This paper describes: why network reliability is important; the kinds of risk that need to be tackled; what the JISC is doing to improve the reliability of JANET itself; what institutions can do to improve the reliability of their own networks.

Senior Management Briefing Paper: Improving Network Reliability

This webpage has been archived. Its content will not be updated. View web retention policy

This paper describes: why network reliability is important; the kinds of risk that need to be tackled; what the JISC is doing to improve the reliability of JANET itself; what institutions can do to improve the reliability of their own networks.

June 2000

JISC would like to thank Mike Tedd of the University of Aberystwyth for writing this paper and Pat Crocker and Roland trice for their comments.

JANET, including SuperJANET, is the world-class academic and research network that links UK HE and FE institutions to each other and to the Internet. Institutions have their own local networks for internal communication and to connect to JANET. These networks are becoming more and more vital to teaching, marketing and administration activities as well as research.
The JISC is taking forward various initiatives to make JANET even more reliable but, to make the whole network highly reliable, institutions need to improve their local networks.

This paper describes:

  • Why network reliability is important
  • The kinds of risk that need to be tackled
  • What the JISC is doing to improve the reliability of JANET itself
  • What institutions can do to improve the reliability of their own networks

It is of interest to:

  • Vice Chancellors, Registrars
  • Heads of Computing Services, Heads of Information Services

Why Network Reliability is Important

Networks have become essential for nearly all the key activities within any educational institution, including administration, teaching and learning, research and student recruitment.

Institutions can no longer function adequately without access to all their computer systems and networks.

The recent fiasco on the London Stock Exchange, when systems failed and no trading could be done for most of the last day of the tax year, was a graphic illustration of this. It is difficult to overestimate the importance of network reliability.

The JANET network

The JANET network, including SuperJANET, is a world-class high-speed, broadband network serving FE, HE and Research Council sites. The backbone of JANET is a very-high-capacity resilient ring, which should not fail. Links to sites are not all resilient, but they are reasonably reliable – links to nearly every site are working 90% of the time and on average the links to over 92% of sites are operational 99% of the time.

The next phase of the network development, SuperJANET4, is currently being designed and procured. It is expected that most institutions will become part of a MAN (metropolitan area network), and that SuperJANET4 will provide an extremely high bandwidth core, connecting the MANs to each other and to other networks in the work, in a very reliable way.

If we are to deliver high reliability networking to users, it is important that both the MANs and the local arrangements in each institution are highly reliable.

Classes of Risk to Network Operation

There are many different risks that could lead to partial or total loss of service in JANET and the local networks connected to it. Most of these risks fall into the following groups.

Human Issues

There are many ways in which human errors can compromise service or affect the ability to respond to incidents. Networks become more complex all the time, making human error more likely, so regular reviews are needed to see how complexity can be reduced. Having high calibre staff is very important.

Single points of failure

Although it I possible to configure a computer network to have no “single point of failure”, the typical network today has many of them. Some can be addressed quite cheaply, but it can be costly to provide alternative equipment, paths and sites.

Organisation and staffing

Where organisations work together, for example in MANs, it is common to operate as a friendly consortia without legal status or formal contractual arrangements. This could lead to problems such as confused responsibilities and inadequate response to incidents.

Capacity Risks

Network traffic grows inexorably and, when demand exceeds capacity, this leads to compromised quality of service and sometimes to total breakdown.

Security

Computer rooms should be secure and physical access to network equipment should be limited to authorised individuals. Intrusion over the network by hackers is also a major risk; the computers and switches in the network can be interfered with from anywhere in the world unless adequate measures are taken to protect them. Viruses are also a growing threat, affecting users and whole sites.

What JISC and UKERNA are Doing

In 1998, with support from SHEFC (Scottish Higher Education Funding Council), UKERNA studied risks in the JANET network in Scotland and took a number of measures to improve resilience there, including setting up multiple paths into each Scottish MAN and the introduction of a second independent link to the SuperJANET core in England.

In early 1999, JISC’s networking committee commissioned from its Technical Advisory Unity (TAU) a study of critical risks in the JANET network. The TAU report identified a substantial number of important risks to network operation and assess the probability and level of impact of each one, leading to an ordered list of the main risks.

UKERNA set up four working groups to study risk reduction measures as a response to the TAU report. Outcomes from this work recommend a number of measures to improve the current network and are guiding the ongoing SuperJANET4 design and procurement. A number of extra links have been introduced into the existing JANET network with resilience as well as capacity in mind.

A community workshop on Managing Risks was held in February 2000 . The JISC has set up a ‘MAN Guidelines’ group, which is looking at a range of MAN issues, including management and support arrangements. UKERNA is also initiating work with the UK MANs group aimed at operational and service level areas.

What Institutions Should Be Doing

Local arrangements are a vital part of the service seen by individual users. Like JANET itself, local networks typically evolved when high reliability operation was not a high priority. Raising the level of reliability needs careful thought and planning.

Risk assessment

The first stage is to study the risks in the local network. Issues to be addressed include organisational responsibilities, staffing, training, security, equipment, communications links and power supplies as well as network design. The studies of the national network mentioned above provide models of how this might be undertaken.

Reducing risk

Identified risks to service should be tackled where practicable. It may be possible to tackle some of the identified risks with relatively limited investment: for example, it may be possible to introduce some alternative paths quite cheaply, and measures like uninterruptible power supplies and improved security do not have to be very expensive.

Quality management

Establishing Quality Management Systems, with procedures to be followed and checking of error prone operations, can help reduce the chance of human error compromising network operation.

Equipment and staff

Major issues will often involve equipment and key staff. Some equipment is expensive and there may be a shortage of professional staff to provide cover in emergencies. Duplicating equipment or recruiting extra staff may be unattractive. However, it may be possible for a group of institutions working together to make mutual arrangements for fault monitoring and reporting and perhaps for cover.

Awareness raising

Incorporating higher reliability when new networks are being designed can be significantly less costly than introducing it later. Clear understanding of responsibilities is important; larger sites will usually have clear organisational arrangements, but some smaller sites may not. Institutions are well advised to establish a security policy, which should be policed and enforced.

Contingency planning

In the event of a major disaster like the loss of a key computer centre by fire, the impact should be reduced if there is already a plan to handle such an eventuality. This plan might cover such issues as who would take charge, how emergency authorisations might be made, what space might be used for emergency arrangements, how new equipment would be acquired and where the funds to make everything possible would come from.

MAN partnerships

Finally, where institutions are partners in a MAN, they should encourage and support the MAN in its activities to improve network reliability.

JISC Senior Management Briefing papers are designed to inform senior managers in the FE and HE about a wide range of strategic issues that affect them, from network developments to implementing new technologies and from electronic content to legal issues.

Documents & Multimedia

Bookmark and Share
Summary
Publication Date
1 June 2000
Publication Type
Topic
Strategic Themes