Copyright 1998 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

This article was published in the May 1998 issue of
IEEE Communications Magazine.

CIRULE1.GIF (372 bytes)

Abstract
This article presents results from recent research that included the modeling of various self-healing structures in order to compute the service availability between any two nodes within a given survivable network. In particular, terminal-pair availabilities are computed for a service-bearing transport signal that traverses a UPSR system, a BLSR system, a series of interconnected ring (UPSR and/or BLSR) systems, called a ring chain, within a general ring-based mesh network, and a series of interconnected point-to-point systems within a general point-to-point mesh network. In addition, these network structures are evaluated with varying degrees of fiber link and system interconnection node survivabilities; specifically, two-fiber and four-fiber link configurations are evaluated, and an intersystem nodal survivability factor a is used to characterize a transport signal of a given service level. Based on the results of these computations, several key observations are highlighted that should be considered when planning and designing short- and long-haul self-healing networks.

 

CIRULE2.GIF (100 bytes)


The Quantitative Impact of Survivable Network Architectures on Service Availability

CIRULE3.GIF (212 bytes)

Mark R. Wilson
Bell Laboratories

 

The survivability of telecommunications networks has become a particular source of concern among service providers and end customers. The high negative exposure generated by several infamous network failures, along with the increasing competition among network service providers, has made network planners acutely concerned with the survivability of their networks. Furthermore, many large customers, realizing this competition for their business, have demanded higher service availabilities for both their voice and data traffic.
A major development in improving the survivability of these networks has been the introduction of self-healing ring architectures, which restore service-bearing transport signals from most failures in under 60 ms. Although the deployment of these ring architectures into networks generally improves network survivability, a rigorous quantification of these improvements in terms of service availabilities has not been demonstrated for general ring-based mesh networks. To quantify these improvements proactively, the mathematical relationship between network survivability and service availability needs to be developed and evaluated.
This article presents results from recent research that attempted to quantify the impact on service availability of implementing varying degrees of network robustness within survivable transport networks. Specifically, closed-form algebraic terminal-pair availability models were developed for single-ring systems, ring-based mesh networks, and point-to-point-based mesh networks. Computational approximations to the mesh models were made to reduce the general closed-form algebraic terminal-pair expressions into factored-form expressions to make the models computationally feasible. In addition, algorithms that compute the terminal-pair availability for these network models were developed, and these algorithms were then applied to report terminal-pair availability parametrically for single-ring architectures, ring-based mesh networks, and point-to-point mesh networks, and to quantify the sensitivity of this terminal-pair availability to changes in link and intersystem nodal survivability profiles. Bellcore fiber and equipment failure rates were used for all computations.
To better understand the terminology used in this article, a few definitions need to be defined. The term survivability refers to the ability of a system/network to be maintained in the working state, given that a deterministic set of failures occurs to the system/network; therefore, the survivability is always "yes" or "no" for a given failure scenario (based on some reference restoration time, e.g., 60 ms). The term availability, A, is the probability that a system/network is in the working state at some time in the future (i.e., the fraction of time the system/network is operational). Availability can be computed for an existing system/network based on past performance data; however, to predict availability of a new system a priori, probabilistic models need to be formulated. In particular, when evaluating network performance, the relevant metric is terminal-pair or service availability, which can be computed for a system/network having a given survivability profile. Finally, unavailability, U, is the probabilistic complement of the availability (i.e., U = 1 – A) and is defined as the probability (fraction of time) the system/network is in the failed state. When reporting system/network performance, unavailability is usually converted to minutes per year or, if the mean time to repair (MTTR) from a nonsurvivable failure of the system/network is known, to the mean time between failures (MTBF), usually in years, where U = MTTR/MTBF.

Ring System Architectures

Figure 1 illustrates the survivability model of an N-node unidirectional path-switched ring (UPSR). Although service-bearing transport signals that are transported around the ring are usually bidirectional circuits, this discussion will focus on only one of these directions, specifically from node s, called the origination node, to node t, called the destination node, as shown in Fig. 1; the discussions for the reverse direction (i.e., from t to s) are identical for this and all other architectures discussed in this article. Moreover, since both directions of a given transport signal generally traverse the identical set of links and nodes between nodes s and t, the service availabilities of the two directions can be assumed to be mathematically identical, and the bidirectional service availability, which is the mathematical intersection of these two directions, is equal to the service availability of either direction.
For the UPSR, the transport signal is duplicated at origination node s and transmitted onto both directions of the UPSR such that two copies of the transport signal are presented to the path selector at destination node t. In this model, we assume that when the circuit was originally provisioned on this UPSR, the upper path was established as the service path, which traverses h links of the N-node UPSR. If a link or node along this h-link path between s and t should fail, as shown in Fig. 2, the path selector at destination node t would perform a path protection switch to receive the copy of the transport signal arriving via the N – h link lower path and thus restore from the failure within 50 ms. Once the failure in Fig. 2 is repaired, the path selector can be either revertive, in which the path selector is switched back to the upper path, or nonrevertive, in which the path selector continues to select the lower-path connection, which then becomes the new service path.
Therefore, in terms of the service availability model, the service-bearing transport signal survives if one or more links and/or intermediate nodes along the h-link service path fail, and the surviving restoration path is always the N – h link path between s and t. When formulating the s–t service availability model, these two parallel s–t paths, which are illustrated in Fig. 3, completely capture all failure scenarios and thus make the model mathematically complete.
Figure 4 illustrates the survivability model of an N-node two-fiber bidirectional line-switched ring (BLSR). As was the case with the UPSR, only one direction of transport, specifically between origination node s and destination node t, is discussed. With the BLSR, transport signals that are routed between two nodes s and t are normally carried only on one set of links between s and t; this set of links makes up what is called the service path, which is shown in Fig. 4 as traversing h links. If a link or node along this h-link path between s and t should fail, as shown in Fig. 5, the two ring nodes adjacent to the failure perform loopbacks to route the channels that contain the s–t transport signals, called the s–t service channels, of one fiber onto the reserved protection channels of the other fiber, thereby creating a folded unidirectional ring that contains both directions of all service channels. Once the failure in Fig. 5 is repaired, the BLSR tests the repaired portion of the ring and then releases the loopbacks and reverts back to its normal configuration (Fig. 4).
Although the protection switching mechanisms are different, the BLSR survivability profile is identical to that of the UPSR; that is, the service-bearing transport signal survives if one or more links and/or intermediate nodes along the h-link service path fail. However, in terms of the service availability model, since the surviving protection paths are different for each failure scenario and depend on the nature and location of the failure, the s–t paths that need to be considered to calculate the service availability are the h-link s–t normal service path and all resulting s–t restoration paths corresponding to all possible failure scenarios from which the BLSR can restore within 50 ms. Nevertheless, when this mathematical exercise is carried out, all restoration paths mathematically reduce to only one such restoration path: that which is implemented when all links and intermediate nodes along the h-link service path fail. For this failure scenario, which is shown in Fig. 6, the loopbacks would be implemented in the two end nodes s and t, and the surviving restoration path would span the N – h links between s and t. Therefore, the two-fiber BLSR s–t service availability model is mathematically complete by considering only the normal h-link s–t service path and the N – h link s–t restoration path, as shown in Fig. 7. Consequently, the algebraic s–t service availability models of the UPSR and BLSR are identical, and service availability results are equally applicable to both ring architectures.
When the service availability expressions of the UPSR and BLSR were formulated, the link variables were modeled in a separable layer such that they were actually link-configuration-independent; therefore, these two-fiber models can be used as a basis for all ring architectures, and the availability model of a given link configuration can be algebraically substituted into this basic expression to obtain the desired link-configuration-dependent service availability expression.
The three link configurations implemented on ring systems are shown in Fig. 8; they are:
  • Two-fiber links
  • Four-fiber links
  • Diversely routed four-fiber links
Although these link configurations could be placed on UPSRs or BLSRs, our discussion will focus on the BLSR, since four-fiber links are deployed only on BLSRs. Nevertheless, models and results found for the four-fiber BLSR would be equally applicable for a four-fiber UPSR, since the service availability expressions were found to be identical for both architectures.
The two-fiber links configuration is, of course, exactly the configuration used to construct two-fiber BLSRs; since there is only one set of fibers connecting adjacent nodes on the ring, no span protection capabilities are possible, and the BLSR completely depends on the ring loopback protection switching mechanism to restore from all fiber and/or nodal failures.
The four-fiber links configuration assumes that two fiber pairs sharing a common sheath are connected between the two adjacent nodes; in this configuration, if a transmitter/receiver element (T/R in Fig. 8) on one of the nodes should fail, span protection switching could take place to immediately restore from the failure. Nevertheless, since both fiber pairs are within the same sheath, ring loopback protection switching would still be required to restore from any fiber or major nodal failures.
In the diversely routed four-fiber link configuration, the two fiber pairs are placed in separate sheaths and diversely routed; therefore, span protection switching can be used to restore from both transmitter/receiver nodal failures and fiber failures. In this case, ring loopback protection switching is needed only for multiple failures on the same link (e.g., a T/R failure followed by a protection fiber failure) or for major nodal failures.
One of the charts derived from the single-ring computations that were based on the above models is shown in Fig. 9, which graphs the s–t service unavailability Ust in minutes per year versus the number of ring nodes N for a service-bearing transport signal that traverses h = N/2 nodes, which yields the worst-case s–t unavailability of an N-node ring with an average link distance between adjacent nodes of 10 mi. As one would expect, the worst-case service unavailabilities increase with the number of nodes on the ring. In addition, because of the span protection capabilities, the four-fiber ring unavailabilities are lower and less sensitive to the number of ring nodes than is the case for the two-fiber ring; in fact, the diversely routed four-fiber ring is virtually insensitive to the number of ring nodes, yielding a constant worst-case service unavailability of 0.500 min/yr. Although the differences in unavailability increase with the number of nodes, the maximum worst-case difference between the two-fiber and diversely routed four-fiber rings occurs at N = 16 is only 0.659 min/yr, and more typical worst-case differences are within 0.1–0.4 min/yr. For example, at N = 8 nodes, this worst-case difference is only 0.165 min/yr.
Some additional service availability results are summarized in Table 1. Specifically, service unavailabilities and MTBF results are given for two-fiber rings, four-fiber rings, diversely routed four-fiber rings, and equivalent series-interconnected diversely routed point-to-point systems for both short- and long-haul networks. The first two columns report unavailability results for an average service circuit and a worst-case service circuit, respectively, that traverse a 16-node ring within a typical short-haul network. For this short-haul model, the average link distance between adjacent nodes was assumed to be 10 mi, and the MTTR from a nonsurvivable failure was assumed to be 4 hr.
The worst-case circuit data were just discussed and graphed in Fig. 9, and, for N = 16, these worst-case unavailability figures are 1.159 min/yr, 0.621 min/yr, and 0.500 min/yr for the two-fiber ring, four-fiber ring, and diversely routed four-fiber ring, respectively; with an MTTR of 4 hours, the corresponding MTBF figures are 207 years, 386 years, and 479 years, respectively. Once again, although the diversely routed four-fiber ring yielded the lowest unavailability results, the service unavailabilities results for the three-ring architectures are all within only 0.659 min/yr of each other. For the equivalent diversely routed point-to-point architecture, which consists of eight series-connected point-to-point systems, the corresponding unavailability and MTBF results are 6.30 min/yr and 38.1 years, respectively, which is 5.14 min/yr greater than the two-fiber ring result. For an average circuit, which traverses one-quarter of the ring circumference (i.e., four links), the unavailability results are somewhat more clustered, with all three worst-case unavailabilities within 0.494 min/yr of each other; meanwhile, the service unavailability of the equivalent four-link diversely routed point-to-point architecture is 2.99 min/yr, over 2 min/yr greater than that for the two-fiber ring.
The second two columns report unavailability results for a worst-case service circuit that traverses a 16-node ring within a typical long-haul network having average link distances between adjacent nodes of 50 and 100 mi, respectively. For this long-haul model, the MTTR from a nonsurvivable failure was assumed to be six hours.
As expected, the unavailability results are all higher than those computed for 10-mi links. For the 50-mi link case, for which no intermediate regeneration was assumed, the worst-case unavailabilities for the two-fiber, four-fiber, and diversely routed four-fiber rings are 5.34 min/yr, 3.52 min/yr, and 0.500 min/yr, respectively; for the equivalent diversely routed point-to-point architecture consisting of eight 50-mi links, the unavailability is 6.82 min/yr, which is only about 1.48 min/yr higher than that for the two-fiber ring. Note that the worst-case service unavailability determined for the diversely routed four-fiber ring is identical to that found with 10-mi links, thus illustrating the insensitivity of the diversely routed four-fiber architecture to increased link distances. In addition, note that the difference between the two- and four-fiber ring worst-case unavailabilities is about 1.82 min/yr; that is, the four-fiber ring provides a 34 percent improvement over the two-fiber ring.
Based on the 50-mi link results, one could expect that the worst-case unavailabilities of the two- and four-fiber ring architectures would continue to diverge, while those for the diversely routed four-fiber ring would remain about the same for longer link distances. However, as shown in the fourth column in Table 1 for 100-mi links, although the worst-case unavailabilities for the ring architectures all do increase significantly, the difference between the two- and four-fiber architectures is only 3.8 min/yr; that is, the four-fiber ring provides only a 19 percent improvement over the two-fiber ring. The erosion of the difference between two- and four-fiber ring service unavailabilities, which is due to the inclusion of one regenerator on each 100-mi ring link, becomes even more pronounced at longer distances that require additional regenerators. Note that even the worst-case service unavailability of the diversely routed four-fiber architecture was increased slightly by the addition of the regenerator onto each fiber link on the ring.

Survivable Mesh Networks

The previous section focused on the service availability models of individual ring system architectures. In this section, selected results are presented from the general service availability models developed for ring-based and point-to-point-based mesh networks. In particular, consider Fig. 10, which illustrates the service availability model between two nodes S and T that are connected via three dual-connected rings (called a three-ring ST ring chain).
Although failures that occur within the individual rings are covered by the ring system models discussed in the previous section, ring-based meshes have additional points of vulnerability at the interconnection nodes between adjacent rings; therefore, dual-node connectivity (or at least connectability) is recommended to allow for restoration from a failure within a given interconnection node via its corresponding secondary interconnection node. To model the various methods of restoring from such a failure, a normalized survivability factor was defined (i.e., 0 1) to capture the full scope of restoration alternatives, which range from having no secondary node connectability ( = 0) to implementing full-time drop-and-continue-based dual-node connectivity ( = 1).
Next, consider Fig. 11, which illustrates an equivalent point-to-point-based mesh network in which three point-to-point mesh segments are similarly dual-connected between nodes S and T. For point-to-point mesh networks, dual connectivity (or at least dual connectability) is required to restore from both interconnection node failures and fiber link failures (unless the point-to-point system links are diversely routed). To model the various methods of restoring from these failures, the normalized survivability factor that was defined for the ring-based mesh network models is used to capture the full scope of restoration alternatives possible for the point-to-point mesh topology; these alternatives range from having no secondary node connectability ( = 0) to implementing full-time dual routing of the signal through the network ( = 1).
The S–T service unavailabilities for the point-to-point-based mesh network shown in Fig. 11 are graphed in Fig. 12 versus the node survivability factor for point-to-point systems having:
  • Unprotected two-fiber links
  • Four-fiber links that share the same sheath
  • Diversely routed four-fiber links
All link distances were assumed to be 10 mi. As shown in this graph, unless some nodal restoration capability is implemented, only the diversely routed scenario yields service unavailabilities that are at all acceptable. Nevertheless, note that all three link configurations yield about the same service unavailability if the signal is dual-routed through the network (i.e., = 1), which indicates that the additional survivability provided by the four-fiber links is dominated by the survivability provided by the dual-node routing between S and T.
The worst-case S–T service unavailabilities for the ring-based mesh network shown in Fig. 10 are graphed in Fig. 13 versus the node survivability factor for ring systems having:
  • Two-fiber links
  • Four-fiber links that share the same sheath
  • Diversely routed four-fiber links
For comparison purposes, the results from Fig. 12 for the diversely routed point-to-point mesh topology are also included. All link distances were assumed to be 10 mi, and, to maintain a fair basis of comparison with the point-to-point scenarios, all ring systems were assumed to consist of four nodes.
As shown in Fig. 13, the worst-case service unavailabilities of all three ring architectures are tightly clustered together and are all about 1 min/yr less than those found for the diversely routed point-to-point topology for all values of ; moreover, the service unavailabilities all experience about a 1 min/yr improvement as increases from zero to unity. In particular, all three ring results range from about 2.6 min/yr (for = 0) to about 1.6 min/yr (for = 1); meanwhile, the diversely routed point-to-point results range from about 3.6 min/yr (for = 0) to about 2.6 min/yr (for = 1).
Therefore, based on these results, all three ring-based mesh scenarios yield service unavailabilities that are approximately equal and are always lower than any point-to-point mesh for all values of . In addition, while the use of diversely routed links is essential to obtain acceptable unavailability results within point-to-point-based mesh networks, the use of four-fiber links within ring-based mesh networks provides no discernible improvement in service unavailability. Therefore, optimal service availabilities can be sufficiently realized with two-fiber ring-based mesh networks that also implement some level of intersystem nodal survivability.

Conclusions

In this article, selected results from recent research were presented that included the modeling of various self-healing architectures in order to compute the service availability between any two nodes within a given survivable transport network. Based on the results that were presented in this article, the following conclusions regarding the design of survivable networks can be stated:
  • Ring-based mesh networks generally yield significantly higher service availabilities than point-to-point mesh networks.
  • Span-protected links significantly increase the service availabilities of point-to-point mesh networks, but not of ring-based mesh networks.
  • Interconnection node protection can further improve the service availabilities of both ring-based and point-to-point mesh networks.
  • Additional link protection is no substitute for interconnection node protection; that is, optimal service availabilities are realizable only if sufficient levels of both link and interconnection node protection are implemented.

Biography
Mark R. Wilson [M] is a Distinguished Member of Technical Staff in the Transport Networking Evolution and Planning Group at Bell Laboratories in Holmdel, New Jersey. Before joining Bell Labs, he received his B.S.E.E. from Drexel University and his M.S.E. from the Moore School of Electrical Engineering at the University of Pennsylvania. Since joining Bell Laboratories, he has completed his Ph.D. degree, also from the Moore School at the University of Pennsylvania. His research at Bell Labs has included analyzing the performance and economics of transport networking technologies (IP, ATM, STM, and WDM) and developing network evolution/vision architectures that optimize the deployment of these technologies into public, private, and virtual private networks; he has also designed numerous survivable network plans for both short-haul and long-haul network service providers. He is a member of the IEEE Communications and Reliability Societies, Eta Kappa Nu, and Tau Beta Pi.