Copyright 1998 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

This article was published in the May 1998 issue of
IEEE Communications Magazine.

CIRULE1.GIF (372 bytes)

Abstract
This article discusses the design of a scalable high-performance multiservice network based on programmable transport. A meshed network with dynamically adjustable link capacities and nodes which provide data packing into "containers" for transport is proposed. A ring-based network, exchanging data containers among its nodes, is the preferred implementation due to its flexibility, maintainability, and high reliability. With lossless rings, the quality of service is controlled solely by the origin and destination nodes, without any interference from other data streams. Flexible programmable transport greatly improves the performance, simplifies the controls, and facilitates scalability. The concept is a departure from classical network thinking. By reducing the complexity of the network core, an economical, reliable, and manageable network with feature-rich edge nodes can be realized. An architecture with recursive ring-based structures provides a high degree of flexibility in bandwidth allocation and is compatible with current transport networks and future all-optical networks.

 

CIRULE2.GIF (100 bytes)


Architecture and Control of an Adaptive High-Capacity Flat Network

CIRULE3.GIF (212 bytes)

Eric Livermore, Richard P. Skillen, Maged Beshai, and Marek Wernik
Nortel

 

To date, communication networks have been used primarily for voice services, and support limited data and computer services. In recognition of the predominance of voice traffic, a circuit-switched channelized architecture evolved and was optimized for telephony. Data and computer communications access and transport have been provided as an overlay on this channelized infrastructure.
Although many data technologies have been developed, the advent and subsequent popularity of the Internet, coupled with its ability to support multiple services, including telephony, has changed the way users communicate and do business. Today, telecommunications networks are shifting from specialized networks toward multipurpose multifunctional networks.
The emerging data network must be able to grow to a much higher capacity than that of today's voice and data networks. In addition to the huge capacity requirement, the emerging networks must provide versatile services. The multiplicity of connection protocols and the effort required for their interworking reduce the ability of the network to provide service diversity. The simplest network would be fully connected, allowing every networking device to have a physical connection to every other networking device. However, as the network size grows, this fully meshed structure rapidly becomes impractical. Due to the spatial variation of traffic loads and the typically large modular sizes of transport links, a fully meshed network normally leads to underutilized transport facilities.
Currently, transport capacity sharing is based on coarse granularity, where point-to-point connections are defined as multiples of standard tributaries such as OC3, OC12, and so on. This renders tandem switching -- at a lower granularity -- necessary to establish end-to-end connections with an acceptable quality of service (QoS) and an acceptable overall network efficiency. A low-connectivity network with excessive tandem switching may, however, lead to an uneconomical network, due to the increased number of hops between origin and destination and the cost of processing in intermediate nodes. Tandem switching in the multiprotocol environment is rather complicated, and several techniques have been proposed to reduce the processing at intermediate nodes by creating "shortcuts" [1]. In addition, tandem switching adds variable delay and increases latency.
The multiplicity of communications protocols and the need for coexistence between new and legacy networks complicate the network planning function. In addition, QoS is difficult to realize in such a heterogeneous network. Currently, IP-based networks do not take QoS into account, and one of the approaches to provide QoS assurance is to interwork -- at a prohibitive cost -- with an underlying ATM network. The complexity and fragility of such approaches are formidable impediments to network scalability.
Advances in electronics, photonics, and processing facilitate the development of high-capacity switching and transport nodes and enable flexible bandwidth allocations. Higher-capacity nodes may result in reducing the number of hops from origin to destination, thus simplifying the routing protocols and overall network design, and the flexible bandwidth allocations greatly simplify the network provisioning task. Adaptive capacity allocation is a highly desirable feature since long-term traffic forecasts are no longer feasible in the rapidly evolving data network.
A flat, highly connected network, with an adaptive allocation of capacity for each node pair, realizes simplicity, versatility, scalability, and high transport efficiency at the same time.

Flat Network Architecture

A flat network may be defined as a fully connected, or very highly connected, network. Network connectivity may be defined as the reciprocal of the traffic-weighted mean number of hops per node-pair; a fully connected network would have a connectivity of unity.

Rigid vs. Flexible Mesh Structures

Traditional transport systems can offer a meshed network by providing direct interconnections between the networking devices. However, the connections would be based on channelized time-division multiplexing, where the bandwidth allocated to a node pair is fixed and dedicated to the node pair. When the connection between two devices is inactive for some period of time, the transport bandwidth is still reserved so that it cannot be utilized by other active connections. Thus, the networking device interfaces may not be as efficiently utilized as in the tandem approach, which allows interconnections from many networking devices to share the interface to a given device.
A fully meshed network, as depicted in Fig. 1, is not scalable to cover a large number of nodes, unless the link capacities are elastic and can be modified rapidly to follow the traffic demand variation. Such a network would allow all the connections to share a common pool of capacity through paths whose capacities are dynamically adjustable. The nodes provide data packaging into "containers" for transport, and a ring exchanges data containers among its nodes. The containers may be of fixed or variable sizes. A service rate calculation for each source­destination node pair is carried out by a centralized or distributed controller which either monitors the traffic or receives updated capacity allocation requests from the nodes, and assigns an appropriate data rate at which each node can transmit to each destination. With lossless rings (traffic-wise), the QoS is controlled solely by the source and destination nodes, without any interference from other data streams within the network. By reducing the complexity of the network core, an economical, reliable, and manageable network with feature-rich edge nodes can be realized.
In summary, a flexible programmable transport simplifies network management and extends the network capacity and network coverage by letting the end nodes control the QoS and end-to-end capacity allocation. Rather than forcing the network to cope with multiple protocols, node pairs can communicate directly through adaptive end-to-end links. Thus, interoperability is replaced by protocol disengagement. Flexible capacity allocation can be realized in frame-based or packet-based schemes.

Topology Overview

The envisaged network has a high-connectivity structure. The basic requirements of such a network are simplicity, scalability, transport efficiency, the ability to accommodate existing legacy subnetworks, and -- most important -- high reliability. There are several candidate topologies which primarily fall into two main categories: cross-connection and ring sharing.
Both cross-connection- and ring-based networks can be configured to be fully meshed or almost fully meshed. A cross-connection-based network would normally have lower connectivity than a ring-based network and is not discussed in this article. A ring structure lends itself to fine capacity partitioning with relatively simple controls. Ring sharing can be achieved in several ways; for example, by using ATM nodes and SONET rings, as depicted in Fig. 2a. However, a flexible-transport layer (Fig. 2b) would achieve a more economical solution. A path of controlled variable capacity for each node pair is established in the network, thus creating a flexible fully connected network. Each path may carry multiple traffic classes, and the capacity of the path can be dynamically shared among the classes at the container-packing stage. The network can accommodate a mixture of data, voice, and video traffic of both unicast and multicast natures. The multiclass service discipline is decided only at the originating and terminating nodes.
Data is transferred in containers of any desired size. Each node in the network is a simple container packing and unpacking device. Each node is assigned a number of containers to carry its data to other nodes in the ring. The container size (i.e., number of bits per container) is arbitrary. The set of containers available to all nodes form a frame. The frame size is arbitrary and is governed by granularity or data-delay specifications. The containers belonging to a given originating node are said to form a subframe, that is, a frame contains a subframe for each originating node.
At least two control schemes are possible, one based on processing a frame header, the other on processing the headers of the individual containers. In the first, a node identifies the locations of its receivable payload from the information in the frame header. In the second scheme, each node reads the labels of all containers but copies only the relevant data. Some containers may be multicast containers. A multicast container is inserted once per ring but may be read by each node in the ring. The container duration is determined by the container size and ring speed. With an OC-192 (10 Gb/s) ring, for example, a 4096-bit container consumes 0.41 µs. A container's share of the ring capacity is the reciprocal of the number of containers per frame. Generally, a large number of containers per frame results in a better resolution at the expense of a higher delay. A frame of 1000 containers, of 4096 bits capacity each, has a period of 410 µs in an OC-192 (10 Gb/s) ring, with each container representing 1/1000 of the ring rate; 10 Mb/s in the OC-192 case. Increasing the size of the frame to 10,000 containers increases the frame period to 4.1 ms, and each container represents a capacity unit of 1 Mb/s. A large frame period, however, is undesirable, and with appropriate container scheduling it is possible to realize a very high resolution using relatively short frames.
A single-channel ring constitutes a basic control domain. The domain definition may be extended to include several parallel single channels, as in Fig. 3. Extending the network capacity is realized by increasing the domain capacity, and extending the network coverage is realized by using several intersecting domains. A domain may share common nodes with one or more other domains. A node in a given domain sends the traffic to a neighboring domain through one or more common nodes. A domain may also function as a transit medium between two nonadjacent domains. A common node between domain X and domain Y merges the contents of all containers received from the X domain nodes and forms new containers to the Y domain nodes, and vice versa. In this process, there are no surprises: the rate controllers at the network edge ensure that the exchange process leaves no residues; hence, no buffering -- apart from what is needed for the exchange process -- is required. The common nodes, connecting domains, perform the extra task of containers merging.
The allocation of containers to each node is controlled by a designated domain node. The allocation can be changed from one frame to the next using a simple rate controller which facilitates precise rate allocation for each node pair. The train of containers from a given source to a destination node is naturally concatenated at the destination node. Therefore, the container-packing process need not be aware of the data format and need not limit the content of a container to an integer number of data transaction units, such as ATM cells, IP packets, or circuit-switched data blocks.

Ring Capacity

A basic ring interconnects a number of add-drop multiplexers which share a single unidirectional channel. Generally, the term channel is used to indicate a tributary or an entire wavelength in a single fiber. In this article, the term refers to a wavelength. The total capacity per channel in a ring of n nodes (i.e., the maximum traffic carried by the channel) depends on the capacity R of the channel (e.g., 10 Gb/s) and the spatial distribution of the traffic demand (the traffic distribution among the end nodes). Assuming a uniform distribution of the spatial traffic demand, the capacity limit of a unidirectional ring is C = 2R. Using a ring pair, comprising a clockwise ring and a counter-clockwise ring, as shown in Fig. 4, routes with more than n/2 hops can be avoided, and the capacity C per channel increases to C = 4nR/(n + 1) if n is odd, or C = 4(n ­ 1)R/n, if n is even, under the same assumptions of spatially balanced traffic demand. Thus, the capacity of a dual ring is almost double the capacity of two unidirectional identical rings. Using dual rings enhances the capacity per wavelength by a factor of approximately two.
To realize a ring (domain) capacity on the order of tens of terabits per second, it is necessary to use multiple fibers, each supporting multiple wavelengths. The channels of a domain must be interconnected at least at one point. Using M fibers, each supporting N wavelengths (i.e., MN channels) at a bit rate of R Gb/s per channel yields a total capacity of more than RMN b/s. The MN channels must exchange payload data through one or more container switches. For example, with M = 128, N = 16, and R = 10 Gb/s, container switches of 20 Tb/s capacity each would be needed. The domain of Fig. 3 has 128 fiber rings, each supporting 16 wavelengths, with 10 Gb/s per wavelength. Four interconnecting container switches, topologically evenly spaced, are used. The domain capacity is not a linear function of the number of container switches, and it is sufficient to use four container switches per domain, each interconnecting all the channels of the domain, to realize almost the maximum attainable capacity for a given set of single rings. In the domain of Fig. 3, the maximum capacity is about 80 Tb/s (only by coincidence, this equals the sum of the capacities of the container switches). It is noted that using more than four 20 Tb/s switches results in an insignificant increase in the overall capacity.
It may also be desirable to partially connect some of the rings, using intermediate-capacity container switches, in order to shorten some end-to-end paths as shown in Fig. 5. The channels are arranged in dual pairs (clockwise/counter-clockwise). The figure illustrates the interconnection of all the rings by 20 Tb/s container switches, and also shows smaller container switches interconnecting dual rings.

The Basic Node and Higher-Capacity Nodes

The basic node interconnects a dual ring and local subtending edge devices. In the example of Fig. 6, the basic node shown has a total input plus output capacity of 60 Gb/s. In a symmetrical node, the input and output sides have 30 Gb/s capacity each. The total capacity may, however, be divided unevenly between input and output in order to accommodate multicast traffic without wasting switch-core capacity, since in multicasting the data is inserted once but may be delivered to several destinations.
As mentioned earlier, a very-high-capacity container switch is needed to interconnect the channels in a wavelength-division multiplexed (WDM) multifiber ring. Large container switches, with capacities ranging from 40 Gb/s to 20 Tb/s, may be realizable using multiple WDM links and a rotator-based core architecture [2]. Several alternate realizations are possible. The simplest may be an all-electronic switch core. Other "mostly optical" solutions may also become economically viable. Multi-dual-channel add-drop modules are advantageous in reducing the route lengths.

Topology Design

Determining the network layout, connectivity, and capacity allocation in a wide-coverage multidomain network is a challenging task. Optimization procedures are required for planning and provisioning purposes. These procedures are often computation-intensive, and fortunately need not be performed in real time. On the other hand, fast topological algorithms are needed to respond (in milliseconds) to network-state changes. The topology design is influenced by the path-routing process, and vice versa. This interdependence offers a good opportunity for network optimization.

Control

In order to explain the main features of the control system, the discussion is limited to a single domain of a single channel. The control may be either frame- or container-based.

Frame-Based Control

The process is explained with the help of a numerical example. The container size, in this discussion, is assumed to be fixed at 4096 bits. To realize high-resolution end-to-end capacity allocation with independent successive frames (e.g., in units of 1 kb/s), the frame length (the number of containers per frame) may be prohibitively large. However, using a remainder and carry-over process, the frame length can be selected to be only about n(n ­ 1), n being the number of nodes per ring (typically 16). It is important to note that an end-to-end capacity of an arbitrary value may be allocated. For example, if a node pair requires 3.6 Mb/s, it is allocated one container every 10.82 frames, in a 256-container frame, with a channel capacity, R, of 10 Gb/s. The capacity allocation for a node pair is determined as c = R/S, where S is the mean number of time slots between successive containers for the node pair. A single container is carried during a single time slot. Thus, in this example, S = 2770 (i.e., 10.82 frames). The frame period is T = KL/R, where K is the number of containers per frame and L the number of bits per container. In the example of Fig. 7, K = 256, L = 4096, R = 10 Gb/s, and T = 105 ms.
The nodes may be of different capacities. Assuming equal-capacity nodes, each node is assigned the same number of containers which may be distributed unevenly according to destination. The containers from a given node will naturally be separated by containers originating from other nodes. Figure 8 shows the destinations of containers emanating from node 0; the shaded areas in the figure refer to the time slots granted to originating node 0 (i.e., the subframe of node 0). Under control of the domain scheduler, the container allocation may differ between consecutive frames for two possible reasons: change in traffic demand, and/or the allocation of a noninteger number of containers per path per frame, since a noninteger allocation necessitates fraction carry-over between successive frames.
It is noted that the unused time slots in the control frame can be granted to "best-effort" traffic streams, which make no capacity reservation, probably according to a weighted-priority discipline. A frame header identifies each container according to destination. Unassigned containers can be seized without reservation. Several techniques which attempt to maximize the transport efficiency, with predefined fairness, can be used.
Figure 9 depicts the change in allocation in a subframe, over a period of four successive frames, to accurately represent noninteger allocations. The containers in this example have a fixed size. Although the allocated containers for a node pair appear to be contiguous in the subframe of Fig. 9, they are in fact well separated in the time domain. The number of containers from node 0 to node 2 during the above four successive frames are 2, 1, 2, and 1, with a mean value of 1.5 containers/frame. Similarly, the allocation for node pair 0­7 are 2, 3, 2, and 2, with a mean value of 2.25 containers/frame. An arbitrary noninteger number of containers may be allocated for any node pair.
It is worthwhile to mention that, for low-speed applications, containers of very small sizes may be used, and byte-interleaving may even be appropriate.

Container-Based Control

In a container-based system, the added load at each node is rate regulated (Fig. 6) with an allocated service rate to each destination. The access capacity is shared by a number of traffic streams, identified by destination, which are allocated separate queues at ring access. A stream is identified by its origin and destination. A stream is allocated a direct end-to-end path, the capacity of which may vary with time. The traffic within a stream may constitute several classes, which share the capacities allocated to the stream. The streams may share the ring capacity according to one of many policies, such as strict priority, a fixed-rate per stream, or a guaranteed minimum rate per stream. The capacity allocation per stream may be updated individually as the need arises. Figure 10 shows the division of the stream capacity among four traffic classes according to the normalized capacity allocations 0.29, 0.21, 0.30, and 0.20. The first stream is allocated 0.29 of the link capacity, the second is allocated 0.21 of the channel capacity, and so on. In this example, the containers are assumed to be of a fixed size. In Fig. 10, represents the normalized capacity allocation for the stream under consideration, is the duration of the data unit, and ß is the container size in data units. During each , a given stream is credited by its normalized capacity allocation (in the above example, is 0.29 for the first stream). A stream which accumulates a credit of ß or more becomes eligible to transmit a container. The remainder, if the credit exceeds ß, is retained by the class. If the credit is expressed as a fraction of the channel capacity, the value of ß is unity. Figure 11 illustrates the process in a tabular form which is self explanatory. An entry in Fig. 11 indicates the credit available to the corresponding class, and a class is ready for access when its credit is unity or more. A shaded entry corresponds to the time slot at which the class is entitled to network access.

Path Allocation for a Dual Ring

In the simplest path-allocation form, each stream is allocated a single end-to-end path which may traverse more than one ring. It may be desirable, however, to have two or more paths per stream for reliability as well as even distribution of the traffic intensity across the network.
The end-to-end routes may be separated between clockwise and counter-clockwise rings in such a way as to balance the traffic loads on individual internodal links and optimize the overall throughput. Path selection for multiple rings is more elaborate. It should aim at reducing the number of hops traversed from origin to destination, with a view to global optimality.

Container Scheduling

For a lossless network, it is of paramount importance that the number of containers per node pair be time regulated. A domain scheduler performs this function using frequently updated traffic and topology information. It is important to remember that, due mainly to propagation delay, the traffic input to the scheduling process may differ from the actual traffic waiting at each node at scheduling time. To compensate for the time delay, real-time traffic characterization and projection may be applied at each node.
In a single domain, a multicast container is transmitted once and received by several nodes; that is, the number of received containers exceeds the number of transmitted containers. The scheduling function should attempt to exploit this asymmetry in order to maximize the overall throughput. In a multidomain network, a multicast container propagates from one domain to neighboring domains through common nodes.

Quality of Service Issues

As mentioned frequently above, the objective is to control the QoS at the edge since this avoids the complexity of having to deal with it at intermediate nodes. When end-to-end bandwidth demand cannot be accommodated for all node pairs, distinction based on some class or other criteria is required for acceptance of new capacity demands. It is noted that QoS may have several interpretations. The flat network architecture allows the QoS to be defined on a node-pair basis.

Extending the Programmable Network Coverage

The capacity of a domain is extended using multiwavelength multifiber interconnected parallel rings. The network's reach is realized by multiple intersecting domains. Figure 12 illustrates the case of a central domain interconnecting four partially interconnected side domains. The common nodes, connecting domains, perform the extra task of container merging.
A network of wide coverage, extending to hundreds of nodes, can be arranged in a multilevel structure as illustrated in Fig. 13. A four-level structure with 10 nodes/domain realizes a network of about 10,000 nodes, each of which may accommodate several subtending nodes. The use of lateral rings interconnecting side domains (not shown in Fig. 12 and Fig. 13) reduces the amount of traffic in the central domain and may also reduce the path length. It is noted that the network rings need not be of equal capacity. When two or more paths are available for a node pair, a given connection must follow the same path in order to maintain container order.

The Impact of WDM: Alternate Architectures

Unprecedented traffic growth is providing a huge demand for fiber facilities. Not only are the backbone networks outstripping their original design capacities, but the routes will need fiber cable replacements to allow full buildup of high-density WDM. In addition, as the routes grow there will be further pressures to create better diversity to improve restoration. Currently, WDM is primarily used to increase point-to-point transport capacity. The abundance of transport bandwidth due to WDM may justify a highly meshed topology at wavelength granularity. However, due to the spatial traffic variation, a wide-coverage network may still require tandem switching and capacity-sharing controls, which are now realized electronically. Two-dimensional space-WDM switching nodes may be used to realize, at wavelength granularity, either a partially interconnected network or a fully meshed network. Figure 14 depicts an optical cross-connect configuration, and Fig. 15 depicts a ring configuration, both using space-WDM switching nodes. In either configuration, traffic consolidation and reinsertion at the edges must be done by electronic means to realize arbitrary granularity; that is, the edges perform the add/drop as well as tandem switching. The path capacity may then be defined in multiples of kilobits per second rather than multiples of 2.5 Gb/s (OC-48), for example. It is possible with such configurations to construct a fully-meshed network with fine granularity and realize the benefits of edge control. The control mechanism would differ from that of the ring architecture of Fig. 3.

Internet Services

The programmable-transport nodes can support edge devices implementing a variety of protocols. The need for interworking among the protocols is eliminated since the tandem function is removed. An IP router, for example, can forward its packets directly to their destinations with improved performance. One of the fast-growing services facilitated by the Internet is electronic commerce. The impact of electronic commerce on today's telecommunication network may be profound. Electronic commerce is expected to continue to grow steadily at an accelerated rate, fueling the growth of existing data networks.

Conclusions

The realization of an economical flat network is feasible with dynamic capacity partitioning at a fine granularity level. A flat network is scalable to multitera bits-per-second capacity and can support multiple traffic types with many specified QoS objectives.
A network would be much easier to control and manage if it were fully meshed with each node having a direct path to each other node. This objective is not economically attractive with coarse granularity. Fortunately, adaptive path capacities with fine granularity -- at multiples of 10 kb/s, for example -- are now feasible using rate controllers. Deploying such controllers enables the realization of an almost fully meshed network with simple controls. A virtual private network is easy to implement in such a scheme.
In a real network, this approach is simpler and less expensive to implement than an ATM-based solution.

References
[1] P. Newman, T. Lyon, and G. Minshall, "Flow-Labelled IP: A Connectionless Approach to ATM," IEEE INFOCOM, 1996, pp. 1251­60.
[2] M. Beshai and E. Münter, "Multi-Tera-bit/s Switch Based on Burst Transfer and Independent Shared Buffers," GLOBECOM, Singapore, Nov. 1995, pp. 1724­30.

Biographies
Eric Livermore received his B.A.Sc. from the University of Toronto in 1966 and has worked in the telecommunications industry for his entire professional career. He is currently a senior manager with the Advanced Network Research Group, Nortel (Northern Telecom). His work has spanned diverse areas ranging from the application of fundamental physics to device design, circuit design, ASIC tool and circuit design, manufacturing, and systems architecture and design. During his career, he has been awarded more than 10 patents. He is a member of the Association of Professional Engineers of Ontario.
Richard P. Skillen [M] received his B.Eng. and M.Eng. from McMaster University. Currently he is vice president, business development, SE Communications with responsibilities for new business assessments and implementations for connectionless multipurpose networks and development of specific applications of these networks for electronic commerce. Prior to joining SE Communications, he was assistant vice president, business development, Nortel. He has held various senior management positions in Nortel, Bell-Northern Research, and Bell Canada. He is a member of the Association of Professional Engineers of Ontario, and has served the IEEE Communications Society in many capacities over the past 20 years.
Maged Beshai [M] received his B.Sc. from Ain Shams University, Cairo, and M.Eng. and Ph.D. from McMaster University, Hamilton, Ontario, all in electrical engineering. He has been with Nortel (Northern Telecom) for 19 years and is currently a senior advisor with the Advanced Network Research Group. His work has included switching systems engineering, network planning, architecture analysis, traffic performance, and network research. He has been awarded four patents, with several pending, and has published several papers on network planning and switching systems architecture. He is a member of the Association of Professional Engineers of Ontario, Canada.
Marek Wernik [M] received his M.Sc. and Ph.D. degrees in electrical engineering from Warsaw Technical University, Poland, in 1973 and 1978, respectively. In 1984 he joined Bell Northern Research to conduct exploratory work on photonic and broadband switching and fiber network planning. Between 1988 and 1994 he was a manager responsible for planning and design of broadband networks and services in BNR's Systems Engineering Division. His contributions included the definition of management and control for ATM networks, working with telecommunications carriers to identify requirements of and deployment plans for broadband services/networks, and Nortel broadband product planning. As director of advanced network planning, he manages a multidisciplinary team responsible for development of new network engineering methodologies for future large-scale and high-performance networks and for establishment of network architecture directions in partnership with Nortel lead customers.