Tony Przygienda and Olivier Vandezande, Juniper Networks
Published: 30 Jun 2022
CTN Issue: June 2022
A note from the editor:
According to Wikipedia, Routing is the process of selecting a path for traffic in a network or between or across multiple networks. Routing is no doubt a key enabling technology of the Internet and has been evolving over the last few decades. With the development of Cloud Computing, Edge Computing and Network Virtualization’s new generation of data center networks based on IP fabrics, new challenges as well as opportunities for advancing the state of the art in routing technology appear.
In this series of articles about modern routing, we start with an introduction to lay out the history and foundation of routing in IP fabrics, principles, and problems. Later, we'll have articles to introduce two novel routing protocols designed for large-scale data centers: Routing In Fat Trees (RIFT) and Link State Vector Routing (LSVR). This “crash course” should be a joy ride.
Yingzhen Qu, CTN Editor
Principles and Problems of Routing in IP Fabrics
The evolution of modern data centers and access technologies, coupled with shifts in application architecture (e.g., microservices), is bringing new challenges to the network designs of yore. Those challenges lead to exponential increase in the complexity of network layouts and the related control technologies, ultimately resulting in higher maintenance and operation efforts when relying on traditional techniques.
More specifically, with the ongoing shift from centralized to distributed computation, virtualization, and ultimately cloud computing, most of current data volume consists of intra-data center traffic leading to continuously increasing demand for east-west bandwidth capacity. What make things even more difficult are the rising expectations of reliability and predictability of backend services hosted in such high-volume networks. To add insult to injury, the demarcation point between the servers and the data center network is shifting increasingly “inside” the server, owing to the introduction of virtualization helped by techniques like virtual machines and containers. Things do not look better on the other side of the network, either. Close to consumers, i.e., in the metro area, massive bandwidth requirements are raising their heads as well, triggered by content caching and the ever-increasing number of devices, both mobile and fixed; and such devices not only attach to the network passively but also often generate uninterrupted streams of data flowing in the direction of the backends. Everyone needs to stream their doorbell back into the cloud, it seems. Between those two burgeoning poles remains the squeezed backbone where scaling demands massive upgrades of long links limited by geography as it ever was.
Hence, in this brave new cloud computing era, the existing control protocols as we know and like and the network architectures (mainly inherited from the last 20 years of wide-area or global network design) are having a hard time to cope with these changes. The well-known network design dimensions identified under the SAMS acronym (Scalable, Available, Manageable and Secure) are straining under pressure.
Over the past decades, attempts to scale networks out vertically with bigger chassis, higher port count per box, and modifying existing and traditional protocols have resulted in mixed outcomes that only addressed the issues partially while seeing limited implementations. Arbitrary topological designs relying on vertical scaling techniques and technologies like routing protocols have served networking well over the last two decades and allowed the Internet to span the planet. These topologies, though still unavoidable to a large extent in the backbone, prove more and more inadequate in this new world of connected devices that by now vastly outnumber human population and still procreate at an incredible pace.
Leaving technical arguments aside for a second, it goes without saying that data center and network operators are faced with a relentless pressure to bring down capital and operational spending at the same time. Those offering services to third-parties or attaching a cost to their services are especially bound to “deliver more for less.”
So, with all those developments over the course of the last decade, both operators and vendors have built up a demand for a radically new approach to the networking challenges in data centers and metro networks. Thankfully, their physical locality allows for radical rethinking of network design and operation.
By now, a new generation of data-center network based on IP fabrics has come to the fore and is still gaining dominance every day. But what is this “magical” IP fabric mentioned so often, the very heart of the new generation of local networking? And what are the associated challenges it brings, along with the opportunities of ubiquitous connectivity where the network becomes the computer?
The Network Fabric
First things first, let us explore the concept of a ‘fabric’ itself. In a modern data center architecture, contrary to a traditional, largely arbitrarily connected mesh of devices, a fabric is made of highly interconnected and regular layers of network devices that can be represented, due to the resulting properties, as a unified logical entity. Unlike traditional multitier architectures generally used in wide area networks, a data center fabric flattens the network architecture, effectively reducing the distance between network endpoints within the data center. Such a design results in extreme bandwidth efficiency and low latency. It also provides multiple redundant communications paths and delivers higher total throughput to the server and the storage nodes connected to it compared to an irregular mesh.
But how are these interconnections made? Well, in a sense we are going forward full speed ahead into the past.
In 1953 in Bell Labs, after traditional telephony networks arrived at a point at which growing hard-wired cross-connects became physically impossible as well as economically infeasible and that relays inaugurated the era of “flipping cables without human hands”, a little-known researcher by the name of Charles Clos thought about scaling multi-stage non-blocking circuit-switching networks for telecom usage [CLOS]. His famous 3-stages CLOS topology (the ingress stage, the middle stage, and the egress stage) is still used in the vast majority of data center network architectures today. The Clos topology provides a non-blocking capability by dimensioning appropriately the middle stage. Amongst its other advantage is an easy way to scale out horizontally by adding more capacity at the different stages or vertically by adding more stages though at a cost of delay. This topology is the foundation to spine and leaf network architecture that constitutes almost all “IP fabrics” today. When the Clos topology is represented with the ingress stage and the egress stage at the same level, it is in fact the spine and leaf topology where the middle stage is the spine level, and the ingress/egress stage is the leaf level. Clos offers additional predictability, as all traffic in a spine-and-leaf network, east-west or north-south, becomes equal. All traffic is processed by the same number of hops from ingress point to egress point that enforces consistent delay and jitter characteristics due to its uniform blocking characteristics. And Clos, being really a “cross-connect of cross-connects” is also economically speaking a very efficient way to provision bandwidth when looking at the cost of cabling. This architecture was in fact used in most high-end, dense routers for the last 20 years to route traffic internally so in a sense we are just “externalizing” what was hidden in plain sight as “fabric module” in large, chassis-based devices. And well, it’s not easy to beat Bell Labs mathematicians who did not have calculators but had time to solve hard, efficiency constrained linear programming problems, and as we know, algebra does not change much over time.
Taking things further, simple 3-stages IP fabrics are perfectly suitable for small to medium size data centers or metro networks, but bigger ones require additional considerations.
Initially, the overall fabric scale can be increased by creating multiple smaller spine-and-leaf fabrics (a.k.a. PoDs which stands for “point of delivery”) and interconnecting them with additional spine-like layer(s), since Clos is inherently a recursive concept. Depending on the scale needed, multiple layers can be added in a tree-like structure where links (and network devices capacity) nearer to the top of the tree can be provisioned with more bandwidth (fatter) than links (and network devices capacity) further down the hierarchy. Hence the origin of the “fat tree” which in itself is borrowed from original “real fat tree” used in supercomputing which had in fact a trunk with a single supercomputer attached to it.
By dimensioning the links and the network devices capacity at each level of the tree accordingly, a design resulting in non-blocking and predictive network paths is possible.
Here is an example of a generic fat tree topology:
For even larger, “hyper-scale” data center IP fabrics, the next step in the scalability can be achieved using a so-called “multi-planes” topology which will be described a little bit later [FBOOK].
As shown before, in a single-plane topology a single fat tree interconnects each spine of each pod. With multi-planes topology, several parallel planes (fat trees) interconnect a subset of spine nodes in each pod. This option is often caused by the limited radix of available devices but provides hyper-scalability, predictable failure domain, and minimal blast-radius in case of failures. Given there is no free lunch in networking it brings with it other important challenges in terms of the necessary IP fabric control plane.
A multi-plane topology that provides only sparser connectivity and "partitioned spines" is depicted in the figure below since it is hard to grasp by intuition or description.
The IP Fabric
Looking through the lens of control traffic again, the ‘IP’ in the IP fabric is the result of early L2 fabrics not having been capable to guarantee stability of the control plane by growing the size of the fabrics.
As the name suggests, IP fabric is based on IPv6 and/or IPv4 forwarding which implies that the IP fabric is routed and operated at the network layer. The fabric can be IP native or rely on Multiprotocol Label Switching (MPLS) for the forwarding, or segment routing (SR) even though the levels of complexity those technologies introduce start to affect the simplicity desired in IP fabric operation. An important consideration for using these technologies is the fact that bandwidth is so abundant and resilient that no further technologies to protect or traffic engineer streams should be needed. The inherent scalability of layer 3 IP technology brings the usual (and in addition to that Clos specific) challenges as well.
Considering what IP fabrics are used for and to provide the necessary properties we described earlier, the applications operating on such fabrics require a lot of addressing agility and diversity of services supported by the network. Highly volatile, virtualized environments have to deliver complex network services which may imply mobile addressing and segmentation, stretched Ethernet segments for L2 applications, workload mobility, and other network layer abstractions that must be provisioned or decommissioned in a timely manner. Implementing all those things natively in the IP fabric substrate would complicate it significantly and lead to possible loss of its regularity, this opens the door to novel algorithmic approaches in the control plane (described later) that offer substantial savings in operational complexity as well as the necessary resources to deploy it.
A solution to the challenge of things collapsing into the so often resulting “ball of yarn” is the usual layering, i.e., IP fabric is kept strictly as a simple “IP forwarding bandwidth substrate”; akin to RAIDs delivering an uniform, cheap and scalable form factor for persistent, failure resilient storage. With such restricted scope, IP fabrics remain simple to deploy and scale and amount to nothing more than a failure resilient layer of connectivity in the physical network. delegating the complexity of dealing with agile, complex network services to an overlay network that runs on top of it. Such IP fabric, when paired with said overlay, is called the underlay. In a very similar way BGP [BGP4] in traditional IP is responsible for complex connectivity policies while IGPs like OSPF [OSPF] or IS-IS [ISIS] guarantee the overall reachability inside an autonomous system.
Such a layered design using two network domains is naturally isolated in terms of flexibility and free to use different control protocols and management systems in each layer, resulting in decoupled failure domains. Despite the underlay and the overlay being separated, the overlay still relies on the IP fabric underlay to work. The underlay’s role is simply to ensure a solid forwarding substrate guaranteeing reachability resolution and optimal connection paths by offering symmetrical bi-sectional bandwidth, redundancy to physical infrastructure failures, optimal routing and efficient load-balancing.
As a result, the overlay remains free to build on top of such solid IP fabric complex services independently of the physical infrastructure restrictions that irregular meshes often result in. It ends up supplying virtual topologies which tunnel traffic edge-to-edge or in other words, the overlay consists of tunnels interconnecting Top of Rack (ToR) switches, leaf switches or servers attached to the IP fabric. The underlying IP substrate is oblivious as to what passes on top of it in the different encapsulations or tunnels and in fact, such “ignorance” is a desirable feature outcome.
Further, with spine and leaf topology in the overlay, the tendency is for the “network intelligence” to be located at the edges. It is implemented either in leaf devices (such as top-of-rack -- a.k.a. ToR-- switches) or in endpoint servers connected to the fabric. The spine devices mainly act as a transit layer for the leaf devices. To synchronize overlay networking state(s), BGP is used almost exclusively as control plane, often with additional extensions like Ethernet Virtual Private Network [EVPN] or L3VPN [L3VPN]. When it came to the underlying IP Fabric control plane, a significant debate ensued, and a variety of solutions became available. Some of today’s protocols find their roots in the pre-IP fabric era and have evolved to try to meet some of the new needs. Changing the architecture and design of a control protocol to meet the profile necessary for (hyper) scale, cloud computing, etc., is feasible but only up to a certain point. After a certain threshold at which complexity and architectural assumptions of traditional routing protocols decorated with endless number of adaptations become unmanageable, it can be highly beneficial to start over, rethink, and redesign the underlay control plane from scratch to meet all the new demands.
The IP Fabric Challenges
To outline in more detail additional and sometimes different challenges an IP fabric underlay presents compared to traditional underlay routing, we can use the previously mentioned SAMS paradigm to have a few predictable categories and cluster similar considerations together.
In terms of scalability, the IP fabric underlay routing causes an unusual mix of challenges that seem to hamper both traditional IGPs as well as BGP from doing a good job for different reasons:
- Architectural extension to include servers in the routing control plane pushes large fabrics into the magnitude of half a million devices, a scale no traditional IGP can address due to architectural limitations of either relying on a flat flooding domain or otherwise providing a very suboptimal routing or even blackholing in cases of link and node failures due to compulsory topology abstraction and prefix aggregation. BGP can meet the scale prefix wise, but assuming every host holding half a million routes for optimal routing and according to convergence challenges on every topology change makes such a solution quite a difficult choice in terms of operational complexity. BGP cannot provide a “link view” of the topology so error triangulation cannot be tackled by control plane mechanisms and other solutions that add operational complexity have to be deployed alongside a BGP solution.
- High degree of ECMP in the fabrics is not handled well by either approach, IGPs are relying on Dijkstra which performs best on sparse meshes and BGP needs workarounds that directly break the original protocol architecture such as forcing non-unique AS numbers on different devices or making the peers sending updates dependent on each other to prevent trashing of CPU and forwarding silicon during initial convergence or failure events.
- A fabric, being a fat tree, can become quite unbalanced if fat links higher up in the fabric are lost. This can be resolved by either pressuring back on the traffic and choosing randomization in forwarding or by the routing protocol adjusting the weighted tables of the next hops. IGP is ill suited to choose anything but ECMP, due to the nature of forwarding tables it builds using Dijkstra, and hard insistence on “shortest-path first forwarding”; BGP can provide some limited indication of bandwidth imbalances in a very local radius without really guaranteeing global loop-free behaviour.
- In a broad sense a protocol for IP fabrics should scale to a high degree in multiple dimensions while consuming minimal resources on nodes lower in the fabric (control plane state minimization) and guarantee as small a blast radius as possible while converging at the fastest possible speeds during topology changes to avoid complex path protection solutions. Overall, those requirements are seemingly contradicting each other in terms of architecture of traditional routing technologies and have never been addressed in this combination by standardized routing protocols in the Internet space.
As it relates to availability requirements that are somewhat unique to IP fabrics emerge:
- The IP fabric fast convergence is crucial to limit negative effects on the complex fabric overlay network. Due to the nature of ultra-high-speed interfaces used in such fabrics, the volume of traffic loss in case of link or node failure in such meshed and dense environment is very significant. To minimize the disruption, once the failure is detected, the nodes in the IP fabric must converge quickly by exchanging and computing all the routing and/or topology information from each other at the scale of the modern data centers. This information must be consistent, reflecting the current state of the network and free of routing loops or any other kind of persistent inconsistencies. This process must also provide paths alternatives in the fastest possible fashion to suppress traffic black-holing and suboptimal routing on high-capacity links. The absent scale considerations described earlier tend to favour distributed computation performed by IGPs contrary to slower diffused computation used by BGP and other distance-vector protocols.
- Fast convergence is also correlated with overall lower stability of the network under entropy events. To use an imperfect analogy, the situation is comparable to sports cars turning much faster than buses do but also spinning out much faster under smaller control inputs. To increase the stability under faster convergence characteristics the blast radius of the generated control information must be contained to a minimum to prevent oscillation of the whole fabric under link or node changes.
- Likewise, changes to the topology (growth, reduction, replacement, etc) must cause minimum disruption and be easy to perform. IP fabrics are comparable here to RAIDs again, they should provide maximum availability under reconfiguration events and make such events as simple to perform as pulling hot disks out an array. As a more sophisticated example, any associated cabling issue (mismatch, defect, violation, etc) that is a quite common operational issue due to very high link count must also be managed by the IP fabric control plane and have no impact on the overall substrate when present in the fabric.
The next category to consider, manageability, is of the uttermost importance in the highly sensitive and dense IP fabric where an apparently harmless event could have disastrous consequences:
- To start with, underlay configuration on any fabric device should be kept to a minimum, and ideally generic (as in not being dependent on the placement in the fabric) with some control plane autonomy. This helps reduce the load on operation’s teams when, for example, a new device is removed from shrink-wrap and placed in an arbitrary place in the fabric to replace another node that failed. This points relatively quickly to a true zero touch requirement akin to the properties of ethernet switching. As a plus, a fully automated construction of fat-tree topologies based on detection and sensing of the links with limited or no configuration, detection of topology mis-cabling and deployment of fast failover technologies like BFD can make an IP fabric much cheaper from the operational perspective than manually constructed networks.
- Likewise, nodes should be taken out of production quickly and without disruption or complex procedures, something BGP is ill suited for.
- Visibility of the entire IP fabric topology and telemetry information are also important, at least from a single point like an IP fabric node and/or exported to an external machine.
Generally speaking, the lower the operational and deployment complexity achieved by a well-engineered IP fabric underlay routing technology, the more appealing such a form factor becomes, compared to traditional routing and network technologies.
Last, but not least, security requirements:
- Though it would seem quite intuitive that security in IP fabrics is not of that much importance, the opposite is the case. Those fabrics most often carry traffic of high value, especially in overlay, and an attack vector in underlay routing technology could have a significant impact in many dimensions, well beyond simple service unavailability. Hence a good IP fabric underlay is obliged to support many security models, down to the trust extended from a specific switch port to a corresponding port on another switch. Unfortunately, tightening of security/trust is negatively correlated with simplicity of operation, since zero touch plug-and-play technologies rely invariably on universal trust without further verification.
- Interestingly enough, in the security model where all nodes on the fabric have to present a credential, e.g. a key, a common problem is the resulting compromise of the secret. In traditional operations “rolling the fabric over to a new secret” means a careful migration of credentials on a node-by-node basis. An underlay protocol, ideally within the routing control plane, should allow a “one point push” of the new credentials to anyone already on the fabric while withdrawing the current trust.
Ultimately, and after considering new criteria for a new IP fabric underlay along with the classification introduced, several issues are still worth mentioning:
- IP fabrics consist of point-to-point links only. This makes many aspects of the routing protocol design much simpler.
- IP fabrics operate on mixtures of IPv4 and IPv6, including desire to forward IPv4 over links which have only IPv6 link local addresses.
This background material has covered many aspects of the current state of the art along with the requirements and challenges they currently possess. Further articles will focus on novel approaches to IP fabric routing that address those requirements by different design choices based either on existing protocols or new novel ideas.
- [CLOS] Clos, Charles, "A study of non-blocking switching networks". 1953, Bell System Technical Journal.
- [BGP4] Rekhter, Yakov et al, “A Border Gateway Protocol 4 (BGP-4)”, RFC4271, 2006, IETF
- [OSPF] Moy J., “OSPF Version 2”, 1998, IETF
- [ISIS] ISO/IEC, International Organization for Standardization, "Intermediate system to Intermediate system intra-domain routeing information exchange protocol for use in conjunction with the protocol for providing the connectionless-mode Network Service (ISO 8473)”, 2002
- [EVPN] Sajassi et al., “BGP MPLS-Based Ethernet VPN”, RFC7432, 2015, IETF
- [L3VPN] Rosen et al., “BGP/MPLS IP VPNs”, RFC4364, 2004, IETF
- [FBOOK] Andreyev A., “Introducing data center fabric, the next-generation Facebook data center network”, 2014, Facebook
Statements and opinions given in a work published by the IEEE or the IEEE Communications Society are the expressions of the author(s). Responsibility for the content of published articles rests upon the authors(s), not IEEE nor the IEEE Communications Society.