Copyright 1998 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

This article was published in the May 1998 issue of
IEEE Communications Magazine.

CIRULE1.GIF (372 bytes)

Abstract
Multiprotocol label switching (MPLS) is rapidly emerging as an Internet Engineering Task Force (IETF) standard intended to enhance the speed, scalability, and service provisioning capabilities in the Internet. MPLS uses the technique of packet forwarding based on labels, to enable the implementation of a simpler high-performance packet-forwarding engine. This also decouples packet forwarding from routing, facilitating the provision of varied routing services independent of the packet-forwarding paradigm. The authors briefly track the evolution of this technology in relation to other existing technologies. Then an overview of the MPLS architecture and design is provided. In addition, some of the work that was a precursor to MPLS is discussed, as well as related issues and debates.

 

CIRULE2.GIF (100 bytes)


Evolution of Multiprotocol Label Switching

CIRULE3.GIF (212 bytes)

Arun Viswanathan, Bell Laboratories-Lucent Technologies
Nancy Feldman, IBM T. J. Watson Research Center
Zheng Wang, Bell Laboratories-Lucent Technologies
Ross Callon, IronBridge Networks

 

With the advent of the World Wide Web, the Internet has seen enormous growth from its roots as a network of modest proportions, mostly used by the research and academic community, to a large public data network. Several thousands of corporate users and several millions of dial-in residential users have gone online in the last few years, making the Internet a true public data network. This accelerated growth in subscription has led to a surge of data in the Internet backbone. In order to keep up with this demand in service and bandwidth, all Internet service providers (ISPs) have scaled their networks severalfold, in both size and bandwidth. The forecast for this continuing growth is even more astounding [1, 2]. With fast emerging technologies, such as voice over IP and electronic commerce, the need to scale the network beyond present capabilities is paramount. For this the Internet has to scale in several dimensions, including but not limited to bandwidth, routing, quality of service (QoS), and customer service provisioning.
Multiprotocol label switching (MPLS) [3, 4] is not a silver bullet to cure existing or forthcoming problems, but rather an enabling technology which addresses some of these scaling issues. It does this by replacing the standard destination-based hop-by-hop forwarding paradigm with a label-swapping forwarding1 paradigm. This has the benefit of simplifying the packet-forwarding engine, enabling easy scaling to terabit rates. Furthermore, it decouples forwarding from routing, enabling one to apply new specialized or customized routing services without requiring changes in the forwarding path. For the Internet to smoothly scale to gargantuan proportions, as will be needed in the near future, the use of a routing hierarchy to limit the growth of routing tables on the Internet backbone routers is also of paramount importance. With MPLS it is possible to switch data through a routing hierarchy without compromising hierarchical requirements and possibly alleviating core backbone routers from maintaining external routes. Another beneficial aspect of MPLS is that it can associate a wide range of forwarding granularities with a label, ranging from all data destined through an "exit" router to a host-to-host application flow. MPLS has largely evolved from the need to use the high speeds of existing label-swapping technologies, such as asynchronous transfer mode (ATM), with existing router networks, by seamlessly integrating the datagram nature of IP to the label-swapping hardware capability of switches. This approach eliminates several of the problems associated with overlaying an IP network on top of a connection-oriented network. This section discusses some of the existing techniques network operators have used to scale their networks and the limitations therein. We also briefly discuss the evolution of IP switching. The second section outlines the MPLS architecture and design. The third section discusses some of the prior work from which MPLS has derived much of its architecture. The fourth section discuss some of the major design issues that were considered during the initial evolution of MPLS. The fifth section cites some of issues that have not yet been addressed by MPLS in its entirety and can be a topic for future research. This article discusses the MPLS architecture with respect to IP. MPLS will support multiprotocol, and these concepts can easily be extended to other protocols such as IPX and CLNP.

IP over NBMA

Nonbroadcast multi-access (NBMA) networks, such as ATM and frame relay, are connection-oriented networks. That is, data is transferred after a virtual connection (VC) is established between the source and the destination, much like a voice call in the telephony network. These technologies use the label-swapping2 paradigm to forward native data units. Largely due to the label-swapping paradigm, these switches have been scaling extremely well in speed. Usually these switching devices operate an order of magnitude above a typical router in terms of backplane bandwidth capacity. Due to this particular aspect, network operators have successfully deployed NBMA networks in the backbone to scale their networks to a higher bandwidth bracket. There are two standard approaches to achieving this, which are discussed in the following subsections.

The IETF Approach

An NBMA network when viewed as a data link layer is different from a point-to-point or shared media network, in that multiple nodes can be accessed but there is no support for data link layer broadcast. Of the several proposals available, the Internet Engineering Task Force (IETF) chose the classical IP (CIP) model [5] for NBMA. This was called classical because the hosts attached to an ATM network view the attachment as a shared media access forming a logical IP subnet (LIS). The relationship between the hosts is only logical because they share several logically imposed properties, such as a single IP subnet address for all hosts and routers in an LIS, the same LLC/SNAP encapsulation of IP packets [6], the same maximum transmission unit (MTU) between all LIS members, the same routing architecture as in shared media, and finally, an address resolution mechanism to resolve the relationship between IP addresses and ATM addresses of members in an LIS. This is also called the "overlay model." In this, the ATM network employs its own addressing and routing scheme completely independent of the IP network being "overlaid." There can be several LISs overlaid on top of a single ATM network. IP hosts (or IP routers) on an LIS form adjacencies by establishing a permanent VC (PVC) or switched VC (SVC) to each member of the LIS, forming a mesh connectivity between the LIS members. Since the IP address and ATM address of the interface attached to the ATM network have no inherent relationship, the CIP approach provides a address resolution mechanism in case of SVC. An ATM Address Resolution Protocol (ARP) server is employed for address resolution. This is unlike the shared media address resolution because of the nonbroadcast nature of ATM networks. Members in an LIS register themselves with the LIS's ATM ARP server. They can obtain the IP address of other members in the LIS from the server, and request the server to resolve the ATM address for an IP address. Once the ATM address is obtained, an SVC connection can be established to that member. In the PVC case, there is no need for an ATM ARP server because the IP address and PVC mapping can be configured when the PVCs are defined. Alternatively, the members can use the inverse ATM ARP mechanism to resolve the mapping between PVCs and IP addresses.
Packets between members of the same LIS are directly switched through the ATM infrastructure. However, when packets have to be forwarded between hosts in two different LIS, they must be IP forwarded through a router that is a member in both LISs, even when both LISs are overlaid on the same ATM network. The packet may traverse several such routers before it reaches the destination host on the same ATM network.
To overcome this undesirable behavior, the IETF extended the standards to incorporate another mechanism called the Next Hop Resolution Protocol (NHRP) [7]. NHRP is basically an extension of the ATM ARP mechanism, except that the address resolution is done across LISs. NHRP is based on a client/server model, in which the client (NHC), a host or router, makes a request for address resolution of a destination IP address to the nearest configured NHRP server (NHS). This request is forwarded along the normal hop-by-hop IP forwarding path toward the IP destination address whose ATM address resolution is requested. An NHS along the path that has the address resolution information responds with an ATM address to the requesting NHC. If the destination node is directly attached to the ATM network, the resolved ATM address is that of the destination node. If the destination node is not directly attached to the ATM network, the resolved ATM address is that of the egress router on the ATM network which is "nearest" to the destination node. Once the NHC has the ATM address, it opens a direct SVC connection to the ATM address and forwards packets through the connection toward the destination IP address. The NHRP specification precludes the NHC and the far-end node whose ATM is supplied from being routers. This is because NHRP does not provide a mechanism to indicate the NHC when a route change happens after the resolution is made, and in some situations can lead to a stable routing loop.
Multicast over ATM is also built on the LIS concept. A multicast address resolution server (MARS) [8] is used for this purpose. The MARS replaces the IGMP [9] facility in the shared media environment. The MARS serves as a registry of multicast group memberships and maintains the association between the multicast group address and the ATM addresses of the members in a multicast group. A sender can request the MARS to resolve a multicast group address to a list of ATM addresses that identifies the group members. The sender then opens a point-to-multipoint VC to the enlisted members. If the number of senders is very large, a mesh of pont-to-multipoint VCs may quickly exhaust the resources in the ATM network. To circumvent this problem, MARS provides another mechanism where the sender opens a single VC to a multicast server (not the MARS), which in turn maintains a point-to-multipoint VC to the receiving members in a multicast group.

The ATM Forum Approach

The ATM Forum also took the "overlay model" approach. Nonetheless, the ATM Forum approach is different in many ways from that of the IETF. It is geared more toward campus and enterprise network environments. The initial ATM Forum work is called LAN Emulation (LANE) [10]. In this scheme, the host interface attached to the ATM network "appears" like a traditional shared media interface to the upper layers. In this way the upper layers require no change and can continue to operate in the traditional manner. The main motivation for the ATM Forum considering this backward compatibility was to make ATM attractive to the campus and enterprise community which had a large installed base of legacy equipment and software.
LANE employs three servers: a LAN configuration server (LECS), a LAN emulation server (LES), and a broadcast and unknown server (BUS). Each LAN emulation client (LEC), such as a NIC card in a host, is either configured with the ATM address of the LECS or provided with an ATM address of the LECS by the integrated layer management interface (ILMI), or uses an anycast ATM address to create a connection to the LECS. Alternatively, the LEC may be configured with a PVC to the LECS. LEC then obtains the ATM address of the LES from LECS and registers by creating a connection to LES. A LEC to resolve an address of a destination host sends a LANE address resolution protocol (LE_ARP) message to the LES to obtain the ATM address of the destination client. A LEC obtains the ATM address of the BUS by issuing an LE_ARP for the all ones MAC address and opens a direct VC to the BUS. Now if LEC A has a MAC frame to be delivered to LEC B, it issues an LE_ARP to the LES with the MAC address of B to obtain the ATM address of LEC B. Since LEC B must have registered with LES, LEC A receives the required address from the LES, on which it opens a direct data connection to LEC B. If the LES cannot resolve the MAC address of LEC B, LEC A needs to send the frame to the BUS and the BUS will broadcast the message to all the clients. The LES at times cannot resolve LE_ARP request if the destination client is behind a proxy LEC (e.g., a bridge); in such cases the MAC address is learned when there is return traffic from LEC B. Similar to the CIP, hosts on different emulated LANs have to communicate via a router.
LANE version 2 adds logical link control (LLC) multiplexing for VC sharing, available bit rate (ABR) and other QoS support, enhanced multicast support, and multiprotocol over ATM (MPOA) support. The multicast enhancement provides support for a separate path from the broadcast path so that multicast traffic is sent to those member that need to receive the multicast frames.
Again, to overcome the requirement that a router forwards packets at the network layer between hosts in different emulated LANs but in the same ATM infrastructure, the ATM Forum also extended the standard to include MPOA [11], a combination of LANE and NHRP. With this combination, it is able to provide the bridging benefits of LANE and NHRP shortcuts. MPOA finds the optimal path to a destination host using both bridging and routing information. In addition to this, MPOA provides support for a "virtual router." A virtual router can be viewed as a geographically dispersed router with a "route control module" that runs routing protocols to compute and maintain routing information, and an "interface module" that caches routes by requesting the route control module. The data is forwarded from an ingress interface module to an egress interface module using the ATM infrastructure.

Limitations of These Approaches

All of the IP-over-ATM approaches have several scaling issues. The approaches described above use a complete or partial mesh of VCs to connect host members in a logical subnet. This implies that at maximum n*(n – 1) connections are required to connect n members in a logical subnet. This can consume a lot of connections for large memberships, increasing the overhead to set up, maintain, and tear down these connections. Besides the resource factor, each member in a logical subnet has n – 1 routing peers. This drastically increases the computational overhead for the routing protocols because the computational overhead is directly proportional to the number of links (physical or virtual) in a network.
In both CIP and LANE, the data must be forwarded between logical subnets, which does not fully exploit the switching infrastructure. Multicast is also based on logical subnets, and must be IP forwarded between subnets. Although NHRP and MPOA provide mechanisms to shortcut the VC across multiple subnets for unicast data, the mechanism is traffic-driven and thus has associated setup, maintenance, and teardown overheads besides the caching latency issues.
The salient selling point of switches, besides their low prices and fast speeds, was QoS. However, none of these approaches exploit that potential. The connections used in these schemes are still equivalent to best-effort. The protocol and signaling mechanisms only increase the complexity of configuration, maintenance, and operations of an overlay network. Thus, the benefits of switching were available at an unreasonable cost.

The Emergence of IP Switching

One motivation for some of the initial IP switching proposals was the desire to use label-switching devices for their speeds, yet without the added complexity and scaling issues detailed above. Another motivating factor was to use well-known, extensively deployed and exercised, proven standard protocols such as OSPF and BGP. That is, in IP switching the high-speed switches participate in IP routing and become an integral part of the routing domain, while being seamlessly intertwined with routers. This approach eliminates the n-squared peering problem as well as the n-squared mesh connectivity problem. Additionally, QoS can be natively exploited directly by the IP QoS protocols, and multicast traffic can be switched end to end with native multicast support.
These are not the only motivating factors for the evolution of IP switching. Traffic engineering through explicit routing (e.g., source routing), switching through routing hierarchy, virtual private network support, and switching all traffic in a network with minimal connections are some of the other objectives. This is discussed in detail in the following section.
IP switching on ATM hardware, can coexist with ATM protocol stack. An ISP can use the same physical topology to overlay several logical networks.

Architecture

MPLS WG Requirements

The purpose of the MPLS Working Group is to standardize a base technology that integrates the label-swapping paradigm with network-layer routing. Its objectives include standardizing a set of protocols for the distribution and maintenance of labels with existing unicast, multicast, QoS, and explicit routing, as well as defining the procedures for employing this technology over various link-level technologies. The goal is to improve network-layer scalability, traffic engineering capabilities, and price/performance.

Label Distribution and Forwarding

Label switch routers (LSR) in MPLS use link-level forwarding to provide a simple and fast packet-forwarding capability. Normal network layer forwarding requires parsing a relatively large header and performing a longest match algorithm in order to forward packets. However, label-swapping packet forwarding is based on a simple short-label exact match. This results in a simpler forwarding paradigm.
Network-layer routing maintains information from common routing protocols (such as OSPF and BGP) to determine how packets ought to be routed. This routing information partitions the entire forwarding space into forwarding equivalency classes (FECs). A set of packets following the same path, belonging to the same FEC, is also referred to as a "stream" and forwarded in a similar manner. In turn, each FEC is assigned a short, fixed-length, locally significant identifier known as a "label." A packet is "labeled" by either encoding a label in an available location in the data link layer or network-layer header, or encapsulating the packet with a header specifically for this purpose. As a packet enters an MPLS network, a conventional layer-three lookup is performed; however, in addition to the conventional next hop, the associated FEC with the assigned label is found. The packet is forwarded to its next hop with the assigned label. At subsequent nodes, the label is used as an index into a table which specifies the new outgoing label and next hop. The old label is replaced with the new, and the packet is forwarded to the next hop. This eliminates the need for network-layer lookups at all but the first node in the path.
Information from the routing protocols is used to assign and distribute labels to MPLS peers. In general, an MPLS node receives an "outgoing" label mapping from the peer that is the next hop for a stream, and allocates and distributes "incoming" labels to upstream peers for a given stream. The labels are extended into a switched path through the network as each MPLS node "splices" the incoming to outgoing labels. This series of one or more concatenated labels is termed a label switched path (LSP).
The label distribution for the unicast is done via the Label Distribution Protocol (LDP). MPLS neighbors create an LDP peering session for the exchange of LDP's distribution and withdrawal procedures. LDP supports two styles of label distribution: independent or ordered. In the case of independent label distribution, each node may at any time distribute labels for each stream it recognizes. In the case of ordered control, label distribution for a stream is initiated by the egress node for that stream. An LSR is considered an egress LSR for a stream if its next hop for that stream is not an LSR or the node is located at a routing boundary. Any given node may only distribute the incoming labels for each stream it recognizes if it is the egress, or if it has an outgoing label for the given stream. This method ensures that label-to-stream mappings are consistent throughout a network with regard to granularity, and provides a higher probability that unlabeled packets are not forwarded to downstream nodes.
Label allocation within MPLS is performed by downstream peers, where downstream is defined with respect to routing. There are two forms of label allocation: downstream and downstream-on-demand. In downstream allocation the label assignments are made by the downstream node and distributed to the neighboring LSRs. In downstream-on-demand allocation an upstream LSR specifically requests a label assignment for a stream from a downstream node. Downstream-on-demand allocation is useful in ATM networks where merging of LSPs is not possible.
In both independent and ordered distribution, labels can be distributed in liberal or conservative mode. In the liberal mode, an LSR R assigns labels for all neighbors for a FEC. The neighbors maintain the label from Rd even when Rd is not the downstream next-hop for that FEC. This way an LSR can immediately start using a predistributed label when a next hop changes. In the conservative mode, labels are maintained only by those neighbors of Rd that are using Rd as the downstream next hop for that FEC. This scheme is useful in scenarios when the LSR has limited label space. The liberal and conservative allocation mode are interoperable, and the mode of operation is exchanged by the LDP peers when initializing an LDP peer session

Label Granularity

A key component of MPLS is that one or multiple flows may be assigned to the same label (or LSP). Thus, a stream can range from fine to coarse granularity. The choice of label granularity balances the need to share the same label among many destinations with the need to maximize switching benefits while preserving resources. Some common granularities are:
IP Prefix -- This type results in an IP destination prefix in a router's forwarding table sustaining its own LSP. One advantage of this granularity when used in conjunction with the liberal mode is that the label assignments have to be made only once, either at the peer initialization phase or when an address prefix is learned. This reduces the LDP messaging to a minimum. Use of this type of granularity may not scale in large networks with LSRs that have very limited label space.
Egress Router -- This type results in all IP destinations that share a common egress router sharing the same LSP. This information may be extracted from the BGP next hop in a BGP update message, from the OSPF router ID in the OSPF advertisement, or learned directly via MPLS label distribution. It represents the coarsest granularity available.
Application Flow -- This type results in the finest granularity, since the application flow sustains its own switched path. This is the least scalable of all granularity types. The advantage of this is that it provides end-to-end switching. It is most suited to campus and enterprise environments. This may not scale in an ISP core backbone network because it requires maintenance of states for application flows.

Generic Encapsulation (Shim Header)

Figure 1 shows a shim header. MPLS augments traditional routers with the capability to "label" packets [12]. The label is carried after the data link layer header. For example, in the case of point-to-point networks the label can be carried after the PPP header. The definition of MPLS label encapsulation is given in Fig. 1.
The encapsulation contains a 20-bit label. Each labeled packet can have a class of service (CoS) similar to the IP type of service (ToS). An 8-bit time to live (TTL) value is provided to mimic the TTL support in IP headers. A single stack bit indicates if the label is the last in a stack of label encapsulation. This allows for multiple label encapsulations to exist between the data link layer and IP headers. The purpose of label stacking is the topic of the next subsection.

Routing Hierarchy and Label Stacking

The purpose of label stacking is to provide a capability comparable to IP-in-IP tunneling or loose source routing. Stacking essentially maintains identity of several streams when they are aggregated into a single LSP. This enables to switch packets at the de-aggregation point. For the ease of visualization, this concept can be likened to be a generalization of the VP tunnels in ATM, the main difference being that the ATM has only dual hierarchy, where as MPLS provides multilevel hierarchy.
Consider a network that uses an aggregated LSP to an exit node E (Fig. 2). Ingress nodes A and B send packets on LSP terminating at E. Now E must IP forward the packets destined to different networks shown in the figure. Now if E supplies A and B with a specific label for each of the network behind it (label L2 for destination 128.2), ingress nodes A and B can stack the appropriate label before sending packets on the LSP terminating at E. Thus, packets originating at ingress node A have two label encapsulation; the top label is that of the LSP that terminates at E (L1) and the next encapsulation has the label supplied by E for destination prefix in the packet header (L2 for destination 128.2). The packet is switched based on the top level encapsulation (L1). When E receives such packets from LSP terminating at E it can pop the top label (L1) and switch the packet based on the next label in the stack (L2). Note that the label (L1) can also be popped by the penultimate router. Thus, stacking avoids the need to IP forward the packet at E. For example, such FEC-specific stack labels can be piggybacked in BGP. Ingress LSRs A and B learn about the stack label along with route updates through their BGP peering with E.
LDP supports two types of peering for exchanging stack labels: explicit and implicit. Explicit peering is similar to MPLS neighbor peering for LDP, the only difference being that the peering is between remote LSRs.
In explicit peering, LDP connections are set up between remote LDP peers exactly as between local LDP peers. This is most useful when the number of remote LDP peers is small or the number of stack label mappings large.
Implicit peering does not have an LDP connection to a remote LDP peer. Instead, the stack labels are piggybacked onto the LDP messaging when the lower-level LSP is set up between the implicit peering LSRs. The intermediate LDP peers for the lower-level LSP propagate the stack labels as attributes of the lower-level labels. This way, the ingress nodes of the lower-level LSP receive the stack label from the egress LSR. The advantage of this peering scheme is that it does not require the n-square peering mesh as in explicit peering, especially when the number of remote peers is very large. However, this requires that the intermediate LSR maintain the label stack information even when they are not using it.

MPLS and ATM

The applicability of MPLS on an ATM LSR has some major differences from MPLS on a shim header LSR. This is due to the fact that ATM inherently differs from other switching technology because of its fixed-cell nature.
Merging -- The switched paths in an MPLS network take the form of a multipoint-to-point tree. A tree results because of the "merging" of switched paths that occurs at a node when multiple upstream switched paths for a given stream are spliced to a single downstream switched path for the stream. An MPLS node that is capable of merging is capable of creating O(n) switched paths which provide network reachability to all n destinations. The meaning of n depends on the granularity of the switched paths. If n is the number of prefixes existing in a router forwarding table, MPLS will scale the same as today's routing. However, the value of n may be reduced by the use of aggregation. If n represents the number of paths to each egress node, network scalability is greatly improved, since the number of switched paths in the network is reduced with no loss in switching performance.
Unfortunately, in the case of ATM optimal merging is not always possible, since much of the current ATM hardware is not capable of reassembling cells from multiple inbound VCs without the problem of cell interleaving. One solution to this problem is to use virtual paths (VPs) to merge streams rather than VCs. The use of VP merging creates trees of VPs within a network. Cell interleaving is prevented by the assignment of unique VC identifiers (VCIs) within each VP. Each node that is assigned a unique VCI may safely inject traffic using that VCI onto a VP, since it is guaranteed that no other node will use the same VCI value. Thus, a node performing merging can safely merge upstream VP flows for a stream to a single downstream VP without collisions.
However, VP merging uses VCs much less efficiently than a VC merging implementation, since one entire VP is used per stream. Although this variation uses more VC space, the work involved in establishing and maintaining switched paths is still O(n).
Loop Prevention and Detection -- The routing protocols used in an MPLS network in many cases are based on distributed computation. As such, the routing protocols will routinely form transient routing loops while routing convergence is taking place. Because MPLS forms switched paths based on routing information, if the routing information is in a loop, the MPLS switched path will also form a loop unless preventive measures are taken.
In the case of standard IP routing, the damage from routing loops is mitigated by the use of the TTL field within the packet header. This field is decremented by one at each forwarding hop. If the value is decremented to zero, the packet is discarded. However, MPLS packets that are forwarded on ATM labels have no such mechanism, since the ATM header does not have a TTL field. This can cause severe damage to a network because traffic in an ATM loop remains in that loop for as long as the switched path exists. At best, the traffic in the switched path loop steals bandwidth from other unspecified bit rate (UBR) switched paths; at worst, the traffic interferes with IP routing traffic, slows down routing convergence, and lengthens the life of the switched path loop.
There are two methods for handling loops: detection and prevention. In the case of detection, a loop may be formed, but the MPLS mapping distribution message will notice the loop and terminate and/or ignore the switched path. In the case of prevention, mechanisms are provided to prevent a switched path loop from ever being created.
Loop detection is achieved by the use of a path vector field within the MPLS distribution message. This path vector contains the unique identifier of each node that has forwarded a mapping message for a stream. Upon receipt of this field, an MPLS node sees if its own unique identifier is already in the vector; if so, the switched path for the stream is in a loop. The MPLS node is responsible for terminating the path to the stream if it exists, and rejecting the mapping message. If no loop is found, the node is responsible for appending itself to the path vector before forwarding the mapping message.
Loop prevention is achieved by the use of the path vector field in conjunction with a diffusion algorithm. In this, an MPLS node which detects its next hop for a given stream has changed does use the new downstream label until it has performed a diffusion computation to verify that splicing the upstream labels to the new downstream label will not form a looping path. This diffusion computation prunes all upstream nodes that would be on a looping path if the new downstream switched path were used. The MPLS node sends a query message with a path vector to its upstream nodes. Each upstream node that finds itself in the path vector, and thus in a loop, prunes itself off the tree, and sends a response to the originating node. Each node that does not find itself appends its unique identifier to the path vector, and forwards the query message to its upstream nodes. This pattern continues on each upstream path until pruning occurs, or the ingress node is reached and sends a response. The originating node may replace the old downstream path with the new one once it has received a response from each of its upstream nodes.
Idiosyncrasies of Non-Merge ATM -- Traditionally, ATM switches have neither VC-merge capability nor a large VP space for VP-merge. However, there is a large installed base of such switches; therefore, MPLS has explicit support for non-merge switches. Both distribution schemes, independent and ordered, cannot be directly applied to these switches because a node in a network has no knowledge of how many LSRs are upstream to it which require a label. Note that traffic from each ingress LSR for a particular FEC must be carried in its own switched path since merging is not possible. However, any scheme must support an optional loop prevention scheme to prevent and detect loops (or specifically spirals, because paths are never merged in this environment) in such a network environment. Another difference in ATM MPLS is label stacking. There is no mechanism in ATM to stack multiple labels to achieve capabilities similar to those described earlier. The VP switching can be viewed as a stacking mechanism, but is very limited in its use. Moreover, it may not be possible to have VP-based LSP through carrier ATM networks. The only other mechanisms to carry the label stack is after ATM adaptation layer 5 (AAL5) encapsulation.

Multicast

To support multicast, a point-to-multipoint LSP must be established along the multicast tree. There are a number of ways to achieve this: piggybacking on multicast protocol control messages, using LPD or a separate protocol. When multicast trees are established with control messages, the labels may be distributed with piggybacking. For example, labels can be included in PIM-SM Join messages so that the LSPs are constructed along the path toward the rendezvous point (RP) [13, 14]. However, this approach requires extensions to every multicast protocol. Therefore, it is also desirable to be able to use LDP to set multicast LSP. Also, some multicast protocols do not have explicit join messages. For example, in PIM-DM or DVMRP, the creation of a multicast tree is triggered by data packets rather than control messages. In this case, the downstream LSR may establish the LSP toward the source of the traffic via LDP. Label distribution over multi-access LANs is less straightforward. On point-to-point links, a switch can use the entire label space. However, over a multi-access LAN, it is necessary to partition the label space among LSRs connected to the LAN [15]. Otherwise, it would be impossible for the downstream LSR to figure out which LSR is the upstream for a packet. When an interface of a switch connected to a LAN is enabled, it should obtain a unique range of labels for use over the LAN. The first switch joins a multicast group selects the label for the group. All other downstream switches that send Join messages should use the same label.

Explicit Routing

With normal datagram forwarding the overhead of carrying a complete explicit route with each packet is prohibitive. MPLS makes explicit routing practical by allowing the explicit route to be carried only at the time the label switched path is set up. This in turn makes possible a number of advanced routing features which depend on explicit routing.
For example, explicit routing is potentially useful for traffic engineering. Traffic engineering refers to the process of selecting the paths chosen by data traffic in order to balance the traffic load on the links in the network.
Traffic engineering is commonly done today in IP-over-ATM networks via manual configuration of PVCs. Traffic engineering is difficult to accomplish with datagram routing. Some degree of load balancing can be obtained by adjusting the metrics associated with network links, but a greater degree of control over network loading can be accomplished by explicitly routing traffic flows.
There are some issues which need to be carefully considered in the design of any network using explicit routing. For example, in most cases only part of a packet's path will be determined by explicit routing. In order to avoid stable loops in the data path, it is necessary to ensure that routing over the part of the path which is explicitly routed is consistent with the part of the path that is hop-by-hop routed. This also implies that if part of an explicitly routed LSP fails, the part of the LSP from the ingress to the failure point cannot safely be used without introducing the possibility of traffic looping.

RSVP Flows

Recognition of RSVP [16] flows is a relatively difficult processing task for routers to perform at very high bandwidths. MPLS allows a straightforward method to simplify the forwarding of RSVP flows. MPLS therefore presents a straightforward method to improve the efficiency of RSVP.
In RSVP senders transmit a PATH message toward the receivers, and receivers transmit a RESV message to reserve resources in the network. The RESV message traces its way back toward the sender using the state information created by the PATH message. For a multicast session the RESV messages are merged as they traverse toward the senders. The MPLS extension to support RESV in an MPLS network is to piggyback label distribution in the RESV message [17]. The motivation to piggyback distribution over RSVP is to synchronize the label assignment with the QoS setup. Moreover, label distribution via the RESV message requires downstream label assignment that matches the LDP label assignment. It is simpler if label assignments are done by a single neighbor (on a direction) because otherwise the label space has to be partitioned between the neighbors or the label assignment negotiated between the neighbors.

Prior Work

A number of proposals for label switching have been put forward in recent years. Although all the proposals are based on the same paradigm of label swapping and integration with network-layer routing, there are some fundamental differences. The approaches can be broadly categorized into control-driven and data-driven. In the control-driven approach the LSP setup is either piggybacked onto an existing control protocol or closely follows the control information via an independent protocol (triggered by the control information). In the traffic-driven approach the LSP setup is triggered by data. In this section we outline some of the related work on label switching from which MPLS has derived much of its current architecture.

Tag Switching

Tag switching is based on the control-driven approach [18]. In tag switching the setup of LSPs closely follows control messages such as routing updates and RSVP messages. For the unicast case, the tag (label) bindings are carried by the tag distribution protocol (TDP). TDP peering between neighbors is established via a TCP connection between the peers. Tag bindings corresponding to information in the forwarding database (FECs) are distributed to the neighbors via TDP. All tag switch router (TSR) distribution of higher-level tags (for tag stacking) is done by remote TDP peering. However, in case of BGP the higher-level tags are piggybacked in a BGP extension. There is no generic multicast tag distribution support. Tag switching piggybacks binding information in PIM control messages for supporting multicast in a PIM environment. Tag switching also provides an RSVP extension to piggyback tag binding information for RSVP flows. The support for explicit routing is via RSVP by extending RSVP to carry a new source route object.

Aggregate Route-Based IP Switching

Aggregate route-based IP switching (ARIS) is also based on the control-driven approach [19, 20]. Architecturally, ARIS is very similar to tag switching. ARIS introduces the concept of an "egress identifier" (FECs) to express the granularity of a LSPs. The LSP setup for an FEC starts at the "egress" node for the FEC. A node is considered an egress for an FEC if the next hop is a non-ARIS neighbor or the node is at a routing boundary. ARIS provides loop prevention and detection capability that can be useful in ATM type networks. ARIS also provides hop count on LSPs for TTL decrement at the ingress node to support scoping utilities such as traceroute. Multicast support is integral to ARIS and extends loop prevention and detection capabilities to multicast LSPs. Explicit routing support is also integral to the protocol. The only other distribution mechanism used is RSVP for RSVP application flows. For ATM networks, ARIS supports merging via VC-merge and VP-merge, and can operate in non-merge ATM networks.

IP Navigator

IP Navigator [21] is again a control-driven protocol. IP Navigator is based on the use of OSPF as the internal routing protocol used within a routing domain. Explicit routing is used for setting up the VCs. This allows a simple method to prevent loops, and is easy to do because of the assumption of a link state routing protocol (OSPF) as the IGP of choice.
IP Navigator makes an ATM or frame relay cloud look like an IP network to the edge routers. Inside the cloud, a VC signaling protocol is employed to set to multipoint-to-point and point-to-multipoint VCs. When a packet arrives at the edge of the cloud, the ingress router performs a route lookup to determine the egress, and then forwards the packet along the pre-established VC to the destination switch.

IFMP -- IP Switching

Ipsilon Flow Management Protocol (IFMP) [22] is a traffic-driven protocol. When a flow first starts to transmit data, IP packets are sent to the controller of the IP switch and are forwarded with normal destination-based IP forwarding. When the number of packets from a flow exceeds a predetermined threshold, the controller uses IFMP to set up an LSP for the particular flow. Once the LSP is established, the packets from the flow start to follow the shortcut LSPs rather than going through the controller. The LSPs are periodically refreshed to ensure that they are consistent with the underlying forwarding tables. If an LSP is idle for a period of time, it is automatically timed out and deleted.

Cell Switch Router (CSR)

The CSR proposal is similar to IP switching in many respects [23]. CSR is primarily designed as a device for interconnecting ATM clouds. Within an LIS, ATM Forum standards are used to connection hosts and switched together. Multiple LISs are then interconnected with CSRs that are capable of running both IP forwarding and cell forwarding. Based on the port number of the packets, CSRs may choose to set up LSPs for long-lasting flows. The setup of LSPs is data-driven for best-effort traffic and RSVP-driven for flows that require resource reservation. Once an LSP is set up, the packets will follow the short-cut paths and bypass IP level forwarding in the CSRs.

Issues

Faster Forwarding

Is label forwarding faster than longest prefix match lookup forwarding? It is a fact today that several routers are able to do traditional packet forwarding at full line rate with total packet throughput comparable to the effective packet throughput of switches. The argument of speed is somewhat misplaced in that sense. What MPLS provides is a simpler mechanism for forwarding which may improve on the price/performance and time-to-market capabilities. Moreover, it decouples forwarding from routing. This enables the provision of varied routing services without requiring changes in the forwarding paradigm. Some of these routing services may be harder to implement with a hop-by-hop destination-based forwarding paradigm.

Loop Prevention and Detection

The MPLS WG requirement states that the solution "defined by MPLS must provide protocol mechanism(s) to either prevent the formation of loops and/or contain the amount of (networking) resources that could be consumed due to the presence of such loops." Discussion of this covers the entire spectrum, ranging from loop prevention as essential and mandatory for implementation to loops as not being a problem and an issue that should be ignored because it is largely a routing issue. However, the rough consensus of the WG is to support some sort of loop prevention/detection scheme within the MPLS solution. The current diffusion scheme is favored because the node that identifies a change in the next hop continues to use the old next hop until the changeover to the new next hop is validated via a diffusion process. Thus, the packets are switched using the old next hop and not blackholed until a new next hop is validated to not form a looped LSP.
Similar to the diffusion scheme, the WG is still considering other schemes that may be more scalable. The concept is based on using a monotonically ordered attribute (e.g., the hop count) to identify and prevent forming looping LSPs.

Independent vs. Ordered Distribution

The independent distribution method has the drawback that an upstream label can be advertised before a downstream label is received, resulting in unlabeled packets being sent to the downstream node. Furthermore, since each node independently assigns labels, it is possible for different MPLS nodes to make inconsistent decisions with regard to the granularity of the streams for which it is assigning labels. While it has been argued that the ordered distribution may not be robust because if a node has to wait for another node upon which to act before initiating its own setup procedure, this can potentially lead to deadlock or a delayed connection setup. However, ordered distribution provides features independent distribution cannot provide, such as loop prevention/detection, class of service, and LSP policy. For these reasons, the architecture supports both distribution mechanisms since they are completely interoperable and can provide benefits of both distribution paradigms.

VC-Merging

ARIS was one of the first to propose VC-merging to solve the O(n2) mesh problem. VC-merge requires that ATM switches serialize the packets incoming on different VCs and outbound on a single VC. This requires that ATM switches must buffer cells to serialize packets. Contrary to initial belief it has been shown that this requirement is fairly reasonable to achieve in ATM switches [24]. The ATM Forum is considering proposals to include the VC-merge capability in PNNI.

Control Driven vs. Data Driven

In a control-driven approach, the setup of switched paths is initiated by L3 control protocols such as a routing update and resource reservation request. There are two ways of implementing this. One way is to piggyback on the control protocols, but this involves extensions to the control protocols. For example, one can extend the current routing protocols so that labels are distributed with the routing updates and switched paths are set up along the forwarding tables. Alternatively, the switched path can also be set up by the label distribution protocol by following the control protocol operation. An example is the ARIS proposal which sets up sink trees along the forwarding paths.
Since the setup is decoupled with data packets, there is usually little or no delay before data flows can use the switched paths. The control-driven approach also allows one to add more information such as label granularity by extensions to control protocols or the fields in the label distribution protocol. Refreshing the state of switched paths can be done in a timely fashion to synchronize with routing changes. However, the downside of the control-driven approach is that switched paths are confined to a control domain. For example, routing-based schemes can only establish switched paths with a routing domain. It requires other mechanisms such as a label header and label stack to cross the control domain borders.
By its nature, presetup of all possible switched paths always produces the worst-case label requirements. In large networks it may lead to scalability problems, Furthermore, the mechanisms used in a control-driven approach are inevitably specific to the control protocols on which the mechanisms are based.
In the data-driven approach, the setup is triggered by data packets. In Ipsilon's scheme, for example, the setup of a switched path starts when a switch detects that a flow is long-lasting (by either matching a well-known port number or waiting for a fixed number of packets to go through).
A data-driven approach can set up switched paths across control domains (such as routing domains) and even end-to-end if required. The setup is inherently done on an on-demand basis, so it may be able to take advantage of locality of the traffic. It is more generic in the sense that the setup mechanism does not have to understand how the forwarding paths are constructed by the control protocols, and it can simply set up switched paths by following the data packets.
However, the disadvantage of such an approach is that there may be setup delay before the flows can use the switched paths. The switched paths have to be refreshed periodically in order to adapt to route changes, and there may be some delay between a route change and the next refreshing point. Another problem is that it is difficult to convey extra information without changing the data packet header. For example, in Ipsilon's scheme it is not possible to specify explicitly which flow type (which decides the flow granularity) to use for classification.

RSVP and MPLS

It has been proposed that RSVP and MPLS be combined with explicit routing, and that this be used for traffic engineering. There are, however, a couple of problems with this approach.
Normal RSVP operation calls for the normal best-effort route to be used when the RSVP path fails for any reason. The combination of some traffic following carefully prescribed routes while other traffic follows dynamically specified routes defeats the purpose of traffic engineering.
Also, if traffic follows an explicit route for part of the path, but follows a dynamically determined hop-by-hop route for part of its path, there is the possibility of creating permanent routing loops.

MPLS in Shared Media

There have been two proposals for MPLS over shared media. One approach is that the shim header (generic encapsulation) is placed between the medium access control (MAC) header and the network-layer header, and the shim header is not used for forwarding traffic through the shared media, but only by the edge routers on a shared media. Another scheme is to encode the label in the MAC header itself by redefining the destination MAC address semantics. The advantage the latter proposal claims is that frames would not require to undergo fragmentation (unlike the former proposal) and bridges have routing functionality (essentially building routers out of bridges). However, the severe limitation of this proposal is in its inability to interoperate in LAN environments with legacy bridging equipment.

Future Trends

Several areas remain that are not fully explored by the MPLS WG. There exists a large installed based of ATM networks which use different styles of IP over ATM. The interoperability between MPLS and the overlay ATM subnet requires more investigation to eliminate the IP forwarding hop between the network boundary. Another such area is the IP forwarding hop at nodes where route summarization is applied. It is important to maintain the hierarchical routing architecture by summarizing routes at routing boundaries. This entails that the packet from an LSP terminating at such a node be IP forwarded. Label stacking cannot be used in this scenario because labels specific to more specific routes cannot be distributed. Otherwise, it would be contrary to the route scaling gained via a hierarchical routing architecture. One possible area of investigation could be that sources be supplied with a more specific label on demand and stack the label before sending packets into aggregated LSPs.

Conclusion

MPLS has emerged as a promising technology that will improve the scalability of hop-by-hop routing and forwarding, and provide traffic engineering capabilities for better network provisioning. It decouples forwarding from routing and allows multiprotocol support without requiring changes to the basic forwarding paradigm. The MPLS standardization effort is still in a relatively early stage, and there are a number of technical issues which need to be resolved before the standard is complete. Still further work is needed before multivendor interoperability becomes a reality. However, MPLS hass the potential to offer a useful technique for improving several aspects of network operation.

References
[1] Cyberatlas, "CyberAtlas: Market Size: Forecast", March 1998.
[2] V. Jones, "The Internet, Consolidated Facts and Figures", 1998.
[3] R. Callon et al., "A Framework for Multiprotocol Label Switching," Internet Draft, Nov. 1997.
[4] E. Rosen, A. Viswanathan, and R. Callon, "A Proposed Architecture for MPLS," Internet draft, July 1997.
[5] M. Laubach, "Classical IP and ARP over ATM," RFC 1577, Jan. 1994.
[6] J. Heinanen, "Multiprotocol Encapsulation over ATM Adaptation Layer 5," RFC1483, 1993.
[7] J. Luciani, et al., "NBMA Next Hop Resolution Protocol," Internet draft, Mar. 1997.
[8] G. Armitage, "Support for Multicast over UNI 3.0/3.1 based ATM Networks," RFC 2022, Nov. 1996.
[9] W. Fenner, "Internet Group Management Protocol, Version 2," RFC 2236, Nov. 1997
[10] "Lan Emulation over ATM 1.0," ATM Forum, af-lane-0021.000, Jan. 1995.
[11] "Multi-Protocol over ATM (MPOA) version 1.0," ATM Forum, STR-MPOA-01.000, Apr. 1997.
[12] E. Rosen et al., "MPLS Label Stack Encoding," Internet draft, draft-ietf-mpls-label-encaps-01.txt, Feb. 1998.
[13] Estrin et al., "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol Specification," Internet draft, Oct., 1996.
[14] D. Farinacci and Y. Rekhter, "Multicast Tag Binding and Distribution Using PIM," Internet draft, Dec. 1996.
[15] D. Farinacci, "Partitioning Tag Space among Multicast Routers on a Common Subnet," Internet draft, Dec. 1996.
[16] R. Braden et al., "Resource ReSerVation Protocol (RSVP) -- Version 1 Functional Specification," RFC2205, Sept. 1997.
[17] B. Davie et al., "Use of Label Switching With RSVP," Internet draft, Nov. 1997.
[18] Y. Rekhter et al., "Cisco Systems' Tag Switching Architecture Overview," RFC2105, Feb. 1997
[19] A. Viswanathan et al., "ARIS: Aggregate Route-Based IP Switching," Internet draft, Mar. 1997.
[20] N. Feldman and A. Viswanathan, "ARIS Specification," Internet draft, Mar. 1997.
[21] H. Ahmed et al., "IP Switching for Scalable IP Services," Proc. IEEE, vol. 85, no. 12, Dec. 1997.
[22] P. Newman et al., "Ipsilon Flow Management Protocol Specification for IPv4 Version 1.0," RFC1953, May 1996.
[23] Y. Katsube et al., "Internetworking Based on Cell Switch Router -- Architecture and Protocol Overview," Proc IEEE, vol. 85, no. 12, Dec. 1997.
[24] I. Widjaja and A. Elwalid, "Performance Issues in VC-Merge Capable Switches for IP over ATM Networks," INFOCOM '98, San Francisco, CA, 29 Mar.–2 Apr. 1998.

Biographies
Arun Viswanathan received his Master's in computer science and technology from Jawaharlal Nehru University, New Delhi, India, in 1989 and his Master's with Honors in mathematics from Panjab University, Chandigarh, India, in 1987. He is a member of technical staff with the High-Speed Networking Research Department at Bell Laboratories, Lucent Technologies. Earlier he worked with IBM in the Advanced Networking Laboratory where he co-invented the aggregate route-based IP switching (ARIS) protocol. Over the past ten years he has worked on several networking projects ranging from ARIS and the NSFNet to application protocols and digital switches. His current interests include multiprotocol label switching (MPLS), high-speed routers, and quality of service provisioning in the Internet. He is a co-author of the MPLS Working Group Framework and Architecture documents.
Nancy Feldman is a software engineer at IBM T. J. Watson Research Center. While a member of IBM's Advanced Networking Laboratory she was an co-inventor and architect of the ARIS protocol. She was also a key member of the High Performance Computing and Communication Department which was responsible for the design and implementation of IBM's NSFNet Router and Milford Router. Nancy is an active member of the IETF MPLS Working Group and a co-author of the MPLS Framework and MPLS Label Distribution Protocol specification.
Zheng Wang received a Ph.D. degree in computer science from University College London (UCL), England in 1992. He is a member of technical staff with the High-Speed Networks Research Department at Bell Laboratories, Lucent Technologies, and is currently working on high-speed routers, label switching, and bandwidth management. Prior to joining Bell Labs, he worked at UCL and Cambridge University. For the past ten years he has been working on a wide range of Internet-related research, including routing, congestion control, resource management, protocols design, multicast, reliable multicast, web caching, and prefetching.
Ross Callon is the chief systems architect for IronBridge Networks, Inc. He is responsible for overseeing the internal systems protocol architecture which governs a product's internal behavior as well as the network systems architecture which governs the way a product functions in a network. He has extensive experience in routing, addressing, and multiprotocol co-existence and interoperability, and has been a major contributor to networking standards through his work in the IETF, ATM Forum, and other standards bodies. He is active in the IETF MPLS Working Group, and is co-author of the MPLS Framework Document and Protocol Architecture. He holds a Master of Science degree in operations research from Stanford University (1977), and a Bachelor of Science degree in mathematics from the Massachusetts Institute of Technology (1973).