TCP-Aware Buffer Management -- Since TCP-controlled traffic constitutes a significant component of Internet traffic, it is natural to incorporate mechanisms in routers that enhance TCP performance. The following general criteria act as a guideline. For long TCP flows (transfers much larger than the obtained bandwidth-delay product) we would like to keep the link share as close to the desired one as
possible.5 For short transfers, we would like small queuing delays for some TCP flows such as telnet flows. We would like to achieve good TCP performance even in the presence of cross-traffic which is not TCP-controlled, uncontrolled, or controlled in a manner that is unfair to TCP sources. Also, we would like to keep the link utilization as high as possible.
TCP Observations
Traffic Burstiness -- TCP creates bursty network traffic. This is because data transmission in TCP is ack clocked, and acks bunch together in FIFO queues to create burstiness [46, 47]. Also, TCP slow-start is very bursty due to the doubling of the window sizes every round-trip time. Losses in the slow-start phase can lead to very poor throughput, and buffering at least equal to approximately one-third the bandwidth delay product is needed to avoid losses in the slow-start phase [48]. However, to keep throughput high, especially for long transfers, a generally accepted rule of thumb is to have buffering equal to at least one bandwidth delay product.
With such large buffers, packets from quasi-delay-sensitive applications, like telnet, may queue up behind a large slow-start burst from a file transfer and experience large queuing delays despite the use of mechanisms like RED [49]. Hence, it would be good to separate telnet-like sources from others.
Unfairness -- It is known that TCP is inherently unfair to connections with long round-trip times [50], and the unfairness can sometimes be as bad as the inverse square of round-trip times [48]. This implies the use of active TCP-aware buffer management in routers to alleviate unfairness as suggested in [34].
Synchronization -- Another reason for the use of TCP-aware buffer management is that with drop-tail queuing, TCP windows can synchronize, leading to poor and oscillatory link utilization [51]. It is necessary to reduce synchronization by appropriate buffer management.
Random Loss Sensitivity -- Since TCP assumes that every loss is a congestion loss and reduces its transmission rate drastically, TCP throughput is very vulnerable to loss. For a fixed loss rate, the throughput goes down as the inverse square of the bandwidth-delay product [48, 52]. This loss sensitivity is further increased if there is a slow reverse path or the ack path is congested. The sensitivity increase is proportional to the ratio of the transmission time of ack packets in the reverse path to that of data packets in the forward path [53]. Random losses can be caused by transient fast (faster than round-trip times) fluctuations in open-loop traffic such as uncontrolled UDP traffic. Buffer management must protect TCP sources from losses caused by TCP-unfriendly sources (i.e., sources that do not react to losses by reducing transmissions as fast as TCP).
Is Per-Flow Fair Queuing Sufficient for Achieving TCP Performance Goals? -- Per-flow fair queuing provides fair opportunities-to-transmit. However, if a flow does not have a backlog (due to small windows, for instance) often enough then fair opportunities-to-transmit do not translate to fair link sharing [35]. Proper buffer management is needed in conjunction with fair queuing to ensure that TCP flows share the available bandwidth in a fair manner (or in a controlled unfair manner if desired).
The link-sharing goals for TCP are to achieve:
-
High throughput
-
Fairness in bandwidth sharing among competing TCP connections
-
Protection of conforming connections from malicious users
-
Reduce the well-known ack-compression phenomenon for TCP [46] which causes network traffic to be bursty even for smooth sources
-
Provision of low latency to telnet-like applications which are delay-sensitive and use TCP
Is Buffer Management Alone Sufficient? -- In the absence of per-flow queuing, control of link sharing is achieved by buffer management and packet discard schemes like Random Early Detection (RED) and drop-from-front. Although some of the above goals can be achieved, the lack of per-flow state hampers the degree to which the goals are achieved. However, the above goals can be achieved much better with a combination of per-flow queuing and appropriate buffer management [35].
TCP-Aware Buffer Management for Use with Fair Queuing -- Merely dividing available buffer space and using RED on each queue is not appealing since buffer space is wasted drastically. For good buffer usage, we would like to share buffer space between different flows. Using RED on this shared buffer space breaks the natural flow isolation that per-flow queuing provides (see [35] for an example) and allows TCP connections to be affected by TCP-unfriendly sources. Also, fair queuing with uncontrolled buffer usage by each flow leads to unfairness.
A solution is to give each flow a nominal reserved buffer space. These nominal allocations can be set in proportion to weights (or ratio of bandwidth to round-trip times if they are known). Each flow's buffer usage can exceed reservation if unused buffer space is available. However, if the total buffer occupancy exceeds a global threshold and a new packet arrives, some packet already in the buffer must be pushed out. The candidate queues eligible for pushout are those queues whose occupancy is above their nominal allocation. From this candidate set, we pick the queue that has the most excess occupancy (i.e., longest deviation from nominal location). The reason we pick this longest queue is that TCP flows with short round-trip times and high bandwidth usage are likely to have the longest queues above nominal since their windows grow faster than flows with longer round-trip times. We alleviate unfairness to long round-trip time connections by dropping from this longest queue. Also, TCP-unfriendly flows are likely to have long queues. Within the longest queue, we use drop-from-front [54].
Drop-from-front triggers TCP's fast retransmit feature more quickly, reducing the length of congestion episodes and improving link utilization. It improves fairness even with FIFO queuing, and reduces chances of windows synchronizing. It is also a useful mechanism for dropping acks [53] (in case of ack congestion) since TCP acks are cumulative. Also, when used with fair queuing it almost eliminates the chance of retransmitted packets being lost [35]. Loss of retransmitted packets leads to expensive TCP timeouts and consequent loss of throughput.
This buffer management scheme protects TCP sources from TCP-unfriendly flows. Hence, with this buffer management there is no need for other applications (such as reliable multicast or RTP applications) to have TCP-friendly congestion control. Those applications can adapt their traffic in whatever manner is preferable to them without having to comply with TCP's congestion control mechanism. TCP-aware buffer management eliminates the need for other applications to be TCP-aware. Another useful aspect is that this buffer management allows TCP to have larger initial windows (hence speeding up completion of short transfers) without affecting other TCP flows.
Protection from TCP-Unfriendly Sources -- The protection that the above buffer management scheme gives to TCP sources is illustrated in Fig. 5.
Figure 5 shows the fraction of the link bandwidth used by 10 persistent TCP sources sharing a link with a loss-insensitive TCP-unfriendly source [35]. The sending rate of this unfriendly source is varied from zero to twice the link bandwidth. With a consistent overload produced by the TCP-unfriendly source, the buffer management and dropping strategy have a stronger impact on bandwidth sharing than the scheduling policy itself. When the TCP-unfriendly source increases its rate, for both FCFS and fair queuing with global random early detection (RED) there is very similar TCP-source performance degradation. However, with longest queue drop policies (marked ALQD in the figure to denote an approximated longest queue implementation), an FQ scheduler is able to almost perfectly maintain the guaranteed rate of 111 per flow, irrespective of whether the flow is TCP-like or totally loss-insensitive.
Figure 6 shows the coefficient of variation of the throughput of different TCP flows when they share a common link with fair queuing and with RED or the buffer management scheme proposed in the previous section. Clearly, the proposed buffer management scheme has better fairness than RED.
Implementation Issues
It has been a long-standing assumption that implementation of large-scale per-flow queuing and hierarchical link sharing is not possible at reasonable cost with current technologies. However, prototype implementations of gigabit routers with per-flow queuing and hierarchical link sharing has shown that it is indeed possible to implement these mechanisms at the necessary scale at high speeds. A primary limitation in applying intelligent buffer management and scheduling mechanisms was that processing elements were performing all functions for header processing, which limited throughput as functionality increased. Distribution of processing on a functional basis and not on a per-packet basis, as shown in Fig. 3, allows throughput to be increased considerably while increasing functionality as well. Furthermore, the state information needed for per-flow queuing is not prohibitive, as has been assumed.
In the Bell Laboratories Router prototype, shown in Fig. 7, we have demonstrated that per-flow queuing and hierarchical link sharing can be implemented, on a scale necessary for backbone routers, using a single FPGA device running at 33 MHz and supporting output link capacities up to 1 Mb/s.
Switch Fabric
Switch fabric design is a very well studied area, especially in the context of ATM networks [5, 5559], and a detailed discussion of switch fabric design issues is omitted for brevity. The key point to keep in mind is that transport through the fabric must be done in a service-preserving manner.
Discussion and Conclusions
With the Internet's ongoing evolution from a best-effort network to a network with service differentiation, an important topic of interest is the type of differentiated services that service providers may want to provide. One possibility is to encode in the ToS bits a set of canonic differentiated service types and incorporate only these mechanisms into the network infrastructure. However, it is not clear whether this simple and evolutionary approach can satisfy differentiated services demands.
The point of view taken in this article is that it is possible to incorporate in routers using current technology, mechanisms which, in addition to enabling simple differentiated services, as proposed recently, also enables differentiated services that are more sophisticated in their ability to meet specific customer needs. The required mechanisms are:
-
All header processing done at wire speed: "no queuing before processing"
-
Packet classification with range matching on many fields for a large number of rules, extending route lookups to source-and-destination-based lookups
-
Flexible and general scheduling and buffer management techniques
-
QoS-capable switch fabric
-
A mechanism for flow isolation
Using these mechanisms, routers can aggregate customers in a very flexible manner, isolate individual customers' traffic, and allocate resources in a customizable manner. This permits each ISP to customize its service offerings to suit its customer demands by offering services such as VPNs with customized service differentiaton, virtual leased lines, and so on. Restricting router functionality to simple differentiated services possibilities due to technology constraints is not necessary.
Acknowledgments
The authors would like to thank members of the Bell Laboratories Router team wo were responsible for building a prototype router encompassing many of the ideas discussed in this article. We would like to thank, in particular, R. Arunachalam, S. Rathnavelu, K. J. Singh, B. Suter, and H. Tzeng for their many contributions.
References
[1] K. C. Claffy, "Internet Traffic Characterization," Ph.D. thesis, UC San Diego, 1994.
[2] K. Claffy, C. Polyzos, and H. W. Braun, "Application of Sampling Methodologies to Network Traffic Characterization," Proc. ACM SIGCOMM '93, Sept. 1993, pp. 194203.
[3] R. Jain and S. Routhier, "Packet Trains -- Measurements and a New Model for Computer Network Traffic," IEEE JSAC, vol. 4, 1986, pp. 98695.
[4] K. Thomson, G. J. Miller, and R. Wilder, "Wide-Area Traffic Patterns and Characteristics," IEEE Network, Dec. 1997.
[5] N. McKeown, V. Anantharam, and J. Walrand, "Achieving 100% Throughput in an Input-Queued Switch," Proc. INFOCOM '96, Mar. 1996, pp. 296302.
[6] V. Fuller et al., "Classless Inter-Domain Routing", RFC1519, June 1993.
[7] L. Zhang et al., "RSVP: A New Resource Reservation Protocol," IEEE Network, vol. 7, no. 5, Sept. 1993, pp. 818.
[8] C. Partridge, "Locality and Route Caches," NSF Wksp. on Internet Statistics Measurement and Analysis, San Diego, CA, Feb. 1996.
[9] A. Asthana et al., "Design of a Gigabit IP Router," Tech. rep. 11251-911105-09TM, AT&T Bell Labs., Nov. 1991.
[10] A. Asthana et al., "Towards a Gigabit IP Router," J. High-Speed Networks, vol. 1, no. 4, 1992.
[11] D. Ghosal, T. V. Lakshman, and Y. Huang, "Parallel Architectures for Processing High-Speed Network Signaling Protocols," IEEE/ACM Trans. Networking, Dec. 1995, pp. 71628.
[12] K. Sklower, "A Tree-Based Routing Table for Berkeley Unix," Tech. rep., UC Berkeley, 1993.
[13] W. Doeringer, G. Karjoth, and M. Nassehi, "Routing on Longest-Matching Prefixes, IEEE/ACM Trans. Networking, vol. 4, no. 1, Feb. 1996, pp. 8697.
[14] D. C. Feldmeier, "Improving gateway performance with a routing-table cache," Proc. of IEEE INFOCOM '88, New Orleans, LI, March 1988.
[15] P. Van Emde Boas, "Preserving Order in a Forest in Less than Logarithmic Time," Proc. 16th IEEE Conf. Foundations of Comp. Sci., 1975, pp. 7584.
[16] P. Van Emde Boas, R. Kaas, and E. Zijlstra, "Design and Implementation of an Efficient Priority Queue," Math. Sys. Theory, vol. 10, 1977, pp. 99127.
[17] M. De Berg, M. van Kreveld, and J. Snoeyink, "Two- and Three-Dimensional Point Location in Rectangular Subdivisions," J. Algorithms, vol. 18, 1995, pp. 25677.
[18] M. Waldvogel et al., "Scalable High Speed IP Routing Lookup," Proc. ACM SIGCOMM '97, France, Sept. 1997.
[19] M. Degermark et al., "Small Forwarding Tables for Fast Routing Lookups," Proc. ACM SIGCOMM '97, France, Sept. 1997.
[20] H.-Y. Tzeng, "Longest Prefix Search Using Compressed Trees," Tech. rep. TM, Bell Labs., under submission, 1997.
[21] D. Clark, "Service Allocation Profiles", Internet draft, 1997.
[22] K. Nichols and S. Blake, "Differentiated Services Operational Model and Definitions", Internet draft, Feb. 1998.
[23] B. Chazelle, "How to Search in History," Info. and Control, vol. 64, 1985, pp. 7799.
[24] B. Chazelle and J. Friedman, "Point Location among Hyperplanes and Unidirectional Ray Shooting," Comp. Geometry: Theory and Apps., vol. 4, 1994, pp. 5362.
[25] K. L. Clarkson, "New Applications of Random Sampling in Computational Geometry," Discrete & Comp. Geometry, vol. 2, 1987, pp. 195222.
[26] M. H. Overmars and A. F. van der Stappen, "Range Searching and Point Location among Fat Objects," J. Algorithms, vol. 21, no. 3, 1996, pp. 62956.
[27] T. V. Lakshman and D. Stiliadis, "Packet Classification Algorithms for Gigabit Internet Routers," Tech. Rep. 113470-980202-02T, Lucent Technologies-Bell Labs., Jan. 1998; also to appear, Proc. ACM SIGCOMM '98.
[28] T. Li and Y. Rekhter, "Provider Architecture for Differentiated Services and Traffic Engineering (PASTE)", Internet draft, 1998.
[29] J. Boyle, RSVP Extensions for CIDR Aggregated Data Flows, Internet draft, 1997.
[30] D. Estrin et al., "Protocol Independent Multicast -- Sparse Mode," Protocol spec. RFC 2117, June 1997.
[31] D. Waitzman, C. Partridge, and S. Deering, "Distance Vector Multicast Routing Protocol", RFC 1075, June 1993.
[32] S. J. Golestani, "A Self-Clocked Fair Queuing Scheme for Broadband Applications," Proc. IEEE INFOCOM '94, Apr. 1994, pp. 63646.
[33] D. Stiliadis and A. Varma, "Design and Analysis of Frame-Based Fair Queuing: A New Traffic Scheduling Algorithm for Packet-Switched Networks", Proc. ACM SIGMETRICS '96, May 1996, pp. 10415.
[34] B. Braded et al., "Recommendations on Queue Management and Congestion Avoidance in the Internet," Internet draft, Mar. 1997.
[35] B. Suter et al., "Design Considerations for Supporting TCP with Per-Flow Queuing," Proc. IEEE INFOCOM '98, San Francisco, CA, 1998.
[36] B. Suter et al., "Efficient Active Queue Management for Internet routers," Proc. Interop Engineers Conf., Las Vegas, NV, May 1998.
[37] D. Lin and R. Morris, "Dynamics of Random Early Detection," Proc. ACM SIGCOMM '97, Sept. 1997.
[38] D. D. Clark, S. Shenker, and L. Zhang, "Supporting Real-Time Applications in an Integrated Services Packet Network: Architecture and Mechanism," Proc. ACM SIGCOMM '92, Aug. 1992, pp. 1426.
[39] S. Floyd and V. Jacobson," Link Sharing and Resource Management Models for Packet Networks," IEEE/ACM Trans. Networking, vol. 3, no. 4, Aug. 1995, pp. 36586.
[40] J. C. R. Bennett and H. Zhang, "Hierarchical Packet Fair Queuing Algorithms," Proc. ACM SIGCOMM '96, Palo Alto, CA, Aug. 1996, pp. 14356.
[41] D. Stiliadis and A. Varma, "A General Methodology for Designing Efficient Traffic Scheduling and Shaping Algorithms," Proc. IEEE INFOCOM '97, 1997.
[42] P. Newman, T. Tylon, and G. Minshall, "Flow Labelled IP: A Connectionless Approach to ATM," Proc. IEEE INFOCOM '96, San Francisco, CA, Apr. 1996, pp. 125160.
[43] A. K. Parekh and R. G. Gallager, "A Generalized Processor Sharing Approach to Flow Control -- The Single Node Case," Proc. INFOCOM '92, vol. 2, May 1992, pp. 91524.
[44] A. Demers, S. Keshav, and S. Shenker, "Analysis and Simulation of a Fair Queuing Algorithm," Internetworking: Res. and Experience, vol. 1, no. 1, 1990, pp. 326.
[45] D. Stiliadis and A. Varma, "Rate-Proportional Servers: A Design Methodology for Fair Queuing Algorithms," IEEE/ACM Trans. Networking, Apr. 1998.
[46] S. Shenker, L. Zhang, and D. D. Clark, "Some Observations on the Dynamics of a Congestion Control Algorithm," Comp. Commun. Rev., Oct. 1990, pp. 3039.
[47] L. Zhang, S. Shenker, and D. D. Clark, "Observations on the Dynamics of a Congestion Control Algorithm: The Effects of Two-Way Traffic," Proc. ACM SIGCOMM '91, 1991, pp. 13347.
[48] T. V. Lakshman and U. Madhow, "The Performance of TCP/IP for Networks with High Bandwidth-Delay Products and Random Loss," IEEE/ACM Trans. Networking, June 1997.
[49] S. Floyd and V. Jacobson, "Random Early Detection Gateways for Congestion Avoidance," IEEE/ACM Trans. Networking, Aug. 1993.
[50] S. Floyd and V. Jacobson," "On Traffic Phase Effects in Packet Switched Gateways," Internetworking: Res. and Experience, vol. 3, no. 3, Sept. 1992, pp. 11556.
[51] L. Zhang, "A New Architecture for Packet Switching Network Protocols," Ph.D. thesis, MIT Comp. Sci., Cambridge, MA, 1989.
[52] T. J. Ott, M. Mathis, and J. H. B. Kemperman, "The Stationary Behavior of Ideal TCP Congestion Avoidance", 1996.
[53] T. V. Lakshman, U. Madhow, and B. Suter, "Window-Based Error Recovery and Flow Control with a Slow Acknowledgment Channel: A Study of TCP/IP Performance," Proc. IEEE INFOCOM '97, Kobe, Japan, Apr. 1997.
[54] T. V. Lakshman, A. Neidhart, and T. J. Ott, "The Drop-from-Front Strategy in TCP over ATM and Its Interworking with Other Control Features," Proc. IEEE INFOCOM '96, San Francisco, CA, 1996, pp. 124250.
[55] F. A. Tobagi, "Fast Packet Switch Architectures for Broadband Integrated Services Digital Networks," Proc. IEEE, vol. 78, no. 1, Nov. 1990, pp. 13367.
[56] F. M. Chiussi, J. G. Kneuer, and V. P. Kumar, "Low-Cost Scalable Switching Solutions for Broadband Networks," IEEE Commun. Mag., vol. 35, no. 12, Dec. 1997, pp. 3643.
[57] F. M. Chiussi, "Design, Performance and Implementation of a Three-Stage Banyan-Based Architecture with Input and Output Buffers for Large Fast Packet Switches," Tech. rep. CSL-TR-93-573, CSL, Stanford, CA, June 1993.
[58] T. E. Anderson et al., "High Speed Switch Scheduling for Local Area Networks," ACM Trans. Comp. Sys., vol. 11, no. 4, Nov. 1993, pp. 31952; also, Digital Systems Research Center tech. rep. no. 99.
[59] D. Stiliadis and A. Varma, "Providing Bandwidth Guarantees in an Input-Buffered Crossbar Switch," Proc. INFOCOM '95, Apr. 1995.
Biographies
Vijay P. Kumar is head of High Speed Networks Research at Bell Laboratories, Lucent Technologies. He is responsible for research and development in algorithms, architectures, protocols, chips, and systems for high-speed data networking. He is also responsible for making next-generation routing engines into products. He obtained his Bachelor of Engineering degree in electronics and communication engineering from Osmania University, Hyderabad, India, in 1980, and M.S. and Ph.D. degrees in electrical and computer engineering from the University of Iowa, Iowa City, in 1982 and 1985, respectively. He has been with Bell Laboratories since 1985, where he has conducted and led research activities in fast packet switching, VLSI communication architectures, fault-tolerant networking, and VLSI yield enhancement. His work has resulted in several ATM chipsets and a gigabit routing engine for differentiated services.
T. V. Lakshman received his Ph.D. degree in computer science from the University of Maryland, College Park, in 1986. Prior to that he received a Master's degree from the Department of Physics, Indian Institute of Technology, Bangalore, India. From 1986 to 1995 he was at Bellcore, where he was most recently a senior research scientist and technical project manager in the Information Networking Research Laboratory. He is currently with the High Speed Networks Research Department at Bell Laboratories. His recent research has been in issues related to traffic characterization and provision of quality of service for video services, architectures and algorithms for gigabit IP routers, end-to-end flow control in high-speed networks, traffic shaping and policing, switch scheduling, and parallel architectures for fast signaling and connection management in high-speed networks. His current research interests are in the areas of high-speed networking, distributed computing, and multimedia systems. He is a corecipient of the 1995 ACM Sigmetrics/Performance Conference Outstanding Paper Award. He is an editor of the IEEE/ACM Transactions on Networking.
Dimitrios Stiliadis received his Diploma in computer engineering and informatics from the University of Patras, Greece, in 1992. He received M.S. and Ph.D. degrees in computer engineering from the University of California at Santa Cruz in 1994 and 1996, respectively. He is currently a member of technical staff in High-Speed Networks Research of Bell Laboratories, where he has been leading the architecture and design of a gigabit router supporting differentiated services. His current research interests include traffic scheduling, network performance analysis, and architectures of fast forwarding engines.