Copyright 2000 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

This article was published in the May 2000 issue of
Network.

ABSTRACT

 

Understanding Web traffic characteristics is key to improving the performance and scalability of the Web. In this article Web proxy workloads from different levels of a caching hierarchy are used to understand how the workload characteristics change across different levels of a caching hierarchy. The main observations of this study are that HTML and image documents account for 95 percent of the documents seen in the workload; the distribution of transfer sizes of documents is heavy-tailed, with the tails becoming heavier as one moves up the caching hierarchy ; the popularity profile of documents does not precisely follow the Zipf distribution; one-timers account for approximately 70 percent of the documents referenced; concentration of references is less at proxy caches than at servers, and concentration of references diminishes as one moves up the caching hierarchy; and the modification rate is higher at higher-level proxies.

 

 

Traffic Analysis of a Web Proxy Caching Hierarchy

 

Anirban Mahanti, Carey Williamson, and Derek Eager
University of Saskatchewan

 

The World Wide Web has experienced phenomenal growth in recent years. This growth of the Web has contributed significantly to the network traffic on the Internet, and motivated much research into improving the performance and scalability of the Web.

In recent years, Web proxy caches have been deployed to reduce network traffic and provide better response time for Web accesses. A Web proxy consists of application-level software that accepts document retrieval requests from a set of clients, forwards these requests to appropriate servers if the requested documents are not already present in the proxy's cache, and sends documents back to the clients. Originally, proxies were designed to allow network administrators to be able to control Internet access from within an intranet [1]. It was recognized, however, that proxies may also serve as repositories for frequently requested documents. This role of proxies has made them very popular. Caching documents at the proxy can save network bandwidth and reduce network latency for document accesses [2].

Most browsers can be configured to use a proxy. Proxies can be deployed almost anywhere in the Internet. A second-level cache, the first level being the browser cache, can be provided by a proxy cache that serves requests from a large community such as a large corporation, a university, or the customers of an Internet service provider. Higher-level proxies have also been deployed. These proxies typically have other second-level caches (i.e., lower-level proxies) as their clients.

Web caches connected in the above described tree-like fashion are said to form a cache hierarchy. Web caches distribute load away from server "hot spots," saving wide area network bandwidth and reducing access latencies. Lower-level caches typically configure a higher-level cache as their parent. The first Web proxy, the CERN httpd server, allowed caches to be arranged hierarchically. In such Web caches, requests that cannot be satisfied from a lower-level cache are forwarded to the higher-level cache, with the expectation that some of these requested documents will be stored there. The Internet Cache Protocol (ICP) [3] was developed as a part of the Harvest caching project [4] to provide a more efficient method of intercache communication. ICP allows lower-level caches to send queries to a higher-level cache to inquire whether or not it has a copy of the requested document. Requests are forwarded to the higher level only if a copy of the document is found; otherwise, the request is forwarded to the origin server.

Understanding Web traffic characteristics is key to the design of techniques that save network bandwidth, reduce latency, and improve response time of Web accesses. This article is concerned with understanding factors that can affect the performance of Web proxy caches. The study builds on previous work done by Arlitt and Williamson [5] on Web server workloads, Cunha et al. [6] on Web client workloads, and Abdulla et al. [7] on Web proxy workloads. The present work differs from that in [7] because workloads from proxies at different levels of a caching hierarchy are considered, permitting a more complete study of how the workload characteristics change as one moves from the client side to the server side of the network.

The rest of the article is organized as follows. We describe the data collection, reduction, and analysis of access logs performed. Next, we present the results of the workload analysis, followed by conclusions.

Data Collection, Reduction, and Analysis

Our traffic analysis is conducted using access logs from Web proxy servers. Each entry in the access logs records the URL of the document being requested, the date and time of the request, the name (or IP address) of the client host making the request, the number of bytes returned to the requesting client, and additional information that describes how the client's request was treated at the proxy. Processing these log entries can produce useful summary statistics about workload volume, document types and sizes, popularity of documents, and proxy cache performance. The next section describes the data collection sites and the "raw" data sets, while another section describes the reduction of the raw data sets to only contain useful information for the analysis performed in this article.

Configuration of Proxy Sites

The access logs for this study were obtained from three Web proxy servers: the proxy server at the University of Saskatchewan; the CANARIE proxy [8] located at Bell Canada in Toronto; and the National Laboratory for Applied Networking Research (NLANR) proxy [9] at the University of Illinois, Urbana-Champaign (UIUC). Each of these sites continuously records access logs, which were obtained on a daily basis using FTP. Further detail on each of these sites is provided below.

The Web proxy server at the University of Saskatchewan represents an institution-level Web proxy in the CA*net II caching hierarchy. It serves several hundred users on the University of Saskatchewan campus who have configured their browsers to use the proxy cache. This proxy server is operated by the Department of Computing Services at the University of Saskatchewan. The proxy server uses a Digital AlphaServer 1200 5/400 with two 400 MHz processors running Squid v. 1.2.beta20 with 5 Gbytes disk cache. The proxy is configured to use the CANARIE cache as its parent. That is, cache misses at the University of Saskatchewan cache result in requests to the CANARIE cache for the missing document.

The CANARIE proxy cache is the root of the CA*net II caching hierarchy. This proxy is a SUN Ultra Sparc 180 MHz machine running Squid v. 1.2.beta22. The CANARIE proxy server is physically located at Bell Canada, Toronto, although it is administratively controlled by the CANARIE Advanced Research and Development Network Operations Centre (ARDNOC) in Ottawa. This cache currently functions as a parent for several first-level proxies, including proxies at the University of Saskatchewan, the University of Alberta, Dalhousie University, and McMaster University. There are also a small number of users who configure their browsers to directly use the CANARIE proxy cache. The CANARIE proxy has parent links to two nodes in the NLANR caching hierarchy: the Pittsburgh NLANR node and the NLANR node at UIUC.

The NLANR traces in this study come from the National Center for Supercomputing Applications (NCSA) proxy server at UIUC. This site represents one of several top-level nodes in the NLANR Web caching hierarchy; it receives requests from sibling caches at the top level, as well as from lower-level caches that use it as a parent. The NCSA proxy server is a Digital AlphaServer 1000 266 MHz machine running Squid v. 1.2.beta17 with about 10 Gbytes of disk cache.

The access logs of individual days were concatenated to obtain longer data sets for each site. A 45-day trace was created for the University of Saskatchewan proxy server by concatenating access logs from February 5 to March 30, 1999.1 Similarly, a 45-day CANARIE trace spanning February 2 to March 24, 1999,2 and a 30-day NLANR trace spanning February 3 to March 4, 1999 were created. The University of Saskatchewan proxy server recorded a total of 30,330,149 requests in 45 days of activity. The CANARIE proxy server recorded a total of 20,725,429 requests in 45 days of activity. The NLANR access logs recorded 35,631,782 requests in 30 days of activity. These three data sets will be referred to as USask, CANARIE, and NLANR in the rest of this article. The access logs are presented in this relative order to reflect the progression from an institution-level Web proxy cache to a top-level Web proxy cache.

Data Reduction and Analysis

The extremely large volume of the raw proxy traces necessitated pruning of the data sets to contain only the information useful for our study. Both the USask and CANARIE Web proxies were configured to use ICP for intercache queries. ICP queries account for a significant fraction of the total requests in the proxy traces, although they do not result in document transfers. An analysis of the raw data sets showed that 17 percent of the requests made to the USask proxy were ICP queries. Similarly, 52 percent of the requests in the CANARIE proxy trace were ICP queries, while for the NLANR proxy none of the requests were ICP queries, since the NLANR access log was not configured to record activity at the ICP port.

After removing the ICP queries from the raw data sets, the next step used the HTTP reply codes in the access logs for data reduction. The HTTP reply codes describe how a client's request was serviced. For example, a reply code of 200 (OK) means that a valid document was made available to the client, either directly from the proxy cache, or by retrieving the document from another proxy cache or the origin server. A reply code of 304 (Not Modified) implies that a client had issued a GET If-Modified-Since request to determine whether or not the client's cached copy of a document was up to date, and the server (or some intermediate proxy) replied indicating that the client has a valid copy of the document. An HTTP response code of 206 (Partial Content) implies partial transfer of the document to the client in response to a partial GET request. Partial GETs are used to recover efficiently from partial failed transfers in HTTP/1.1. An HTTP response of 302 (Found) means that the requested document is known to reside in a different location than that specified by the URL.

In the data sets, approximately 90 percent of the HTTP requests made to the proxy result in the client successfully obtaining the document (HTTP 200 and HTTP 304). In this study, we consider all requests that would result in the document being accessed from the origin server in the absence of intermediate proxies. Therefore, only the 200 (OK) and 206 (Partial Content) status codes are considered.

Table 1 summarizes the reduced access logs for the three Web proxy sites. Based on the average number of requests seen per day at each proxy server, the NLANR proxy server has the highest activity, while the CANARIE proxy server has the least activity. This is not surprising since very few institutional caches currently use the CANARIE proxy cache.3

Table 1 also indicates the number of distinct documents, servers, and clients recorded in the access logs. Each proxy handles requests for millions of distinct documents located at thousands of different Web servers. The number of clients (i.e., distinct client IP addresses seen) varies quite significantly in the three traces. This number is not known precisely for the NLANR log, since the client IP addresses in the NLANR access logs are randomized every day (for privacy concerns). However, on any particular day, about 700 clients generate requests to the NLANR cache.

Overall, the mean and median document transfer sizes are quite small, as has been reported in previous studies of Web servers [5, 10] and Web proxy servers [7]. In these data sets, the mean size of the documents transferred ranges from 8 to 13 kbytes, while the median is in the range of 2–3 kbytes. The mean transfer size is larger than the median transfer size because there are some very large documents that skew the mean of the transfer size distribution. There is also high variability in the sizes of documents transferred, as indicated by the coefficient of variation (COV) reported in Table 1. These reduced data sets are used in the analyses in the rest of the article.

Proxy Traffic Analysis

The following sections present the results from a detailed traffic analysis of Web proxy workloads. The following characteristics are studied: document types and sizes, proportion of documents accessed only once in the access logs, distribution of transfer sizes, popularity of documents, and rate of document modifications.

Document Types and Sizes

The next step in the analysis classifies document requests (based on URL extensions4) into the following categories: Any document that could not be classified under one of the above categories was placed in the Others category.

The results of this analysis for the three data sets are summarized in Table 2. The table shows that HTML and image documents account for close to 95 percent of the total requests. Similar results were reported for client traces by Cunha et al. [6], and for server traces by Arlitt and Williamson [5].

Unlike Web server workloads, however, Table 2 shows that image documents are consistently the most requested document type (68–78 percent), followed by HTML documents (about 20 percent). Similar observations were made by Abdulla et al. [7] in their characterization of Web proxy traffic. It is also observed that image type documents are responsible for the highest percentage of bytes transferred (40–52 percent), followed by HTML documents (18–23 percent).

A substantial portion of the byte transfer volume is accounted for by audio, video, compressed, and formatted document types, which, despite their relatively infrequent access (typically close to 1 percent of requests), are large enough to generate 20–30 percent of the bytes transferred. This observation is substantiated by the large mean and median transfer sizes indicated for these document types compared to other document types. In general, the COV values within each document type category are much lower than the COV of transfer sizes for the aggregate data set (Table 1). The large variation in the mean transfer sizes across the diverse document types helps explain the high COV reported in the aggregate data sets.

One-Time Referencing

A surprising observation made in previous analyses of Web server workloads was that (regardless of the duration of the access log studied) typically 15–30 percent of the documents accessed in the log were accessed only once in the log. This so-called one-time referencing behavior is of concern for Web caching, since there is clearly no point caching something that will be accessed only once.

The precise cause of this one-time referencing behavior is not fully understood. Several explanations have been proposed. First, it might indicate the vastness of the Web and the low signal-to-noise ratio for much of its content. Second, it might reflect human nature in browsing habits (e.g., once a site has been visited, there is no need to visit it again). Third, it might reflect the behavior of content providers, who might use date-based URL names, or redesign or modify Web pages on a regular basis to keep them current, while possibly removing or renaming old pages. Fourth, it may reflect the presence of search engines or Web robots that traverse many pages to construct an index. Fifth, the presence of a high percentage of one-timers might indicate that caching hierarchies are actually working well. Finally, it may be the consequence of document prefetching (i.e., prediction) algorithms in some proxies and/or client browsers. All of these hypotheses seem plausible. If any (or all) of them are true, they indicate a challenging workload environment for Web caching algorithms.

Several other explanations are less plausible. For example, attributing this one-time referencing to the presence of dynamic requests (e.g., CGI) is not possible, since dynamic documents typically account for much less than 5 percent of the workload in Web server and Web proxy access logs. Attributing one-time referencing to typographical errors made by clients in requested URL names does not make sense, since the access log analyses usually focus on successful requests, not errors.

Table 3 summarizes the contribution of one-timers with respect to the number of distinct documents, total requests, and total bytes transferred for the three data sets. One-timers account for a substantial portion of the total requests (18–47 percent) and total bytes transferred (27–48 percent). Also, approximately 70 percent of the documents referenced are one-timers. The latter number is significantly higher than that for Web server workloads [5]. Presumably this is because requests made to a proxy are for a much larger population of Web documents than those made to a single server.

The predominance of one-time referencing for Web documents highlights the need for novel Web proxy caching policies that can effectively discriminate against one-timers. For example, frequency-based algorithms, such as Least Frequently Used (LFU), tend to perform better than recency-based algorithms, such as Least Recently Used (LRU) [11]. Similar observations have been made for Web server caching algorithms [5, 12, 13].

Transfer Size Distribution

The next analysis focuses on the transfer size distribution for the documents returned to the requesting clients (either directly by the proxy, or after obtaining the document from a higher-level proxy or the originating server). Of particular interest are the shape of this distribution, the presence of a heavy tail in the distribution, and the impact of the heavy-tailed distribution on Web and network performance.

Figure 1 shows the cumulative distribution function of the transfer size for each proxy server, using a logarithmic scale on the horizontal axis. Almost all the transfers are in the range from 100 to 100,000 bytes, with very few small transfers (less than 100 bytes) and very few large transfers (more than 100,000 bytes). This distribution is similar to the document size distribution reported for Web clients [6] and Web servers [5, 10, 14, 15].

The transfer size distribution in Fig. 1 is heavy-tailed. In simple terms, a "heavy tail" in a Web document size distribution means that the very large "elephants" (i.e., outliers) in the tail of the distribution, although relatively few in number, are large enough to contribute significantly to the overall traffic volume observed (e.g., skew the mean transfer size distribution, as observed in Table 1).

Figure 2 illustrates the "heavy-tail" transfer size distribution for the USask data set. Figure 2 shows that in the USask data set 75 percent of all the distinct documents requested are for transfers less than 10,000 bytes in size.5 About 80 percent of the references are for documents in this category. However, transfers of documents smaller than 10,000 bytes account for only 27 percent of the total bytes transferred by the proxy to the clients. The tail of the distribution (transfers over 100,000 bytes) accounts for a significant 30 percent of the total bytes transferred. Similar observations can be made for the other data sets [16]. Therefore, caching smaller documents can increase the hit ratio at the cost of more bytes transferred over the network. To reduce the network traffic volume, proxies can cache larger documents, thus sacrificing the cache hit ratio.

The simplest example of a heavy-tailed distribution is the doubly exponential Pareto distribution; its cumulative distribution function is

F(x) = 1 – (k/x) , k > 0, x k

The parameter is known as the tail index [15], and k defines the beginning of the "tail" of the distribution (i.e., it represents the smallest possible value of the random variable in the heavy-tailed distribution). As decreases, the tail of the distribution becomes heavier [15]. In other words, an arbitrarily large portion of the probability mass may be present in the tail of the distribution as decreases.

To estimate the tail index for the transfer size data sets, the approach outlined in [15, 17] is followed. First, a log-log complementary distribution (LLCD) is plotted for the transfer sizes in the data sets. An LLCD plot graphs log(x) = log(1 – F(x)) vs. logx, for large x [17]. Heavy-tailed distributions have the following property:

An estimate of the tail index is obtained by determining the slope of the LLCD plot for values of x greater than k, using least-squares linear regression [18].

Figure 3 shows the LLCD plot for the NLANR data set, along with the least-squares regression fit for the heavy tail (k = 10,000 bytes). Table 4 summarizes the estimated value for each data set, along with the coefficient of determination (R2), which assesses the "goodness of fit" for the linear regression. These results show a very strong fit R2, and values ranging from 1.07 to 1.27.

It can be concluded that proxy transfer sizes are heavy-tailed, just like server transfer sizes (0.93 1.33). Among the three data sets considered, the NLANR data set exhibited the heaviest tail ( = 1.07), while the USask data set shows the least heavy tail ( = 1.27). One possible explanation for this observation is successful caching of small documents at lower-level proxies, resulting in heavier tails at higher-level caches.

Document Popularity

Many researchers have noted that the popularity of Web documents is highly uneven [5, 6, 10, 14, 17, 19, 20]. The objective of this analysis is to determine whether the Zipf distribution applies to the proxy document reference streams across different levels of a caching hierarchy.

Zipf's Law [21] states that if items are ranked (r) according to their popularity (P) measured by the frequency of reference for individual items, then the popularity of an item is given by

where the exponent ß is often close to unity [22].

In the special case where the exponent ß = 1, the Nth most popular item is exactly twice as popular as the 2Nth item, and so on. The Zipf distribution has been widely used to model many aspects of social and economic behavior [21, 22]. It has also been applied to model memory referencing behavior of computer programs [23, 24] and, most recently, Web referencing behavior [20].

To determine whether or not Web proxy document references follow the Zipf distribution, the documents are first ranked, with the most referenced document being assigned a rank of one, followed by the next most referenced document with a rank of two, and so on.

The frequency vs. rank plot for the USask data set (on a log-log scale) appears in Fig. 4. Visual inspection of the graph suggests that the distribution follows Zipf's Law, albeit not as strictly as has been observed for Web servers. Similar observations can be made for the other two data sets [16]. There appears to be some flattening at the most popular and least popular ends of the plots. This might indicate caching of the frequently requested "hot" documents at browsers and lower-level caches. The middle portion appears to follow the Zipf distribution more accurately. Similar observations are made by Breslau et al. [20].

To determine the best-fit exponent () of the Zipf distribution, the method in [25] is used. This involves grouping all documents that have been referenced an identical number of times and assigning them the same rank. For example, if three documents are referenced 100 times each and are ranked 1000, 1001, and 1002, the modified data set considers them to be a single data point with a rank of 1001 (i.e., the average of the three ranks) and popularity equal to 100. After obtaining the modified data set, a least-squares fit over the (log-transformed) modified data set is performed. The calculated values for and goodness of fit R2 are shown in Table 5. Figure 4 also shows that the Zipf distribution plot obtained with the calculated value (the dotted line) is very similar to the corresponding frequency vs. rank plot. These results suggest that a Zipf-like model can capture the distribution of references seen at a proxy. Note that the exponent decreases for higher-level proxies. This is most likely due to filtering of popular documents at the browsers and lower-level proxies. Roadknight et al. came to similar conclusions after a detailed study of document popularity characteristics.

Another way of characterizing the uneven document access patterns is to determine the extent to which references are skewed toward certain documents. A measure of skewness, referred to as concentration, was applied to memory and file referencing behavior [23, 26], and later to Web server document accesses [5, 10, 27].

The concentration phenomenon, as applied to documents sorted by their reference counts, is illustrated in Fig. 5a. Nonuniform referencing of Web documents is clearly reflected in this figure, with 30 percent of the documents accounting for 60–80 percent of the references. The remaining 70 percent of the documents (primarily the one-timers) account for the remaining requests.

Concentration of references is also observed when documents are sorted according to the volume of bytes transferred in response to requests for them. Figure 5b shows that 10 percent of the documents account for approximately 90 percent of the bytes transferred by the proxy to clients.

Higher-level proxies (i.e., CANARIE, NLANR) receive requests from clients that are typically lower-level proxies. Therefore, the concentration of requests at these proxies might be due to the overlap in requests from different clients, since frequent requests from users generally get captured by the browsers and lower-level proxy caches. Among the three data sets, the USask data set shows the most concentration, while the CANARIE one shows the least. The lower concentration at the CANARIE proxy might be due to its small client population (about 13).

These results suggest that the concentration of references is lower at the Web proxy servers than at the Web servers [5]. This observation makes sense intuitively, since the clients of a Web proxy can effectively access any available document in the Web (i.e., the document set is very large). For Web servers the requests are restricted to a limited set of documents (i.e., the documents present at the Web server).

However, some recent studies have shown that many static documents are never cached [28, 29]. There are three main reasons for these documents not being cached. First, a significant fraction of the responses received from the Web servers contain no Last Modified dates, disabling caching of these requests [30]. Second, many requests in the proxy traces are associated with "cookies,"6 and most proxies do not cache responses that are cookie-specific. Third, some static documents are associated with an HTTP directive ("no-cache" pragma in HTTP/1.0) which informs the clients that the document should not be cached.

A cursory analysis of the access logs revealed that there are many documents never being cached (at the browser and/or proxy). Therefore, part of the concentration of references seen is due to the uncachability of the documents. Since the access logs used in this analysis did not have the necessary information to distinguish clearly between cachable and uncachable documents, no further attempt was made to understand the impact of cachability of documents.

The Rate of Change of Documents

As already seen, there are multiple references to many Web documents. However, some instances of multiple references may represent accesses to different versions of a document that has been updated. This type of referencing behavior would yield quite different cache behavior than would multiple accesses to an unmodified document. Therefore, it is important to know if there is any correlation between frequency of access and document modifications. Since Web caches do not cache dynamic files, the following analysis focuses on static7 documents only.

A document is assumed to be modified when the size associated with a reference to the document is different from the size associated with the document when it was last referenced. There is some error in this assumption, since the access logs report the bytes transferred from the proxy to the clients, which includes HTTP headers. Since document size is approximated by bytes transferred, modifications are reported when header size changes occur and when clients abort requests. The change ratio of a document is defined to be the ratio of the number of modifications seen to the number of references made since the document was first referenced.

Figure 6 shows the average change ratio of documents as a function of the midpoints of reference count intervals. The change ratio for all documents with reference counts in the interval [xi, xi+1], where xi+1 = 1.1 * xi, were averaged to produce each point in the figure.

The average change ratio appears to increase initially, then decrease, and finally increase again as the reference count increases. The documents seen in the traces tend to be more frequently modified as one moves higher up in the caching hierarchy because requests from lower-level proxies are forwarded to higher-level proxies when document modifications occur.

Conclusions

This article presented a workload analysis for Web proxy servers. Three different Web proxies were studied: the Web proxy cache at the University of Saskatchewan; the CANARIE Web proxy cache; and an NLANR Web proxy cache at the University of Illinois, Urbana-Champaign. These Web proxies reflect the progression from an institutional-level Web proxy cache to a top-level Web cache.

The main observations of the study are as follows:

These observations present a snapshot of Web proxy workload characteristics on one particular Web caching hierarchy at one point in time. Clearly, to comment on the general characteristics at different levels of a caching hierarchy, a study needs to look at more than one caching hierarchy, as was done for Web server [5], proxy [7], and client [6] workloads. Nevertheless, these results and observations can be used to design a flexible synthetic workload generator. Such workload generators can be used to learn more about the impact of different workload parameters on the performance of Web proxy caches, whether hierarchically configured or not.

As for the future, the Web is a continually evolving system. Forecasts indicate that by the year 2002, there will be about 320 million Web users, compared to approximately 100 million users in the year 1998 [30]. The number of multimedia services is also increasing, resulting in more audio and video traffic. Therefore, analysis of Web proxy traffic over time needs to be conducted to understand the changes occurring in the traffic characteristics.

 

References
[1] M. Baentsch et al., "World-Wide Web Caching-The Application Level View of the Internet," IEEE Commun. Mag., vol. 35, no. 6, June 1997.

[2] S. Glassman, "A Caching Relay for the World-Wide Web," Comp. Networks and IDSN Sys., vol. 27, no. 2, Nov. 1994, pp. 165–74.

[3] D. Wessels and K. Claffy, "ICP and the Squid Web Cache," IEEE JSAC, vol. 16, no. 3, Apr. 1998, available at http://ircache.nlanr.net/~wessels/Papers/icp-squid.ps.gz, pp, 345–57.

[4] A. Chankhunthod et al., "A Hierarchical Internet Object Cache," Proc. 1996 USENIX Tech. Conf., San Diego, CA, Jan. 1996, pp. 153-166.

[5] M. Arlitt and C. Williamson, "Internet Web Servers: Workload Characterization and Performance Implications," IEEE/ACM Trans. Net., vol. 5, no. 5, Oct. 1997, pp. 631–45.

[6] C. Cunha, A. Bestavros, and M. Crovella, "Characteristics of WWW Client-Based Traces," Tech. rep. TR-95-010, Dept. of Comp. Sci., Boston Univ., Apr. 1995, available at ftp://cs-ftp.bu.edu/techreports/1995-010-www-client-traces.ps.Z

[7] G. Abdulla et al., "WWW Proxy Traffic Characterization with Application to Caching," Tech. rep. TR-97-03, Comp. Sci. Dt., Virginia Tech., Mar. 1997, available at http://www.cs.vt.edu/~chitra/work.html

[8] CANARIE, CA*net II sanitized access logs, available at http://ardnoc41.canet2.net/cache/squid/rawlogs/

[9] NLANR sanitized access logs, available at ftp://ircache.nlanr.net/Traces

[10] H. Braun and K. Claffy, "Web Traffic Characterization: An Assessment of the Impact of Caching Documents from NCSA's Web Server," Comp. Networks and ISDN Sys., vol. 28, no. 1 & 2, Jan. 1995, pp. 37–51.

[11] M. Arlitt, R. Friedrich, and T. Jin, "Performance Evaluation of Web Proxy Cache Replacement Policies," Tech. rep. HPL-98-97, HP Labs, May 1998, available at http://www.hpl.hp.com/techreports/98/HPL-98-97.html

[12] S. Williams et al., "Removal Policies in Network Caches for World-Wide Web Documents," Proc. 1996 ACM SIGCOMM Conf. Appls., Technologies, Architectures, and Protocols for Comp. Commun., Stanford, CA, Aug. 1996, pp. 293–305.

[13] M. Arlitt and C. Williamson, "Trace-Driven Simulation of Document Caching for Internet Web Servers," Simulation, vol. 68, no. 1, Jan. 1997, pp. 23–33.

[14] P. Barford and M. Crovella, "Generating Representative Web Workloads for Network and Server Performance Evaluation," Proc. 1998 ACM SIGMETRICS Conf. Measurement and Modeling of Comp. Sys., Madison, WI, July 1998, pp. 151–60.

[15] M. Crovella and A. Bestavros, "Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes," IEEE/ACM Trans. Networking, vol. 5, no. 6, Dec. 1997, pp. 835–71.

[16] A. Mahanti, Web Proxy Workload Characterisation and Modelling, M.Sc. thesis, Dept. of Comp. Sci., Univ. of Saskatchewan, Sept. 1999, available at ftp://ftp.cs.usask.ca/pub/discus/thesis-mahanti.ps.Z

[17] P. Barford et al., "Changes in Web Client Access/Patterns: Characteristics and Caching Implications," World Wide Web J., vol. 2, 1999, pp. 15–28, available at http://cs-www.bu.edu/faculty/crovella/paper-archive/traces98.ps

[18] V. Paxson, "Empirically Derived Analytic Models of Wide-Area TCP Connections," IEEE/ACM Trans. Net., vol. 2, no. 4, Aug. 1994, pp. 316–36.

[19] V. Almeida et al., "Characterizing Reference Locality in the WWW," Proc. 1996 Int'l. Conf. Parallel and Dist. Info. Sys., Miami Beach, FL, Dec. 1996, pp. 92–103.

[20] L. Breslau et al., "On the Implications of Zipf's Law for Web Caching," Proc. IEEE INFOCOM '99, New York, Mar. 1999, available at http://www.cs.wisc.edu/~cao/papers/zipf-implications.html

[21] G. Zipf, Human Behavior and the Principle of Least-Effort, Cambridge, MA: Addison-Wesley, 1949.

[22] M. Kendall, "Natural Law in the Social Sciences," J. Royal Stat. Soc. 4, vol. 124, 1961, pp. 1–18.

[23] R. Bunt and J. Murphy, "The Measurement of Locality and the Behaviour of Programs," Comp. J., vol. 27, no. 2, Aug. 1984, pp. 238–45.

[24] J. Peachey, R. Bunt, and C. Colbourn, "Some Empirical Observations on Program Behaviour with Applications to Program Restructuring," IEEE Trans. Software Eng., vol. 11, no. 2, Feb. 1985, pp. 188-93.

[25] C. Roadknight, I. Marshall, and D. Vearer, "File Popularity Characterisation," Proc. 2nd Wksp. Internet Server Perf., Atlanta, GA, May 1999, available at http://www.cc.gatech.edu/fac/Ellen.Zegura/wisp99/papers/roadknight.ps

[26] C. Williamson and R. Bunt, "Characterizing Short Term File Referencing Behaviour," Proc. 5th Annual IEEE Phoenix Conf. Comp. and Commun., Phoenix, AZ, Mar. 1986, pp. 651–60.

[27] T. Kwan, R. McGrath, and D. Reed, "NCSA's World Wide Web Server: Design and Performance," IEEE Comp., vol. 28, no. 11, Nov. 1995, pp. 68–74.

[28] R. Caceres et al., "Web Proxy Caching: The Devil is in the Details," Perf. Eval. Rev., vol. 26, no. 1, Dec. 1998, pp. 11–15.

[29] C. Wills and M. Mikhailov, "Examining the Cacheability of User-Requested Web Resources," Proc. 4th Int'l. Web Caching Wksp., San Diego, CA, Mar./Apr. 1999, available at http://www.cs.wpi.edu/~mikhail

[30] L. Lange, "The Internet," IEEE Spectrum, vol. 36, no. 1, Jan. 1999, pp. 35–40.

Additional Reading
[1] CERN httpd Web Server

Biographies
Anirban Mahanti received a B.E. degree in computer science in 1997 from Birla Institute of Technology, Ranchi, India, and an M.Sc. degree in computer science from the University of Saskatchewan in 1999. He is currently working toward a Ph.D. degree in computer science at the University of Saskatchewan. His research interests are in parallel processing and Web performance.

Carey Williamson [M] is a professor in the Department of Computer Science at the University of Saskatchewan. He received a B.Sc. (Honours) in computer science from the University of Saskatchewan in 1985, and a Ph.D. in computer science from Stanford University in 1991. His general research interests are in computer networks and computer systems performance evaluation. More specific interests include network traffic measurement, workload characterization, network simulation, and Web performance. He is a member of ACM and SCS.

Derek L. Eager received a B.Sc. degree in computer science from the University of Regina in 1979, and M.Sc. and Ph.D. degrees in computer science from the University of Toronto in 1981 and 1984, respectively. Currently he is a professor in the Department of Computer Science at the University of Saskatchewan. His research interests are in the areas of performance evaluation, streaming media delivery, and distributed systems.