Publications

IEEE CTN
Written By:

Songze Li and A. Salman Avestimehr, University of Southern California

Published: 30 Aug 2018

CTN Issue: August 2018

A note from the editor:

Edge compute is seen as an important part of future high bandwidth 5G networks, but it comes with the challenges of distributed compute over a network with variable availability of bandwidth and compute capacity due to the service orientated nature of modern wireless networks. This month Songze and Salman present distributed coding of computation as a technology to combat these issues, outlining two new strategies to balance compute and network bandwidth requirements to optimize the Edge compute QoS. Fascinating stuff. Your comments as always are very welcome.

Alan Gatherer
Editor-in-Chief

Songze Li

University of Southern California

A. Salman Avestimehr

University of Southern California

The edge computing architecture (see Fig. 1) has been recently proposed to better satisfy the service requirements of the emerging Internet-of-Things (IoT) applications (see, e.g., [1]). Unlike the Cloud computing that stores and processes end-users' data in remote and centralized datacenters, edge computing brings the provision of services closer to the end-users by pooling the available resources at the edge of the network (e.g., smartphones, tablets, smart cars, base stations and routers) (see, e.g., [2], [3]). As a result, the main driving vision for edge computing is to leverage the significant amount of dispersed computing resources at the edge of the network to provide much more user-aware, resource-efficient, scalable and low-latency services for IoT.

In this article, we demonstrate how coding can be effectively utilized to trade abundant computing resources at network edge for performance improvement in various aspects of edge computing. In particular, we illustrate two coding concepts, which leverage the available or under-utilized computing resources at various parts of the network to enable coding opportunities that significantly reduce the bandwidth consumption, and minimize the risk from slow, irresponsive, and adversarial edge nodes.

The first coding concept introduced in [4], [5], which is referred to as “Coded MapReduce” (CMR), enables a surprising inversely proportional tradeoff between computation load and communication load in distributed edge computing. More specifically, for general MapReduce-type distributed computing frameworks, CMR demonstrates that increasing the computation load by a factor of r (i.e., evaluating each computation at r carefully chosen nodes) can create novel coding opportunities that reduce the required communication load for computing by the same factor. Hence, CMR can be utilized to pool the underutilized computing resources at network edge to slash the required bandwidth for edge computing.

The second coding concept introduced in [6] which is referred to as “Lagrange Coded Computing” (LCC), leverages computation redundancy (in novel coded forms) to simultaneously provide the following salient features: 1) resiliency against missing results from straggling or irresponsive nodes; 2) security against Byzantine (or malicious) nodes that deliberately modify the computation for their benefit; and 3) privacy of the dataset from possible collusion of nodes. LCC, which exploits the well-known Lagrange polynomial to create computation redundancy, can be applied to any scenario in which the function of interest is an arbitrary multivariate polynomial of the input dataset, hence covering many computations of interest in edge computing. Lagrange Coded Computing significantly generalizes prior works on coded computing (e.g., [7]-[9]) to go beyond linear computations that have so far been the main focus in this area.

In this article, we give an overview of these two coding concepts, illustrate their key ideas via motivating examples, and demonstrate their impacts on edge networks. More generally, noting that redundancy is abundant in edge networks (i.e., many computing/storage points) and grows linearly with network size, we demonstrate the transformational role of coding in edge computing for leveraging such redundancy to enable low-latency, scalable, and secure edge computing. We also point out that while these two coding techniques are also applicable to cloud computing applications, they are expected to play a much more substantial role in improving the system performance of edge applications, due to the fact that communication bottleneck, availability and security of computation resources are far more severe issues in edge environments compared with its cloud counterpart. We finally highlight some exciting open problems and research directions for utilizing coding in edge computing architectures.

Coding for Bandwidth-Efficient Edge Computing

We illustrate Coded MapReduce (CMR) in a typical edge computing scenario, in which an edge client aims to utilize the network edge for its computation task. For instance, a driver wants to find the best route through a navigation application offered by the edge, in which the map information and traffic condition are distributedly stored in edge nodes (ENs) like roadside monitors, smart traffic lights, or other smart cars that collaborate to find the best route. Another example is object recognition that is the key enabler of many augmented reality applications. To provide an object recognition service over edge, edge nodes like routers and base stations, each stores parts of the dataset repository, and collaboratively process the images or videos provided by the edge client.

For the above edge computing applications, the computation task is over a large dataset that is distributedly stored on the edge nodes (e.g., map/traffic information or dataset repository), and the computations are often decomposed using MapReduce-type frameworks (e.g., [10], [11]), in which a collection of edge nodes distributedly Map a set of input files, generating some intermediate values, from which they Reduce a set of output functions.

We now demonstrate the main concepts of CMR in a simple problem depicted in Fig. 2. In this case, a client uploads a job of computing 3 output functions (represented by red/circle, green/square, and blue/triangle respectively) from 6 input files to the edge. Three edge nodes in the edge, i.e., EN 1, EN 2 and EN 3, collaborate to perform the computation. Each EN is responsible for computing a unique output function, e.g., EN 1 computes the red/circle function, EN 2 computes the green/square function, and EN 3 computes the blue/triangle function. When an EN maps a locally stored input file, it computes 3 intermediate values, one for each output function. To reduce an output function, each EN needs to know the intermediate values of this output for all 6 input files. We first consider the case where no redundancy is imposed on the computations, i.e., each file is mapped exactly once. Then as shown in Fig. 2(a), each EN maps 2 input files locally, obtaining 2 out of 6 required intermediate values. Hence, each EN needs another 4 intermediate values transferred from the other ENs, yielding a communication load of 4 x 3=12.

Now, we demonstrate how CMR can substantially reduce the communication load by injecting redundancy in computation. As shown in Fig. 2(b), let us double the computation such that each file is mapped on two ENs (files are downloaded to the ENs offline). It is apparent that since more local computations are performed, each EN now only requires 2 other intermediate values, and an uncoded shuffling scheme would achieve a communication load of 2 x 3=6. However, we can do better with CMR. As shown in Fig. 2(b), instead of unicasting individual intermediate values, every EN multicasts a bit-wise XOR, denoted by ⨁, of 2 intermediate values to the other two ENs, simultaneously satisfying their data demands. For example, knowing the blue triangle in File 3, EN 2 can cancel it from the coded packet multicast by EN 1, recovering the needed green square in File 1. In general, the bandwidth consumption of multicasting one packet to two nodes is less than that of unicasting two packets, and here we consider a scenario in which it is as much as that of unicasting one packet (which is the case for wireless networks). Therefore, the above CMR scheme incurs a communication load of 3, achieving a 4x gain from the case without computation redundancy and a 2x gain from the uncoded shuffling.

More generally, we can consider an edge computing scenario, in which K edge nodes collaborate to compute Q output functions from N input files that are distributedly stored at the nodes. We define the computation load, r, to be the total number of input files that are mapped across the nodes, normalized by N. That is, e.g., r = 2 means that on average each file is mapped on two nodes. We can similarly define the communication load L to be the total (normalized) number of information bits exchanged across nodes during data shuffling, in order to compute the Q output functions.

For this scenario, it was shown in [5] that, compared with conventional uncoded strategies, CMR can surprisingly reduce the communication load by a multiplicative factor that equals to the computation load r, when computing r times more sub-tasks than the execution without redundancy (i.e., r=1). Or more specifically,

$$L_{coded}= \frac{1}{r}L_{uncoded}=\frac{1}{r}(1-\frac{r}{K})= \Theta(\frac{1}{r}).$$   (1)

CMR employs a specific strategy to assign the computations of the Map and Reduce functions, in order to enable novel coding opportunities for data shuffling. In particular, each data block is repetitively mapped on r distinct nodes according to a specific pattern, in order to create coded multicast messages that deliver useful data simultaneously to r ≥ 1 nodes. For example, as demonstrated in Fig. 3, the overall communication load can be reduced by more than 50% when each Map task is repeated at only one other node (i.e., r = 2).

The idea of efficiently creating and exploiting coded multicast opportunities was initially proposed to solve caching problems in [12], [13], and extended to wireless D2D networks in [14], where caches pre-fetch part of the content to enable coding during the content delivery, minimizing the network traffic. CMR extends such coding opportunities to data shuffling of distributed computing frameworks, significantly reducing the required communication load.

Apart from significantly slashing the bandwidth consumption, CMR also has the following major impacts on the design of edge computing architectures.

• Reducing Overall Response Time. Let us consider an arbitrary edge computing application for which the overall response time is composed of the time spent computing the intermediate tasks, denoted by $T_{Task Computation}$, and the time spent moving intermediate results, denoted by $T_{Data Movement}$. In many applications of interest (e.g., video/image analytics or recommendation services), most of the job execution time is spent for data movement. For example, consider the scenarios in which $T_{Data Movement}$ is 10x ~ 100x of $T_{Task Computation}$. Using CMR with computation load r, we can achieve an overall response time of

$$T_{Total, coded} \approx \mathbb{E}[rT_{Task Computation} + T_{Data Movement}/r].$$   (2)

To minimize the above response time, one would choose the optimum computation load $r^* = \sqrt{T_{Data Movement}/T_{Task Computation}}$. Then in the above example, utilizing CMR can reduce the overall job response time by approximately 1.5 ~ 5 times.

The practical impact of CMR on reducing the response time has been recently demonstrated in [15] through a series of experiments over Amazon EC2 clusters. In particular, the CMR techniques were incorporated into the well-known distributed sorting algorithm TeraSort [16], to develop a new coded sorting algorithm, namely CodedTeraSort, which allows a flexible selection of the computation load r. Here we summarize in Table I, the runtime performance of a particular job of sorting 12 GB of data over 16 EC2 instances.

Table I: Average response times for sorting 12 GB of data over 16 EC2 instances using 100 Mbps network speed.

Computation (sec.)

Communication
(sec.)

Total
(sec.)

Speedup

TeraSort

15.53

945.72

961.25

CodedTeraSort: r = 5

60.5

222.83

283.33

3.39x

Theoretically according to (1), with a computation load r=5, CodedTeraSort promises to reduce the data shuffling time by a factor of approximately 5. From Table I, we can see that CodedTeraSort brought down the data shuffling time, which was the limiting component of the runtime of this application, by 4.24x. As a result, CodedTeraSort reduced the overall job response time by 3.39x.

• Scalable Mobile Computation. CMR also found its application in mobile edge computing. For a mobile computing platform proposed in [17], a collection of mobile users, each has an input to process overall a large dataset (e.g., the image repository of an image recognition application), collaborate to store the dataset and perform the computations, using their own storage and computing resources. All participating users communicate the locally computed intermediate results among each other to reduce the final results.

Utilizing CMR in this mobile computing platform leads to a scalable design. More specifically, let us consider a scenario where K users, each processing a fraction of the dataset, denoted by μ (for some 1/K≤ μ ≤1), collaborate for mobile edge computing. It was demonstrated in [17] that CMR can achieve a (normalized) bandwidth consumption of 1/μ-1 to shuffle all required intermediate results. This reduces the communication load of the uncoded scheme, i.e., K(1-μ), by a factor of μK, which scales linearly with the aggregated storage size of all collaborating users. Also, since the consumed bandwidth is independent of the number of users K, CMR allows this platform to simultaneously serve an unlimited number of users with a constant communication load.

Coding for Resilient, Secure, and Private Edge Computing

We now move to the second coding concept, named Lagrange Coded Computing (LCC) and introduced in [6], and demonstrate it for a class of edge computing applications, in which a client requires processing a massive dataset (possibly over multiple iterations). The application is supported by a group of edge nodes, which have distributedly stored the entire dataset. Each node processes the parts of the dataset it locally has, and returns the computed results to the client. The client reduces the final results after collecting intermediate results from all edge nodes. Many distributed machine learning algorithms fall into this category. For example, a gradient decent algorithm for model training requires computing the gradient over all training data to update the model parameters in each iteration. To do that at network edge, each edge node stores locally a subset of the training data. During computation, each node computes a partial gradient using its local data and returns the result to the client.

To be more specific, let us consider a distributed computing problem where given the input data, partition into K batches $X_1,\ldots,X_K$, the goal is to compute K output results $Y_k = f(X_k)$, $k=1,\ldots,K$, for some function f that can be an arbitrary multivariate polynomial with degree d. This computation framework captures many computation tasks of interest. For example, let us consider linear computations like matrix-vector multiplication that underlies many machine learning applications. In this case, the goal is to compute $X \cdot b$, for some data matrix X partitioned into K sub-matrices $\{X_k\}_{k=1}^K$ and some vector b. The output in this case is computed as $Y_k = f(X_k) = X_k \cdot b$. Another example is the bilinear computations that include matrix-matrix multiplications. Given two lists of matrices $\{A_k\}_{k=1}^K$ and $\{B_k\}_{k=1}^K$, the goal is to compute element-wise products $\{A_k \cdot B_k\}_{k=1}^K$. In this case, the data batch $X_k = (A_k, B_k)$, and the output $Y_k = f(X_k) = A_k \cdot B_k$.

One natural approach to tackle this problem is to use K edge nodes, each one of which is responsible for storing one data batch and computing one output result. However, it is obvious that this approach is susceptible to straggling or failed edge nodes. That is, even one slow or missing node can cause significant delay, or even failure of the entire computation. In order to improve the resiliency against slow or irresponsive nodes, we can employ N ≥ K edge nodes, and some redundant computations on these N nodes. For a naive repetition scheme, we can repeat the task of processing a data batch over N/K nodes. In the worst case, we can tolerate missing results from (N/K-1) nodes, and the results from any subset of N-N/K+1 nodes are sufficient for the client to recover the desired results.

The LCC scheme provides the optimal design of the data and task placement to achieves the minimum recovery threshold, which is defined as the minimum number of nodes the client needs to wait for before the overall computation can be accomplished. LCC has the following three main components.

The first key component is the coded data placement. Instead of storing original data batches $X_1,\ldots,X_K$, each edge node i stores a coded data batch $\tilde{X}_i$, which is a linear combination of the original batches, for all i=1, …, N. To create these coded batches, the first step is to select K distinct elements $\beta_1,\ldots,\beta_K$ in the relevant field, and construct the following Lagrange polynomial u(z) that has value $X_k$ at the point $z=\beta_k$, for all k=1, …, K.

$$u(z) = \sum_{k=1}^K X_k \prod_{i \neq k} \frac{z-\beta_i}{\beta_k-\beta_i}.$$   (3)

Then, using a set of N distinct elements $\alpha_1,\ldots,\alpha_N$, as shown in Fig. 4(a), the coded data batch $\tilde{X}_i = u(\alpha_i)$ is created by evaluating u(z) at the point $\alpha_i$ for all i=1, …, N. This encoding process can be performed efficiently using fast multi-point polynomial evaluation algorithms (see, e.g., [18]). We emphasize on the following two salient features of the data encoding of LCC:

• Universal: The data encoding is oblivious of the output function f. Therefore, the coded data placement can be performed offline without knowing which operations will be applied on the data.

• Incremental: When new data become available and coded data batches need to be updated, we only need to encode the new data and append them to the previously coded batches, instead of accessing the entire uncoded data and re-encoding them to update the coded data.

The second key component is computation on coded data (i.e., coded computing). During the process of local computations, each EN i computes $f(\tilde{X}_i)$ on its locally stored coded data $\tilde{X}_i$, and returns the computation result to the client. Given that f is a polynomial of degree d, it is easy to realize that $f(\tilde{X}_i) = f (u(\alpha_i))$ is essentially the value of the univariate polynomial f(u(z)) of degree d(K-1) at the point $z = \alpha_i$.

The third key component is computation decoding. Since the computation at each node provides the value of the polynomial f(u(z)) at a distinct point, as shown in Fig. 4(b), using the computation results from any subset of d(K-1)+1 nodes, the client can interpolate the polynomial f(u(z)). This operation can be done efficiently using fast polynomial arithmetic algorithms (see, e.g., [19]). Finally, the client evaluates f(u(z)) at the points $\beta_1,\ldots,\beta_K$ to recover the desired results $f(u(\beta_k)) = f(X_k)$, k=1, …, K.

The above LCC scheme achieves recovery threshold of d(K-1)+1 = Θ(K), which significantly improves over the aforementioned naive repetition scheme, whose recovery threshold is N-N/K+1 = Θ(N). To see this, for fixed K, LCC only needs a fixed number of nodes to return their results, and this number is independent of the network size N. In contrast, the number of surviving nodes required by the repetition scheme scales linearly with N. As the edge network expands, the LCC scheme benefits much more from the abundant computation resources in alleviating the negative effects caused by slow or failed nodes, which leads to a much lower computation latency. In fact, it was proven in [6] that LCC achieves the minimum possible recovery threshold among all distributed computing schemes.

Table II: Breakdowns of the run-times for running gradient descent for 100 iterations on Amazon EC2 over 40 worker nodes.

Schemes

Recovery threshold

Communication time

Computation time

Total run-time

uncoded

40

24.125 s

0.237 s

24.362 s

GC

31

6.033 s

2.431 s

8.464 s

LCC

7

1.719 s

1.868 s

3.587 s

Having theoretically demonstrated the optimality of the LCC scheme, we also empirically evaluate its performance on practical workloads. In a recent work [20], we apply LCC to solve least-squares regression problems, where a weight vector is trained from an input dataset that attains the minimum quadratic loss. To perform this task distributedly using the gradient descent algorithm, the training data matrix is partitioned into K sub-matrices $\{X_k\}_{k=1}^K$, which are stored and processed on N distributed worker nodes. Within each iteration of the algorithm, to compute the gradient and update the weight vector w at a central master node, the master node utilizes the computation results from the workers to recover $\sum_{k=1}^K X_k X_k^T w$. Taking the function $f(X_k) = X_kX_k^T w$, the LCC scheme can be directly applied here to mitigate the effects of slow/failed nodes (or stragglers), and speedup the computation. We list one set of experiment results in Table II. We can see that compared with the uncoded scheme, for which K=N, and each worker processes a disjoint subset of data, LCC trades redundant local computations to significantly reduce the required recovery threshold, slashing the total run-time by 6.8x. Also, compared with the state-of-the-art straggler mitigation scheme for gradient computation, known as “Gradient Coding” (GC) [21], LCC achieves a much smaller recovery threshold, and a 2.4x reduction on the total run-time. In contrast to the practice of GC that codes over results computed from uncoded data, LCC directly codes over the original data, and utilizes the computation results of coded data to decode the final results.

Through minor modifications, LCC can also simultaneously guarantee security and privacy in edge computing, against malicious and colluding nodes. To be more specific, in order for LCC to tolerate up to A adversarial/malicious nodes who may return arbitrarily erroneous results, and up to T colluding nodes who may use their local coded data to jointly infer the original data, LCC can achieve security against the adversarial nodes and privacy against the colluding nodes with a recovery threshold d(K+T-1)+2A+1. The additional term 2A in this recovery threshold accounts for the additional nodes the client needs to wait for in order to decode the correct information symbols in the presence of A corrupted code symbols, just like decoding a Maximum Distance Separable (MDS) code with minimum distance 2A. The additional term Td accounts for appending random data to the original data before encoding, such that after data mixing in the encoding process, any subset of T colluding nodes are kept oblivious of the original data.

Conclusions and Future Research Directions

We demonstrated how coding can be effectively utilized to leverage abundant computing resources at the network edge to enable a bandwidth-efficient, resilient, secure, and private edge computing environment. In particular, we illustrated two recently proposed coding concepts, namely Coded MapReduce and Lagrange Coded Computing, and discussed their impacts on edge computing.

We envision codes to play a fundamental role in edge computing by enabling an efficient utilization of computation, communication, and storage resources at network edge. This area opens many important and exciting future research directions. Here we list a few:

• Heterogeneous computing nodes: In distributed edge networks, different nodes have different processing and storage capacities. The ideas outlined in this article can be used to develop heuristic solutions for heterogeneous networks. For example, one simple approach is to break the more powerful nodes into multiple smaller virtual nodes that have homogeneous capability, and then apply the proposed coding techniques for the homogeneous setting. However, systematically developing practical task assignment and coding techniques for these systems, that are provably optimum (approximately), is a challenging open problem.

• Coding for partial stragglers: Most of the prior works assume that stragglers fail catastrophically and do not respond to any requests. However, this is rarely the case in practice, and machines can experience temporary slowdown due to various random factors like vibrating hard drive and background programs. For an iterative computing process (e.g., model training), a machine may straggle in some iterations but runs completely normal in others. This motivates us to define performance metrics and design optimal coding schemes, not one round at a time, but considering the iterative nature of the whole process jointly.

• Networks with multi-layer and structured topology: The current code designs for distributed computing [4], [5] are developed for a basic topology, in which the processing nodes are connected through a shared link. While these results demonstrate the significant gain of coding in distributed edge computing, we need to extend these ideas to more general network topologies. In such networks, nodes can be connected through multiple switches and links in different layers with different capacities.

• Multi-stage computation tasks: Another important direction is to consider more general computing frameworks, in which the computation job is represented by a Directed Acyclic Task Graph (DAG). While we can apply the aforementioned code designs for each stage of computation locally, we expect to achieve a higher reduction in bandwidth consumption and response time by globally designing codes for the entire task graph and accounting for interactions between consecutive stages.

• Coded computing overhead: The current edge computing system under consideration lacks appropriate modeling of the coding overhead, which includes the cost for the encoding and decoding processes, the cost for performing multicast communications, and the cost for maintaining desired data redundancy across edge nodes. To make the study of coding in practical edge systems more relevant, it is important to carefully formulate a comprehensive model that systematically accounts for these overhead.

• Verifiable distributed computing: Edge computing architectures facilitate offloading of computational tasks from relatively weak computational devices (clients) to more powerful nodes in the edge network. As a result, there is a critical need for “Verifiable Computing” methods, in which clients can make sure they receive the correct calculations. This is typically achieved by injecting redundancy in computations by the clients. We expect codes to provide much more efficient methods for leveraging computation redundancy in order to provide verified computing in edge applications.

• Exploiting the algebraic structures of computation tasks: Recall that CMR can be applied to any general computation task that can be cast in a MapReduce framework. However, we expect to improve the overall performance, if we exploit the specific algebraic properties of the underlying tasks. For example, if the task has some linearity, we may be able to incorporate it in communication and coding design in order to further reduce the bandwidth consumption and latency. Furthermore, extending LCC beyond polynomials (e.g., non-linearity in neural networks) is of great interest.

• Plug-and-Play edge nodes: We can finally envision a software package (or App) that can be installed and maintained distributedly on each edge node. This package should allow an edge computing node to join the system anytime to work with the rest of the nodes or leave the system asynchronously, still the entire network operates near optimum. Designing codes that guarantee integrity of computations despite such network dynamics is a very interesting and important research direction.

References

[1] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016.
[2] M. Chiang and T. Zhang, “Fog and IoT: An overview of research opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp. 854– 864, Dec. 2016.
[3] Y. C. Hu, M. Patel, D. Sabella, N. Sprecher, and V. Young, “Mobile edge computing – a key technology towards 5g,” ETSI white paper, vol. 11, no. 11, pp. 1–16, 2015.
[4] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, “CodedMapReduce,” in Proceedings of the 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), Sept. 2015, pp. 964–971.
[5] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, “A fundamental tradeoff between computation and communication in distributed computing,” IEEE Trans. Inf. Theory, vol. 64, no. 1, Jan. 2018.
[6] Q. Yu, N. Raviv, J. So, and A. S. Avestimehr, “Lagrange coded computing: Optimal design for resiliency, security and privacy,” e-print arXiv:1806.00939, 2018.
[7] K. Lee, M. Lam, R. Pedarsani, D.Papailiopoulos, and K.Ramchandran, “Speeding up distributed machine learning using codes,” in Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), July 2016, pp. 1143–1147.
[8] S. Dutta, V. Cadambe, and P.Grover, “Short-dot: Computing large linear transforms distributedly using coded short dot products,” in NIPS, 2016, pp. 2092–2100.
[9] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “Polynomial codes: an optimal design for high-dimensional coded matrix multiplication,” in NIPS, 2017, pp. 4406–4416.
[10] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” in Proceedings of the 2004 6th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI ’04. USENIX, Dec. 2004, pp. 137–150.
[11] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: cluster computing with working sets,” in Proceedings of the 2010 2nd USENIX Workshop on Hot Topics in Cloud Computin, ser. HotCloud ’10. USENIX, June 2010, pp. 10–10.
[12] M. A. Maddah-Ali and U. Niesen, “Fundamental limits of caching,” IEEE Transactions on Information Theory, vol. 60, no. 5, pp. 2856– 2867, May 2014.
[13] M. A. Maddah-Ali and U. Niesen, “Decentralized coded caching attains order-optimal memory-rate tradeoff,” IEEE/ACM Transactions on Networking, vol. 23, no. 4, pp. 1029–1040, Aug. 2015.
[14] M. Ji, G. Caire, and A. F. Molisch, “Fundamental limits of caching in wireless D2D networks,” IEEE Transactions on Information Theory, vol. 62, no. 2, pp. 849–869, Feb. 2016.
[15] S. Li, S. Supittayapornpong, M. A. Maddah-Ali, and A. S. Avestimehr, “Coded terasort,” IPDPS ParLearning Workshop, May 2017.
[16] O. O’Malley, “TeraByte Sort on Apache Hadoop,” available online at: http://sortbenchmark.org/YahooHadoop.pdf, 2008, Accessed on August 28, 2018.
[17] S. Li, Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, “A scalable framework for wireless distributed computing,” IEEE/ACM Transactions on Networking, vol. 25, no. 5, pp. 2643–2654, Oct. 2017.
[18] A. V. Aho, J. E. Hopcroft, and J. D. Ullman, The Design and Analysis of Computer Algorithms, 1st ed. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1974.
[19] K. S. Kedlaya and C. Umans, “Fast polynomial factorization and modular composition,” SIAM Journal on Computing, vol. 40, no. 6, pp. 1767–1802, 2011.
[20] S. Li, S. M. M. Kalan, Q. Yu, M. Soltanolkotabi, and A. S. Avestimehr, “Polynomially coded regression: Optimal straggler mitigation via data encoding,” e-print arXiv:1805.09934, 2018.
[21] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, “Gradient coding: Avoiding stragglers in distributed learning,” in Proceedings of the 34th International Conference on Machine Learning, Aug. 2017, pp. 3368–3376.

Statements and opinions given in a work published by the IEEE or the IEEE Communications Society are the expressions of the author(s). Responsibility for the content of published articles rests upon the authors(s), not IEEE nor the IEEE Communications Society.