Reading List for Distributed Systems and Cloud Computing

Understanding the literature is usually the first step to do research, which is the same for systems research on cloud computing. A reading list may help a lot to those that just start in cloud computing research.

Prof. Lin Gu, my PhD supervisor, compiled a reading list for system research on cloud computing. The reading list includes a list of papers related to Internet-scale systems and datacenters, techniques in distributed computing like Paxos, execution frameworks like MapReduce, distributed file systems like GFS, and storage systems like Dynamo. These are very good papers which every one in this area should study.

Please find the reading list here:

A copy of the reading list is as follows.

Reading List for Distributed Systems and Cloud Computing

Datacenters form the backbone of cloud-based systems. Barroso et al. introduced the Google search system, which provides a good starting point for understanding Internet-scale systems and datacenters:
Luiz Barroso, Jeffrey Dean, Urs Hoelzle. Web Search for a Planet: The Google Cluster Architecture. IEEE Micro, Vol. 23, No. 2, pp. 22-28, Mar./Apr. 2003.

The MapReduce framework is a pioneer of large-scale data-intensive computing in datacenters:
Dean, J. and Ghemawat, S. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation – Volume 6 (San Francisco, CA, December 06 – 08, 2004).

Our recent development, DVM, shows an efficient way to extend instruction-level virtualization to a large number of physical hosts, and can potentially provide an abstraction of a “single computer” for a datacenter:
Zhiqiang Ma, Zhonghua Sheng and Lin Gu. DVM: A Big Virtual Machine for Cloud Computing. IEEE Transactions on Computers (to appear).

Other instruction-level abstractions for data intensive computing include Layer Zero, X10 and OpenCL. Specifically, Layer Zero is an open platform for data-intensive computing using unified memory and CPU resources on a cluster and provides the “single computer” abstraction. Such emerging technologies emphasize computation in the memory, and take different approaches to manage the sources and model the data and computation. Spark takes a functional approach on an RDD (Resilient Distributed Dataset) model, and a set of technologies (Pregel, Power Graph, etc.) focus on computation in a graph model.

Zhiqiang Ma, Ke Hong and Lin Gu. VOLUME: Enable Large-Scale In-Memory Computation on Commodity Clusters. The 5th IEEE Intl. Conf. on Cloud Computing Technology and Science (CloudCom’13). Bristol, UK. Dec. 2-5, 2013.

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI’12.

The datacenter software builds on techniques in distributed computing. Among these techniques, Paxos plays an important role in many core services in cloud systems. The following papers describe Paxos and a few systems using it:
Lamport, L. The part-time parliament. ACM Trans. Comput. Syst. 16, 2 (May. 1998), 133-169.

L. Lamport. Paxos made simple. ACM SIGACT News, 32(4):18-25, 2001.

Burrows, M. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Seattle, Washington, November 06 – 08, 2006). 335-350.

The classic design of a large-scale storage system in datacenters is the Google File System (GFS), which provides a systematic solution to scalability, consistency, and software fault tolerance:
Ghemawat, S., Gobioff, H., and Leung, S. The Google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP’03), Bolton Landing, NY, USA, October 19 – 22, 2003. 29-43.

Above the file system abstraction, researchers have constructed key value stores and databases. Often not supporting the full ACID semantics, the database design is often referred to as a NoSQL database.

DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., and Vogels, W. 2007. Dynamo: amazon’s highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium on Operating Systems Principles (Stevenson, Washington, USA, October 14 – 17, 2007). SOSP ’07. ACM, New York, NY, 205-220.

Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (Seattle, Washington, November 06 – 08, 2006). 205-218.

As the technology evolves, it becomes clear that reasonably strong semantics cannot be entirely ignored. While it is relatively easy to provide atomicity on single records, it is a tremendous technical challenge to support distributed transactions in high throughput, at affordable cost and in a large distributed system. Recent systems have made certain progress in this direction.

Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-Michel Leon, Yawei Li, Alexander Lloyd and Vadim Yushprakh. Megastore: Providing Scalable, Highly Available Storage for Interactive Services. In Proceedings of the Conference on Innovative Data system Research (CIDR), 2011, pp. 223-234.

An earlier system, PNUTS, showcases another design with several similar goals.

Cooper, B. F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen, H., Puz, N., Weaver, D., and Yerneni, R. PNUTS: Yahoo!’s hosted data serving platform. Proc. VLDB Endow. 1, 2 (Aug. 2008), 1277-1288.

A more recent development is Spanner:

James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. Spanner: Google’s Globally-Distributed Database. OSDI’12: Tenth Symposium on Operating System Design and Implementation, Hollywood, CA, October, 2012.

Eric Zhiqiang Ma

Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

Leave a Reply

Your email address will not be published. Required fields are marked *