Large-scale Data Storage and Processing System in Datacenters

Research on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems ∞Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Colossus (GFS2): Colossus: Successor to the Google File System (GFS) BigTable: http://research.google.com/archive/bigtable.html Megastore: http://research.google.com/pubs/pub36971.html Spanner: http://research.google.com/archive/spanner.html Dynamo: http://dl.acm.org/citation.cfm?id=1294281RAMCloud: http://dl.acm.org/citation.cfm?id=1965751 and http://dl.acm.org/citation.cfm?id=2043560Compute systems ∞MapReduce: http://research.google.com/archive/mapreduce.html Hadoop implementation: Hadoop MapReduce Tutorials Sawzall: http://research.google.com/archive/sawzall.html FlumeJava: http://dl.acm.org/citation.cfm?id=1806638 Pig latin: http://dl.acm.org/citation.cfm?id=1376726 Dryad/DryadLINQ: http://research.microsoft.com/en-us/projects/dryad/ Pregel: http://dl.acm.org/citation.cfm?id=1807184 and http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html Dremel: http://research.google.com/pubs/pub36632.html Storm: https://blog.twitter.com/2011/a-storm-is-coming-more-details-and-plans-for-release and https://github.com/nathanmarz/storm/wiki Spark: https://www.usenix.org/conference/nsdi12/resilient-distributed-datasets-fault-tolerant-abstraction-memory-cluster-computing and http://spark-project.org/DVM: IEEE Transactions on Computers paper and VEE paperResource management ∞Mesos: http://mesos.apache.org/documentation/latest/architecture/ » Read more

Colossus: Successor to the Google File System (GFS)

Colossus is the successor to the Google File System (GFS) as mentioned in the recent paper on Spanner on OSDI 2012. Colossus is also used by spanner to store its tablets. The information about Colossus is slim compared with GFS which is published in the paper on SOSP 2003. There is still some information about Colossus on the Web. Here, I list some of them. » Read more

Reading List for Distributed Systems and Cloud Computing

Understanding the literature is usually the first step to do research, which is the same for systems research on cloud computing. A reading list may help a lot to those that just start in cloud computing research. Prof. Lin Gu, my PhD supervisor, compiled a reading list for system research on cloud computing. The reading list includes a list of papers related to Internet-scale systems and datacenters, techniques in distributed computing like Paxos, execution frameworks like MapReduce, distributed file systems like GFS, and storage systems like Dynamo. » Read more

mrcc – A Distributed C Compiler System on MapReduce

The mrcc project’s homepage is here: mrcc project. Abstract mrcc is an open source compilation system that uses MapReduce to distribute C code compilation across the servers of the cloud computing platform. mrcc is built to use Hadoop by default, but it is easy to port it to other could computing platforms, such as MRlite, by only changing the interface to the platform. » Read more