Hadoop Installation Tutorial (Hadoop 2.x)

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, loosely modeled after the Google file system (GFS). » Read more

Big Data Benchmark from AMPLab of UC Berkeley

Benchmarks are important to understand the performance and quantitative and qualitative comparison of different systems. Many analytic frameworks, such as Hive, Impala and Shark, are designed and implemented these years and become fundamental software for processing big data. How to benchmark these big data analytic systems is an interesting problem. The Big Data Benchmark ∞ The Big Data Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative comparisons of five systems by the time this post is written: Redshift – a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse, Hive – a Hadoop-based data warehousing system, Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework, Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine and Stinger/Tez – Tez is a next generation Hadoop execution engine currently in development. » Read more

Software Engineering Advice from Building Large-Scale Distributed Systems by Jeff Dean

Software Engineering Advice from Building Large-Scale Distributed Systems by Jeff Dean. You can download the slides from Software Engineering Advice from Building Large-Scale Distributed Systems by Jeff Dean. These slides contain the “Numbers everyone should know” which everyone working on systems should be familiar with. Numbers Everyone Should Know ∞ L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns        » Read more

Hadoop MapReduce Tutorials

Here is a list of tutorials for learning how to write MapReduce programs on Hadoop, the opensource MapReduce implementation with HDFS. MapReduce Tutorials ∞ The official tutorial on Hadoop MapReduce framework: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html. Yahoo! Hadoop Tutorial ∞ A comprehensive tutorial on Hadoop from Yahoo! Developer Network: http://developer.yahoo.com/hadoop/tutorial/. More about MapReduce ∞ To better understand the design behind MapReduce, it is always good to read Jeff Dean and Sanjay Ghemawat’s MapReduce paper: http://research.google.com/archive/mapreduce.html. » Read more

PUMA: A MapReduce Benchmark Suite

MapReduce is a well-known programming model designed for generating and processing large data. There are various MapReduce implementations. One widely known and used one may be Hadoop. Benchmarking MapReduce frameworks gets to be important. Faraz Ahmad et al. developed a benchmark suite: PUMA MapReduce Benchmark. During our work on MapReduce, we developed a benchmark suite which represents a broad range of MapReduce applications exhibiting application characteristics with high/low computation and high/low shuffle volumes. » Read more

Large-scale Data Storage and Processing System in Datacenters

Research on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems ∞Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Colossus (GFS2): Colossus: Successor to the Google File System (GFS) BigTable: http://research.google.com/archive/bigtable.html Megastore: http://research.google.com/pubs/pub36971.html Spanner: http://research.google.com/archive/spanner.html Dynamo: http://dl.acm.org/citation.cfm?id=1294281RAMCloud: http://dl.acm.org/citation.cfm?id=1965751 and http://dl.acm.org/citation.cfm?id=2043560Compute systems ∞MapReduce: http://research.google.com/archive/mapreduce.html Hadoop implementation: Hadoop MapReduce Tutorials Sawzall: http://research.google.com/archive/sawzall.html FlumeJava: http://dl.acm.org/citation.cfm?id=1806638 Pig latin: http://dl.acm.org/citation.cfm?id=1376726 Dryad/DryadLINQ: http://research.microsoft.com/en-us/projects/dryad/ Pregel: http://dl.acm.org/citation.cfm?id=1807184 and http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html Dremel: http://research.google.com/pubs/pub36632.html Storm: https://blog.twitter.com/2011/a-storm-is-coming-more-details-and-plans-for-release and https://github.com/nathanmarz/storm/wiki Spark: https://www.usenix.org/conference/nsdi12/resilient-distributed-datasets-fault-tolerant-abstraction-memory-cluster-computing and http://spark-project.org/DVM: IEEE Transactions on Computers paper and VEE paperResource management ∞Mesos: http://mesos.apache.org/documentation/latest/architecture/       » Read more

Microsofts Cosmos Service

Cosmos is “Microsoft’s internal data storage/query system for analyzing enormous amounts (as in petabytes) of data”. There is no paper/technical report about Cosmos published yet. I compiled a list of information about Cosmos on the Web as follows. What is Microsoft’s Cosmos service? by Yaron Y. Goland. Microsoft Cosmos: Petabytes perfectly processed perfunctorily by Seth Eliot. Cosmos Big Data and Big Challenges by Pat Helland. » Read more

Hadoop Installation Tutorial (Hadoop 1.x)

Update: If you are new to Hadoop and trying to install one. Please check the newer version: Hadoop Installation Tutorial (Hadoop 2.x). Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed and implemented by Google for processing and generating large data sets [1]. » Read more

mrcc – A Distributed C Compiler System on MapReduce

The mrcc project’s homepage is here: mrcc project. Abstract mrcc is an open source compilation system that uses MapReduce to distribute C code compilation across the servers of the cloud computing platform. mrcc is built to use Hadoop by default, but it is easy to port it to other could computing platforms, such as MRlite, by only changing the interface to the platform. » Read more