|

How to Compress/Uncompress Files in Linux Using gzip, bzip2, 7z, rar and zip

Compress/uncompress files are frequent operations. The normal tools for compressing/uncompressing in Linux is gzip, bzip2, 7z, rar and zip. This post introduces how to compress and uncompress file in Linux using these tools. We use best compressing rate with all these tools and mark the options for “best rate” in bold fonts. We can delete…

|

PUMA: A MapReduce Benchmark Suite

MapReduce is a well-known programming model designed for generating and processing large data. There are various MapReduce implementations. One widely known and used one may be Hadoop. Benchmarking MapReduce frameworks gets to be important. Faraz Ahmad et al. developed a benchmark suite: PUMA MapReduce Benchmark. During our work on MapReduce, we developed a benchmark suite…

Hadoop TeraSort Benchmark

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark. TeraGen generates random data that can be used as input data for a subsequent running…

|

Large-scale Data Storage and Processing System in Datacenters

Research on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Colossus (GFS2): Colossus: Successor to the Google File System (GFS)…

| |

Microsofts Cosmos Service

Cosmos is “Microsoft’s internal data storage/query system for analyzing enormous amounts (as in petabytes) of data”. There is no paper/technical report about Cosmos published yet. I compiled a list of information about Cosmos on the Web as follows. What is Microsoft’s Cosmos service? by Yaron Y. Goland. Microsoft Cosmos: Petabytes perfectly processed perfunctorily by Seth…

Conference Ranking by Average Number of Citations in the Last 5 Years, 2012

I am trying to find out the top conferences that have the largest average number of citations in the last 5 years on the Internet but fail to find one. However, there are many rankings about the overall citations and numbers of publications. Hence, it is not hard to calculate the average number of citations…

| |

Hadoop Installation Tutorial (Hadoop 1.x)

Update: If you are new to Hadoop and trying to install one. Please check the newer version: Hadoop Installation Tutorial (Hadoop 2.x). Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed…

Reading List for Distributed Systems and Cloud Computing

Understanding the literature is usually the first step to do research, which is the same for systems research on cloud computing. A reading list may help a lot to those that just start in cloud computing research. Prof. Lin Gu, my PhD supervisor, compiled a reading list for system research on cloud computing. The reading…

Conferences on Cloud Computing 2013

This post lists important conferences related to Cloud Computing in year 2013. SOSP 2013 SOSP’13: The 24th ACM Symposium on Operating Systems Principles. November 3-6, 2013, Nemacolin Woodlands Resort, Pennsylvania. The biennial ACM Symposium on Operating Systems Principles is the world’s premier forum for researchers, developers, programmers, and teachers of computer systems technology. Academic and…

Hadoop Default Ports

Hadoop’s namenode and datanodes expose a bunch of TCP ports used by Hadoop’s daemons to communicate to each other or listen directly to users’ requests. These ports information are needed by both the Hadoop users and cluster administrators to write programs or configure firewalls/gateways accordingly. A post written by Philip Zeyliger from Cloudera’s blog summarizes the…

Conferences on Cloud Computing 2012

This post lists important conferences on Cloud Computing in year 2012. OSDI 2012 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12) October 8–10, 2012, Hollywood, CA “The tenth OSDI seeks to present innovative, exciting research in computer systems. OSDI brings together professionals from academic and industrial backgrounds in what has become a…

Pitfalls and Lessons on Configuing and Tuning Hadoop

This post lists pitfalls and lessons learning when configuring and tuning Hadoop. Hadoop with IPv6 Hadoo doesn’t support IPv6 currently (up to 0.20.2 and 0.21.0): Hadoop and IPv6. The performance of the cluster may suffer from turning IPv6 on in clusters: mail archive. One good practice is to disable IPv6 on servers in the Hadoop…