How to set the replication factor for one file when it is uploaded by `hdfs dfs -put` command line in HDFS?

When uploading a file by the hdfs dfs -put command line in HDFS, how to set a replication factor instead of the global one for that file? For example, HDFS’s global replication factor is 3. For some temporary files, I would like to save just one copy for faster uploading and saving disk space. The […]

How to set the number of mappers and reducers of Hadoop in command line?

How to set the number of mappers and reducers of Hadoop in command line? Number of mappers and reducers can be set like (5 mappers, 2 reducers): -D -D mapred.reduce.tasks=2 in the command line. In the code, one can configure JobConf variables. job.setNumMapTasks(5); // 5 mappers job.setNumReduceTasks(2); // 2 reducers Note that on Hadoop […]

Hadoop 2 (YARN) default configuration values

Where to check the default Hadoop 2 (YARN) configuration values for: HDFS: hdfs-site.xml YARN: yarn-site.xml MapReduce: mapred-site.xml Default Hadoop 2 (YARN) configuration values for Hadoop 2.2.0 from Apache Hadoop website: HDFS: YARN: MapReduce: Answered by Eric Z Ma.

Good introductions to Hadoop 2.0 (YARN)?

Which ones are recommended introductions to Hadoop 2.0 (YARN)? Pointers to webpages are good. Those are good ones that I find: The SoCC13 paper “Apache Hadoop YARN: Yet Another Resource Negotiator” by Vinod Kumar Vavilapalli et al.: The introduction from Hortonworks by Arun Murthy: The “Official” one from Apache Hadoop website (very brief): Answered […]

How to choose the number of mappers and reducers in Hadoop

How to choose the number of mappers and reducers in Hadoop to get good job performance? The Hadoop Wiki gives a discussion on this: Some valuable points: About the number of Maps: The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to […]

Big Data Benchmark from AMPLab of UC Berkeley

Benchmarks are important to understand the performance and quantitative and qualitative comparison of different systems. Many analytic frameworks, such as Hive, Impala and Shark, are designed and implemented these years and become fundamental software for processing big data. How to benchmark these big data analytic systems is an interesting problem. The Big Data Benchmark ∞ […]

PUMA: A MapReduce Benchmark Suite

MapReduce is a well-known programming model designed for generating and processing large data. There are various MapReduce implementations. One widely known and used one may be Hadoop. Benchmarking MapReduce frameworks gets to be important. Faraz Ahmad et al. developed a benchmark suite: PUMA MapReduce Benchmark. During our work on MapReduce, we developed a benchmark suite […]

Hadoop TeraSort Benchmark

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark. TeraGen generates random data that can be used as input data for a subsequent running […]

Reading List for Distributed Systems and Cloud Computing

Understanding the literature is usually the first step to do research, which is the same for systems research on cloud computing. A reading list may help a lot to those that just start in cloud computing research. Prof. Lin Gu, my PhD supervisor, compiled a reading list for system research on cloud computing. The reading […]

Hadoop Default Ports

Hadoop’s namenode and datanodes expose a bunch of TCP ports used by Hadoop’s daemons to communicate to each other or listen directly to users’ requests. These ports information are needed by both the Hadoop users and cluster administrators to write programs or configure firewalls/gateways accordingly. A post written by Philip Zeyliger from Cloudera’s blog summarizes the […]

Pitfalls and Lessons on Configuing and Tuning Hadoop

This post lists pitfalls and lessons learning when configuring and tuning Hadoop. Hadoop with IPv6 Hadoo doesn’t support IPv6 currently (up to 0.20.2 and 0.21.0): Hadoop and IPv6. The performance of the cluster may suffer from turning IPv6 on in clusters: mail archive. One good practice is to disable IPv6 on servers in the Hadoop […]

mrcc – A Distributed C Compiler System on MapReduce

The mrcc project’s homepage is here: mrcc project. Abstract mrcc is an open source compilation system that uses MapReduce to distribute C code compilation across the servers of the cloud computing platform. mrcc is built to use Hadoop by default, but it is easy to port it to other could computing platforms, such as MRlite, […]