Tutorial

Hadoop TeraSort Benchmark

ByEric Ma Dec 18, 2012Sep 5, 2020

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark.

TeraGen generates random data that can be used as input data for a subsequent running of TeraSort.

Generate input by TeraGen

The syntax for TeraGen:

$ hadoop jar hadoop-*examples*.jar teragen \
<number of 100-byte rows> <output dir>

To make the TeraGen run on multiple nodes with multiple tasks, you may need to specify the number of map tasks (30 here as an example; for Hadoop 2):

$ hadoop -D mapreduce.job.maps 30 \
jar hadoop-*examples*.jar teragen \
<number of 100-byte rows> <output dir>

The number of mappers depends on the number of rows you will generate and the number of nodes you have. For more information on how to set the number of mappers and reducers, please check this post.

Run TeraSort

After the data is generated, run the sort by TeraSort

$ hadoop jar hadoop-*examples*.jar terasort \
<input dir> <output dir>

You may also need to set the number of mappers and reducers for better performance.

Validate the sorted output data of TeraSort

TeraValidate ensures that the output data of TeraSort is globally sorted.

The syntax for TeraValidate:

$ hadoop jar hadoop-*examples*.jar teravalidate \
<output dir> <terasort-validate dir>

Linux | Software | Tutorial

Outlook-style Email Reply Header in Thunderbird

ByEric Ma Jul 13, 2013Sep 18, 2021

Thunderbird’s email reply header is short and not like the ones in Outlook and other email clients. Thunderbird provides several integrated reply headers. However, the Outlook-style reply header with sender, receiver, date, and other information are clear and useful (e.g. my boss cc’ed an email to me to let me handle a request and I…

Linux

Installing Specific Old Versions of Packages in Yum

ByEric Ma Jul 13, 2013May 6, 2014

We may need to install some old packages such as the kernel in our Linux box. Let’s use installing a older version of Linux kernel in Fedora as the example to introduce how to install old packages from the repository using yum. By now, suppose we have install kernel-2.6.32.16-143 in the Linux box and we…

How to install MATE on Linux Mint 17 Qiana?

ByEric Ma Mar 24, 2018Mar 24, 2018

I am using Linux Ming 17 Qiana. How to install MATE on Linux Mint 17? In Linux Mint, the package that installs MATE is ‘mint-meta-mate’: $ sudo aptitude install mint-meta-mate Read more: How to install ffmpeg on Linux Mint 17 Qiana? How to install GNOME 3 in Ubuntu MATE 18.04? How to install the MATE…

Manage Linux console screen by commands?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to manage Linux console screen by commands? When the screen will be blanked? Put the screen into powerdown mode or power off the screen? 2 tools are useful for managing the console screen on Linux: setterm – set terminal attributes.vbetool – run real-mode video BIOS code to alter hardware state. When the screen will…

Tutorial

Reading List for Distributed Systems and Cloud Computing

ByEric Ma Sep 15, 2012Aug 30, 2020

Understanding the literature is usually the first step to do research, which is the same for systems research on cloud computing. A reading list may help a lot to those that just start in cloud computing research. Prof. Lin Gu, my PhD supervisor, compiled a reading list for system research on cloud computing. The reading…

How to make CentOS 6.6 power off the console screen automatically?

ByEric Ma Mar 24, 2018Mar 24, 2018

On CentOS 6.6, I find that it can make the screen (console, not X) blank after a while. However, it does not power the screen off. This usually work on Fedora or other releases. I guess this is caused by some configurations specific in CentOS 6.6. How to make CentOS 6.6 power off the console…

One Comment

Eric Zhiqiang Ma says:

Jul 23, 2014 at 6:34 pm

For large datasets, you may need to specify the number of mappers and reducers to make the computation and data distributed across nodes:

https://www.systutorials.com/qa/947/how-set-the-number-mappers-and-reducers-hadoop-command-line

Reply

Generate input by TeraGen

Run TeraSort

Validate the sorted output data of TeraSort

Similar Posts

One Comment

Leave a Reply Cancel reply