Tutorial

Hadoop TeraSort Benchmark

ByEric Ma Dec 18, 2012Sep 5, 2020

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark.

TeraGen generates random data that can be used as input data for a subsequent running of TeraSort.

Generate input by TeraGen

The syntax for TeraGen:

$ hadoop jar hadoop-*examples*.jar teragen \
<number of 100-byte rows> <output dir>

To make the TeraGen run on multiple nodes with multiple tasks, you may need to specify the number of map tasks (30 here as an example; for Hadoop 2):

$ hadoop -D mapreduce.job.maps 30 \
jar hadoop-*examples*.jar teragen \
<number of 100-byte rows> <output dir>

The number of mappers depends on the number of rows you will generate and the number of nodes you have. For more information on how to set the number of mappers and reducers, please check this post.

Run TeraSort

After the data is generated, run the sort by TeraSort

$ hadoop jar hadoop-*examples*.jar terasort \
<input dir> <output dir>

You may also need to set the number of mappers and reducers for better performance.

Validate the sorted output data of TeraSort

TeraValidate ensures that the output data of TeraSort is globally sorted.

The syntax for TeraValidate:

$ hadoop jar hadoop-*examples*.jar teravalidate \
<output dir> <terasort-validate dir>

Changing a git commit message after I have pushed it to the server?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to change a wrong git commit message after I have pushed it to the server? If the remote repository is shared with others, it is better to let the wrong git commit message there. If you use the repository by your own and you are sure that no one else has pulled your latest…

Linux | Programming

How to Generate and Apply Patches using diff and patch on Linux

ByEric Ma Jul 13, 2013Sep 19, 2017

`diff` and `patch` are tools to create patches and apply patches to source code, which is widely used in the open-source world, such as Linux kernel and application. patch: applying patches To apply a patch to a single file: $ patch < foo.patch If the foo.patch does not identify the file the patch should be...

How to remove cookies for a certain site in Chrome?

ByQ A Mar 24, 2018

How to remove cookies for a specific site/domain in Chrome? I just want to delete the cookies by one site, not the whole cookies stored by my browser. In Chrome’s settings (open chrome://settings/ in the URL bar): Search for “cookie” and you will find “Privacy” -> “Content Settings”. Click the button “Content Settings”. In the…

svn: how to clean up my repository directory?

ByQ A Mar 24, 2018

I have a local svn repository. I work in the repository, e.g. compiling latex documents and building software packages, and there are many useless temporary files there. How can I clean it up by automatically clean these temporary files like the git clean -f in git? svn seems has no built-in function for this. But…

How to debug media print view of Web page in Firefox?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to debug the media print view set by @media print {} in CSS of Web pages in Firefox? In firefox, after opening the Web page, First, hit “Shift + F2” to open the “Developer Toolbar” at the bottom. Second, in the “Developer Toobar”, input media emulate print and Firefox will show the print view…

Linux | Programming | Software | Tutorial

Keyboard Key Mapping for Emacs: Evil Mode and Rearranging Alt, Ctrl and Win Keys

ByEric Ma Jul 30, 2014Aug 30, 2020

Ctrl keys are important and possibly most frequently used in Emacs. However, it is painful on today’s common PC keyboards since Ctrl keys are usually in the corner of the keyboard main area. Why the key mappings in Emacs are designed like this? After it was designed, Emacs was commonly on the Lisp Machine keyboards…

One Comment

Eric Zhiqiang Ma says:

Jul 23, 2014 at 6:34 pm

For large datasets, you may need to specify the number of mappers and reducers to make the computation and data distributed across nodes:

https://www.systutorials.com/qa/947/how-set-the-number-mappers-and-reducers-hadoop-command-line

Reply

Generate input by TeraGen

Run TeraSort

Validate the sorted output data of TeraSort

Similar Posts

One Comment

Leave a Reply Cancel reply