Tutorial

A Simple Sort Benchmark on Hadoop

ByEric Ma Jan 7, 2012Apr 5, 2016

After [[hadoop-installation-tutorial|installing Hadoop]], we usually run some benchmark programs to test whether the system works well. In the post of the Hadoop install tutorial, we show a very simple to grep strings from a simple sets of files. In this post, we introduce the Sort for testing and benchmarking Hadoop. The Sort program is also included in the Hadoop distribution package, and the package also includes a input data generator which generate 10 GB * (number of slave nodes) input data to sort. This program processes larger a datasets, which gives some strength to Hadoop including the execution engine and HDFS.

The Sort example program simply uses the MapReduce framework to sort the input directory into the output directory. The mapper is the predefined IdentityMapper and the reducer is the predefined IdentityReducer, both of which just pass their inputs directly to the output. The inputs and outputs must be Sequence files where the keys and values are BytesWritable.

The RandomWriter example program writes 10 GB (by default) of random data per host to HDFS using MapReduce. Each map takes a single file name as input and writes random BytesWritable keys and values to the DFS sequence file. The maps do not emit any output and the reduce phase is not used.

For a quick test of the Sort benchmark, just execute these two commands after [[hadoop-installation-tutorial|setting up and starting the Hadoop]]] (here we are in the Hadoop directory. If run the commands outside the Hadoop directory, simply use the full/relative path for the jar file):

# hadoop jar hadoop-*-examples.jar randomwriter rand
# hadoop jar hadoop-*-examples.jar sort rand rand-sort

The first command generates the random data into rand and the second commands sorts the generated data in rand and the result is put into rand-sort.

For more details and more options of the Sort and RandomWriter example programs, please refer to the Hadoop Wiki: Sort and RandomWriter.

Emacs highlighting part of lines that go over 80 chars

ByEric Ma Mar 24, 2018Mar 24, 2018

How to make Emacs highlighting part of lines that go over 80 chars? I use the whitespace mode: ;; `lines-tail`, highlight the part that goes beyond the ;; limit of `whitespace-line-column` (require ‘whitespace) (setq whitespace-style ‘(face empty tabs lines-tail trailing)) (global-whitespace-mode t) More: https://github.com/zma/emacs-config/blob/master/.emacs Alternatively, you can run highlight-lines-matching-regexp with the expression .{81}. http://stackoverflow.com/questions/6344474/how-can-i-make-emacs-highlight-lines-that-go-over-80-chars Read…

How to compress lists of consecutive citation numbers to one range in Latex?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to compress lists of consecutive citation numbers to one number range in Latex? For example, change [14], [15], [16], [17], [19] to [14-17], [19] That will save some space for the paper/document written in latex. The cite package is great from my experience. You just need to add usepackage{cite} in the document’s preamble and…

Programming language popularity indices?

ByEric Ma Mar 24, 2018Mar 24, 2018

Any good programming language popularity indices? Those are interesting ones: TIOBE Indexhttp://www.tiobe.com/index.php/content/paperinfo/tpci/index.html The RedMonk Programming Language Rankings: January 2014 This ranking is published as blog posts. So no persistent homepage found yet. The January 2014 version is: http://redmonk.com/sogrady/2014/01/22/language-rankings-1-14/ Programming Language Popularityhttp://langpop.com/ Read more: Most important aspects or features of the C++ programming language? Are You…

C++ cout formatting output

ByQ A Mar 24, 2018

How to format outputs with cout in C++? You need IO Manipulators <iomanip>: Header providing parametric manipulators. There are also good tutorials on the Web: Output Formatting: http://arachnoid.com/cpptutor/student3.html Formatting Cout Output in C++ using iomanip: http://www.cprogramming.com/tutorial/iomanip.html Read more: Formatting code shortcuts in Eclipse How to get the output of a system command in C How…

How to get a script’s directory reliably in Bash on Linux?

ByEric Ma Mar 24, 2018Oct 7, 2019

How to get a script’s directory reliably in Bash on Linux? For example, to get the directory of the executing script $0. dirname can give you the directory name from the absolute path. You can get the absolute path of the script by readlink -f to handle symbolic links (consider a symbolic link ./run.sh linked…

How to install WordPress on Fedora 19

ByQ A Mar 24, 2018

How to install the latest WordPress on a newly installed Fedora 19? Thanks! First, install the LAMP (you already have ‘L’) stack: # yum install httpd php php-mysql mysql-server php-gd and start these services: # systemctl start mysqld.service httpd.service Then, install WordPress on the LAMP stack following the tutorials on the Web. Two good ones…

One Comment

Pingback: Hadoop Installation Tutorial | Fclose.com

Similar Posts

One Comment

Leave a Reply Cancel reply