A Simple Sort Benchmark on Hadoop

After installing Hadoop, we usually run some benchmark programs to test whether the system works well. In the post of the Hadoop install tutorial, we show a very simple to grep strings from a simple sets of files. In this post, we introduce the Sort for testing and benchmarking Hadoop. The Sort program is also included in the Hadoop distribution package, and the package also includes a input data generator which generate 10 GB * (number of slave nodes) input data to sort. This program processes larger a datasets, which gives some strength to Hadoop including the execution engine and HDFS.

The Sort example program simply uses the MapReduce framework to sort the input directory into the output directory. The mapper is the predefined IdentityMapper and the reducer is the predefined IdentityReducer, both of which just pass their inputs directly to the output. The inputs and outputs must be Sequence files where the keys and values are BytesWritable.

The RandomWriter example program writes 10 GB (by default) of random data per host to HDFS using MapReduce. Each map takes a single file name as input and writes random BytesWritable keys and values to the DFS sequence file. The maps do not emit any output and the reduce phase is not used.

For a quick test of the Sort benchmark, just execute these two commands after setting up and starting the Hadoop] (here we are in the Hadoop directory. If run the commands outside the Hadoop directory, simply use the full/relative path for the jar file):

  1. hadoop jar hadoop-*-examples.jar randomwriter rand
  2. hadoop jar hadoop-*-examples.jar sort rand rand-sort

The first command generates the random data into rand and the second commands sorts the generated data in rand and the result is put into rand-sort.

For more details and more options of the Sort and RandomWriter example programs, please refer to the Hadoop Wiki: Sort and RandomWriter.

Eric Z Ma

Eric is a father and systems guy. Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

One comment:

Leave a Reply

Your email address will not be published. Required fields are marked *