Tutorial

Hadoop TeraSort Benchmark

ByEric Ma Dec 18, 2012Sep 5, 2020

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark.

TeraGen generates random data that can be used as input data for a subsequent running of TeraSort.

Generate input by TeraGen

The syntax for TeraGen:

$ hadoop jar hadoop-*examples*.jar teragen \
<number of 100-byte rows> <output dir>

To make the TeraGen run on multiple nodes with multiple tasks, you may need to specify the number of map tasks (30 here as an example; for Hadoop 2):

$ hadoop -D mapreduce.job.maps 30 \
jar hadoop-*examples*.jar teragen \
<number of 100-byte rows> <output dir>

The number of mappers depends on the number of rows you will generate and the number of nodes you have. For more information on how to set the number of mappers and reducers, please check this post.

Run TeraSort

After the data is generated, run the sort by TeraSort

$ hadoop jar hadoop-*examples*.jar terasort \
<input dir> <output dir>

You may also need to set the number of mappers and reducers for better performance.

Validate the sorted output data of TeraSort

TeraValidate ensures that the output data of TeraSort is globally sorted.

The syntax for TeraValidate:

$ hadoop jar hadoop-*examples*.jar teravalidate \
<output dir> <terasort-validate dir>

How to get hostname in Python on Linux?

ByQ A Mar 24, 2018

In Python, how to get hostname as the command hostname does on Linux? In Python, you can get the hostname by the socket.gethostname() library function in the socket module: import socket hostname = socket.gethostname() Reference: https://www.systutorials.com/dtivl/20/how-to-get-the-hostname-of-the-node?show=34#a34 Read more: How to get the hostname of the node in Python? Getting Hostname in Bash in Linux in…

Tutorial

Hadoop Default Ports

ByEric Ma Jan 15, 2012Mar 27, 2018

Hadoop’s namenode and datanodes expose a bunch of TCP ports used by Hadoop’s daemons to communicate to each other or listen directly to users’ requests. These ports information are needed by both the Hadoop users and cluster administrators to write programs or configure firewalls/gateways accordingly. A post written by Philip Zeyliger from Cloudera’s blog summarizes the…

Systems 101

Blockchain 101

ByDavid Yang Sep 16, 2023Sep 16, 2023

What is blockchain? Blockchain is a specific type of database with special data organization and properties. Blockchains store data in blocks that are then cryptographically chained together in the chronological order one by one, with the block chained onto the previous block. Data commonly stored in blockchains are transactions for Distributed Ledgers. The transactions are…

Computing systems | Resource management | Storage systems | Systems | Tutorial

Hadoop Installation Tutorial (Hadoop 2.x)

ByEric Ma Sep 14, 2014Dec 29, 2019

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data…

Linux

Managing Repositories on Git Server Using Gitosis

ByEric Ma Mar 25, 2012Aug 23, 2020

How to manage users and repositories and how to use these repositories will be introduced in this post. Please refer to Setting Up a Git Server Using Gitosis for how to set up the git server. Please refer to Howto for New Git Users for how to use git as a new user. Create a…

How to choose the number of mappers and reducers in Hadoop

ByQ A Mar 24, 2018

How to choose the number of mappers and reducers in Hadoop to get good job performance? The Hadoop Wiki gives a discussion on this: http://wiki.apache.org/hadoop/HowManyMapsAndReduces Some valuable points: About the number of Maps: The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to…

One Comment

Eric Zhiqiang Ma says:

Jul 23, 2014 at 6:34 pm

For large datasets, you may need to specify the number of mappers and reducers to make the computation and data distributed across nodes:

https://www.systutorials.com/qa/947/how-set-the-number-mappers-and-reducers-hadoop-command-line

Reply

Generate input by TeraGen

Run TeraSort

Validate the sorted output data of TeraSort

Similar Posts

One Comment

Leave a Reply Cancel reply