Hadoop Installation Tutorial (Hadoop 1.x)

Update: If you are new to Hadoop and trying to install one. Please check the newer version: Hadoop Installation Tutorial (Hadoop 2.x).

Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed and implemented by Google for processing and generating large data sets [1]. HDFS is Hadoop’s underlying data persistency layer, which is loosely modelled after Google file system GFS [2]. Hadoop has seen active development activities and increasing adoption. Many cloud computing services, such as Amazon EC2, provide MapReduce functions, and the research community uses MapReduce and Hadoop to solve data-intensive problems in bioinformatics, computational finance, chemistry, and environmental science [3]. Although MapReduce has its limitations [3], it is an important framework to process large data sets.

How to set up a Hadoop environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop cluster, one node runs as the NameNode, one node runs as the JobTracker and many nodes runs as the TaskTracker (slaves).

First we assume we have created a Linux user “hadoop” on each nodes that we use and the “hadoop” user’s home directory is “/home/hadoop/”.

Enable “hadoop” user to password-less SSH login to slaves

Just for our convenience, make sure the “hadoop” user from NameNode and JobTracker can ssh to the slaves without password so that we need not to input the password every time.

Details about password-less SSH login can be found Enabling Password-less ssh Login.

Install softwared needed by Hadoop

Java JDK:

Java JDK can be downloaded form: http://java.sun.com/. Then we can install (actually copy the jdk directory) Java JDK on all nodes of the Hadoop cluster.

As an example in this tutorial, the JDK is installed into

/home/hadoop/jdk1.6.0_24

I provide a simple bash script to duplicate the JDK directory to all nodes:

$ for i in `cat nodes`; do scp -rq /home/hadoop/jdk1.6.0_24 hadoop@$i:/home/hadoop/; done;

‘nodes’ is a file that contains all the nodes IPs or host names, one in one line.

Hadoop

Hadoop softwar can be downloaded from here. In this tutorial, we use Hadoop 0.20.203.0.

Then we can install Hadoop on all nodes of the Hadoop cluster.

We can directly unpack it to a directory. In this example, we store it in

/home/hadoop/hadoop/

which is a directory under the hadoop Linux user’s home directory.

The hadoop directory can also be duplicated to all nodes using the script above.

Configure environment variables of “hadoop” user

We assume the “hadoop” user use bash as its shell.

Add these two lines at the bottom of ~/.bashrc on all nodes:

export HADOOP_COMMON_HOME="/home/hadoop/hadoop/"
export PATH=$HADOOP_COMMON_HOME/bin/:$PATH

The HADOOP_COMMON_HOME environment variable is used by Hadoop’s utility scripts, and it must be set, otherwise the scripts may report an error message “Hadoop common not found”.

The second line adds hadoop’s bin directory to the PATH sothat we can directly run hadoop’s commands without specifying the full path to it.

Configure Hadoop

conf/hadoop-env.h

Add or change these lines to specify the JAVA_HOME and directory to store the logs:

export JAVA_HOME=/home/hadoop/jdk1.6.0_24
export HADOOP_LOG_DIR=/home/hadoop/data/logs

conf/core-site.xml

Here the NameNode runs on 10.1.1.30.

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://10.1.1.30:9000</value>
</property>
</configuration>

conf/hdfs-site.xml

<configuration>

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>/lhome/hadoop/data/dfs/name/</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>/lhome/hadoop/data/dfs/data/</value>
</property>

</configuration>

dfs.replication is the number of replicas of each block. dfs.name.dir is the path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. dfs.data.dir is comma-separated list of paths on the local filesystem of a DataNode where it stores its blocks.

conf/mapred-site.xml

Here the JobTracker runs on 10.1.1.2.

<configuration>

<property>
<name>mapred.job.tracker</name>
<value>10.1.1.2:9001</value>
</property>

<property>
<name>mapred.system.dir</name>
<value>/hadoop/data/mapred/system/</value>
</property>

<property>
<name>mapred.local.dir</name>
<value>/lhome/hadoop/data/mapred/local/</value>
</property>

</configuration>

mapreduce.jobtracker.address is host or IP and port of JobTracker. mapreduce.jobtracker.system.dir is the path on the HDFS where where the Map/Reduce framework stores system files. mapreduce.cluster.local.dir is comma-separated list of paths on the local filesystem where temporary MapReduce data is written.

conf/slaves

Delete localhost and add all the names of the TaskTrackers, each in on line. For example:

jobtrackname1
jobtrackname2
jobtrackname3
jobtrackname4
jobtrackname5
jobtrackname6

Duplicate Hadoop configuration files to all nodes

We may duplicate the configuration files under conf diretory to all nodes. The script mentioned above can be used.

By now, we have finished copying Hadoop softwares and configuring the Hadoop. Now let’s have some fun with Hadoop.

Start Hadoop

We need to start both the HDFS and MapReduce to start Hadoop.

Format a new HDFS

On NameNode (10.1.1.30):

$&nbsp;hadoop namenode -format

Remember to delete HDFS’s local files on all nodes before re-formating it:

$ rm /home/hadoop/data /tmp/hadoop-hadoop -rf

Start HDFS

On NameNode (10.1.1.30):

$ start-dfs.sh

Check the HDFS status:

On NameNode (10.1.1.30):

$ hadoop dfsadmin -report

There may be less nodes listed in the report than we actually have. We can try it again.

Start mapred:

On JobTracker (10.1.1.2):

$ start-mapred.sh

Check job status:

$ hadoop job -list

Run Hadoop jobs

A simple example

We run a simple example built in Hadoop’s distribution. For easy-to-run and more larger tests, please consider the A Simple Sort Benchmark on Hadoop.

Copy the input files into the distributed filesystem:

$ hadoop fs -put /home/hadoop/hadoop/conf input

Run some of the examples:

$ hadoop jar /home/hadoop/hadoop/hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

Examine the output files:

Copy the output files from the distributed filesystem to the local
filesytem and examine them:

$ hadoop fs -get output output
$ cat output/*

or

View the output files on the distributed filesystem:

$ hadoop fs -cat output/*

Shut down Hadoop cluster

We can stop Hadoop when we no long use it.

Stop HDFS on NameNode (10.1.1.30):

$ stop-dfs.sh

Stop JobTracker and TaskTrackers on JobTracker (10.1.1.2):

$ stop-mapred.sh

Some possible problems

Firewall blocks connections

Configure iptables: We can configure iptables to allow all connections, if these nodes are in a secure local area network which is most of the situation, by this command on all nodes:

# iptables -F
# service iptables save

For a list of the default ports used by Hadoop, please refer to: Hadoop Default Ports.

Pitfalls and Lessons

Please also check Pitfalls and Lessons on Configuing and Tuning Hadoop.

References

[1] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters.” in the 6th Conference on Symposium on Operating Systems Design & Implementation, vol. 6, San Francisco, CA, 2004, pp. 137–150.
[2] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google filesystem,” in Proc. of the 9th ACM Symposium on Operating Systems Principles (SOSP’03), 2003, pp. 29–43.
[3] Z. Ma and L. Gu. The limitation of MapReduce: A probing case and a lightweight solution. In CLOUD COMPUTING 2010: Proc. of the 1st Intl. Conf. on Cloud Computing, GRIDs, and Virtualization, pages 68–73, 2010.

Other Hadoop tutorials

Cluster Setup from Apache.
Managing a Hadoop Cluster from Yahoo.

Additional content

Some additional content for this post.

An example of Hadoop configuration files

Added on Dec. 20, 2012.

An example of Hadoop 1.0.3 configuration files.

Shown here as changes to the default conf directory.

diff -rupN conf/core-site.xml /lhome/hadoop/hadoop-1.0.3/conf/core-site.xml
--- conf/core-site.xml  2012-05-09 04:34:50.000000000 +0800
+++ /lhome/hadoop/hadoop-1.0.3/conf/core-site.xml   2012-07-26 15:45:41.372840027 +0800
@@ -4,5 +4,8 @@
 <!-- Put site-specific property overrides in this file. -->

 <configuration>
-
+<property>
+<name>fs.default.name</name>
+<value>hdfs://hadoop0:9000</value>
+</property>
 </configuration>
diff -rupN conf/hadoop-env.sh /lhome/hadoop/hadoop-1.0.3/conf/hadoop-env.sh
--- conf/hadoop-env.sh  2012-05-09 04:34:50.000000000 +0800
+++ /lhome/hadoop/hadoop-1.0.3/conf/hadoop-env.sh   2012-07-26 15:49:41.025839796 +0800
@@ -6,7 +6,7 @@
 # remote nodes.

 # The java implementation to use.  Required.
-# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
+export JAVA_HOME=/usr/java/jdk1.6.0_24/

 # Extra Java CLASSPATH elements.  Optional.
 # export HADOOP_CLASSPATH=
@@ -32,6 +32,7 @@ export HADOOP_JOBTRACKER_OPTS="-Dcom.sun

 # Where log files are stored.  $HADOOP_HOME/logs by default.
 # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
+export HADOOP_LOG_DIR=/lhome/hadoop/data/logs

 # File naming remote slave hosts.  $HADOOP_HOME/conf/slaves by default.
 # export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
diff -rupN conf/hdfs-site.xml /lhome/hadoop/hadoop-1.0.3/conf/hdfs-site.xml
--- conf/hdfs-site.xml  2012-05-09 04:34:50.000000000 +0800
+++ /lhome/hadoop/hadoop-1.0.3/conf/hdfs-site.xml   2012-07-26 15:46:06.185839356 +0800
@@ -4,5 +4,18 @@
 <!-- Put site-specific property overrides in this file. -->

 <configuration>
+<property>
+<name>dfs.replication</name>
+<value>3</value>
+</property>

+<property>
+<name>dfs.name.dir</name>
+<value>/lhome/hadoop/data/dfs/name/</value>
+</property>
+
+<property>
+<name>dfs.data.dir</name>
+<value>/lhome/hadoop/data/dfs/data/</value>
+</property>
 </configuration>
diff -rupN conf/mapred-site.xml /lhome/hadoop/hadoop-1.0.3/conf/mapred-site.xml
--- conf/mapred-site.xml    2012-05-09 04:34:50.000000000 +0800
+++ /lhome/hadoop/hadoop-1.0.3/conf/mapred-site.xml 2012-07-26 15:47:39.586907398 +0800
@@ -5,4 +5,24 @@

 <configuration>

+<property>
+<name>mapred.job.tracker</name>
+<value>hadoop0:9001</value>
+</property>
+
+<property>
+<name>mapred.tasktracker.reduce.tasks.maximum</name>
+<value>1</value>
+</property>
+
+<property>
+<name>mapred.tasktracker.map.tasks.maximum</name>
+<value>1</value>
+</property>
+
+<property>
+<name>mapred.local.dir</name>
+<value>/lhome/hadoop/data/mapred/local/</value>
+</property>
+
 </configuration>
diff -rupN conf/slaves /lhome/hadoop/hadoop-1.0.3/conf/slaves
--- conf/slaves 2012-05-09 04:34:50.000000000 +0800
+++ /lhome/hadoop/hadoop-1.0.3/conf/slaves  2012-07-26 15:48:54.811839973 +0800
@@ -1 +1,2 @@
-localhost
+hadoop1
+hadoop2

Eric Zhiqiang Ma

Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

3 comments:

      1. Yes, Eric. I followed Hadoop 2.x and also posted my comments there after resolving the mapreduce hang issue by added the missing resourcemanager.hostname in yarn-site.xml file.

        Thanks again.

Leave a Reply

Your email address will not be published. Required fields are marked *