Hadoop Installation Tutorial (Hadoop 2.x)

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data sets. HDFS is Hadoop’s underlying data persistency layer, loosely modeled after the Google file system (GFS). Many cloud computing services, such as Amazon EC2, provide MapReduce functions. Although MapReduce has its limitations, it is an important framework to process large data sets.

How to set up a Hadoop 2.x (YARN) environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop (YARN) cluster, one node runs as the NameNode and the ResourceManager and many other nodes runs as the NodeManager and DataNode (slaves). If you are not familiar with these names, please take a look at YARN architecture first.

First we assume that you have created a Linux user “hadoop” on each nodes that you will use for running Hadoop and the “hadoop” user’s home directory is /home/hadoop/.

Configure hostnames

Hadoop uses hostnames to identify nodes by default. So you should first give each host a name and allow other nodes to find the IP of the hosts with names. The most simple way may be adding the host to IP mappings in every nodes’ /etc/hosts file. For larger cluster, you may use a DNS service. Here, we use 3 nodes as the example and these lines are added to the /etc/hosts file on each node:

10.0.3.29	hofstadter
10.0.3.30	snell
10.0.3.31	biot

Enable “hadoop” user to password-less SSH login to slaves

Just for our convenience, make sure the “hadoop” user from the NameNode and ResourceManager can ssh to the slaves without password so that we need not to input the password every time.

Details about password-less SSH login can be found in Enabling Password-less ssh Login.

Install software needed by Hadoop

The software needed to install Hadoop is Java (we use JDK here) besides of Hadoop itself.

Java JDK

Oracle Java JDK can be downloaded from JDK’s webpage. You need to install (actually just copy the JDK directory) Java JDK on all nodes of the Hadoop cluster.

As an example in this tutorial, the JDK is installed into

/usr/java/default/

You may need to make soft link to /usr/java/default from the actual location where you installed JDK.

Add these 2 lines to the “hadoop” user’s ~/.bashrc on all nodes:

export JAVA_HOME=/usr/java/default
export PATH=$JAVA_HOME/bin:$PATH

Hadoop

Hadoop software can be downloaded from Hadoop website. In this tutorial, we use Hadoop 2.5.0.

You can unpack the tar ball to a directory. In this example, we unpack it to

/home/hadoop/hadoop/

which is a directory under the hadoop Linux user’s home directory.

The Hadoop directory need to be duplicated to all nodes after configuration. Remember to do it after the configuration.

Configure environment variables for the “hadoop” user

We assume the “hadoop” user uses bash as its shell.

Add these lines at the bottom of ~/.bashrc on all nodes:

export HADOOP_COMMON_HOME=$HOME/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME
export HADOOP_HDFS_HOME=$HADOOP_COMMON_HOME
export YARN_HOME=$HADOOP_COMMON_HOME
export PATH=$PATH:$HADOOP_COMMON_HOME/bin
export PATH=$PATH:$HADOOP_COMMON_HOME/sbin

The last 2 lines add hadoop’s bin directory to the PATH so that we can directly run hadoop’s commands without specifying the full path to it.

Configure Hadoop

The configuration files for Hadoop is under $HADOOP_COMMON_HOME/etc/hadoop for our installation here. Here the content is added to the .xml files between <configuration> and </configuration>.

core-site.xml

Here the NameNode runs on biot.

  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://biot/</value>
    <description>NameNode URI</description>
  </property>

yarn-site.xml

The YARN ResourceManager runs on biot and supports MapReduce shuffle.

<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>biot</value>
  <description>The hostname of the ResourceManager</description>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
  <description>shuffle service for MapReduce</description>
</property>

hdfs-site.xml

The configuration here is optional. Add the following settings if you need them. The descriptions contain the purpose of each configuration.

    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///home/hadoop/hdfs/</value>
        <description>DataNode directory for storing data chunks.</description>
    </property>

    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///home/hadoop/hdfs/</value>
        <description>NameNode directory for namespace and transaction logs storage.</description>
    </property>

    <property>
        <name>dfs.replication</name>
        <value>3</value>
        <description>Number of replication for each chunk.</description>
    </property>

mapred-site.xml

First copy mapred-site.xml.template to mapred-site.xml and add the following content.

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>Execution framework.</description>
</property>

slaves

Delete localhost and add all the names of the TaskTrackers, each in on line. For example:

hofstadter
snell
biot

Duplicate Hadoop configuration files to all nodes

We may duplicate the Hadoop directory and configuration files under the etc/hadoop directory to all nodes. You may use this script to duplicate the hadoop directory:

cd
for i in `cat hadoop/etc/hadoop/slaves`; do 
  echo $i; rsync -avxP --exclude=logs hadoop/ $i:hadoop/; 
done

By now, we have finished copying Hadoop software and configuring the Hadoop. Now let’s have some fun with Hadoop.

Format a new distributed filesystem

Format a new distributed file system by

hdfs namenode -format

Start HDFS/YARN

Manually

Start the HDFS with the following command, run on the designated NameNode:

hadoop-daemon.sh --script hdfs start namenode

Run the script to start DataNodes on each slave:

hadoop-daemon.sh --script hdfs start datanode

Start the YARN with the following command, run on the designated ResourceManager:

yarn-daemon.sh start resourcemanager

Run a script to start NodeManagers on all slaves:

yarn-daemon.sh start nodemanager
By scripts
start-dfs.sh
start-yarn.sh

Check status

Check HDFS status
hdfs dfsadmin -report

It should show a report like:

Configured Capacity: 158550355968 (147.66 GB)
Present Capacity: 11206017024 (10.44 GB)
DFS Remaining: 11205943296 (10.44 GB)
DFS Used: 73728 (72 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Live datanodes (3):

Name: 10.0.3.30:50010 (snell)
Hostname: snell
Decommission Status : Normal
Configured Capacity: 52850118656 (49.22 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 49800732672 (46.38 GB)
DFS Remaining: 3049361408 (2.84 GB)
DFS Used%: 0.00%
DFS Remaining%: 5.77%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 05 07:50:17 GMT 2014


Name: 10.0.3.31:50010 (biot)
Hostname: biot
Decommission Status : Normal
Configured Capacity: 52850118656 (49.22 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 49513574400 (46.11 GB)
DFS Remaining: 3336519680 (3.11 GB)
DFS Used%: 0.00%
DFS Remaining%: 6.31%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 05 07:50:17 GMT 2014


Name: 10.0.3.29:50010 (hofstadter)
Hostname: hofstadter
Decommission Status : Normal
Configured Capacity: 52850118656 (49.22 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 48030068736 (44.73 GB)
DFS Remaining: 4820025344 (4.49 GB)
DFS Used%: 0.00%
DFS Remaining%: 9.12%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 1
Last contact: Fri Sep 05 07:50:17 GMT 2014
Check YARN status
yarn node -list

It should show a report like:

Total Nodes:3
         Node-Id	     Node-State	Node-Http-Address	Number-of-Running-Containers
hofstadter:43469	        RUNNING	  hofstadter:8042	                           0
     snell:57039	        RUNNING	       snell:8042	                           0
      biot:52834	        RUNNING	        biot:8042	                           0

Run basic tests

Let’s play with grep as the basic test of Hadoop YARN/HDFS/MapReduce.

First, create the directory to store data for the hadoop user.

hadoop fs -mkdir /user
hadoop fs -mkdir /user/hadoop

Then, put the configuration file directory as the input.

hadoop fs -put /home/hadoop/hadoop/etc/hadoop /user/hadoop/hadoop-config

or simply refer the directory under the hadoop user’s HDFS home (check the the discussion, thanks to Thirumal Venkat for this tip):

hadoop fs -put /home/hadoop/hadoop/etc/hadoop hadoop-config

Let’s ls it to check the content:

hdfs dfs -ls /user/hadoop/hadoop-config

It should print out the results like follows.

Found 26 items
-rw-r--r--   3 hadoop supergroup       3589 2014-09-05 07:53 /user/hadoop/hadoop-config/capacity-scheduler.xml
-rw-r--r--   3 hadoop supergroup       1335 2014-09-05 07:53 /user/hadoop/hadoop-config/configuration.xsl
-rw-r--r--   3 hadoop supergroup        318 2014-09-05 07:53 /user/hadoop/hadoop-config/container-executor.cfg
-rw-r--r--   3 hadoop supergroup        917 2014-09-05 07:53 /user/hadoop/hadoop-config/core-site.xml
-rw-r--r--   3 hadoop supergroup       3589 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-env.cmd
-rw-r--r--   3 hadoop supergroup       3443 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-env.sh
-rw-r--r--   3 hadoop supergroup       2490 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-metrics.properties
-rw-r--r--   3 hadoop supergroup       1774 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-metrics2.properties
-rw-r--r--   3 hadoop supergroup       9201 2014-09-05 07:53 /user/hadoop/hadoop-config/hadoop-policy.xml
-rw-r--r--   3 hadoop supergroup        775 2014-09-05 07:53 /user/hadoop/hadoop-config/hdfs-site.xml
-rw-r--r--   3 hadoop supergroup       1449 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-env.sh
-rw-r--r--   3 hadoop supergroup       1657 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-log4j.properties
-rw-r--r--   3 hadoop supergroup         21 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-signature.secret
-rw-r--r--   3 hadoop supergroup        620 2014-09-05 07:53 /user/hadoop/hadoop-config/httpfs-site.xml
-rw-r--r--   3 hadoop supergroup      11118 2014-09-05 07:53 /user/hadoop/hadoop-config/log4j.properties
-rw-r--r--   3 hadoop supergroup        918 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-env.cmd
-rw-r--r--   3 hadoop supergroup       1383 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-env.sh
-rw-r--r--   3 hadoop supergroup       4113 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-queues.xml.template
-rw-r--r--   3 hadoop supergroup        887 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-site.xml
-rw-r--r--   3 hadoop supergroup        758 2014-09-05 07:53 /user/hadoop/hadoop-config/mapred-site.xml.template
-rw-r--r--   3 hadoop supergroup         22 2014-09-05 07:53 /user/hadoop/hadoop-config/slaves
-rw-r--r--   3 hadoop supergroup       2316 2014-09-05 07:53 /user/hadoop/hadoop-config/ssl-client.xml.example
-rw-r--r--   3 hadoop supergroup       2268 2014-09-05 07:53 /user/hadoop/hadoop-config/ssl-server.xml.example
-rw-r--r--   3 hadoop supergroup       2178 2014-09-05 07:54 /user/hadoop/hadoop-config/yarn-env.cmd
-rw-r--r--   3 hadoop supergroup       4567 2014-09-05 07:54 /user/hadoop/hadoop-config/yarn-env.sh
-rw-r--r--   3 hadoop supergroup       1007 2014-09-05 07:54 /user/hadoop/hadoop-config/yarn-site.xml

Now, let’s run the grep now:

cd
hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep /user/hadoop/hadoop-config /user/hadoop/output 'dfs[a-z.]+'

It will print status as follows if everything works well.

14/09/05 07:54:36 INFO client.RMProxy: Connecting to ResourceManager at biot/10.0.3.31:8032
14/09/05 07:54:37 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
14/09/05 07:54:37 INFO input.FileInputFormat: Total input paths to process : 26
14/09/05 07:54:37 INFO mapreduce.JobSubmitter: number of splits:26
14/09/05 07:54:37 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1409903409779_0001
14/09/05 07:54:37 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
14/09/05 07:54:38 INFO impl.YarnClientImpl: Submitted application application_1409903409779_0001
14/09/05 07:54:38 INFO mapreduce.Job: The url to track the job: http://biot:8088/proxy/application_1409903409779_0001/
14/09/05 07:54:38 INFO mapreduce.Job: Running job: job_1409903409779_0001
14/09/05 07:54:45 INFO mapreduce.Job: Job job_1409903409779_0001 running in uber mode : false
14/09/05 07:54:45 INFO mapreduce.Job:  map 0% reduce 0%
14/09/05 07:54:50 INFO mapreduce.Job:  map 23% reduce 0%
14/09/05 07:54:52 INFO mapreduce.Job:  map 81% reduce 0%
14/09/05 07:54:53 INFO mapreduce.Job:  map 100% reduce 0%
14/09/05 07:54:56 INFO mapreduce.Job:  map 100% reduce 100%
14/09/05 07:54:56 INFO mapreduce.Job: Job job_1409903409779_0001 completed successfully
14/09/05 07:54:56 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=319
		FILE: Number of bytes written=2622017
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=65815
		HDFS: Number of bytes written=405
		HDFS: Number of read operations=81
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters
		Launched map tasks=26
		Launched reduce tasks=1
		Data-local map tasks=26
		Total time spent by all maps in occupied slots (ms)=116856
		Total time spent by all reduces in occupied slots (ms)=3000
		Total time spent by all map tasks (ms)=116856
		Total time spent by all reduce tasks (ms)=3000
		Total vcore-seconds taken by all map tasks=116856
		Total vcore-seconds taken by all reduce tasks=3000
		Total megabyte-seconds taken by all map tasks=119660544
		Total megabyte-seconds taken by all reduce tasks=3072000
	Map-Reduce Framework
		Map input records=1624
		Map output records=23
		Map output bytes=566
		Map output materialized bytes=469
		Input split bytes=3102
		Combine input records=23
		Combine output records=12
		Reduce input groups=10
		Reduce shuffle bytes=469
		Reduce input records=12
		Reduce output records=10
		Spilled Records=24
		Shuffled Maps =26
		Failed Shuffles=0
		Merged Map outputs=26
		GC time elapsed (ms)=363
		CPU time spent (ms)=15310
		Physical memory (bytes) snapshot=6807674880
		Virtual memory (bytes) snapshot=32081272832
		Total committed heap usage (bytes)=5426970624
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=62713
	File Output Format Counters
		Bytes Written=405
14/09/05 07:54:56 INFO client.RMProxy: Connecting to ResourceManager at biot/10.0.3.31:8032
14/09/05 07:54:56 WARN mapreduce.JobSubmitter: No job jar file set.  User classes may not be found. See Job or Job#setJar(String).
14/09/05 07:54:56 INFO input.FileInputFormat: Total input paths to process : 1
14/09/05 07:54:56 INFO mapreduce.JobSubmitter: number of splits:1
14/09/05 07:54:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1409903409779_0002
14/09/05 07:54:56 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
14/09/05 07:54:56 INFO impl.YarnClientImpl: Submitted application application_1409903409779_0002
14/09/05 07:54:56 INFO mapreduce.Job: The url to track the job: http://biot:8088/proxy/application_1409903409779_0002/
14/09/05 07:54:56 INFO mapreduce.Job: Running job: job_1409903409779_0002
14/09/05 07:55:02 INFO mapreduce.Job: Job job_1409903409779_0002 running in uber mode : false
14/09/05 07:55:02 INFO mapreduce.Job:  map 0% reduce 0%
14/09/05 07:55:07 INFO mapreduce.Job:  map 100% reduce 0%
14/09/05 07:55:12 INFO mapreduce.Job:  map 100% reduce 100%
14/09/05 07:55:13 INFO mapreduce.Job: Job job_1409903409779_0002 completed successfully
14/09/05 07:55:13 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=265
		FILE: Number of bytes written=193601
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=527
		HDFS: Number of bytes written=179
		HDFS: Number of read operations=7
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters
		Launched map tasks=1
		Launched reduce tasks=1
		Data-local map tasks=1
		Total time spent by all maps in occupied slots (ms)=2616
		Total time spent by all reduces in occupied slots (ms)=2855
		Total time spent by all map tasks (ms)=2616
		Total time spent by all reduce tasks (ms)=2855
		Total vcore-seconds taken by all map tasks=2616
		Total vcore-seconds taken by all reduce tasks=2855
		Total megabyte-seconds taken by all map tasks=2678784
		Total megabyte-seconds taken by all reduce tasks=2923520
	Map-Reduce Framework
		Map input records=10
		Map output records=10
		Map output bytes=239
		Map output materialized bytes=265
		Input split bytes=122
		Combine input records=0
		Combine output records=0
		Reduce input groups=5
		Reduce shuffle bytes=265
		Reduce input records=10
		Reduce output records=10
		Spilled Records=20
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=20
		CPU time spent (ms)=2090
		Physical memory (bytes) snapshot=415334400
		Virtual memory (bytes) snapshot=2382364672
		Total committed heap usage (bytes)=401997824
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=405
	File Output Format Counters
		Bytes Written=179

After the grep execution finishes, we can check the results. We can check the content in the output directory by:

hdfs dfs -ls /user/hadoop/output/

It should print output as follows.

Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2014-09-05 07:55 /user/hadoop/output/_SUCCESS
-rw-r--r--   3 hadoop supergroup        179 2014-09-05 07:55 /user/hadoop/output/part-r-00000

part-r-00000 contains the result. Let’s cat it out to check the content.

hdfs dfs -cat /user/hadoop/output/part-r-00000

It should print output as follows.

6	dfs.audit.logger
4	dfs.class
3	dfs.server.namenode.
2	dfs.period
2	dfs.audit.log.maxfilesize
2	dfs.audit.log.maxbackupindex
1	dfsmetrics.log
1	dfsadmin
1	dfs.servers
1	dfs.file

Stop HDFS/YARN

Manually

On the ResourceManager and NodeManager nodes:

yarn-daemon.sh stop resourcemanager
yarn-daemon.sh stop nodemanager

On the NameNode:

hadoop-daemon.sh --script hdfs stop namenode

On each DataNodes:

hadoop-daemon.sh --script hdfs stop datanode
With scripts
stop-yarn.sh
stop-dfs.sh

Debug information

You may check logs for debugging. The logs on each nodes are under:

hadoop/logs/

You may want to cleanup everything on all nodes. To remove data directories on data node (if you did not set the hdfs-site.xml to choose the directories by yourself). (Be careful, the following scripts will delete everything in /tmp that your current user can delete. You may adapt it if you stores some useful data under /tmp .)

rm -rf /tmp/* ~/hadoop/logs/*
for i in `cat hadoop/etc/hadoop/slaves`; do 
  echo $i; ssh $i 'rm -rf /tmp/* ~/hadoop/logs/'; 
done

More readings on HDFS/YARN/Hadoop

Here are links to some good articles related to Hadoop on the Web:

Default Hadoop configuration values: http://ask.fclose.com/749/hadoop-2-yarn-default-configuration-values

Official cluster setup tutorial: https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-common/ClusterSetup.html

Guides to configure Hadoop:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.6.0/bk_installing_manually_book/content/rpm-chap1-11.html
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

Additional notes

Control number of containers

You may want to configure the number of containers managed by YARN on each nodes. You can refer to this example below.

The following lines are added to yarn-site.xml to specify that each node uses 3072MB memory and each container uses at least 1536MB memory. That is, at most 2 containers on each node.

<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>3072</value>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>1536</value>
</property>

Eric Zhiqiang Ma

Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

32 comments:

  1. I believe if you use /user/hadoop instead you can directly access folders inside it similar to your home directory in HDFS.

    Ex:
    /user/hadoop/input can be just referenced as input while accessing HDFS.

    1. On my cluster:

      $ hdfs dfs -ls /user/hadoop/
      ls: `/user/hadoop/': No such file or directory

      and

      $ hdfs dfs -put hadoop-2.5.0.tar.gz hadoop-2.5.0.tar.gz
      put: `hadoop-2.5.0.tar.gz': No such file or directory

      It seems that unless the /user/hadoop/ directory is not automatically created.

      After it is created, it can be just referenced as you posted. Nice!

      BTW: the path is currently hardcoded in HDFS: https://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java?view=markup

      171 @Override
      172 public Path getHomeDirectory() {
      173 return makeQualified(new Path("/user/" + dfs.ugi.getShortUserName()));
      174 }

  2. Dear Eric Zhiqiang,
    I am very new to hadoop, and this blog instruction explain the multinode cluster setup in very nice way, however i have some query before starting the multinode cluster setup.
    1. Can multiple vitual machines (hosted on a single exsi server) be used as nodes of multi node hadoop cluster.
    2. As per your suggestion, first we have to do hadoop configuration on a specific node(say client node) then have to Duplicate Hadoop configuration files to all nodes,
    so can we used NameNode or any datanode as the client node or have to use a dedicated node as client node
    3. Is it necessary to write name node host name in slaves file, if i want to run my task tracker service only on datanodes.
    4. I am planning to use RHEL 6.4 on all my nodes and hadoop version hadoop-2.5.1.tar.gz, so can we use inbox open jdk with below version:
    java version “1.7.0_09-icedtea”
    OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)
    OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)

  3. Hi Eric,

    I am getting the following error when trying the check the HDFS status on namenode or datanode:

    [hadoop@namenode ~]$ hdfs dfsadmin -report
    14/11/29 10:45:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    Configured Capacity: 0 (0 B)
    Present Capacity: 0 (0 B)
    DFS Remaining: 0 (0 B)
    DFS Used: 0 (0 B)
    DFS Used%: NaN%
    Under replicated blocks: 0
    Blocks with corrupt replicas: 0
    Missing blocks: 0

    ————————————————-
    Can you please suggest me a solution.
    Thanks in advance.

  4. It seems the DataNodes are not identified by the NameNode.

    1. Some problems noted in http://www.highlyscalablesystems.com/3022/pitfalls-and-lessons-on-configuing-and-tuning-hadoop/ may still validate. You may check them.

    2. Another common problems for me is that the firewalls on these nodes block the network traffic. If the nodes are in a controlled and trusted cluster, you may disable firewallD (on F20: http://ask.fclose.com/692/how-to-totally-disable-firewall-or-iptables-on-fedora-20 ) or iptables (earlier releases: http://www.fclose.com/3837/flushing-iptables-on-fedora/ ).

    3. You may also log on the nodes running DataNode and use `ps aux | grep java` to check whether the DataNode daemon is running.

    Hope these tips help.

  5. Hi Eric,

    Thank u so much for your help. But firewall is already off on my machine. I performed the following steps, and the problem got resolved:
    1. Stop the cluster
    2. Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml
    3. Reformat the NameNode (NOTE: all HDFS data is lost during this process!)
    4. Restart the cluster
    Courtesy:stackoverflow.com/questions/10097246/no-data-nodes-are-started

  6. Thanks for the detailed info and i believe every one likes the tutorial and the way you took us on each and every individual step. Kudos

  7. Hi Eric,

    Very useful blog. Just wondering about containers – do you have more details on them. For example:
    1. If one node has TWO containers, can one map-reduce job spawn up to two tasks only on that node? Or can each container have more than one tasks each?
    2. Do you know the internals of Yarn, in particular, which part of the Yarn script actually spawn off different container / tasks?

    Thanks
    C.Chee

    1. You may check the “Architecture of Next Generation Apache Hadoop MapReduce
      Framework”:

      https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf

      The “Resource Model” discussed the mode for YARN v1.0:

      The Scheduler models only memory in YARN v1.0. Every node in the system is considered to be composed of multiple containers of minimum size of memory (say 512MB or 1 GB). The ApplicationMaster can request any container as a multiple of the minimum memory size.

      Eventually we want to move to a more generic resource model, however, for Yarn v1 we propose a rather straightforward model:

      The resource model is completely based on memory (RAM) and every node is made up discreet chunks of memory.

      For the implementation, you may need to dive into the source code tree.

  8. Thaanks for the tutorial. I have one question and one issu requiring your help.

    Is the value mapreduce_shuffle or mapreduce.shuffle?

    yarn.nodemanager.aux-services
    mapreduce_shuffle
    shuffle service for MapReduce

    I configured Hadoop 2.5.2 following your guideline. HDFS is confgiured and datanodes are reporting. yarn node -list is running and reports the nodes in my cluster. I am getting the Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted at 23% of the Map task.

    Could you please help me to get out of this exception.

    15/02/05 12:00:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    15/02/05 12:00:36 INFO client.RMProxy: Connecting to ResourceManager at 101-master/192.168.0.18:8032
    15/02/05 12:00:42 INFO input.FileInputFormat: Total input paths to process : 1
    15/02/05 12:00:43 INFO mapreduce.JobSubmitter: number of splits:1
    15/02/05 12:00:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423137573492_0001
    15/02/05 12:00:44 INFO impl.YarnClientImpl: Submitted application application_1423137573492_0001
    15/02/05 12:00:44 INFO mapreduce.Job: The url to track the job: http://101-master:8088/proxy/application_1423137573492_0001/
    15/02/05 12:00:44 INFO mapreduce.Job: Running job: job_1423137573492_0001
    15/02/05 12:00:58 INFO mapreduce.Job: Job job_1423137573492_0001 running in uber mode : false
    15/02/05 12:00:58 INFO mapreduce.Job: map 0% reduce 0%
    15/02/05 12:01:17 INFO mapreduce.Job: map 8% reduce 0%
    15/02/05 12:01:20 INFO mapreduce.Job: map 12% reduce 0%
    15/02/05 12:01:23 INFO mapreduce.Job: map 16% reduce 0%
    15/02/05 12:01:26 INFO mapreduce.Job: map 23% reduce 0%
    15/02/05 12:01:30 INFO mapreduce.Job: Task Id : attempt_1423137573492_0001_m_000000_0, Status : FAILED
    Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr

    ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr

    at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
    at org.apache.hadoop.util.Shell.run(Shell.java:455)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

    1. Hi Tariq,

      It is “mapreduce_shuffle”.

      Check: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html

      I have no idea what’s the reason for the error.

      The exit status 134 may tell some information (check a discussion here: https://groups.google.com/forum/#!topic/comp.lang.java.machine/OibTSkLJ-bY ). In this post, the JVM is from Oracle. Your JVM seems the OpenJDK. You may try Oracle JVM.

    1. Hi Tariq, just noticed your comment.

      I do ever experience the `exitCode=134` problem once.

      My solution is to add the following setting to `hadoop/etc/hadoop/yarn-site.xml`:

      <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>2048</value>
      </property>
      

      What I did is only this.

      You may check how much memory your program uses in one task and set the value to be larger than that.

  9. Thank you for the wonderful tutorial since I am a beginner it was really easy. I made a cluster with two slave nodes and one master node. I had a doubt how do I check whether map/reduce tasks are working on slave nodes.Are there specific files to check in the logs directory if yes then which ones. The yarn node -list is showing 3 nodes with status running.

  10. HI Eric,

    Please help me ,I am unable to run my first basic example,i am getting below message
    “15/03/20 20:35:13 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:54310/usr/local/hadoop/tmp/hadoop-hduser/mapred/staging/hduser/.staging/job_201503201943_0007
    15/03/20 20:35:13 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
    Exception in thread “main” org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)”

    Thanks
    Shekar M

  11. Thanks for the great tutorial. It’s the most up-to-date information I’ve found. I have a couple of questions.

    You don’t mention the masters file, which some other cluster configuration blogs show. Should we add a masters file to go along with the included slaves file? Also, your script will distribute the modified slaves files to the slave nodes. Do the slave nodes need the modified slaves file, or is it ignored on the slave nodes?

    When we format the dfs with “hdfs namenode -format” should this be done on all nodes, or just the master?

    1. About the masters file: the masters file is for the Secondary NameNodes ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Secondary_NameNode ). In 2.5.0, better use the “dfs.namenode.secondary.http-address” property ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml ). I am not sure whether the masters file still works for specifying the Secondary NameNodes. You may take a try and will be welcome to share your findings here.

      FYI: the start-dfs.sh starts the secondary namenodes:

      # secondary namenodes (if any)
      
      SECONDARY_NAMENODES=$($HADOOP_PREFIX/bin/hdfs getconf -secondarynamenodes 2>/dev/null)
      

      The `hdfs getconf` command gets the addresses of the secondary namenodes. A quick try shows me that the masters file has no effect and the address from `dfs.namenode.secondary.http-address` is used.

      The slave nodes do not need the slaves file. You can skip it.

      For `hdfs namenode -format`, it only need to be done on the master.

  12. Eric,

    I really appreciate your efforts in publishing this article and answering queries from hadoop users. Before coming across to your posting/article, I struggled for two weeks to run mapreduce code successfully in multi-node environment. The mapreduce job used to hang indefinitely. I was missing the “yarn.resourcemanager.hostname” parameter, in “yarn-site.xml” config file. Your article helped me finding this missing piece and I could run all my mapreduce job successfully.

    Thanks a lot.

  13. Now that I’ve got a working Hadoop cluster I’d like to install HBase and Zookeeper too. Do you know any good tutorials for installing HBase and Zookeeper?

    1. It is not covered in this tutorial. To make NameNode high availability with more than one NameNode nodes to avoid the single point of failure, you may consider 2 choices:

      HDFS High Availability using a shared NFS directory to share edit logs between the Active and Standby NameNodes: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
      HDFS High Availability Using the Quorum Journal Manager: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

Leave a Reply

Your email address will not be published. Required fields are marked *