Setting Up a Hadoop 2.x Cluster for Development and Legacy Systems

ByEric Ma Posted onSep 14, 2014Apr 12, 2026 Updated onApr 12, 2026

Hadoop 2.x reached end-of-life in 2016. This guide covers setup for learning purposes and legacy system maintenance only. For production deployments, use Hadoop 3.x or later, which includes performance improvements, better YARN scheduling, HDFS erasure coding, and improved security. Cloud-managed options like AWS EMR, Google Dataproc, and Azure HDInsight eliminate most operational overhead.

This guide walks through setting up a three-node cluster (hofstadter, snell, biot) with hofstadter as the NameNode/ResourceManager. Hadoop 2.x introduced YARN (Yet Another Resource Negotiator), decoupling resource management from MapReduce job scheduling — a significant architectural change from 1.x.

Prerequisites

Linux user hadoop exists on all nodes with home directory /home/hadoop/
Root or sudo access to configure system settings
Network connectivity between all nodes with consistent DNS or /etc/hosts configuration
JDK 8 or later (Hadoop 2.10.1 works through Java 21, though testing with Java 8-11 is more common)

Hostname and Network Configuration

Hadoop uses hostnames for node discovery. Edit /etc/hosts on every node:

10.0.3.29   hofstadter
10.0.3.30   snell
10.0.3.31   biot

Verify connectivity:

ping -c 1 hofstadter
ping -c 1 snell
ping -c 1 biot

For clusters with 10+ nodes, use DNS instead of /etc/hosts to reduce maintenance burden.

SSH Key-Based Authentication

The Hadoop user on the NameNode must SSH to all DataNodes without password prompts. Generate Ed25519 keys as the hadoop user on the NameNode:

su - hadoop
ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519

Distribute the public key to all nodes:

for node in snell biot; do
  ssh-copy-id -i ~/.ssh/id_ed25519.pub hadoop@$node
done

Test passwordless login:

ssh hadoop@snell hostname
ssh hadoop@biot hostname

Both commands should return the node’s hostname without prompting for a password.

Java Installation

Hadoop requires a JDK, not just a JRE. Install via your package manager or manually:

# Using package manager (recommended)
apt-get install openjdk-11-jdk  # Debian/Ubuntu
# or
dnf install java-11-openjdk-devel  # RHEL/CentOS/Fedora

For manual installation, download from the OpenJDK repository and extract to /usr/java/:

mkdir -p /usr/java
cd /usr/java
# Download and extract JDK tarball
ln -s jdk-21.x.x default

Add to ~/.bashrc for the hadoop user on all nodes:

export JAVA_HOME=/usr/java/default
export PATH=$JAVA_HOME/bin:$PATH

Source and verify:

source ~/.bashrc
java -version

Hadoop Installation

Download Hadoop 2.10.1 (final 2.x release) from the Apache archive:

cd /home/hadoop
wget https://archive.apache.org/dist/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz
tar xzf hadoop-2.10.1.tar.gz
ln -s hadoop-2.10.1 hadoop

Add to ~/.bashrc for the hadoop user on all nodes:

export HADOOP_COMMON_HOME=$HOME/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME
export HADOOP_HDFS_HOME=$HADOOP_COMMON_HOME
export YARN_HOME=$HADOOP_COMMON_HOME
export PATH=$PATH:$HADOOP_COMMON_HOME/bin:$HADOOP_COMMON_HOME/sbin

Reload and verify:

source ~/.bashrc
hadoop version

Configuration

All configuration files live in $HADOOP_COMMON_HOME/etc/hadoop/. Edit .xml files directly with a text editor — do not rely on shell scripts to modify them, as manual edits may be lost.

core-site.xml

Set the NameNode address. Add this inside the <configuration> tags:

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://hofstadter/</value>
  <description>NameNode URI</description>
</property>

hdfs-site.xml

Configure HDFS replication and storage directories:

<property>
  <name>dfs.replication</name>
  <value>3</value>
  <description>Number of block replicas</description>
</property>
<property>
  <name>dfs.namenode.name.dir</name>
  <value>file:///home/hadoop/hdfs/namenode</value>
  <description>NameNode metadata storage</description>
</property>
<property>
  <name>dfs.datanode.data.dir</name>
  <value>file:///home/hadoop/hdfs/datanode</value>
  <description>DataNode block storage</description>
</property>

Create these directories on all nodes:

mkdir -p /home/hadoop/hdfs/namenode /home/hadoop/hdfs/datanode
chown hadoop:hadoop /home/hadoop/hdfs

Set dfs.replication to match your number of DataNodes or fewer. A three-node cluster can use replication factor 3; smaller clusters should lower this value.

yarn-site.xml

Configure YARN resource management:

<property>
  <name>yarn.resourcemanager.hostname</name>
  <value>hofstadter</value>
  <description>ResourceManager host</description>
</property>
<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
  <description>Shuffle service for MapReduce</description>
</property>
<property>
  <name>yarn.nodemanager.resource.memory-mb</name>
  <value>4096</value>
  <description>Memory available per NodeManager in MB</description>
</property>
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>512</value>
  <description>Minimum memory per container in MB</description>
</property>
<property>
  <name>yarn.scheduler.maximum-allocation-mb</name>
  <value>4096</value>
  <description>Maximum memory per container in MB</description>
</property>

Adjust memory values based on actual available RAM. A common formula: (total_ram – 2GB_system_overhead) = yarn.nodemanager.resource.memory-mb. If a node has 8GB RAM, set this to 6144 MB.

mapred-site.xml

Copy the template and configure MapReduce to use YARN:

cp $HADOOP_COMMON_HOME/etc/hadoop/mapred-site.xml.template \
   $HADOOP_COMMON_HOME/etc/hadoop/mapred-site.xml

Add to mapred-site.xml:

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
  <description>Use YARN for MapReduce</description>
</property>
<property>
  <name>mapreduce.jobhistory.address</name>
  <value>hofstadter:10020</value>
  <description>Job history server address</description>
</property>
<property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>hofstadter:19888</value>
  <description>Job history web UI</description>
</property>
<property>
  <name>mapreduce.map.memory.mb</name>
  <value>512</value>
  <description>Memory per map task</description>
</property>
<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>512</value>
  <description>Memory per reduce task</description>
</property>

These map/reduce memory settings should not exceed yarn.scheduler.maximum-allocation-mb.

workers (formerly slaves)

Edit etc/hadoop/workers and list all DataNode hostnames:

hofstadter
snell
biot

The NameNode/ResourceManager can also run DataNode/NodeManager services (as shown here), though production deployments often dedicate the NameNode to metadata management only.

Distribute Configuration to All Nodes

Use rsync to copy Hadoop configuration to all DataNodes:

cd /home/hadoop
for node in snell biot; do
  echo "Syncing to $node..."
  rsync -avxz --delete hadoop/ hadoop@$node:/home/hadoop/hadoop/
done

Verify on each node:

ssh hadoop@snell ls /home/hadoop/hadoop/etc/hadoop/core-site.xml
ssh hadoop@biot ls /home/hadoop/hadoop/etc/hadoop/core-site.xml

Format the NameNode

On the NameNode only, initialize HDFS metadata:

hdfs namenode -format

This creates the initial fsimage. Do this only once. Re-formatting destroys all HDFS data and cannot be undone.

Start the Cluster

Option 1: Individual Service Control

On the NameNode:

hdfs --daemon start namenode

On each DataNode:

hdfs --daemon start datanode

On the ResourceManager (usually the NameNode):

yarn --daemon start resourcemanager

On each NodeManager:

yarn --daemon start nodemanager

The --daemon flag runs the service in the background. Older Hadoop versions used start-dfs.sh and start-yarn.sh scripts, which still work but are less flexible.

Option 2: Use Cluster Start Scripts

From the NameNode, start HDFS across all nodes listed in workers:

start-dfs.sh

This command uses SSH to launch NameNode and DataNode services. Similarly, start YARN:

start-yarn.sh

Run these scripts from the NameNode to ensure proper SSH communication.

Verify Cluster Status

Check HDFS health:

hdfs dfsadmin -report

Sample output:

Configured Capacity: 335,854,977,024 (312.59 GB)
Present Capacity: 335,854,977,024 (312.59 GB)
DFS Remaining: 335,854,977,024 (312.59 GB)
DFS Used: 0 (0 B)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

Live datanodes (3):
Name: 10.0.3.30:50010 (snell)
Hostname: snell
...

Check YARN node status:

yarn node -list

Sample output:

Total Nodes:3
         Node-Id            Node-State Node-Http-Address   Number-of-Running-Containers
hofstadter:43469           RUNNING    hofstadter:8042                              0
snell:57039                RUNNING    snell:8042                                    0
biot:52834                 RUNNING    biot:8042                                     0

All nodes should report RUNNING state.

Web UIs

Access cluster dashboards via browser:

HDFS NameNode: http://hofstadter:9870/ (shows block inventory, replication status, dead nodes)
YARN ResourceManager: http://hofstadter:8088/ (shows running and completed applications, node resources)
Job History Server: http://hofstadter:19888/ (MapReduce job logs)

Run a Test Job

Create HDFS home directory:

hadoop fs -mkdir -p /user/hadoop

Upload test data:

hadoop fs -put $HADOOP_COMMON_HOME/etc/hadoop test-input

Run the built-in grep example:

hadoop jar $HADOOP_COMMON_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
  grep test-input test-output 'dfs[a-z.]*'

Monitor the job at http://hofstadter:8088/. Once complete, check results:

hadoop fs -ls test-output/
hadoop fs -cat test-output/part-r-00000

The output will list all lines from the input containing the regex pattern.

Stop the Cluster

Option 1: Individual Service Control

On each node, stop services in reverse order:

# On each DataNode
hdfs --daemon stop datanode

# On NameNode
hdfs --daemon stop namenode

# On each NodeManager
yarn --daemon stop nodemanager

# On ResourceManager
yarn --daemon stop resourcemanager

Option 2: Use Cluster Stop Scripts

From the NameNode:

stop-dfs.sh
stop-yarn.sh

These scripts use SSH to stop services across all nodes in workers.

Debugging and Troubleshooting

Check logs on any node:

ls $HADOOP_COMMON_HOME/logs/
tail -f $HADOOP_COMMON_HOME/logs/hadoop-hadoop-namenode-hofstadter.log
tail -f $HADOOP_COMMON_HOME/logs/hadoop-hadoop-datanode-snell.log

Connection Refused / SSH Failures

Verify SSH is running on all nodes: systemctl status ssh (or sshd on some systems)
Check that the hadoop user can SSH without a password: ssh hadoop@snell echo test
Ensure firewall rules permit SSH (port 22) and Hadoop ports (9000, 9870, 8088, 50010)

Port Conflicts

NameNode: 9000 (RPC), 9870 (web UI)
DataNode: 50010 (data transfer), 50020 (IPC)
ResourceManager: 8032 (RPC), 8088 (web UI)
NodeManager: 8042 (web UI), 8040 (RPC)

Check if ports are in use:

netstat -tlnp | grep :9000
lsof -i :8088

Java Not Found

Verify JAVA_HOME is set: echo $JAVA_HOME
Confirm the path exists: ls $JAVA_HOME/bin/java
Source ~/.bashrc again if recently added: source ~/.bashrc

Blocks Under-Replicated

Check that dfs.replication does not exceed the number of DataNodes
Verify all DataNodes are healthy: hdfs dfsadmin -report
Wait for the rebalancer to complete; this can take minutes on large datasets

NameNode Safemode

On startup, HDFS enters safemode while blocks are inventoried. Check status:

hdfs dfsadmin -safemode get

Force exit (only if blocks are verified):

hdfs dfsadmin -safemode leave

Full Cluster Reset

To reset from scratch:

# Stop all services
stop-dfs.sh
stop-yarn.sh

# Clean data on NameNode
rm -rf /home/hadoop/hdfs /home/hadoop/hadoop/logs

# Clean data on all DataNodes
for node in snell biot; do
  ssh hadoop@$node 'rm -rf /home/hadoop/hdfs /home/hadoop/hadoop/logs'
done

# Re-format and restart
hdfs namenode -format
start-dfs.sh
start-yarn.sh

Configuration Reference

Parameter	File	Purpose	Example
`fs.defaultFS`	core-site.xml	NameNode address	`hdfs://hofstadter/`
`dfs.replication`	hdfs-site.xml	Block replication factor	`3`
`d

34 Comments

Thirumal Venkat says:

Sep 14, 2014 at 10:46 am

I believe if you use /user/hadoop instead you can directly access folders inside it similar to your home directory in HDFS.

Ex:
/user/hadoop/input can be just referenced as input while accessing HDFS.

Reply
1. Eric Zhiqiang Ma says:
  
  Sep 15, 2014 at 2:14 am
  
  On my cluster:
  
  $ hdfs dfs -ls /user/hadoop/ ls: `/user/hadoop/': No such file or directory
  and
  
  $ hdfs dfs -put hadoop-2.5.0.tar.gz hadoop-2.5.0.tar.gz put: `hadoop-2.5.0.tar.gz': No such file or directory
  
  It seems that unless the /user/hadoop/ directory is not automatically created.
  
  After it is created, it can be just referenced as you posted. Nice!
  
  BTW: the path is currently hardcoded in HDFS: https://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java?view=markup
  
  171 @Override 172 public Path getHomeDirectory() { 173 return makeQualified(new Path("/user/" + dfs.ugi.getShortUserName())); 174 }
  
  Reply
Luke says:

Oct 24, 2014 at 3:01 am

Thank you very much for this guide. The apache docs are not easy to follow .

Reply
md says:

Nov 26, 2014 at 8:38 pm

Dear Eric Zhiqiang,
I am very new to hadoop, and this blog instruction explain the multinode cluster setup in very nice way, however i have some query before starting the multinode cluster setup.
1. Can multiple vitual machines (hosted on a single exsi server) be used as nodes of multi node hadoop cluster.
2. As per your suggestion, first we have to do hadoop configuration on a specific node(say client node) then have to Duplicate Hadoop configuration files to all nodes,
so can we used NameNode or any datanode as the client node or have to use a dedicated node as client node
3. Is it necessary to write name node host name in slaves file, if i want to run my task tracker service only on datanodes.
4. I am planning to use RHEL 6.4 on all my nodes and hadoop version hadoop-2.5.1.tar.gz, so can we use inbox open jdk with below version:
java version “1.7.0_09-icedtea”
OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)

Reply
1. Eric Zhiqiang Ma says:
  
  Nov 26, 2014 at 9:25 pm
  
  Hi md,
  
  1. Yes.
  
  2. I used the NameNode in this tutorial as the “client node”.
  
  3. No if you do not want to run DataNode on the NameNode node.
  
  4. I suggest using Oracle JVM. But 1.7.0_09-icedtea seems reported “Good” too: https://wiki.apache.org/hadoop/HadoopJavaVersions .
  
  Reply
  1. md says:
    
    Nov 26, 2014 at 10:24 pm
    
    Thanks Eric Zhiqiang for your quick response.
    
    Reply
Eric Zhiqiang Ma says:

Nov 29, 2014 at 12:35 am

It seems the DataNodes are not identified by the NameNode.

1. Some problems noted in http://www.highlyscalablesystems.com/3022/pitfalls-and-lessons-on-configuing-and-tuning-hadoop/ may still validate. You may check them.

2. Another common problems for me is that the firewalls on these nodes block the network traffic. If the nodes are in a controlled and trusted cluster, you may disable firewallD (on F20: https://www.systutorials.com/qa/692/how-to-totally-disable-firewall-or-iptables-on-fedora-20 ) or iptables (earlier releases: http://www.fclose.com/3837/flushing-iptables-on-fedora/ ).

3. You may also log on the nodes running DataNode and use `ps aux | grep java` to check whether the DataNode daemon is running.

Hope these tips help.

Reply
md says:

Nov 28, 2014 at 9:33 pm

Hi Eric,

I am getting the following error when trying the check the HDFS status on namenode or datanode:

[hadoop@namenode ~]$ hdfs dfsadmin -report
14/11/29 10:45:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Configured Capacity: 0 (0 B)
Present Capacity: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used: 0 (0 B)
DFS Used%: NaN%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

————————————————-
Can you please suggest me a solution.
Thanks in advance.

Reply
md says:

Nov 29, 2014 at 2:19 am

Hi Eric,

Thank u so much for your help. But firewall is already off on my machine. I performed the following steps, and the problem got resolved:
1. Stop the cluster
2. Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml
3. Reformat the NameNode (NOTE: all HDFS data is lost during this process!)
4. Restart the cluster
Courtesy:stackoverflow.com/questions/10097246/no-data-nodes-are-started

Reply
1. Eric Zhiqiang Ma says:
  
  Nov 29, 2014 at 5:07 am
  
  Hi md,
  
  Noted. Thanks for sharing.
  
  Reply
Eric Zhiqiang Ma says:

Dec 7, 2014 at 9:28 pm

NameNode metadata is critical for the whole HDFS cluster. If you use a single node for the NameNode, make replicas of the metadata on 2 or more separated disks for higher data reliability.

Please check this post for more information on how to replicate and set up 2-disk metadata storage for the NameNode:

https://www.systutorials.com/qa/1315/add-new-hdfs-namenode-metadata-directory-existing-cluster

Reply
John says:

Dec 10, 2014 at 4:53 pm

I had better luck defining JAVA_HOME in hadoop_env.sh

Reply
1. Eric Zhiqiang Ma says:
  
  Dec 10, 2014 at 9:29 pm
  
  That’s better if you use the Java of a version different from the global one.
  
  Reply
Saravanan says:

Dec 29, 2014 at 7:11 pm

Thanks for the detailed info and i believe every one likes the tutorial and the way you took us on each and every individual step. Kudos

Reply
Chee says:

Jan 12, 2015 at 11:34 pm

Hi Eric,

Very useful blog. Just wondering about containers – do you have more details on them. For example:
1. If one node has TWO containers, can one map-reduce job spawn up to two tasks only on that node? Or can each container have more than one tasks each?
2. Do you know the internals of Yarn, in particular, which part of the Yarn script actually spawn off different container / tasks?

Thanks
C.Chee

Reply
1. Eric Zhiqiang Ma says:
  
  Jan 20, 2015 at 6:18 am
  
  You may check the “Architecture of Next Generation Apache Hadoop MapReduce
  Framework”:
  
  https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf
  
  The “Resource Model” discussed the mode for YARN v1.0:
  
  The Scheduler models only memory in YARN v1.0. Every node in the system is considered to be composed of multiple containers of minimum size of memory (say 512MB or 1 GB). The ApplicationMaster can request any container as a multiple of the minimum memory size.
  
  Eventually we want to move to a more generic resource model, however, for Yarn v1 we propose a rather straightforward model:
  
  The resource model is completely based on memory (RAM) and every node is made up discreet chunks of memory.
  
  For the implementation, you may need to dive into the source code tree.
  
  Reply
Pingback: Installation of Hadoop | HadoopYoda Blog
Tariq says:

Feb 5, 2015 at 4:22 am

Thaanks for the tutorial. I have one question and one issu requiring your help.

Is the value mapreduce_shuffle or mapreduce.shuffle?

yarn.nodemanager.aux-services
mapreduce_shuffle
shuffle service for MapReduce

I configured Hadoop 2.5.2 following your guideline. HDFS is confgiured and datanodes are reporting. yarn node -list is running and reports the nodes in my cluster. I am getting the Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted at 23% of the Map task.

Could you please help me to get out of this exception.

15/02/05 12:00:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
15/02/05 12:00:36 INFO client.RMProxy: Connecting to ResourceManager at 101-master/192.168.0.18:8032
15/02/05 12:00:42 INFO input.FileInputFormat: Total input paths to process : 1
15/02/05 12:00:43 INFO mapreduce.JobSubmitter: number of splits:1
15/02/05 12:00:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423137573492_0001
15/02/05 12:00:44 INFO impl.YarnClientImpl: Submitted application application_1423137573492_0001
15/02/05 12:00:44 INFO mapreduce.Job: The url to track the job: http://101-master:8088/proxy/application_1423137573492_0001/
15/02/05 12:00:44 INFO mapreduce.Job: Running job: job_1423137573492_0001
15/02/05 12:00:58 INFO mapreduce.Job: Job job_1423137573492_0001 running in uber mode : false
15/02/05 12:00:58 INFO mapreduce.Job: map 0% reduce 0%
15/02/05 12:01:17 INFO mapreduce.Job: map 8% reduce 0%
15/02/05 12:01:20 INFO mapreduce.Job: map 12% reduce 0%
15/02/05 12:01:23 INFO mapreduce.Job: map 16% reduce 0%
15/02/05 12:01:26 INFO mapreduce.Job: map 23% reduce 0%
15/02/05 12:01:30 INFO mapreduce.Job: Task Id : attempt_1423137573492_0001_m_000000_0, Status : FAILED
Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr

ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr

at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Reply
1. Eric Zhiqiang Ma says:
  
  Feb 5, 2015 at 6:57 am
  
  Hi Tariq,
  
  It is “mapreduce_shuffle”.
  
  Check: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html
  
  I have no idea what’s the reason for the error.
  
  The exit status 134 may tell some information (check a discussion here: https://groups.google.com/forum/#!topic/comp.lang.java.machine/OibTSkLJ-bY ). In this post, the JVM is from Oracle. Your JVM seems the OpenJDK. You may try Oracle JVM.
  
  Reply
Tariq says:

Feb 19, 2015 at 7:32 am

Dear Eric,

I am having problem with YARN/Mapred Configuration. I have asked these question at stackoverflow. Could you please have a look at these question and answer them, if possible. Thanks
http://stackoverflow.com/questions/28586561/yarn-container-lauch-failed-exception-and-mapred-site-xml-configuration

http://stackoverflow.com/questions/28609639/yarn-container-configuration-for-javacv

Regards,

Reply
1. Eric Zhiqiang Ma says:
  
  Feb 21, 2015 at 4:19 am
  Hi Tariq, just noticed your comment.
  
  I do ever experience the `exitCode=134` problem once.
  
  My solution is to add the following setting to `hadoop/etc/hadoop/yarn-site.xml`:
```
<property>
  <name>yarn.scheduler.minimum-allocation-mb</name>
  <value>2048</value>
</property>
```
  What I did is only this.
  
  You may check how much memory your program uses in one task and set the value to be larger than that.
  Reply
Shrenik Gala says:

Feb 20, 2015 at 5:54 pm

Thank you for the wonderful tutorial since I am a beginner it was really easy. I made a cluster with two slave nodes and one master node. I had a doubt how do I check whether map/reduce tasks are working on slave nodes.Are there specific files to check in the logs directory if yes then which ones. The yarn node -list is showing 3 nodes with status running.

Reply
shekar says:

Mar 20, 2015 at 4:39 am

HI Eric,

Please help me ,I am unable to run my first basic example,i am getting below message
“15/03/20 20:35:13 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:54310/usr/local/hadoop/tmp/hadoop-hduser/mapred/staging/hduser/.staging/job_201503201943_0007
15/03/20 20:35:13 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
Exception in thread “main” org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)”

Thanks
Shekar M

Reply
1. Eric Zhiqiang Ma says:
  
  Mar 20, 2015 at 4:45 am
  Make sure `hdfs dfs -ls /usr/local/hadoop/input` exists. As Hadoop prints
```
Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
```
  Reply
Dean Schulze says:

Mar 23, 2015 at 7:36 pm

Thanks for the great tutorial. It’s the most up-to-date information I’ve found. I have a couple of questions.

You don’t mention the masters file, which some other cluster configuration blogs show. Should we add a masters file to go along with the included slaves file? Also, your script will distribute the modified slaves files to the slave nodes. Do the slave nodes need the modified slaves file, or is it ignored on the slave nodes?

When we format the dfs with “hdfs namenode -format” should this be done on all nodes, or just the master?

Reply
1. Eric Zhiqiang Ma says:
  
  Mar 23, 2015 at 9:24 pm
  About the masters file: the masters file is for the Secondary NameNodes ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Secondary_NameNode ). In 2.5.0, better use the “dfs.namenode.secondary.http-address” property ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml ). I am not sure whether the masters file still works for specifying the Secondary NameNodes. You may take a try and will be welcome to share your findings here.
  
  FYI: the start-dfs.sh starts the secondary namenodes:
```
# secondary namenodes (if any)

SECONDARY_NAMENODES=$($HADOOP_PREFIX/bin/hdfs getconf -secondarynamenodes 2>/dev/null)
```
  The `hdfs getconf` command gets the addresses of the secondary namenodes. A quick try shows me that the masters file has no effect and the address from `dfs.namenode.secondary.http-address` is used.
  
  The slave nodes do not need the slaves file. You can skip it.
  
  For `hdfs namenode -format`, it only need to be done on the master.
  Reply
Nandu says:

Mar 24, 2015 at 1:28 pm

Eric,

I really appreciate your efforts in publishing this article and answering queries from hadoop users. Before coming across to your posting/article, I struggled for two weeks to run mapreduce code successfully in multi-node environment. The mapreduce job used to hang indefinitely. I was missing the “yarn.resourcemanager.hostname” parameter, in “yarn-site.xml” config file. Your article helped me finding this missing piece and I could run all my mapreduce job successfully.

Thanks a lot.

Reply
1. Eric Zhiqiang Ma says:
  
  Mar 24, 2015 at 6:25 pm
  
  Great to here that! :)
  
  Reply
Dean Schulze says:

Mar 26, 2015 at 5:12 am

Now that I’ve got a working Hadoop cluster I’d like to install HBase and Zookeeper too. Do you know any good tutorials for installing HBase and Zookeeper?

Reply
1. Eric Zhiqiang Ma says:
  
  Mar 26, 2015 at 6:28 am
  
  You may try the official one first: https://hbase.apache.org/apache_hbase_reference_guide.pdf . I did not try it out myself. But it looks pretty well written.
  
  Reply
Shefali says:

Jun 2, 2016 at 6:38 pm

Hi can we add a namenode to a running cluster.
If yes what would be the steps??

Reply
1. Eric Z Ma says:
  
  Jun 7, 2016 at 6:34 pm
  
  It is not covered in this tutorial. To make NameNode high availability with more than one NameNode nodes to avoid the single point of failure, you may consider 2 choices:
  
  HDFS High Availability using a shared NFS directory to share edit logs between the Active and Standby NameNodes: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
  HDFS High Availability Using the Quorum Journal Manager: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
  
  Reply
Dev Mukherjee says:

Jun 8, 2017 at 6:30 pm

Hi,

What will be the standard OS (in linux) where I can perform the above steps to install it.

Thanks!
Dev

Reply
1. Eric says:
  
  Jun 8, 2017 at 11:40 pm
  
  The tutorial does not reply on any specific Linux distro. CentOS 7, Fedora 12+, Ubuntu 12+ and more other distro should be good enough as long as the needed tools used are installed.
  
  Reply

Prerequisites

Hostname and Network Configuration

SSH Key-Based Authentication

Java Installation

Hadoop Installation

Configuration

core-site.xml

hdfs-site.xml

yarn-site.xml

mapred-site.xml

workers (formerly slaves)

Distribute Configuration to All Nodes

Format the NameNode

Start the Cluster

Option 1: Individual Service Control

Option 2: Use Cluster Start Scripts

Verify Cluster Status

Web UIs

Run a Test Job

Stop the Cluster

Option 1: Individual Service Control

Option 2: Use Cluster Stop Scripts

Debugging and Troubleshooting

Connection Refused / SSH Failures

Port Conflicts

Java Not Found

Blocks Under-Replicated

NameNode Safemode

Full Cluster Reset

Configuration Reference

34 Comments

Leave a Reply Cancel reply