Setting Up a Hadoop 2.x Cluster for Development and Legacy Systems
Hadoop 2.x reached end-of-life in 2016. This guide covers setup for learning purposes and legacy system maintenance only. For production deployments, use Hadoop 3.x or later, which includes performance improvements, better YARN scheduling, HDFS erasure coding, and improved security. Cloud-managed options like AWS EMR, Google Dataproc, and Azure HDInsight eliminate most operational overhead.
This guide walks through setting up a three-node cluster (hofstadter, snell, biot) with hofstadter as the NameNode/ResourceManager. Hadoop 2.x introduced YARN (Yet Another Resource Negotiator), decoupling resource management from MapReduce job scheduling — a significant architectural change from 1.x.
Prerequisites
- Linux user
hadoopexists on all nodes with home directory/home/hadoop/ - Root or sudo access to configure system settings
- Network connectivity between all nodes with consistent DNS or
/etc/hostsconfiguration - JDK 8 or later (Hadoop 2.10.1 works through Java 21, though testing with Java 8-11 is more common)
Hostname and Network Configuration
Hadoop uses hostnames for node discovery. Edit /etc/hosts on every node:
10.0.3.29 hofstadter
10.0.3.30 snell
10.0.3.31 biot
Verify connectivity:
ping -c 1 hofstadter
ping -c 1 snell
ping -c 1 biot
For clusters with 10+ nodes, use DNS instead of /etc/hosts to reduce maintenance burden.
SSH Key-Based Authentication
The Hadoop user on the NameNode must SSH to all DataNodes without password prompts. Generate Ed25519 keys as the hadoop user on the NameNode:
su - hadoop
ssh-keygen -t ed25519 -N "" -f ~/.ssh/id_ed25519
Distribute the public key to all nodes:
for node in snell biot; do
ssh-copy-id -i ~/.ssh/id_ed25519.pub hadoop@$node
done
Test passwordless login:
ssh hadoop@snell hostname
ssh hadoop@biot hostname
Both commands should return the node’s hostname without prompting for a password.
Java Installation
Hadoop requires a JDK, not just a JRE. Install via your package manager or manually:
# Using package manager (recommended)
apt-get install openjdk-11-jdk # Debian/Ubuntu
# or
dnf install java-11-openjdk-devel # RHEL/CentOS/Fedora
For manual installation, download from the OpenJDK repository and extract to /usr/java/:
mkdir -p /usr/java
cd /usr/java
# Download and extract JDK tarball
ln -s jdk-21.x.x default
Add to ~/.bashrc for the hadoop user on all nodes:
export JAVA_HOME=/usr/java/default
export PATH=$JAVA_HOME/bin:$PATH
Source and verify:
source ~/.bashrc
java -version
Hadoop Installation
Download Hadoop 2.10.1 (final 2.x release) from the Apache archive:
cd /home/hadoop
wget https://archive.apache.org/dist/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz
tar xzf hadoop-2.10.1.tar.gz
ln -s hadoop-2.10.1 hadoop
Add to ~/.bashrc for the hadoop user on all nodes:
export HADOOP_COMMON_HOME=$HOME/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_COMMON_HOME
export HADOOP_HDFS_HOME=$HADOOP_COMMON_HOME
export YARN_HOME=$HADOOP_COMMON_HOME
export PATH=$PATH:$HADOOP_COMMON_HOME/bin:$HADOOP_COMMON_HOME/sbin
Reload and verify:
source ~/.bashrc
hadoop version
Configuration
All configuration files live in $HADOOP_COMMON_HOME/etc/hadoop/. Edit .xml files directly with a text editor — do not rely on shell scripts to modify them, as manual edits may be lost.
core-site.xml
Set the NameNode address. Add this inside the <configuration> tags:
<property>
<name>fs.defaultFS</name>
<value>hdfs://hofstadter/</value>
<description>NameNode URI</description>
</property>
hdfs-site.xml
Configure HDFS replication and storage directories:
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Number of block replicas</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
<description>NameNode metadata storage</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
<description>DataNode block storage</description>
</property>
Create these directories on all nodes:
mkdir -p /home/hadoop/hdfs/namenode /home/hadoop/hdfs/datanode
chown hadoop:hadoop /home/hadoop/hdfs
Set dfs.replication to match your number of DataNodes or fewer. A three-node cluster can use replication factor 3; smaller clusters should lower this value.
yarn-site.xml
Configure YARN resource management:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hofstadter</value>
<description>ResourceManager host</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>Shuffle service for MapReduce</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
<description>Memory available per NodeManager in MB</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>512</value>
<description>Minimum memory per container in MB</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>4096</value>
<description>Maximum memory per container in MB</description>
</property>
Adjust memory values based on actual available RAM. A common formula: (total_ram – 2GB_system_overhead) = yarn.nodemanager.resource.memory-mb. If a node has 8GB RAM, set this to 6144 MB.
mapred-site.xml
Copy the template and configure MapReduce to use YARN:
cp $HADOOP_COMMON_HOME/etc/hadoop/mapred-site.xml.template \
$HADOOP_COMMON_HOME/etc/hadoop/mapred-site.xml
Add to mapred-site.xml:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>Use YARN for MapReduce</description>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hofstadter:10020</value>
<description>Job history server address</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hofstadter:19888</value>
<description>Job history web UI</description>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>512</value>
<description>Memory per map task</description>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>512</value>
<description>Memory per reduce task</description>
</property>
These map/reduce memory settings should not exceed yarn.scheduler.maximum-allocation-mb.
workers (formerly slaves)
Edit etc/hadoop/workers and list all DataNode hostnames:
hofstadter
snell
biot
The NameNode/ResourceManager can also run DataNode/NodeManager services (as shown here), though production deployments often dedicate the NameNode to metadata management only.
Distribute Configuration to All Nodes
Use rsync to copy Hadoop configuration to all DataNodes:
cd /home/hadoop
for node in snell biot; do
echo "Syncing to $node..."
rsync -avxz --delete hadoop/ hadoop@$node:/home/hadoop/hadoop/
done
Verify on each node:
ssh hadoop@snell ls /home/hadoop/hadoop/etc/hadoop/core-site.xml
ssh hadoop@biot ls /home/hadoop/hadoop/etc/hadoop/core-site.xml
Format the NameNode
On the NameNode only, initialize HDFS metadata:
hdfs namenode -format
This creates the initial fsimage. Do this only once. Re-formatting destroys all HDFS data and cannot be undone.
Start the Cluster
Option 1: Individual Service Control
On the NameNode:
hdfs --daemon start namenode
On each DataNode:
hdfs --daemon start datanode
On the ResourceManager (usually the NameNode):
yarn --daemon start resourcemanager
On each NodeManager:
yarn --daemon start nodemanager
The --daemon flag runs the service in the background. Older Hadoop versions used start-dfs.sh and start-yarn.sh scripts, which still work but are less flexible.
Option 2: Use Cluster Start Scripts
From the NameNode, start HDFS across all nodes listed in workers:
start-dfs.sh
This command uses SSH to launch NameNode and DataNode services. Similarly, start YARN:
start-yarn.sh
Run these scripts from the NameNode to ensure proper SSH communication.
Verify Cluster Status
Check HDFS health:
hdfs dfsadmin -report
Sample output:
Configured Capacity: 335,854,977,024 (312.59 GB)
Present Capacity: 335,854,977,024 (312.59 GB)
DFS Remaining: 335,854,977,024 (312.59 GB)
DFS Used: 0 (0 B)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Live datanodes (3):
Name: 10.0.3.30:50010 (snell)
Hostname: snell
...
Check YARN node status:
yarn node -list
Sample output:
Total Nodes:3
Node-Id Node-State Node-Http-Address Number-of-Running-Containers
hofstadter:43469 RUNNING hofstadter:8042 0
snell:57039 RUNNING snell:8042 0
biot:52834 RUNNING biot:8042 0
All nodes should report RUNNING state.
Web UIs
Access cluster dashboards via browser:
- HDFS NameNode: http://hofstadter:9870/ (shows block inventory, replication status, dead nodes)
- YARN ResourceManager: http://hofstadter:8088/ (shows running and completed applications, node resources)
- Job History Server: http://hofstadter:19888/ (MapReduce job logs)
Run a Test Job
Create HDFS home directory:
hadoop fs -mkdir -p /user/hadoop
Upload test data:
hadoop fs -put $HADOOP_COMMON_HOME/etc/hadoop test-input
Run the built-in grep example:
hadoop jar $HADOOP_COMMON_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar \
grep test-input test-output 'dfs[a-z.]*'
Monitor the job at http://hofstadter:8088/. Once complete, check results:
hadoop fs -ls test-output/
hadoop fs -cat test-output/part-r-00000
The output will list all lines from the input containing the regex pattern.
Stop the Cluster
Option 1: Individual Service Control
On each node, stop services in reverse order:
# On each DataNode
hdfs --daemon stop datanode
# On NameNode
hdfs --daemon stop namenode
# On each NodeManager
yarn --daemon stop nodemanager
# On ResourceManager
yarn --daemon stop resourcemanager
Option 2: Use Cluster Stop Scripts
From the NameNode:
stop-dfs.sh
stop-yarn.sh
These scripts use SSH to stop services across all nodes in workers.
Debugging and Troubleshooting
Check logs on any node:
ls $HADOOP_COMMON_HOME/logs/
tail -f $HADOOP_COMMON_HOME/logs/hadoop-hadoop-namenode-hofstadter.log
tail -f $HADOOP_COMMON_HOME/logs/hadoop-hadoop-datanode-snell.log
Connection Refused / SSH Failures
- Verify SSH is running on all nodes:
systemctl status ssh(orsshdon some systems) - Check that the hadoop user can SSH without a password:
ssh hadoop@snell echo test - Ensure firewall rules permit SSH (port 22) and Hadoop ports (9000, 9870, 8088, 50010)
Port Conflicts
- NameNode: 9000 (RPC), 9870 (web UI)
- DataNode: 50010 (data transfer), 50020 (IPC)
- ResourceManager: 8032 (RPC), 8088 (web UI)
- NodeManager: 8042 (web UI), 8040 (RPC)
Check if ports are in use:
netstat -tlnp | grep :9000
lsof -i :8088
Java Not Found
- Verify
JAVA_HOMEis set:echo $JAVA_HOME - Confirm the path exists:
ls $JAVA_HOME/bin/java - Source
~/.bashrcagain if recently added:source ~/.bashrc
Blocks Under-Replicated
- Check that
dfs.replicationdoes not exceed the number of DataNodes - Verify all DataNodes are healthy:
hdfs dfsadmin -report - Wait for the rebalancer to complete; this can take minutes on large datasets
NameNode Safemode
On startup, HDFS enters safemode while blocks are inventoried. Check status:
hdfs dfsadmin -safemode get
Force exit (only if blocks are verified):
hdfs dfsadmin -safemode leave
Full Cluster Reset
To reset from scratch:
# Stop all services
stop-dfs.sh
stop-yarn.sh
# Clean data on NameNode
rm -rf /home/hadoop/hdfs /home/hadoop/hadoop/logs
# Clean data on all DataNodes
for node in snell biot; do
ssh hadoop@$node 'rm -rf /home/hadoop/hdfs /home/hadoop/hadoop/logs'
done
# Re-format and restart
hdfs namenode -format
start-dfs.sh
start-yarn.sh
Configuration Reference
| Parameter | File | Purpose | Example |
|---|---|---|---|
fs.defaultFS |
core-site.xml | NameNode address | hdfs://hofstadter/ |
dfs.replication |
hdfs-site.xml | Block replication factor | 3 |
| `d |

I believe if you use
/user/hadoopinstead you can directly access folders inside it similar to your home directory in HDFS.Ex:
/user/hadoop/inputcan be just referenced asinputwhile accessing HDFS.On my cluster:
$ hdfs dfs -ls /user/hadoop/
ls: `/user/hadoop/': No such file or directory
and
$ hdfs dfs -put hadoop-2.5.0.tar.gz hadoop-2.5.0.tar.gzput: `hadoop-2.5.0.tar.gz': No such file or directory
It seems that unless the /user/hadoop/ directory is not automatically created.
After it is created, it can be just referenced as you posted. Nice!
BTW: the path is currently hardcoded in HDFS: https://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java?view=markup
171 @Override172 public Path getHomeDirectory() {
173 return makeQualified(new Path("/user/" + dfs.ugi.getShortUserName()));
174 }
Thank you very much for this guide. The apache docs are not easy to follow .
Dear Eric Zhiqiang,
I am very new to hadoop, and this blog instruction explain the multinode cluster setup in very nice way, however i have some query before starting the multinode cluster setup.
1. Can multiple vitual machines (hosted on a single exsi server) be used as nodes of multi node hadoop cluster.
2. As per your suggestion, first we have to do hadoop configuration on a specific node(say client node) then have to Duplicate Hadoop configuration files to all nodes,
so can we used NameNode or any datanode as the client node or have to use a dedicated node as client node
3. Is it necessary to write name node host name in slaves file, if i want to run my task tracker service only on datanodes.
4. I am planning to use RHEL 6.4 on all my nodes and hadoop version hadoop-2.5.1.tar.gz, so can we use inbox open jdk with below version:
java version “1.7.0_09-icedtea”
OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)
OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)
Hi md,
1. Yes.
2. I used the NameNode in this tutorial as the “client node”.
3. No if you do not want to run DataNode on the NameNode node.
4. I suggest using Oracle JVM. But 1.7.0_09-icedtea seems reported “Good” too: https://wiki.apache.org/hadoop/HadoopJavaVersions .
Thanks Eric Zhiqiang for your quick response.
It seems the DataNodes are not identified by the NameNode.
1. Some problems noted in http://www.highlyscalablesystems.com/3022/pitfalls-and-lessons-on-configuing-and-tuning-hadoop/ may still validate. You may check them.
2. Another common problems for me is that the firewalls on these nodes block the network traffic. If the nodes are in a controlled and trusted cluster, you may disable firewallD (on F20: https://www.systutorials.com/qa/692/how-to-totally-disable-firewall-or-iptables-on-fedora-20 ) or iptables (earlier releases: http://www.fclose.com/3837/flushing-iptables-on-fedora/ ).
3. You may also log on the nodes running DataNode and use `ps aux | grep java` to check whether the DataNode daemon is running.
Hope these tips help.
Hi Eric,
I am getting the following error when trying the check the HDFS status on namenode or datanode:
[hadoop@namenode ~]$ hdfs dfsadmin -report
14/11/29 10:45:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
Configured Capacity: 0 (0 B)
Present Capacity: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used: 0 (0 B)
DFS Used%: NaN%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
————————————————-
Can you please suggest me a solution.
Thanks in advance.
Hi Eric,
Thank u so much for your help. But firewall is already off on my machine. I performed the following steps, and the problem got resolved:
1. Stop the cluster
2. Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml
3. Reformat the NameNode (NOTE: all HDFS data is lost during this process!)
4. Restart the cluster
Courtesy:stackoverflow.com/questions/10097246/no-data-nodes-are-started
Hi md,
Noted. Thanks for sharing.
NameNode metadata is critical for the whole HDFS cluster. If you use a single node for the NameNode, make replicas of the metadata on 2 or more separated disks for higher data reliability.
Please check this post for more information on how to replicate and set up 2-disk metadata storage for the NameNode:
https://www.systutorials.com/qa/1315/add-new-hdfs-namenode-metadata-directory-existing-cluster
I had better luck defining JAVA_HOME in hadoop_env.sh
That’s better if you use the Java of a version different from the global one.
Thanks for the detailed info and i believe every one likes the tutorial and the way you took us on each and every individual step. Kudos
Hi Eric,
Very useful blog. Just wondering about containers – do you have more details on them. For example:
1. If one node has TWO containers, can one map-reduce job spawn up to two tasks only on that node? Or can each container have more than one tasks each?
2. Do you know the internals of Yarn, in particular, which part of the Yarn script actually spawn off different container / tasks?
Thanks
C.Chee
You may check the “Architecture of Next Generation Apache Hadoop MapReduce
Framework”:
https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf
The “Resource Model” discussed the mode for YARN v1.0:
For the implementation, you may need to dive into the source code tree.
Thaanks for the tutorial. I have one question and one issu requiring your help.
Is the value mapreduce_shuffle or mapreduce.shuffle?
yarn.nodemanager.aux-services
mapreduce_shuffle
shuffle service for MapReduce
I configured Hadoop 2.5.2 following your guideline. HDFS is confgiured and datanodes are reporting. yarn node -list is running and reports the nodes in my cluster. I am getting the Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted at 23% of the Map task.
Could you please help me to get out of this exception.
15/02/05 12:00:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
15/02/05 12:00:36 INFO client.RMProxy: Connecting to ResourceManager at 101-master/192.168.0.18:8032
15/02/05 12:00:42 INFO input.FileInputFormat: Total input paths to process : 1
15/02/05 12:00:43 INFO mapreduce.JobSubmitter: number of splits:1
15/02/05 12:00:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423137573492_0001
15/02/05 12:00:44 INFO impl.YarnClientImpl: Submitted application application_1423137573492_0001
15/02/05 12:00:44 INFO mapreduce.Job: The url to track the job: http://101-master:8088/proxy/application_1423137573492_0001/
15/02/05 12:00:44 INFO mapreduce.Job: Running job: job_1423137573492_0001
15/02/05 12:00:58 INFO mapreduce.Job: Job job_1423137573492_0001 running in uber mode : false
15/02/05 12:00:58 INFO mapreduce.Job: map 0% reduce 0%
15/02/05 12:01:17 INFO mapreduce.Job: map 8% reduce 0%
15/02/05 12:01:20 INFO mapreduce.Job: map 12% reduce 0%
15/02/05 12:01:23 INFO mapreduce.Job: map 16% reduce 0%
15/02/05 12:01:26 INFO mapreduce.Job: map 23% reduce 0%
15/02/05 12:01:30 INFO mapreduce.Job: Task Id : attempt_1423137573492_0001_m_000000_0, Status : FAILED
Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr
ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Hi Tariq,
It is “mapreduce_shuffle”.
Check: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html
I have no idea what’s the reason for the error.
The exit status 134 may tell some information (check a discussion here: https://groups.google.com/forum/#!topic/comp.lang.java.machine/OibTSkLJ-bY ). In this post, the JVM is from Oracle. Your JVM seems the OpenJDK. You may try Oracle JVM.
Dear Eric,
I am having problem with YARN/Mapred Configuration. I have asked these question at stackoverflow. Could you please have a look at these question and answer them, if possible. Thanks
http://stackoverflow.com/questions/28586561/yarn-container-lauch-failed-exception-and-mapred-site-xml-configuration
http://stackoverflow.com/questions/28609639/yarn-container-configuration-for-javacv
Regards,
Hi Tariq, just noticed your comment.
I do ever experience the `exitCode=134` problem once.
My solution is to add the following setting to `hadoop/etc/hadoop/yarn-site.xml`:
What I did is only this.
You may check how much memory your program uses in one task and set the value to be larger than that.
Thank you for the wonderful tutorial since I am a beginner it was really easy. I made a cluster with two slave nodes and one master node. I had a doubt how do I check whether map/reduce tasks are working on slave nodes.Are there specific files to check in the logs directory if yes then which ones. The yarn node -list is showing 3 nodes with status running.
HI Eric,
Please help me ,I am unable to run my first basic example,i am getting below message
“15/03/20 20:35:13 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:54310/usr/local/hadoop/tmp/hadoop-hduser/mapred/staging/hduser/.staging/job_201503201943_0007
15/03/20 20:35:13 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
Exception in thread “main” org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)”
Thanks
Shekar M
Make sure `hdfs dfs -ls /usr/local/hadoop/input` exists. As Hadoop prints
Thanks for the great tutorial. It’s the most up-to-date information I’ve found. I have a couple of questions.
You don’t mention the masters file, which some other cluster configuration blogs show. Should we add a masters file to go along with the included slaves file? Also, your script will distribute the modified slaves files to the slave nodes. Do the slave nodes need the modified slaves file, or is it ignored on the slave nodes?
When we format the dfs with “hdfs namenode -format” should this be done on all nodes, or just the master?
About the masters file: the masters file is for the Secondary NameNodes ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Secondary_NameNode ). In 2.5.0, better use the “dfs.namenode.secondary.http-address” property ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml ). I am not sure whether the masters file still works for specifying the Secondary NameNodes. You may take a try and will be welcome to share your findings here.
FYI: the start-dfs.sh starts the secondary namenodes:
The `hdfs getconf` command gets the addresses of the secondary namenodes. A quick try shows me that the masters file has no effect and the address from `dfs.namenode.secondary.http-address` is used.
The slave nodes do not need the slaves file. You can skip it.
For `hdfs namenode -format`, it only need to be done on the master.
Eric,
I really appreciate your efforts in publishing this article and answering queries from hadoop users. Before coming across to your posting/article, I struggled for two weeks to run mapreduce code successfully in multi-node environment. The mapreduce job used to hang indefinitely. I was missing the “yarn.resourcemanager.hostname” parameter, in “yarn-site.xml” config file. Your article helped me finding this missing piece and I could run all my mapreduce job successfully.
Thanks a lot.
Great to here that! :)
Now that I’ve got a working Hadoop cluster I’d like to install HBase and Zookeeper too. Do you know any good tutorials for installing HBase and Zookeeper?
You may try the official one first: https://hbase.apache.org/apache_hbase_reference_guide.pdf . I did not try it out myself. But it looks pretty well written.
Hi can we add a namenode to a running cluster.
If yes what would be the steps??
It is not covered in this tutorial. To make NameNode high availability with more than one NameNode nodes to avoid the single point of failure, you may consider 2 choices:
HDFS High Availability using a shared NFS directory to share edit logs between the Active and Standby NameNodes: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
HDFS High Availability Using the Quorum Journal Manager: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
Hi,
What will be the standard OS (in linux) where I can perform the above steps to install it.
Thanks!
Dev
The tutorial does not reply on any specific Linux distro. CentOS 7, Fedora 12+, Ubuntu 12+ and more other distro should be good enough as long as the needed tools used are installed.