Installing Hadoop 1.x: A Complete Guide

Hadoop 1.x reached end-of-life in 2014 and is no longer maintained. This post documents a deprecated architecture for historical reference only.

For new deployments, use Hadoop 3.x+ with YARN, which offers significant improvements in resource management, multi-tenancy, and reliability. See the Apache Hadoop documentation for current versions.

For managed services, consider AWS EMR, Google Dataproc, Azure HDInsight, or Cloudera.

Architecture Overview

Hadoop consists of two core components:

HDFS (Hadoop Distributed File System): The distributed storage layer, loosely modeled after Google File System (GFS). Stores data in blocks across DataNodes with configurable replication.
MapReduce: A programming model and execution framework for batch processing. Hadoop’s implementation is an open-source variant of Google’s original MapReduce, used for processing large datasets across clusters.

The typical cluster topology uses:

NameNode: Manages the HDFS namespace and file system tree
JobTracker: Schedules MapReduce jobs and monitors TaskTrackers
DataNodes: Store HDFS blocks
TaskTrackers: Execute map and reduce tasks (slaves)

This architecture was replaced by YARN in Hadoop 2.x, which separates resource management from job scheduling.

Prerequisites

Linux cluster with passwordless SSH configured between nodes
Dedicated Linux user (typically hadoop) on each node
Java Development Kit (JDK 8+) installed on all nodes

Setting Up SSH Key-Based Authentication

Enable the hadoop user to SSH between nodes without passwords:

# On the NameNode, as the hadoop user
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys

# Copy the public key to all DataNodes
for host in $(cat /home/hadoop/nodes); do
  ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@$host
done

The nodes file contains one hostname or IP per line.

Installing Java

Modern Hadoop requires Java 8 or later. Install via your distribution’s package manager where possible:

# Ubuntu/Debian
sudo apt-get install openjdk-11-jdk

# RHEL/CentOS
sudo yum install java-11-openjdk-devel

# Or manually download from Oracle/AdoptOpenJDK and extract to:
/home/hadoop/jdk/

If installing manually, distribute to all nodes:

for host in $(cat /home/hadoop/nodes); do
  scp -r /home/hadoop/jdk hadoop@$host:/home/hadoop/
done

Installing Hadoop 1.x

Download Hadoop from the Apache Archive. Extract to all nodes:

# On NameNode
tar xzf hadoop-1.2.1.tar.gz -C /home/hadoop/
ln -s /home/hadoop/hadoop-1.2.1 /home/hadoop/hadoop

# Distribute to DataNodes
for host in $(cat /home/hadoop/nodes); do
  scp -r /home/hadoop/hadoop-1.2.1 hadoop@$host:/home/hadoop/
  ssh hadoop@$host "ln -s /home/hadoop/hadoop-1.2.1 /home/hadoop/hadoop"
done

Configuring Environment Variables

Add to ~/.bashrc for the hadoop user on all nodes:

export HADOOP_HOME="/home/hadoop/hadoop"
export JAVA_HOME="/home/hadoop/jdk"
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CONF_DIR=$HADOOP_HOME/conf

Source the file:

source ~/.bashrc

Configuring Hadoop

Edit configuration files in $HADOOP_HOME/conf/ on the NameNode, then distribute to all nodes.

hadoop-env.sh

Set Java and log paths:

export JAVA_HOME=/home/hadoop/jdk
export HADOOP_LOG_DIR=/home/hadoop/data/logs
export HADOOP_PID_DIR=/home/hadoop/pids

Create required directories on all nodes:

mkdir -p /home/hadoop/data/logs /home/hadoop/pids

core-site.xml

Set the NameNode address (example: 10.1.1.30):

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://10.1.1.30:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/hadoop/data/tmp</value>
  </property>
</configuration>

hdfs-site.xml

Configure replication and storage paths:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/home/hadoop/data/dfs/name</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/home/hadoop/data/dfs/data</value>
  </property>
  <property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>10.1.1.31:50090</value>
  </property>
</configuration>

Create directories on all nodes:

mkdir -p /home/hadoop/data/dfs/name /home/hadoop/data/dfs/data

mapred-site.xml

Configure the JobTracker (example: 10.1.1.2):

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>10.1.1.2:9001</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/home/hadoop/data/mapred/local</value>
  </property>
  <property>
    <name>mapred.system.dir</name>
    <value>/hadoop/mapred/system</value>
  </property>
  <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>2</value>
  </property>
  <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>2</value>
  </property>
</configuration>

Create local directories on all DataNodes:

mkdir -p /home/hadoop/data/mapred/local

slaves

List all DataNode hostnames (one per line), excluding NameNode and JobTracker:

datanode1
datanode2
datanode3
datanode4

Remove or comment out localhost if present.

Distributing Configuration to All Nodes

After configuring the NameNode:

for host in $(cat /home/hadoop/hadoop/conf/slaves); do
  scp -r /home/hadoop/hadoop/conf hadoop@$host:/home/hadoop/hadoop/
done

Starting the Cluster

Format HDFS (First Time Only)

On the NameNode, format the namespace:

hdfs namenode -format

This initializes the HDFS directory structure. Do this only once—formatting an existing cluster erases all data.

Start HDFS

On the NameNode:

start-dfs.sh

Verify all DataNodes are registered:

hadoop dfsadmin -report

Wait a few seconds if not all nodes appear immediately.

Start MapReduce

On the JobTracker node:

start-mapred.sh

Check job status:

hadoop job -list

Running a Test Job

Test with a built-in example. Copy input data:

hadoop fs -mkdir input
hadoop fs -put /home/hadoop/hadoop/conf input

Run the grep example:

hadoop jar /home/hadoop/hadoop/share/hadoop/mapreduce1/hadoop-examples-1.*.jar \
  grep input output 'dfs[a-z.]+'

Retrieve results:

hadoop fs -cat output/* | head -20

Or copy locally:

hadoop fs -get output ./output-local
cat ./output-local/*

Stopping the Cluster

stop-dfs.sh
stop-mapred.sh

Troubleshooting

Nodes Don’t Connect

Check network connectivity and firewall rules:

# Test SSH from NameNode
ssh hadoop@datanode1 "hostname"

# Check Hadoop ports are reachable
telnet datanode1 50075  # DataNode port

Namenode Safe Mode

If the NameNode won’t exit safe mode after startup:

hadoop dfsadmin -safemode leave

Safe mode can take minutes on large clusters. Check status with:

hadoop dfsadmin -safemode get

Permission Issues

Ensure the hadoop user owns all Hadoop directories:

sudo chown -R hadoop:hadoop /home/hadoop/data /home/hadoop/hadoop

Firewall Blocking Connections

On RHEL/CentOS systems, disable restrictive firewall rules for internal networks:

# Flush iptables (warning: affects all traffic)
sudo iptables -F
sudo systemctl restart iptables

For production, use specific rules instead of flushing all policies.

Check Logs

Review logs on NameNode:

tail -f /home/hadoop/data/logs/hadoop-hadoop-namenode-*.log

DataNode logs:

tail -f /home/hadoop/data/logs/hadoop-hadoop-datanode-*.log

Default Ports

Service	Port
NameNode Web UI	50070
DataNode Web UI	50075
JobTracker Web UI	50030
TaskTracker Web UI	50060
NameNode RPC	9000
JobTracker RPC	9001

Access web UIs at http://namenode-ip:50070/ and http://jobtracker-ip:50030/.

3 Comments

Nandu says:

Mar 24, 2015 at 8:37 am

Very nice article. WIll follow some of the suggestions to set multi-node environment.
Thanks.

1. Eric Zhiqiang Ma says:
  
  Mar 24, 2015 at 5:25 pm
  
  Thanks. If you are considering installing Hadoop, it is better to use the 2.x versions: https://www.systutorials.com/hadoop-installation-tutorial-hadoop-2-x/ .
  
  1. Nandu says:
    
    Mar 24, 2015 at 6:00 pm
    
    Yes, Eric. I followed Hadoop 2.x and also posted my comments there after resolving the mapreduce hang issue by added the missing resourcemanager.hostname in yarn-site.xml file.
    
    Thanks again.