34 Comments

  1. I believe if you use /user/hadoop instead you can directly access folders inside it similar to your home directory in HDFS.

    Ex:
    /user/hadoop/input can be just referenced as input while accessing HDFS.

    1. On my cluster:

      $ hdfs dfs -ls /user/hadoop/
      ls: `/user/hadoop/': No such file or directory

      and

      $ hdfs dfs -put hadoop-2.5.0.tar.gz hadoop-2.5.0.tar.gz
      put: `hadoop-2.5.0.tar.gz': No such file or directory

      It seems that unless the /user/hadoop/ directory is not automatically created.

      After it is created, it can be just referenced as you posted. Nice!

      BTW: the path is currently hardcoded in HDFS: https://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java?view=markup

      171 @Override
      172 public Path getHomeDirectory() {
      173 return makeQualified(new Path("/user/" + dfs.ugi.getShortUserName()));
      174 }

  2. Dear Eric Zhiqiang,
    I am very new to hadoop, and this blog instruction explain the multinode cluster setup in very nice way, however i have some query before starting the multinode cluster setup.
    1. Can multiple vitual machines (hosted on a single exsi server) be used as nodes of multi node hadoop cluster.
    2. As per your suggestion, first we have to do hadoop configuration on a specific node(say client node) then have to Duplicate Hadoop configuration files to all nodes,
    so can we used NameNode or any datanode as the client node or have to use a dedicated node as client node
    3. Is it necessary to write name node host name in slaves file, if i want to run my task tracker service only on datanodes.
    4. I am planning to use RHEL 6.4 on all my nodes and hadoop version hadoop-2.5.1.tar.gz, so can we use inbox open jdk with below version:
    java version “1.7.0_09-icedtea”
    OpenJDK Runtime Environment (rhel-2.3.4.1.el6_3-x86_64)
    OpenJDK 64-Bit Server VM (build 23.2-b09, mixed mode)

  3. It seems the DataNodes are not identified by the NameNode.

    1. Some problems noted in http://www.highlyscalablesystems.com/3022/pitfalls-and-lessons-on-configuing-and-tuning-hadoop/ may still validate. You may check them.

    2. Another common problems for me is that the firewalls on these nodes block the network traffic. If the nodes are in a controlled and trusted cluster, you may disable firewallD (on F20: https://www.systutorials.com/qa/692/how-to-totally-disable-firewall-or-iptables-on-fedora-20 ) or iptables (earlier releases: http://www.fclose.com/3837/flushing-iptables-on-fedora/ ).

    3. You may also log on the nodes running DataNode and use `ps aux | grep java` to check whether the DataNode daemon is running.

    Hope these tips help.

  4. Hi Eric,

    I am getting the following error when trying the check the HDFS status on namenode or datanode:

    [hadoop@namenode ~]$ hdfs dfsadmin -report
    14/11/29 10:45:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    Configured Capacity: 0 (0 B)
    Present Capacity: 0 (0 B)
    DFS Remaining: 0 (0 B)
    DFS Used: 0 (0 B)
    DFS Used%: NaN%
    Under replicated blocks: 0
    Blocks with corrupt replicas: 0
    Missing blocks: 0

    ————————————————-
    Can you please suggest me a solution.
    Thanks in advance.

  5. Hi Eric,

    Thank u so much for your help. But firewall is already off on my machine. I performed the following steps, and the problem got resolved:
    1. Stop the cluster
    2. Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml
    3. Reformat the NameNode (NOTE: all HDFS data is lost during this process!)
    4. Restart the cluster
    Courtesy:stackoverflow.com/questions/10097246/no-data-nodes-are-started

  6. Thanks for the detailed info and i believe every one likes the tutorial and the way you took us on each and every individual step. Kudos

  7. Hi Eric,

    Very useful blog. Just wondering about containers – do you have more details on them. For example:
    1. If one node has TWO containers, can one map-reduce job spawn up to two tasks only on that node? Or can each container have more than one tasks each?
    2. Do you know the internals of Yarn, in particular, which part of the Yarn script actually spawn off different container / tasks?

    Thanks
    C.Chee

    1. You may check the “Architecture of Next Generation Apache Hadoop MapReduce
      Framework”:

      https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf

      The “Resource Model” discussed the mode for YARN v1.0:

      The Scheduler models only memory in YARN v1.0. Every node in the system is considered to be composed of multiple containers of minimum size of memory (say 512MB or 1 GB). The ApplicationMaster can request any container as a multiple of the minimum memory size.

      Eventually we want to move to a more generic resource model, however, for Yarn v1 we propose a rather straightforward model:

      The resource model is completely based on memory (RAM) and every node is made up discreet chunks of memory.

      For the implementation, you may need to dive into the source code tree.

  8. Pingback: Installation of Hadoop | HadoopYoda Blog
  9. Thaanks for the tutorial. I have one question and one issu requiring your help.

    Is the value mapreduce_shuffle or mapreduce.shuffle?

    yarn.nodemanager.aux-services
    mapreduce_shuffle
    shuffle service for MapReduce

    I configured Hadoop 2.5.2 following your guideline. HDFS is confgiured and datanodes are reporting. yarn node -list is running and reports the nodes in my cluster. I am getting the Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted at 23% of the Map task.

    Could you please help me to get out of this exception.

    15/02/05 12:00:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    15/02/05 12:00:36 INFO client.RMProxy: Connecting to ResourceManager at 101-master/192.168.0.18:8032
    15/02/05 12:00:42 INFO input.FileInputFormat: Total input paths to process : 1
    15/02/05 12:00:43 INFO mapreduce.JobSubmitter: number of splits:1
    15/02/05 12:00:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1423137573492_0001
    15/02/05 12:00:44 INFO impl.YarnClientImpl: Submitted application application_1423137573492_0001
    15/02/05 12:00:44 INFO mapreduce.Job: The url to track the job: http://101-master:8088/proxy/application_1423137573492_0001/
    15/02/05 12:00:44 INFO mapreduce.Job: Running job: job_1423137573492_0001
    15/02/05 12:00:58 INFO mapreduce.Job: Job job_1423137573492_0001 running in uber mode : false
    15/02/05 12:00:58 INFO mapreduce.Job: map 0% reduce 0%
    15/02/05 12:01:17 INFO mapreduce.Job: map 8% reduce 0%
    15/02/05 12:01:20 INFO mapreduce.Job: map 12% reduce 0%
    15/02/05 12:01:23 INFO mapreduce.Job: map 16% reduce 0%
    15/02/05 12:01:26 INFO mapreduce.Job: map 23% reduce 0%
    15/02/05 12:01:30 INFO mapreduce.Job: Task Id : attempt_1423137573492_0001_m_000000_0, Status : FAILED
    Exception from container-launch: ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr

    ExitCodeException exitCode=134: /bin/bash: line 1: 29182 Aborted (core dumped) /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1423137573492_0001/container_1423137573492_0001_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 192.168.0.19 41158 attempt_1423137573492_0001_m_000000_0 2 > /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stdout 2> /home/ubuntu/hadoop/logs/userlogs/application_1423137573492_0001/container_1423137573492_0001_01_000002/stderr

    at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
    at org.apache.hadoop.util.Shell.run(Shell.java:455)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

    1. Hi Tariq,

      It is “mapreduce_shuffle”.

      Check: https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html

      I have no idea what’s the reason for the error.

      The exit status 134 may tell some information (check a discussion here: https://groups.google.com/forum/#!topic/comp.lang.java.machine/OibTSkLJ-bY ). In this post, the JVM is from Oracle. Your JVM seems the OpenJDK. You may try Oracle JVM.

    1. Hi Tariq, just noticed your comment.

      I do ever experience the `exitCode=134` problem once.

      My solution is to add the following setting to `hadoop/etc/hadoop/yarn-site.xml`:

      <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>2048</value>
      </property>
      

      What I did is only this.

      You may check how much memory your program uses in one task and set the value to be larger than that.

  10. Thank you for the wonderful tutorial since I am a beginner it was really easy. I made a cluster with two slave nodes and one master node. I had a doubt how do I check whether map/reduce tasks are working on slave nodes.Are there specific files to check in the logs directory if yes then which ones. The yarn node -list is showing 3 nodes with status running.

  11. HI Eric,

    Please help me ,I am unable to run my first basic example,i am getting below message
    “15/03/20 20:35:13 INFO mapred.JobClient: Cleaning up the staging area hdfs://localhost:54310/usr/local/hadoop/tmp/hadoop-hduser/mapred/staging/hduser/.staging/job_201503201943_0007
    15/03/20 20:35:13 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
    Exception in thread “main” org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://localhost:54310/usr/local/hadoop/input
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
    at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:962)”

    Thanks
    Shekar M

  12. Thanks for the great tutorial. It’s the most up-to-date information I’ve found. I have a couple of questions.

    You don’t mention the masters file, which some other cluster configuration blogs show. Should we add a masters file to go along with the included slaves file? Also, your script will distribute the modified slaves files to the slave nodes. Do the slave nodes need the modified slaves file, or is it ignored on the slave nodes?

    When we format the dfs with “hdfs namenode -format” should this be done on all nodes, or just the master?

    1. About the masters file: the masters file is for the Secondary NameNodes ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#Secondary_NameNode ). In 2.5.0, better use the “dfs.namenode.secondary.http-address” property ( https://hadoop.apache.org/docs/r2.5.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml ). I am not sure whether the masters file still works for specifying the Secondary NameNodes. You may take a try and will be welcome to share your findings here.

      FYI: the start-dfs.sh starts the secondary namenodes:

      # secondary namenodes (if any)
      
      SECONDARY_NAMENODES=$($HADOOP_PREFIX/bin/hdfs getconf -secondarynamenodes 2>/dev/null)
      

      The `hdfs getconf` command gets the addresses of the secondary namenodes. A quick try shows me that the masters file has no effect and the address from `dfs.namenode.secondary.http-address` is used.

      The slave nodes do not need the slaves file. You can skip it.

      For `hdfs namenode -format`, it only need to be done on the master.

  13. Eric,

    I really appreciate your efforts in publishing this article and answering queries from hadoop users. Before coming across to your posting/article, I struggled for two weeks to run mapreduce code successfully in multi-node environment. The mapreduce job used to hang indefinitely. I was missing the “yarn.resourcemanager.hostname” parameter, in “yarn-site.xml” config file. Your article helped me finding this missing piece and I could run all my mapreduce job successfully.

    Thanks a lot.

  14. Now that I’ve got a working Hadoop cluster I’d like to install HBase and Zookeeper too. Do you know any good tutorials for installing HBase and Zookeeper?

    1. It is not covered in this tutorial. To make NameNode high availability with more than one NameNode nodes to avoid the single point of failure, you may consider 2 choices:

      HDFS High Availability using a shared NFS directory to share edit logs between the Active and Standby NameNodes: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
      HDFS High Availability Using the Quorum Journal Manager: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html

  15. Hi,

    What will be the standard OS (in linux) where I can perform the above steps to install it.

    Thanks!
    Dev

    1. The tutorial does not reply on any specific Linux distro. CentOS 7, Fedora 12+, Ubuntu 12+ and more other distro should be good enough as long as the needed tools used are installed.

Leave a Reply

Your email address will not be published. Required fields are marked *