hadoop Archives - Page 2 of 3

QA

How to find the DataNodes that actually store a file in HDFS?

ByEric Ma Mar 24, 2018Mar 24, 2018

A file may be splitted to many chunks and replications stored on many datanodes in HDFS. Now, the question is how to find the DataNodes that actually store a file in HDFS? You may use the dfsadmin -fsck tool from the Hadoop hdfs util. Here is an example: $ hadoop fsck /user/aaa/file.name -files -locations -blocks…

QA

How to write /etc/fstab entry for –bind mounting?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to write /etc/fstab entry for –bind mounting like mount –bind /home/hadoop/hdfs/store-tmp /home/store/tmp From man 8 mount: Since Linux 2.4.0 it is possible to remount part of the file hierarchy somewhere else. The call is mount –bind olddir newdir or shortoption mount -B olddir newdir or fstab entry is: /olddir /newdir none bind

QA

What’s the difference between Reliability, Durability, and Availability for data storage system?

ByWeiwei Jia Mar 24, 2018Jan 7, 2020

Some important concepts in distributed system like Hadoop distributed file system, Google file system and so on. Answer from http://www.quora.com/Whats-the-difference-between-Reliability-Durability-and-Availability-for-data-storage-system The difference between durability and availability is fairly simple. Durability is about what happens when all power goes out everywhere. Has all data been written to stable storage that doesn’t require power (e.g. disk/flash), in…

QA

How to change number of replications of certain files in HDFS?

ByEric Ma Mar 24, 2018Mar 24, 2018

The HDFS has a configuration in hdfs-site.xml to set the global replication number of blocks with the “dfs.replication” property. However, there are some “hot” files that are access by many nodes. How to increase the number of blocks for these certain files in HDFS? You can the replication number of certain file to 10: hdfs…

QA

How to get logs of a specific time range on Linux?

ByEric Ma Mar 24, 2018Mar 24, 2018

The logs I am processing is Hadoop log (log4j). It is in format like: 2014-09-20 21:55:11,855 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated user map size: 36 2014-09-20 21:55:11,863 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Updated group map size: 55 2014-09-20 22:10:11,907 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Update cache now 2014-09-20 22:10:11,907 INFO org.apache.hadoop.nfs.nfs3.IdUserGroup: Not doing static UID/GID mapping because ‘/etc/nfs.map’ does not exist. Now, I…

QA

Making Hadoop Java process heap larger?

ByEric Ma Mar 24, 2018Mar 24, 2018

In Hadoop 2.5.0, I use ‘ps -aux’ and find the Java process has options: -Xmx1000m However, my nodes have 32GB memory. How to make Hadoop Java process heap larger? In yarn-env.sh, you can find: # For setting YARN specific HEAP sizes please use this # Parameter and set appropriately # YARN_HEAPSIZE=1000 In hadoop-env.sh, you can…

QA

How to set the number of mappers and reducers of Hadoop in command line?

ByEric Ma Mar 24, 2018Feb 26, 2019

How to set the number of mappers and reducers of Hadoop in command line? Number of mappers and reducers can be set like (5 mappers, 2 reducers): -D mapred.map.tasks=5 -D mapred.reduce.tasks=2 in the command line. In the code, one can configure JobConf variables. job.setNumMapTasks(5); // 5 mappers job.setNumReduceTasks(2); // 2 reducers Note that on Hadoop…

QA

How to set the data replication factor of Hadoop HDFS?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to set the data replication factor of Hadoop HDFS in Hadoop 2 (YARN)? The default replication factor in HDFS is controlled by the dfs.replication property. The value is 3 by default. To change the replication factor, you can add a dfs.replication property settings in the hdfs-site.xml configuration file of Hadoop: <property> <name>dfs.replication</name> <value>1</value> <description>Replication…

QA

Hadoop 2 (YARN) default configuration values

ByEric Ma Mar 24, 2018Feb 26, 2019

Where to check the default Hadoop 2 (YARN) configuration values for: HDFS: hdfs-site.xml YARN: yarn-site.xml MapReduce: mapred-site.xml Default Hadoop 2 (YARN) configuration values for Hadoop 2.2.0 from Apache Hadoop website: HDFS: http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml YARN: https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml MapReduce: https://hadoop.apache.org/docs/r2.2.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

QA

Good introductions to Hadoop 2.0 (YARN)?

ByEric Ma Mar 24, 2018Feb 26, 2019

Which ones are recommended introductions to Hadoop 2.0 (YARN)? Pointers to webpages are good. Those are good ones that I find: The SoCC13 paper “Apache Hadoop YARN: Yet Another Resource Negotiator” by Vinod Kumar Vavilapalli et al.: http://www.socc2013.org/home/program/a5-vavilapalli.pdf The introduction from Hortonworks by Arun Murthy:http://hortonworks.com/blog/apache-hadoop-yarn-concepts-and-applications/ The “Official” one from Apache Hadoop website (very brief):https://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YARN.html

QA

Classpath for compiling MapReduce jobs on Hadoop 2.2.0

ByEric Ma Mar 24, 2018Mar 24, 2018

How to get the correct classpath for compiling MapReduce jobs on Hadoop 2.2.0 (YARN)? The yarn command from Hadoop 2 can find it out for you: yarn classpath You may add the full path to yarn which is under bin directory of the Hadoop distribution pachage, if it is not in your $PATH.

QA

How to choose the number of mappers and reducers in Hadoop

ByQ A Mar 24, 2018

How to choose the number of mappers and reducers in Hadoop to get good job performance? The Hadoop Wiki gives a discussion on this: http://wiki.apache.org/hadoop/HowManyMapsAndReduces Some valuable points: About the number of Maps: The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to…

QA

SQL layers on NoSQL databases

ByQ A Mar 24, 2018Mar 27, 2018

What are the SQL layer solution over NoSQL databases such as key/value stores? Phoenix: A SQL layer on HBase: https://github.com/forcedotcom/phoenix They also show some performance results: https://github.com/forcedotcom/phoenix/wiki/Performance F1 – The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business: http://research.google.com/pubs/pub38125.html With F1, we have built a novel hybrid system that combines the scalability, fault tolerance, transparent sharding,…

Storage systems | Systems | Tutorial

How to force a metadata checkpointing in HDFS

ByEric Ma Sep 9, 2017Sep 11, 2017

The metadata checkpointing in HDFS is done by the Secondary NameNode to merge the fsimage and the edits log files periodically and keep edits log size within a limit. For various reasons, the checkpointing by the Secondary NameNode may fail. For one example, HDFS SecondaraNameNode log shows errors in its log as follows. 2017-08-06 10:54:14,488…

Computing systems | Resource management | Storage systems | Systems | Tutorial

Hadoop Installation Tutorial (Hadoop 2.x)

ByEric Ma Sep 14, 2014Dec 29, 2019

Hadoop 2 or YARN is the new version of Hadoop. It adds the yarn resource manager in addition to the HDFS and MapReduce components. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce designed and implemented by Google initially for processing and generating large data…

Computing systems | Storage systems | Systems

Big Data Benchmark from AMPLab of UC Berkeley

ByEric Ma Mar 17, 2014Sep 5, 2020

Benchmarks are important to understand the performance and quantitative and qualitative comparison of different systems. Many analytic frameworks, such as Hive, Impala and Shark, are designed and implemented these years and become fundamental software for processing big data. How to benchmark these big data analytic systems is an interesting problem. The Big Data Benchmark The…

Computing systems | Tutorial

Hadoop MapReduce Tutorials

ByEric Ma Jul 17, 2013Sep 5, 2020

Here is a list of tutorials for learning how to write MapReduce programs on Hadoop, the opensource MapReduce implementation with HDFS. MapReduce Tutorials The official tutorial on Hadoop MapReduce framework: http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.html. Yahoo! Hadoop Tutorial A comprehensive tutorial on Hadoop from Yahoo! Developer Network: http://developer.yahoo.com/hadoop/tutorial/. More about MapReduce To better understand the design behind MapReduce, it…

Computing systems | News

PUMA: A MapReduce Benchmark Suite

ByEric Ma Dec 20, 2012Sep 5, 2020

MapReduce is a well-known programming model designed for generating and processing large data. There are various MapReduce implementations. One widely known and used one may be Hadoop. Benchmarking MapReduce frameworks gets to be important. Faraz Ahmad et al. developed a benchmark suite: PUMA MapReduce Benchmark. During our work on MapReduce, we developed a benchmark suite…

Tutorial

Hadoop TeraSort Benchmark

ByEric Ma Dec 18, 2012Sep 5, 2020

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. Here, we provide a short tutorial for using the Hadoop TeraSort benchmark. TeraGen generates random data that can be used as input data for a subsequent running…

Computing systems | Storage systems

Large-scale Data Storage and Processing System in Datacenters

ByEric Ma Dec 11, 2012Aug 30, 2020

Research on Cloud Computing has made big progresses and many excellent large-scale systems have been designed in recent years. I compiled a list of some large-scale data storage and processing systems in datacenters as follows. Storage systems Google File System (GFS): http://research.google.com/archive/gfs.html HDFS implementation: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html Colossus (GFS2): Colossus: Successor to the Google File System (GFS)…