hadoop Archives - SysTutorials

Insights

What is the Future of Big Data Analytics and Hadoop?

Bymanchun Jul 18, 2019Nov 21, 2019

Big Data has taken a lead in the IT industry and has played a significant role in the Business growth and decision-making processes that gives you an edge over the competitors. This is equally applicable to the organizations as well as professionals existing in the analytics domain. Big Data Analytics bring an ocean of opportunities…

QA

How to force a checkpointing of metadata in HDFS?

ByEric Ma Mar 24, 2018Nov 21, 2019

HDFS SecondaraNameNode log shows 2017-08-06 10:54:14,488 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint java.io.IOException: Inconsistent checkpoint fields. LV = -63 namespaceID = 1920275013 cTime = 0 ; clusterId = CID-f38880ba-3415-4277-8abf-b5c2848b7a63 ; blockpoolId = BP-578888813-10.6.1.2-1497278556180. Expecting respectively: -63; 263120692; 0; CID-d22222fd-e28a-4b2d-bd2a-f60e1f0ad1b1; BP-622207878-10.6.1.2-1497242227638. at org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:134) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:531) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:395) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:361) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:357) It seems the checkpoint…

QA

How to find out all files with replication factor 1 in HDFS?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to find out all files with replication factor 1 in HDFS? The hdfs dfsadmin -report shows there are blocks with replication factor 1: Missing blocks (with replication factor 1): 7 How to find them out? You can run hdfs fsck to list all files with their replication counts and grep those with replication factor…

QA

How to estimate the memory usage of HDFS NameNode for a HDFS cluster?

ByEric Ma Mar 24, 2018Mar 24, 2018

HDFS stores the metadata of files and blocks in the memory of the NameNode. How to estimate the memory usage of HDFS NameNode for a HDFS cluster? Each file and each block has around 150 bytes of metadata on NameNode. So you may do the calculation based on this. For examples, assume block size is…

Which filesystem operations in HDFS is atomic?

ByQ A Mar 24, 2018Nov 25, 2019

Atomicity is a very important and fundamental property aspect of filesystems. Applications semantics and many functions depend on and only be available based on the atomicity models of the underlying filesystem. Which filesystem operations in HDFS is atomic? So that locks can be implemented on top of it. In a reasonably widely usable filesystem, some…

QA

How to put files with spaces in names into HDFS?

ByQ A Mar 24, 2018

I got this error when I tried to save a file with a space in its name into HDFS: $ hdfs dfs -put -f “/home/u1/testa/test a” “/u1/testa/test a” put: unexpected URISyntaxException while the HDFS seems allow spaces in its file names: https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/filesystem/model.html . How to achieve the effect of saving the files with spaces in…

QA

Where to download an old release package of Hadoop?

ByQ A Mar 24, 2018

The download page of Hadoop http://hadoop.apache.org/releases.html only contains several recent release packages. I would like to download some old release packages such as Hadoop 2.5.0 which is not available anymore on the release page. Where to download a copy of the old release? You can download the old release packages of Hadoop on the http://archive.apache.org…

QA

how change my policy of scheduling in hadoop?

ByEric Ma Mar 24, 2018Oct 7, 2019

I want to change policy of scheduling in Hadoop, how to I can change job order in map reduce automatically. Assume you are using Hadoop 2 / YARN. The configuration parameter named yarn.resourcemanager.scheduler.class controls the class to be used as the resource scheduler for YARN/Hadoop. The default value for the scheduler class (check more at…

QA

How to manually kill HDFS DataNodes?

ByEric Ma Mar 24, 2018Mar 24, 2018

stop-dfs.sh report that there are no datanodes running on some nodes like hdfs-node-000208: no datanode to stop However, there are DataNode process running there. How to clean these processes on many (100s) of nodes? You may use this piece of bash script: for i in `cat hadoop/etc/hadoop/slaves`; do echo $i; ssh $i ‘jps | grep…

QA

How to set the replication factor for one file when it is uploaded by `hdfs dfs -put` command line in HDFS?

ByEric Ma Mar 24, 2018Mar 24, 2018

When uploading a file by the hdfs dfs -put command line in HDFS, how to set a replication factor instead of the global one for that file? For example, HDFS’s global replication factor is 3. For some temporary files, I would like to save just one copy for faster uploading and saving disk space. The…

QA

How to make Fedora Linux not clean some files in /tmp/?

ByEric Ma Mar 24, 2018Mar 24, 2018

On my Fedora 20, I find that the system automatically clean up file under /tmp/. This is convenient. However, it cause some problems for some programs. For example, HDFS puts its DataNode pid file under /tmp/ by default like hadoop-hadoop-datanode.pid. After it is cleaned up, the hadoop-daemon.sh script will consider there is no DataNode running….

Storage systems | Systems

How to handle missing blocks and blocks with corrupt replicas in HDFS?

ByEric Ma Mar 24, 2018Feb 20, 2020

One of HDFS cluster’s hdfs dfsadmin -report reports: Under replicated blocks: 139016 Blocks with corrupt replicas: 9 Missing blocks: 0 The “Under replicated blocks” can be re-replicated automatically after some time. How to handle the missing blocks and blocks with corrupt replicas in HDFS? Understanding these blocks A block is “with corrupt replicas” in HDFS…

QA

HDFS stays in safe mode because of reported blocks not reaching 0.9990 of total blocks

ByEric Ma Mar 24, 2018Feb 9, 2019

After a node failure and restarting the HDFS, the NameNode reports: “The reported blocks 1968810 needs additional 5071 blocks to reach the threshold 0.9990 of total blocks 1975856. Safe mode will be turned off automatically.” in the log. Why this happens? And how to fix it? About why the NameNode stays in the safe mode:…

QA

how to skip mapper function in hadoop

ByEric Ma Mar 24, 2018Mar 28, 2018

In hadoop I need to skip mapper function and directly execute the reducer function. We doing this to improve hadoop performance, if the hadoop framework is used to analyze same data sets, then mapper’s output will be same for different kind of jobs. To save the redundant computation for same results, I am planning to…

QA

How to set replication factors for HDFS directories?

ByEric Ma Mar 24, 2018Mar 24, 2018

Is it possible to set the replication factor for specific directory in HDFS to be one that is different from the default replication factor? This should set the existing files’ replication factors but also new files created in the specific directory. This can simplify the administration. We can set the replication factor of /tmp/ to…

QA

How to add a new HDFS NameNode metadata directory to an existing cluster?

ByEric Ma Mar 24, 2018Mar 24, 2018

We have a running HDFS cluster. Currently, the NameNode metadata data directory has only one directory configured in hdfs-site.xml: <property> <name>dfs.namenode.name.dir</name> <value>file:///home/hadoop/hdfs/</value> <description>NameNode directory for namespace and transaction logs storage.</description> </property> We would like to add a new directory for dfs.namenode.name.dir to make replicas of the metadata on a separated disk for higher data reliability….

QA

How to check the replication factor of a file in HDFS?

ByEric Ma Mar 24, 2018Mar 24, 2018

A related question: how to find the replication factors of files in a HDFS cluster? method 1: You can use the HDFS command line to ls the file. The second column of the output will show the replication factor of the file. For example, $ hdfs dfs -ls /usr/GroupStorage/data1/out.txt -rw-r–r– 3 hadoop zma 11906625598 2014-10-22…

QA

How to change an running HDFS cluster’s replication factor?

ByEric Ma Mar 24, 2018Mar 24, 2018

Now, I have a running HDFS cluster storing lost files. I want to change its default replication factor. How to change it? What will happen after it is changed? For example, I change from 2 to 3. Will HDFS automatically re-replicate the data chunks? First, the replication factor is client decided. Second, the replication factor…

QA

What is the design of Snapshots in HDFS?

ByEric Ma Mar 24, 2018Mar 24, 2018

What is the design of Snapshots in HDFS? This PDF documents the design of snapshot. Jing Zhao and Tsz-Wo Sze from Hortonworks gave a great talk on the design of HDFS snapshots. The slides can be downloaded at here. The development of snapshot is tracked by HDFS-2802.

QA

How to balance DataNode storage in HDFS?

ByEric Ma Mar 24, 2018Feb 26, 2019

As nodes are added and deleted in a Hadoop cluster. Storage usage across DataNodes may be different. Some DataNodes’ disks are almost used up while some others’ are almost empty. How to balance data across DataNodes in HDFS? Hadoop provides the balancer to redistribute the data. Brief introduction to balancer in Hadoop: balancer. The design…