yarn

Scripting & Utilities

Finding HDFS Files with Replication Factor 1
ByEric Ma Mar 24, 2018Apr 13, 2026

Files with replication factor 1 are a liability in production HDFS clusters. They have no redundancy, meaning a single node failure results in data loss. Identifying and fixing these files should be part of your regular cluster maintenance. Using the HDFS CLI The most straightforward approach is using hdfs fsck with output parsing: hdfs fsck…

Read More Finding HDFS Files with Replication Factor 1
Design Patterns & Architecture

Configuring Hadoop’s Job Scheduling Policy
ByEric Ma Mar 24, 2018Apr 13, 2026

The YARN resource scheduler determines how cluster resources are allocated to jobs. By default, Hadoop uses the Capacity Scheduler, but you can switch to an alternative like the Fair Scheduler or configure different scheduling policies. Identifying Your Current Scheduler Check which scheduler is currently active by examining your configuration: grep -A 2 “yarn.resourcemanager.scheduler.class” $HADOOP_HOME/etc/hadoop/yarn-site.xml The…

Read More Configuring Hadoop’s Job Scheduling Policy
Linux & Systems Administration

Safely Stopping Stray HDFS DataNode Processes on Multiple Nodes
ByEric Ma Mar 24, 2018Apr 11, 2026

When stop-dfs.sh fails to cleanly terminate DataNode processes, you’re left with orphaned Java processes consuming resources and potentially causing cluster issues. This happens most often after ungraceful cluster shutdowns or when the standard stop script encounters communication problems with specific nodes. Understanding the Problem The typical symptom is stop-dfs.sh reporting that certain nodes have no…

Read More Safely Stopping Stray HDFS DataNode Processes on Multiple Nodes
Linux & Systems Administration

Setting Custom Replication Factor for Individual HDFS Files
ByEric Ma Mar 24, 2018Apr 13, 2026

When uploading files to HDFS using hdfs dfs -put, the replication factor defaults to the cluster-wide setting in hdfs-site.xml (typically 3). For temporary files, logs, or staging data, you often want a lower replication factor to reduce write latency and disk usage. Override Replication Factor at Upload Time Use the -D flag to pass HDFS…

Read More Setting Custom Replication Factor for Individual HDFS Files
Systems & Architecture

Diagnosing and Repairing Corrupt HDFS Blocks
ByEric Ma Mar 24, 2018Apr 11, 2026

When you see output like this from hdfs dfsadmin -report, you need to understand what each metric means and how to respond: Under replicated blocks: 139016 Blocks with corrupt replicas: 9 Missing blocks: 0 Understanding block states in HDFS is essential for cluster health. The distinction between these categories determines how urgent your response needs…

Read More Diagnosing and Repairing Corrupt HDFS Blocks
Programming Languages

Recovering HDFS from Safe Mode After DataNode Failures
ByEric Ma Mar 24, 2018Apr 12, 2026

When a NameNode restarts, it enters safe mode to rebuild its understanding of block-to-DataNode mappings. If DataNodes don’t report their blocks quickly enough or some don’t come back online, the NameNode may stay stuck in safe mode with a message like: The reported blocks 1968810 needs additional 5071 blocks to reach the threshold 0.9990 of…

Read More Recovering HDFS from Safe Mode After DataNode Failures
Programming Languages

Caching Mapper Output in Hadoop: Strategies for Reusing Intermediate Results
ByEric Ma Mar 24, 2018Apr 13, 2026

The core problem here is legitimate: if you’re running multiple jobs on the same dataset where the mapper phase produces identical intermediate results, recomputing those results is wasteful. However, skipping the mapper phase entirely breaks MapReduce’s processing model. There are better approaches. Why You Can’t Just Skip the Mapper MapReduce assumes data flows through map…

Read More Caching Mapper Output in Hadoop: Strategies for Reusing Intermediate Results
Programming Languages

Configuring HDFS Replication Factors by Directory
ByEric Ma Mar 24, 2018Apr 12, 2026

HDFS doesn’t natively support directory-level replication factor inheritance. Even if you set a specific replication factor on a directory and its files, new files created in that directory will default to the cluster’s global dfs.replication setting (typically 3). This limitation can complicate multi-tier storage strategies where you want temporary or low-priority data on fewer replicas…

Read More Configuring HDFS Replication Factors by Directory
Linux & Systems Administration

Checking HDFS File Replication Factor
ByEric Ma Mar 24, 2018Apr 13, 2026

When managing HDFS clusters, you often need to verify the replication factor of specific files to ensure data redundancy meets your requirements. Here are the practical methods to check this. Using hdfs dfs -ls The most straightforward way is to list the file with hdfs dfs -ls: hdfs dfs -ls /usr/GroupStorage/data1/out.txt Output: -rw-r–r– 3 hadoop…

Read More Checking HDFS File Replication Factor
Linux & Systems Administration

Configuring Heap Size for Hadoop NameNode, DataNode, and YARN
ByEric Ma Mar 24, 2018Apr 13, 2026

When running Hadoop on systems with substantial memory, the default 1GB heap size is often inadequate. If you check running processes with ps aux and see -Xmx1000m, you’re working with the default configuration that doesn’t scale to modern hardware. Understanding Hadoop Heap Configuration Hadoop’s Java process memory is controlled by environment variables set in configuration…

Read More Configuring Heap Size for Hadoop NameNode, DataNode, and YARN
Linux & Systems Administration

Configuring Mappers and Reducers in Hadoop: CLI and Code Approaches
ByEric Ma Mar 24, 2018Apr 13, 2026

To set the number of mappers and reducers when submitting a Hadoop job, use the -D flag with the appropriate property names. The correct properties depend on your Hadoop version. Hadoop 2.x and Later (YARN) Use the modern property names: hadoop jar -Dmapreduce.job.maps=5 -Dmapreduce.job.reduces=2 yourapp.jar The older mapred.map.tasks and mapred.reduce.tasks properties are deprecated in Hadoop…

Read More Configuring Mappers and Reducers in Hadoop: CLI and Code Approaches
Scripting & Utilities

Configuring HDFS Replication: Cluster and Per-File Settings
ByEric Ma Mar 24, 2018Apr 12, 2026

The replication factor determines how many copies of each data block HDFS maintains across your cluster. The default is 3, which provides adequate fault tolerance for most deployments by surviving both node and rack failures simultaneously. Setting Cluster-Wide Default Replication To set the default replication factor for all new files, add the dfs.replication property to…

Read More Configuring HDFS Replication: Cluster and Per-File Settings
Design Patterns & Architecture

Understanding Hadoop Configuration Files: Locations and Defaults
ByEric Ma Mar 24, 2018Apr 13, 2026

Hadoop uses three primary configuration files to define YARN, HDFS, and MapReduce behavior: HDFS: hdfs-site.xml YARN: yarn-site.xml MapReduce: mapred-site.xml These files live in $HADOOP_HOME/etc/hadoop/ and override the built-in defaults when present. Finding Official Default Values Apache publishes default configuration documentation for each release. For current versions: Hadoop 3.4.x (Latest) HDFS defaults: https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml YARN defaults: https://hadoop.apache.org/docs/r3.4.0/hadoop-yarn/hadoop-yarn-common/yarn-default.xml…

Read More Understanding Hadoop Configuration Files: Locations and Defaults
Development Best Practices

Understanding YARN: Resource Management and Cluster Fundamentals
ByEric Ma Mar 24, 2018Apr 12, 2026

YARN (Yet Another Resource Negotiator) fundamentally restructured Hadoop 2.0 by decoupling resource management from application logic. If you’re transitioning from Hadoop 1.x or building systems on top of YARN, understanding its architecture is essential for effective cluster administration and application development. Essential Reading The foundational paper Start with “Apache Hadoop YARN: Yet Another Resource Negotiator”…

Read More Understanding YARN: Resource Management and Cluster Fundamentals
Design Patterns & Architecture

Configuring Hadoop Classpath for MapReduce Compilation
ByEric Ma Mar 24, 2018Apr 13, 2026

When compiling MapReduce jobs against a Hadoop installation, you need to include the correct classpath to resolve Hadoop dependencies. The yarn classpath command handles this automatically. Getting the classpath Run this command to output the full classpath: yarn classpath If yarn isn’t in your $PATH, use the full path: $HADOOP_HOME/bin/yarn classpath Replace $HADOOP_HOME with your…

Read More Configuring Hadoop Classpath for MapReduce Compilation
Linux & Systems Administration

Setting Up a Hadoop 2.x Cluster for Development and Legacy Systems
ByEric Ma Sep 14, 2014Apr 12, 2026

Hadoop 2.x reached end-of-life in 2016. This guide covers setup for learning purposes and legacy system maintenance only. For production deployments, use Hadoop 3.x or later, which includes performance improvements, better YARN scheduling, HDFS erasure coding, and improved security. Cloud-managed options like AWS EMR, Google Dataproc, and Azure HDInsight eliminate most operational overhead. This guide…

Read More Setting Up a Hadoop 2.x Cluster for Development and Legacy Systems