dfs

Programming Languages

Forcing HDFS Metadata Checkpoints
ByEric Ma Mar 24, 2018Apr 12, 2026

The NameNode in HDFS maintains the filesystem namespace in memory. The Secondary NameNode (or Checkpoint Node in HA setups) periodically merges the namespace image (fsimage) with the edit logs to create a new checkpoint. Understanding how to force this process is essential for cluster maintenance and recovery operations. How Checkpointing Works The checkpoint process combines…

Read More Forcing HDFS Metadata Checkpoints
Code Optimization

Estimating HDFS NameNode Memory Usage
ByEric Ma Mar 24, 2018Apr 12, 2026

The HDFS NameNode holds the entire filesystem namespace and block map in memory. Estimating memory requirements accurately is critical for cluster planning and preventing out-of-memory failures that can cripple your entire cluster. Core Memory Components The NameNode’s memory consumption breaks down into several key areas: Namespace Objects: The NameNode maintains an in-memory representation of the…

Read More Estimating HDFS NameNode Memory Usage
Programming Languages

Handling Files with Spaces in HDFS
ByQ A Mar 24, 2018Apr 13, 2026

When working with Hadoop Distributed File System (HDFS), files containing spaces in their names require special handling during the hdfs dfs -put command. Without proper escaping or quoting, the shell interprets spaces as delimiters, splitting the filename into multiple arguments. Direct Upload with Quoting The simplest approach is to wrap the filename in quotes: hdfs…

Read More Handling Files with Spaces in HDFS
Linux & Systems Administration

Safely Stopping Stray HDFS DataNode Processes on Multiple Nodes
ByEric Ma Mar 24, 2018Apr 11, 2026

When stop-dfs.sh fails to cleanly terminate DataNode processes, you’re left with orphaned Java processes consuming resources and potentially causing cluster issues. This happens most often after ungraceful cluster shutdowns or when the standard stop script encounters communication problems with specific nodes. Understanding the Problem The typical symptom is stop-dfs.sh reporting that certain nodes have no…

Read More Safely Stopping Stray HDFS DataNode Processes on Multiple Nodes
Linux & Systems Administration

Setting Custom Replication Factor for Individual HDFS Files
ByEric Ma Mar 24, 2018Apr 13, 2026

When uploading files to HDFS using hdfs dfs -put, the replication factor defaults to the cluster-wide setting in hdfs-site.xml (typically 3). For temporary files, logs, or staging data, you often want a lower replication factor to reduce write latency and disk usage. Override Replication Factor at Upload Time Use the -D flag to pass HDFS…

Read More Setting Custom Replication Factor for Individual HDFS Files
Programming Languages

Recovering HDFS from Safe Mode After DataNode Failures
ByEric Ma Mar 24, 2018Apr 12, 2026

When a NameNode restarts, it enters safe mode to rebuild its understanding of block-to-DataNode mappings. If DataNodes don’t report their blocks quickly enough or some don’t come back online, the NameNode may stay stuck in safe mode with a message like: The reported blocks 1968810 needs additional 5071 blocks to reach the threshold 0.9990 of…

Read More Recovering HDFS from Safe Mode After DataNode Failures
Programming Languages

Configuring HDFS Replication Factors by Directory
ByEric Ma Mar 24, 2018Apr 12, 2026

HDFS doesn’t natively support directory-level replication factor inheritance. Even if you set a specific replication factor on a directory and its files, new files created in that directory will default to the cluster’s global dfs.replication setting (typically 3). This limitation can complicate multi-tier storage strategies where you want temporary or low-priority data on fewer replicas…

Read More Configuring HDFS Replication Factors by Directory
Languages & Frameworks

Adding a Secondary NameNode Metadata Directory to HDFS
ByEric Ma Mar 24, 2018Apr 12, 2026

Adding a second metadata directory to your HDFS NameNode increases reliability by maintaining synchronized replicas of the namespace and transaction logs across separate disks. This guide walks through the process safely. Prerequisites and Planning Before starting, verify your current configuration and plan the new directory location: grep -A2 “dfs.namenode.name.dir” $HADOOP_HOME/etc/hadoop/hdfs-site.xml The new directory should be…

Read More Adding a Secondary NameNode Metadata Directory to HDFS
Linux & Systems Administration

Checking HDFS File Replication Factor
ByEric Ma Mar 24, 2018Apr 13, 2026

When managing HDFS clusters, you often need to verify the replication factor of specific files to ensure data redundancy meets your requirements. Here are the practical methods to check this. Using hdfs dfs -ls The most straightforward way is to list the file with hdfs dfs -ls: hdfs dfs -ls /usr/GroupStorage/data1/out.txt Output: -rw-r–r– 3 hadoop…

Read More Checking HDFS File Replication Factor
Linux & Systems Administration

Adjusting HDFS Replication Factor Per File
ByEric Ma Mar 24, 2018Apr 13, 2026

HDFS uses the dfs.replication property in hdfs-site.xml to set a global default replication factor for all blocks. However, you can override this on a per-file or per-directory basis using the hdfs dfs -setrep command — useful for frequently accessed “hot” files that need higher availability. Basic syntax hdfs dfs -setrep [-R] [-w] <numReplicas> <path> Setting…

Read More Adjusting HDFS Replication Factor Per File
Scripting & Utilities

Configuring HDFS Replication: Cluster and Per-File Settings
ByEric Ma Mar 24, 2018Apr 12, 2026

The replication factor determines how many copies of each data block HDFS maintains across your cluster. The default is 3, which provides adequate fault tolerance for most deployments by surviving both node and rack failures simultaneously. Setting Cluster-Wide Default Replication To set the default replication factor for all new files, add the dfs.replication property to…

Read More Configuring HDFS Replication: Cluster and Per-File Settings
Linux & Systems Administration

Tuning MapReduce: Choosing Mapper and Reducer Counts
ByQ A Mar 24, 2018Apr 12, 2026

Choosing the right number of mappers and reducers directly impacts Hadoop job performance. This isn’t a set-and-forget configuration—it depends on your cluster characteristics, data size, and task complexity. Mapper Configuration The number of mappers is primarily determined by the number of HDFS blocks in your input files. Each block typically generates one map task by…

Read More Tuning MapReduce: Choosing Mapper and Reducer Counts
Systems & Architecture

HDFS NameNode Metadata Checkpointing: Resolving Sync Issues
ByEric Ma Sep 9, 2017Apr 12, 2026

The Secondary NameNode periodically merges the fsimage and edits log files to keep the edits log manageable and consolidate metadata. When this checkpointing fails, you’ll see errors like: ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint java.io.IOException: Inconsistent checkpoint fields. This typically indicates a mismatch between the NameNode and Secondary NameNode metadata states, often from unclean shutdowns or…

Read More HDFS NameNode Metadata Checkpointing: Resolving Sync Issues
Linux & Systems Administration

Setting Up a Hadoop 2.x Cluster for Development and Legacy Systems
ByEric Ma Sep 14, 2014Apr 12, 2026

Hadoop 2.x reached end-of-life in 2016. This guide covers setup for learning purposes and legacy system maintenance only. For production deployments, use Hadoop 3.x or later, which includes performance improvements, better YARN scheduling, HDFS erasure coding, and improved security. Cloud-managed options like AWS EMR, Google Dataproc, and Azure HDInsight eliminate most operational overhead. This guide…

Read More Setting Up a Hadoop 2.x Cluster for Development and Legacy Systems
Linux & Systems Administration

Installing Hadoop 1.x: A Complete Guide
ByEric Ma Oct 9, 2012Apr 12, 2026

Hadoop 1.x reached end-of-life in 2014 and is no longer maintained. This post documents a deprecated architecture for historical reference only. For new deployments, use Hadoop 3.x+ with YARN, which offers significant improvements in resource management, multi-tenancy, and reliability. See the Apache Hadoop documentation for current versions. For managed services, consider AWS EMR, Google Dataproc,…

Read More Installing Hadoop 1.x: A Complete Guide
Design Patterns & Architecture

Hadoop Port Reference Guide
ByEric Ma Jan 15, 2012Apr 13, 2026

Hadoop daemons communicate over specific TCP ports that you’ll need to know for cluster setup, firewall configuration, and application development. These ports are configurable but come with sensible defaults in Hadoop 3.x. HDFS Ports Service Port Configuration Property Purpose Namenode HTTP UI 9870 dfs.namenode.http-address Web UI and metadata operations Namenode HTTPS UI 9871 dfs.namenode.https-address Secure…

Read More Hadoop Port Reference Guide
Algorithms & Data Structures

Sorting Performance Benchmarks on Hadoop
ByEric Ma Jan 7, 2012Apr 12, 2026

Running benchmarks after a Hadoop installation helps validate that your cluster works correctly and gives you baseline performance metrics. The Sort benchmark is a standard test that exercises the MapReduce execution engine and HDFS, making it useful for stress-testing both the distributed storage layer and computation framework. What the Sort Benchmark Does The Sort benchmark…

Read More Sorting Performance Benchmarks on Hadoop
Linux & Systems Administration

mrcc – A Distributed C Compiler System on MapReduce (Archived 2010)
ByEric Ma Jan 16, 2010Apr 12, 2026

Archived Content (2010): This post describes a research project from 2010. The tools and versions mentioned (Hadoop 0.20, MapReduce Streaming) are historically significant but have been largely superseded by modern distributed build systems like Bazel Remote Execution and cloud-native CI/CD pipelines. mrcc – A Distributed C Compiler System on MapReduce Original Project Date: January 2010…

Read More mrcc – A Distributed C Compiler System on MapReduce (Archived 2010)