Spark

Linux & Systems Administration

Choosing Stream Processing Tools Based on Latency Needs
ByEric Ma Nov 27, 2018Apr 11, 2026

The fundamental lesson from years of big data work is straightforward: solve problems with tools designed for those problems. Organizations routinely force large-scale data through mismatched infrastructure — pushing streams through micro-batching systems, analyzing graphs as tables, handling real-time requirements with batch windows. Each mismatch adds latency, complexity, and operational burden. Stream processing deserves particular…

Read More Choosing Stream Processing Tools Based on Latency Needs
Programming Languages

Caching Mapper Output in Hadoop: Strategies for Reusing Intermediate Results
ByEric Ma Mar 24, 2018Apr 13, 2026

The core problem here is legitimate: if you’re running multiple jobs on the same dataset where the mapper phase produces identical intermediate results, recomputing those results is wasteful. However, skipping the mapper phase entirely breaks MapReduce’s processing model. There are better approaches. Why You Can’t Just Skip the Mapper MapReduce assumes data flows through map…

Read More Caching Mapper Output in Hadoop: Strategies for Reusing Intermediate Results
Development Best Practices

Spark SQL: DDL and DML Operations Explained
ByEric Ma Mar 24, 2018Apr 12, 2026

Spark SQL doesn’t have a separate DDL/DML specification distinct from Hive QL — it inherits its SQL dialect directly from Hive. If you’re designing a SQL engine or looking to understand Spark SQL’s data definition and manipulation capabilities, you need to reference Hive’s DDL and DML documentation. Why Spark SQL Uses Hive QL Spark SQL…

Read More Spark SQL: DDL and DML Operations Explained
Linux & Systems Administration

AMPLab Big Data Benchmark: Key Metrics and Lasting Impact
ByEric Ma Mar 17, 2014Apr 12, 2026

Benchmarks exist to answer one question: how fast can this system process data? The AMPLab Big Data Benchmark, developed by UC Berkeley, became a foundational effort to answer that question systematically across different distributed query engines. The Benchmark’s Original Scope The AMPLab benchmark evaluated five systems circa 2013: Redshift – Amazon’s columnar data warehouse (built…

Read More AMPLab Big Data Benchmark: Key Metrics and Lasting Impact
Linux & Systems Administration

Designing Scalable Data Storage and Processing Architectures
ByEric Ma Dec 11, 2012Apr 12, 2026

Modern datacenters rely on distributed storage and processing systems designed to handle massive datasets across clusters of commodity hardware. Understanding these systems is essential for anyone working with infrastructure at scale. Here’s an overview of the major architectures and implementations you’ll encounter. Storage Systems Google File System (GFS) and successors GFS established the foundation for…

Read More Designing Scalable Data Storage and Processing Architectures
Software Architecture

Distributed Systems and Cloud Computing: Essential Reading Guide
ByEric Ma Sep 15, 2012Apr 12, 2026

Understanding distributed systems and cloud computing starts with the literature. These papers form the foundation for anyone working in the space — they’re not just historical artifacts, they’re still actively referenced and their design principles remain relevant. Internet-Scale Systems and Datacenters The Google cluster architecture paper remains essential reading. It laid out the practical realities…

Read More Distributed Systems and Cloud Computing: Essential Reading Guide