Data processing

System Administration & Cloud

How Cloud Infrastructure Transforms Remote Work and Reduces Costs
ByEthan Millar Sep 4, 2019Apr 12, 2026

Cloud computing has fundamentally reshaped how organizations operate, shifting from on-premises infrastructure to distributed, internet-accessible systems. Rather than maintaining physical servers and data centers, companies now leverage cloud platforms where data storage, processing, and management occur across networked server clusters accessible via the internet. This shift has concrete implications for how work gets done, where…

Read More How Cloud Infrastructure Transforms Remote Work and Reduces Costs
Linux & Systems Administration

Choosing Stream Processing Tools Based on Latency Needs
ByEric Ma Nov 27, 2018Apr 11, 2026

The fundamental lesson from years of big data work is straightforward: solve problems with tools designed for those problems. Organizations routinely force large-scale data through mismatched infrastructure — pushing streams through micro-batching systems, analyzing graphs as tables, handling real-time requirements with batch windows. Each mismatch adds latency, complexity, and operational burden. Stream processing deserves particular…

Read More Choosing Stream Processing Tools Based on Latency Needs
Scripting & Utilities

Printing Fields After a Specific Field in awk
ByEric Ma Mar 24, 2018Apr 12, 2026

When processing text data, you often need to extract everything from a certain field onward. awk makes this straightforward once you understand the mechanics. Basic Syntax The fundamental approach uses awk’s field variables and looping: awk ‘{for(i=N;i<=NF;i++) printf “%s “, $i; print “”}’ file.txt Replace N with the field number you want to start from….

Read More Printing Fields After a Specific Field in awk
Languages & Frameworks

Reading and Processing Files Line by Line in C++
ByQ A Mar 24, 2018Apr 12, 2026

Processing files line by line is a common task in systems programming, data processing, and log analysis. C++ provides several approaches with different trade-offs in performance, memory usage, and code clarity. Using std::getline with std::ifstream The most straightforward approach uses std::getline with an input file stream: #include <fstream> #include <string> #include <iostream> int main() {…

Read More Reading and Processing Files Line by Line in C++
Linux & Systems Administration

Installing Hadoop 1.x: A Complete Guide
ByEric Ma Oct 9, 2012Apr 12, 2026

Hadoop 1.x reached end-of-life in 2014 and is no longer maintained. This post documents a deprecated architecture for historical reference only. For new deployments, use Hadoop 3.x+ with YARN, which offers significant improvements in resource management, multi-tenancy, and reliability. See the Apache Hadoop documentation for current versions. For managed services, consider AWS EMR, Google Dataproc,…

Read More Installing Hadoop 1.x: A Complete Guide
Software Architecture

Distributed Systems and Cloud Computing: Essential Reading Guide
ByEric Ma Sep 15, 2012Apr 12, 2026

Understanding distributed systems and cloud computing starts with the literature. These papers form the foundation for anyone working in the space — they’re not just historical artifacts, they’re still actively referenced and their design principles remain relevant. Internet-Scale Systems and Datacenters The Google cluster architecture paper remains essential reading. It laid out the practical realities…

Read More Distributed Systems and Cloud Computing: Essential Reading Guide
Linux & Systems Administration

mrcc – A Distributed C Compiler System on MapReduce (Archived 2010)
ByEric Ma Jan 16, 2010Apr 12, 2026

Archived Content (2010): This post describes a research project from 2010. The tools and versions mentioned (Hadoop 0.20, MapReduce Streaming) are historically significant but have been largely superseded by modern distributed build systems like Bazel Remote Execution and cloud-native CI/CD pipelines. mrcc – A Distributed C Compiler System on MapReduce Original Project Date: January 2010…

Read More mrcc – A Distributed C Compiler System on MapReduce (Archived 2010)