Designing Scalable Data Storage and Processing Architectures
Modern datacenters rely on distributed storage and processing systems designed to handle massive datasets across clusters of commodity hardware. Understanding these systems is essential for anyone working with infrastructure at scale. Here’s an overview of the major architectures and implementations you’ll encounter.
Storage Systems
Google File System (GFS) and successors
GFS established the foundation for distributed storage at scale. It prioritizes high throughput over latency, using replication across three or more nodes for fault tolerance. The design assumes node failures are routine and handles them transparently.
HDFS borrowed heavily from GFS principles and remains the standard distributed filesystem for Hadoop deployments. It uses a NameNode for metadata management and DataNodes for actual block storage. Be aware that HDFS works best for write-once, read-many workloads. For frequently updated data, you’ll want something else.
Google’s Colossus (internally referred to as GFS2) improved on the original design with better scalability, metadata distribution, and handling of very large files.
Structured data stores
BigTable introduced column-oriented storage with multi-dimensional sorted maps. It handles large sparse datasets efficiently and powers many Google services. The design separates storage from compute, allowing independent scaling.
Spanner extended BigTable concepts with strong consistency guarantees across geographic regions through true timestamp-based multi-version concurrency. It’s become the model for systems requiring both scalability and ACID properties.
Megastore provided transactions across a distributed system, building on BigTable for relational workloads that need consistency without sacrificing availability.
Key-value and in-memory systems
Dynamo pioneered highly available key-value storage through eventual consistency and peer-to-peer replication. Amazon’s architecture heavily influenced modern NoSQL databases.
RAMCloud optimized for ultra-low latency by keeping data in DRAM across a cluster. Recovery from failures happens through parallel reconstruction from replica backups, making DRAM-based storage practical for mission-critical data.
Compute Systems
Batch processing frameworks
MapReduce simplified distributed processing by hiding cluster management complexity. Users write map and reduce functions; the framework handles data distribution, fault tolerance, and aggregation. Hadoop’s implementation is still widely used, though often replaced by Spark for most new projects.
Spark improves on MapReduce with in-memory RDDs (Resilient Distributed Datasets) that persist across operations, dramatically accelerating iterative algorithms and interactive queries. Most teams migrating from Hadoop should consider Spark first.
Specialized processing models
Dremel introduced columnar storage for fast interactive analysis. The query engine reads only needed columns, achieving near-realtime analytics on massive datasets. This approach influenced tools like Presto and BigQuery.
Storm and modern stream processors (Kafka Streams, Flink) handle continuous data streams, processing events as they arrive rather than in batch windows. Choose stream processors when you need sub-second latency on unbounded data.
Pregel abstracted graph computing as iterative vertex-centric computation. Graphs with billions of nodes require this bulk synchronous parallel model rather than traditional MapReduce.
FlumeJava and Pig Latin provided higher-level abstractions for pipeline construction. Pig particularly appealed to SQL-trained analysts who could express workflows without Java code, though SQL-on-Hadoop tools have since become standard.
Dryad and DryadLINQ (Microsoft) offered flexible dataflow graphs before Spark. Dryad’s dynamic optimization influenced later systems.
Resource Management
Cluster schedulers
Mesos provides fine-grained resource sharing across multiple frameworks on a single cluster. Rather than one scheduler managing all resources, Mesos acts as a broker. ResourceManagers request resources; Mesos offers available capacity; frameworks accept or decline. This two-level architecture enabled running MapReduce, Spark, and other workloads simultaneously.
YARN (Hadoop’s ResourceManager) became the more common choice for Hadoop-centric environments, though Kubernetes has since emerged as the dominant cluster orchestration platform for containerized workloads.
Kubernetes’ declarative scheduling, automatic bin-packing, and service discovery make it the standard for new infrastructure. Most new distributed systems integrate with Kubernetes rather than Mesos directly.
Practical Considerations for 2026
When building distributed storage and processing pipelines today:
- Choose Spark over Hadoop for batch processing unless you have deeply entrenched MapReduce code
- Evaluate Kubernetes before Mesos for new cluster deployments
- Use managed services (BigQuery, DuckDB, Snowflake) instead of building custom systems when scaling permits
- Start with stream processing (Kafka, Flink, or Spark Structured Streaming) if your workload has temporal structure
- Separate storage from compute to scale each independently—object storage (S3, GCS) with query engines (DuckDB, Presto) is often simpler than tightly coupled systems
The principles from these foundational papers remain relevant: replication for fault tolerance, separation of concerns, and embracing node failures as expected events rather than exceptions.

The Memcache and TAO from Facebook are also very interesting, scalable and real systems: http://www.systutorials.com/qa/364/cache-at-facebook