AMPLab Big Data Benchmark: Key Metrics and Lasting Impact
Benchmarks exist to answer one question: how fast can this system process data? The AMPLab Big Data Benchmark, developed by UC Berkeley, became a foundational effort to answer that question systematically across different distributed query engines.
The Benchmark’s Original Scope
The AMPLab benchmark evaluated five systems circa 2013:
- Redshift – Amazon’s columnar data warehouse (built on ParAccel)
- Hive – Hadoop SQL layer for batch processing
- Shark – early Spark-based SQL engine (superseded by Spark SQL)
- Impala – Cloudera’s MPP query engine for in-memory analytics
- Tez/Stinger – alternative Hadoop execution engine with improved task scheduling
While these specific implementations have evolved or been replaced, the benchmark design itself remains instructive.
What The Benchmark Measured
The suite ran standard SQL queries across four categories:
Scans: Full-table and partial scans to measure I/O performance and predicate pushdown effectiveness.
Aggregations: GROUP BY operations and COUNT/SUM aggregates to evaluate shuffle and reduction performance.
Joins: Multiple join patterns—including star joins with fact and dimension tables—to stress distributed join algorithms.
User-Defined Functions (UDFs): Custom functions to measure serialization overhead and per-row processing costs.
Each query ran against datasets ranging from 1 GB to 270 GB, with measurements focused on wall-clock response time from query submission to result return.
The Architectural Trade-offs The Benchmark Revealed
The benchmark exposed fundamental design decisions:
MapReduce-based systems (Hive, Spark on YARN) prioritize fault tolerance and flexibility. They tolerate slower query execution in exchange for the ability to handle arbitrary code (UDFs), scale across thousands of nodes, and recover from node failures mid-query. Response times typically ranged from seconds to minutes.
MPP databases (Redshift, Impala, Trino) optimize for query latency through aggressive compilation, columnar storage, and vectorized execution. They target sub-second to low-second latency but require manual schema design, statistics management, and typically run on smaller, more powerful clusters. They handle failures through replication and redundancy rather than distributed recovery.
This wasn’t a case of one approach being “better”—they solved different problems. The benchmark made these trade-offs explicit and measurable.
Datasets and Reproducibility
The benchmark used public datasets hosted on AWS S3:
- Tables:
pageviews,rankings,uservisits(star schema structure) - Sizes: 1 GB, 5 GB, 25 GB, 100 GB, 270 GB variants
- Source: Intel’s HiBench toolkit for synthetic data generation, plus Common Crawl samples
- Access: Public S3 bucket enabling anyone to reproduce results
This design—open datasets, clear queries, public results—was crucial. You could deploy your own cluster and validate whether your setup matched published baselines.
What The Benchmark Didn’t Measure
Notably absent:
- Streaming ingestion and real-time query latency
- Machine learning pipeline performance
- Semi-structured data (JSON, Parquet) processing
- Update/insert performance or ACID compliance
- Cost-per-query in cloud environments
- Query optimization across multiple analytical workloads (typical in real systems)
The benchmark was a slice of the analytical workload spectrum, not the whole pie.
Applying This Today
The original benchmark targeted systems and configurations that have largely moved on. But its methodology remains relevant:
For evaluating modern systems, consider:
- TPC-H and TPC-DS have become industry standards with more comprehensive query coverage and clearly defined scale factors (SF 100, SF 1000, etc.)
- Cloud data warehouses (Snowflake, BigQuery, Redshift) now dominate analytical workloads—benchmark them in their native cloud region with their actual cost structure, not as lift-and-shift comparisons
- Trino, DuckDB, and ClickHouse have shifted baselines significantly through improved query optimization, adaptive execution, and columnar processing even in open-source tools
- Your actual queries and schemas matter more than synthetic benchmarks—if your workload is heavy on star joins and aggregations, TPC-H is relevant; if it’s mostly scans with UDFs, you need different tests
Running Your Own Benchmark
If you want to benchmark a new cluster or compare tools:
- Define queries that resemble your actual workloads—not hypothetical ones
- Use reproducible datasets (public benchmarks, anonymized production samples, or synthetic generators with fixed seeds)
- Measure end-to-end latency, not just execution time (include planning, compilation, network roundtrips)
- Run multiple iterations to account for caching effects, garbage collection, and statistical noise
- Document the environment explicitly: CPU, memory, storage type, network topology, configuration tuning
- Acknowledge what you didn’t test: failover behavior, concurrent workloads, data skew, schema changes
The AMPLab benchmark’s lasting value isn’t in its specific numbers (which are dated) but in demonstrating that rigorous, reproducible measurement of distributed systems is possible and necessary. The methodology holds up; the systems have simply moved forward.
