Benchmarks are important to understand the performance and quantitative and qualitative comparison of different systems. Many analytic frameworks, such as Hive, Impala and Shark, are designed and implemented these years and become fundamental software for processing big data. How to benchmark these big data analytic systems is an interesting problem.
The Big Data Benchmark
The Big Data Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative comparisons of five systems by the time this post is written: Redshift – a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse, Hive – a Hadoop-based data warehousing system, Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework, Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine and Stinger/Tez – Tez is a next generation Hadoop execution engine currently in development.
What is being evaluated
As stated by the benchmark website:
This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes. Keep in mind that these systems have very different sets of capabilities. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF’s), tolerating failures, and scaling to thousands of nodes. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. The workload here is simply one set of queries that most of these systems these can complete.
The dataset is an important part of a benchmark if others want to reproduce or verify the results. The Big Data Benchmark provides hosted datasets on S3. The largest dataset is around 270 GB which is for 5-node tests. The datasets the benchmark provides was generated using Intel’s Hadoop Benchmark Suite (HiBench) and data sampled from the Common Crawl document corpus.