Big Data Benchmark from AMPLab of UC Berkeley

ByEric Ma Mar 17, 2014Sep 5, 2020

Benchmarks are important to understand the performance and quantitative and qualitative comparison of different systems. Many analytic frameworks, such as Hive, Impala and Shark, are designed and implemented these years and become fundamental software for processing big data. How to benchmark these big data analytic systems is an interesting problem.

The Big Data Benchmark

The Big Data Benchmark from AMPLab, UC Berkeley provides quantitative and qualitative comparisons of five systems by the time this post is written: Redshift – a hosted MPP database offered by Amazon.com based on the ParAccel data warehouse, Hive – a Hadoop-based data warehousing system, Shark – a Hive-compatible SQL engine which runs on top of the Spark computing framework, Impala – a Hive-compatible* SQL engine with its own MPP-like execution engine and Stinger/Tez – Tez is a next generation Hadoop execution engine currently in development.

What is being evaluated

As stated by the benchmark website:

This benchmark measures response time on a handful of relational queries: scans, aggregations, joins, and UDF’s, across different data sizes. Keep in mind that these systems have very different sets of capabilities. MapReduce-like systems (Shark/Hive) target flexible and large-scale computation, supporting complex User Defined Functions (UDF’s), tolerating failures, and scaling to thousands of nodes. Traditional MPP databases are strictly SQL compliant and heavily optimized for relational queries. The workload here is simply one set of queries that most of these systems these can complete.

Datasets

The dataset is an important part of a benchmark if others want to reproduce or verify the results. The Big Data Benchmark provides hosted datasets on S3. The largest dataset is around 270 GB which is for 5-node tests. The datasets the benchmark provides was generated using Intel’s Hadoop Benchmark Suite (HiBench) and data sampled from the Common Crawl document corpus.

Software | Tutorial | Virtualization

How to list and start VirtualBox VMs in command line in Linux?

ByDavid Yang Feb 29, 2020Feb 28, 2020

VirtualBox is a nice open source virtual machine software. It works nicely on Linux and is supported by many Linux distros like Ubuntu in their official package repositories, so it is quite easy to set it up on Linux. The VMs can also be managed in command line using the vboxmanage command line tool provided…

QA | Tutorial

GNOME 3 "Natural scrolling" mouse option does not work

ByQ A Jul 31, 2018Nov 22, 2019

In GNOME 3 “Settings” -> “Devices” -> “Mouse & Touchpad”, after setting “Natural scrolling” to On, the scrolling is still as the same before (non-natural). How to fix this? This is likely related to the X11 driver. Remove the xorg-x11-drv-synaptics driver if it is installed. And install the xorg-x11-drv-libinput driver if it is not installed….

How to get an environment variable in Python?

ByQ A Mar 24, 2018

In Python, how to get the value (string) of an environment variable? In Python, all environment variables can be accessed directly through os.environ import os print os.environ[‘HOME’] If the environment variable is not present, it will raise a KeyError. You can use get to return None if the environment variable is not present: print os.environ.get(‘ENV_MIGHT_EXIST’)…

How to limit the network rate used by scp in Linux?

ByEric Ma Mar 24, 2018Jan 31, 2019

I am using a shared network and scp to upload files. I do not want to use most of the bandwidth available. How to limit the network rate used by scp in Linux? You can use the -l option of scp to limit the rate of bandwidth. -l limit Limits the used bandwidth, specified in…

How to manually set Vim’s filetype?

ByEric Ma Mar 24, 2018Mar 24, 2018

Vim can detects the file types from quite well. But under certain situations, such as creating a new file, Vim does not automatically set the file type. How to manually set it? In side Vim, to set a filetype, set value to the filetype variable. For example, to set the filetype to php: :set filetype=php…

QA | Tutorial

How to test a file or directory exists in C++?

ByQ A Apr 1, 2018Nov 22, 2019

How to test a path (a file or directory exists) in C++? In Bash, the [ -f ] and [ -d ] tests can test whether a file or a directory exist. What are the corresponding C++ ways for these tests? To test whether a file or dir (a path) exists, you may call stat()…

The Big Data Benchmark

What is being evaluated

Datasets

Similar Posts

Leave a Reply Cancel reply