pumabenchmarks

MapReduce Benchmarks

by Faraz Ahmad, Seyong Lee, Mithuna Thottethodi, T. N. Vijaykumar

MapReduce is a well-known programming model, developed within Google, for processing large amounts of raw data, for example, crawled documents or web request logs. This data is usually so large that it must be distributed across thousands of machines in order to be processed in a reasonable time. The ease of programmability, automatic data management and transparent fault tolerance has made MapReduce a favorable choice for large-scale data centers batch processing. Map, written by a user of the MapReduce library, takes an input pair and produces a set of intermediate key/value pairs. The library groups together all intermediate values associated with the same intermediate key and passes them to the reduce function through an all-map-to-all-reduce communication called Shuffle. Reduce, also written by the user, receives intermediate key along with a set of values from Map and merges together these values to produce the final output. Hadoop is an open-source implementation of MapReduce which is being improved and developed regularly by software developers / researchers and is maintained by Apache Software Foundation.  Despite being vast efforts on the development of Hadoop MapReduce, there has not been a very rigorous work done on the benchmarks side.

During our work on MapReduce, we developed a benchmark suite which represents a broad range of MapReduce applications exhibiting application characteristics with high/low computation and high/low shuffle volumes. There are a total of 13 benchmarks, out of which Tera-Sort, Word-Count, and Grep are from Hadoop distribution. The rest of the benchmarks were developed in-house and are currently not part of the Hadoop distribution. The three benchmarks from Hadoop distribution are also slightly modified to take number of reduce tasks as input from the user and generate final time completion statistics of jobs. The benchmarks and data sets can be downloaded from here.

 

1. Word-Count

counts the occurrences of each word in a large collection of documents. Map emits <word,1> tuples. Reduce adds up the counts for a given word from all map tasks and outputs the final count.

 

Input format: any document (usually a web document in text/xml format) 

Output format: <word> <count>

Dataset: Downloaded from http://dumps.wikimedia.org/enwiki/ . Due to HDFS (Hadoop File System) limitations, the datasets needed some processing such as (i) copying all files from multiple hierarchical directories to one directory, (ii) merging multiple files together to create small number of large-sized files rather than large number of small-sized files, and (iii) eliminating special character file names.

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar wordcount –r <num-reduces> <input-dir> <output-dir>

 

 

2. Inverted-Index

takes a list of documents as input and generates word-to-document indexing. Map emits <word, docId> tuples with each word emitted once per docId. Reduce combines all tuples on key <word> and emits <word,list(docId)> tuples after removing duplicates.

 

Input format: any document (usually a web document in text/xml format)

Output format: <word> <docId>

Dataset: Same as for Word-Count.

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar invertedindex –m<num-maps> -r <num-reduces> <input-dir> <output-dir>

Note: To let Hadoop figure out the number of map tasks, supply –m 1 here.

 

 

3. Term-Vector

determines the most frequent words in a set of documents and is useful in the analyses of a host’s relevance to a search. Map emits <host,termvector> tuples where termvector is itself a tuple of the form <word, 1>. Reduce discards the words whose frequency is below some cut-off, sorts the rest of the list per key in a descending order with respect to count and emits tuples of the form <host, list(termvector)>.

 

Input format: any document (usually a web document in text/xml format)

Output format: <host> <termvector>

Dataset: Same as for Word-Count.

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar termvectorperhost –m<num-maps> -r <num-reduces> <input-dir> <output-dir>

 

 

4. Self-Join

is similar to the candidate generation part of the a priori data mining algorithm to generate association among k+1 fields given the set of k-field associations. Map receives candidate lists of the form {element1, element2, ...., elementk}, each list in alphanumerically sorted order. Map breaks these lists into <{element1, element2, ....,elementk-1}, {elementk}> tuples. Reduce prepares a sorted list of all the Map values for a given key by building <{element1, element2, ...., elementk-1}, {val1,val2, ...., valj}> tuples. From these tuples, (k+1)-sized candidates can be obtained by appending consecutive pairs of map values vali, vali+1 to the (k-1)-sized key. By avoiding repetition of (k-1)-sized keys for every pair of values in the list, tuples are a compact representation of the (k+1)-sized candidates set.

 

Input format: {e1,e2, ….., ek}

Output format: <e1,e2, ….. ek-1>< ek, ek+1 >

Dataset: Synthetic data (details here)

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar selfjoin –m<num-maps> -r <num-reduces> <input-dir> <output-dir>

 

 

5. Adjacency-List

is similar to search-engine computation to generate adjacency and reverse adjacency lists of nodes of a graph for use by PageRank-like algorithms. Map receives as inputs graph edges <p, q> of a directed graph that follows the power law of the World-wide Web. For the input, we assume the probability, that a node has an out-degree of i, is proportional to 1/iskew with an average out-degree of 7.2. Map emits tuples of the form <q, from_list{p}:to_list{}> and <p, from_list{}:to_list{q}>. For a given key, Reduce generates unions of the respective lists in the from_list and to_list fields, sorts the items within the union lists, and emits <p(and q), from_list{sorted union of all individual from_list}:to_list{sorted union of all individual to_list}> tuples.

 

Input format: {p,q}

Output format: <p><from{list_of_in_degree}:to{list_of_out_degree}> , <q><from{list_of_in_degree}:to{list_of_out_degree}>

Dataset: Synthetic data (details here)

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar adjlist –m<num-maps> -r <num-reduces> <input-dir> <output-dir>

 

 

6. K-Means

is a popular data mining algorithm to cluster input data into k clusters. K-means iterates to successively improve the clustering. We classify movies based on their ratings using anonymized movies rating data which is of the form <movie_id: list{rater_id, rating}>. We use random starting values for the cluster centroids. Map computes the cosine-vector similarity of a given movie with the centroids, and determines the centroid to which the movie is closest (i.e., the cluster to which it belongs). Map emits <centroid_id, (similarity_value, movie_data)> where movie_data is (movie_id:list{rater_id, rating}). Reduce determines the new centroids by computing the average of similarity of all the movies in a cluster. The movie closest to the average is the new centroid and Reduce emits movies data along with their current centroid as well as new centroids data (model file) for use in the next iteration. The algorithm iterates until the change in the centroids is below a threshold.

 

Input Format: {movie_id: userid1_rating1, userid2_rating2, ...}

Output Format: kmeans produces two types of outputs:

(a) <centroid_num><{movie_id: userid1_rating1, userid2_rating2, ...}> (list of all movies associated with a particular centroid)

(b) <centroid_num><{similarity_value}{centroid_movie_id}{num_members}{userid1_rating1, userid2_rating2, …}> (new centroid}

Datasets: movie ratings dataset.

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar kmeans_itertxt_hr –m <num-maps> -r <num-reduces> <input-dir> <output-dir>

Note: To run multiple iterations of kmeans, a wrapper script should be executed. The details can be found here.

 

 

7. Classification

classifies the movies into one of k pre-determined clusters. Similar to k-means, classification uses anonymized movies rating data which is of the form <movie_id: list{rater_id, rating}>. Similar to k-means, Map computes the cosine vector similarity of a given movie with the centroids, and determines the centroid to which the movie is closest (i.e., the cluster to which it belongs). Map emits <centroid_id, movie_id)>. Unlike k-means, the details of movie ratings are not emitted because there are no further iterations which may need the details. Reduce collects all the movies in a cluster and emits <centroid_id, movie_id>.

 

Input Format: {movie_id: userid1_rating1, userid2_rating2, ….}

Output Format: <centroid_num><movieid>

Datasets: movie ratings dataset.

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar classification –m <num-maps> -r <num-reduces> <input-dir> <output-dir>

 

 

8. Histogram-Movies

generates a histogram of input data and is a generic tool used in many data analyses. We use the movie rating data and the input is of the form <movie_id: list{rater_id, rating}>. Based on the average ratings of movies (ratings range from 1 to 5) we bin the movies into 8 bins each with a range of 0.5. Map computes the average rating for a movie, determines the bin, and emits <bin_value, 1> tuples. Reduce collects all the tuples for a bin and outputs a <bin_value, n> tuple.

 

Input Format: {movie_id: userid1_rating1, userid2_rating2, ….}

Output Format: <bin_value><num_of_movies>

Datasets: movie ratings dataset.

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar histogram_movies –m <num-maps> -r <num-reduces> <input-dir> <output-dir>

 

 

9. Histogram-Ratings

generates a histogram of the user ratings trend. The input is of the form <movie_id: list{rater_id, rating}>. Here, we bin the user ratings of 1-5 into 5 bins and Map emits <rating, 1> tuple for each review. Reduce collects all the tuples for a rating and emits a <rating, n> tuple.

 

Input Format: {movie_id: userid1_rating1, userid2_rating2, ….}

Output Format: <rating ><num_of_user_reviews>

Datasets: movie ratings dataset.

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar histogram_ratings –m <num-maps> -r <num-reduces> <input-dir> <output-dir>

 

 

10. Sequence-Count

generates a count of all unique sets of three consecutive words per document in the input data. Map emits <word1|word2|word3|filename, 1> tuples. Reduce adds up the counts for the multi-words from all map tasks and outputs the final count.

 

Input format: any document (usually a web document in text/xml format)

Output format: <word1|word2|word3|filename> <count>

Dataset: Same as for Word-Count.

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar sequencecounts –m <num-maps> –r <num-reduces> <input-dir> <output-dir>

 

 

11. Ranked-Inverted-Index

takes list of words and their frequencies per document and generates lists of documents containing the given words in decreasing order of frequency. Map takes sequence-count benchmark’s output <word-sequence|filename,n> as its input and separates counts from the rest of the data in the input. Map output format is <word-sequence, {filename,n}>. Reduce takes all map outputs and produces a list per word-sequence in decreasing order of occurrence in the respective documents <word-sequence><{count1, file1},{count2, file2}, …>.

 

Input format: <word-sequence|filename><count>

Output format: <word-sequence> <count | file>

Dataset: Output of Sequence-Count

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar rankedinvertedindex –m <num-maps> –r <num-reduces> <input-dir> <output-dir>

 

 

12. Tera-Sort

sorts 100-byte <key,value> tuples on the keys where key is a 10-byte field and the rest of the bytes as value (payload). Map is identity function which simply reads and emits the tuples and Reduce emits the sorted data to the final output. The sorting occurs in MapReduce’s in-built sort while reduce tasks simply emit the sorted tokens.

 

Input format: {10-bytes}{90-bytes} 

Output format: <10-bytes><90-bytes>

Dataset: Generated through TeraGen in Hadoop. Here is a sample 3GB dataset.

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar terasort <input-dir> <output-dir> <num-reduces>

 

 

13. Grep

searches for a pattern in a file and is a generic search tool used in many data analyses. Each map task outputs lines containing either of the patterns as <regex, 1> tuples. Reduce task adds up the counts and emits <regex, n> tuples.

 

Input format: any document (usually a web document in text/xml format)

Output format: <regex> <count>

Dataset: Same as for Word-Count.

Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar grep <input-dir> <output-dir> <num-reduces> <regex> [<group>]

 

 

Download benchmarks and datasets