how to skip mapper function in hadoop

In hadoop I need to skip mapper function and directly execute the reducer function.

We doing this to improve hadoop performance, if the hadoop framework is used to analyze same data sets, then mapper’s output will be same for different kind of jobs. To save the redundant computation for same results, I am planning to run mapper function once and store its output as a cache and utilize it for future jobs by passing on mapper function and directly execute reducer function using pre processed mapper’s results.

Any help or pointer will be appreciated.

(sorry for bad english)

I feel your requirement does not fit into the MapReduce model well.

If you must use MapReduce/HDFS, you may consider using multiple MapReduce jobs:

The first MapReduce job stores the shuffle results (by a reduce function that outputs its input) in HDFS that can be reused.

The other MapReduce jobs just have map tasks that read the data from HDFS generated by the first MapReduce job as input. You must carefully organize and partition the data in HDFS to simulate the reduce tasks’ semantics if you applications reply on the semantics.

It can work but is ugly.

Another choice is to implement a MapReduce-like framework on Hadoop (YARN/Hadoop 2, not Hadoop v1) that can skip the map phase. You may take a look at the approachs of HaLoop and DryadInc.

Overall, my felling is that it is better to use other programming models/systems other than plain MapReduce for your workloads.

Answered by Eric Z Ma.

Hey Eric,
Appreciate your response, will look into other models as you suggested and update this post.

Regards,
Sanjeev


HI, Eric,

Can you please give me some materials, about how to do same using “Hadoop 2” ?

Regards,
Sanjeev


You may take a look at these pages:

YARN architecture: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Writing YARN applications: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

Examples of applications may be MapReduce in Hadoop 2 and Spark on YARN: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

Eric Z Ma

Eric is a father and systems guy. Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.

Leave a Reply

Your email address will not be published. Required fields are marked *