In hadoop I need to skip mapper function and directly execute the reducer function.
We doing this to improve hadoop performance, if the hadoop framework is used to analyze same data sets, then mapper’s output will be same for different kind of jobs. To save the redundant computation for same results, I am planning to run mapper function once and store its output as a cache and utilize it for future jobs by passing on mapper function and directly execute reducer function using pre processed mapper’s results.
Any help or pointer will be appreciated.
(sorry for bad english)
I feel your requirement does not fit into the MapReduce model well.
If you must use MapReduce/HDFS, you may consider using multiple MapReduce jobs:
The first MapReduce job stores the shuffle results (by a reduce function that outputs its input) in HDFS that can be reused.
The other MapReduce jobs just have map tasks that read the data from HDFS generated by the first MapReduce job as input. You must carefully organize and partition the data in HDFS to simulate the reduce tasks’ semantics if you applications reply on the semantics.
It can work but is ugly.
Overall, my felling is that it is better to use other programming models/systems other than plain MapReduce for your workloads.
Appreciate your response, will look into other models as you suggested and update this post.
Can you please give me some materials, about how to do same using “Hadoop 2” ?
You may take a look at these pages:
Writing YARN applications: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
Examples of applications may be MapReduce in Hadoop 2 and Spark on YARN: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html