how to skip mapper function in hadoop

ByEric Ma Mar 24, 2018Mar 28, 2018

In hadoop I need to skip mapper function and directly execute the reducer function.

We doing this to improve hadoop performance, if the hadoop framework is used to analyze same data sets, then mapper’s output will be same for different kind of jobs. To save the redundant computation for same results, I am planning to run mapper function once and store its output as a cache and utilize it for future jobs by passing on mapper function and directly execute reducer function using pre processed mapper’s results.

Any help or pointer will be appreciated.

(sorry for bad english)

I feel your requirement does not fit into the MapReduce model well.

If you must use MapReduce/HDFS, you may consider using multiple MapReduce jobs:

The first MapReduce job stores the shuffle results (by a reduce function that outputs its input) in HDFS that can be reused.

The other MapReduce jobs just have map tasks that read the data from HDFS generated by the first MapReduce job as input. You must carefully organize and partition the data in HDFS to simulate the reduce tasks’ semantics if you applications reply on the semantics.

It can work but is ugly.

Another choice is to implement a MapReduce-like framework on Hadoop (YARN/Hadoop 2, not Hadoop v1) that can skip the map phase. You may take a look at the approachs of HaLoop and DryadInc.

Overall, my felling is that it is better to use other programming models/systems other than plain MapReduce for your workloads.

Hey Eric,
Appreciate your response, will look into other models as you suggested and update this post.

Regards,
Sanjeev

HI, Eric,

Can you please give me some materials, about how to do same using “Hadoop 2” ?

Regards,
Sanjeev

You may take a look at these pages:

YARN architecture: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Writing YARN applications: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

Examples of applications may be MapReduce in Hadoop 2 and Spark on YARN: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

How to host a domain registered in another account in a Dreamhost account

ByQ A Mar 24, 2018

How to host a domain name registered in another account in one Dreamhost account. Simply trying to add the new domain, it reports: Error! You can’t add that domain: already in our system You need to delete the domain from your hosting (DNS) from the account first. Don’t worry, the domain name is still there…

Linux

Howto: Git Server over SSH

ByEric Ma Jul 13, 2013Mar 19, 2020

Git and SSH are both powerful tools, and git/ssh work well together. We introduce how to set up git server via ssh in this post. Git server through SSH is easy and fast to set up, although every user will have access to all repositories in the git server over SSH and every user is…

Find Available Packages Versions using aptitude in Ubuntu

ByEric Ma Mar 24, 2018Sep 30, 2021

How to find the available packages’ versions with aptitude on Linux? With aptitude, you can use this command to show the available versions of a package: aptitude versions <package name> In the console GUI, aptitude also show the versions. You may also simulate installation of a package and see which version will be installed: aptitude…

How to convert PDF to text with format kept on Linux?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to convert PDF to text with format kept on Linux? Many of the formatting in PDF will not be available in text. But better keep the text’s relative positions as the same. For example, the table columns should be kept. The pdftotext tool can convert PDF to text pretty well: pdftotext – Portable Document…

Linux | Tutorial

How sched_min_granularity_ns, sched_latency_ns and sched_wakeup_granularity_ns in CFS affect the timeslice of processes

ByWeiwei Jia Dec 1, 2016Jan 7, 2020

Abstract Currently, the most famous process scheduling algorithm in Linux Kernel is Completely Fair Scheduling (CFS) algorithm. The core idea of CFS is to let each process share the same proportional CPU resources to run so that it is fair to each process. In this article, I will introduce how sched_min_granularity_ns and sched_latency_ns work internal…

How to make thunderbird not wrap lines automatically?

ByEric Ma Mar 24, 2018Mar 24, 2018

How to make thunderbird not wrap lines automatically? Check Making Thunderbird Not Wrap Lines Automatically: Setting mail.compose.wrap_to_window_width to true. Read more: Making Thunderbird Not Wrap Long Lines Automatically How to Wrap and NOT Wrap Lines in vim Making Evolution Not Wrap Lines in Composed Emails How to wrap long lines in a text file on…

Similar Posts

Leave a Reply Cancel reply