how to skip mapper function in hadoop

ByEric Ma Mar 24, 2018Mar 28, 2018

In hadoop I need to skip mapper function and directly execute the reducer function.

We doing this to improve hadoop performance, if the hadoop framework is used to analyze same data sets, then mapper’s output will be same for different kind of jobs. To save the redundant computation for same results, I am planning to run mapper function once and store its output as a cache and utilize it for future jobs by passing on mapper function and directly execute reducer function using pre processed mapper’s results.

Any help or pointer will be appreciated.

(sorry for bad english)

I feel your requirement does not fit into the MapReduce model well.

If you must use MapReduce/HDFS, you may consider using multiple MapReduce jobs:

The first MapReduce job stores the shuffle results (by a reduce function that outputs its input) in HDFS that can be reused.

The other MapReduce jobs just have map tasks that read the data from HDFS generated by the first MapReduce job as input. You must carefully organize and partition the data in HDFS to simulate the reduce tasks’ semantics if you applications reply on the semantics.

It can work but is ugly.

Another choice is to implement a MapReduce-like framework on Hadoop (YARN/Hadoop 2, not Hadoop v1) that can skip the map phase. You may take a look at the approachs of HaLoop and DryadInc.

Overall, my felling is that it is better to use other programming models/systems other than plain MapReduce for your workloads.

Hey Eric,
Appreciate your response, will look into other models as you suggested and update this post.

Regards,
Sanjeev

HI, Eric,

Can you please give me some materials, about how to do same using “Hadoop 2” ?

Regards,
Sanjeev

You may take a look at these pages:

YARN architecture: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html

Writing YARN applications: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

Examples of applications may be MapReduce in Hadoop 2 and Spark on YARN: https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html

Programming | Tutorial

Converting Int to String in C++

ByEric Ma Oct 4, 2020Nov 1, 2020

It is common to convert an integer (int) to a string (std::string) in C++ programs. Because of the long history of C++ which has several versions with extended libraries and supports almost all C standard library functions, there are many ways to convert an int to string in C++. This post introduces how to convert…

How to change my Linux password

ByEric Ma Mar 24, 2018Mar 24, 2018

How to change my password on a Linux box? The original password is generated by the administrator for me. User the passwd command. The easiest way: Log in the Linux box with your account Run passwd and it will ask your old and new passwords. Read more: Is Firefox Sync safe, that is, could someone…

In Bash script, how to join multiple lines from a file?

ByQ A Mar 24, 2018

In Bash script, how to join multiple lines from a file? For example, to join the lines a good boy to a good boy You can use tr command, something like: tr -s ‘n’ ‘ ‘ < file.txt It just goes through its input and makes changes according to what you specify in two sets…

Linux | Virtualization

Setting Up Ubuntu DomU on Xen: Ubuntu 10.10 on Fedora Xen Dom0

ByEric Ma Jul 13, 2013Apr 1, 2020

Setting up Ubuntu 10.10 DomU on top of Fedora Xen Dom0 is introduced in this post. The process of setting up Ubuntu 10.10 DomU is the same as Setting Up Stable Xen DomU with Fedora: Unmodified Fedora 12 on top of Xenified Fedora 12 Dom0 with Xen 4.0 This post only show the difference which…

How to get processes’ I/O utilization percentage

ByWeiwei Jia Mar 24, 2018Jan 7, 2020

Two notices: 1, a process has only one main thread which is itself. 2, a process has many threads. Solution 1: Please use taskstats [1] related interfaces, and send TASKSTATS_TYPE_PID and TASKSTATS_TYPE_TGID commands to kernel to get a process’s ‘blkio_delay_total’ parameter for a process with one main thread and a process with threads separately. Solution…

diff alternative on Windows to find differences between 2 files?

ByEric Ma Mar 24, 2018Nov 21, 2019

diff on Linux is a very handy tool. Any good diff alternatives on Windows to find differences between 2 files? A GUI program will be better on Windows. I would recommend meld for Windows users. Meld is a visual diff and merge tool open source software. meld is available on Linux, Windows and Mac OS…

Similar Posts

Leave a Reply Cancel reply