How to choose the number of mappers and reducers in Hadoop to get good job performance?
The Hadoop Wiki gives a discussion on this: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
Some valuable points:
About the number of Maps:
The number of maps is usually driven by the number of DFS blocks in
the input files. Although that causes people to adjust their DFS block
size to adjust the number of maps. The right level of parallelism for
maps seems to be around 10-100 maps/node, although we have taken it up
to 300 or so for very cpu-light map tasks. Task setup takes awhile, so
it is best if the maps take at least a minute to execute.
About the mumber of Reduces:
The right number of reduces seems to be 0.95 or 1.75 (nodes
mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can
launch immediately and start transfering map outputs as the maps
finish. At 1.75 the faster nodes will finish their first round of
reduces and launch a second round of reduces doing a much better job
of load balancing.