Caching Mapper Output in Hadoop: Strategies for Reusing Intermediate Results
The core problem here is legitimate: if you’re running multiple jobs on the same dataset where the mapper phase produces identical intermediate results, recomputing those results is wasteful. However, skipping the mapper phase entirely breaks MapReduce’s processing model. There are better approaches. Why You Can’t Just Skip the Mapper MapReduce assumes data flows through map…
