MapReduce

01 14 2024 1 minute read

MapReduce is indeed a data processing model designed for handling large-scale data sets. It emerged as a response to the challenges of processing increasingly larger datasets that could not be managed effectively through traditional vertical scaling methods.
In the early 2000s, Google engineers recognized the limitations of existing data processing methods due to the rapidly growing size of datasets. This realization led to the development of a new approach.
Jeffrey Dean and Sanjay Ghemawat authored a seminal white paper in 2004 titled “MapReduce: Simplified Data Processing on Large Clusters.”
This paper introduced the MapReduce model to the world and detailed how it could efficiently process large amounts of data using a distributed and parallel architecture. The paper became a foundational text in the field of big data processing and inspired numerous implementations and adaptations in various big data technologies.
A MapReduce job is comprised of 3 main steps:
- Map: The map function accesses various chunks of the dataset in a distributed system, transforming them into intermediate key-value pairs.
- Shuffle: Reorganizes the intermediate key-value pairs such that pairs of the same key are routed to the same machine in the final step.
- Reduce: The reduce function on the newly shuffled key-value pairs and transforms them into more meaningful data.

A DFS is an abstraction over a (usually large) cluster of machines that allows them to act like one large file system.
A DFS guarantees that Distributed storage, Replication and High Availability, Scalability, Transparency, Reliability and Fault tolerance.
2 Famous DFS are the Google File System and Hadoop.

A popular, open-source framework that supports MapReduce jobs and many other kinds of data-processing pipelines. Its central component is HDFS (Hadoop Distributed File System), on top of which other technologies have been developed.

You may also enjoy