MapReduce


MapReduce is made up of 2 major parts: map & reduce
We can write mappers and reducers to analyse large quantities of data.

The first stage is the mapper.

What is a Mapper?

The raw input is fed to the mapper line by line to be processed into smaller chunks. The primary function of a mapper is to group similar data by a key. For each input line the mapper function is run. The mapper function then has the task to assign a key and value according to the input. After all the input lines are processed, values with the same key are grouped together into a list.

Output of a mapper is: (key, list(values)).

The output of a mapper is then fed to a reducer which is the second stage now.

What is a Reducer?

The reducer item is run for every unique key in the mapper output and its primary function is to operate on the list of values associated with the key. The output of a reducer is again a key-value pair which can be according to user specifications.

So why MapReduce?

The biggest advantage of MapReduce is that all the operations can run concurrently across a distributed architecture. Several lines of input can be processed by the mapper at a single time, and once the mapper has completed operations, several reducers run concurrently on the mapper output.

Let's checkout an example to understand better: Word Counter