Simplified MapReduce
Feb 12, 2014
Here is an introduction to what is a MapReduce is:
MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
From: http://en.wikipedia.org/wiki/MapReduce
Mainly, a MapReduce used to gather information from a massive datasets, faster and easier. The algorithm consists of two main functions, map and reduce. The map function is used to collecting data from the inputs. At this step, map function breaks the input into smaller chunks. In reduce function; we will put or aggregate the map function’s results together, to make a single result.
The reduce function will always performed after the map function.
To understand the process better, I’d like to give an example. Suppose we have a news website, each news is an entity in our database. Each news item has an Array of keywords that describes the news. Following is a sample of a news item:
{
title: ‘Hello world!’,
description: ‘Hello world! This is the first post from our awesome news portal; we will publish more news here. Thanks.’
keywords: [{
word: ‘hello’,
count: 1
}, {
word: ‘world’,
count: 1
}, {
word: ‘news’,
count: 2
}, {
word: ‘post’,
count: 1
}]
}
So, what we want to do? We have a lot of news items, and an Array of keywords inside each one. We are going to determine popular keywords from all news items.
First of all, the map function will break the news item into smaller pieces. Actually, we should emit the keyword and the number of repeat inside the map function. The emit function is used to push new values into a temporary key-value pair, this array will be used in reduce function further to generate a single value.
Following is an example of map function source code:
function () {
this.keywords.forEach(function (doc) {
emit(doc.word, doc.count);
})
}
To understand the map function better, following is an output of this function. When we have “hello” word that repeated twice with out number of 1 and 3, the output will be:
{ “hello”: [1, 3] }
And when we have the word “post” that repeated once, with count number of 2, the output would be:
{ “post”: [2] }
Then, we have the reduce function. Inside the reduce function we will wrap up map function’s result to create a single value. The single value is a keyword with total count of repetition in all news items.
Following is the reduce function source code:
function (key, values) {
return Array.sum(values);
};
So, following is the output of reduce function:
{ id: “hello”, value: 4 }
And for the second map function’s output, the result will be:
{ id: “post”, value: 2 }
After performing the reduce function, we will have a set of keywords with the total count of repetition amongst all news items, that is, the array of popular keywords.
Of course the above explanation was a briefly look into the MapReduce algorithm. There are a lot of MapReduce frameworks and you can find them in NoSQL databases, MongoDB for instance.