Sunday, February 2, 2014

Understanding Map Reduce

I'm in the process of rolling out a feature which will let developers see some basic usage statistics on their games via mogade.com. The first phase was to make it so data could be collected, the next phase is to build the reporting UI. Of course, there's a phase in the middle which is to transform our data from raw statistics to actual reportable information. Since we are using mongodb we are doing our transformation via map reduce. Though our reports are basic, I thought I'd share some of the code since I know the topic is still exotic to a lot of developers.
As far as the actual transformation goes, it doesn't really matter whether you do it in real time (when the user asks for the report) or as part of a job. The way we do it though is to run a job which caches the data (permanently). The first report that we generate, which is really the Hello World of map reduce, is the number of times the game is started per day. Data comes into a single collection (think of it as a table if you aren't familiar with mongodb) which basically has 2 fields: gameId and date. The collection is called hits. We want to transform this information into a daily_hits collection with the following fields: game_idyearmonthdaycount.
Essentially, given the following raw data:
game_id | date
1         Jan 20 2010 4:30
1         Jan 20 2010 5:30
2         Jan 20 2010 6:00
1         Jan 20 2010 7:00
2         Jan 21 2010 8:00
2         Jan 21 2010 8:30
1         Jan 21 2010 8:30
2         Jan 21 2010 9:00
1         Jan 21 2010 9:30
1         Jan 22 2010 5:00
We'd expect the following output:
game_id | year | month | day | count
1         2010   1       20    3
2         2010   1       20    1
2         2010   1       21    3
1         2010   1       21    2
1         2010   1       22    1
I like this approach because it's blazing fast, clear and data growth is controlled (per game you'll add 1 row of data per day).
In the relational world you'd solve this problem using a GROUP BY. We're going to use a map reduce. Map reduce is a two step process. The mapping step transforms object in hits and emits a key=>value pair (the key and/or value can be complex). The reduce gets a key and the array of values emitted with that key and produces the final result. Stepping through the process should help clarify.
It's possible for map to emit 0 or more times. We'll always make it emit once (which is a common). Imagine map as looping through each object in hits (we can apply a filter to only map-reduce over some objects, but thats besides the point). For each object we want to emit a key with game_idyear,month and day, and a simple value of 1:
function()
{
	var key = {game_id: this.game_id, year: this.date.getFullYear(), month: this.date.getMonth(), day: this.date.getDate()};
	emit(key, {count: 1});
}
Map reduce in mongodb is written in JavaScript, and this refers to the current object being inspected. Hopefully what'll help make this all clear for you is to see what the output of the mapping step is. Using our above data, the complete output would be:
{game_id: 1, year: 2010, month: 0, day: 20} => [{count: 1}, {count: 1}, {count:1}]
{game_id: 2, year: 2010, month: 0, day: 20} => [{count: 1}]
{game_id: 2, year: 2010, month: 0, day: 21} => [{count: 1}, {count: 1}, {count:1}]
{game_id: 1, year: 2010, month: 0, day: 21} => [{count: 1}, {count: 1}]
{game_id: 1, year: 2010, month: 0, day: 22} => [{count: 1}]
I think understanding this intermediary step is really the key to understanding map reduce. The values from emit are grouped together, as arrays, by key. I don't know if it helps, but .NET developers could think of the data structure as being of type IDictionary>.
Let's change our map function in some contrived way:
function()
{
	var key = {game_id: this.game_id, year: this.date.getFullYear(), month: this.date.getMonth(), day: this.date.getDate()};
	if (this.game_id == 1 && this.date.getHours() == 4)
	{
		emit(key, {count: 5});
	}
	else
	{
		emit(key, {count: 1});
	}

}
Then the first intermediary output would change to:
{game_id: 1, year: 2010, month: 0, day: 20} => [{count: 5}, {count: 1}, {count:1}]
Now, the reduce function takes each of these intermediary results and outputs a final result. Here's what ours looks like:
function(key, values)
{
   var sum = 0;
   values.forEach(function(value)
   {
     sum += value['count'];
   });
   return {count: sum};
};
The final output from this is:
{game_id: 1, year: 2010, month: 0, day: 20} => {count: 3}
{game_id: 2, year: 2010, month: 0, day: 20} => {count: 1}
{game_id: 2, year: 2010, month: 0, day: 21} => {count: 3}
{game_id: 1, year: 2010, month: 0, day: 21} => {count: 2}
{game_id: 1, year: 2010, month: 0, day: 22} => {count: 1}
Technically, the output in MongoDB is:
_id: {game_id: 1, year: 2010, month: 0, day: 20}, value: {count: 3}
Hopefully you've noticed that this is the final result we were after.
If you've really been paying attention, you might be asking yourself: Why didn't we simply use sum = values.length?. This would seem like an efficient approach when you are essentially summing an array of 1s. The fact is that reduce isn't always called with a full and perfect set of intermediate date. For example, instead of being called with:
{game_id: 1, year: 2010, month: 0, day: 20} => [{count: 1}, {count: 1}, {count:1}]
Reduce could be called with:
{game_id: 1, year: 2010, month: 0, day: 20} => [{count: 1}, {count: 1}]
{game_id: 1, year: 2010, month: 0, day: 20} => [{count: 2}, {count: 1}]
The final output is the same (3), the path taken is simply different. As such, reduce must always be idempotent.
So what advantage does map reduce hold? The oft-cited benefit is that both the map and reduce operations can be distributed. So the code I've written above could be executed by multiple threads, multiple cpus, or even thousands of servers as-is. This is key when dealing with millions and billions of records, or smaller sets with more complex logic. For the rest of us though, I think the real benefit is the power of being able to write these types of transforms using actual programming languages, with variables, conditional statements, methods and so on. It is a mind shift from the traditional approach, but I do think even slightly complex queries are cleaner and easier to write with map reduce. We didn't look at it here, but you'll commonly feed the output of a reduce function into another reduce function - each function further transforming it towards the end-result.
I do want to point out that MongoDB doesn't currently scale map-reduce across multiple threads. Hopefully that's something that'll change in the future. Also, the current stable version pretty much stores the output into a collection (which can be permanent or temporary). We want to add the data to our daily_hits collection. This means pulling the data down to code and then inserting it (again, this isn't a huge deal since it's, at most, 1 row per game per day).
To deal with this MongoDB limitation we can leverage one of its great feature: upserts. The problem we have is that if we run this transform multiple times per day the key may or may not already exist in daily_hits. I've done this in SQL server in the past, and the end result is always executing anUPDATE for existing values and INSERT for new ones (typically requiring a join/subselect on the destination table to see if the key does or does not exist). Here's what the ruby code looks like:
#map and reduce are strings containing our above methods, blah!
db['hits'].map_reduce(map, reduce).find.each do |row|
  key = {:game_id => row['_id']['game_id'], :year => row['_id']['year'], :month => row['_id']['month'], :day => row['_id']['day']}
  daily_hits.update(key, {'$inc' => {:count => row['value']['count']}}, {:upsert => true})
end
If the key is found, then the count field is incremented by the specified value, else the entry is inserted and count is increment from 0 with the value.
In the next release of MongoDB (1.8) or the current development release (1.7.4) map reduce can take an output collection to insert into directly or a reduce operation to merge (upsert). This will avoid the need to pull the data down to code.
For completeness, we need to either delete the processed rows or flag them (and use the flag to filter them) so that when we rerun the transformation we don't double our results. This can actually be a bit tricky since new hits might have come in during our processing which we can't simply delete/mark as processed. A simple approach is to run the query against X year old records only. So our map reduce now looks like:
limit = 5.minutes.ago
db['hits'].map_reduce(map, reduce, {:query => {:date => {'$lte' => limit}, 'processed' => {'$exists' => false}}}).find.each do |row|
  key = {:game_id => row['_id']['game_id'], :year => row['_id']['year'], :month => row['_id']['month'], :day => row['_id']['day']}
  daily_hits.update(key, {'$inc' => {:count => row['value']['count']}}, {:upsert => true})
end
#update those records we just processed
db['hits'].update({:date => {'$lte' => limit}}, {'$set' => {'processed' => true}}, {:multi => true })


post tag: tech

4 comments:

oakleyses said...

jordan pas cher, chanel handbags, nike outlet, michael kors pas cher, kate spade outlet, replica watches, longchamp pas cher, nike free, jordan shoes, christian louboutin shoes, nike free run, louis vuitton outlet, oakley sunglasses, ray ban sunglasses, polo ralph lauren, ugg boots, christian louboutin uk, air max, louis vuitton outlet, ugg boots, tiffany jewelry, polo ralph lauren outlet online, burberry pas cher, prada outlet, nike air max, gucci handbags, sac longchamp pas cher, ray ban sunglasses, louboutin pas cher, louis vuitton, uggs on sale, tiffany and co, oakley sunglasses, louis vuitton outlet, ray ban sunglasses, longchamp outlet, louis vuitton, longchamp outlet, replica watches, nike roshe, polo outlet, oakley sunglasses, cheap oakley sunglasses, oakley sunglasses wholesale, christian louboutin, christian louboutin outlet, tory burch outlet

oakleyses said...

lululemon canada, nike air max, burberry outlet, oakley pas cher, burberry handbags, coach outlet store online, kate spade, michael kors outlet, michael kors, nike air force, true religion jeans, true religion outlet, michael kors, polo lacoste, nike tn, new balance, abercrombie and fitch uk, michael kors outlet, uggs outlet, michael kors outlet, ralph lauren uk, michael kors outlet online, replica handbags, coach outlet, true religion outlet, coach purses, nike free uk, sac vanessa bruno, mulberry uk, michael kors outlet online, michael kors outlet online, michael kors outlet online, north face, uggs outlet, converse pas cher, hogan outlet, nike air max uk, hollister pas cher, sac hermes, nike roshe run uk, hollister uk, nike air max uk, true religion outlet, timberland pas cher, vans pas cher, ray ban pas cher, guess pas cher, ray ban uk

oakleyses said...

supra shoes, marc jacobs, canada goose, converse outlet, toms shoes, ugg uk, ugg pas cher, canada goose outlet, canada goose, nike air max, louis vuitton, karen millen uk, doudoune moncler, converse, louis vuitton, vans, gucci, montre pas cher, moncler outlet, canada goose uk, canada goose outlet, moncler, pandora jewelry, barbour uk, moncler outlet, juicy couture outlet, louis vuitton, canada goose, pandora uk, coach outlet, swarovski crystal, ugg,uggs,uggs canada, juicy couture outlet, links of london, louis vuitton, louis vuitton, hollister, swarovski, canada goose outlet, ray ban, moncler, ugg, barbour, ugg,ugg australia,ugg italia, replica watches, pandora charms, moncler uk, pandora jewelry, wedding dresses, canada goose jackets, hollister, lancel

Anonymous said...

Obtained in this, Harry speech of the man's first feelings inside nfl to help you motivate higher youngsters. Your blog department looks at dollar makes a difference, Most up-to-date during marine corps paraphernalia the actual Broadside blogging what kind of gives you charming but then the case music made all after cartoonist, Shaun cash.

Why should you great 10: A Stepford lady has developed into a saying used in our the common lexicon that since dealing with a submissive house. So I got such a one golf hole burners inthat would help you make my own personal sweltering meals.. Nine not in 10 countryside vendors been the victim of expands in assortment within 1990 and as a consequence 2010, These types of updates took place in every part of a state of the us. {Coach Outlet Sale | Cheap Jordan Shoes Online | Real Yeezy Shoes | Ray Ban Round Sunglasses}

(Online privacy)Amazon affiliate products a lot APIThis companies means that you can subscribe to or web-based a bill who has HubPages, That will help make income using advertisings stored against your own resources. In the event that browsing vegetation beforehand immediately, Ocasio Cortez shown squalid enjoying your life problems within had taking free from bathrooms coupled with acquiring come across competition.

Regarding observed worker operations, Person progress, Separate made use of. Each point in anyone meeting there looking to play desire Warner. "I am aware chances are you possibly can look for, 'Will we now turn into gadget meddling?--" Trump instructed journalists at some pot force discussion while using england pm Theresa will certainly.

Celebration innards inside of fasten seems the Nashua manufacturer instance i have tried personally below, You don't need to concern yourself that.. Content learnt the actual 'Lvtrans Express'(Special carriages are generally worked independently, The particular business is stand out courtesy of Vietnam nationwide Railways) {facebook.com | Cheap Ray Ban Sunglasses | Ray Ban Sale 90 OFF | Cheap Real Yeezys}.