Tuesday, November 5, 2013

GROUP BY AND SUM WITH MONGO

Deprecated: This post is now made obsolete by the very capable MongoDB Aggregation Framework. I love when a post becomes obsolete, but at least we have the past to look back on…
Original post:

As a dyed-in-the-wool SQL guy, it can be quite humbling to be suddenly stumbling around NoSQL town, snagged by what feels like the easiest query in the world.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
SELECT    period_id,
          account_id,
          product_id,
          customer_id,
          employee_id,
          vendor_id,
          SUM(amount_cents) AS amount
FROM      transactions
GROUP BY  period_id,
          account_id,
          product_id,
          customer_id,
          employee_id,
          vendor_id
SQL 101, right? The old GROUP BY and SUM yields a pretty simple aggregation. We are using Mongoid as an ORM for a Rails 3 app, and I was embarassingly stumped. The Mongoid documentation enables you to perform aggregations with filters right out of the box, soBand.where(:likes.gt => 100).sum(:likes) is simple enough.
But “complex” aggregations like grouping require MapReduce? (Complex to whom?!) Previously, I hadn’t had the time to play around with MapReduce, so I was hoping to avoid having to go down that road for something so simple. I am sure there are plenty of you out there who—like me—are/were just using Mongo for schema-less data storage who don’t dig into MapReduce. If so, this may be a helpful entre.

GROUPING BY 1 FIELD

MapReduce, by its nature, requires that you boil down your query to an array of hashes, which you recall are key-value pairs. Naturally, you can have multiple values per row, but you can’t have multiple keys, so grouping by 1 field is a bit easier.
See Kyle Banker’s excellent discussion of MapReduce basics for this one.
Map function:
1
2
3
4
5
// our key is author's username
// our value is the number of votes for the current comment
var map = function() {
  emit(this.author, {votes: this.votes});
};
Reduce function:
1
2
3
4
5
6
7
8
// add up all the votes for each key
var reduce = function(key, values) {
  var sum = 0;
  values.forEach(function(doc) {
    sum += doc.votes;
  });
  return {votes: sum};
};
Since you have an array of values with a single key, you can group by that key, sum any arbitrary amount field, and you’re done. As Kyle notes, if you’re doing this in ruby you then just need to parse the results.

GROUPING BY MULTIPLE FIELDS

The issue we ran into was that we needed to group by multiple fields. This is actually no problem, because the “key” in the map function can be a hash itself, so you can “pack the uniqueness” into a hash and then unpack it later.
Let’s take a look. A simplified version of our model looks something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
class Transactions <<
  include Mongoid::Document

  field :amount_cents, type: Integer
  # [...many other fields providing granular detail...]

  belongs_to :period
  belongs_to :account
  belongs_to :product
  belongs_to :customer
  belongs_to :employee
  belongs_to :vendor
end
Presuming we want to aggregate these granular transactions across above the 6 dimensions and sum the amount_cents field, we would want to employ a summary class method that would work with filters/scopes applied. It could work something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# Aggregates transactions into summary format
#
# Example
#
#   txns.create!(period: @period_1, amount_cents: 1)
#   txns.create!(period: @period_1, amount_cents: 5)
#   txns.create!(period: @period_2, amount_cents: 9)
#   txns.summary # => [
#     {period: @period_1, amount_cents: 6}
#     {period: @period_2, amount_cents: 9}
#   ]
#
# Returns Array of Hash values containing aggregated rows
def self.summary

  # Array to return once populated
  results = []

  # Group by and sum field names could be abstracted to parameters
  group_by_fields = [
    :period_id, :account_id, :product_id,
    :customer_id, :employee_id, :vendor_id
  ]
  sum_field = :amount_cents

  # "Pack" the unique group by fields into a hash
  hash_contents = group_by_fields.map { | group_by_field |
    "#{group_by_field.to_s}: this.#{group_by_field.to_s}"
  }.join(',')

  # Build the map function based on group by and sum fields
  map = "function(){ emit(
            {#{hash_contents}},
            {amount: this.#{sum_field.to_s}}
          );
        };"

  # Reduce function is constant
  reduce = "function(key, values){
    var sum = 0;
    values.forEach(
      function(doc){
        sum += doc.amount;
      }
    );
    return {amount: sum};
  };"

  # Execute MapReduce function on whatever set of Transactions is provided
  self.map_reduce(map, reduce).out(inline: true).to_a.map do |reduce|
    # Unpack the group by fields
    hash = {}
    group_by_fields.each do |group_by_field|
      hash[group_by_field] = reduce["_id"][group_by_field.to_s] if
        reduce["_id"][group_by_field.to_s]
    end
    # Include the aggregated value
    hash[sum_field] = reduce["value"]
    results << hash
  end

  # Return the Array of Hash values containing the aggregated rows
  results

end

NULL KEYS

It’s worth noting that null values in keys are simply not included in the hash, and so if the logic of your usage for the returned array requires these values either be present pointing to a null value or point to a default value, this logic should supplant the if reduce["_id"][group_by_field.to_s] clause above.
There it is. Group by 1 or more fields using Mongoid. Just pack the uniqueness into a hash, aggregate with MapReduce, then unpack the keys and return.
Hope this helps!

No comments: