Tuesday, November 5, 2013

Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries.
I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems, so I'm confused what use MapReduce is if you already have all the keys (you have to have searched for them, somehow, right?).
My documents are all in a bucket named 'sales' and they are records with the following fields: {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", "created_at": 1365931758}.
Let's take the example where I need to report the total revenue for each product in each month over the past 4 years (that's basically the entire bucket), how does one use Riak's MapReduce to do that efficiently? Even just trying to use an identity map operation on the data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.
I'm doing this in Erlang (using the protocol buffers client), but any language is fine for an explanation.
The equivalent SQL (MySQL) would be:
  SELECT SUM(price)                         AS revenue,
         FROM_UNIXTIME(created_at, '%Y-%m') AS month,
    FROM sales
GROUP BY month, product_key
(Ordering not important right now).
share|improve this question
Ok, I get it, Riak isn't really made for handling lots of data. Many of the examples lead you to believe it is (traversing graphs of facebook friends, processing lots of tweets etc), but in reality Riak is just a key value store. It's MapReduce is effectively pointless, because for it to actually perform, you're working with such low volumes of data that you could do it client-side. I'll try other MapReduce databases for comparison, as maybe this is a general MapReduce problem. –  d11wtq Apr 14 at 12:35

2 Answers

Currently, I create secondary indexes for document attributes that I need to search frequently, and use this much smaller subset of keys as the input to a MapReduce job.
I do agree that it seems very expensive to run a big MapReduce job like this, compared to other systems I've used.
share|improve this answer
Yeah, I don't think the secondary index would help for my use case though... we need the full bucket, always, since we're trying to aggregate the data it contains. Secondary indexes help with finding documents, but as soon as you need the entire bucket, their value is lost. Either way, it seems like Riak with 500K records just plain sucks ;) MapReduce support is misleading really... before it becomes a feasible option, you could be processing the data client-side. –  d11wtq Apr 15 at 2:29 
You are correct, MapReduce in any KV store will not make it behave like a SQL database. There are several things that may help your use case. Use more than one bucket. Instead of just a Sales bucket you could break them down by product, region, or month so the data is already split by one of your common reporting criteria. Consider adding a secondary index to each document for each field. Your month query could then be a range query of the created_at index. If your id field is sequentially increasing and you need to pull monthly data, store the beginning and ending id for each month in a separate key (not easy to do once the data is written, I know). You may also consider breaking each document a series of keys. Instead of just storing an id key with a json document for a value, store a key for each field like id-productid, id-createdat, id-price. This will minimize the amount of data that must be read from the disk and stored in RAM in order to process your MapReduce.
To put this in perspective, consider the following (very sarcastic) hypothetical: I have 500K documents in a MySQL database, each document consists of a json string. My database consists of a single table named Sales, with a single column named Data which stores my documents as binary blobs. How can I write a fast, efficient SQL statement that will select only the documents that contain a date and group them by month?
The point I am making is that you must design the structure of your data objects according to the strengths of the data store you choose to use. Riak is not particularly efficient at handling JSON unless you are using their solr-like search, but there are probably ways to restructure your data that it might be able to handle. Or perhaps this means that another data store would better fit your needs.
share|improve this answer

No comments: