Tuesday, November 19, 2013

Using DynamoDB with Amazon Elastic MapReduce

Articles & Tutorials>Using DynamoDB with Amazon Elastic MapReduce
This article shows how to use EMR to efficiently export DynamoDB tables to S3, import S3 data into DynamoDB, and perform sophisticated queries across tables stored in both DynamoDB and other storage services such as S3.

Details

Submitted By:AdamG@AWS
AWS Products Used:Amazon DynamoDB, Amazon Elastic MapReduce
Created On:February 15, 2012 4:57 AM GMT
Last Updated:September 26, 2013 12:23 AM GMT
Apache Hadoop and NoSQL databases are complementary technologies that together provide a powerful toolbox for managing, analyzing, and monetizing Big Data. That's why we were so excited to provide out-of-the-box Amazon Elastic MapReduce (Amazon EMR) integration with Amazon DynamoDB, providing customers an integrated solution that eliminates the often prohibitive costs of administration, maintenance, and upfront hardware. Customers can now move vast amounts of data into and out of DynamoDB, as well as perform sophisticated analytics on that data, using EMR's highly parallelized environment to distribute the work across the number of servers of their choice. Further, as EMR uses a SQL-based engine for Hadoop called Hive, you need only know basic SQL while we handle distributed application complexities such as estimating ideal data splits based on hash keys, pushing appropriate filters down to DynamoDB, and distributing tasks across all the instances in your EMR cluster.
In this article, we'll demonstrate how EMR can be used to efficiently export DynamoDB tables to S3, import S3 data into DynamoDB, and perform sophisticated queries across tables stored in both DynamoDB and other storage services such as S3.
We will also use sample product order data stored in S3 to demonstrate how you can keep current data in DynamoDB while storing older, less frequently accessed data, in S3. By exporting your rarely used data to Amazon S3 you can reduce your storage costs while preserving low latency access required for high velocity data. Further, exported data in S3 is still directly queryable via EMR (and you can even join your exported tables with current DynamoDB tables).
The sample order data uses the schema below. This includes Order ID as its primary key, a Customer ID field, an Order Date stored as the number of seconds since epoch, and Total representing the total amount spent by the customer on that order. The data also has folder-based partitioning by both year and month, and you'll see why in a bit.
Creating a DynamoDB Table
Let's create a DynamoDB table for the month of January, 2012 named Orders-2012-01. We will specify Order ID as the Primary Key. By using a table for each month, it is much easier to export data and delete tables over time when they no longer require low latency access.
For this sample, a read capacity and a write capacity of 100 units should be more than sufficient. When setting these values you should keep in mind that the larger the EMR cluster the more capacity it will be able to take advantage of. Further, you will be sharing this capacity with any other applications utilizing your DynamoDB table
Launching an EMR Cluster
Please follow Steps 1-3 in the EMR for DynamoDB section of the Elastic MapReduce Developer Guide to launch an interactive EMR cluster and SSH to its Master Node to begin submitting SQL-based queries. Note that we recommend you use at least three instances of m1.large size for this sample.
At the hadoop command prompt for the current master node, type hive. You should see a hive prompt: hive>
As no other applications will be using our DynamoDB table, let's tell EMR to use 100% of the available read throughput (by default it will use 50%). Note that this can adversely affect the performance of other applications simultaneously using your DynamoDB table and should be set cautiously.
SET dynamodb.throughput.read.percent=1.0;
Creating Hive Tables
Outside data sources are referenced in your Hive cluster by creating an EXTERNAL TABLE. First let's create an EXTERNAL TABLE for the exported order data in S3. Note that this simply creates a reference to the data, no data is yet moved.
CREATE EXTERNAL TABLE orders_s3_export ( order_id string, customer_id string, order_date int, total double )
 PARTITIONED BY (year string, month string)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY '\t'
 LOCATION 's3://elastic-mapreduce/samples/ddb-orders' ;
You can see that we specified the data location, the ordered data fields, and the folder-based partitioning scheme.
Now let's create an EXTERNAL TABLE for our DynamoDB table.
CREATE EXTERNAL TABLE orders_ddb_2012_01 ( order_id string, customer_id string, order_date bigint, total double )
 STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'   TBLPROPERTIES (
 "dynamodb.table.name" = "Orders-2012-01",
 "dynamodb.column.mapping" = "order_id:Order ID,customer_id:Customer ID,order_date:Order Date,total:Total"
 );
This is a bit more complex. We need to specify the DynamoDB table name, the DynamoDB storage handler, the ordered fields, and a mapping between the EXTERNAL TABLE fields (which can't include spaces) and the actual DynamoDB fields.
Now we're ready to start moving some data!
Importing Data into DynamoDB
In order to access the data in our S3 EXTERNAL TABLE, we first need to specify which partitions we want in our working set via the ADD PARTITION command. Let's start with the data for January 2012.
ALTER TABLE orders_s3_export ADD PARTITION (year='2012', month='01') ;
Now if we query our S3 EXTERNAL TABLE, only this partition will be included in the results. Let's load all of the January 2012 order data into our external DynamoDB Table. Note that this may take several minutes.
INSERT OVERWRITE TABLE orders_ddb_2012_01
 SELECT order_id, customer_id, order_date, total
 FROM orders_s3_export ;
Looks a lot like standard SQL, doesn't it?
Querying Data in DynamoDB Using SQL
Now let's find the top 5 customers by spend over the first week of January. Note the use of unix-timestamp as order_date is stored as the number of seconds since epoch.
SELECT customer_id, sum(total) spend, count(*) order_count
 FROM orders_ddb_2012_01 
 WHERE order_date >= unix_timestamp('2012-01-01', 'yyyy-MM-dd')
 AND order_date < unix_timestamp('2012-01-08', 'yyyy-MM-dd')
 GROUP BY customer_id
 ORDER BY spend desc
 LIMIT 5 ;
Querying Exported Data in S3
It looks like customer: 'c-2cC5fF1bB' was the biggest spender for that week. Now let's query our historical data in S3 to see what that customer spent in each of the final 6 months of 2011. Though first we will have to include the additional data into our working set. The RECOVER PARTITIONS command makes it easy to
ALTER TABLE orders_s3_export RECOVER PARTITIONS;
We will now query the 2011 exported data for customer 'c-2cC5fF1bB' from S3. Note that the partition fields, both month and year, can be used in your Hive query.
SELECT year, month, customer_id, sum(total) spend, count(*) order_count
 FROM orders_s3_export
 WHERE customer_id = 'c-2cC5fF1bB'
 AND month >= 6
 AND year = 2011
 GROUP BY customer_id, year, month
 ORDER by month desc;
Exporting Data to S3
Now let's export the January 2012 DynamoDB table data to a different S3 bucket owned by you. We'll first need to create an EXTERNAL TABLE for that S3 bucket. Note that we again partition the data by year and month.
CREATE EXTERNAL TABLE orders_s3_new_export ( order_id string, customer_id string, order_date int, total double )
 PARTITIONED BY (year string, month string)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
 LOCATION 's3://';
Now export the data from DynamoDB to S3, specifying the appropriate partition values for that table's month and year.
INSERT OVERWRITE TABLE orders_s3_new_export
 PARTITION (year='2012', month='01')
 SELECT * from orders_ddb_2012_01;
Note that if this was the end of a month and you no longer needed low latency access to that table's data, you could also delete the table in DynamoDB. You may also now want to terminate your job flow from the EMR console to ensure you do not continue being charged.
Please visit our documentation for more examples, including how to specify the format and compression scheme for your exported files.

4 comments:

oakleyses said...

louis vuitton handbags, oakley sunglasses, louboutin, longchamp outlet, nike shoes, louis vuitton outlet stores, chanel handbags, burberry outlet, prada outlet, jordan shoes, tiffany and co, michael kors outlet, tory burch outlet, louis vuitton outlet, longchamp handbags, nike free, true religion jeans, michael kors outlet, kate spade outlet, polo ralph lauren outlet, tiffany and co, prada handbags, polo ralph lauren outlet, michael kors outlet, michael kors outlet, longchamp handbags, oakley sunglasses, ray ban sunglasses, kate spade handbags, burberry outlet, louis vuitton outlet, louboutin outlet, louboutin, coach factory outlet, air max, air max, coach outlet, gucci outlet, christian louboutin shoes, michael kors outlet, coach purses, ray ban sunglasses, michael kors outlet, louis vuitton, coach outlet store online, true religion jeans, oakley sunglasses cheap

oakleyses said...

ralph lauren, lululemon, air max, hollister, north face, nike air max, polo lacoste, vanessa bruno, timberland, vans pas cher, louboutin, louis vuitton, oakley pas cher, air max pas cher, nike roshe run, air max, true religion outlet, barbour, sac longchamp, air force, hollister, sac louis vuitton, nike free, polo ralph lauren, nike trainers, louis vuitton uk, nike roshe, sac hermes, longchamp, michael kors, sac burberry, sac guess, mulberry, new balance pas cher, converse pas cher, sac louis vuitton, hogan outlet, nike tn, north face, true religion outlet, ray ban pas cher, michael kors, air jordan, nike blazer, nike free pas cher, michael kors pas cher, abercrombie and fitch, ray ban sunglasses

oakleyses said...

mac cosmetics, mont blanc, marc jacobs, canada goose outlet, nike huarache, vans shoes, soccer jerseys, hollister, giuseppe zanotti, beats by dre, abercrombie and fitch, longchamp, insanity workout, celine handbags, bottega veneta, ghd, nfl jerseys, north face outlet, chi flat iron, ugg boots, birkin bag, ugg australia, canada goose, herve leger, ugg pas cher, rolex watches, valentino shoes, canada goose uk, canada goose, ferragamo shoes, canada goose, ugg boots, uggs outlet, north face jackets, soccer shoes, asics running shoes, new balance shoes, p90x, lululemon outlet, canada goose jackets, mcm handbags, instyler, babyliss pro, ugg, wedding dresses, jimmy choo outlet, reebok outlet, nike roshe run

oakleyses said...

parajumpers, karen millen, air max, converse, pandora charms, moncler, louboutin, moncler, links of london, lancel, juicy couture outlet, oakley, hollister, pandora charms, supra shoes, thomas sabo, canada goose, gucci, wedding dresses, timberland boots, swarovski crystal, air max, coach outlet store online, moncler, ray ban, canada goose, moncler, ugg, louis vuitton, swarovski, hollister, montre homme, moncler, hollister clothing store, ralph lauren, rolex watches, moncler outlet, moncler, iphone 6 cases, baseball bats, juicy couture outlet, toms shoes, vans, pandora jewelry, ugg, converse shoes