Tag Archives: AWS DMS

Amazon Redshift – 2017 Recap

Post Syndicated from Larry Heathcote original https://aws.amazon.com/blogs/big-data/amazon-redshift-2017-recap/

We have been busy adding new features and capabilities to Amazon Redshift, and we wanted to give you a glimpse of what we’ve been doing over the past year. In this article, we recap a few of our enhancements and provide a set of resources that you can use to learn more and get the most out of your Amazon Redshift implementation.

In 2017, we made more than 30 announcements about Amazon Redshift. We listened to you, our customers, and delivered Redshift Spectrum, a feature of Amazon Redshift, that gives you the ability to extend analytics to your data lake—without moving data. We launched new DC2 nodes, doubling performance at the same price. We also announced many new features that provide greater scalability, better performance, more automation, and easier ways to manage your analytics workloads.

To see a full list of our launches, visit our what’s new page—and be sure to subscribe to our RSS feed.

Major launches in 2017

Amazon Redshift Spectrumextend analytics to your data lake, without moving data

We launched Amazon Redshift Spectrum to give you the freedom to store data in Amazon S3, in open file formats, and have it available for analytics without the need to load it into your Amazon Redshift cluster. It enables you to easily join datasets across Redshift clusters and S3 to provide unique insights that you would not be able to obtain by querying independent data silos.

With Redshift Spectrum, you can run SQL queries against data in an Amazon S3 data lake as easily as you analyze data stored in Amazon Redshift. And you can do it without loading data or resizing the Amazon Redshift cluster based on growing data volumes. Redshift Spectrum separates compute and storage to meet workload demands for data size, concurrency, and performance. Redshift Spectrum scales processing across thousands of nodes, so results are fast, even with massive datasets and complex queries. You can query open file formats that you already use—such as Apache Avro, CSV, Grok, ORC, Apache Parquet, RCFile, RegexSerDe, SequenceFile, TextFile, and TSV—directly in Amazon S3, without any data movement.

For complex queries, Redshift Spectrum provided a 67 percent performance gain,” said Rafi Ton, CEO, NUVIAD. “Using the Parquet data format, Redshift Spectrum delivered an 80 percent performance improvement. For us, this was substantial.

To learn more about Redshift Spectrum, watch our AWS Summit session Intro to Amazon Redshift Spectrum: Now Query Exabytes of Data in S3, and read our announcement blog post Amazon Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data.

DC2 nodes—twice the performance of DC1 at the same price

We launched second-generation Dense Compute (DC2) nodes to provide low latency and high throughput for demanding data warehousing workloads. DC2 nodes feature powerful Intel E5-2686 v4 (Broadwell) CPUs, fast DDR4 memory, and NVMe-based solid state disks (SSDs). We’ve tuned Amazon Redshift to take advantage of the better CPU, network, and disk on DC2 nodes, providing up to twice the performance of DC1 at the same price. Our DC2.8xlarge instances now provide twice the memory per slice of data and an optimized storage layout with 30 percent better storage utilization.

Redshift allows us to quickly spin up clusters and provide our data scientists with a fast and easy method to access data and generate insights,” said Bradley Todd, technology architect at Liberty Mutual. “We saw a 9x reduction in month-end reporting time with Redshift DC2 nodes as compared to DC1.”

Read our customer testimonials to see the performance gains our customers are experiencing with DC2 nodes. To learn more, read our blog post Amazon Redshift Dense Compute (DC2) Nodes Deliver Twice the Performance as DC1 at the Same Price.

Performance enhancements— 3x-5x faster queries

On average, our customers are seeing 3x to 5x performance gains for most of their critical workloads.

We introduced short query acceleration to speed up execution of queries such as reports, dashboards, and interactive analysis. Short query acceleration uses machine learning to predict the execution time of a query, and to move short running queries to an express short query queue for faster processing.

We launched results caching to deliver sub-second response times for queries that are repeated, such as dashboards, visualizations, and those from BI tools. Results caching has an added benefit of freeing up resources to improve the performance of all other queries.

We also introduced late materialization to reduce the amount of data scanned for queries with predicate filters by batching and factoring in the filtering of predicates before fetching data blocks in the next column. For example, if only 10 percent of the table rows satisfy the predicate filters, Amazon Redshift can potentially save 90 percent of the I/O for the remaining columns to improve query performance.

We launched query monitoring rules and pre-defined rule templates. These features make it easier for you to set metrics-based performance boundaries for workload management (WLM) queries, and specify what action to take when a query goes beyond those boundaries. For example, for a queue that’s dedicated to short-running queries, you might create a rule that aborts queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops.

Customer insights

Amazon Redshift and Redshift Spectrum serve customers across a variety of industries and sizes, from startups to large enterprises. Visit our customer page to see the success that customers are having with our recent enhancements. Learn how companies like Liberty Mutual Insurance saw a 9x reduction in month-end reporting time using DC2 nodes. On this page, you can find case studies, videos, and other content that show how our customers are using Amazon Redshift to drive innovation and business results.

In addition, check out these resources to learn about the success our customers are having building out a data warehouse and data lake integration solution with Amazon Redshift:

Partner solutions

You can enhance your Amazon Redshift data warehouse by working with industry-leading experts. Our AWS Partner Network (APN) Partners have certified their solutions to work with Amazon Redshift. They offer software, tools, integration, and consulting services to help you at every step. Visit our Amazon Redshift Partner page and choose an APN Partner. Or, use AWS Marketplace to find and immediately start using third-party software.

To see what our Partners are saying about Amazon Redshift Spectrum and our DC2 nodes mentioned earlier, read these blog posts:

Resources

Blog posts

Visit the AWS Big Data Blog for a list of all Amazon Redshift articles.

YouTube videos

GitHub

Our community of experts contribute on GitHub to provide tips and hints that can help you get the most out of your deployment. Visit GitHub frequently to get the latest technical guidance, code samples, administrative task automation utilities, the analyze & vacuum schema utility, and more.

Customer support

If you are evaluating or considering a proof of concept with Amazon Redshift, or you need assistance migrating your on-premises or other cloud-based data warehouse to Amazon Redshift, our team of product experts and solutions architects can help you with architecting, sizing, and optimizing your data warehouse. Contact us using this support request form, and let us know how we can assist you.

If you are an Amazon Redshift customer, we offer a no-cost health check program. Our team of database engineers and solutions architects give you recommendations for optimizing Amazon Redshift and Amazon Redshift Spectrum for your specific workloads. To learn more, email us at [email protected].

If you have any questions, email us at [email protected].

 


Additional Reading

If you found this post useful, be sure to check out Amazon Redshift Spectrum – Exabyte-Scale In-Place Queries of S3 Data, Using Amazon Redshift for Fast Analytical Reports and How to Migrate Your Oracle Data Warehouse to Amazon Redshift Using AWS SCT and AWS DMS.


About the Author

Larry Heathcote is a Principle Product Marketing Manager at Amazon Web Services for data warehousing and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys family time, home projects, grilling out and the taste of classic barbeque.

 

 

 

Combine Transactional and Analytical Data Using Amazon Aurora and Amazon Redshift

Post Syndicated from Re Alvarez-Parmar original https://aws.amazon.com/blogs/big-data/combine-transactional-and-analytical-data-using-amazon-aurora-and-amazon-redshift/

A few months ago, we published a blog post about capturing data changes in an Amazon Aurora database and sending it to Amazon Athena and Amazon QuickSight for fast analysis and visualization. In this post, I want to demonstrate how easy it can be to take the data in Aurora and combine it with data in Amazon Redshift using Amazon Redshift Spectrum.

With Amazon Redshift, you can build petabyte-scale data warehouses that unify data from a variety of internal and external sources. Because Amazon Redshift is optimized for complex queries (often involving multiple joins) across large tables, it can handle large volumes of retail, inventory, and financial data without breaking a sweat.

In this post, we describe how to combine data in Aurora in Amazon Redshift. Here’s an overview of the solution:

  • Use AWS Lambda functions with Amazon Aurora to capture data changes in a table.
  • Save data in an Amazon S3
  • Query data using Amazon Redshift Spectrum.

We use the following services:

Serverless architecture for capturing and analyzing Aurora data changes

Consider a scenario in which an e-commerce web application uses Amazon Aurora for a transactional database layer. The company has a sales table that captures every single sale, along with a few corresponding data items. This information is stored as immutable data in a table. Business users want to monitor the sales data and then analyze and visualize it.

In this example, you take the changes in data in an Aurora database table and save it in Amazon S3. After the data is captured in Amazon S3, you combine it with data in your existing Amazon Redshift cluster for analysis.

By the end of this post, you will understand how to capture data events in an Aurora table and push them out to other AWS services using AWS Lambda.

The following diagram shows the flow of data as it occurs in this tutorial:

The starting point in this architecture is a database insert operation in Amazon Aurora. When the insert statement is executed, a custom trigger calls a Lambda function and forwards the inserted data. Lambda writes the data that it received from Amazon Aurora to a Kinesis data delivery stream. Kinesis Data Firehose writes the data to an Amazon S3 bucket. Once the data is in an Amazon S3 bucket, it is queried in place using Amazon Redshift Spectrum.

Creating an Aurora database

First, create a database by following these steps in the Amazon RDS console:

  1. Sign in to the AWS Management Console, and open the Amazon RDS console.
  2. Choose Launch a DB instance, and choose Next.
  3. For Engine, choose Amazon Aurora.
  4. Choose a DB instance class. This example uses a small, since this is not a production database.
  5. In Multi-AZ deployment, choose No.
  6. Configure DB instance identifier, Master username, and Master password.
  7. Launch the DB instance.

After you create the database, use MySQL Workbench to connect to the database using the CNAME from the console. For information about connecting to an Aurora database, see Connecting to an Amazon Aurora DB Cluster.

The following screenshot shows the MySQL Workbench configuration:

Next, create a table in the database by running the following SQL statement:

Create Table
CREATE TABLE Sales (
InvoiceID int NOT NULL AUTO_INCREMENT,
ItemID int NOT NULL,
Category varchar(255),
Price double(10,2), 
Quantity int not NULL,
OrderDate timestamp,
DestinationState varchar(2),
ShippingType varchar(255),
Referral varchar(255),
PRIMARY KEY (InvoiceID)
)

You can now populate the table with some sample data. To generate sample data in your table, copy and run the following script. Ensure that the highlighted (bold) variables are replaced with appropriate values.

#!/usr/bin/python
import MySQLdb
import random
import datetime

db = MySQLdb.connect(host="AURORA_CNAME",
                     user="DBUSER",
                     passwd="DBPASSWORD",
                     db="DB")

states = ("AL","AK","AZ","AR","CA","CO","CT","DE","FL","GA","HI","ID","IL","IN",
"IA","KS","KY","LA","ME","MD","MA","MI","MN","MS","MO","MT","NE","NV","NH","NJ",
"NM","NY","NC","ND","OH","OK","OR","PA","RI","SC","SD","TN","TX","UT","VT","VA",
"WA","WV","WI","WY")

shipping_types = ("Free", "3-Day", "2-Day")

product_categories = ("Garden", "Kitchen", "Office", "Household")
referrals = ("Other", "Friend/Colleague", "Repeat Customer", "Online Ad")

for i in range(0,10):
    item_id = random.randint(1,100)
    state = states[random.randint(0,len(states)-1)]
    shipping_type = shipping_types[random.randint(0,len(shipping_types)-1)]
    product_category = product_categories[random.randint(0,len(product_categories)-1)]
    quantity = random.randint(1,4)
    referral = referrals[random.randint(0,len(referrals)-1)]
    price = random.randint(1,100)
    order_date = datetime.date(2016,random.randint(1,12),random.randint(1,30)).isoformat()

    data_order = (item_id, product_category, price, quantity, order_date, state,
    shipping_type, referral)

    add_order = ("INSERT INTO Sales "
                   "(ItemID, Category, Price, Quantity, OrderDate, DestinationState, \
                   ShippingType, Referral) "
                   "VALUES (%s, %s, %s, %s, %s, %s, %s, %s)")

    cursor = db.cursor()
    cursor.execute(add_order, data_order)

    db.commit()

cursor.close()
db.close() 

The following screenshot shows how the table appears with the sample data:

Sending data from Amazon Aurora to Amazon S3

There are two methods available to send data from Amazon Aurora to Amazon S3:

  • Using a Lambda function
  • Using SELECT INTO OUTFILE S3

To demonstrate the ease of setting up integration between multiple AWS services, we use a Lambda function to send data to Amazon S3 using Amazon Kinesis Data Firehose.

Alternatively, you can use a SELECT INTO OUTFILE S3 statement to query data from an Amazon Aurora DB cluster and save it directly in text files that are stored in an Amazon S3 bucket. However, with this method, there is a delay between the time that the database transaction occurs and the time that the data is exported to Amazon S3 because the default file size threshold is 6 GB.

Creating a Kinesis data delivery stream

The next step is to create a Kinesis data delivery stream, since it’s a dependency of the Lambda function.

To create a delivery stream:

  1. Open the Kinesis Data Firehose console
  2. Choose Create delivery stream.
  3. For Delivery stream name, type AuroraChangesToS3.
  4. For Source, choose Direct PUT.
  5. For Record transformation, choose Disabled.
  6. For Destination, choose Amazon S3.
  7. In the S3 bucket drop-down list, choose an existing bucket, or create a new one.
  8. Enter a prefix if needed, and choose Next.
  9. For Data compression, choose GZIP.
  10. In IAM role, choose either an existing role that has access to write to Amazon S3, or choose to generate one automatically. Choose Next.
  11. Review all the details on the screen, and choose Create delivery stream when you’re finished.

 

Creating a Lambda function

Now you can create a Lambda function that is called every time there is a change that needs to be tracked in the database table. This Lambda function passes the data to the Kinesis data delivery stream that you created earlier.

To create the Lambda function:

  1. Open the AWS Lambda console.
  2. Ensure that you are in the AWS Region where your Amazon Aurora database is located.
  3. If you have no Lambda functions yet, choose Get started now. Otherwise, choose Create function.
  4. Choose Author from scratch.
  5. Give your function a name and select Python 3.6 for Runtime
  6. Choose and existing or create a new Role, the role would need to have access to call firehose:PutRecord
  7. Choose Next on the trigger selection screen.
  8. Paste the following code in the code window. Change the stream_name variable to the Kinesis data delivery stream that you created in the previous step.
  9. Choose File -> Save in the code editor and then choose Save.
import boto3
import json

firehose = boto3.client('firehose')
stream_name = ‘AuroraChangesToS3’


def Kinesis_publish_message(event, context):
    
    firehose_data = (("%s,%s,%s,%s,%s,%s,%s,%s\n") %(event['ItemID'], 
    event['Category'], event['Price'], event['Quantity'],
    event['OrderDate'], event['DestinationState'], event['ShippingType'], 
    event['Referral']))
    
    firehose_data = {'Data': str(firehose_data)}
    print(firehose_data)
    
    firehose.put_record(DeliveryStreamName=stream_name,
    Record=firehose_data)

Note the Amazon Resource Name (ARN) of this Lambda function.

Giving Aurora permissions to invoke a Lambda function

To give Amazon Aurora permissions to invoke a Lambda function, you must attach an IAM role with appropriate permissions to the cluster. For more information, see Invoking a Lambda Function from an Amazon Aurora DB Cluster.

Once you are finished, the Amazon Aurora database has access to invoke a Lambda function.

Creating a stored procedure and a trigger in Amazon Aurora

Now, go back to MySQL Workbench, and run the following command to create a new stored procedure. When this stored procedure is called, it invokes the Lambda function you created. Change the ARN in the following code to your Lambda function’s ARN.

DROP PROCEDURE IF EXISTS CDC_TO_FIREHOSE;
DELIMITER ;;
CREATE PROCEDURE CDC_TO_FIREHOSE (IN ItemID VARCHAR(255), 
									IN Category varchar(255), 
									IN Price double(10,2),
                                    IN Quantity int(11),
                                    IN OrderDate timestamp,
                                    IN DestinationState varchar(2),
                                    IN ShippingType varchar(255),
                                    IN Referral  varchar(255)) LANGUAGE SQL 
BEGIN
  CALL mysql.lambda_async('arn:aws:lambda:us-east-1:XXXXXXXXXXXXX:function:CDCFromAuroraToKinesis', 
     CONCAT('{ "ItemID" : "', ItemID, 
            '", "Category" : "', Category,
            '", "Price" : "', Price,
            '", "Quantity" : "', Quantity, 
            '", "OrderDate" : "', OrderDate, 
            '", "DestinationState" : "', DestinationState, 
            '", "ShippingType" : "', ShippingType, 
            '", "Referral" : "', Referral, '"}')
     );
END
;;
DELIMITER ;

Create a trigger TR_Sales_CDC on the Sales table. When a new record is inserted, this trigger calls the CDC_TO_FIREHOSE stored procedure.

DROP TRIGGER IF EXISTS TR_Sales_CDC;
 
DELIMITER ;;
CREATE TRIGGER TR_Sales_CDC
  AFTER INSERT ON Sales
  FOR EACH ROW
BEGIN
  SELECT  NEW.ItemID , NEW.Category, New.Price, New.Quantity, New.OrderDate
  , New.DestinationState, New.ShippingType, New.Referral
  INTO @ItemID , @Category, @Price, @Quantity, @OrderDate
  , @DestinationState, @ShippingType, @Referral;
  CALL  CDC_TO_FIREHOSE(@ItemID , @Category, @Price, @Quantity, @OrderDate
  , @DestinationState, @ShippingType, @Referral);
END
;;
DELIMITER ;

If a new row is inserted in the Sales table, the Lambda function that is mentioned in the stored procedure is invoked.

Verify that data is being sent from the Lambda function to Kinesis Data Firehose to Amazon S3 successfully. You might have to insert a few records, depending on the size of your data, before new records appear in Amazon S3. This is due to Kinesis Data Firehose buffering. To learn more about Kinesis Data Firehose buffering, see the “Amazon S3” section in Amazon Kinesis Data Firehose Data Delivery.

Every time a new record is inserted in the sales table, a stored procedure is called, and it updates data in Amazon S3.

Querying data in Amazon Redshift

In this section, you use the data you produced from Amazon Aurora and consume it as-is in Amazon Redshift. In order to allow you to process your data as-is, where it is, while taking advantage of the power and flexibility of Amazon Redshift, you use Amazon Redshift Spectrum. You can use Redshift Spectrum to run complex queries on data stored in Amazon S3, with no need for loading or other data prep.

Just create a data source and issue your queries to your Amazon Redshift cluster as usual. Behind the scenes, Redshift Spectrum scales to thousands of instances on a per-query basis, ensuring that you get fast, consistent performance even as your dataset grows to beyond an exabyte! Being able to query data that is stored in Amazon S3 means that you can scale your compute and your storage independently. You have the full power of the Amazon Redshift query model and all the reporting and business intelligence tools at your disposal. Your queries can reference any combination of data stored in Amazon Redshift tables and in Amazon S3.

Redshift Spectrum supports open, common data types, including CSV/TSV, Apache Parquet, SequenceFile, and RCFile. Files can be compressed using gzip or Snappy, with other data types and compression methods in the works.

First, create an Amazon Redshift cluster. Follow the steps in Launch a Sample Amazon Redshift Cluster.

Next, create an IAM role that has access to Amazon S3 and Athena. By default, Amazon Redshift Spectrum uses the Amazon Athena data catalog. Your cluster needs authorization to access your external data catalog in AWS Glue or Athena and your data files in Amazon S3.

In the demo setup, I attached AmazonS3FullAccess and AmazonAthenaFullAccess. In a production environment, the IAM roles should follow the standard security of granting least privilege. For more information, see IAM Policies for Amazon Redshift Spectrum.

Attach the newly created role to the Amazon Redshift cluster. For more information, see Associate the IAM Role with Your Cluster.

Next, connect to the Amazon Redshift cluster, and create an external schema and database:

create external schema if not exists spectrum_schema
from data catalog 
database 'spectrum_db' 
region 'us-east-1'
IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/RedshiftSpectrumRole'
create external database if not exists;

Don’t forget to replace the IAM role in the statement.

Then create an external table within the database:

 CREATE EXTERNAL TABLE IF NOT EXISTS spectrum_schema.ecommerce_sales(
  ItemID int,
  Category varchar,
  Price DOUBLE PRECISION,
  Quantity int,
  OrderDate TIMESTAMP,
  DestinationState varchar,
  ShippingType varchar,
  Referral varchar)
ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://{BUCKET_NAME}/CDC/'

Query the table, and it should contain data. This is a fact table.

select top 10 * from spectrum_schema.ecommerce_sales

 

Next, create a dimension table. For this example, we create a date/time dimension table. Create the table:

CREATE TABLE date_dimension (
  d_datekey           integer       not null sortkey,
  d_dayofmonth        integer       not null,
  d_monthnum          integer       not null,
  d_dayofweek                varchar(10)   not null,
  d_prettydate        date       not null,
  d_quarter           integer       not null,
  d_half              integer       not null,
  d_year              integer       not null,
  d_season            varchar(10)   not null,
  d_fiscalyear        integer       not null)
diststyle all;

Populate the table with data:

copy date_dimension from 's3://reparmar-lab/2016dates' 
iam_role 'arn:aws:iam::XXXXXXXXXXXX:role/redshiftspectrum'
DELIMITER ','
dateformat 'auto';

The date dimension table should look like the following:

Querying data in local and external tables using Amazon Redshift

Now that you have the fact and dimension table populated with data, you can combine the two and run analysis. For example, if you want to query the total sales amount by weekday, you can run the following:

select sum(quantity*price) as total_sales, date_dimension.d_season
from spectrum_schema.ecommerce_sales 
join date_dimension on spectrum_schema.ecommerce_sales.orderdate = date_dimension.d_prettydate 
group by date_dimension.d_season

You get the following results:

Similarly, you can replace d_season with d_dayofweek to get sales figures by weekday:

With Amazon Redshift Spectrum, you pay only for the queries you run against the data that you actually scan. We encourage you to use file partitioning, columnar data formats, and data compression to significantly minimize the amount of data scanned in Amazon S3. This is important for data warehousing because it dramatically improves query performance and reduces cost.

Partitioning your data in Amazon S3 by date, time, or any other custom keys enables Amazon Redshift Spectrum to dynamically prune nonrelevant partitions to minimize the amount of data processed. If you store data in a columnar format, such as Parquet, Amazon Redshift Spectrum scans only the columns needed by your query, rather than processing entire rows. Similarly, if you compress your data using one of the supported compression algorithms in Amazon Redshift Spectrum, less data is scanned.

Analyzing and visualizing Amazon Redshift data in Amazon QuickSight

Modify the Amazon Redshift security group to allow an Amazon QuickSight connection. For more information, see Authorizing Connections from Amazon QuickSight to Amazon Redshift Clusters.

After modifying the Amazon Redshift security group, go to Amazon QuickSight. Create a new analysis, and choose Amazon Redshift as the data source.

Enter the database connection details, validate the connection, and create the data source.

Choose the schema to be analyzed. In this case, choose spectrum_schema, and then choose the ecommerce_sales table.

Next, we add a custom field for Total Sales = Price*Quantity. In the drop-down list for the ecommerce_sales table, choose Edit analysis data sets.

On the next screen, choose Edit.

In the data prep screen, choose New Field. Add a new calculated field Total Sales $, which is the product of the Price*Quantity fields. Then choose Create. Save and visualize it.

Next, to visualize total sales figures by month, create a graph with Total Sales on the x-axis and Order Data formatted as month on the y-axis.

After you’ve finished, you can use Amazon QuickSight to add different columns from your Amazon Redshift tables and perform different types of visualizations. You can build operational dashboards that continuously monitor your transactional and analytical data. You can publish these dashboards and share them with others.

Final notes

Amazon QuickSight can also read data in Amazon S3 directly. However, with the method demonstrated in this post, you have the option to manipulate, filter, and combine data from multiple sources or Amazon Redshift tables before visualizing it in Amazon QuickSight.

In this example, we dealt with data being inserted, but triggers can be activated in response to an INSERT, UPDATE, or DELETE trigger.

Keep the following in mind:

  • Be careful when invoking a Lambda function from triggers on tables that experience high write traffic. This would result in a large number of calls to your Lambda function. Although calls to the lambda_async procedure are asynchronous, triggers are synchronous.
  • A statement that results in a large number of trigger activations does not wait for the call to the AWS Lambda function to complete. But it does wait for the triggers to complete before returning control to the client.
  • Similarly, you must account for Amazon Kinesis Data Firehose limits. By default, Kinesis Data Firehose is limited to a maximum of 5,000 records/second. For more information, see Monitoring Amazon Kinesis Data Firehose.

In certain cases, it may be optimal to use AWS Database Migration Service (AWS DMS) to capture data changes in Aurora and use Amazon S3 as a target. For example, AWS DMS might be a good option if you don’t need to transform data from Amazon Aurora. The method used in this post gives you the flexibility to transform data from Aurora using Lambda before sending it to Amazon S3. Additionally, the architecture has the benefits of being serverless, whereas AWS DMS requires an Amazon EC2 instance for replication.

For design considerations while using Redshift Spectrum, see Using Amazon Redshift Spectrum to Query External Data.

If you have questions or suggestions, please comment below.


Additional Reading

If you found this post useful, be sure to check out Capturing Data Changes in Amazon Aurora Using AWS Lambda and 10 Best Practices for Amazon Redshift Spectrum


About the Authors

Re Alvarez-Parmar is a solutions architect for Amazon Web Services. He helps enterprises achieve success through technical guidance and thought leadership. In his spare time, he enjoys spending time with his two kids and exploring outdoors.

 

 

 

Event-Driven Computing with Amazon SNS and AWS Compute, Storage, Database, and Networking Services

Post Syndicated from Christie Gifrin original https://aws.amazon.com/blogs/compute/event-driven-computing-with-amazon-sns-compute-storage-database-and-networking-services/

Contributed by Otavio Ferreira, Manager, Software Development, AWS Messaging

Like other developers around the world, you may be tackling increasingly complex business problems. A key success factor, in that case, is the ability to break down a large project scope into smaller, more manageable components. A service-oriented architecture guides you toward designing systems as a collection of loosely coupled, independently scaled, and highly reusable services. Microservices take this even further. To improve performance and scalability, they promote fine-grained interfaces and lightweight protocols.

However, the communication among isolated microservices can be challenging. Services are often deployed onto independent servers and don’t share any compute or storage resources. Also, you should avoid hard dependencies among microservices, to preserve maintainability and reusability.

If you apply the pub/sub design pattern, you can effortlessly decouple and independently scale out your microservices and serverless architectures. A pub/sub messaging service, such as Amazon SNS, promotes event-driven computing that statically decouples event publishers from subscribers, while dynamically allowing for the exchange of messages between them. An event-driven architecture also introduces the responsiveness needed to deal with complex problems, which are often unpredictable and asynchronous.

What is event-driven computing?

Given the context of microservices, event-driven computing is a model in which subscriber services automatically perform work in response to events triggered by publisher services. This paradigm can be applied to automate workflows while decoupling the services that collectively and independently work to fulfil these workflows. Amazon SNS is an event-driven computing hub, in the AWS Cloud, that has native integration with several AWS publisher and subscriber services.

Which AWS services publish events to SNS natively?

Several AWS services have been integrated as SNS publishers and, therefore, can natively trigger event-driven computing for a variety of use cases. In this post, I specifically cover AWS compute, storage, database, and networking services, as depicted below.

Compute services

  • Auto Scaling: Helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You can configure Auto Scaling lifecycle hooks to trigger events, as Auto Scaling resizes your EC2 cluster.As an example, you may want to warm up the local cache store on newly launched EC2 instances, and also download log files from other EC2 instances that are about to be terminated. To make this happen, set an SNS topic as your Auto Scaling group’s notification target, then subscribe two Lambda functions to this SNS topic. The first function is responsible for handling scale-out events (to warm up cache upon provisioning), whereas the second is in charge of handling scale-in events (to download logs upon termination).

  • AWS Elastic Beanstalk: An easy-to-use service for deploying and scaling web applications and web services developed in a number of programming languages. You can configure event notifications for your Elastic Beanstalk environment so that notable events can be automatically published to an SNS topic, then pushed to topic subscribers.As an example, you may use this event-driven architecture to coordinate your continuous integration pipeline (such as Jenkins CI). That way, whenever an environment is created, Elastic Beanstalk publishes this event to an SNS topic, which triggers a subscribing Lambda function, which then kicks off a CI job against your newly created Elastic Beanstalk environment.

  • Elastic Load Balancing: Automatically distributes incoming application traffic across Amazon EC2 instances, containers, or other resources identified by IP addresses.You can configure CloudWatch alarms on Elastic Load Balancing metrics, to automate the handling of events derived from Classic Load Balancers. As an example, you may leverage this event-driven design to automate latency profiling in an Amazon ECS cluster behind a Classic Load Balancer. In this example, whenever your ECS cluster breaches your load balancer latency threshold, an event is posted by CloudWatch to an SNS topic, which then triggers a subscribing Lambda function. This function runs a task on your ECS cluster to trigger a latency profiling tool, hosted on the cluster itself. This can enhance your latency troubleshooting exercise by making it timely.

Storage services

  • Amazon S3: Object storage built to store and retrieve any amount of data.You can enable S3 event notifications, and automatically get them posted to SNS topics, to automate a variety of workflows. For instance, imagine that you have an S3 bucket to store incoming resumes from candidates, and a fleet of EC2 instances to encode these resumes from their original format (such as Word or text) into a portable format (such as PDF).In this example, whenever new files are uploaded to your input bucket, S3 publishes these events to an SNS topic, which in turn pushes these messages into subscribing SQS queues. Then, encoding workers running on EC2 instances poll these messages from the SQS queues; retrieve the original files from the input S3 bucket; encode them into PDF; and finally store them in an output S3 bucket.

  • Amazon EFS: Provides simple and scalable file storage, for use with Amazon EC2 instances, in the AWS Cloud.You can configure CloudWatch alarms on EFS metrics, to automate the management of your EFS systems. For example, consider a highly parallelized genomics analysis application that runs against an EFS system. By default, this file system is instantiated on the “General Purpose” performance mode. Although this performance mode allows for lower latency, it might eventually impose a scaling bottleneck. Therefore, you may leverage an event-driven design to handle it automatically.Basically, as soon as the EFS metric “Percent I/O Limit” breaches 95%, CloudWatch could post this event to an SNS topic, which in turn would push this message into a subscribing Lambda function. This function automatically creates a new file system, this time on the “Max I/O” performance mode, then switches the genomics analysis application to this new file system. As a result, your application starts experiencing higher I/O throughput rates.

  • Amazon Glacier: A secure, durable, and low-cost cloud storage service for data archiving and long-term backup.You can set a notification configuration on an Amazon Glacier vault so that when a job completes, a message is published to an SNS topic. Retrieving an archive from Amazon Glacier is a two-step asynchronous operation, in which you first initiate a job, and then download the output after the job completes. Therefore, SNS helps you eliminate polling your Amazon Glacier vault to check whether your job has been completed, or not. As usual, you may subscribe SQS queues, Lambda functions, and HTTP endpoints to your SNS topic, to be notified when your Amazon Glacier job is done.

  • AWS Snowball: A petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data.You can leverage Snowball notifications to automate workflows related to importing data into and exporting data from AWS. More specifically, whenever your Snowball job status changes, Snowball can publish this event to an SNS topic, which in turn can broadcast the event to all its subscribers.As an example, imagine a Geographic Information System (GIS) that distributes high-resolution satellite images to users via Web browser. In this example, the GIS vendor could capture up to 80 TB of satellite images; create a Snowball job to import these files from an on-premises system to an S3 bucket; and provide an SNS topic ARN to be notified upon job status changes in Snowball. After Snowball changes the job status from “Importing” to “Completed”, Snowball publishes this event to the specified SNS topic, which delivers this message to a subscribing Lambda function, which finally creates a CloudFront web distribution for the target S3 bucket, to serve the images to end users.

Database services

  • Amazon RDS: Makes it easy to set up, operate, and scale a relational database in the cloud.RDS leverages SNS to broadcast notifications when RDS events occur. As usual, these notifications can be delivered via any protocol supported by SNS, including SQS queues, Lambda functions, and HTTP endpoints.As an example, imagine that you own a social network website that has experienced organic growth, and needs to scale its compute and database resources on demand. In this case, you could provide an SNS topic to listen to RDS DB instance events. When the “Low Storage” event is published to the topic, SNS pushes this event to a subscribing Lambda function, which in turn leverages the RDS API to increase the storage capacity allocated to your DB instance. The provisioning itself takes place within the specified DB maintenance window.

  • Amazon ElastiCache: A web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud.ElastiCache can publish messages using Amazon SNS when significant events happen on your cache cluster. This feature can be used to refresh the list of servers on client machines connected to individual cache node endpoints of a cache cluster. For instance, an ecommerce website fetches product details from a cache cluster, with the goal of offloading a relational database and speeding up page load times. Ideally, you want to make sure that each web server always has an updated list of cache servers to which to connect.To automate this node discovery process, you can get your ElastiCache cluster to publish events to an SNS topic. Thus, when ElastiCache event “AddCacheNodeComplete” is published, your topic then pushes this event to all subscribing HTTP endpoints that serve your ecommerce website, so that these HTTP servers can update their list of cache nodes.

  • Amazon Redshift: A fully managed data warehouse that makes it simple to analyze data using standard SQL and BI (Business Intelligence) tools.Amazon Redshift uses SNS to broadcast relevant events so that data warehouse workflows can be automated. As an example, imagine a news website that sends clickstream data to a Kinesis Firehose stream, which then loads the data into Amazon Redshift, so that popular news and reading preferences might be surfaced on a BI tool. At some point though, this Amazon Redshift cluster might need to be resized, and the cluster enters a ready-only mode. Hence, this Amazon Redshift event is published to an SNS topic, which delivers this event to a subscribing Lambda function, which finally deletes the corresponding Kinesis Firehose delivery stream, so that clickstream data uploads can be put on hold.At a later point, after Amazon Redshift publishes the event that the maintenance window has been closed, SNS notifies a subscribing Lambda function accordingly, so that this function can re-create the Kinesis Firehose delivery stream, and resume clickstream data uploads to Amazon Redshift.

  • AWS DMS: Helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.DMS also uses SNS to provide notifications when DMS events occur, which can automate database migration workflows. As an example, you might create data replication tasks to migrate an on-premises MS SQL database, composed of multiple tables, to MySQL. Thus, if replication tasks fail due to incompatible data encoding in the source tables, these events can be published to an SNS topic, which can push these messages into a subscribing SQS queue. Then, encoders running on EC2 can poll these messages from the SQS queue, encode the source tables into a compatible character set, and restart the corresponding replication tasks in DMS. This is an event-driven approach to a self-healing database migration process.

Networking services

  • Amazon Route 53: A highly available and scalable cloud-based DNS (Domain Name System). Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.You can set CloudWatch alarms and get automated Amazon SNS notifications when the status of your Route 53 health check changes. As an example, imagine an online payment gateway that reports the health of its platform to merchants worldwide, via a status page. This page is hosted on EC2 and fetches platform health data from DynamoDB. In this case, you could configure a CloudWatch alarm for your Route 53 health check, so that when the alarm threshold is breached, and the payment gateway is no longer considered healthy, then CloudWatch publishes this event to an SNS topic, which pushes this message to a subscribing Lambda function, which finally updates the DynamoDB table that populates the status page. This event-driven approach avoids any kind of manual update to the status page visited by merchants.

  • AWS Direct Connect (AWS DX): Makes it easy to establish a dedicated network connection from your premises to AWS, which can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.You can monitor physical DX connections using CloudWatch alarms, and send SNS messages when alarms change their status. As an example, when a DX connection state shifts to 0 (zero), indicating that the connection is down, this event can be published to an SNS topic, which can fan out this message to impacted servers through HTTP endpoints, so that they might reroute their traffic through a different connection instead. This is an event-driven approach to connectivity resilience.

More event-driven computing on AWS

In addition to SNS, event-driven computing is also addressed by Amazon CloudWatch Events, which delivers a near real-time stream of system events that describe changes in AWS resources. With CloudWatch Events, you can route each event type to one or more targets, including:

Many AWS services publish events to CloudWatch. As an example, you can get CloudWatch Events to capture events on your ETL (Extract, Transform, Load) jobs running on AWS Glue and push failed ones to an SQS queue, so that you can retry them later.

Conclusion

Amazon SNS is a pub/sub messaging service that can be used as an event-driven computing hub to AWS customers worldwide. By capturing events natively triggered by AWS services, such as EC2, S3 and RDS, you can automate and optimize all kinds of workflows, namely scaling, testing, encoding, profiling, broadcasting, discovery, failover, and much more. Business use cases presented in this post ranged from recruiting websites, to scientific research, geographic systems, social networks, retail websites, and news portals.

Start now by visiting Amazon SNS in the AWS Management Console, or by trying the AWS 10-Minute Tutorial, Send Fan-out Event Notifications with Amazon SNS and Amazon SQS.

 

Near Zero Downtime Migration from MySQL to DynamoDB

Post Syndicated from YongSeong Lee original https://aws.amazon.com/blogs/big-data/near-zero-downtime-migration-from-mysql-to-dynamodb/

Many companies consider migrating from relational databases like MySQL to Amazon DynamoDB, a fully managed, fast, highly scalable, and flexible NoSQL database service. For example, DynamoDB can increase or decrease capacity based on traffic, in accordance with business needs. The total cost of servicing can be optimized more easily than for the typical media-based RDBMS.

However, migrations can have two common issues:

  • Service outage due to downtime, especially when customer service must be seamlessly available 24/7/365
  • Different key design between RDBMS and DynamoDB

This post introduces two methods of seamlessly migrating data from MySQL to DynamoDB, minimizing downtime and converting the MySQL key design into one more suitable for NoSQL.

AWS services

I’ve included sample code that uses the following AWS services:

  • AWS Database Migration Service (AWS DMS) can migrate your data to and from most widely used commercial and open-source databases. It supports homogeneous and heterogeneous migrations between different database platforms.
  • Amazon EMR is a managed Hadoop framework that helps you process vast amounts of data quickly. Build EMR clusters easily with preconfigured software stacks that include Hive and other business software.
  • Amazon Kinesis can continuously capture and retain a vast amount of data such as transaction, IT logs, or clickstreams for up to 7 days.
  • AWS Lambda helps you run your code without provisioning or managing servers. Your code can be automatically triggered by other AWS services such Amazon Kinesis Streams.

Migration solutions

Here are the two options I describe in this post:

  1. Use AWS DMS

AWS DMS supports migration to a DynamoDB table as a target. You can use object mapping to restructure original data to the desired structure of the data in DynamoDB during migration.

  1. Use EMR, Amazon Kinesis, and Lambda with custom scripts

Consider this method when more complex conversion processes and flexibility are required. Fine-grained user control is needed for grouping MySQL records into fewer DynamoDB items, determining attribute names dynamically, adding business logic programmatically during migration, supporting more data types, or adding parallel control for one big table.

After the initial load/bulk-puts are finished, and the most recent real-time data is caught up by the CDC (change data capture) process, you can change the application endpoint to DynamoDB.

The method of capturing changed data in option 2 is covered in the AWS Database post Streaming Changes in a Database with Amazon Kinesis. All code in this post is available in the big-data-blog GitHub repo, including test codes.

Solution architecture

The following diagram shows the overall architecture of both options.

Option 1:  Use AWS DMS

This section discusses how to connect to MySQL, read the source data, and then format the data for consumption by the target DynamoDB database using DMS.

Create the replication instance and source and target endpoints

Create a replication instance that has sufficient storage and processing power to perform the migration job, as mentioned in the AWS Database Migration Service Best Practices whitepaper. For example, if your migration involves a large number of tables, or if you intend to run multiple concurrent replication tasks, consider using one of the larger instances. The service consumes a fair amount of memory and CPU.

As the MySQL user, connect to MySQL and retrieve data from the database with the privileges of SUPER, REPLICATION CLIENT. Enable the binary log and set the binlog_format parameter to ROW for CDC in the MySQL configuration. For more information about how to use DMS, see Getting Started  in the AWS Database Migration Service User Guide.

mysql> CREATE USER 'repl'@'%' IDENTIFIED BY 'welcome1';
mysql> GRANT all ON <database name>.* TO 'repl'@'%';
mysql> GRANT SUPER,REPLICATION CLIENT  ON *.* TO 'repl'@'%';

Before you begin to work with a DynamoDB database as a target for DMS, make sure that you create an IAM role for DMS to assume, and grant access to the DynamoDB target tables. Two endpoints must be created to connect the source and target. The following screenshot shows sample endpoints.

The following screenshot shows the details for one of the endpoints, source-mysql.

Create a task with an object mapping rule

In this example, assume that the MySQL table has a composite primary key (customerid + orderid + productid). You are going to restructure the key to the desired structure of the data in DynamoDB, using an object mapping rule.

In this case, the DynamoDB table has the hash key that is a combination of the customerid and orderid columns, and the sort key is the productid column. However, the partition key should be decided by the user in an actual migration, based on data ingestion and access pattern. You would usually use high-cardinality attributes. For more information about how to choose the right DynamoDB partition key, see the Choosing the Right DynamoDB Partition Key AWS Database blog post.

DMS automatically creates a corresponding attribute on the target DynamoDB table for the quantity column from the source table because rule-action is set to map-record-to-record and the column is not listed in the exclude-columns attribute list. For more information about map-record-to-record and map-record-to-document, see Using an Amazon DynamoDB Database as a Target for AWS Database Migration Service.

Migration starts immediately after the task is created, unless you clear the Start task on create option. I recommend enabling logging to make sure that you are informed about what is going on with the migration task in the background.

The following screenshot shows the task creation page.

You can use the console to specify the individual database tables to migrate and the schema to use for the migration, including transformations. On the Guided tab, use the Where section to specify the schema, table, and action (include or exclude). Use the Filter section to specify the column name in a table and the conditions to apply.

Table mappings also can be created in JSON format. On the JSON tab, check Enable JSON editing.

Here’s an example of an object mapping rule that determines where the source data is located in the target. If you copy the code, replace the values of the following attributes. For more examples, see Using an Amazon DynamoDB Database as a Target for AWS Database Migration Service.

  • schema-name
  • table-name
  • target-table-name
  • mapping-parameters
  • attribute-mappings
{
  "rules": [
   {
      "rule-type": "selection",
      "rule-id": "1",
      "rule-name": "1",
      "object-locator": {
        "schema-name": "mydatabase",
        "table-name": "purchase"
      },
      "rule-action": "include"
    },
    {
      "rule-type": "object-mapping",
      "rule-id": "2",
      "rule-name": "2",
      "rule-action": "map-record-to-record",
      "object-locator": {
        "schema-name": "mydatabase",
        "table-name": "purchase"
 
      },
      "target-table-name": "purchase",
      "mapping-parameters": {
        "partition-key-name": "customer_orderid",
        "sort-key-name": "productid",
        "exclude-columns": [
          "customerid",
          "orderid"           
        ],
        "attribute-mappings": [
          {
            "target-attribute-name": "customer_orderid",
            "attribute-type": "scalar",
            "attribute-sub-type": "string",
            "value": "${customerid}|${orderid}"
          },
          {
            "target-attribute-name": "productid",
            "attribute-type": "scalar",
            "attribute-sub-type": "string",
            "value": "${productid}"
          }
        ]
      }
    }
  ]
}

Start the migration task

If the target table specified in the target-table-name property does not exist in DynamoDB, DMS creates the table according to data type conversion rules for source and target data types. There are many metrics to monitor the progress of migration. For more information, see Monitoring AWS Database Migration Service Tasks.

The following screenshot shows example events and errors recorded by CloudWatch Logs.

DMS replication instances that you used for the migration should be deleted once all migration processes are completed. Any CloudWatch logs data older than the retention period is automatically deleted.

Option 2: Use EMR, Amazon Kinesis, and Lambda

This section discusses an alternative option using EMR, Amazon Kinesis, and Lambda to provide more flexibility and precise control. If you have a MySQL replica in your environment, it would be better to dump data from the replica.

Change the key design

When you decide to change your database from RDMBS to NoSQL, you need to find a more suitable key design for NoSQL, for performance as well as cost-effectiveness.

Similar to option #1, assume that the MySQL source has a composite primary key (customerid + orderid + productid). However, for this option, group the MySQL records into fewer DynamoDB items by customerid (hash key) and orderid (sort key). Also, remove the last column (productid) of the composite key by converting the record values productid column in MySQL to the attribute name in DynamoDB, and setting the attribute value as quantity.

This conversion method reduces the number of items. You can retrieve the same amount of information with fewer read capacity units, resulting in cost savings and better performance. For more information about how to calculate read/write capacity units, see Provisioned Throughput.

Migration steps

Option 2 has two paths for migration, performed at the same time:

  • Batch-puts: Export MySQL data, upload it to Amazon S3, and import into DynamoDB.
  • Real-time puts: Capture changed data in MySQL, send the insert/update/delete transaction to Amazon Kinesis Streams, and trigger the Lambda function to put data into DynamoDB.

To keep the data consistency and integrity, capturing and feeding data to Amazon Kinesis Streams should be started before the batch-puts process. The Lambda function should stand by and Streams should retain the captured data in the stream until the batch-puts process on EMR finishes. Here’s the order:

  1. Start real-time puts to Amazon Kinesis Streams.
  2. As soon as real-time puts commences, start batch-puts.
  3. After batch-puts finishes, trigger the Lambda function to execute put_item from Amazon Kinesis Streams to DynamoDB.
  4. Change the application endpoints from MySQL to DynamoDB.

Step 1:  Capture changing data and put into Amazon Kinesis Streams

Firstly, create an Amazon Kinesis stream to retain transaction data from MySQL. Set the Data retention period value based on your estimate for the batch-puts migration process. For data integrity, the retention period should be enough to hold all transactions until batch-puts migration finishes. However you do not necessarily need to select the maximum retention period. It depends on the amount of data to migrate.

In the MySQL configuration, set binlog_format to ROW to capture transactions by using the BinLogStreamReader module. The log_bin parameter must be set as well to enable the binlog. For more information, see the Streaming Changes in a Database with Amazon Kinesis AWS Database blog post.

 

[mysqld]
secure-file-priv = ""
log_bin=/data/binlog/binlog
binlog_format=ROW
server-id = 1
tmpdir=/data/tmp

The following sample code is a Python example that captures transactions and sends them to Amazon Kinesis Streams.

 

#!/usr/bin/env python
from pymysqlreplication import BinLogStreamReader
from pymysqlreplication.row_event import (
  DeleteRowsEvent,
  UpdateRowsEvent,
  WriteRowsEvent,
)

def main():
  kinesis = boto3.client("kinesis")

  stream = BinLogStreamReader(
    connection_settings= {
      "host": "<host IP address>",
      "port": <port number>,
      "user": "<user name>",
      "passwd": "<password>"},
    server_id=100,
    blocking=True,
    resume_stream=True,
    only_events=[DeleteRowsEvent, WriteRowsEvent, UpdateRowsEvent])

  for binlogevent in stream:
    for row in binlogevent.rows:
      event = {"schema": binlogevent.schema,
      "table": binlogevent.table,
      "type": type(binlogevent).__name__,
      "row": row
      }

      kinesis.put_record(StreamName="<Amazon Kinesis stream name>", Data=json.dumps(event), PartitionKey="default")
      print json.dumps(event)

if __name__ == "__main__":
main()

The following code is sample JSON data generated by the Python script. The type attribute defines the transaction recorded by that JSON record:

  • WriteRowsEvent = INSERT
  • UpdateRowsEvent = UPDATE
  • DeleteRowsEvent = DELETE
{"table": "purchase_temp", "row": {"values": {"orderid": "orderidA1", "quantity": 100, "customerid": "customeridA74187", "productid": "productid1"}}, "type": "WriteRowsEvent", "schema": "test"}
{"table": "purchase_temp", "row": {"before_values": {"orderid": "orderid1", "quantity": 1, "customerid": "customerid74187", "productid": "productid1"}, "after_values": {"orderid": "orderid1", "quantity": 99, "customerid": "customerid74187", "productid": "productid1"}}, "type": "UpdateRowsEvent", "schema": "test"}
{"table": "purchase_temp", "row": {"values": {"orderid": "orderid100", "quantity": 1, "customerid": "customerid74187", "productid": "productid1"}}, "type": "DeleteRowsEvent", "schema": "test"}

Step 2. Dump data from MySQL to DynamoDB

The easiest way is to use DMS, which recently added Amazon S3 as a migration target. For an S3 target, both full load and CDC data is written to CSV format. However, CDC is not a good fit as UPDATE and DELETE statements are not supported. For more information, see Using Amazon S3 as a Target for AWS Database Migration Service.

Another way to upload data to Amazon S3 is to use the INTO OUTFILE SQL clause and aws s3 sync CLI command in parallel with your own script. The degree of parallelism depends on your server capacity and local network bandwidth. You might find a third-party tool useful, such as pt-archiver (part of the Percona Toolkit see the appendix for details).

SELECT * FROM purchase WHERE <condition_1>
INTO OUTFILE '/data/export/purchase/1.csv' FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n';
SELECT * FROM purchase WHERE <condition_2>
INTO OUTFILE '/data/export/purchase/2.csv' FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n';
...
SELECT * FROM purchase WHERE <condition_n>
INTO OUTFILE '/data/export/purchase/n.csv' FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n';

I recommend the aws s3 sync command for this use case. This command works internally with the S3 multipart upload feature. Pattern matching can exclude or include particular files. In addition, if the sync process crashes in the middle of processing, you do not need to upload the same files again. The sync command compares the size and modified time of files between local and S3 versions, and synchronizes only local files whose size and modified time are different from those in S3. For more information, see the sync command in the S3 section of the AWS CLI Command Reference.

$ aws s3 sync /data/export/purchase/ s3://<your bucket name>/purchase/ 
$ aws s3 sync /data/export/<other path_1>/ s3://<your bucket name>/<other path_1>/
...
$ aws s3 sync /data/export/<other path_n>/ s3://<your bucket name>/<other path_n>/ 

After all data is uploaded to S3, put it into DynamoDB. There are two ways to do this:

  • Use Hive with an external table
  • Write MapReduce code

Hive with an external table

Create a Hive external table against the data on S3 and insert it into another external table against the DynamoDB table, using the org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler property. To improve productivity and the scalability, consider using Brickhouse, which is a collection of UDFs for Hive.

The following sample code assumes that the Hive table for DynamoDB is created with the products column, which is of type ARRAY<STRING >.  The productid and quantity columns are aggregated, grouping by customerid and orderid, and inserted into the products column with the CollectUDAF columns provided by Brickhouse.

hive> DROP TABLE purchase_ext_s3; 
--- To read data from S3 
hive> CREATE EXTERNAL TABLE purchase_ext_s3 (
customerid string,
orderid    string,
productid  string,
quantity   string) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LOCATION 's3://<your bucket name>/purchase/';

Hive> drop table purchase_ext_dynamodb ; 
--- To connect to DynamoDB table  
Hive> CREATE EXTERNAL TABLE purchase_ext_dynamodb (
      customerid STRING, orderid STRING, products ARRAY<STRING>)
      STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
      TBLPROPERTIES ("dynamodb.table.name" = "purchase", 
      "dynamodb.column.mapping" = "customerid:customerid,orderid:orderid,products:products");

--- Batch-puts to DynamoDB using Brickhouse 
hive> add jar /<jar file path>/brickhouse-0.7.1-SNAPSHOT.jar ; 
hive> create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
hive> INSERT INTO purchase_ext_dynamodb 
select customerid as customerid , orderid as orderid
       ,collect(concat(productid,':' ,quantity)) as products
      from purchase_ext_s3
      group by customerid, orderid; 

Unfortunately, the MAP, LIST, BOOLEAN, and NULL data types are not supported by the  DynamoDBStorageHandler class, so the ARRAY<STRING> data type has been chosen. The products column of ARRAY<STRING> data type in Hive is matched to the StringSet type attribute in DynamoDB. The sample code mostly shows how Brickhouse works, and only for those who want to aggregate multiple records into one StringSet type attribute in DynamoDB.

Python MapReduce with Hadoop Streaming

A mapper task reads each record from the input data on S3, and maps input key-value pairs to intermediate key-value pairs. It divides source data from S3 into two parts (key part and value part) delimited by a TAB character (“\t”). Mapper data is sorted in order by their intermediate key (customerid and orderid) and sent to the reducer. Records are put into DynamoDB in the reducer step.

#!/usr/bin/env python
import sys
 
# get all lines from stdin
for line in sys.stdin:
    line = line.strip()
    cols = line.split(',')
# divide source data into Key and attribute part.
# example output : “cusotmer1,order1	product1,10”
    print '%s,%s\t%s,%s' % (cols[0],cols[1],cols[2],cols[3] )

Generally, the reduce task receives the output produced after map processing (which is key/list-of-values pairs) and then performs an operation on the list of values against each key.

In this case, the reducer is written in Python and is based on STDIN/STDOUT/hadoop streaming. The enumeration data type is not available. The reducer receives data sorted and ordered by the intermediate key set in the mapper, customerid and orderid (cols[0],cols[1]) in this case, and stores all attributes for the specific key in the item_data dictionary. The attributes in the item_data dictionary are put, or flushed, into DynamoDB every time a new intermediate key comes from sys.stdin.

#!/usr/bin/env python
import sys
import boto.dynamodb
 
# create connection to DynamoDB
current_keys = None
conn = boto.dynamodb.connect_to_region( '<region>', aws_access_key_id='<access key id>', aws_secret_access_key='<secret access key>')
table = conn.get_table('<dynamodb table name>')
item_data = {}

# input comes from STDIN emitted by Mapper
for line in sys.stdin:
    line = line.strip()
    dickeys, items  = line.split('\t')
    products = items.split(',')
    if current_keys == dickeys:
       item_data[products[0]]=products[1]  
    else:
        if current_keys:
          try:
              mykeys = current_keys.split(',') 
              item = table.new_item(hash_key=mykeys[0],range_key=mykeys[1], attrs=item_data )
              item.put() 
          except Exception ,e:
              print 'Exception occurred! :', e.message,'==> Data:' , mykeys
        item_data = {}
        item_data[products[0]]=products[1]
        current_keys = dickeys

# put last data
if current_keys == dickeys:
   print 'Last one:' , current_keys #, item_data
   try:
       mykeys = dickeys.split(',')
       item = table.new_item(hash_key=mykeys[0] , range_key=mykeys[1], attrs=item_data )
       item.put()
   except Exception ,e:
print 'Exception occurred! :', e.message, '==> Data:' , mykeys

To run the MapReduce job, connect to the EMR master node and run a Hadoop streaming job. The hadoop-streaming.jar file location or name could be different, depending on your EMR version. Exception messages that occur while reducers run are stored at the directory assigned as the –output option. Hash key and range key values are also logged to identify which data causes exceptions or errors.

$ hadoop fs -rm -r s3://<bucket name>/<output path>
$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
           -input s3://<bucket name>/<input path> -output s3://<bucket name>/<output path>\
           -file /<local path>/mapper.py -mapper /<local path>/mapper.py \
           -file /<local path>/reducer.py -reducer /<local path>/reducer.py

In my migration experiment using the above scripts, with self-generated test data, I found the following results, including database size and the time taken to complete the migration.

ServerMySQL instancem4.2xlarge
EMR cluster

master : 1 x m3.xlarge

core  : 2 x m4.4xlarge

DynamoDB2000 write capacity unit
DataNumber of records1,000,000,000
Database file size (.ibc)100.6 GB
CSV files size37 GB
Performance (time)Export to CSV6 min 10 sec
Upload to S3 (sync)3 min 30 sec
Import to DynamoDBdepending on write capacity unit

 

The following screenshot shows the performance results by write capacity.

Note that the performance result is flexible and can vary depending on the server capacity, network bandwidth, degree of parallelism, conversion logic, program language, and other conditions. All provisioned write capacity units are consumed by the MapReduce job for data import, so the more you increase the size of the EMR cluster and write capacity units of DynamoDB table, the less time it takes to complete. Java-based MapReduce code would be more flexible for function and MapReduce framework.

Step 3: Amazon Lambda function updates DynamoDB by reading data from Amazon Kinesis

In the Lambda console, choose Create a Lambda function and the kinesis-process-record-python blueprint. Next, in the Configure triggers page, select the stream that you just created.

The Lambda function must have an IAM role with permissions to read from Amazon Kinesis and put items into DynamoDB.

The Lambda function can recognize the transaction type of the record by looking up the type attribute. The transaction type determines the method for conversion and update.

For example, when a JSON record is passed to the function, the function looks up the type attribute. It also checks whether an existing item in the DynamoDB table has the same key with the incoming record. If so, the existing item must be retrieved and saved in a dictionary variable (item, in this case). Apply a new update information command to the item dictionary before it is put back into DynamoDB table. This prevents the existing item from being overwritten by the incoming record.

from __future__ import print_function

import base64
import json
import boto3

print('Loading function')
client = boto3.client('dynamodb')

def lambda_handler(event, context):
    #print("Received event: " + json.dumps(event, indent=2))
    for record in event['Records']:
        # Amazon Kinesis data is base64-encoded so decode here
        payload = base64.b64decode(record['kinesis']['data'])
        print("Decoded payload: " + payload)
        data = json.loads(payload)
        
        # user logic for data triggered by WriteRowsEvent
        if data["type"] == "WriteRowsEvent":
            my_table = data["table"]
            my_hashkey = data["row"]["values"]["customerid"]
            my_rangekey = data["row"]["values"]["orderid"]
            my_productid = data["row"]["values"]["productid"]
            my_quantity = str( data["row"]["values"]["quantity"] )
            try:
                response = client.get_item( Key={'customerid':{'S':my_hashkey} , 'orderid':{'S':my_rangekey}} ,TableName = my_table )
                if 'Item' in response:
                    item = response['Item']
                    item[data["row"]["values"]["productid"]] = {"S":my_quantity}
                    result1 = client.put_item(Item = item , TableName = my_table )
                else:
                    item = { 'customerid':{'S':my_hashkey} , 'orderid':{'S':my_rangekey} , my_productid :{"S":my_quantity}  }
                    result2 = client.put_item( Item = item , TableName = my_table )
            except Exception, e:
                print( 'WriteRowsEvent Exception ! :', e.message  , '==> Data:' ,data["row"]["values"]["customerid"]  , data["row"]["values"]["orderid"] )
        
        # user logic for data triggered by UpdateRowsEvent
        if data["type"] == "UpdateRowsEvent":
            my_table = data["table"]
            
        # user logic for data triggered by DeleteRowsEvent    
        if data["type"] == "DeleteRowsEvent":
            my_table = data["table"]
            
            
    return 'Successfully processed {} records.'.format(len(event['Records']))

Step 4:  Switch the application endpoint to DynamoDB

Application codes need to be refactored when you change from MySQL to DynamoDB. The following simple Java code snippets focus on the connection and query part because it is difficult to cover all cases for all applications. For more information, see Programming with DynamoDB and the AWS SDKs.

Query to MySQL

The following sample code shows a common way to connect to MySQL and retrieve data.

import java.sql.* ;
...
try {
    Connection conn =  DriverManager.getConnection("jdbc:mysql://<host name>/<database name>" , "<user>" , "<password>");
    stmt = conn.createStatement();
    String sql = "SELECT quantity as quantity FROM purchase WHERE customerid = '<customerid>' and orderid = '<orderid>' and productid = '<productid>'";
    ResultSet rs = stmt.executeQuery(sql);

    while(rs.next()){ 
       int quantity  = rs.getString("quantity");   //Retrieve by column name 
       System.out.print("quantity: " + quantity);  //Display values 
       }
} catch (SQLException ex) {
    // handle any errors
    System.out.println("SQLException: " + ex.getMessage());}
...
==== Output ====
quantity:1
Query to DynamoDB

To retrieve items from DynamoDB, follow these steps:

  1. Create an instance of the DynamoDB
  2. Create an instance of the Table
  3. Add the withHashKey and withRangeKeyCondition methods to an instance of the QuerySpec
  4. Execute the query method with the querySpec instance previously created. Items are retrieved as JSON format, so use the getJSON method to look up a specific attribute in an item.
...
DynamoDB dynamoDB = new DynamoDB( new AmazonDynamoDBClient(new ProfileCredentialsProvider()));

Table table = dynamoDB.getTable("purchase");

QuerySpec querySpec = new QuerySpec()
        .withHashKey("customerid" , "customer1")  // hashkey name and its value 
        .withRangeKeyCondition(new RangeKeyCondition("orderid").eq("order1") ) ; // Ranage key and its condition value 

ItemCollection<QueryOutcome> items = table.query(querySpec); 

Iterator<Item> iterator = items.iterator();          
while (iterator.hasNext()) {
Item item = iterator.next();
System.out.println(("quantity: " + item.getJSON("product1"));   // 
}
...
==== Output ====
quantity:1

Conclusion

In this post, I introduced two options for seamlessly migrating data from MySQL to DynamoDB and minimizing downtime during the migration. Option #1 used DMS, and option #2 combined EMR, Amazon Kinesis, and Lambda. I also showed you how to convert the key design in accordance with database characteristics to improve read/write performance and reduce costs. Each option has advantages and disadvantages, so the best option depends on your business requirements.

The sample code in this post is not enough for a complete, efficient, and reliable data migration code base to be reused across many different environments. Use it to get started, but design for other variables in your actual migration.

I hope this post helps you plan and implement your migration and minimizes service outages. If you have questions or suggestions, please leave a comment below.

Appendix

To install the Percona Toolkit:

# Install Percona Toolkit

$ wget https://www.percona.com/downloads/percona-toolkit/3.0.2/binary/redhat/6/x86_64/percona-toolkit-3.0.2-1.el6.x86_64.rpm

$ yum install perl-IO-Socket-SSL

$ yum install perl-TermReadKey

$ rpm -Uvh percona-toolkit-3.0.2-1.el6.x86_64.rpm

# run pt-archiver

Example command:

$ pt-archiver –source h=localhost,D=blog,t=purchase –file ‘/data/export/%Y-%m-%d-%D.%t’  –where “1=1” –limit 10000 –commit-each

 


About the Author

Yong Seong Lee is a Cloud Support Engineer for AWS Big Data Services. He is interested in every technology related to data/databases and helping customers who have difficulties in using AWS services. His motto is “Enjoy life, be curious and have maximum experience.”

 

 

 


Converging Data Silos to Amazon Redshift Using AWS DMS

 

AWS Database Migration Service – 20,000 Migrations and Counting

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-database-migration-service-20000-migrations-and-counting/

I first wrote about AWS Database Migration Service just about a year ago in my AWS Database Migration Service post. At that time I noted that over 1,000 AWS customers had already made use of the service as part of their move to AWS.

As a quick recap, AWS Database Migration Service and Schema Conversion Tool (SCT) help our customers migrate their relational data from expensive, proprietary databases and data warehouses (either on premises on in the cloud, but with restrictive licensing terms either way) to more cost-effective cloud-based databases and data warehouses such as Amazon Aurora, Amazon Redshift, MySQL, MariaDB, and PostgreSQL, with minimal downtime along the way. Our customers tell us that they love the flexibility and the cost-effective nature of these moves. For example, moving to Amazon Aurora gives them access to a database that is MySQL and PostgreSQL compatible, at 1/10th the cost of a commercial database. Take a peek at our AWS Database Migration Services Customer Testimonials to see how Expedia, Thomas Publishing, Pega, and Veoci have made use of the service.

20,000 Unique Migrations
I’m pleased to be able to announce that our customers have already used AWS Database Migration Service to migrate 20,000 unique databases to AWS and that the pace continues to accelerate (we reached 10,000 migrations in September of 2016).

We’ve added many new features to DMS and SCT over the past year. Here’s a summary:

Learn More
Here are some resources that will help you to learn more and to get your own migrations underway, starting with some recent webinars:

Migrating From Sharded to Scale-Up – Some of our customers implemented a scale-out strategy in order to deal with their relational workload, sharding their database across multiple independent pieces, each running on a separate host. As part of their migration, these customers often consolidate two or more shards onto a single Aurora instance, reducing complexity, increasing reliability, and saving money along the way. If you’d like to do this, check out the blog post, webinar recording, and presentation.

Migrating From Oracle or SQL Server to Aurora – Other customers migrate from commercial databases such as Oracle or SQL Server to Aurora. If you would like to do this, check out this presentation and the accompanying webinar recording.

We also have plenty of helpful blog posts on the AWS Database Blog:

  • Reduce Resource Consumption by Consolidating Your Sharded System into Aurora – “You might, in fact, save bunches of money by consolidating your sharded system into a single Aurora instance or fewer shards running on Aurora. That is exactly what this blog post is all about.”
  • How to Migrate Your Oracle Database to Amazon Aurora – “This blog post gives you a quick overview of how you can use the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS) to facilitate and simplify migrating your commercial database to Amazon Aurora. In this case, we focus on migrating from Oracle to the MySQL-compatible Amazon Aurora.”
  • Cross-Engine Database Replication Using AWS Schema Conversion Tool and AWS Database Migration Service – “AWS SCT makes heterogeneous database migrations easier by automatically converting source database schema. AWS SCT also converts the majority of custom code, including views and functions, to a format compatible with the target database.”
  • Database Migration—What Do You Need to Know Before You Start? – “Congratulations! You have convinced your boss or the CIO to move your database to the cloud. Or you are the boss, CIO, or both, and you finally decided to jump on the bandwagon. What you’re trying to do is move your application to the new environment, platform, or technology (aka application modernization), because usually people don’t move databases for fun.”
  • How to Script a Database Migration – “You can use the AWS DMS console or the AWS CLI or the AWS SDK to perform the database migration. In this blog post, I will focus on performing the migration with the AWS CLI.”

The documentation includes five helpful walkthroughs:

There’s also a hands-on lab (you will need to register in order to participate).

See You at a Summit
The DMS team is planning to attend and present at many of our upcoming AWS Summits and would welcome the opportunity to discuss your database migration requirements in person.

Jeff;

 

 

Amazon RDS – 2016 in Review

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-rds-2016-in-review/

Even though we published 294 posts on this blog last year, I left out quite a number of worthwhile launches! Today I would like to focus on Amazon Relational Database Service (RDS) and recap all of the progress that the teams behind this family of services made in 2016. The team focused on four major areas last year:

  • High Availability
  • Enhanced Monitoring
  • Simplified Security
  • Database Engine Updates

Let’s take a look at each of these areas…

High Availability
Relational databases are at the heart of many types of applications. In order to allow our customers to build applications that are highly available, RDS has offered multi-AZ support since early 2010 (read Amazon RDS – Multi-AZ Deployments For Enhanced Availability & Reliability for more info). Instead of spending weeks setting up multiple instances, arranging for replication, writing scripts to detect network, instance, & network issues, making failover decisions, and bringing a new secondary instance online, you simply opt for Multi-AZ Deployment when you create the Database Instance. RDS also makes it easy for you to create cross-region read replicas.

Here are some of the other enhancements that we made in 2016 in order to help you to achieve high availability:

Enhanced Monitoring
We announced the first big step toward enhanced monitoring at the end of 2015 (New – Enhanced Monitoring for Amazon RDS) with support for MySQL, MariaDB, and Amazon Aurora and then made additional announcements in 2016:

Simplified Security
We want to make it as easy and simple as possible for you to use encryption to protect your data, whether it is at rest or in motion. Here are the enhancements that we made in this area last year:

Database Engine Updates
The open source community and the vendors of commercial databases add features and produce new releases at a rapid pace and we track their work very closely, aiming to update RDS as quickly as possible after each significant release. Here’s what we did in 2016:

Stay Tuned
We’ve already made some big announcements this year (you can find them in the AWS What’s New for 2017) with plenty more in store including the recently announced PostgreSQL-compatible version of Aurora, so stay tuned! You may also want to subscribe to the AWS Database Blog for detailed posts that will show you how to get the most from RDS, Amazon Aurora, and Amazon ElastiCache.

Jeff;

PS – This post does not include all of the enhancements that we made to AWS Database Migration Service or the Schema Conversion Tool last year. I’m working on another post on that topic.

Month in Review: January 2017

Post Syndicated from Derek Young original https://aws.amazon.com/blogs/big-data/month-in-review-january-2017/

Another month of big data solutions on the Big Data Blog!

Take a look at our summaries below and learn, comment, and share. Thank you for reading!

NEW POSTS

Decreasing Game Churn: How Upopa used ironSource Atom and Amazon ML to Engage Users
Ever wondered what it takes to keep a user from leaving your game or application after all the hard work you put in? Wouldn’t it be great to get a chance to interact with the users before they’re about to leave? In this post, learn how ironSource worked with gaming studio Upopa to build an efficient, cheap, and accurate way to battle churn and make data-driven decisions using ironSource Atom’s data pipeline and Amazon ML.

Create a Healthcare Data Hub with AWS and Mirth Connect
Healthcare providers record patient information across different software platforms. Each of these platforms can have varying implementations of complex healthcare data standards. Also, each system needs to communicate with a central repository called a health information exchange (HIE) to build a central, complete clinical record for each patient. In this post, learn how to consume different data types as messages, transform the information within the messages, and then use AWS services to take action depending on the message type.

Call for Papers! DEEM: 1st Workshop on Data Management for End-to-End Machine Learning
Amazon and Matroid will hold the first workshop on Data Management for End-to-End Machine Learning (DEEM) on May 14th, 2017 in conjunction with the premier systems conference SIGMOD/PODS 2017 in Raleigh, North Carolina. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management, and systems research to discuss data management issues in ML application scenarios. The workshop is soliciting research papers that describe preliminary and ongoing research results.

Converging Data Silos to Amazon Redshift Using AWS DMS
In this post, learn to use AWS Database Migration Service (AWS DMS) and other AWS services to easily converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.

Run Mixed Workloads with Amazon Redshift Workload Management
It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time. Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs. This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.

Secure Amazon EMR with Encryption
In this post, learn how to set up encryption of data at multiple levels using security configurations with EMR. You’ll walk through the step-by-step process to achieve all the encryption prerequisites, such as building the KMS keys, building SSL certificates, and launching the EMR cluster with a strong security configuration.

FROM THE ARCHIVE

Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning
In this post, learn to generate a predictive model for flight delays that can be used to help pick the flight least likely to add to your travel stress. To accomplish this, you’ll use Apache Spark running on Amazon EMR for extracting, transforming, and loading (ETL) the data, Amazon Redshift for analysis, and Amazon Machine Learning for creating predictive models.


Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.

Converging Data Silos to Amazon Redshift Using AWS DMS

Post Syndicated from Pratim Das original https://aws.amazon.com/blogs/big-data/converging-data-silos-to-amazon-redshift-using-aws-dms/

Organizations often grow organically—and so does their data in individual silos. Such systems are often powered by traditional RDBMS systems and they grow orthogonally in size and features. To gain intelligence across heterogeneous data sources, you have to join the data sets. However, this imposes new challenges, as joining data over dblinks or into a single view is extremely cumbersome and an operational nightmare.

This post walks through using AWS Database Migration Service (AWS DMS) and other AWS services to make it easy to converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.

AWS service overview

Here’s a brief overview of AWS services that help with data convergence.

AWS DMS

With DMS, you can migrate your data to and from most widely used commercial and open-source databases. The service supports homogenous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. It also allows you to stream data to Amazon Redshift from any of the supported sources including:

  • Amazon Aurora
  • PostgreSQL
  • MySQL
  • MariaDB
  • Oracle
  • SAP ASE
  • SQL Server

DMS enables consolidation and easy analysis of data in the petabyte-scale data warehouse. It can also be used for continuous data replication with high availability.

Amazon QuickSight

Amazon QuickSight provides very fast, easy-to-use, cloud-powered business intelligence at 1/10th the cost of traditional BI solutions. QuickSight uses a new, super-fast, parallel, in-memory calculation engine (“SPICE”) to perform advanced calculations and render visualizations rapidly.

QuickSight integrates automatically with AWS data services, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance to them. You can easily connect QuickSight to AWS data services, including Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon S3, and Amazon Athena. You can also upload CSV, TSV, and spreadsheet files or connect to third-party data sources such as Salesforce.

Amazon Redshift

Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift is typically priced at 1/10th of the price of the competition. We have many customers running petabyte scale data analytics on AWS using Amazon Redshift.

Amazon Redshift is also ANSI SQL compliant, supports JDBC/ODBC, and is easy to connect to your existing business intelligence (BI) solution. However, if your storage requirement is in the 10s of TB range and requires high levels of concurrency across small queries, you may want to consider Amazon Aurora as the target converged database.

Walkthrough

Assume that you have an events company specializing on sports, and have built a MySQL database that holds data for the players and the sporting events. Customers and ticket information is stored in another database; in this case, assume it is PostgresSQL and this gets updated when customer purchases tickets from our website and mobile apps. You can download a sample dataset from the aws-database-migration-samples GitHub repo.

These databases could be anywhere: at an on-premises facility; on AWS in Amazon EC2 or Amazon RDS, or other cloud provider; or in a mixture of such locations. To complicate things a little more, you can assume that the lost opportunities (where a customer didn’t complete buying the ticket even though it was added to the shopping cart) are streamed via clickstream through Amazon Kinesis and then stored on Amazon S3. We then use AWS Data Pipeline to orchestrate a process to cleanse that data using Amazon EMR and make it ready for loading to Amazon Redshift. The clickstream integration is not covered in this post but was demonstrated in the recent Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics post.

Architecture

In this solution, you use DMS to bring the two data sources into Amazon Redshift and run analytics to gain business insights. The following diagram demonstrates the proposed solution.

DataSilos_1_1

After the data is available on Amazon Redshift, you could easily build BI dashboards and generate intelligent reports to gain insights using Amazon QuickSight. You could also take this a step further and build a model using Amazon Machine Learning. Amazon Machine Learning uses powerful algorithms to create ML models by finding patterns in your existing data stored in Amazon S3, or Amazon Redshift. It is also highly scalable and can generate billions of predictions daily, and serve those predictions in real time and at high throughput.

Creating source databases

For the purposes of this post, create two RDS databases, one with a MySQL engine, and the other with PostgreSQL and then load some data. These represent a real-life scenario where databases could be located on-premises, on AWS, or both. Just as in real life, there may be more than two source databases; the process described in this post would still be reasonably similar.

Follow the steps in Tutorial: Create a Web Server and an Amazon RDS Database to create the two source databases. Use the links from the main tutorial page to see how to connect to specific databases and load data. For more information, see:

Make a note of the security group that you create and associate all the RDS instances with it. Call it “MyRDSSecurityGroup”.

Afterward, you should be able to see all the databases listed in the RDS Instances dashboard.

DataSilos_2_1

Setting up a target Amazon Redshift cluster

Set up a two-node cluster as shown below, with a cluster name similar to “consolidated-dwh” and a database named similar to “mydwh”. You could also set up a one-node cluster based on the instance type; the instance type may be available on the AWS Free Tier.

DataSilos_3

In the next step, choose Publicly Accessible for non-production usage to keep the configuration simple.

Also, for simplicity, choose the same VPC where you have placed the RDS instances and include the MyRDSSecurityGroup in the list of security groups allowed to access the Amazon Redshift cluster.

Setting up DMS

You can set up DMS easily, as indicated in the AWS Database Migration Service post on the AWS blog. However, rather than using the wizard, you may take a step-by-step approach:

  1. Create a replication instance.
  2. Create the endpoints for the two source databases and the target Amazon Redshift database.
  3. Create a task to synchronize each of the sources to the target.

Create a replication instance

In the DMS console, choose Replication instances, Create replication instance. The instance type you select depends on the data volume you deal with. After setup, you should be able to see your replication instance.

DataSilos_4
Create endpoints

In the DMS console, choose Endpoints, Create endpoint. You need to configure the two source endpoints representing the PostgreSQL and MySQL RDS databases. You also need to create the target endpoint by supplying the Amazon Redshift database that you created in the previous steps. After configuration, the endpoints look similar to the following screenshot:

DataSilos_5

Create a task and start data migration

You can rely on DMS to create the target tables in your target Amazon Redshift database or you may want to take advantage of AWS Schema Conversion Tool to create the target schema and also do a compatibility analysis in the process. Using the AWS Schema Conversion Tool is particularly useful when migrating using heterogeneous data sources. For more information, see Getting Started with the AWS Schema Conversion Tool.

For simplicity, I avoided using the AWS Schema Conversion Tool in this post and used jump to DMS to create the target schema and underlying tables and then set up the synchronization between the data sources and the target.

In the DMS console, choose Tasks, Create Tasks. Fill in the fields as shown in the following screenshot:

DataSilos_6

Note that given the source is RDS MySQL and you chose Migrate data and replicate on going changes, you need to enable bin log retention. Other engines have other requirements and DMS prompts you accordingly. For this particular case, run the following command:

call mysql.rds_set_configuration('binlog retention hours', 24);

Now, choose Start task on create. In the task settings, choose Drop tables on target to have DMS create the tables, if you haven’t already created the target tables using the AWS Schema Conversion Tool, as described earlier. Choose Enable logging but note that this incurs additional costs as the generated CloudWatch logs require storage.

In the table mappings, for Schema to migrate, ensure that the correct schema has been selected from the source databases. DMS creates the schema on the target if it does not already exist.

Repeat for the other data source, choosing the other source endpoint and the same Amazon Redshift target endpoint. In the table mappings section, choose Custom and customize as appropriate. For example, you can specify the schema names to include and tables to exclude, as shown in the following screenshot:

DataSilos_7

Using this custom configuration, you can perform some minor transformations, such as down casing target table names, or choosing a different target schema for both sources.

After both tasks have successfully completed, the Tasks tab now looks like the following:

DataSilos_8

Running queries on Amazon Redshift

In Amazon Redshift, select your target cluster and choose Loads. You can see all operations that DMS performed in the background to load the data from the two source databases into Amazon Redshift.

DataSilos_9

Ensure change data capture is working

Generate additional data on Amazon RDS PostgreSQL in the ticketing.sporting_event_ticket by running the script provided in the generate_mlb_season.sql aws-database-migration-samples GitHub repository. Notice that the tasks have caught up and are showing the migration in progress. You can also query the target tables and see that the new data is in the target table.

Visualization options

Set up QuickSight and configure your data source to be your Amazon Redshift database. If you have a Redshift cluster in the same account and in the same region, it will appear when you clock Redshift (Auto-discovered) from the data sets page, as shown below.

DataSilos_16

Access to any other Redshift cluster can be configured as follows using the Redshift (Manual connect) link:

DataSilos_10

Now, create your data set. Choose New Data Set and select either a new data source or an existing data source listed at the bottom of the page. Choose Ticketing for Sports.

DataSilos_11_1
In the next step, choose Create Data Set.

In the next step, when QuickSight prompts you to choose your table, you can select the schema and the required table and choose Select. Alternatively, you may choose Edit/Preview data.

DataSilos_11

You could use the graphical options shown below to start creating your data set. Given that you have data from multiple sources, it’s safe to assume that your target tables are in separate schemas. Select the schema and tables, select the other schemas, and bring the appropriate tables to the palette by selecting them using the check box to the right. For each join, select the join type and then map the appropriate keys between the tables until the two reds turn to one of the blue join types.

DataSilos_12

In this case, rather than preparing the data set in the palette, you provide a custom SQL query. On the left pane, choose Tables, Switch to Custom SQL tool.

Paste the following SQL query in the Custom SQL field and enter a name.

select to_char( e.start_date_time, 'YYYY-MM-DD' ) event_date, 
to_char( e.start_date_time, 'HH24:MI' ) start_time, e.sold_out, 
e.sport_type_name, l.name event_location, l.city event_city, 
l.seating_capacity, hteam.name home_team, hl.name home_field, 
hl.city home_city, ateam.name away_team, al.name away_field, 
al.city away_city, sum( t.ticket_price ) total_ticket_price, 
avg( t.ticket_price ) average_ticket_price, 
min ( t.ticket_price ) cheapest_ticket, 
max( t.ticket_price ) most_expensive_ticket, count(*) num_tickets

from ticketing.sporting_event_ticket t, sourcemysql.sporting_event e, 
sourcemysql.sport_location l, sourcemysql.sport_team hteam, 
sourcemysql.sport_team ateam, sourcemysql.sport_location hl, 
sourcemysql.sport_location al

where t.sporting_event_id = e.id
and t.sport_location_id = l.id
and e.home_team_id = hteam.id
and e.away_team_id = ateam.id
and hteam.home_field_id = hl.id
and ateam.home_field_id = al.id

group by to_char( e.start_date_time, 'YYYY-MM-DD' ), 
to_char( e.start_date_time, 'HH24:MI' ), e.start_date_time, 
e.sold_out, e.sport_type_name, l.name, l.city, l.seating_capacity, 
hteam.name, ateam.name, hl.name, hl.city, al.name, al.city;

DataSilos_13

You can choose Save and visualize and view the QuickSight visualization toolkit and filter options. Here you can build your story or dashboards and start sharing them with your team.

Now, you can choose various fields from the field list and the various measures to get the appropriate visualization, like the one shown below. This one was aimed to understand the date at which each event in each city reached the maximum capacity.

DataSilos_14

You can also combine many such visualizations and prepare your dashboard for management reporting. The analysis may also drive where you need to invent on campaigns and where things are going better than expected to ensure a healthy sales pipeline.

DataSilos_15

Summary

In this post, you used AWS DMS to converge multiple heterogonous data sources to an Amazon Redshift cluster. You also used Quicksight to create a data visualization on the converged dataset to provide you with additional insights. Although we have used an e-commerce use case related to an events company, this concept of converging multiple data silos to a target is also applicable to other verticals such as retail, health-care, finance, insurance and banking, gaming, and so on.

If you have questions or suggestions, please comment below.


About the Author

 

Pratim_DasPratim Das is a Specialist Solutions Architect for Analytics in EME. He works with customers on big data and analytical projects, helping them build solutions on AWS, using AWS services and (or) other open source or commercial solution from the big data echo system. In his spare time he enjoys cooking and creating exciting new recipes always with that spicy kick.

 

 


Related

Derive Insights from IoT in Minutes using AWS IoT, Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight

o_realtime_1_1_1