We have been busy adding new features and capabilities to Amazon Redshift, and we wanted to give you a glimpse of what we’ve been doing over the past year. In this article, we recap a few of our enhancements and provide a set of resources that you can use to learn more and get the most out of your Amazon Redshift implementation.
In 2017, we made more than 30 announcements about Amazon Redshift. We listened to you, our customers, and delivered Redshift Spectrum, a feature of Amazon Redshift, that gives you the ability to extend analytics to your data lake—without moving data. We launched new DC2 nodes, doubling performance at the same price. We also announced many new features that provide greater scalability, better performance, more automation, and easier ways to manage your analytics workloads.
To see a full list of our launches, visit our what’s new page—and be sure to subscribe to our RSS feed.
Major launches in 2017
Amazon Redshift Spectrum—extend analytics to your data lake, without moving data
We launched Amazon Redshift Spectrum to give you the freedom to store data in Amazon S3, in open file formats, and have it available for analytics without the need to load it into your Amazon Redshift cluster. It enables you to easily join datasets across Redshift clusters and S3 to provide unique insights that you would not be able to obtain by querying independent data silos.
With Redshift Spectrum, you can run SQL queries against data in an Amazon S3 data lake as easily as you analyze data stored in Amazon Redshift. And you can do it without loading data or resizing the Amazon Redshift cluster based on growing data volumes. Redshift Spectrum separates compute and storage to meet workload demands for data size, concurrency, and performance. Redshift Spectrum scales processing across thousands of nodes, so results are fast, even with massive datasets and complex queries. You can query open file formats that you already use—such as Apache Avro, CSV, Grok, ORC, Apache Parquet, RCFile, RegexSerDe, SequenceFile, TextFile, and TSV—directly in Amazon S3, without any data movement.
“For complex queries, Redshift Spectrum provided a 67 percent performance gain,” said Rafi Ton, CEO, NUVIAD. “Using the Parquet data format, Redshift Spectrum delivered an 80 percent performance improvement. For us, this was substantial.”
DC2 nodes—twice the performance of DC1 at the same price
We launched second-generation Dense Compute (DC2) nodes to provide low latency and high throughput for demanding data warehousing workloads. DC2 nodes feature powerful Intel E5-2686 v4 (Broadwell) CPUs, fast DDR4 memory, and NVMe-based solid state disks (SSDs). We’ve tuned Amazon Redshift to take advantage of the better CPU, network, and disk on DC2 nodes, providing up to twice the performance of DC1 at the same price. Our DC2.8xlarge instances now provide twice the memory per slice of data and an optimized storage layout with 30 percent better storage utilization.
“Redshift allows us to quickly spin up clusters and provide our data scientists with a fast and easy method to access data and generate insights,” said Bradley Todd, technology architect at Liberty Mutual. “We saw a 9x reduction in month-end reporting time with Redshift DC2 nodes as compared to DC1.”
On average, our customers are seeing 3x to 5x performance gains for most of their critical workloads.
We introduced short query acceleration to speed up execution of queries such as reports, dashboards, and interactive analysis. Short query acceleration uses machine learning to predict the execution time of a query, and to move short running queries to an express short query queue for faster processing.
We launched results caching to deliver sub-second response times for queries that are repeated, such as dashboards, visualizations, and those from BI tools. Results caching has an added benefit of freeing up resources to improve the performance of all other queries.
We also introduced late materialization to reduce the amount of data scanned for queries with predicate filters by batching and factoring in the filtering of predicates before fetching data blocks in the next column. For example, if only 10 percent of the table rows satisfy the predicate filters, Amazon Redshift can potentially save 90 percent of the I/O for the remaining columns to improve query performance.
We launched query monitoring rules and pre-defined rule templates. These features make it easier for you to set metrics-based performance boundaries for workload management (WLM) queries, and specify what action to take when a query goes beyond those boundaries. For example, for a queue that’s dedicated to short-running queries, you might create a rule that aborts queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops.
Customer insights
Amazon Redshift and Redshift Spectrum serve customers across a variety of industries and sizes, from startups to large enterprises. Visit our customer page to see the success that customers are having with our recent enhancements. Learn how companies like Liberty Mutual Insurance saw a 9x reduction in month-end reporting time using DC2 nodes. On this page, you can find case studies, videos, and other content that show how our customers are using Amazon Redshift to drive innovation and business results.
In addition, check out these resources to learn about the success our customers are having building out a data warehouse and data lake integration solution with Amazon Redshift:
You can enhance your Amazon Redshift data warehouse by working with industry-leading experts. Our AWS Partner Network (APN) Partners have certified their solutions to work with Amazon Redshift. They offer software, tools, integration, and consulting services to help you at every step. Visit our Amazon Redshift Partner page and choose an APN Partner. Or, use AWS Marketplace to find and immediately start using third-party software.
To see what our Partners are saying about Amazon Redshift Spectrum and our DC2 nodes mentioned earlier, read these blog posts:
If you are evaluating or considering a proof of concept with Amazon Redshift, or you need assistance migrating your on-premises or other cloud-based data warehouse to Amazon Redshift, our team of product experts and solutions architects can help you with architecting, sizing, and optimizing your data warehouse. Contact us using this support request form, and let us know how we can assist you.
If you are an Amazon Redshift customer, we offer a no-cost health check program. Our team of database engineers and solutions architects give you recommendations for optimizing Amazon Redshift and Amazon Redshift Spectrum for your specific workloads. To learn more, email us at [email protected].
Larry Heathcote is a Principle Product Marketing Manager at Amazon Web Services for data warehousing and analytics. Larry is passionate about seeing the results of data-driven insights on business outcomes. He enjoys family time, home projects, grilling out and the taste of classic barbeque.
This post courtesy of Shane Baldacchino, Solutions Architect at Amazon Web Services.
Many customers ask for guidance on migrating end-to-end solutions running on virtual machines over to AWS. This post provides an overview of moving a common WordPress blog running on a virtualized platform to AWS, including re-pointing the DNS records associated to with the website.
AWS Server Migration Service (AWS SMS) is an agentless service that makes it easier and faster for you to migrate thousands of on-premises workloads to AWS. AWS SMS allows you to automate, schedule, and track incremental replications of live server volumes, making it easier for you to coordinate large-scale server migrations.
Walkthrough
The key elements of this migration process include the following steps:
Establish your AWS environment.
Replicate your database.
Download the SMS Connector from the AWS Management Console.
Configure AWS SMS and Hypervisor permissions.
Install and configure the SMS Connector appliance.
Import your virtual machine inventory and create a replication job.
Launch your Amazon EC2 instance.
Change your DNS records to resolve the WordPress blog to your EC2 instance.
Before you start, ensure that your source systems OS and vCenter version are supported by AWS. For more information, see the Server Migration Service FAQ.
Establish your AWS environment
For this walkthrough, your WordPress blog is currently running as a two-tier LAMP stack in a corporate data center. You have a frontend running Apache and PHP, plus a backend database running on MySQL. All systems are hosted on a virtualized platform.
First, establish your AWS environment. If your organization is new to AWS, this may include account or subaccount creation, a new virtual private cloud (VPC), and associated subnets, route tables, internet gateways, and so on. Think of this phase as setting up your software-defined data center. For more information, see Getting Started with Amazon EC2.
The blog is a two-tier stack, so go with two private subnets. Because you want it to be highly available, use multiple Availability Zones. A zone resides within an AWS Region. Each zone is isolated, but the zones within a region are connected through low-latency links. This allows architects and solution designers to build highly available solutions.
Replicate your database
WordPress uses a MySQL relational database. You could continue to manage MySQL and the associated EC2 instances associated with maintaining and scaling a database. For this walkthrough, use this opportunity to migrate to an RDS instance of Amazon Aurora, as it is a MySQL compliant database. Not only is Amazon Aurora a high-performant database engine but it frees you up to focus on application development by managing time-consuming database administration tasks, including backups, software patching, monitoring, scaling, and replication.
Use AWS Database Migration Service to migrate your MySQL database to Amazon Aurora easily and securely. After a database migration instance has been instantiated, configure the source and destination endpoints and create a replication task.
By attaching to the MySQL binlog, you can seed in the current data in the database and also capture all future state changes in near real time. For more information, see Migrating a MySQL-Compatible Database to Amazon Aurora.
Finally, the task shows that you are replicating current data in your WordPress blog database and future changes from MySQL into Amazon Aurora.
Download the SMS Connector from the AWS Management Console
Now, use AWS SMS to migrate your Apache PHP frontend to EC2. AWS SMS is delivered as an appliance for your hypervisor.
To download the SMS Connector, log in to the console and choose Server Migration Service, Connectors, SMS Connector setup guide.
Configure AWS SMS
Your hypervisor and AWS SMS will need an appropriate user with sufficient privileges to perform migrations.
AWS SMS – Use the AWS CLI or the IAM console to create an IAM user with the ServerMigrationConnector policy attached.
Launch a new VM based on the SMS Connector that you downloaded. To configure the connector, connect to it via HTTPS. You can obtain the SMS Connector IP address from your hypervisor.
Connect to the SMS Connector via HTTPS. In the example above, the connector IP address is 10.0.0.31. In your browser, enter https://10.0.0.31.
Configure the connector with the IAM and hypervisor credentials that you created earlier.
After it’s configured, and the associated connectivity and authentication checks have passed, return to the console and view your connector in AWS SMS.
Import your virtual machine inventory and create a replication job After validating that the SMS Connector is in a “HEALTHY” state, import your server catalog to AWS SMS. This process can take up to a minute.
Select the server to migrate and choose Create replication job. The console guides you through the process. The time that the initial replication task takes to complete is dependent on the available bandwidth and the size of your VM. After the initial seed replication, network bandwidth is minimized as AWS SMS replicates only incremental changes occurring on the VM.
Launch your EC2 instance
When your replication task is complete, the artifact created by AWS SMS is a custom AMI that you can use to deploy an EC2 instance. Follow the usual process to launch your EC2 instance, noting that you may need to replace any host-based firewalls with security groups and NACLs.
When you create an EC2 instance, ensure that you pick the most suitable EC2 instance type and size to match your performance requirements while optimizing for cost.
While your new EC2 instance is a replica of your on-premises VM, you should always validate that applications are functioning. How you do this differs on an application-by-application basis. You can use a combination of approaches, such as editing a local host file and testing your application, SSH, or Telnet.
From the RDS console, get your connection string details and update your WordPress configuration file to point to the Amazon Aurora database. As WordPress is expecting a MySQL database and Amazon Aurora is MySQL-compliant, this change of database engine is transparent to WordPress.
Change your DNS records to resolve the WordPress blog to your EC2 instance
You have validated that your WordPress application is running correctly, as you are still receiving changes from your on-premises data center via AWS DMS into your Amazon Aurora database.
You can now update your DNS zone file using Amazon Route 53. Route 53 can be driven by multiple methods: console, SDK, or AWS CLI.
For this walkthrough, update your DNS zone file via the AWS CLI. The JSON example shows upserting the A record in your zone to resolve to your EC2 instance.
Use the AWS CLI to execute the request and update the record in your zone file. The cut-over period between the original off-cloud location and AWS is defined by the TTL in the SOA (statement of authority) in your DNS zone. During this period, any requests resolving to your off-cloud server that result in database writes are automatically replicated to your Amazon Aurora instance via AWS DMS.
You have now successfully migrated your WordPress blog to AWS. Based on the TTL of your DNS zone file, end users slowly resolve the WordPress blog to AWS.
After you have validated your successful migration, be sure to delete your AWS DMS task and your AWS SMS replication job.
Summary
In this post, you moved a WordPress blog to AWS, using AWS SMS and AWS DMS to re-point the associated DNS records.
Many architectures can be extended to use many of the inherent benefits of AWS, with little effort. For example, by using Amazon CloudWatch metrics to drive Auto Scaling policies, you can use an Application Load Balancer as your frontend. This removes the single point of failure for a single Amazon EC2 instance and ensures that your deployed capacity closely follows customer demand. Think big and get building!
With Amazon Redshift, you can build petabyte-scale data warehouses that unify data from a variety of internal and external sources. Because Amazon Redshift is optimized for complex queries (often involving multiple joins) across large tables, it can handle large volumes of retail, inventory, and financial data without breaking a sweat.
In this post, we describe how to combine data in Aurora in Amazon Redshift. Here’s an overview of the solution:
Use AWS Lambda functions with Amazon Aurora to capture data changes in a table.
Serverless architecture for capturing and analyzing Aurora data changes
Consider a scenario in which an e-commerce web application uses Amazon Aurora for a transactional database layer. The company has a sales table that captures every single sale, along with a few corresponding data items. This information is stored as immutable data in a table. Business users want to monitor the sales data and then analyze and visualize it.
In this example, you take the changes in data in an Aurora database table and save it in Amazon S3. After the data is captured in Amazon S3, you combine it with data in your existing Amazon Redshift cluster for analysis.
By the end of this post, you will understand how to capture data events in an Aurora table and push them out to other AWS services using AWS Lambda.
The following diagram shows the flow of data as it occurs in this tutorial:
The starting point in this architecture is a database insert operation in Amazon Aurora. When the insert statement is executed, a custom trigger calls a Lambda function and forwards the inserted data. Lambda writes the data that it received from Amazon Aurora to a Kinesis data delivery stream. Kinesis Data Firehose writes the data to an Amazon S3 bucket. Once the data is in an Amazon S3 bucket, it is queried in place using Amazon Redshift Spectrum.
Creating an Aurora database
First, create a database by following these steps in the Amazon RDS console:
Sign in to the AWS Management Console, and open the Amazon RDS console.
Choose Launch a DB instance, and choose Next.
For Engine, choose Amazon Aurora.
Choose a DB instance class. This example uses a small, since this is not a production database.
In Multi-AZ deployment, choose No.
Configure DB instance identifier, Master username, and Master password.
Launch the DB instance.
After you create the database, use MySQL Workbench to connect to the database using the CNAME from the console. For information about connecting to an Aurora database, see Connecting to an Amazon Aurora DB Cluster.
The following screenshot shows the MySQL Workbench configuration:
Next, create a table in the database by running the following SQL statement:
Create Table
CREATE TABLE Sales (
InvoiceID int NOT NULL AUTO_INCREMENT,
ItemID int NOT NULL,
Category varchar(255),
Price double(10,2),
Quantity int not NULL,
OrderDate timestamp,
DestinationState varchar(2),
ShippingType varchar(255),
Referral varchar(255),
PRIMARY KEY (InvoiceID)
)
You can now populate the table with some sample data. To generate sample data in your table, copy and run the following script. Ensure that the highlighted (bold) variables are replaced with appropriate values.
The following screenshot shows how the table appears with the sample data:
Sending data from Amazon Aurora to Amazon S3
There are two methods available to send data from Amazon Aurora to Amazon S3:
Using a Lambda function
Using SELECT INTO OUTFILE S3
To demonstrate the ease of setting up integration between multiple AWS services, we use a Lambda function to send data to Amazon S3 using Amazon Kinesis Data Firehose.
Alternatively, you can use a SELECT INTO OUTFILE S3 statement to query data from an Amazon Aurora DB cluster and save it directly in text files that are stored in an Amazon S3 bucket. However, with this method, there is a delay between the time that the database transaction occurs and the time that the data is exported to Amazon S3 because the default file size threshold is 6 GB.
Creating a Kinesis data delivery stream
The next step is to create a Kinesis data delivery stream, since it’s a dependency of the Lambda function.
To create a delivery stream:
Open the Kinesis Data Firehose console
Choose Create delivery stream.
For Delivery stream name, type AuroraChangesToS3.
For Source, choose Direct PUT.
For Record transformation, choose Disabled.
For Destination, choose Amazon S3.
In the S3 bucket drop-down list, choose an existing bucket, or create a new one.
Enter a prefix if needed, and choose Next.
For Data compression, choose GZIP.
In IAM role, choose either an existing role that has access to write to Amazon S3, or choose to generate one automatically. Choose Next.
Review all the details on the screen, and choose Create delivery stream when you’re finished.
Creating a Lambda function
Now you can create a Lambda function that is called every time there is a change that needs to be tracked in the database table. This Lambda function passes the data to the Kinesis data delivery stream that you created earlier.
To create the Lambda function:
Open the AWS Lambda console.
Ensure that you are in the AWS Region where your Amazon Aurora database is located.
If you have no Lambda functions yet, choose Get started now. Otherwise, choose Create function.
Choose Author from scratch.
Give your function a name and select Python 3.6 for Runtime
Choose and existing or create a new Role, the role would need to have access to call firehose:PutRecord
Choose Next on the trigger selection screen.
Paste the following code in the code window. Change the stream_name variable to the Kinesis data delivery stream that you created in the previous step.
Choose File -> Save in the code editor and then choose Save.
Once you are finished, the Amazon Aurora database has access to invoke a Lambda function.
Creating a stored procedure and a trigger in Amazon Aurora
Now, go back to MySQL Workbench, and run the following command to create a new stored procedure. When this stored procedure is called, it invokes the Lambda function you created. Change the ARN in the following code to your Lambda function’s ARN.
DROP PROCEDURE IF EXISTS CDC_TO_FIREHOSE;
DELIMITER ;;
CREATE PROCEDURE CDC_TO_FIREHOSE (IN ItemID VARCHAR(255),
IN Category varchar(255),
IN Price double(10,2),
IN Quantity int(11),
IN OrderDate timestamp,
IN DestinationState varchar(2),
IN ShippingType varchar(255),
IN Referral varchar(255)) LANGUAGE SQL
BEGIN
CALL mysql.lambda_async('arn:aws:lambda:us-east-1:XXXXXXXXXXXXX:function:CDCFromAuroraToKinesis',
CONCAT('{ "ItemID" : "', ItemID,
'", "Category" : "', Category,
'", "Price" : "', Price,
'", "Quantity" : "', Quantity,
'", "OrderDate" : "', OrderDate,
'", "DestinationState" : "', DestinationState,
'", "ShippingType" : "', ShippingType,
'", "Referral" : "', Referral, '"}')
);
END
;;
DELIMITER ;
Create a trigger TR_Sales_CDC on the Sales table. When a new record is inserted, this trigger calls the CDC_TO_FIREHOSE stored procedure.
DROP TRIGGER IF EXISTS TR_Sales_CDC;
DELIMITER ;;
CREATE TRIGGER TR_Sales_CDC
AFTER INSERT ON Sales
FOR EACH ROW
BEGIN
SELECT NEW.ItemID , NEW.Category, New.Price, New.Quantity, New.OrderDate
, New.DestinationState, New.ShippingType, New.Referral
INTO @ItemID , @Category, @Price, @Quantity, @OrderDate
, @DestinationState, @ShippingType, @Referral;
CALL CDC_TO_FIREHOSE(@ItemID , @Category, @Price, @Quantity, @OrderDate
, @DestinationState, @ShippingType, @Referral);
END
;;
DELIMITER ;
If a new row is inserted in the Sales table, the Lambda function that is mentioned in the stored procedure is invoked.
Verify that data is being sent from the Lambda function to Kinesis Data Firehose to Amazon S3 successfully. You might have to insert a few records, depending on the size of your data, before new records appear in Amazon S3. This is due to Kinesis Data Firehose buffering. To learn more about Kinesis Data Firehose buffering, see the “Amazon S3” section in Amazon Kinesis Data Firehose Data Delivery.
Every time a new record is inserted in the sales table, a stored procedure is called, and it updates data in Amazon S3.
Querying data in Amazon Redshift
In this section, you use the data you produced from Amazon Aurora and consume it as-is in Amazon Redshift. In order to allow you to process your data as-is, where it is, while taking advantage of the power and flexibility of Amazon Redshift, you use Amazon Redshift Spectrum. You can use Redshift Spectrum to run complex queries on data stored in Amazon S3, with no need for loading or other data prep.
Just create a data source and issue your queries to your Amazon Redshift cluster as usual. Behind the scenes, Redshift Spectrum scales to thousands of instances on a per-query basis, ensuring that you get fast, consistent performance even as your dataset grows to beyond an exabyte! Being able to query data that is stored in Amazon S3 means that you can scale your compute and your storage independently. You have the full power of the Amazon Redshift query model and all the reporting and business intelligence tools at your disposal. Your queries can reference any combination of data stored in Amazon Redshift tables and in Amazon S3.
Redshift Spectrum supports open, common data types, including CSV/TSV, Apache Parquet, SequenceFile, and RCFile. Files can be compressed using gzip or Snappy, with other data types and compression methods in the works.
Next, create an IAM role that has access to Amazon S3 and Athena. By default, Amazon Redshift Spectrum uses the Amazon Athena data catalog. Your cluster needs authorization to access your external data catalog in AWS Glue or Athena and your data files in Amazon S3.
In the demo setup, I attached AmazonS3FullAccess and AmazonAthenaFullAccess. In a production environment, the IAM roles should follow the standard security of granting least privilege. For more information, see IAM Policies for Amazon Redshift Spectrum.
create external schema if not exists spectrum_schema
from data catalog
database 'spectrum_db'
region 'us-east-1'
IAM_ROLE 'arn:aws:iam::XXXXXXXXXXXX:role/RedshiftSpectrumRole'
create external database if not exists;
Don’t forget to replace the IAM role in the statement.
Then create an external table within the database:
CREATE EXTERNAL TABLE IF NOT EXISTS spectrum_schema.ecommerce_sales(
ItemID int,
Category varchar,
Price DOUBLE PRECISION,
Quantity int,
OrderDate TIMESTAMP,
DestinationState varchar,
ShippingType varchar,
Referral varchar)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
LOCATION 's3://{BUCKET_NAME}/CDC/'
Query the table, and it should contain data. This is a fact table.
select top 10 * from spectrum_schema.ecommerce_sales
Next, create a dimension table. For this example, we create a date/time dimension table. Create the table:
CREATE TABLE date_dimension (
d_datekey integer not null sortkey,
d_dayofmonth integer not null,
d_monthnum integer not null,
d_dayofweek varchar(10) not null,
d_prettydate date not null,
d_quarter integer not null,
d_half integer not null,
d_year integer not null,
d_season varchar(10) not null,
d_fiscalyear integer not null)
diststyle all;
Populate the table with data:
copy date_dimension from 's3://reparmar-lab/2016dates'
iam_role 'arn:aws:iam::XXXXXXXXXXXX:role/redshiftspectrum'
DELIMITER ','
dateformat 'auto';
The date dimension table should look like the following:
Querying data in local and external tables using Amazon Redshift
Now that you have the fact and dimension table populated with data, you can combine the two and run analysis. For example, if you want to query the total sales amount by weekday, you can run the following:
select sum(quantity*price) as total_sales, date_dimension.d_season
from spectrum_schema.ecommerce_sales
join date_dimension on spectrum_schema.ecommerce_sales.orderdate = date_dimension.d_prettydate
group by date_dimension.d_season
You get the following results:
Similarly, you can replace d_season with d_dayofweek to get sales figures by weekday:
With Amazon Redshift Spectrum, you pay only for the queries you run against the data that you actually scan. We encourage you to use file partitioning, columnar data formats, and data compression to significantly minimize the amount of data scanned in Amazon S3. This is important for data warehousing because it dramatically improves query performance and reduces cost.
Partitioning your data in Amazon S3 by date, time, or any other custom keys enables Amazon Redshift Spectrum to dynamically prune nonrelevant partitions to minimize the amount of data processed. If you store data in a columnar format, such as Parquet, Amazon Redshift Spectrum scans only the columns needed by your query, rather than processing entire rows. Similarly, if you compress your data using one of the supported compression algorithms in Amazon Redshift Spectrum, less data is scanned.
Analyzing and visualizing Amazon Redshift data in Amazon QuickSight
After modifying the Amazon Redshift security group, go to Amazon QuickSight. Create a new analysis, and choose Amazon Redshift as the data source.
Enter the database connection details, validate the connection, and create the data source.
Choose the schema to be analyzed. In this case, choose spectrum_schema, and then choose the ecommerce_sales table.
Next, we add a custom field for Total Sales = Price*Quantity. In the drop-down list for the ecommerce_sales table, choose Edit analysis data sets.
On the next screen, choose Edit.
In the data prep screen, choose New Field. Add a new calculated field Total Sales $, which is the product of the Price*Quantity fields. Then choose Create. Save and visualize it.
Next, to visualize total sales figures by month, create a graph with Total Sales on the x-axis and Order Data formatted as month on the y-axis.
After you’ve finished, you can use Amazon QuickSight to add different columns from your Amazon Redshift tables and perform different types of visualizations. You can build operational dashboards that continuously monitor your transactional and analytical data. You can publish these dashboards and share them with others.
Final notes
Amazon QuickSight can also read data in Amazon S3 directly. However, with the method demonstrated in this post, you have the option to manipulate, filter, and combine data from multiple sources or Amazon Redshift tables before visualizing it in Amazon QuickSight.
In this example, we dealt with data being inserted, but triggers can be activated in response to an INSERT, UPDATE, or DELETE trigger.
Keep the following in mind:
Be careful when invoking a Lambda function from triggers on tables that experience high write traffic. This would result in a large number of calls to your Lambda function. Although calls to the lambda_async procedure are asynchronous, triggers are synchronous.
A statement that results in a large number of trigger activations does not wait for the call to the AWS Lambda function to complete. But it does wait for the triggers to complete before returning control to the client.
Similarly, you must account for Amazon Kinesis Data Firehose limits. By default, Kinesis Data Firehose is limited to a maximum of 5,000 records/second. For more information, see Monitoring Amazon Kinesis Data Firehose.
In certain cases, it may be optimal to use AWS Database Migration Service (AWS DMS) to capture data changes in Aurora and use Amazon S3 as a target. For example, AWS DMS might be a good option if you don’t need to transform data from Amazon Aurora. The method used in this post gives you the flexibility to transform data from Aurora using Lambda before sending it to Amazon S3. Additionally, the architecture has the benefits of being serverless, whereas AWS DMS requires an Amazon EC2 instance for replication.
Re Alvarez-Parmar is a solutions architect for Amazon Web Services. He helps enterprises achieve success through technical guidance and thought leadership. In his spare time, he enjoys spending time with his two kids and exploring outdoors.
Contributed by Otavio Ferreira, Manager, Software Development, AWS Messaging
Like other developers around the world, you may be tackling increasingly complex business problems. A key success factor, in that case, is the ability to break down a large project scope into smaller, more manageable components. A service-oriented architecture guides you toward designing systems as a collection of loosely coupled, independently scaled, and highly reusable services. Microservices take this even further. To improve performance and scalability, they promote fine-grained interfaces and lightweight protocols.
However, the communication among isolated microservices can be challenging. Services are often deployed onto independent servers and don’t share any compute or storage resources. Also, you should avoid hard dependencies among microservices, to preserve maintainability and reusability.
If you apply the pub/sub design pattern, you can effortlessly decouple and independently scale out your microservices and serverless architectures. A pub/sub messaging service, such as Amazon SNS, promotes event-driven computing that statically decouples event publishers from subscribers, while dynamically allowing for the exchange of messages between them. An event-driven architecture also introduces the responsiveness needed to deal with complex problems, which are often unpredictable and asynchronous.
What is event-driven computing?
Given the context of microservices, event-driven computing is a model in which subscriber services automatically perform work in response to events triggered by publisher services. This paradigm can be applied to automate workflows while decoupling the services that collectively and independently work to fulfil these workflows. Amazon SNS is an event-driven computing hub, in the AWS Cloud, that has native integration with several AWS publisher and subscriber services.
Which AWS services publish events to SNS natively?
Several AWS services have been integrated as SNS publishers and, therefore, can natively trigger event-driven computing for a variety of use cases. In this post, I specifically cover AWS compute, storage, database, and networking services, as depicted below.
Compute services
Auto Scaling: Helps you ensure that you have the correct number of Amazon EC2 instances available to handle the load for your application. You can configure Auto Scaling lifecycle hooks to trigger events, as Auto Scaling resizes your EC2 cluster.As an example, you may want to warm up the local cache store on newly launched EC2 instances, and also download log files from other EC2 instances that are about to be terminated. To make this happen, set an SNS topic as your Auto Scaling group’s notification target, then subscribe two Lambda functions to this SNS topic. The first function is responsible for handling scale-out events (to warm up cache upon provisioning), whereas the second is in charge of handling scale-in events (to download logs upon termination).
AWS Elastic Beanstalk: An easy-to-use service for deploying and scaling web applications and web services developed in a number of programming languages. You can configure event notifications for your Elastic Beanstalk environment so that notable events can be automatically published to an SNS topic, then pushed to topic subscribers.As an example, you may use this event-driven architecture to coordinate your continuous integration pipeline (such as Jenkins CI). That way, whenever an environment is created, Elastic Beanstalk publishes this event to an SNS topic, which triggers a subscribing Lambda function, which then kicks off a CI job against your newly created Elastic Beanstalk environment.
Elastic Load Balancing:Automatically distributes incoming application traffic across Amazon EC2 instances, containers, or other resources identified by IP addresses.You can configure CloudWatch alarms on Elastic Load Balancing metrics, to automate the handling of events derived from Classic Load Balancers. As an example, you may leverage this event-driven design to automate latency profiling in an Amazon ECS cluster behind a Classic Load Balancer. In this example, whenever your ECS cluster breaches your load balancer latency threshold, an event is posted by CloudWatch to an SNS topic, which then triggers a subscribing Lambda function. This function runs a task on your ECS cluster to trigger a latency profiling tool, hosted on the cluster itself. This can enhance your latency troubleshooting exercise by making it timely.
Amazon S3:Object storage built to store and retrieve any amount of data.You can enable S3 event notifications, and automatically get them posted to SNS topics, to automate a variety of workflows. For instance, imagine that you have an S3 bucket to store incoming resumes from candidates, and a fleet of EC2 instances to encode these resumes from their original format (such as Word or text) into a portable format (such as PDF).In this example, whenever new files are uploaded to your input bucket, S3 publishes these events to an SNS topic, which in turn pushes these messages into subscribing SQS queues. Then, encoding workers running on EC2 instances poll these messages from the SQS queues; retrieve the original files from the input S3 bucket; encode them into PDF; and finally store them in an output S3 bucket.
Amazon EFS: Provides simple and scalable file storage, for use with Amazon EC2 instances, in the AWS Cloud.You can configure CloudWatch alarms on EFS metrics, to automate the management of your EFS systems. For example, consider a highly parallelized genomics analysis application that runs against an EFS system. By default, this file system is instantiated on the “General Purpose” performance mode. Although this performance mode allows for lower latency, it might eventually impose a scaling bottleneck. Therefore, you may leverage an event-driven design to handle it automatically.Basically, as soon as the EFS metric “Percent I/O Limit” breaches 95%, CloudWatch could post this event to an SNS topic, which in turn would push this message into a subscribing Lambda function. This function automatically creates a new file system, this time on the “Max I/O” performance mode, then switches the genomics analysis application to this new file system. As a result, your application starts experiencing higher I/O throughput rates.
Amazon Glacier: A secure, durable, and low-cost cloud storage service for data archiving and long-term backup.You can set a notification configuration on an Amazon Glacier vault so that when a job completes, a message is published to an SNS topic. Retrieving an archive from Amazon Glacier is a two-step asynchronous operation, in which you first initiate a job, and then download the output after the job completes. Therefore, SNS helps you eliminate polling your Amazon Glacier vault to check whether your job has been completed, or not. As usual, you may subscribe SQS queues, Lambda functions, and HTTP endpoints to your SNS topic, to be notified when your Amazon Glacier job is done.
AWS Snowball: A petabyte-scale data transport solution that uses secure appliances to transfer large amounts of data.You can leverage Snowball notifications to automate workflows related to importing data into and exporting data from AWS. More specifically, whenever your Snowball job status changes, Snowball can publish this event to an SNS topic, which in turn can broadcast the event to all its subscribers.As an example, imagine a Geographic Information System (GIS) that distributes high-resolution satellite images to users via Web browser. In this example, the GIS vendor could capture up to 80 TB of satellite images; create a Snowball job to import these files from an on-premises system to an S3 bucket; and provide an SNS topic ARN to be notified upon job status changes in Snowball. After Snowball changes the job status from “Importing” to “Completed”, Snowball publishes this event to the specified SNS topic, which delivers this message to a subscribing Lambda function, which finally creates a CloudFront web distribution for the target S3 bucket, to serve the images to end users.
Amazon RDS: Makes it easy to set up, operate, and scale a relational database in the cloud.RDS leverages SNS to broadcast notifications when RDS events occur. As usual, these notifications can be delivered via any protocol supported by SNS, including SQS queues, Lambda functions, and HTTP endpoints.As an example, imagine that you own a social network website that has experienced organic growth, and needs to scale its compute and database resources on demand. In this case, you could provide an SNS topic to listen to RDS DB instance events. When the “Low Storage” event is published to the topic, SNS pushes this event to a subscribing Lambda function, which in turn leverages the RDS API to increase the storage capacity allocated to your DB instance. The provisioning itself takes place within the specified DB maintenance window.
Amazon ElastiCache: A web service that makes it easy to deploy, operate, and scale an in-memory data store or cache in the cloud.ElastiCache can publish messages using Amazon SNS when significant events happen on your cache cluster. This feature can be used to refresh the list of servers on client machines connected to individual cache node endpoints of a cache cluster. For instance, an ecommerce website fetches product details from a cache cluster, with the goal of offloading a relational database and speeding up page load times. Ideally, you want to make sure that each web server always has an updated list of cache servers to which to connect.To automate this node discovery process, you can get your ElastiCache cluster to publish events to an SNS topic. Thus, when ElastiCache event “AddCacheNodeComplete” is published, your topic then pushes this event to all subscribing HTTP endpoints that serve your ecommerce website, so that these HTTP servers can update their list of cache nodes.
Amazon Redshift: A fully managed data warehouse that makes it simple to analyze data using standard SQL and BI (Business Intelligence) tools.Amazon Redshift uses SNS to broadcast relevant events so that data warehouse workflows can be automated. As an example, imagine a news website that sends clickstream data to a Kinesis Firehose stream, which then loads the data into Amazon Redshift, so that popular news and reading preferences might be surfaced on a BI tool. At some point though, this Amazon Redshift cluster might need to be resized, and the cluster enters a ready-only mode. Hence, this Amazon Redshift event is published to an SNS topic, which delivers this event to a subscribing Lambda function, which finally deletes the corresponding Kinesis Firehose delivery stream, so that clickstream data uploads can be put on hold.At a later point, after Amazon Redshift publishes the event that the maintenance window has been closed, SNS notifies a subscribing Lambda function accordingly, so that this function can re-create the Kinesis Firehose delivery stream, and resume clickstream data uploads to Amazon Redshift.
AWS DMS: Helps you migrate databases to AWS quickly and securely. The source database remains fully operational during the migration, minimizing downtime to applications that rely on the database.DMS also uses SNS to provide notifications when DMS events occur, which can automate database migration workflows. As an example, you might create data replication tasks to migrate an on-premises MS SQL database, composed of multiple tables, to MySQL. Thus, if replication tasks fail due to incompatible data encoding in the source tables, these events can be published to an SNS topic, which can push these messages into a subscribing SQS queue. Then, encoders running on EC2 can poll these messages from the SQS queue, encode the source tables into a compatible character set, and restart the corresponding replication tasks in DMS. This is an event-driven approach to a self-healing database migration process.
Amazon Route 53: A highly available and scalable cloud-based DNS (Domain Name System). Route 53 health checks monitor the health and performance of your web applications, web servers, and other resources.You can set CloudWatch alarms and get automated Amazon SNS notifications when the status of your Route 53 health check changes. As an example, imagine an online payment gateway that reports the health of its platform to merchants worldwide, via a status page. This page is hosted on EC2 and fetches platform health data from DynamoDB. In this case, you could configure a CloudWatch alarm for your Route 53 health check, so that when the alarm threshold is breached, and the payment gateway is no longer considered healthy, then CloudWatch publishes this event to an SNS topic, which pushes this message to a subscribing Lambda function, which finally updates the DynamoDB table that populates the status page. This event-driven approach avoids any kind of manual update to the status page visited by merchants.
AWS Direct Connect (AWS DX): Makes it easy to establish a dedicated network connection from your premises to AWS, which can reduce your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections.You can monitor physical DX connections using CloudWatch alarms, and send SNS messages when alarms change their status. As an example, when a DX connection state shifts to 0 (zero), indicating that the connection is down, this event can be published to an SNS topic, which can fan out this message to impacted servers through HTTP endpoints, so that they might reroute their traffic through a different connection instead. This is an event-driven approach to connectivity resilience.
In addition to SNS, event-driven computing is also addressed by Amazon CloudWatch Events, which delivers a near real-time stream of system events that describe changes in AWS resources. With CloudWatch Events, you can route each event type to one or more targets, including:
Many AWS services publish events to CloudWatch. As an example, you can get CloudWatch Events to capture events on your ETL (Extract, Transform, Load) jobs running on AWS Glue and push failed ones to an SQS queue, so that you can retry them later.
Conclusion
Amazon SNS is a pub/sub messaging service that can be used as an event-driven computing hub to AWS customers worldwide. By capturing events natively triggered by AWS services, such as EC2, S3 and RDS, you can automate and optimize all kinds of workflows, namely scaling, testing, encoding, profiling, broadcasting, discovery, failover, and much more. Business use cases presented in this post ranged from recruiting websites, to scientific research, geographic systems, social networks, retail websites, and news portals.
Many companies consider migrating from relational databases like MySQL to Amazon DynamoDB, a fully managed, fast, highly scalable, and flexible NoSQL database service. For example, DynamoDB can increase or decrease capacity based on traffic, in accordance with business needs. The total cost of servicing can be optimized more easily than for the typical media-based RDBMS.
However, migrations can have two common issues:
Service outage due to downtime, especially when customer service must be seamlessly available 24/7/365
Different key design between RDBMS and DynamoDB
This post introduces two methods of seamlessly migrating data from MySQL to DynamoDB, minimizing downtime and converting the MySQL key design into one more suitable for NoSQL.
AWS services
I’ve included sample code that uses the following AWS services:
AWS Database Migration Service (AWS DMS) can migrate your data to and from most widely used commercial and open-source databases. It supports homogeneous and heterogeneous migrations between different database platforms.
Amazon EMR is a managed Hadoop framework that helps you process vast amounts of data quickly. Build EMR clusters easily with preconfigured software stacks that include Hive and other business software.
Amazon Kinesis can continuously capture and retain a vast amount of data such as transaction, IT logs, or clickstreams for up to 7 days.
AWS Lambda helps you run your code without provisioning or managing servers. Your code can be automatically triggered by other AWS services such Amazon Kinesis Streams.
Migration solutions
Here are the two options I describe in this post:
Use AWS DMS
AWS DMS supports migration to a DynamoDB table as a target. You can use object mapping to restructure original data to the desired structure of the data in DynamoDB during migration.
Use EMR, Amazon Kinesis, and Lambda with custom scripts
Consider this method when more complex conversion processes and flexibility are required. Fine-grained user control is needed for grouping MySQL records into fewer DynamoDB items, determining attribute names dynamically, adding business logic programmatically during migration, supporting more data types, or adding parallel control for one big table.
After the initial load/bulk-puts are finished, and the most recent real-time data is caught up by the CDC (change data capture) process, you can change the application endpoint to DynamoDB.
The following diagram shows the overall architecture of both options.
Option 1: Use AWS DMS
This section discusses how to connect to MySQL, read the source data, and then format the data for consumption by the target DynamoDB database using DMS.
Create the replication instance and source and target endpoints
Create a replication instance that has sufficient storage and processing power to perform the migration job, as mentioned in the AWS Database Migration Service Best Practices whitepaper. For example, if your migration involves a large number of tables, or if you intend to run multiple concurrent replication tasks, consider using one of the larger instances. The service consumes a fair amount of memory and CPU.
As the MySQL user, connect to MySQL and retrieve data from the database with the privileges of SUPER, REPLICATION CLIENT. Enable the binary log and set the binlog_format parameter to ROW for CDC in the MySQL configuration. For more information about how to use DMS, see Getting Started in the AWS Database Migration Service User Guide.
mysql> CREATE USER 'repl'@'%' IDENTIFIED BY 'welcome1';
mysql> GRANT all ON <database name>.* TO 'repl'@'%';
mysql> GRANT SUPER,REPLICATION CLIENT ON *.* TO 'repl'@'%';
Before you begin to work with a DynamoDB database as a target for DMS, make sure that you create an IAM role for DMS to assume, and grant access to the DynamoDB target tables. Two endpoints must be created to connect the source and target. The following screenshot shows sample endpoints.
The following screenshot shows the details for one of the endpoints, source-mysql.
Create a task with an object mapping rule
In this example, assume that the MySQL table has a composite primary key (customerid + orderid + productid). You are going to restructure the key to the desired structure of the data in DynamoDB, using an object mapping rule.
In this case, the DynamoDB table has the hash key that is a combination of the customerid and orderid columns, and the sort key is the productid column. However, the partition key should be decided by the user in an actual migration, based on data ingestion and access pattern. You would usually use high-cardinality attributes. For more information about how to choose the right DynamoDB partition key, see the Choosing the Right DynamoDB Partition Key AWS Database blog post.
DMS automatically creates a corresponding attribute on the target DynamoDB table for the quantity column from the source table because rule-action is set to map-record-to-record and the column is not listed in the exclude-columns attribute list. For more information about map-record-to-record and map-record-to-document, see Using an Amazon DynamoDB Database as a Target for AWS Database Migration Service.
Migration starts immediately after the task is created, unless you clear the Start task on create option. I recommend enabling logging to make sure that you are informed about what is going on with the migration task in the background.
The following screenshot shows the task creation page.
You can use the console to specify the individual database tables to migrate and the schema to use for the migration, including transformations. On the Guided tab, use the Where section to specify the schema, table, and action (include or exclude). Use the Filter section to specify the column name in a table and the conditions to apply.
Table mappings also can be created in JSON format. On the JSON tab, check Enable JSON editing.
The following screenshot shows example events and errors recorded by CloudWatch Logs.
DMS replication instances that you used for the migration should be deleted once all migration processes are completed. Any CloudWatch logs data older than the retention period is automatically deleted.
Option 2: Use EMR, Amazon Kinesis, and Lambda
This section discusses an alternative option using EMR, Amazon Kinesis, and Lambda to provide more flexibility and precise control. If you have a MySQL replica in your environment, it would be better to dump data from the replica.
Change the key design
When you decide to change your database from RDMBS to NoSQL, you need to find a more suitable key design for NoSQL, for performance as well as cost-effectiveness.
Similar to option #1, assume that the MySQL source has a composite primary key (customerid + orderid + productid). However, for this option, group the MySQL records into fewer DynamoDB items by customerid (hash key) and orderid (sort key). Also, remove the last column (productid) of the composite key by converting the record values productid column in MySQL to the attribute name in DynamoDB, and setting the attribute value as quantity.
This conversion method reduces the number of items. You can retrieve the same amount of information with fewer read capacity units, resulting in cost savings and better performance. For more information about how to calculate read/write capacity units, see Provisioned Throughput.
Migration steps
Option 2 has two paths for migration, performed at the same time:
Batch-puts: Export MySQL data, upload it to Amazon S3, and import into DynamoDB.
Real-time puts: Capture changed data in MySQL, send the insert/update/delete transaction to Amazon Kinesis Streams, and trigger the Lambda function to put data into DynamoDB.
To keep the data consistency and integrity, capturing and feeding data to Amazon Kinesis Streams should be started before the batch-puts process. The Lambda function should stand by and Streams should retain the captured data in the stream until the batch-puts process on EMR finishes. Here’s the order:
Start real-time puts to Amazon Kinesis Streams.
As soon as real-time puts commences, start batch-puts.
After batch-puts finishes, trigger the Lambda function to execute put_item from Amazon Kinesis Streams to DynamoDB.
Change the application endpoints from MySQL to DynamoDB.
Step 1: Capture changing data and put into Amazon Kinesis Streams
Firstly, create an Amazon Kinesis stream to retain transaction data from MySQL. Set the Data retention period value based on your estimate for the batch-puts migration process. For data integrity, the retention period should be enough to hold all transactions until batch-puts migration finishes. However you do not necessarily need to select the maximum retention period. It depends on the amount of data to migrate.
In the MySQL configuration, set binlog_format to ROW to capture transactions by using the BinLogStreamReader module. The log_bin parameter must be set as well to enable the binlog. For more information, see the Streaming Changes in a Database with Amazon Kinesis AWS Database blog post.
Another way to upload data to Amazon S3 is to use the INTO OUTFILE SQL clause and aws s3 sync CLI command in parallel with your own script. The degree of parallelism depends on your server capacity and local network bandwidth. You might find a third-party tool useful, such as pt-archiver (part of the Percona Toolkit see the appendix for details).
SELECT * FROM purchase WHERE <condition_1>
INTO OUTFILE '/data/export/purchase/1.csv' FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n';
SELECT * FROM purchase WHERE <condition_2>
INTO OUTFILE '/data/export/purchase/2.csv' FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n';
...
SELECT * FROM purchase WHERE <condition_n>
INTO OUTFILE '/data/export/purchase/n.csv' FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n';
I recommend the aws s3sync command for this use case. This command works internally with the S3 multipart upload feature. Pattern matching can exclude or include particular files. In addition, if the sync process crashes in the middle of processing, you do not need to upload the same files again. The sync command compares the size and modified time of files between local and S3 versions, and synchronizes only local files whose size and modified time are different from those in S3. For more information, see the sync command in the S3 section of the AWS CLI Command Reference.
After all data is uploaded to S3, put it into DynamoDB. There are two ways to do this:
Use Hive with an external table
Write MapReduce code
Hive with an external table
Create a Hive external table against the data on S3 and insert it into another external table against the DynamoDB table, using the org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler property. To improve productivity and the scalability, consider using Brickhouse, which is a collection of UDFs for Hive.
The following sample code assumes that the Hive table for DynamoDB is created with the products column, which is of type ARRAY<STRING >. The productid and quantity columns are aggregated, grouping by customerid and orderid, and inserted into the products column with the CollectUDAF columns provided by Brickhouse.
hive> DROP TABLE purchase_ext_s3;
- – To read data from S3
hive> CREATE EXTERNAL TABLE purchase_ext_s3 (
customerid string,
orderid string,
productid string,
quantity string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://<your bucket name>/purchase/';
Hive> drop table purchase_ext_dynamodb ;
- – To connect to DynamoDB table
Hive> CREATE EXTERNAL TABLE purchase_ext_dynamodb (
customerid STRING, orderid STRING, products ARRAY<STRING>)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES ("dynamodb.table.name" = "purchase",
"dynamodb.column.mapping" = "customerid:customerid,orderid:orderid,products:products");
- – Batch-puts to DynamoDB using Brickhouse
hive> add jar /<jar file path>/brickhouse-0.7.1-SNAPSHOT.jar ;
hive> create temporary function collect as 'brickhouse.udf.collect.CollectUDAF';
hive> INSERT INTO purchase_ext_dynamodb
select customerid as customerid , orderid as orderid
,collect(concat(productid,':' ,quantity)) as products
from purchase_ext_s3
group by customerid, orderid;
Unfortunately, the MAP, LIST, BOOLEAN, and NULL data types are not supported by the DynamoDBStorageHandler class, so the ARRAY<STRING> data type has been chosen. The products column of ARRAY<STRING> data type in Hive is matched to the StringSettype attribute in DynamoDB. The sample code mostly shows how Brickhouse works, and only for those who want to aggregate multiple records into one StringSettype attribute in DynamoDB.
Python MapReduce with Hadoop Streaming
A mapper task reads each record from the input data on S3, and maps input key-value pairs to intermediate key-value pairs. It divides source data from S3 into two parts (key part and value part) delimited by a TAB character (“\t”). Mapper data is sorted in order by their intermediate key (customerid and orderid) and sent to the reducer. Records are put into DynamoDB in the reducer step.
#!/usr/bin/env python
import sys
# get all lines from stdin
for line in sys.stdin:
line = line.strip()
cols = line.split(',')
# divide source data into Key and attribute part.
# example output : “cusotmer1,order1 product1,10”
print '%s,%s\t%s,%s' % (cols[0],cols[1],cols[2],cols[3] )
Generally, the reduce task receives the output produced after map processing (which is key/list-of-values pairs) and then performs an operation on the list of values against each key.
In this case, the reducer is written in Python and is based on STDIN/STDOUT/hadoop streaming. The enumeration data type is not available. The reducer receives data sorted and ordered by the intermediate key set in the mapper, customerid and orderid (cols[0],cols[1]) in this case, and stores all attributes for the specific key in the item_data dictionary. The attributes in the item_data dictionary are put, or flushed, into DynamoDB every time a new intermediate key comes from sys.stdin.
#!/usr/bin/env python
import sys
import boto.dynamodb
# create connection to DynamoDB
current_keys = None
conn = boto.dynamodb.connect_to_region( '<region>', aws_access_key_id='<access key id>', aws_secret_access_key='<secret access key>')
table = conn.get_table('<dynamodb table name>')
item_data = {}
# input comes from STDIN emitted by Mapper
for line in sys.stdin:
line = line.strip()
dickeys, items = line.split('\t')
products = items.split(',')
if current_keys == dickeys:
item_data[products[0]]=products[1]
else:
if current_keys:
try:
mykeys = current_keys.split(',')
item = table.new_item(hash_key=mykeys[0],range_key=mykeys[1], attrs=item_data )
item.put()
except Exception ,e:
print 'Exception occurred! :', e.message,'==> Data:' , mykeys
item_data = {}
item_data[products[0]]=products[1]
current_keys = dickeys
# put last data
if current_keys == dickeys:
print 'Last one:' , current_keys #, item_data
try:
mykeys = dickeys.split(',')
item = table.new_item(hash_key=mykeys[0] , range_key=mykeys[1], attrs=item_data )
item.put()
except Exception ,e:
print 'Exception occurred! :', e.message, '==> Data:' , mykeys
To run the MapReduce job, connect to the EMR master node and run a Hadoop streaming job. The hadoop-streaming.jar file location or name could be different, depending on your EMR version. Exception messages that occur while reducers run are stored at the directory assigned as the –output option. Hash key and range key values are also logged to identify which data causes exceptions or errors.
In my migration experiment using the above scripts, with self-generated test data, I found the following results, including database size and the time taken to complete the migration.
Server
MySQL instance
m4.2xlarge
EMR cluster
master : 1 x m3.xlarge
core : 2 x m4.4xlarge
DynamoDB
2000 write capacity unit
Data
Number of records
1,000,000,000
Database file size (.ibc)
100.6 GB
CSV files size
37 GB
Performance (time)
Export to CSV
6 min 10 sec
Upload to S3 (sync)
3 min 30 sec
Import to DynamoDB
depending on write capacity unit
The following screenshot shows the performance results by write capacity.
Note that the performance result is flexible and can vary depending on the server capacity, network bandwidth, degree of parallelism, conversion logic, program language, and other conditions. All provisioned write capacity units are consumed by the MapReduce job for data import, so the more you increase the size of the EMR cluster and write capacity units of DynamoDB table, the less time it takes to complete. Java-based MapReduce code would be more flexible for function and MapReduce framework.
Step 3: Amazon Lambda function updates DynamoDB by reading data from Amazon Kinesis
In the Lambda console, choose Create a Lambda function and the kinesis-process-record-python blueprint. Next, in the Configure triggers page, select the stream that you just created.
The Lambda function must have an IAM role with permissions to read from Amazon Kinesis and put items into DynamoDB.
The Lambda function can recognize the transaction type of the record by looking up the type attribute. The transaction type determines the method for conversion and update.
For example, when a JSON record is passed to the function, the function looks up the type attribute. It also checks whether an existing item in the DynamoDB table has the same key with the incoming record. If so, the existing item must be retrieved and saved in a dictionary variable (item, in this case). Apply a new update information command to the item dictionary before it is put back into DynamoDB table. This prevents the existing item from being overwritten by the incoming record.
from __future__ import print_function
import base64
import json
import boto3
print('Loading function')
client = boto3.client('dynamodb')
def lambda_handler(event, context):
#print("Received event: " + json.dumps(event, indent=2))
for record in event['Records']:
# Amazon Kinesis data is base64-encoded so decode here
payload = base64.b64decode(record['kinesis']['data'])
print("Decoded payload: " + payload)
data = json.loads(payload)
# user logic for data triggered by WriteRowsEvent
if data["type"] == "WriteRowsEvent":
my_table = data["table"]
my_hashkey = data["row"]["values"]["customerid"]
my_rangekey = data["row"]["values"]["orderid"]
my_productid = data["row"]["values"]["productid"]
my_quantity = str( data["row"]["values"]["quantity"] )
try:
response = client.get_item( Key={'customerid':{'S':my_hashkey} , 'orderid':{'S':my_rangekey}} ,TableName = my_table )
if 'Item' in response:
item = response['Item']
item[data["row"]["values"]["productid"]] = {"S":my_quantity}
result1 = client.put_item(Item = item , TableName = my_table )
else:
item = { 'customerid':{'S':my_hashkey} , 'orderid':{'S':my_rangekey} , my_productid :{"S":my_quantity} }
result2 = client.put_item( Item = item , TableName = my_table )
except Exception, e:
print( 'WriteRowsEvent Exception ! :', e.message , '==> Data:' ,data["row"]["values"]["customerid"] , data["row"]["values"]["orderid"] )
# user logic for data triggered by UpdateRowsEvent
if data["type"] == "UpdateRowsEvent":
my_table = data["table"]
# user logic for data triggered by DeleteRowsEvent
if data["type"] == "DeleteRowsEvent":
my_table = data["table"]
return 'Successfully processed {} records.'.format(len(event['Records']))
Step 4: Switch the application endpoint to DynamoDB
Application codes need to be refactored when you change from MySQL to DynamoDB. The following simple Java code snippets focus on the connection and query part because it is difficult to cover all cases for all applications. For more information, see Programming with DynamoDB and the AWS SDKs.
Query to MySQL
The following sample code shows a common way to connect to MySQL and retrieve data.
import java.sql.* ;
...
try {
Connection conn = DriverManager.getConnection("jdbc:mysql://<host name>/<database name>" , "<user>" , "<password>");
stmt = conn.createStatement();
String sql = "SELECT quantity as quantity FROM purchase WHERE customerid = '<customerid>' and orderid = '<orderid>' and productid = '<productid>'";
ResultSet rs = stmt.executeQuery(sql);
while(rs.next()){
int quantity = rs.getString("quantity"); //Retrieve by column name
System.out.print("quantity: " + quantity); //Display values
}
} catch (SQLException ex) {
// handle any errors
System.out.println("SQLException: " + ex.getMessage());}
...
==== Output ====
quantity:1
Query to DynamoDB
To retrieve items from DynamoDB, follow these steps:
Create an instance of the DynamoDB
Create an instance of the Table
Add the withHashKey and withRangeKeyCondition methods to an instance of the QuerySpec
Execute the query method with the querySpec instance previously created. Items are retrieved as JSON format, so use the getJSON method to look up a specific attribute in an item.
...
DynamoDB dynamoDB = new DynamoDB( new AmazonDynamoDBClient(new ProfileCredentialsProvider()));
Table table = dynamoDB.getTable("purchase");
QuerySpec querySpec = new QuerySpec()
.withHashKey("customerid" , "customer1") // hashkey name and its value
.withRangeKeyCondition(new RangeKeyCondition("orderid").eq("order1") ) ; // Ranage key and its condition value
ItemCollection<QueryOutcome> items = table.query(querySpec);
Iterator<Item> iterator = items.iterator();
while (iterator.hasNext()) {
Item item = iterator.next();
System.out.println(("quantity: " + item.getJSON("product1")); //
}
...
==== Output ====
quantity:1
Conclusion
In this post, I introduced two options for seamlessly migrating data from MySQL to DynamoDB and minimizing downtime during the migration. Option #1 used DMS, and option #2 combined EMR, Amazon Kinesis, and Lambda. I also showed you how to convert the key design in accordance with database characteristics to improve read/write performance and reduce costs. Each option has advantages and disadvantages, so the best option depends on your business requirements.
The sample code in this post is not enough for a complete, efficient, and reliable data migration code base to be reused across many different environments. Use it to get started, but design for other variables in your actual migration.
I hope this post helps you plan and implement your migration and minimizes service outages. If you have questions or suggestions, please leave a comment below.
Yong Seong Lee is a Cloud Support Engineer for AWS Big Data Services. He is interested in every technology related to data/databases and helping customers who have difficulties in using AWS services. His motto is “Enjoy life, be curious and have maximum experience.”
I first wrote about AWS Database Migration Service just about a year ago in my AWS Database Migration Service post. At that time I noted that over 1,000 AWS customers had already made use of the service as part of their move to AWS.
As a quick recap, AWS Database Migration Service and Schema Conversion Tool (SCT) help our customers migrate their relational data from expensive, proprietary databases and data warehouses (either on premises on in the cloud, but with restrictive licensing terms either way) to more cost-effective cloud-based databases and data warehouses such as Amazon Aurora, Amazon Redshift, MySQL, MariaDB, and PostgreSQL, with minimal downtime along the way. Our customers tell us that they love the flexibility and the cost-effective nature of these moves. For example, moving to Amazon Aurora gives them access to a database that is MySQL and PostgreSQL compatible, at 1/10th the cost of a commercial database. Take a peek at our AWS Database Migration Services Customer Testimonials to see how Expedia, Thomas Publishing, Pega, and Veoci have made use of the service.
20,000 Unique Migrations I’m pleased to be able to announce that our customers have already used AWS Database Migration Service to migrate 20,000 unique databases to AWS and that the pace continues to accelerate (we reached 10,000 migrations in September of 2016).
We’ve added many new features to DMS and SCT over the past year. Here’s a summary:
Learn More Here are some resources that will help you to learn more and to get your own migrations underway, starting with some recent webinars:
Migrating From Sharded to Scale-Up – Some of our customers implemented a scale-out strategy in order to deal with their relational workload, sharding their database across multiple independent pieces, each running on a separate host. As part of their migration, these customers often consolidate two or more shards onto a single Aurora instance, reducing complexity, increasing reliability, and saving money along the way. If you’d like to do this, check out the blog post, webinar recording, and presentation.
Migrating From Oracle or SQL Server to Aurora – Other customers migrate from commercial databases such as Oracle or SQL Server to Aurora. If you would like to do this, check out this presentation and the accompanying webinar recording.
How to Migrate Your Oracle Database to Amazon Aurora – “This blog post gives you a quick overview of how you can use the AWS Schema Conversion Tool (AWS SCT) and AWS Database Migration Service (AWS DMS) to facilitate and simplify migrating your commercial database to Amazon Aurora. In this case, we focus on migrating from Oracle to the MySQL-compatible Amazon Aurora.”
Database Migration—What Do You Need to Know Before You Start? – “Congratulations! You have convinced your boss or the CIO to move your database to the cloud. Or you are the boss, CIO, or both, and you finally decided to jump on the bandwagon. What you’re trying to do is move your application to the new environment, platform, or technology (aka application modernization), because usually people don’t move databases for fun.”
How to Script a Database Migration – “You can use the AWS DMS console or the AWS CLI or the AWS SDK to perform the database migration. In this blog post, I will focus on performing the migration with the AWS CLI.”
The documentation includes five helpful walkthroughs:
There’s also a hands-on lab (you will need to register in order to participate).
See You at a Summit The DMS team is planning to attend and present at many of our upcoming AWS Summits and would welcome the opportunity to discuss your database migration requirements in person.
Even though we published 294 posts on this blog last year, I left out quite a number of worthwhile launches! Today I would like to focus on Amazon Relational Database Service (RDS) and recap all of the progress that the teams behind this family of services made in 2016. The team focused on four major areas last year:
High Availability
Enhanced Monitoring
Simplified Security
Database Engine Updates
Let’s take a look at each of these areas…
High Availability Relational databases are at the heart of many types of applications. In order to allow our customers to build applications that are highly available, RDS has offered multi-AZ support since early 2010 (read Amazon RDS – Multi-AZ Deployments For Enhanced Availability & Reliability for more info). Instead of spending weeks setting up multiple instances, arranging for replication, writing scripts to detect network, instance, & network issues, making failover decisions, and bringing a new secondary instance online, you simply opt for Multi-AZ Deployment when you create the Database Instance. RDS also makes it easy for you to create cross-region read replicas.
Here are some of the other enhancements that we made in 2016 in order to help you to achieve high availability:
Enhanced Monitoring We announced the first big step toward enhanced monitoring at the end of 2015 (New – Enhanced Monitoring for Amazon RDS) with support for MySQL, MariaDB, and Amazon Aurora and then made additional announcements in 2016:
Simplified Security We want to make it as easy and simple as possible for you to use encryption to protect your data, whether it is at rest or in motion. Here are the enhancements that we made in this area last year:
Database Engine Updates The open source community and the vendors of commercial databases add features and produce new releases at a rapid pace and we track their work very closely, aiming to update RDS as quickly as possible after each significant release. Here’s what we did in 2016:
Another month of big data solutions on the Big Data Blog!
Take a look at our summaries below and learn, comment, and share. Thank you for reading!
NEW POSTS
Decreasing Game Churn: How Upopa used ironSource Atom and Amazon ML to Engage Users Ever wondered what it takes to keep a user from leaving your game or application after all the hard work you put in? Wouldn’t it be great to get a chance to interact with the users before they’re about to leave? In this post, learn how ironSource worked with gaming studio Upopa to build an efficient, cheap, and accurate way to battle churn and make data-driven decisions using ironSource Atom’s data pipeline and Amazon ML.
Create a Healthcare Data Hub with AWS and Mirth Connect Healthcare providers record patient information across different software platforms. Each of these platforms can have varying implementations of complex healthcare data standards. Also, each system needs to communicate with a central repository called a health information exchange (HIE) to build a central, complete clinical record for each patient. In this post, learn how to consume different data types as messages, transform the information within the messages, and then use AWS services to take action depending on the message type.
Call for Papers! DEEM: 1st Workshop on Data Management for End-to-End Machine Learning Amazon and Matroid will hold the first workshop on Data Management for End-to-End Machine Learning (DEEM) on May 14th, 2017 in conjunction with the premier systems conference SIGMOD/PODS 2017 in Raleigh, North Carolina. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management, and systems research to discuss data management issues in ML application scenarios. The workshop is soliciting research papers that describe preliminary and ongoing research results.
Converging Data Silos to Amazon Redshift Using AWS DMS In this post, learn to use AWS Database Migration Service (AWS DMS) and other AWS services to easily converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.
Run Mixed Workloads with Amazon Redshift Workload Management It’s common for mixed workloads to have some processes that require higher priority than others. Sometimes, this means a certain job must complete within a given SLA. Other times, this means you only want to prevent a non-critical reporting workload from consuming too many cluster resources at any one time. Without workload management (WLM), each query is prioritized equally, which can cause a person, team, or workload to consume excessive cluster resources for a process which isn’t as valuable as other more business-critical jobs. This post provides guidelines on common WLM patterns and shows how you can use WLM query insights to optimize configuration in production workloads.
Secure Amazon EMR with Encryption In this post, learn how to set up encryption of data at multiple levels using security configurations with EMR. You’ll walk through the step-by-step process to achieve all the encryption prerequisites, such as building the KMS keys, building SSL certificates, and launching the EMR cluster with a strong security configuration.
FROM THE ARCHIVE
Powering Amazon Redshift Analytics with Apache Spark and Amazon Machine Learning In this post, learn to generate a predictive model for flight delays that can be used to help pick the flight least likely to add to your travel stress. To accomplish this, you’ll use Apache Spark running on Amazon EMR for extracting, transforming, and loading (ETL) the data, Amazon Redshift for analysis, and Amazon Machine Learning for creating predictive models.
Want to learn more about Big Data or Streaming Data? Check out our Big Dataand Streaming dataeducational pages.
Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.
Organizations often grow organically—and so does their data in individual silos. Such systems are often powered by traditional RDBMS systems and they grow orthogonally in size and features. To gain intelligence across heterogeneous data sources, you have to join the data sets. However, this imposes new challenges, as joining data over dblinks or into a single view is extremely cumbersome and an operational nightmare.
This post walks through using AWS Database Migration Service (AWS DMS) and other AWS services to make it easy to converge multiple heterogonous data sources to Amazon Redshift. You can then use Amazon QuickSight, to visualize the converged dataset to gain additional business insights.
AWS service overview
Here’s a brief overview of AWS services that help with data convergence.
AWS DMS
With DMS, you can migrate your data to and from most widely used commercial and open-source databases. The service supports homogenous migrations such as Oracle to Oracle, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora or Microsoft SQL Server to MySQL. It also allows you to stream data to Amazon Redshift from any of the supported sources including:
Amazon Aurora
PostgreSQL
MySQL
MariaDB
Oracle
SAP ASE
SQL Server
DMS enables consolidation and easy analysis of data in the petabyte-scale data warehouse. It can also be used for continuous data replication with high availability.
Amazon QuickSight
Amazon QuickSight provides very fast, easy-to-use, cloud-powered business intelligence at 1/10th the cost of traditional BI solutions. QuickSight uses a new, super-fast, parallel, in-memory calculation engine (“SPICE”) to perform advanced calculations and render visualizations rapidly.
QuickSight integrates automatically with AWS data services, enables organizations to scale to hundreds of thousands of users, and delivers fast and responsive query performance to them. You can easily connect QuickSight to AWS data services, including Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon S3, and Amazon Athena. You can also upload CSV, TSV, and spreadsheet files or connect to third-party data sources such as Salesforce.
Amazon Redshift
Amazon Redshift delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes. Amazon Redshift is typically priced at 1/10th of the price of the competition. We have many customers running petabyte scale data analytics on AWS using Amazon Redshift.
Amazon Redshift is also ANSI SQL compliant, supports JDBC/ODBC, and is easy to connect to your existing business intelligence (BI) solution. However, if your storage requirement is in the 10s of TB range and requires high levels of concurrency across small queries, you may want to consider Amazon Aurora as the target converged database.
Walkthrough
Assume that you have an events company specializing on sports, and have built a MySQL database that holds data for the players and the sporting events. Customers and ticket information is stored in another database; in this case, assume it is PostgresSQL and this gets updated when customer purchases tickets from our website and mobile apps. You can download a sample dataset from the aws-database-migration-samples GitHub repo.
These databases could be anywhere: at an on-premises facility; on AWS in Amazon EC2 or Amazon RDS, or other cloud provider; or in a mixture of such locations. To complicate things a little more, you can assume that the lost opportunities (where a customer didn’t complete buying the ticket even though it was added to the shopping cart) are streamed via clickstream through Amazon Kinesis and then stored on Amazon S3. We then use AWS Data Pipeline to orchestrate a process to cleanse that data using Amazon EMR and make it ready for loading to Amazon Redshift. The clickstream integration is not covered in this post but was demonstrated in the recent Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics post.
Architecture
In this solution, you use DMS to bring the two data sources into Amazon Redshift and run analytics to gain business insights. The following diagram demonstrates the proposed solution.
After the data is available on Amazon Redshift, you could easily build BI dashboards and generate intelligent reports to gain insights using Amazon QuickSight. You could also take this a step further and build a model using Amazon Machine Learning. Amazon Machine Learning uses powerful algorithms to create ML models by finding patterns in your existing data stored in Amazon S3, or Amazon Redshift. It is also highly scalable and can generate billions of predictions daily, and serve those predictions in real time and at high throughput.
Creating source databases
For the purposes of this post, create two RDS databases, one with a MySQL engine, and the other with PostgreSQL and then load some data. These represent a real-life scenario where databases could be located on-premises, on AWS, or both. Just as in real life, there may be more than two source databases; the process described in this post would still be reasonably similar.
Follow the steps in Tutorial: Create a Web Server and an Amazon RDS Database to create the two source databases. Use the links from the main tutorial page to see how to connect to specific databases and load data. For more information, see:
Make a note of the security group that you create and associate all the RDS instances with it. Call it “MyRDSSecurityGroup”.
Afterward, you should be able to see all the databases listed in the RDS Instances dashboard.
Setting up a target Amazon Redshift cluster
Set up a two-node cluster as shown below, with a cluster name similar to “consolidated-dwh” and a database named similar to “mydwh”. You could also set up a one-node cluster based on the instance type; the instance type may be available on the AWS Free Tier.
In the next step, choose Publicly Accessible for non-production usage to keep the configuration simple.
Also, for simplicity, choose the same VPC where you have placed the RDS instances and include the MyRDSSecurityGroup in the list of security groups allowed to access the Amazon Redshift cluster.
Setting up DMS
You can set up DMS easily, as indicated in the AWS Database Migration Service post on the AWS blog. However, rather than using the wizard, you may take a step-by-step approach:
Create a replication instance.
Create the endpoints for the two source databases and the target Amazon Redshift database.
Create a task to synchronize each of the sources to the target.
Create a replication instance
In the DMS console, choose Replication instances, Create replication instance. The instance type you select depends on the data volume you deal with. After setup, you should be able to see your replication instance.
Create endpoints
In the DMS console, choose Endpoints, Create endpoint. You need to configure the two source endpoints representing the PostgreSQL and MySQL RDS databases. You also need to create the target endpoint by supplying the Amazon Redshift database that you created in the previous steps. After configuration, the endpoints look similar to the following screenshot:
Create a task and start data migration
You can rely on DMS to create the target tables in your target Amazon Redshift database or you may want to take advantage of AWS Schema Conversion Tool to create the target schema and also do a compatibility analysis in the process. Using the AWS Schema Conversion Tool is particularly useful when migrating using heterogeneous data sources. For more information, see Getting Started with the AWS Schema Conversion Tool.
For simplicity, I avoided using the AWS Schema Conversion Tool in this post and used jump to DMS to create the target schema and underlying tables and then set up the synchronization between the data sources and the target.
In the DMS console, choose Tasks, Create Tasks. Fill in the fields as shown in the following screenshot:
Note that given the source is RDS MySQL and you chose Migrate data and replicate on going changes, you need to enable bin log retention. Other engines have other requirements and DMS prompts you accordingly. For this particular case, run the following command:
Now, choose Start task on create. In the task settings, choose Drop tables on target to have DMS create the tables, if you haven’t already created the target tables using the AWS Schema Conversion Tool, as described earlier. Choose Enable logging but note that this incurs additional costs as the generated CloudWatch logs require storage.
In the table mappings, for Schema to migrate, ensure that the correct schema has been selected from the source databases. DMS creates the schema on the target if it does not already exist.
Repeat for the other data source, choosing the other source endpoint and the same Amazon Redshift target endpoint. In the table mappings section, choose Custom and customize as appropriate. For example, you can specify the schema names to include and tables to exclude, as shown in the following screenshot:
Using this custom configuration, you can perform some minor transformations, such as down casing target table names, or choosing a different target schema for both sources.
After both tasks have successfully completed, the Tasks tab now looks like the following:
Running queries on Amazon Redshift
In Amazon Redshift, select your target cluster and choose Loads. You can see all operations that DMS performed in the background to load the data from the two source databases into Amazon Redshift.
Ensure change data capture is working
Generate additional data on Amazon RDS PostgreSQL in the ticketing.sporting_event_ticket by running the script provided in the generate_mlb_season.sql aws-database-migration-samples GitHub repository. Notice that the tasks have caught up and are showing the migration in progress. You can also query the target tables and see that the new data is in the target table.
Visualization options
Set up QuickSight and configure your data source to be your Amazon Redshift database. If you have a Redshift cluster in the same account and in the same region, it will appear when you clock Redshift (Auto-discovered) from the data sets page, as shown below.
Access to any other Redshift cluster can be configured as follows using the Redshift (Manual connect) link:
Now, create your data set. Choose New Data Set and select either a new data source or an existing data source listed at the bottom of the page. Choose Ticketing for Sports.
In the next step, choose Create Data Set.
In the next step, when QuickSight prompts you to choose your table, you can select the schema and the required table and choose Select. Alternatively, you may choose Edit/Preview data.
You could use the graphical options shown below to start creating your data set. Given that you have data from multiple sources, it’s safe to assume that your target tables are in separate schemas. Select the schema and tables, select the other schemas, and bring the appropriate tables to the palette by selecting them using the check box to the right. For each join, select the join type and then map the appropriate keys between the tables until the two reds turn to one of the blue join types.
In this case, rather than preparing the data set in the palette, you provide a custom SQL query. On the left pane, choose Tables, Switch to Custom SQL tool.
Paste the following SQL query in the Custom SQL field and enter a name.
select to_char( e.start_date_time, 'YYYY-MM-DD' ) event_date,
to_char( e.start_date_time, 'HH24:MI' ) start_time, e.sold_out,
e.sport_type_name, l.name event_location, l.city event_city,
l.seating_capacity, hteam.name home_team, hl.name home_field,
hl.city home_city, ateam.name away_team, al.name away_field,
al.city away_city, sum( t.ticket_price ) total_ticket_price,
avg( t.ticket_price ) average_ticket_price,
min ( t.ticket_price ) cheapest_ticket,
max( t.ticket_price ) most_expensive_ticket, count(*) num_tickets
from ticketing.sporting_event_ticket t, sourcemysql.sporting_event e,
sourcemysql.sport_location l, sourcemysql.sport_team hteam,
sourcemysql.sport_team ateam, sourcemysql.sport_location hl,
sourcemysql.sport_location al
where t.sporting_event_id = e.id
and t.sport_location_id = l.id
and e.home_team_id = hteam.id
and e.away_team_id = ateam.id
and hteam.home_field_id = hl.id
and ateam.home_field_id = al.id
group by to_char( e.start_date_time, 'YYYY-MM-DD' ),
to_char( e.start_date_time, 'HH24:MI' ), e.start_date_time,
e.sold_out, e.sport_type_name, l.name, l.city, l.seating_capacity,
hteam.name, ateam.name, hl.name, hl.city, al.name, al.city;
You can choose Save and visualize and view the QuickSight visualization toolkit and filter options. Here you can build your story or dashboards and start sharing them with your team.
Now, you can choose various fields from the field list and the various measures to get the appropriate visualization, like the one shown below. This one was aimed to understand the date at which each event in each city reached the maximum capacity.
You can also combine many such visualizations and prepare your dashboard for management reporting. The analysis may also drive where you need to invent on campaigns and where things are going better than expected to ensure a healthy sales pipeline.
Summary
In this post, you used AWS DMS to converge multiple heterogonous data sources to an Amazon Redshift cluster. You also used Quicksight to create a data visualization on the converged dataset to provide you with additional insights. Although we have used an e-commerce use case related to an events company, this concept of converging multiple data silos to a target is also applicable to other verticals such as retail, health-care, finance, insurance and banking, gaming, and so on.
If you have questions or suggestions, please comment below.
About the Author
Pratim Das is a Specialist Solutions Architect for Analytics in EME. He works with customers on big data and analytical projects, helping them build solutions on AWS, using AWS services and (or) other open source or commercial solution from the big data echo system. In his spare time he enjoys cooking and creating exciting new recipes always with that spicy kick.
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.