Tag Archives: database

Analyze your Amazon S3 spend using AWS Glue and Amazon Redshift

Post Syndicated from Shayon Sanyal original https://aws.amazon.com/blogs/big-data/analyze-your-amazon-s3-spend-using-aws-glue-and-amazon-redshift/

The AWS Cost & Usage Report (CUR) tracks your AWS usage and provides estimated charges associated with that usage. You can configure this report to present the data at hourly or daily intervals, and it is updated at least one time per day until it is finalized at the end of the billing period. The Cost & Usage Report is delivered automatically to an Amazon S3 bucket that you specify, and you can download it from there directly. You can also integrate the report into Amazon Redshift, query it with Amazon Athena, or upload it to Amazon QuickSight. For more information, see Query and Visualize AWS Cost and Usage Data Using Amazon Athena and Amazon QuickSight.

This post presents a solution that uses AWS Glue Data Catalog and Amazon Redshift to analyze S3 usage and spend by combining the AWS CUR, S3 inventory reports, and S3 server access logs.

Prerequisites

Before you begin, complete the following prerequisites:

  • You need an S3 bucket for your S3 inventory and server access log data files. For more information, see Create a Bucket and What is Amazon S3?
  • You must have the appropriate IAM permissions for Amazon Redshift to be able to access the S3 buckets – for this post, choose two non-restrictive IAM roles (AmazonS3FullAccess and AWSGlueConsoleFullAccess), but restrict your access accordingly for your own scenarios.

Amazon S3 inventory

Amazon S3 inventory is one of the tools S3 provides to help manage your storage. You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. Amazon S3 inventory provides comma-separated values (CSV), Apache optimized row columnar (ORC), or Apache Parquet output files that list your objects and their corresponding metadata on a daily or weekly basis for a given S3 bucket.

Amazon S3 server access logs

Server access logging provides detailed records for the requests you make to a bucket. Server access logs are useful for many applications, for example in security and access audits. It can also help you learn about your customer base and understand your S3 bill.

AWS Glue

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. AWS Glue consists of a central metadata repository known as the Data Catalog, a crawler to populate the Data Catalog with tables, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. AWS Glue is serverless, so there’s no infrastructure to set up or manage. This post uses AWS Glue to catalog S3 inventory data and server access logs, which makes it available for you to query with Amazon Redshift Spectrum.

Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can use Amazon Redshift to efficiently query and retrieve structured and semi-structured data from files in S3 without having to load the data into Amazon Redshift native tables. You can create Amazon Redshift external tables by defining the structure for files and registering them as tables in the AWS Glue Data Catalog.

Setting up S3 inventory reports for analysis

This post uses the Parquet file format for its inventory reports and delivers the files daily to S3 buckets. You can select both the frequency of delivery and output file formats under Advanced settings as shown in the screenshot below:

For more information about configuring your S3 inventory, see How Do I Configure Amazon S3 Inventory?

The following diagram shows the data flow for this solution:

Below steps summarize the data flow diagram represented above:

  • S3 Inventory Reports are delivered to an S3 bucket that you configure.
  • The AWS Glue crawler then crawls this S3 bucket and populates the metadata in the AWS Glue Data Catalog.
  • The AWS Glue Data Catalog is then accessible through an external schema in Redshift.
  • The S3 Inventory Reports (available in the AWS Glue Data Catalog) and the Cost and Usage Reports (available in another S3 bucket) are now ready to be joined and queried for analysis.

The inventory reports are delivered to an S3 bucket. The following screenshot shows the S3 bucket structure for the S3 inventory reports:

There is a data folder in this bucket. This folder contains the Parquet data you want to analyze. The following screenshot shows the content of the folder.

Because these are daily files, there is one file per day.

Configuring an AWS Glue crawler

You can use an AWS Glue crawler to discover this dataset in your S3 bucket and create the table schemas in the Data Catalog. After you create these tables, you can query them directly from Amazon Redshift.

To configure your crawler to read S3 inventory files from your S3 bucket, complete the following steps:

  1. Choose a crawler name.
  2. Choose S3 as the data store and specify the S3 path up to the data
  3. Choose an IAM role to read data from S3 – AmazonS3FullAccess and AWSGlueConsoleFullAccess.
  4. Set a frequency schedule for the crawler to run.
  5. Configure the crawler’s output by selecting a database and adding a prefix (if any).

This post uses the database s3spendanalysis.

The following screenshot shows the completed crawler configuration.

Run this crawler to add tables to your Glue Data Catalog. After the crawler has completed successfully, go to the Tables section on your AWS Glue console to verify the table details and table metadata. The following screenshot shows the table details and table metadata after your AWS Glue crawler has completed successfully:

Creating an external schema

Before you can query the S3 inventory reports, you need to create an external schema (and subsequently, external tables) in Amazon Redshift. An Amazon Redshift external schema references an external database in an external data catalog. Because you are using an AWS Glue Data Catalog as your external catalog, after you create an external schema in Amazon Redshift, you can see all the external tables in your Data Catalog in Amazon Redshift. To create the external schema, enter the following code:

create external schema spectrum_schema from data catalog
database 's3spendanalysis'
iam_role 'arn:aws:iam::<AWS_IAM_ROLE>';

Querying the table

On the Amazon Redshift dashboard, under Query editor, you can see the data table. You can also query the svv_external_schemas system table to verify that your external schema has been created successfully. See the following screenshot.

You can now query the S3 inventory reports directly from Amazon Redshift without having to move the data into Amazon Redshift first. The following screenshot shows how to do this using the Query Editor in the Amazon Redshift console:

Setting up S3 server access logs for analysis

The following diagram shows the data flow for this solution.

Below steps summarize the data flow diagram represented above:

  • S3 Server Access Logs are delivered to an S3 bucket that you configure.
  • These server access logs are then directly accessible to be queried from Amazon Redshift (note that we’ll be using CREATE EXTERNAL TABLE in Redshift Spectrum for this purpose, explained below).
  • The S3 Server Access Logs and the Cost and Usage Reports (available in another S3 bucket) are now ready to be joined and queried for analysis.

The S3 server access logs are delivered to an S3 bucket. For more information about setting up server access logging, see Amazon S3 Server Access Logging.

The following screenshot shows the S3 bucket structure for the server access logs.

The server access log files consist of a sequence of new-line delimited log records. Each log record represents one request and consists of space-delimited fields. The following code is an example log record:

b8ad5f5cfd3c09418536b47b157851fb7bea4a00486471093a7d765e35a4f8ef s3spendanalysisblog [23/Sep/2018:22:10:52 +0000] 72.21.196.65 arn:aws:iam::<AWS Account #>:user/shayons D5633DAD1063C5CA REST.GET.LIFECYCLE - "GET /s3spendanalysisblog?lifecycle= HTTP/1.1" 404 NoSuchLifecycleConfiguration 332 - 105 - "-" "S3Console/0.4, aws-internal/3 aws-sdk-java/1.11.408 Linux/4.9.119-0.1.ac.277.71.329.metal1.x86_64 OpenJDK_64-Bit_Server_VM/25.181-b13 java/1.8.0_181" -

Creating an external table

You can define the S3 server access logs as an external table. Because you already have an external schema, create an external table using the following code. This post uses RegEx SerDe to create a table that allows you to correctly parse all the fields present in the S3 server access logs. See the following code:

CREATE EXTERNAL TABLE spectrum_schema.s3accesslogs(
BucketOwner                   varchar(256), 
Bucket                        varchar(256), 
RequestDateTime               varchar(256), 
RemoteIP                      varchar(256), 
Requester                     varchar(256), 
RequestID                     varchar(256), 
Operation                     varchar(256), 
Key                           varchar(256), 
RequestURI_operation          varchar(256),
RequestURI_key                varchar(256),
RequestURI_httpProtoversion   varchar(256),
HTTPstatus                    varchar(256), 
ErrorCode                     varchar(256), 
BytesSent                     varchar(256), 
ObjectSize                    varchar(256), 
TotalTime                     varchar(256), 
TurnAroundTime                varchar(256), 
Referrer                      varchar(256), 
UserAgent                     varchar(256), 
VersionId                     varchar(256))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex' = '([^ ]*) ([^ ]*) \\[(.*?)\\] ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) \"([^ ]*) ([^ ]*) ([^ ]*)\" (-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (\"[^\"]*\") ([^ ]*)'
  )
STORED AS TEXTFILE
LOCATION
  's3://s3spendanalysisblog/accesslogs/';

Validating the data

You can validate the external table data in Amazon Redshift. The following screenshot shows how to do this using the Query Editor in the Amazon Redshift console:

You are now ready to analyze the data.

Analyzing the data using Amazon Redshift

In this post, you have a CUR file per day in your S3 bucket. The files themselves are organized in a monthly hierarchy. See the following screenshot.

Each day’s file consists of the following files for CUR data:

  • myCURReport-1.csv.gz – A zipped file of the data itself
  • myCURReport-Manifest.json – A JSON file that contains the metadata for the file
  • myCURReport-RedshiftCommands.sql – Amazon Redshift table creation scripts and a COPY command to create the CUR table from a Redshift manifest file
  • myCURReport-RedshiftManifest.json – The Amazon Redshift manifest file to create the CUR table

Using Amazon Redshift is one of the many ways to carry out this analysis. Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing Business Intelligence (BI) tools. Amazon Redshift gives you fast querying capabilities over structured data using familiar SQL-based clients and BI tools using standard ODBC and JDBC connections. Queries are distributed and parallelized across multiple physical resources.

You are now ready to run SQL queries with the Amazon Redshift SQL Query Editor. This post also uses the psql client tool, a terminal-based front end from PostgreSQL, to query the data in the cluster.

To query the data, complete the following steps:

  1. Create a custom schema to contain your tables for analysis. See the following code:
    create schema if not exists redshift_schema;

    You should create your table in a schema other than public to control user access to database objects.

  2. Create a CUR table for the latest month in Amazon Redshift using the CUR SQL file in S3. See the following code:
    create table redshift_schema.AWSBilling201910 (
    identity_LineItemId VARCHAR(256),
    identity_TimeInterval VARCHAR(100),
    bill_InvoiceId VARCHAR(100),
    bill_BillingEntity VARCHAR(10),
    bill_BillType VARCHAR(100),
    bill_PayerAccountId VARCHAR(100),
    bill_BillingPeriodStartDate TIMESTAMPTZ,
    bill_BillingPeriodEndDate TIMESTAMPTZ,
    lineItem_UsageAccountId VARCHAR(100),
    lineItem_LineItemType VARCHAR(100),
    lineItem_UsageStartDate TIMESTAMPTZ,
    lineItem_UsageEndDate TIMESTAMPTZ,
    lineItem_ProductCode VARCHAR(100),
    lineItem_UsageType VARCHAR(100),
    lineItem_Operation VARCHAR(100),
    lineItem_AvailabilityZone VARCHAR(100),
    lineItem_ResourceId VARCHAR(256),
    lineItem_UsageAmount DECIMAL(11,2),
    lineItem_NormalizationFactor VARCHAR(10),
    lineItem_NormalizedUsageAmount DECIMAL(11,2),
    lineItem_CurrencyCode VARCHAR(10),
    lineItem_UnblendedRate DECIMAL(11,2),
    lineItem_UnblendedCost DECIMAL(11,2),
    lineItem_BlendedRate DECIMAL(11,2),
    lineItem_BlendedCost DECIMAL(11,2),
    lineItem_LineItemDescription VARCHAR(100),
    lineItem_TaxType VARCHAR(100),
    lineItem_LegalEntity VARCHAR(100),
    product_ProductName VARCHAR(100),
    product_alarmType VARCHAR(100),
    product_automaticLabel VARCHAR(100),
    product_availability VARCHAR(100),
    product_availabilityZone VARCHAR(100),
    product_clockSpeed VARCHAR(100),
    product_currentGeneration VARCHAR(100),
    product_databaseEngine VARCHAR(100),
    product_dedicatedEbsThroughput VARCHAR(100),
    product_deploymentOption VARCHAR(100),
    product_durability VARCHAR(100),
    product_ecu VARCHAR(100),
    product_edition VARCHAR(100),
    product_engineCode VARCHAR(100),
    product_enhancedNetworkingSupported VARCHAR(100),
    product_eventType VARCHAR(100),
    product_feeCode VARCHAR(100),
    product_feeDescription VARCHAR(100),
    product_fromLocation VARCHAR(100),
    product_fromLocationType VARCHAR(100),
    product_gpu VARCHAR(100),
    product_gpuMemory VARCHAR(100),
    product_group VARCHAR(100),
    product_groupDescription VARCHAR(100),
    product_instanceFamily VARCHAR(100),
    product_instanceType VARCHAR(100),
    product_instanceTypeFamily VARCHAR(100),
    product_io VARCHAR(100),
    product_labelingTaskType VARCHAR(100),
    product_licenseModel VARCHAR(100),
    product_location VARCHAR(100),
    product_locationType VARCHAR(100),
    product_maxThroughputvolume VARCHAR(100),
    product_maxVolumeSize VARCHAR(100),
    product_memory VARCHAR(100),
    product_messageDeliveryFrequency VARCHAR(100),
    product_messageDeliveryOrder VARCHAR(100),
    product_minVolumeSize VARCHAR(100),
    product_networkPerformance VARCHAR(100),
    product_normalizationSizeFactor VARCHAR(100),
    product_operation VARCHAR(100),
    product_physicalCpu VARCHAR(100),
    product_physicalGpu VARCHAR(100),
    product_physicalProcessor VARCHAR(100),
    product_processorArchitecture VARCHAR(100),
    product_processorFeatures VARCHAR(100),
    product_productFamily VARCHAR(100),
    product_protocol VARCHAR(100),
    product_queueType VARCHAR(100),
    product_region VARCHAR(100),
    product_servicecode VARCHAR(100),
    product_servicename VARCHAR(100),
    product_sku VARCHAR(100),
    product_storage VARCHAR(100),
    product_storageClass VARCHAR(100),
    product_storageMedia VARCHAR(100),
    product_subscriptionType VARCHAR(100),
    product_toLocation VARCHAR(100),
    product_toLocationType VARCHAR(100),
    product_transferType VARCHAR(100),
    product_usageFamily VARCHAR(100),
    product_usagetype VARCHAR(100),
    product_vcpu VARCHAR(100),
    product_version VARCHAR(100),
    product_volumeType VARCHAR(100),
    product_workforceType VARCHAR(100),
    pricing_RateId VARCHAR(100),
    pricing_publicOnDemandCost DECIMAL(11,2),
    pricing_publicOnDemandRate DECIMAL(11,2),
    pricing_term VARCHAR(100),
    pricing_unit VARCHAR(100),
    reservation_AmortizedUpfrontCostForUsage DECIMAL(11,2),
    reservation_AmortizedUpfrontFeeForBillingPeriod DECIMAL(11,2),
    reservation_EffectiveCost DECIMAL(11,2),
    reservation_EndTime TIMESTAMPTZ,
    reservation_ModificationStatus VARCHAR(100),
    reservation_NormalizedUnitsPerReservation BIGINT,
    reservation_RecurringFeeForUsage DECIMAL(11,2),
    reservation_StartTime TIMESTAMPTZ,
    reservation_SubscriptionId VARCHAR(100),
    reservation_TotalReservedNormalizedUnits BIGINT,
    reservation_TotalReservedUnits BIGINT,
    reservation_UnitsPerReservation BIGINT,
    reservation_UnusedAmortizedUpfrontFeeForBillingPeriod DECIMAL(11,2),
    reservation_UnusedNormalizedUnitQuantity BIGINT,
    reservation_UnusedQuantity BIGINT,
    reservation_UnusedRecurringFee DECIMAL(11,2),
    reservation_UpfrontValue BIGINT
    );

  3. Load the data into Amazon Redshift for the latest month, using the provided CUR Manifest file. See the following code:
    copy AWSBilling201910 from 's3://ss-cur//myCURReport/20191001-20191101/fd76beee-0709-42d5-bcb2-bb45f8ba1aae/myCURReport-RedshiftManifest.json'
    credentials 'arn:aws:iam::<AWS_IAM_ROLE>'
    GZIP CSV IGNOREHEADER 1 TIMEFORMAT 'auto' manifest;

  4. Validate the data loaded in the Amazon Redshift table. See the following code:
    select * from AWSBilling201910
    where lineItem_ProductCode = 'AmazonS3'
    and lineItem_ResourceId = 's3spendanalysisblog' limit 10;

    The following screenshot shows that data has been loaded correctly in the Amazon Redshift table:

Managing database security

You can manage database security in Amazon Redshift by controlling which users have access to which database objects. To make sure your objects are secure, create two groups: FINANCE and ADMIN, with two users in FINANCE and one user in ADMIN. Complete the following steps:

  1. Create the groups where the user accounts are assigned. The following code creates two different user groups:
    create group finance;
    create group admin;

    To view all user groups, query the PG_GROUP system catalog table (you should see finance and admin here):

    select * from pg_group:

  2. Create three database users with different privileges and add them to the groups. See the following code:
    create user finance1 password 'finance1Pass'
    in group finance;
    
    create user finance2 password 'finance2Pass'
    in group finance;
    
    create user admin1 password 'admin1Pass'
    in group admin;

    Validate the users have been successfully created. To view a list of users, query the PG_USER catalog table:

  3. Grant SELECT privileges to the FINANCE group and ALL privileges to the ADMIN group for your table AWSBilling201910 in redshift_schema. See the following code:
    grant select on table redshift_schema.AWSBilling201910 to group finance; 
    grant all on table redshift_schema.AWSBilling201910 to group admin;

    You can verify if you enforced database security correctly. The user finance1 tried to rename the table AWSBilling201910 in redshift_schema, but got a permission denied error message (due to restricted access). The following screenshot shows this scenario and the subsequent error message:

Example S3 inventory analysis

S3 charges split per bucket. The following query identifies the data storage and transfer costs for each separate S3 bucket:

SELECT
  "lineitem_productcode",
  "lineitem_usagetype",
  "lineitem_resourceid",
  b."storage_class",
  SUM(CASE
    WHEN "lineitem_usagetype" like '%Byte%' THEN "lineitem_usageamount"/1024
    ELSE "lineitem_usageamount"
  END) as "Usage",
  CASE
    WHEN "lineitem_usagetype" like '%Byte%' THEN 'TBs'
    ELSE 'Requests'
  END as "Usage Units",
  sum("lineitem_blendedcost") as cost
from awsbilling201902 a
  join spectrum_schema.data b
    on a.lineItem_ResourceId = b.bucket
where "product_productname" = 'Amazon Simple Storage Service'
group by
  "lineitem_productcode",
  "lineitem_usagetype",
  "lineitem_resourceid",
  b."storage_class"
order by
  sum("lineitem_blendedcost") desc;

The following screenshot shows the results of executing the above query:

Costs are split by type of storage (for example, Glacier versus standard storage).

The following query identifies S3 data transfer costs (intra-region and inter-region) by S3 storage class (usage amount, unblended cost, blended cost):

SELECT
 lineitem_productcode
 ,product_fromlocation
 ,product_tolocation,
  b.storage_class
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  join spectrum_schema.data b
ON
    a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
 AND a.product_productfamily = 'Data Transfer'
GROUP BY
 1,2,3,4
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

The following query identifies S3 fee, API request, and storage charges:

SELECT
 lineitem_productcode
 ,product_productfamily
 ,b.storage_class
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  join spectrum_schema.data b
ON
   a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
  and a.product_productfamily <> 'Data Transfer'
GROUP BY
 1,2,3
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

Server access logs sample analysis queries

S3 access log charges per operation type. The following query identifies the data storage and transfer costs for each separate HTTP operation:

SELECT
  "lineitem_productcode",
  "lineitem_usagetype",
  "lineitem_resourceid",
  b."operation",
  b."httpstatus",
  b."bytessent",
  SUM(CASE
      WHEN "lineitem_usagetype" like '%Byte%'
        THEN "lineitem_usageamount" / 1024
      ELSE "lineitem_usageamount"
      END) as "Usage",
  CASE
  WHEN "lineitem_usagetype" like '%Byte%'
    THEN 'TBs'
  ELSE 'Requests'
  END  as "Usage Units",
  sum("lineitem_blendedcost") as cost
from awsbilling201902 a
  join spectrum_schema.s3accesslogs b
    on a.lineItem_ResourceId = b.bucket
where "product_productname" = 'Amazon Simple Storage Service'
group by
  1, 2, 3, 4, 5, 6
order by
  sum("lineitem_blendedcost") desc;

The following screenshot shows the result of executing the above query:

The following query identifies S3 data transfer costs (intra-region and inter-region) by S3 operation and HTTP status (usage amount, unblended cost, blended cost):

SELECT
 lineitem_productcode
 ,product_fromlocation
 ,product_tolocation
 ,b.operation
 ,b.httpstatus
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  JOIN spectrum_schema.s3accesslogs b
ON
   a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
 AND a.product_productfamily = 'Data Transfer'
GROUP BY
 1,2,3,4,5
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

The following query identifies S3 fee, API request, and storage charges:

SELECT
 lineitem_productcode
 ,product_productfamily
 ,b.operation
 ,b.httpstatus
 ,sum(lineitem_usageamount) usageamount
 ,sum(lineitem_unblendedcost) unblendedcost
 ,sum(lineitem_blendedcost) blendedcost
FROM
awsbilling201902 a
  JOIN spectrum_schema.s3accesslogs b
ON
   a.lineItem_ResourceId = b.bucket
WHERE
 a.lineitem_productcode = 'AmazonS3'
  and a.product_productfamily <> 'Data Transfer'
GROUP BY
 1,2,3,4
ORDER BY
 usageamount desc;

The following screenshot shows the result of executing the above query:

Overall data flow diagram

The following diagram shows the complete data flow for this solution.

Conclusion

AWS Glue makes provides an easy and convenient way to discover data stored in your S3 buckets automatically in a cloud-native, secure, and efficient way. This post demonstrated how to use AWS Glue and Amazon Redshift to analyze your S3 spend using Cost and Usage Reports. You also learned best practices for managing database security in Amazon Redshift through users and groups. Using this framework, you can start analyzing your S3 bucket spend with a few clicks in a matter of minutes on the AWS Management Console!

If you have questions or suggestions, please leave your thoughts in the comments section below.

 


About the Author

 Shayon Sanyal is a Data Architect, Data Lake for Global Financial Services at AWS.

 

 

 

ICYMI: Serverless Q4 2019

Post Syndicated from Rob Sutter original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2019/

Welcome to the eighth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

In case you missed our last ICYMI, checkout what happened last quarter here.

The three months comprising the fourth quarter of 2019

AWS re:Invent

AWS re:Invent 2019

re:Invent 2019 dominated the fourth quarter at AWS. The serverless team presented a number of talks, workshops, and builder sessions to help customers increase their skills and deliver value more rapidly to their own customers.

Serverless talks from re:Invent 2019

Chris Munns presenting 'Building microservices with AWS Lambda' at re:Invent 2019

We presented dozens of sessions showing how customers can improve their architecture and agility with serverless. Here are some of the most popular.

Videos

Decks

You can also find decks for many of the serverless presentations and other re:Invent presentations on our AWS Events Content.

AWS Lambda

For developers needing greater control over performance of their serverless applications at any scale, AWS Lambda announced Provisioned Concurrency at re:Invent. This feature enables Lambda functions to execute with consistent start-up latency making them ideal for building latency sensitive applications.

As shown in the below graph, provisioned concurrency reduces tail latency, directly impacting response times and providing a more responsive end user experience.

Graph showing performance enhancements with AWS Lambda Provisioned Concurrency

Lambda rolled out enhanced VPC networking to 14 additional Regions around the world. This change brings dramatic improvements to startup performance for Lambda functions running in VPCs due to more efficient usage of elastic network interfaces.

Illustration of AWS Lambda VPC to VPC NAT

New VPC to VPC NAT for Lambda functions

Lambda now supports three additional runtimes: Node.js 12, Java 11, and Python 3.8. Each of these new runtimes has new version-specific features and benefits, which are covered in the linked release posts. Like the Node.js 10 runtime, these new runtimes are all based on an Amazon Linux 2 execution environment.

Lambda released a number of controls for both stream and async-based invocations:

  • You can now configure error handling for Lambda functions consuming events from Amazon Kinesis Data Streams or Amazon DynamoDB Streams. It’s now possible to limit the retry count, limit the age of records being retried, configure a failure destination, or split a batch to isolate a problem record. These capabilities help you deal with potential “poison pill” records that would previously cause streams to pause in processing.
  • For asynchronous Lambda invocations, you can now set the maximum event age and retry attempts on the event. If either configured condition is met, the event can be routed to a dead letter queue (DLQ), Lambda destination, or it can be discarded.

AWS Lambda Destinations is a new feature that allows developers to designate an asynchronous target for Lambda function invocation results. You can set separate destinations for success and failure. This unlocks new patterns for distributed event-based applications and can replace custom code previously used to manage routing results.

Illustration depicting AWS Lambda Destinations with success and failure configurations

Lambda Destinations

Lambda also now supports setting a Parallelization Factor, which allows you to set multiple Lambda invocations per shard for Kinesis Data Streams and DynamoDB Streams. This enables faster processing without the need to increase your shard count, while still guaranteeing the order of records processed.

Illustration of multiple AWS Lambda invocations per Kinesis Data Streams shard

Lambda Parallelization Factor diagram

Lambda introduced Amazon SQS FIFO queues as an event source. “First in, first out” (FIFO) queues guarantee the order of record processing, unlike standard queues. FIFO queues support messaging batching via a MessageGroupID attribute that supports parallel Lambda consumers of a single FIFO queue, enabling high throughput of record processing by Lambda.

Lambda now supports Environment Variables in the AWS China (Beijing) Region and the AWS China (Ningxia) Region.

You can now view percentile statistics for the duration metric of your Lambda functions. Percentile statistics show the relative standing of a value in a dataset, and are useful when applied to metrics that exhibit large variances. They can help you understand the distribution of a metric, discover outliers, and find hard-to-spot situations that affect customer experience for a subset of your users.

Amazon API Gateway

Screen capture of creating an Amazon API Gateway HTTP API in the AWS Management Console

Amazon API Gateway announced the preview of HTTP APIs. In addition to significant performance improvements, most customers see an average cost savings of 70% when compared with API Gateway REST APIs. With HTTP APIs, you can create an API in four simple steps. Once the API is created, additional configuration for CORS and JWT authorizers can be added.

AWS SAM CLI

Screen capture of the new 'sam deploy' process in a terminal window

The AWS SAM CLI team simplified the bucket management and deployment process in the SAM CLI. You no longer need to manage a bucket for deployment artifacts – SAM CLI handles this for you. The deployment process has also been streamlined from multiple flagged commands to a single command, sam deploy.

AWS Step Functions

One powerful feature of AWS Step Functions is its ability to integrate directly with AWS services without you needing to write complicated application code. In Q4, Step Functions expanded its integration with Amazon SageMaker to simplify machine learning workflows. Step Functions also added a new integration with Amazon EMR, making EMR big data processing workflows faster to build and easier to monitor.

Screen capture of an AWS Step Functions step with Amazon EMR

Step Functions step with EMR

Step Functions now provides the ability to track state transition usage by integrating with AWS Budgets, allowing you to monitor trends and react to usage on your AWS account.

You can now view CloudWatch Metrics for Step Functions at a one-minute frequency. This makes it easier to set up detailed monitoring for your workflows. You can use one-minute metrics to set up CloudWatch Alarms based on your Step Functions API usage, Lambda functions, service integrations, and execution details.

Step Functions now supports higher throughput workflows, making it easier to coordinate applications with high event rates. This increases the limits to 1,500 state transitions per second and a default start rate of 300 state machine executions per second in US East (N. Virginia), US West (Oregon), and Europe (Ireland). Click the above link to learn more about the limit increases in other Regions.

Screen capture of choosing Express Workflows in the AWS Management Console

Step Functions released AWS Step Functions Express Workflows. With the ability to support event rates greater than 100,000 per second, this feature is designed for high-performance workloads at a reduced cost.

Amazon EventBridge

Illustration of the Amazon EventBridge schema registry and discovery service

Amazon EventBridge announced the preview of the Amazon EventBridge schema registry and discovery service. This service allows developers to automate discovery and cataloging event schemas for use in their applications. Additionally, once a schema is stored in the registry, you can generate and download a code binding that represents the schema as an object in your code.

Amazon SNS

Amazon SNS now supports the use of dead letter queues (DLQ) to help capture unhandled events. By enabling a DLQ, you can catch events that are not processed and re-submit them or analyze to locate processing issues.

Amazon CloudWatch

Amazon CloudWatch announced Amazon CloudWatch ServiceLens to provide a “single pane of glass” to observe health, performance, and availability of your application.

Screenshot of Amazon CloudWatch ServiceLens in the AWS Management Console

CloudWatch ServiceLens

CloudWatch also announced a preview of a capability called Synthetics. CloudWatch Synthetics allows you to test your application endpoints and URLs using configurable scripts that mimic what a real customer would do. This enables the outside-in view of your customers’ experiences, and your service’s availability from their point of view.

CloudWatch introduced Embedded Metric Format, which helps you ingest complex high-cardinality application data as logs and easily generate actionable metrics. You can publish these metrics from your Lambda function by using the PutLogEvents API or using an open source library for Node.js or Python applications.

Finally, CloudWatch announced a preview of Contributor Insights, a capability to identify who or what is impacting your system or application performance by identifying outliers or patterns in log data.

AWS X-Ray

AWS X-Ray announced trace maps, which enable you to map the end-to-end path of a single request. Identifiers show issues and how they affect other services in the request’s path. These can help you to identify and isolate service points that are causing degradation or failures.

X-Ray also announced support for Amazon CloudWatch Synthetics, currently in preview. CloudWatch Synthetics on X-Ray support tracing canary scripts throughout the application, providing metrics on performance or application issues.

Screen capture of AWS X-Ray Service map in the AWS Management Console

X-Ray Service map with CloudWatch Synthetics

Amazon DynamoDB

Amazon DynamoDB announced support for customer-managed customer master keys (CMKs) to encrypt data in DynamoDB. This allows customers to bring your own key (BYOK) giving you full control over how you encrypt and manage the security of your DynamoDB data.

It is now possible to add global replicas to existing DynamoDB tables to provide enhanced availability across the globe.

Another new DynamoDB capability to identify frequently accessed keys and database traffic trends is currently in preview. With this, you can now more easily identify “hot keys” and understand usage of your DynamoDB tables.

Screen capture of Amazon CloudWatch Contributor Insights for DynamoDB in the AWS Management Console

CloudWatch Contributor Insights for DynamoDB

DynamoDB also released adaptive capacity. Adaptive capacity helps you handle imbalanced workloads by automatically isolating frequently accessed items and shifting data across partitions to rebalance them. This helps reduce cost by enabling you to provision throughput for a more balanced workload instead of over provisioning for uneven data access patterns.

Amazon RDS

Amazon Relational Database Services (RDS) announced a preview of Amazon RDS Proxy to help developers manage RDS connection strings for serverless applications.

Illustration of Amazon RDS Proxy

The RDS Proxy maintains a pool of established connections to your RDS database instances. This pool enables you to support a large number of application connections so your application can scale without compromising performance. It also increases security by enabling IAM authentication for database access and enabling you to centrally manage database credentials using AWS Secrets Manager.

AWS Serverless Application Repository

The AWS Serverless Application Repository (SAR) now offers Verified Author badges. These badges enable consumers to quickly and reliably know who you are. The badge appears next to your name in the SAR and links to your GitHub profile.

Screen capture of SAR Verifiedl developer badge in the AWS Management Console

SAR Verified developer badges

AWS Developer Tools

AWS CodeCommit launched the ability for you to enforce rule workflows for pull requests, making it easier to ensure that code has pass through specific rule requirements. You can now create an approval rule specifically for a pull request, or create approval rule templates to be applied to all future pull requests in a repository.

AWS CodeBuild added beta support for test reporting. With test reporting, you can now view the detailed results, trends, and history for tests executed on CodeBuild for any framework that supports the JUnit XML or Cucumber JSON test format.

Screen capture of AWS CodeBuild

CodeBuild test trends in the AWS Management Console

Amazon CodeGuru

AWS announced a preview of Amazon CodeGuru at re:Invent 2019. CodeGuru is a machine learning based service that makes code reviews more effective and aids developers in writing code that is more secure, performant, and consistent.

AWS Amplify and AWS AppSync

AWS Amplify added iOS and Android as supported platforms. Now developers can build iOS and Android applications using the Amplify Framework with the same category-based programming model that they use for JavaScript apps.

Screen capture of 'amplify init' for an iOS application in a terminal window

The Amplify team has also improved offline data access and synchronization by announcing Amplify DataStore. Developers can now create applications that allow users to continue to access and modify data, without an internet connection. Upon connection, the data synchronizes transparently with the cloud.

For a summary of Amplify and AppSync announcements before re:Invent, read: “A round up of the recent pre-re:Invent 2019 AWS Amplify Launches”.

Illustration of AWS AppSync integrations with other AWS services

Q4 serverless content

Blog posts

October

November

December

Tech talks

We hold several AWS Online Tech Talks covering serverless tech talks throughout the year. These are listed in the Serverless section of the AWS Online Tech Talks page.

Here are the ones from Q4:

Twitch

October

There are also a number of other helpful video series covering Serverless available on the AWS Twitch Channel.

AWS Serverless Heroes

We are excited to welcome some new AWS Serverless Heroes to help grow the serverless community. We look forward to some amazing content to help you with your serverless journey.

AWS Serverless Application Repository (SAR) Apps

In this edition of ICYMI, we are introducing a section devoted to SAR apps written by the AWS Serverless Developer Advocacy team. You can run these applications and review their source code to learn more about serverless and to see examples of suggested practices.

Still looking for more?

The Serverless landing page has much more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. We’re also kicking off a fresh series of Tech Talks in 2020 with new content providing greater detail on everything new coming out of AWS for serverless application developers.

Throughout 2020, the AWS Serverless Developer Advocates are crossing the globe to tell you more about serverless, and to hear more about what you need. Follow this blog to keep up on new launches and announcements, best practices, and examples of serverless applications in action.

You can also follow all of us on Twitter to see latest news, follow conversations, and interact with the team.

Chris Munns: @chrismunns
Eric Johnson: @edjgeek
James Beswick: @jbesw
Moheeb Zara: @virgilvox
Ben Smith: @benjamin_l_s
Rob Sutter: @rts_rob
Julian Wood: @julian_wood

Happy coding!

Amazon Redshift at re:Invent 2019

Post Syndicated from Corina Radovanovich original https://aws.amazon.com/blogs/big-data/amazon-redshift-at-reinvent-2019/

The annual AWS re:Invent learning conference is an exciting time full of new product and program launches. At the first re:Invent conference in 2012, AWS announced Amazon Redshift. Since then, tens of thousands of customers have started using Amazon Redshift as their cloud data warehouse. In 2019, AWS shared several significant launches and dozens of sessions. This post presents highlights of what happened with Amazon Redshift at re:Invent 2019.

Andy Jassy’s AWS re:Invent 2019 keynote

When Andy Jassy takes the stage to talk about what’s new at AWS, he launches the new Amazon Redshift node type, RA3 with managed storage; the new Federated Query (preview) feature, Export to Data Lake; and Advanced Query Accelerator (AQUA) (preview) for Amazon Redshift. Watch AWS re:Invent 2019 – Keynote with Andy Jassy on YouTube, or jump ahead for the Amazon RedShift announcements.

Deep dive and best practices for Amazon Redshift

Every year the Amazon Redshift deep dive session rates highly, and people continue to watch and re-watch it after the event. This year was no different. Specialist Solution Architects Harshida Patel and Tony Gibbs take an in-depth look at best practices for data warehousing with Amazon Redshift. It’s a must-see for existing Amazon Redshift users. Watch AWS re:Invent 2019: Deep dive and best practices for Amazon Redshift (ANT418) on YouTube.

What’s new with Amazon Redshift, featuring Yelp and Workday

With over 200 new features and capabilities launched in the last 18 months, there’s a lot to cover in a session about what’s new with Amazon Redshift. Join one of the Product Managers driving RA3 and managed storage, Himanshu Raja, to catch up on the recent performance, concurrency, elasticity, and manageability enhancements behind Amazon Redshift’s record price-to-performance ratio. You also get more insight into the architectural evolution of Amazon Redshift with RA3 and managed storage, and how it uses machine learning to create a self-optimizing data warehouse. In the second half of the session, you hear from Steven Moy, a software engineer at Yelp, about how Amazon Redshift’s latest features have helped Yelp achieve optimization and scale for an organization with an enormous about of data and sophisticated analytics. Watch AWS re:Invent 2019: What’s new with Amazon Redshift, featuring Yelp (ANT320-R1) on YouTube.

The session repeated with Michalis Petropoulos, Director of Engineering, and Erol Guney of Workday. Watch the full session to get a slightly different take on what’s new, or jump to the customer presentation to hear from Erol Guney, Architect, Data Platform, at Workday about how Amazon Redshift empowers their Data as a Service product team to focus on architecture goals and business logic.

Migrate your data warehouse to the cloud in record time, featuring Nielsen and Fannie Mae

In this session, you learn about important concepts and tips for migrating your legacy on-premise data warehouse to the cloud. You hear from Tejas Desai, VP of Technology at Neilsen, about their migration journey and benefits. Watch AWS re:Invent 2019: Migrate your data warehouse to the cloud, featuring Nielsen (ANT334-R1) on YouTube.

The repeat of this session features Amy Tseng from Fannie Mae. If you don’t want to listen to Tony’s overview again, skip ahead to learn how Fannie Mae embraced a data lake architecture with Amazon Redshift for analytics to save costs, maximize performance, and scale. Amy’s presentation was a crowd favorite, with some of the most positive customer feedback and a wealth of great information about how Fannie Mae managed their migration.

How to scale data analytics with Amazon Redshift, featuring Duolingo and Warner Brothers Interactive Entertainment

Data is growing fast, and so is the value that business users need to gain from business data. When AWS first announced Amazon Redshift in 2012, it could handle up to 640 TB of compressed data. It can now scale to 8 PB of compressed data. Learn more about Amazon Redshift’s unique ability to deliver top performance at the lowest and most predictable cost from Vinay Shukla, Principal Product Manager. This is an especially important session if you want to learn more about the newest Amazon Redshift node type RA3. You also hear from Jonathan Burket of Duolingo about their experience in the preview of RA3 nodes and how Duolingo uses Amazon Redshift. Duolingo is a wildly popular language-learning platform and the most downloaded education app in the world, with over 300 million users. Enabling data-driven decisions with A/B tests and ad hoc analysis has been a driver of their success. Watch AWS re:Invent 2019: How to scale data analytics with Amazon Redshift (ANT335-R2) on YouTube.

The repeat session features Redshift Product Manager Maor Kleider with an in-depth case study from Matt Howell, Executive Director, Analytics, and Kurt Larson, Technical Director, Analytics, at Warner Brothers Interactive Entertainment. Watch the full session for another perspective about how to scale with the latest Amazon Redshift features, with unique insights about analytics across Amazon Redshift and your data lake. You can also jump to the customer presentation. Not only is this session packed with interesting insights about how data analytics drives the success of games like Batman and Mortal Kombat, it also has an action-packed trailer about all the awesome Warner Brothers games.

If you prefer to see a session without the announcements from the keynote and with demos, watch Debu Panda showcase the new Amazon Redshift console and share practical tips about using Amazon Redshift.

Amazon Redshift reimagined: RA3 and AQUA

This embargoed session is the first opportunity to learn more about AQUA for Amazon Redshift, and how it improves query performance to up to 10 times faster. Britt Johnston, Director of Product Management, kicks off with an intro into the next generation of Amazon Redshift, and Senior Principal Engineer Andy Caldwell jumps in to share the origin and vision of the exciting new technology. The enthusiasm Andy feels about sharing AQUA with customers for the first time is palpable. Watch AWS re:Invent 2019: [NEW LAUNCH!] Amazon Redshift reimagined: RA3 and AQUA (ANT230) on YouTube.

State-of-the-art cloud data warehousing, featuring Asurion and Comcast

This session serves as a great introduction to cloud data warehousing at AWS, with insightful presentations from a different customer in each delivery. You can hear from Asurion about how they use data analytics to serve over 300 million people with excellent customer satisfaction scores. You learn about how to use AWS services with Amazon Redshift and why Asurion believes in their data lake-based architecture. Watch AWS re:Invent 2019: State-of-the-art cloud data warehousing, featuring Asurion (ANT213-R1) on YouTube.

In the repeat session, Rajat Garg, Senior Principal Architect from Comcast, talks about moving to Amazon Redshift from a legacy on-premise Oracle Exadata environment. He shares their strategy, approach, and performance improvements.

What’s next and more information

In addition to these sessions at re:Invent, there are also hands-on workshops, intimate builder roundtables, and interactive chalk talks that weren’t recorded.

Keep exploring the following links for more information about new releases:

We hope to see you in Las Vegas for re:Invent 2020, or at one of the hundreds of other AWS virtual and in-person events running around the world. For more information, see AWS Events and Webinars.


About the authors

Corina Radovanovich leads product marketing for cloud data warehousing at AWS.

 

 

 

Maximize data ingestion and reporting performance on Amazon Redshift

Post Syndicated from Vasu Kiran Gorti original https://aws.amazon.com/blogs/big-data/maximize-data-ingestion-and-reporting-performance-on-amazon-redshift/

This is a guest post from ZS. In their own words, “ZS is a professional services firm that works closely with companies to help develop and deliver products and solutions that drive customer value and company results. ZS engagements involve a blend of technology, consulting, analytics, and operations, and are targeted toward improving the commercial experience for clients.”

ZS was involved in setting up and operating a MicroStrategy-based BI application that sources 700 GB of data from Amazon Redshift as a data warehouse in an Amazon-hosted backend architecture. ZS sourced healthcare data from various pharma data vendors from different systems such as Amazon S3 buckets and FTP systems into the data lake. They processed this data using transient Amazon EMR clusters and stored it on Amazon S3 for reporting consumption. The reporting-specific data is moved to Amazon Redshift using COPY commands, and MicroStrategy uses it to refresh front-end dashboards.

ZS has strict, client-set SLAs to meet with the available Amazon Redshift infrastructure. We carried out experiments to identify an approach to handle large data volumes using the available small Amazon Redshift cluster.

This post provides an approach for loading a large volume of data from S3 to Amazon Redshift and applies efficient distribution techniques for enhanced performance of reporting queries on a relatively small Amazon Redshift cluster.

Data processing methodology

ZS infrastructure is hosted on AWS, where they store and process pharma industry data from various vendors using AWS services before reporting the data on a MicroStrategy BI reporting tool. The following diagram shows the overall data flow from flat files to reports shown on MicroStrategy for end-users.

Step 1: Pharma data is sourced from various vendors and different systems like FTP location, individual systems and Amazon S3 buckets etc.

Step 2: Cost-effective transient clusters are spun as needed to provide compute power to execute pyspark codes.

Step 3: Post the processing, data is stored in Amazon S3 buckets for consumption of downstream applications.

Step 4: 700 GB of data is then ingested into Amazon Redshift for MSTR consumption.

Step 5: This data is read from Amazon Redshift and the insights are displayed to the end-users in the form of reports on MicroStrategy.

Dataset under consideration

In this specific scenario, ZS was working with data from the pharma domain. The following table demonstrates the data’s typical structure—it has several doctor, patient, treatment-pertinent IDs, and healthcare metrics.

Table 1
Column Name EMR datatypeAmazon Redshift datatype
Time IDintegerint
Geography IDintegerint
Product IDintegerint
Market IDintegerint
Doctor IDintegerint
Doctor Attribute 1 IDintegerint
Doctor Attribute 2 IDintegerint
Doctor Attribute 3 IDintegerint
Doctor Attribute 4 IDintegerint
Doctor Rankintegerint
Metric 1doubledecimal(18,6)
Metric 2doubledecimal(18,6)
Metric 3doubledecimal(18,6)
Metric 4doubledecimal(18,6)
Metric 5doubledecimal(18,6)
Metric 6doubledecimal(18,6)
Metric 7doubledecimal(18,6)
Metric 8doubledecimal(18,6)
Metric 9doubledecimal(18,6)
Metric 10doubledecimal(18,6)
Metric 11doubledecimal(18,6)
Metric 12doubledecimal(18,6)
Metric 13doubledecimal(18,6)
Metric 14doubledecimal(18,6)
Metric 15doubledecimal(18,6)
Metric 16doubledecimal(18,6)
Metric 17doubledecimal(18,6)
Metric 18doubledecimal(18,6)
Metric 19doubledecimal(18,6)
Metric 20doubledecimal(18,6)
Metric 21doubledecimal(18,6)
Metric 22doubledecimal(18,6)
Metric 23doubledecimal(18,6)
Data Snapshot Datetimestamptimestamp
Data Refresh Datetimestamptimestamp
Data Refresh IDstringvarchar

Each table has approximately 35–40 columns and holds approximately 200–250M rows of data. ZS used 40 such tables; they sourced the data in these tables from various healthcare data vendors and processed them as per reporting needs.

The total dataset is approximately 2 TB in size in CSV format and approximately 700 GB in Parquet format.

Challenges and constraints

The five-step process for data refresh and insight generation outlined previously takes place over the weekends within a stipulated time frame. Under default, unoptimized state data load from S3 to Amazon Redshift and MicroStrategy refresh (Step 4 in the previous diagram) took almost 13–14 hours on a 2node ds2.8xlarge cluster and was affecting the overall weekend run SLA (1.5 hours).

The following diagram outlines the three constraints that ZS had to solve for to meet client needs:

Weekly time-based SLA – Load within 1 hour and fetch data on MSTR within 1.5 hours

The client IT and Business teams set a strict SLA to load 700 GB of Parquet data (equivalent to 2 TB CSV) onto Amazon Redshift and refresh the reports on the MicroStrategy BI tool. In this scenario, the client team had moved from another vendor to AWS, and the overall client expectation was to reduce costs without a significant performance dip.

Fixed cluster size – Pre-decided 2 node ds2.8xlarge cluster

The client IT teams determined the cluster size and configuration, and took into consideration the cost, data volumes, and load patterns. These were fixed and not adjustable: a 2 node ds2.8xlarge cluster. ZS carried out PoCs to optimize the environment subject to these constraints.

High data volume – Truncate load 700GB data in Parquet format

The data that ZS used was pertinent to the Pharma domain. The dataset under consideration in this scenario was 700 GB in Parquet format. In this specific use case, with every refresh, even historic data was updated, and therefore a lot of data could not be appended. Therefore, we followed a truncate and load process.

Iterative optimization

With constraints over time, data volume, and cluster size, ZS performed various experiments to optimize Amazon Redshift data load and read time—two key aspects to gauge performance. ZS created an iterative framework that helps do the following:

  • Decide the file format
  • Define optimal data distribution through distribution and sort keys
  • Identify the techniques to parallelize the data-loading process

The below diagram shows the key steps that can be followed to get the best data load and read performance on any Amazon RedShift cluster.

Data load optimization

We identified and optimized four key factors impacting data load performance: file formats, file size at source, concurrency, and column encoding.

File formats

Many projects usually load data in CSV format from S3 to Amazon Redshift. ZS had data available in Parquet format with snappy compression as an output of Spark processes. (Spark processes work best with this combination.)

To identify an efficient format for Amazon Redshift, we compared Parquet with commonly used CSV and GZIP formats. We loaded a table from S3, with 200M rows of data generated through the Spark process, which equates to 41 GB in CSV, 11 GB in Parquet, and 10 GB in GZIP, and compared load time and CPU utilization. The below diagram shows load time vs CPU utilization for same data stored in different file formats.

For the dataset and constraints we were working with, loading the Parquet format file required low CPU utilization and lesser I/O compared to CSV and GZIP, and occupied a smaller memory footprint on S3 compared to the memory-intensive CSV format. Lower CPU utilization allowed more parallel loads in this scenario, thereby reducing the overall runtime required to load Parquet files.

File size at source

The next aspect was to choose the block size in which the Parquet files were broken down and stored on S3. A block size of 128 MB is commonly used for Spark jobs and considered to be optimal for data processing. However, Amazon Redshift works best with larger files.

We loaded 10 GB of Parquet data broken down into smaller equisized files of 250 MB, 750 MB, 1 GB, 1.5 GB, and 3 GB block sizes and noted the performance in each case. The following graph shows the different load times.

There was a gradual improvement in the data load time until the block size reached 1 GB (with best load timing). Beyond 1GB mark there was a dip in performance observed with larger files and Amazon Redshift took more time to process larger files.

These numbers are specific to the kind of data we were working with. The recommendations can vary as the form and shape of data changes.

As a best practice, identify the block size and have the number of files as a multiple of the number of slices of the Amazon Redshift cluster. This makes sure that each slice does an equal amount of work and there are no idle slices, thereby increasing efficiency and improving performance. For more information, see Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift.

Concurrency

The COPY command is relatively low on memory. The more parallel the loads, the better the performance. ZS loaded a table approximately 7.3 GB multiple times with separate concurrency settings. We measured the throughput in terms of the average time taken per GB to move files to Amazon Redshift with 1 to 20 concurrent loads. The following table summarizes the results.

TestNumber of tables loaded in parallel (concurrency)Total data loaded (GB)
Test 117.3
Test 2536.5
Test 31073
Test 415109.5
Test 520146

The below diagram shows the time taken to load 1GB of data and the CPU utilization for various concurrency settings.

For the dataset and constraints we were working with, a concurrency of 10 gave the best throughput for our specific dataset, with a 25% CPU availability buffer of about 25%. Depending on the nature of the data and volume fluctuations every release, you can opt for a different buffer.

Column encoding

To identify the column encoding and compression on Amazon Redshift that gives the best performance and occupies lower storage footprint, ZS compared ZSTD (which the ANALYZE COMPRESSSION command recommended), LZO, and none encoding formats on Amazon Redshift tables for load performance. The below diagram shows the time taken to load same data volume into the tables with none, ZSTD and LZO encoding applied to the columns.

For the dataset and constraints we were working with, ZSTD encoding on columns offered a high compression ratio (~3, than when there was no compression used) and gave the best data copy performance and low storage footprint on Amazon Redshift for our use case. You can have varied results depending on the datatype and cardinality of the data.

Note: This solution was implemented prior to the feature release for AZ64 encoding and hence does not consider its impact. One could use the approach described in this blog post considering AZ64 compression encoding among all the compression encodings Amazon Redshift supports.

Data read optimization

ZS also improved the data read performance by MicroStrategy from Amazon Redshift by using distribution and sorting keys and SQL optimization (minimizing filters on MicroStrategy auto-generated SQL queries).

SQL queries

MicroStrategy is a business intelligence tool and reads data from a database by intelligently building its own SQL. We compared the performance of MSTR SQLs (typical DW queries such as SELECT, GROUP BY, or temporary tables) with and without filters and observed that the query with and without filters for our dataset ran in almost the same time and used the same resources. But the unfiltered query gave four times more data. The following table summarizes the results.

Filters?

Rows fetchedQuery runtimeCPU utilization

Y

500K

5.1 (mins)

15%

N2M5.1 (mins)

15%

Tweaking the SQLs to have minimum or no filters can help fetch significantly more rows with only a small increase in processing time, compared to what it takes to read data with more filters. If you need to use the entire dataset, it is better to fetch complete data using one query than using several SQLs with separate filters and running them in parallel.

Distribution and sort keys

The following table is an analysis of a table load and read performance with and without distribution and sort keys for our use case.

Dist. Style Dist KeySort KeysLoad time (mins)Query run time (mins)Query CPU utilization
None16.759.559%
KeyMost suitable distribution key depending on the query that gives even distribution6 columns that appear in where clause, in the order of group by clause in SQL17.536.132%

The table without a distribution key loaded a little faster. However, differences in data read time and CPU utilization between the two queries is significant. This means that more and more parallel loads could run efficiently when the keys were set appropriately, because the overall CPU utilization reduced significantly. The key takeaway is to use a distribution style—auto, even, or manual—to optimize data read from Amazon Redshift and allow more parallel processing. ZS used a manual distribution style and chose distribution and sort keys based on an extensive understanding of data and refresh cadence.

Next steps to get the best output from your Amazon Redshift instance

As an outcome of the multiple POC results, we identified the most suitable file formats, compression techniques, re-partitioned Parquet file block sizes, and distribution and interleaved sorting logic that offer the best performance for our dataset for reporting on MicroStrategy with Amazon Redshift as the database. This helped us identify the best data load and read combinations to load 700 GB of Parquet data (equivalent to 2 TB of CSV data) within the client-set 2.5 hour SLA using the available fixed 2 node ds2.8xlarge cluster.

The following diagram shows the iterative process you can follow to identify the best data load and read techniques suited for an Amazon Redshift cluster configuration.

The following are a few key takeaways:

  • Parquet and Amazon Redshift worked well together. The data in Parquet format had low CPU utilization and I/O requirements, which allowed more parallel loads.
  • ZSTD encoding worked best for this specific dataset due to encoding on numbers as well.
  • Sort keys and distribution keys on tables can bring down read time by approximately 80% compared to tables with no distribution and sorting logic applied.
  • Filtering data on Amazon Redshift does not work the same as typical databases. You can improve data filtering performance with the appropriate sort keys.
  • File size at the source and concurrency are interrelated and you should choose them accordingly. Larger blocks (maximum 1 GB, for this specific dataset) are loaded faster onto Amazon Redshift.

 


About the Authors

Vasu Kiran Gorti is a result-oriented professional with technology and functional/domain experience predominantly in sales and marketing consulting. Essaying a role of Associate Consultant at ZS Associates, he has experience in working with life sciences and healthcare clients in alignment to their business objectives and expectations. He specializes in MicroStrategy, Business Intelligence, Analytics, and reporting. Proactive and innovative, Vasu enjoys taking up stimulating initiatives that bridge the gap between technology and business. He’s a permanent beta—always learning, improvising and evolving.

 

 

Ajit Pathak is a Technology Consultant at ZS Associates and leads BI and data management projects for Pharmaceutical companies. His areas of interest and expertise include MSTR, Redshift, and AWS suite. He loves designing complex applications and recommending best architectural practices that lead to efficient and concise dashboards. A qualified technology consultant, Ajit has focused on driving informed business decisions through clear data communication. When he is not focusing on areas of data visualizations and applications, Ajit loves to read, play badminton, and participate in debates ranging from politics to sports.

Working with nested data types using Amazon Redshift Spectrum

Post Syndicated from Juan Yu original https://aws.amazon.com/blogs/big-data/working-with-nested-data-types-using-amazon-redshift-spectrum/

Redshift Spectrum is a feature of Amazon Redshift that allows you to query data stored on Amazon S3 directly and supports nested data types. This post discusses which use cases can benefit from nested data types, how to use Amazon Redshift Spectrum with nested data types to achieve excellent performance and storage efficiency, and some of the limitations of nested data types.

This post uses a data set generated with dummy data. You can view its table schema. If you’d like to try the dataset, deploy a Redshift cluster, execute the DDLs there, and use the example queries from this post or build your own.

Data modeling

In many scenarios, data is generated in a hierarchy. For example, assume a customer bought several items. For analytic purposes, there are various data modeling approaches to save storage or speed up data processing. One popular approach to achieve storage efficiency is the dimensional model.

The following table shows dummy customer data.

usernamenamesexaddressmailbirthdate
1erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/10
2shepherdlisaMark LeeM754 Michelle Gateway Port Johnstad, ME 35695[email protected]11/10/32
3palmerpaulJennifer MarshallF869 Harrell Forges Apt. 111 East Monica, MO 01243[email protected]3/11/07
4brettmcgeeTravis WilsonM535 Lisa Flat East Andrew, ID 43332[email protected]3/22/10
5torresdianaAshley HoffmanF7815 Lauren Ranch Ambertown, FL 93225[email protected]5/14/60

The following table contains dummy order data, which is linked to the customer table via a foreign key username.

usernametransaction_dateshipping_dateitemsprice
1erin1510/11/1910/13/19104794
2erin1510/11/1910/12/1971697
3erin1510/7/1910/9/19215
4erin1510/6/1910/10/1951744
5erin1510/5/1910/10/1976346

In the dimensional model, each customer’s information is stored only one time. There is no duplicated data, even though a customer could order multiple items at various times.

The dimensional model is optimal for storage. However, it can be challenging to process data efficiently. To get a full picture of your data, you need to join the two tables together to restore the hierarchy.

For example, to find out how many items customer Mark Lee bought and his total spending in the last three months, the query needs to join the customers and orders table. See the following code:

Select c.username, o.transaction_date, o.shipping_date, sum(items), sum(price) 
from customers c inner join orders o on (c.username = o.username) 
where c.name = ‘Mark Lee’ 
and transaction_date > DATEADD(month, -3, GETDATE())
group by 1,2,3;

When there are millions of customers who might buy multiple items in each transaction, the join can be very expensive. A fast-growing dataset can be so large that you need to store it in a distributed system. To perform the join, you need to shuffle data through the network, and the cost becomes even more significant.

As storage becomes cheaper and cheaper, people are starting to use a flattened model. In this model, data is pre-joined to gain processing efficiency. The following table shows that the customer and order information is stored in one record and ready to be analyzed.

usernamenamesexaddressmailbirthdatetransaction_dateshipping_dateitemsprice
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/14/1910/12/1921237
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/16/1910/9/1984824
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/17/1910/10/1994392
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/17/1910/9/1931079
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/25/1910/7/191208
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/2/1910/5/19103689
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/5/1910/10/1976346
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/6/1910/10/1951744
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/7/1910/9/19215
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/11/1910/13/19104794
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/1010/11/1910/12/1971697
palmerpaulJennifer MarshallF869 Harrell Forges Apt. 111 East Monica, MO 01243[email protected]3/11/079/14/199/22/1964642
palmerpaulJennifer MarshallF869 Harrell Forges Apt. 111 East Monica, MO 01243[email protected]3/11/079/17/199/21/191527
palmerpaulJennifer MarshallF869 Harrell Forges Apt. 111 East Monica, MO 01243[email protected]3/11/0710/9/1910/12/195408
torresdianaAshley HoffmanF7815 Lauren Ranch Ambertown, FL 93225[email protected]5/14/609/17/199/28/1995452

This model also works well on a distributed system. Because each row contains complete information, you can process it on any node, and don’t need to shuffle data. You can also use the columnar format to store data, which allows the query engine to read only the needed columns instead of the whole row. This technique improves analytics performance and is storage efficient.

Both models have their pros and cons. The dimensional model trades compute power for storage efficiency, and the flattened model trades storage for processing efficiency.

Some new data types are available that achieve the best of both. Instead of putting child records into another table, you can nest them into the parent record and get the full information without performing a join. It effectively denormalizes the data without duplicating the parent record.

The following diagram illustrates this workflow.

You can apply this model to a schemaful hierarchy dataset. Continuing with the customer and order example, although a customer might buy multiple items, each order item contains the same type of information, such as product ID, price, and vendor.

The hierarchy is clear and consistent. You can map data to a nested structured schema, which you can store and access efficiently via SQL language.

The following table is a nested data presentation of the previous example.

usernamenamesexaddressmailbirthdatetransaction_dateshipping_dateitemsprice
erin15Sarah NewmanF795 Nancy Shoal Apt. 684 Phillipschester, MI 01979[email protected]4/24/109/14/1910/12/1921237
9/16/1910/9/1984824
9/17/1910/10/1994392
9/17/1910/9/1931079
9/25/1910/7/191208
10/2/1910/5/19103689
10/5/1910/10/1976346
10/6/1910/10/1951744
10/7/1910/9/19215
10/11/1910/13/19104794
10/11/1910/12/1971697
palmerpaulJennifer MarshallF869 Harrell Forges Apt. 111 East Monica, MO 01243[email protected]3/11/079/14/199/22/1964642
9/17/199/21/191527
10/9/1910/12/195408
torresdianaAshley HoffmanF7815 Lauren Ranch Ambertown, FL 93225[email protected]5/14/609/17/199/28/1995452

The following graph compares the storage usage for the three models (all in parquet format).

The graph shows that nested structure is as storage efficient as the dimensional model.

Using nested data types

Nested data types are structured data types for some common data patterns. Nested data types support structs, arrays, and maps.

A struct is similar to a relational table. It groups object properties together. For example, if a customer profile contains their name, address, email, and birthdate, it appears as the following schema:

customer struct<
              username:string,
              name:string,
              sex:string,
              address:string,
              mail:string,
              birthdate:string>

The data appears as the following code:

    "customer": {
        "username": "kevin35",
        "name": "Nancy Alvarez",
        "sex": "F",
        "address": "05472 Kathleen Turnpike\nNew Ashley, NV 84430",
        "mail": "[email protected]",
        "birthdate": "1961-02-05"
    }

An array stores one-to-many relationships. For example, a customer may have multiple shipping addresses or phone numbers. If a customer has several phone numbers, it appears as the following schema:

Phonenumbers array<string>

The data appears as the following code:

[‘555-5555’, ‘555-1234’]

A map is a collection of key-value pairs. You can consider it as a list of struct<key, value> elements. For example, if a customer has particular reward preferences, it appears as the following schema:

preference map<string,boolean> 

The data appears as the following code:

{one_day_delivery=true, 
 coupon=false, 
 free_shipping=true}

Nested data could have another nested data type as a member. The most common one is an array of structs. For example, an order containing multiple items could appear as the following schema:

orders array<
            struct<
              product_id:string,
              price:int,
              onsale:boolean,
              tax:int,
              weight:int,
              others:int,
              vendor:string>
            > 

You can create a complex object by combining them. For example, a customer’s online transaction appears as the following schema:

customer struct<
              username:string,
              name:string,
              sex:string,
              address:string,
              mail:string,
              birthdate:string> , 
  shipping_address array<
                      struct<
                        name:string,
                        street_address:string,
                        city:string,
                        postcode:string>
                      > , 
  creditcard string , 
  transaction_date string , 
  shipping_date string , 
  membership string , 
  preference map<string,boolean> , 
  orders array<
            struct<
              product_id:string,
              price:int,
              onsale:boolean,
              tax:int,
              weight:int,
              others:int,
              vendor:string>
            > , 
  platform string , 
  comments string 

Popular query engines such as Hive, Spark, Presto, and Redshift Spectrum support nested data types. The SQL syntax those engines support can be different. To make it straightforward and consistent, all query examples in this post use Amazon Redshift Spectrum. For more information, see Tutorial: Querying Nested Data with Amazon Redshift Spectrum.

Use cases for nested data types

Nested data types have many benefits: simplify your ETL, data modeling, and achieve the good performance. The following are some common use cases that can benefit from nested data types.

Parent-child relationship

Nested data types keep the parent-child (summary-details) relationship by storing them collocated. This often matches how you want to analyze the data. For example, to analyze customers’ purchasing habits, you may need to find the following:

  • Customers who purchase often but buy only a few items each time. They likely want an annual membership that covers the shipping cost.
  • Customers who purchase less frequently but buy many items in one transaction. They likely expect a free shipping benefit or discount.

You need support information from the orders data, such as how many items, on average, a customer buys per transaction.

To find a list of customers who order online at least once per week, with fewer than four items each time, use the following code:

with purchases as (
select co.customer.username as customer, co.transaction_date as transaction_date, co.customer.address as address,
  (select count(*) from co.orders) as total_items, 
  (select sum(case when onsale = true then 1 else 0 end) from co.orders) as items_onsale
from demo.customer_order_nested_parq co )
select customer, count(transaction_date) as tran_cnt, avg(total_items) 
from purchases 
where total_items <= 3 and items_onsale > 0 
      and transaction_date >= '2019-09-01' 
group by 1 having tran_cnt >= 4;

With the nested order details, per item information is already grouped by customer per transaction. Children aggregation is straightforward; you can aggregate order details to categorize a customer. If you use a denormalized table, you have to do GROUP BY two times. The query could also take longer. See the following code:

with purchases as (
select cc_username as customer, transaction_date, cc_address as address,
  count(*) as total_items, 
  sum(case when co_onsale = true then 1 else 0 end) as items_onsale
from demo.customer_order_flatten_parq 
group by 1,2,3)
select customer, count(transaction_date) as tran_cnt, avg(total_items) 
from purchases 
where total_items <= 3 and items_onsale > 0 
      and transaction_date > '2019-09-01'
group by 1 having tran_cnt >= 4;

To find customers who order only once per quarter with at least 10 items and high total spending, use the following code:

with purchases as (
select co.customer.username as customer, co.transaction_date as transaction_date, 
  (select count(*) from co.orders) as total_items, 
  (select sum(price) from co.orders) as total_spending
from demo.customer_order_nested_parq co )
select customer, count(transaction_date) as tran_cnt, avg(total_spending) from purchases 
where total_items >= 10 and total_spending > 5000 and transaction_date > '2019-07-01' transaction_date < '2019-09-30'
group by 1 having tran_cnt < 2
order by 3 desc;

Another benefit of using nested data types for parent-child data analysis is resource usage reduction. If there are one million customer transactions, there could be over five times the item orders. For example, to find each day how many goods ship to Michigan, use the following code:

select co.shipping_date, sum(coo.weight)
from demo.customer_order_nested_parq co, co.orders coo
where co.customer.address like '%MI 012__'
group by 1
order by 1;

Assuming that 3% of customers ship orders to Michigan, after filtering the customer data, there could be approximately 3% of matching transactions. You only need to process 150 thousand item orders instead of 5 million. This greatly reduces the data to process and the resources to use when compared to a flattened model.

For the parent-child use case, nested data types provide straightforward aggregation on children, more efficient filtering, group by, windowing, and storage saving.

Many-to-many relationship

Customers could buy many items from various vendors, and a vendor could sell a product to many customers. This is a many-to-many relationship.

In a dimensional model, you need three tables: a customers table, an orders table, and a transactions table. To find the top vendors who have the most customers, you need to join the three tables. See the following code:

select vendor, transaction_date, count(distinct cc.username)
from customers cc,
     transactions tt,
     orders oo
where cc.username = tt.username
and oo.transaction_id = tt.transaction_id
and tt.transaction_date >= '2019-01-01' 
group by 1,2
order by 3 desc;

With nested data types, the query is similar to the one using the dimensional model. However, because the orders data is collocated with customer transactions, you can join them on-the-fly without paying the cost. See the following code:

select coo.vendor, co.transaction_date, count(distinct co.customer.username)
from demo.customer_order_nested_parq co, 
co.orders coo
where co.transaction_date > '2019-01-01'
group by 1,2
order by 3 desc;

As another example, your vendor, Smith PLC, had a big sale event on October 10, 2019. You want to find out which customers bought your product during this sale and the top customers who spent the most. To do so, use the following code:

select co.customer.username, count(coo.product_id), sum(coo.price)
from demo.customer_order_nested_parq co, co.orders coo
where co.transaction_date = '2019-10-10'
and (select count(*) from co.orders 
     where vendor = 'Smith PLC' and onsale = true) > 0
group by 1
order by 3 desc;

Compared to the dimensional model query, the nested model is two-to-three times faster. This is on a relatively small dataset with only a few million rows. For a larger dataset, the performance improvement is even greater, and with less resource usage.

Sparse and frequently changed data

Assume that you want to reward customers who order from your online store. For each transaction, the customer can choose one or more rewards, such as free shipping, one-day delivery, a discount, or a coupon. Depending on how effective a reward is, you have to frequently modify the reward types, add new ones, or remove ones that aren’t popular.

If you store the data in a flattened model, there are two common options to track this data. The first method is creating a table with one column for each type of reward. You have to think of all possible rewards at the outset and create those columns. This could lead to a wide table and very sparse data. Alternatively, you can modify your table schema when you want to add or remove a reward type. That adds more maintenance work and you may lose history data. The following table demonstrates this method (all transaction_id data in below table examples are faked one).

transaction_idfree_shippingone_day_deliverydiscountcoupon
pklein35966659391853535FALSETRUETRUE
rebeccawiliams228880139768961FALSETRUE
brooke39180013629693040TRUEFALSETRUETRUE
jchapman4283556333561927FALSETRUEFALSEFALSE
mariamartin3515336516983566FALSEFALSETRUE

The second option is storing one reward per row. This avoids the wide table issue and the burden of constantly updating the schema. The approach is suitable if you only need to analyze a single reward. If you want to see whether there is any correlation between rewards, such as if more customers prefer free shipping and one-day delivery more than a discount and coupon, this option is more complicated. This model also needs more storage. The following table demonstrates this method.

transaction_idrewordtypevalue
pklein35966659391853535free_shippingFALSE
pklein35966659391853535one_day_deliveryTRUE
pklein35966659391853535couponTRUE
rebeccawiliams228880139768961one_day_deliveryFALSE
rebeccawiliams228880139768961couponTRUE
brooke39180013629693040free_shippingTRUE
brooke39180013629693040one_day_deliveryFALSE
brooke39180013629693040discountTRUE
brooke39180013629693040couponTRUE

A compromise is to use a JSON string to store selected rewards together in one column, which avoids schema change. See the following code:

preference varchar(65535)	

The following table shows how the data is stored in JSON string:

transaction_idpreference
pklein35966659391853535{“coupon”:true, “free_shipping”:false,”one_day_delivery”:true}
rebeccawiliams228880139768961{“coupon”:true, one_day_delivery”:false}
brooke39180013629693040{“coupon”:true, “discount”:true, “free_shipping”:true,”one_day_delivery”:false}
jchapman4283556333561927{“coupon”:false, “discount”:false, “free_shipping”:false, “one_day_delivery”:true}
mariamartin3515336516983566{“discount”:true, “free_shipping”:false,”one_day_delivery”:false}

You can analyze it by using a JSON function to extract the reward data. See the following code:

select correlation, count(username) from (
select username,
(case when 
    (json_extract_path_text(preference,'free_shipping')  = 'true' and  
     json_extract_path_text(preference,'one_day_delivery')  = 'true') 
     then 1
 when 
    (json_extract_path_text(preference,'discount') = 'true' and  
     json_extract_path_text(preference,'coupon')  = 'true') 
     then 2
else 0 
 end) as correlation
from demo.transactions
  )
group by 1;

This solution is acceptable, but you could be more storage efficient and more performant by using the nested data type map. See the following code:

preference map<string, boolean>

The following table shows how the data is stored in map:

transaction_idpreference
pklein35966659391853535{coupon=true, free_shipping=false,one_day_delivery=true}
rebeccawiliams228880139768961{coupon=true, one_day_delivery=false}
brooke39180013629693040{coupon=true, discount=true, free_shipping=true,one_day_delivery=false}
jchapman4283556333561927{coupon=false, discount=false, free_shipping=false, one_day_delivery=true}
mariamartin3515336516983566{discount=true, free_shipping=false,one_day_delivery=false}

 

You can analyze a single reward or multiple rewards using SQL. For example, to find how many customers prefer free shipping, use the following code:

select count(distinct co.customer.username)
from demo.customer_order_nested_parq co, co.preference cm
where cm.key = 'free_shipping' and cm.value = true;

To find how many customers prefer free shipping and one-day delivery more than a coupon or discount, use the following code:

with customer_rewards as (
select co.customer.username as customer, 
 (select count(*) from co.preference cm 
where cm.key = 'free_shipping' and cm.value = true) as shipping_pref,
 (select count(*) from co.preference cm 
where cm.key = 'one_day_delivery' and cm.value = true) as delivery_pref,
 (select count(*) from co.preference cm 
where cm.key = 'coupon' and cm.value = true) as coupon_pref,
 (select count(*) from co.preference cm 
where cm.key = 'discount' and cm.value = true) as discount_pref
from demo.customer_order_nested_parq co;
select case when shipping_pref > 0 and delivery_pref > 0 then 1
            when coupon_pref > 0 and discount_pref > 20 then 2
            else 0
       end as correlation, count(customer)
from customer_rewards
group by 1;

The map type allows you to add any key-value pair. You can add a new reward type at any time without a schema change, and you can analyze the new reward right away.

The main advantage of the map type is that it supports flexible schema and eliminates the need to update the schema frequently. However, there is not much performance benefit. If performance is your top priority, a flattened table is recommended. You can also flatten the most-often accessed columns, and use map for the less frequently accessed columns.

Limitations of nested data types

Although nested data types are useful in many use cases, they have the following limitations:

  • There is a hard limit on children size.
  • You can only append, and updating data is difficult and slow. You need to rewrite the entire nested object even if you want to modify one child attribute.
  • Processing is split at the parent record level. You may run into problems if the children data is heavily skewed.
  • The query engine may not support all types of analytics on nested data.
  • Amazon Redshift Spectrum Nested Data Limitations.

Summary

This post discussed the benefits of nested data types and use cases in which nested data types can help improve storage efficiency, performance, or simplify analysis. There are many more use cases in which nested data types can be an ideal solution. Try it out and share your experiences!

 


About the Author

 Juan Yu is a Data Warehouse Specialist Solutions Architect at AWS.

 

 

 

 

DBLog: A Generic Change-Data-Capture Framework

Post Syndicated from Netflix Technology Blog original https://medium.com/netflix-techblog/dblog-a-generic-change-data-capture-framework-69351fb9099b?source=rss----2615bd06b42e---4

Andreas Andreakis, Ioannis Papapanagiotou

Overview

Change-Data-Capture (CDC) allows capturing committed changes from a database in real-time and propagating those changes to downstream consumers [1][2]. CDC is becoming increasingly popular for use cases that require keeping multiple heterogeneous datastores in sync (like MySQL and ElasticSearch) and addresses challenges that exist with traditional techniques like dual-writes and distributed transactions [3][4].

In databases like MySQL and PostgreSQL, transaction logs are the source of CDC events. As transaction logs typically have limited retention, they aren’t guaranteed to contain the full history of changes. Therefore, dumps are needed to capture the full state of a source. There are several open source CDC projects, often using the same underlying libraries, database APIs, and protocols. Nonetheless, we found a number of limitations that could not satisfy our requirements e.g. stalling the processing of log events until a dump is complete, missing ability to trigger dumps on demand, or implementations that block write traffic by using table locks.

This motivated the development of DBLog, which offers log and dump processing under a generic framework. In order to be supported, a database is required to fulfill a set of features that are commonly available in systems like MySQL, PostgreSQL, MariaDB, and others.

Some of DBLog’s features are:

  • Processes captured log events in-order.
  • Dumps can be taken any time, across all tables, for a specific table or specific primary keys of a table.
  • Interleaves log with dump events, by taking dumps in chunks. This way log processing can progress alongside dump processing. If the process is terminated, it can resume after the last completed chunk without needing to start from scratch. This also allows dumps to be throttled and paused if needed.
  • No locks on tables are ever acquired, which prevent impacting write traffic on the source database.
  • Supports any kind of output, so that the output can be a stream, datastore, or even an API.
  • Designed with High Availability in mind. Hence, downstream consumers receive change events as they occur on a source.

Requirements

In a previous blog post, we discussed Delta, a data enrichment and synchronization platform. The goal of Delta is to keep multiple datastores in sync, where one store is the source of truth (like MySQL) and others are derived stores (like ElasticSearch). One of the key requirements is to have low propagation delays from the source of truth to the destinations and that the flow of events is highly available. These conditions apply regardless if multiple datastores are used by the same team, or if one team is owning data which another team is consuming. In our Delta blog post, we also described use cases beyond data synchronization, such as event processing.

For data synchronization and event processing use cases, we need to fulfill the following requirements, beyond the ability to capture changes in real-time:

  • Capturing the full state. Derived stores (like ElasticSearch) must eventually store the full state of the source. We provide this via dumps from the source database.
  • Triggering repairs at any time. Instead of treating dumps as a one-time setup activity, we aim to enable them at any time: across all tables, on a specific table, or for specific primary keys. This is crucial for repairs downstream when data has been lost or corrupted.
  • Providing high availability for real-time events. The propagation of real-time changes has high availability requirements; it is undesired if the flow of events stops for a longer duration of time (such as minutes or longer). This requirement needs to be fulfilled even when repairs are in progress so that they don’t stall real-time events. We want real-time and dump events to be interleaved so that both make progress.
  • Minimizing database impact. When connecting to a database, it is important to ensure that it is impacted as little as possible in terms of its bandwidth and ability to serve reads and writes for applications. For this reason, it is preferred to avoid using APIs which can block write traffic such as locks on tables. In addition to that, controls must be put in place which allow throttling of log and dump processing, or to pause the processing if needed.
  • Writing events to any output. For streaming technology, Netflix utilizes a variety of options such as Kafka, SQS, Kinesis, and even Netflix specific streaming solutions such as Keystone. Even though having a stream as an output can be a good choice (like when having multiple consumers), it is not always an ideal choice (as if there is only one consumer). We want to provide the ability to directly write to a destination without passing through a stream. The destination may be a datastore or an external API.
  • Supporting Relational Databases. There are services at Netflix that use RDBMS kind of databases such as MySQL or PostgreSQL via AWS RDS. We want to support these systems as a source so that they can provide their data for further consumption.

Existing Solutions

We evaluated a series of existing Open Source offerings, including: Maxwell, SpinalTap, Yelp’s MySQL Streamer, and Debezium. Existing solutions are similar in regard to capturing real-time changes that originate from a transaction log. For example by using MySQL’s binlog replication protocol, or PostgreSQL’s replication slots.

In terms of dump processing, we found that existing solutions have at least one of the following limitations:

  • Stopping log event processing while processing a dump. This limitation applies if log events are not processed while a dump is in progress. As a consequence, if a dump has a large volume, log event processing stalls for an extended period of time. This is an issue when downstream consumers rely on short propagation delays of real-time changes.
  • Missing ability to trigger dumps on demand. Most solutions execute a dump initially during a bootstrap phase or if data loss is detected at the transaction logs. However, the ability to trigger dumps on demand is crucial for bootstrapping new consumers downstream (like a new ElasticSearch index) or for repairs in case of data loss.
  • Blocking write traffic by locking tables. Some solutions use locks on tables to coordinate the dump processing. Depending on the implementation and database, the duration of locking can either be brief or can last throughout the whole dump process [5]. In the latter case, write traffic is blocked until the dump completes. In some cases, a dedicated read replica can be configured in order to avoid impacting writes on the master. However, this strategy does not work for all databases. For example in PostgreSQL RDS, changes can only be captured from the master.
  • Using proprietary database features. We found that some solutions use advanced database features that are not transferable to other systems, such as: using MySQL’s blackhole engine or getting a consistent snapshot for dumps from the creation of a PostgreSQL replication slot. This prevents code reuse across databases.

Ultimately, we decided to implement a different approach to handle dumps. One which:

  • interleaves log with dump events so that both can make progress
  • allows to trigger dumps at any time
  • does not use table locks
  • uses standardized database features

DBLog Framework

DBLog is a Java-based framework, able to capture changes in real-time and to take dumps. Dumps are taken in chunks so that they interleave with real-time events and don’t stall real-time event processing for an extended period of time. Dumps can be taken any time, via a provided API. This allows downstream consumers to capture the full database state initially or at a later time for repairs.

We designed the framework to minimize database impact. Dumps can be paused and resumed as needed. This is relevant both for recovery after failure and to stop processing if the database reached a bottleneck. We also don’t take locks on tables in order not to impact the application writes.

DBLog allows writing captured events to any output, even if it is another database or API. We use Zookeeper to store state related to log and dump processing, and for leader election. We have built DBLog with pluggability in mind allowing implementations to be swapped as desired (like replacing Zookeeper with something else).

The following subsections explain log and dump processing in more detail.

Log Processing

The framework requires a database to emit an event for each changed row in real-time and in a commit order. A transaction log is assumed to be the origin of those events. The database is sending them to a transport that DBLog can consume. We use the term ‘change log’ for that transport. An event can either be of type: create, update, or delete. For each event, the following needs to be provided: a log sequence number, the column state at the time of the operation, and the schema that applied at the time of the operation.

Each change is serialized into the DBLog event format and is sent to the writer so that it can be delivered to an output. Sending events to the writer is a non-blocking operation, as the writer runs in its own thread and collects events in an internal buffer. Buffered events are written to an output in-order. The framework allows to plugin a custom formatter for serializing events to a custom format. The output is a simple interface, allowing to plugin any desired destination, such as a stream, datastore or even an API.

Dump Processing

Dumps are needed as transaction logs have limited retention, which prevents their use for reconstituting a full source dataset. Dumps are taken in chunks so that they can interleave with log events, allowing both to progress. An event is generated for each selected row of a chunk and is serialized in the same format as log events. This way, a downstream consumer does not need to be concerned if events originate from the log or dumps. Both log and dump events are sent to the output via the same writer.

Dumps can be scheduled any time via an API for all tables, a specific table or for specific primary keys of a table. A dump request per table is executed in chunks of a configured size. Additionally, a delay can be configured to hold back the processing of new chunks, allowing only log event processing during that time. The chunk size and the delay allow to balance between log and dump event processing and both settings can be updated at runtime.

Chunks are selected by sorting a table in ascending primary key order and including rows, where the primary key is greater than the last primary key of the previous chunk. It is required for a database to execute this query efficiently, which typically applies for systems that implement range scans over primary keys.

Figure 1. Chunking a table with 4 columns c1-c4 and c1 as the primary key (pk). Pk column is of type integer and chunk size is 3. Chunk 2 is selected with the condition c1 > 4.

Chunks need to be taken in a way that does not stall log event processing for an extended period of time and which preserves the history of log changes so that a selected row with an older value can not override newer state from log events.

In order to achieve this, we create recognizable watermark events in the change log so that we can sequence the chunk selection. Watermarks are implemented via a table at the source database. The table is stored in a dedicated namespace so that no collisions occur with application tables. Only a single row is contained in the table which stores a UUID field. A watermark is generated by updating this row to a specific UUID. The row update results in a change event which is eventually received through the change log.

By using watermarks, dumps are taken using the following steps:

  1. Briefly pause log event processing.
  2. Generate low watermark by updating the watermark table.
  3. Run SELECT statement for the next chunk and store result-set in-memory, indexed by primary key.
  4. Generate a high watermark by updating the watermark table.
  5. Resume sending received log events to the output. Watch for the low and high watermark events in the log.
  6. Once the low watermark event is received, start removing entries from the result-set for all log event primary keys that are received after the low watermark.
  7. Once the high watermark event is received, send all remaining result-set entries to the output before processing new log events.
  8. Go to step 1 if more chunks present.

The SELECT is assumed to return state from a consistent snapshot, which represents committed changes up to a certain point in history. Or equivalently: the SELECT executed on a specific position of the change log, considering changes up to that point. Databases typically don’t expose the log position which corresponds to a select statement execution (MariaDB is an exception).

The core idea of our approach is to determine a window on the change log which guarantees to contain the SELECT. As the exact selection position is unknown, all selected rows are removed which collide with log events within that window. This ensures that the chunk selection can not override the history of log changes. The window is opened by writing the low watermark, then the selection runs, and finally, the window is closed by writing the high watermark. In order for this to work, the SELECT must read the latest state from the time of the low watermark or later (it is ok if the selection also includes writes that committed after the low watermark write and before the read).

Figures 2a and 2b are illustrating the chunk selection algorithm. We provide an example with a table that has primary keys k1 to k6. Each change log entry represents a create, update, or delete event for a primary key. In figure 2a, we showcase the watermark generation and chunk selection (steps 1 to 4). Updating the watermark table at step 2 and 4 creates two change events (magenta color) which are eventually received via the log. In figure 2b, we focus on the selected chunk rows that are removed from the result set for primary keys that appear between the watermarks (steps 5 to 7).

Figure 2a — The watermark algorithm for chunk selection (steps 1 to 4).
Figure 2b — The watermark algorithm for chunk selection (steps 5–7).

Note that a large count of log events may appear between the low and high watermark, if one or more transactions committed a large set of row changes in between. This is why our approach is briefly pausing log processing during steps 2–4 so that the watermarks are not missed. This way, log event processing can resume event-by-event afterwards, eventually discovering the watermarks, without ever needing to cache log event entries. Log processing is paused only briefly as steps 2–4 are expected to be fast: watermark updates are single write operations and the SELECT runs with a limit.

Once the high watermark is received at step 7, the non-conflicting chunk rows are handed over to the written for in-order delivery to the output. This is a non-blocking operation as the writer runs in a separate thread, allowing log processing to quickly resume after step 7. Afterwards, log event processing continues for events that occur post the high watermark.

In Figure 2c we are depicting the order of writes throughout a chunk selection, by using the same example as figures 2a and 2b. Log events that appear up to the high watermark are written first. Then, the remaining rows from the chunk result (magenta color). And finally, log events that occur after the high watermark.

Figure 2c — Order of output writes. Interleaving log with dump events.

Database support

In order to use DBLog a database needs to provide a change log from a linear history of committed changes and non-stale reads. These conditions are fulfilled by systems like MySQL, PostgreSQL, MariaDB, etc. so that the framework can be used uniformly across these kind of databases.

So far, we added support for MySQL and PostgreSQL. Integrating log events required using different libraries as each database uses a proprietary protocol. For MySQL, we use shyiko/mysql-binlog-connector which implementing the binlog replication protocol in order to receive events from a MySQL host. For PostgreSQL, we are using replication slots with the wal2json plugin. Changes are received via the streaming replication protocol which is implemented by the PostgreSQL jdbc driver. Determining the schema per captured change varies between MySQL and PostgreSQL. In PostgreSQL, wal2json contains the column names and types alongside with the column values. For MySQL schema changes must be tracked which are received as binlog events.

Dump processing was integrated by using SQL and JDBC, only requiring to implement the chunk selection and watermark update. The same code is used for MySQL and PostgreSQL and can be used for other similar databases as well. The dump processing itself has no dependency on SQL or JDBC and allows to integrate databases which fulfill the DBLog framework requirements even if they use different standards.

Figure 3 — DBLog High Level Architecture.

High Availability

DBLog uses active-passive architecture. One instance is active and the others are passive standbys. We leverage Zookeeper for leader election to determine the active instance. The leadership is a lease and is lost if it is not refreshed in time, allowing another instance to take over. We currently deploy one instance per AZ (typically we have 3 AZs), so that if one AZ goes down, an instance in another AZ can continue processing with minimal overall downtime. Passive instances across regions are also possible, though it is recommended to operate in the same region as the database host in order to keep the change capture latencies low.

Production usage

DBLog is the foundation of the MySQL and PostgreSQL Connectors at Netflix, which are used in Delta. Delta is used in production since 2018 for datastore synchronization and event processing use cases in Netflix studio applications. On top of DBLog, the Delta Connectors are using a custom event serializer, so that the Delta event format is used when writing events to an output. Netflix specific streams are used as outputs such as Keystone.

Figure 4— Delta Connector.

Beyond Delta, DBLog is also used to build Connectors for other Netflix data movement platforms, which have their own data formats.

Stay Tuned

DBLog has additional capabilities which are not covered by this blog post, such as:

  • Ability to capture table schemas without using locks.
  • Schema store integration. Storing the schema of each event that is sent to an output and having a reference in the payload of each event to the schema store.
  • Monotonic writes mode. Ensuring that once the state has been written for a specific row, a less recent state can not be written afterward. This way downstream consumers experience state transitions only in a forward direction, without going back-and-forth in time.

We are planning to open source DBLog in 2020 and include additional documentation.

Credits

We would like to thank the following persons for contributing to the development of DBLog: Josh Snyder, Raghuram Onti Srinivasan, Tharanga Gamaethige, and Yun Wang.

References

[1] Das, Shirshanka, et al. “All aboard the Databus!: Linkedin’s scalable consistent change data capture platform.” Proceedings of the Third ACM Symposium on Cloud Computing. ACM, 2012

[2] “About Change Data Capture (SQL Server)”, Microsoft SQL docs, 2019

[3] Kleppmann, Martin, “Using logs to build a solid data infrastructure (or: why dual writes are a bad idea)“, Confluent, 2015

[4] Kleppmann, Martin, Alastair R. Beresford, and Boerge Svingen. “Online event processing.” Communications of the ACM 62.5 (2019): 43–49

[5] https://debezium.io/documentation/reference/0.10/connectors/mysql.html#snapshots


DBLog: A Generic Change-Data-Capture Framework was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 2

Post Syndicated from Asim Kumar Sasmal original https://aws.amazon.com/blogs/big-data/etl-and-elt-design-patterns-for-lake-house-architecture-using-amazon-redshift-part-2/

Part 1 of this multi-post series, ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 1, discussed common customer use cases and design best practices for building ELT and ETL data processing pipelines for data lake architecture using Amazon Redshift Spectrum, Concurrency Scaling, and recent support for data lake export.

This post shows you how to get started with a step-by-step walkthrough of a few ETL and ELT design patterns of Amazon Redshift using AWS sample datasets.

Prerequisites

Before getting started, make sure that you meet the following prerequisites:

  1. This post uses two publicly available AWS sample datasets from the US-West-2 (Oregon) Region. Use the US-West-2 (Oregon) Region for your test run to reduce cross-region network latency and cost due to the data movement.
  2. You have an AWS account in the same Region.
  3. You have the AdministratorAccess policy granted to your AWS account (for production, you should restrict this further).
  4. You have an existing Amazon S3 bucket named eltblogpost in your data lake to store unloaded data from Amazon Redshift. Because bucket names are unique across AWS accounts, replace eltblogpost with your unique bucket name as applicable in the sample code provided.
  5. You have AWS CLI installed and configured to use with your AWS account.
  6. You have an IAM policy named redshift-elt-test-s3-policy with the following read and write permissions for the Amazon S3 bucket named eltblogpost:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "s3:GetBucketLocation",
                    "s3:GetObject",
                    "s3:ListBucket",
                    "s3:ListBucketMultipartUploads",
                    "s3:ListMultipartUploadParts",
                    "s3:AbortMultipartUpload",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::eltblogpost",
                    "arn:aws:s3:::eltblogpost/*"
                ],
                "Effect": "Allow"
            }
        ]
    }

  7. You have an IAM policy named redshift-elt-test-sampledata-s3-read-policy with read only permissions for the Amazon S3 bucket named awssampledbuswest2, hosting the sample data used for this walkthrough.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*",
                    "s3:List*"
                ],
                "Resource": [
                    "arn:aws:s3:::awssampledbuswest2",
                    "arn:aws:s3:::awssampledbuswest2/*"
                ]
            }
        ]
    }

  8. You have an IAM role named redshift-elt-test-role that has a trust relationship with redshift.amazonaws.com and glue.amazonaws.com and the following IAM policies (for production, you should restrict this further as needed):
    • redshift-elt-test-s3-policy
    • redshift-elt-test-sampledata-s3-read-policy
    • AWSGlueServiceRole
    • AWSGlueConsoleFullAccess
  9. Make a note of the ARN for redshift-elt-test-role IAM role.
  10. You have an existing Amazon Redshift cluster with the following parameters:
    • Cluster name as rseltblogpost.
    • Database name as rselttest.
    • Four dc2.large nodes.
    • An associated IAM role named redshift-elt-test-role.
    • A publicly available endpoint.
    • A cluster parameter group named eltblogpost-parameter-group, which you use to change the Concurrency Scaling
    • Cluster workload management set to manual.
  11. You have SQL Workbench/J (or another tool of your choice) and can connect successfully to the cluster.
  12. You have an EC2 instance in the same Region with PostgreSQL client CLI (psql) and can connect successfully to the cluster.
  13. You have an AWS Glue catalog database named eltblogpost as the metadata catalog for Amazon Athena and Redshift Spectrum queries.

Loading data to Amazon Redshift local storage

This post uses the Star Schema Benchmark (SSB) dataset. It is provided publicly in an S3 bucket, which any authenticated AWS user with access to Amazon S3 can access.

To load data to Amazon Redshift local storage, complete the following steps:

  1. Connect to the cluster from the SQL Workbench/J.
  2. Execute the CREATE TABLE statements from the Github repo from your SQL Workbench/J to create tables from the SSB dataset. The following diagram shows the list of tables.
  3. Execute the COPY statements from the Github repo. This step loads data into the tables you created using the sample data available in s3://awssampledbuswest2/ssbgz/. Remember to replace your IAM role ARN noted previously.
  4. To verify that each table loaded correctly, run the following commands:
    select count(*) from LINEORDER; 
    select count(*) from PART;
    select count(*) from CUSTOMER;
    select count(*) from SUPPLIER;
    select count(*) from DWDATE;

    The following results table shows the number of rows for each table in the SSB dataset:

    Table Name    Record Count
    LINEORDER     600,037,902
    PART            1,400,000
    CUSTOMER        3,000,000
    SUPPLIER        1,000,000
    DWDATE              2,556

In addition to the record counts, you can also check for a few sample records from each table.

Performing ELT and ETL using Amazon Redshift and unload to S3

The following are the high-level steps for this walkthrough:

  1. You are looking to pre-aggregate some commonly asked measures by your end-users on the point of sales (POS) data you loaded in Amazon Redshift local storage.
  2. You want to then unload the aggregated data from Amazon Redshift to your data lake (S3) in an open and analytics optimized and compressed Parquet file format. You also want to consider an optimized partitioning for the unloaded data in your data lake to help with the end-user query performance and eventually lower cost.
  3. You want to query the unloaded data from your data lake with Redshift Spectrum. You also want to share the data with other AWS services such as Athena with its pay-per-use and serverless ad hoc and on-demand query model, AWS Glue and Amazon EMR for performing ETL operations on the unloaded data and data integration with your other datasets (such as ERP, finance, or third-party data) stored in your data lake, and Amazon SageMaker for machine learning.

Complete the following steps:

  1. To compute the necessary pre-aggregates, execute the following three ELT queries available on the Github repo from your SQL Workbench/J:
    • ELT Query 1 – This query summarizes the revenue by manufacturer, category, and brand per month per year per supplier region.
    • ELT Query 2 – This query summarizes the revenue by brand per month per year by supplier region and city.
    • ELT Query 3 – This query drills down in time by customer city, supplier city, month, and year.
  2. To unload the aggregated data to S3 with Parquet file format and proper partitioning to help with the access patterns of the unloaded data in the data lake, execute the three UNLOAD queries available on the Github repo from your SQL Workbench/J. To use Redshift Spectrum for querying the unloaded data, you need the following:
    • An Amazon Redshift cluster and a SQL client (SQL Workbench/J or another tool of your choice) that can connect to your cluster and execute SQL commands. The cluster and the data files in S3 must be in the same Region.
    • An external schema in Amazon Redshift that references a database in the external data catalog and provides the IAM role ARN that authorizes your cluster to access S3 on your behalf. It is a best practice to have an external data catalog in AWS Glue.You are now ready to create an AWS Glue crawler.
  3. From AWS CLI, run the following code (replace <Your AWS Account>):
    aws glue create-crawler --cli-input-json file://mycrawler.json --region us-west-2
    Where the file mycrawler.json contains:
    {
        "Name": "eltblogpost_redshift_spectrum_etl_elt_glue_crawler",
        "Role": "arn:aws:iam::<Your AWS Account>:role/redshift-elt-test-role",
        "DatabaseName": "eltblogpost",
        "Description": "",
        "Targets": {
            "S3Targets": [
                {
                    "Path": "s3://eltblogpost/unload_parquet/monthly_revenue_by_region_manufacturer_category_brand"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/monthly_revenue_by_region_city_brand"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/yearly_revenue_by_city"
                }
            ]
        }
    }

    You can also schedule the crawler to run periodically based on your use case. For example, you can schedule the crawler every 35 minutes to keep the AWS Glue catalog tables up to date with the data being unloaded every 30 minutes. However, this post does not configure any scheduling.

  4. After you create the AWS Glue crawler, run it manually from AWS CLI with the following command:
    aws glue start-crawler --name "eltblogpost_redshift_spectrum_etl_elt_glue_crawler" --region us-west-2
  5. When the AWS Glue crawler run is complete, go to the AWS Glue console to see the following three AWS Glue catalog tables under the database eltblogpost:
    • monthly_revenue_by_region_manufacturer_category_brand
    • monthly_revenue_by_region_city_brand
    • yearly_revenue_by_city
  6. Now that you have an external data catalog in AWS Glue named etlblogpost, create an external schema in the persistent cluster named eltblogpost using the following SQL from your SQL Workbench/J (replace <Your AWS Account>):
    create external schema spectrum_eltblogpost 
    from data catalog 
    database 'eltblogpost' 
    iam_role 'arn:aws:iam::<Your AWS Account>:role/redshift-elt-test-role'
    create external database if not exists;

    Using Spectrum, you can now query the three AWS Glue catalog tables you set up earlier.

  7. Go to SQL Workbench/J and run the following sample queries:
    • Top 10 brands by category and manufacturer contributing to revenue for the region AFRICA for the year of 1992 and month of March:
      SELECT brand, category, manufacturer, revenue 
      from "spectrum_eltblogpost"."monthly_revenue_by_region_manufacturer_category_brand"
      where year = '1992'
      and month = 'March' 
      and supplier_region = 'AFRICA'
      order by revenue desc
      limit 10;
      
      brand | category | manufacturer | revenue
      ----------+----------+--------------+-----------
      MFGR#1313 | MFGR#13 | MFGR#1 | 5170356068
      MFGR#5325 | MFGR#53 | MFGR#5 | 5106463527
      MFGR#3428 | MFGR#34 | MFGR#3 | 5055551376
      MFGR#2425 | MFGR#24 | MFGR#2 | 5046250790
      MFGR#4126 | MFGR#41 | MFGR#4 | 5037843130
      MFGR#219 | MFGR#21 | MFGR#2 | 5018018040
      MFGR#159 | MFGR#15 | MFGR#1 | 5009626205
      MFGR#5112 | MFGR#51 | MFGR#5 | 4994133558
      MFGR#5534 | MFGR#55 | MFGR#5 | 4984369900
      MFGR#5332 | MFGR#53 | MFGR#5 | 4980619214

    • Monthly revenue for the region AMERICA for the year 1995 across all brands:
      SELECT month, sum(revenue) revenue
      FROM "spectrum_eltblogpost"."monthly_revenue_by_region_city_brand"
      where year = '1992'
      and supplier_region = 'AMERICA'
      group by month;
      
      month | revenue
      ----------+--------------
      April | 4347703599195
      January | 4482598782080
      September | 4332911671240
      December | 4489411782480
      May | 4479764212732
      August | 4485519151803
      October | 4493509053843
      June | 4339267242387
      March | 4477659286311
      February | 4197523905580
      November | 4337368695526
      July | 4492092583189

    • Yearly revenue for supplier city ETHIOPIA 4 for the years 1992–1995 and month of December:
      SELECT year, supplier_city, sum(revenue) revenue
      FROM "spectrum_eltblogpost"."yearly_revenue_by_city"
      where supplier_city in ('ETHIOPIA 4')
      and year between '1992' and '1995'
      and month = 'December'
      group by year, supplier_city
      order by year, supplier_city;
      
      year | supplier_city | revenue
      -----+---------------+------------
      1992 | ETHIOPIA 4 | 91006583025
      1993 | ETHIOPIA 4 | 90617597590
      1994 | ETHIOPIA 4 | 92015649529
      1995 | ETHIOPIA 4 | 89732644163

When the data is in S3 and cataloged in the AWS Glue catalog, you can query the same catalog tables using Athena, AWS Glue, Amazon EMR, Amazon SageMaker, Amazon QuickSight, and many more AWS services that have seamless integration with S3.

Accelerating ELT and ETL using Redshift Spectrum and unload to S3

Assume that you need to pre-aggregate a set of commonly requested metrics from your end-users on a large dataset stored in the data lake (S3) cold storage using familiar SQL and unload the aggregated metrics in your data lake for downstream consumption.

The following are the high-level steps for this walkthrough:

  1. This is a batch workload that requires standard SQL joins and aggregations on a fairly large volume of relational and structured data. You want to use the power of Redshift Spectrum to perform the required SQL transformations on the data stored in S3 and unload the transformed results back to S3.
  2. You want to query the unloaded data from your data lake using Redshift Spectrum if you have an existing cluster, Athena with its pay-per-use and serverless ad hoc and on-demand query model, AWS Glue and Amazon EMR for performing ETL operations on the unloaded data and data integration with your other datasets in your data lake, and Amazon SageMaker for machine learning.

Because Redshift Spectrum allows you to query the data directly from your data lake without needing to load into Amazon Redshift local storage, you can spin up a short-lived cluster to perform ELT at a massive scale using Redshift Spectrum and terminate the cluster when the work is complete. You can automate spinning up and terminating the short-lived cluster using AWS CloudFormation. That way, you only pay for the few minutes or hours of use. A short-lived cluster can also avoid overloading the current persistent cluster serving interactive queries from live users. For this post, use your existing cluster rseltblogpost.

This post uses a publicly available sample dataset named tickit provided by AWS,  which any authenticated AWS user with access to S3 can access:

  • Sales – s3://awssampledbuswest2/tickit/spectrum/sales/
  • Event – s3://awssampledbuswest2/tickit/allevents_pipe.txt
  • Date – s3://awssampledbuswest2/tickit/date2008_pipe.txt
  • Users – s3://awssampledbuswest2/tickit/allusers_pipe.txt

It is a best practice of Redshift Spectrum for performance reasons to load the dimension tables in the local storage of your short-lived cluster and use an external table for the fact table Sales.

Complete the following steps:

  1. Connect to the cluster from the SQL Workbench/J.To use Redshift Spectrum for querying data from data lake (S3), you need to have the following:
    • An Amazon Redshift cluster and a SQL client (SQL Workbench/J or another tool of your choice) that can connect to your cluster and execute SQL commands. The cluster and the data files in S3 must be in the same Region.
    • An external schema in Amazon Redshift that references a database in the external data catalog and provides the IAM role ARN that authorizes your cluster to access S3 on your behalf. It is a best practice to have an external data catalog in AWS Glue.
    • An AWS Glue catalog database named eltblogpost that you already created.
    • An external schema in your Redshift cluster named spectrum_eltblogpost that you already created.
  2. Execute the SQL available on the Github repo to create an external table named sales in the same external schema named spectrum_eltblogpost. As shown in the previous section, you can also use an AWS Glue crawler to create the external table.
  3. Execute the SQLs available on the Github repo to create the dimension tables to load the data into Amazon Redshift local storage for Redshift Spectrum performance best practices.
  4. Execute the COPY statements available on the Github repo to load the dimension tables using the sample data available in s3://awssampledbuswest2/tickit/. Replace the IAM role ARN with the IAM role ARN you noted earlier, which is associated with your cluster.
  5. To verify that each table has the correct record count, execute the following commands:
    select count(*) from date;
    select count(*) from users;
    select count(*) from event;
    select count(*) from spectrum_eltblogpost.sales;

    The following results table shows the number of rows for each table in the tickit dataset:

    Table Name                    Record Count
    DATE                           365
    USERS                       49,990
    EVENT                        8,798
    spectrum_eltblogpost.sales 172,456

    In addition to the record counts, you can also check for a few sample records from each table.

  6. To compute the necessary pre-aggregates, execute the following three ELT queries available on the Github repo from your SQL Workbench/J:
    • ELT Query 1 – Total quantity sold on a given calendar date.
    • ELT Query 2 – Total quantity sold to each buyer.
    • ELT Query 3 – Events in the 99.9 percentile in terms of all-time gross sales.
  7. To unload the aggregated data to S3 with Parquet file format and proper partitioning to help with the access patterns of the unloaded data in the data lake, execute the three UNLOAD queries available on the Github repo from your SQL Workbench/J.
  8. To use Redshift Spectrum for querying the unloaded data, you can either create a new AWS Glue crawler or modify the previous crawler named eltblogpost_redshift_spectrum_etl_elt_glue_crawler. Update the existing crawler using the following code from AWS CLI (replace <Your AWS Account>):
    aws glue update-crawler --cli-input-json file://mycrawler.json --region us-west-2
    Where the file mycrawler.json contains:
    {
        "Name": "eltblogpost_redshift_spectrum_etl_elt_glue_crawler",
        "Role": "arn:aws:iam::<Your AWS Account>:role/redshift-elt-test-role",
        "DatabaseName": "eltblogpost",
        "Description": "",
        "Targets": {
            "S3Targets": [
                {
                    "Path": "s3://eltblogpost/unload_parquet/monthly_revenue_by_region_manufacturer_category_brand"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/monthly_revenue_by_region_city_brand"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/yearly_revenue_by_city"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/total_quantity_sold_by_date"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/total_quantity_sold_by_buyer_by_date"
                },
                {
                    "Path": "s3://eltblogpost/unload_parquet/total_price_by_eventname"
                }
            ]
        }
    }

  9. After you create the crawler successfully, run it manually from AWS CLI with the following command:
    aws glue start-crawler --name "eltblogpost_redshift_spectrum_etl_elt_glue_crawler" --region us-west-2
  10. When the crawler run is complete, go to the AWS Glue console. The following additional catalog tables are in the catalog database eltblogpost:
    • total_quantity_sold_by_date
    • total_quantity_sold_by_buyer_by_date
    • total_price_by_eventname
  11. Using Spectrum, you can now query the three preceding catalog tables. Go to SQL Workbench/J and run the following sample queries:
      • Top 10 days for quantities sold in February and March 2008:
        SELECT caldate, total_quantity
        FROM "spectrum_eltblogpost"."total_quantity_sold_by_date"
        where caldate between '2008-02-01' and '2008-03-30'
        order by total_quantity desc
        limit 10;
        
        caldate | total_quantity
        -----------+---------------
        2008-02-20 | 1170
        2008-02-25 | 1146
        2008-02-19 | 1145
        2008-02-24 | 1141
        2008-03-26 | 1138
        2008-03-22 | 1136
        2008-03-17 | 1129
        2008-03-08 | 1129
        2008-02-16 | 1127
        2008-03-23 | 1121

      • Top 10 buyers for quantities sold in February and March 2008:
        SELECT firstname,lastname,total_quantity
        FROM "spectrum_eltblogpost"."total_quantity_sold_by_buyer_by_date"
        where caldate between '2008-02-01' and '2008-03-31'
        order by total_quantity desc
        limit 10;
        
        firstname | lastname | total_quantity
        ----------+------------+---------------
        Laurel | Clay | 9
        Carolyn | Valentine | 8
        Amelia | Osborne | 8
        Kai | Gill | 8
        Gannon | Summers | 8
        Ignacia | Nichols | 8
        Ahmed | Mcclain | 8
        Amanda | Mccullough | 8
        Blair | Medina | 8
        Hadley | Bennett | 8

      • Top 10 event names for total price:
        SELECT eventname, total_price
        FROM "spectrum_eltblogpost"."total_price_by_eventname"
        order by total_price desc
        limit 10;
        
        eventname | total_price
        ---------------------+------------
        Adriana Lecouvreur | 51846.00
        Janet Jackson | 51049.00
        Phantom of the Opera | 50301.00
        The Little Mermaid | 49956.00
        Citizen Cope | 49823.00
        Sevendust | 48020.00
        Electra | 47883.00
        Mary Poppins | 46780.00
        Live | 46661.00

After the data is in S3 and cataloged in the AWS Glue catalog, you can query the same catalog tables using Amazon Athena, AWS Glue, Amazon EMR, Amazon SageMaker, Amazon QuickSight, and many more AWS services that have seamless integration with S3.

Scaling ELT and unload running in parallel using Concurrency Scaling

Assume that you have a mixed workload under concurrency when the UNLOAD queries and the ELT jobs run in parallel in your cluster with Concurrency Scaling turned on. When Concurrency Scaling is enabled, Amazon Redshift automatically adds additional cluster capacity when you need to process an increase in concurrent read queries including UNLOAD queries. By default, Concurrency Scaling mode is turned off for your cluster. In this post, you enable the Concurrency Scaling mode for your cluster.

Complete the following steps:

  1. Go to your cluster parameter group named eltblogpost-parameter-group and complete the following:
    • Update max_concurrency_scaling_clusters to 5.
    • Create a new queue named Queue 1 with Concurrency Scaling mode set to Auto and a query group named unload_query for the UNLOAD jobs in the next steps.
  2. After you make these changes, reboot your cluster for the changes to take effect.
  3. For this post, use psql client to connect to your cluster rseltblogpost from the EC2 instance that you set up earlier.
  4. Open an SSH session to your EC2 instance and copy the nine files as shown below from the concurrency folder in the Github repo to /home/ec2-user/eltblogpost/ in your EC2 instance.
  5. Review the concurrency-elt-unload.sh script that runs the following eight jobs in parallel:
    • An ELT script for SSB dataset, which kicks off one query at a time.
    • An ELT script for the tickit dataset, which kicks off one query at a time.
    • Three unload queries for the SSB datasets kicked off in parallel.
    • Three unload queries for the tickit datasets kicked off in parallel.
  6. Run the concurrency-elt-unload.sh While the script is running, you will see the following sample output:             The following are the response times taken by the script:
    real 2m40.245s
    user 0m0.104s
    sys 0m0.000s
  7. Run the following query to validate that some of the UNLOAD queries ran in the Concurrency Scaling clusters (look for “which_cluster = Concurrency Scaling” in the query output below):
    SELECT query,
    Substring(querytxt,1,90) query_text,
    starttime starttime_utc,
    (endtime-starttime)/(1000*1000) elapsed_time_secs,
    case when aborted= 0 then 'complete' else 'error' end status,
    case when concurrency_scaling_status = 1 then 'Concurrency Scaling' else 'Main' end which_cluster
    FROM stl_query
    WHERE database = 'rselttest'
    AND starttime between '2019-10-20 22:53:00' and '2019-10-20 22:56:00’
    AND userid=100
    AND querytxt NOT LIKE 'padb_fetch_sample%'
    AND (querytxt LIKE 'create%' or querytxt LIKE 'UNLOAD%')
    ORDER BY query DESC;

    See the following output from the query: 

  8. Comment out the following SET statement in the six UNLOAD query files (ssb-unload<1-3>.sql and tickit-unload<1-3>.sql) to force all six UNLOAD queries to run in the main cluster:
    set query_group to 'unload_query';

    In other words, disable Concurrency Scaling mode for the UNLOAD queries.

  9. Run the concurrency-elt-unload.sh script. While the script is running, you will see the following sample output:              The following are the response times taken by the script:
    real 3m40.328s
    user 0m0.104s
    sys 0m0.000s

    The following shows the Workload Management settings for the Redshift cluster: 

  10. Run the following query to validate that all the queries ran in the main cluster (look for “which_cluster = Main” in the query output below):
    SELECT query,
    Substring(querytxt,1,90) query_text,
    starttime starttime_utc,
    (endtime-starttime)/(1000*1000) elapsed_time_secs,
    case when aborted= 0 then 'complete' else 'error' end status,
    case when concurrency_scaling_status = 1 then 'Concurrency Scaling' else 'Main' end which_cluster
    FROM stl_query
    WHERE database = 'rselttest'
    AND starttime between '2019-10-20 23:19:00' and '2019-10-20 23:24:00’
    AND userid=100
    AND querytxt NOT LIKE 'padb_fetch_sample%'
    AND (querytxt LIKE 'create%' or querytxt LIKE 'UNLOAD%')
    ORDER BY query DESC;

    See the following output from the query:With Concurrency Scaling, the end-to-end runtime improved by 37.5% (60 seconds faster).

    Summary

    This post provided a step-by-step walkthrough of a few straightforward examples of the common ELT and ETL design patterns of Amazon Redshift using some key Amazon Redshift features such as Amazon Redshift Spectrum, Concurrency Scaling, and recent support for data lake export.

    As always, AWS welcomes feedback. Please submit thoughts or questions in the comments.

     


    About the Authors

    Asim Kumar Sasmal is a senior data architect – IoT in the Global Specialty Practice of AWS Professional Services. He helps AWS customers around the globe to design and build data driven solutions by providing expert technical consulting, best practices guidance, and implementation services on AWS platform. He is passionate about working backwards from customer ask, help them to think big, and dive deep to solve real business problems by leveraging the power of AWS platform.

     

    Maor Kleider is a principal product manager for Amazon Redshift, a fast, simple and cost-effective data warehouse. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.

ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 1

Post Syndicated from Asim Kumar Sasmal original https://aws.amazon.com/blogs/big-data/etl-and-elt-design-patterns-for-lake-house-architecture-using-amazon-redshift-part-1/

Part 1 of this multi-post series discusses design best practices for building scalable ETL (extract, transform, load) and ELT (extract, load, transform) data processing pipelines using both primary and short-lived Amazon Redshift clusters. You also learn about related use cases for some key Amazon Redshift features such as Amazon Redshift Spectrum, Concurrency Scaling, and recent support for data lake export.

Part 2 of this series, ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 2, shows a step-by-step walkthrough to get started using Amazon Redshift for your ETL and ELT use cases.

ETL and ELT

There are two common design patterns when moving data from source systems to a data warehouse. The primary difference between the two patterns is the point in the data-processing pipeline at which transformations happen. This also determines the set of tools used to ingest and transform the data, along with the underlying data structures, queries, and optimization engines used to analyze the data. The first pattern is ETL, which transforms the data before it is loaded into the data warehouse. The second pattern is ELT, which loads the data into the data warehouse and uses the familiar SQL semantics and power of the Massively Parallel Processing (MPP) architecture to perform the transformations within the data warehouse.

In the following diagram, the first represents ETL, in which data transformation is performed outside of the data warehouse with tools such as Apache Spark or Apache Hive on Amazon EMR or AWS Glue. This pattern allows you to select your preferred tools for data transformations. The second diagram is ELT, in which the data transformation engine is built into the data warehouse for relational and SQL workloads. This pattern is powerful because it uses the highly optimized and scalable data storage and compute power of MPP architecture.

Redshift Spectrum

Amazon Redshift is a fully managed data warehouse service on AWS. It uses a distributed, MPP, and shared nothing architecture. Redshift Spectrum is a native feature of Amazon Redshift that enables you to run the familiar SQL of Amazon Redshift with the BI application and SQL client tools you currently use against all your data stored in open file formats in your data lake (Amazon S3).

A common pattern you may follow is to run queries that span both the frequently accessed hot data stored locally in Amazon Redshift and the warm or cold data stored cost-effectively in Amazon S3, using views with no schema binding for external tables. This enables you to independently scale your compute resources and storage across your cluster and S3 for various use cases.

Redshift Spectrum supports a variety of structured and unstructured file formats such as Apache Parquet, Avro, CSV, ORC, JSON to name a few. Because the data stored in S3 is in open file formats, the same data can serve as your single source of truth and other services such as Amazon Athena, Amazon EMR, and Amazon SageMaker can access it directly from your S3 data lake.

For more information, see Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required.

Concurrency Scaling

Using Concurrency Scaling, Amazon Redshift automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries. Concurrency Scaling resources are added to your Amazon Redshift cluster transparently in seconds, as concurrency increases, to serve sudden spikes in concurrent requests with fast performance without wait time. When the workload demand subsides, Amazon Redshift automatically shuts down Concurrency Scaling resources to save you cost.

The following diagram shows how the Concurrency Scaling works at a high-level:

For more information, see New – Concurrency Scaling for Amazon Redshift – Peak Performance at All Times.

Data lake export

Amazon Redshift now supports unloading the result of a query to your data lake on S3 in Apache Parquet, an efficient open columnar storage format for analytics. The Parquet format is up to two times faster to unload and consumes up to six times less storage in S3, compared to text formats. You can also specify one or more partition columns, so that unloaded data is automatically partitioned into folders in your S3 bucket to improve query performance and lower the cost for downstream consumption of the unloaded data. For example, you can choose to unload your marketing data and partition it by year, month, and day columns. This enables your queries to take advantage of partition pruning and skip scanning of non-relevant partitions when filtered by the partitioned columns, thereby improving query performance and lowering cost. For more information, see UNLOAD.

Use cases

You may be using Amazon Redshift either partially or fully as part of your data management and data integration needs. You likely transitioned from an ETL to an ELT approach with the advent of MPP databases due to your workload being primarily relational, familiar SQL syntax, and the massive scalability of MPP architecture.

This section presents common use cases for ELT and ETL for designing data processing pipelines using Amazon Redshift.

ELT

Consider a batch data processing workload that requires standard SQL joins and aggregations on a modest amount of relational and structured data. You selected initially a Hadoop-based solution to accomplish your SQL needs. However, over time, as data continued to grow, your system didn’t scale well. You now find it difficult to meet your required performance SLA goals and often refer to ever-increasing hardware and maintenance costs. Relational MPP databases bring an advantage in terms of performance and cost, and lowers the technical barriers to process data by using familiar SQL.

Amazon Redshift has significant benefits based on its massively scalable and fully managed compute underneath to process structured and semi-structured data directly from your data lake in S3.

The following diagram shows how Redshift Spectrum allows you to simplify and accelerate your data processing pipeline from a four-step to a one-step process with the CTAS (Create Table As) command.

The preceding architecture enables seamless interoperability between your Amazon Redshift data warehouse solution and your existing data lake solution on S3 hosting other Enterprise datasets such as ERP, finance, and third-party for a variety of data integration use cases.

The following diagram shows the seamless interoperability between your Amazon Redshift and your data lake on S3:

When you use an ELT pattern, you can also use your existing ELT-optimized SQL workload while migrating from your on-premises data warehouse to Amazon Redshift. This eliminates the need to rewrite relational and complex SQL workloads into a new compute framework from scratch. With Amazon Redshift, you can load, transform, and enrich your data efficiently using familiar SQL with advanced and robust SQL support, simplicity, and seamless integration with your existing SQL tools. You also need the monitoring capabilities provided by Amazon Redshift for your clusters.

ETL

You have a requirement to unload a subset of the data from Amazon Redshift back to your data lake (S3) in an open and analytics-optimized columnar file format (Parquet). You then want to query the unloaded datasets from the data lake using Redshift Spectrum and other AWS services such as Athena for ad hoc and on-demand analysis, AWS Glue and Amazon EMR for ETL, and Amazon SageMaker for machine learning.

You have a requirement to share a single version of a set of curated metrics (computed in Amazon Redshift) across multiple business processes from the data lake. You can use ELT in Amazon Redshift to compute these metrics and then use the unload operation with optimized file format and partitioning to unload the computed metrics in the data lake.

You also have a requirement to pre-aggregate a set of commonly requested metrics from your end-users on a large dataset stored in the data lake (S3) cold storage using familiar SQL and unload the aggregated metrics in your data lake for downstream consumption. In other words, consider a batch workload that requires standard SQL joins and aggregations on a fairly large volume of relational and structured cold data stored in S3 for a short duration of time. You can use the power of Redshift Spectrum by spinning up one or many short-lived Amazon Redshift clusters that can perform the required SQL transformations on the data stored in S3, unload the transformed results back to S3 in an optimized file format, and terminate the unneeded Amazon Redshift clusters at the end of the processing. This way, you only pay for the duration in which your Amazon Redshift clusters serve your workloads.

As shown in the following diagram, once the transformed results are unloaded in S3, you then query the unloaded data from your data lake either using Redshift Spectrum if you have an existing Amazon Redshift cluster, Athena with its pay-per-use and serverless ad hoc and on-demand query model, AWS Glue and Amazon EMR for performing ETL operations on the unloaded data and data integration with your other datasets (such as ERP, finance, and third-party data) stored in your data lake, and Amazon SageMaker for machine learning.

You can also scale the unloading operation by using the Concurrency Scaling feature of Amazon Redshift. This provides a scalable and serverless option to bulk export data in an open and analytics-optimized file format using familiar SQL.

Best practices

The following recommended practices can help you to optimize your ELT and ETL workload using Amazon Redshift.

Analyze requirements to decide ELT versus ETL

MPP architecture of Amazon Redshift and its Spectrum feature is efficient and designed for high-volume relational and SQL-based ELT workload (joins, aggregations) at a massive scale. A common practice to design an efficient ELT solution using Amazon Redshift is to spend sufficient time to analyze the following:

  • Type of data from source systems (structured, semi-structured, and unstructured)
  • Nature of the transformations required (usually encompassing cleansing, enrichment, harmonization, transformations, and aggregations)
  • Row-by-row, cursor-based processing needs versus batch SQL
  • Performance SLA and scalability requirements considering the data volume growth over time
  • Cost of the solution

This helps to assess if the workload is relational and suitable for SQL at MPP scale.

Key considerations for ELT

For ELT and ELT both, it is important to build a good physical data model for better performance for all tables, including staging tables with proper data types and distribution methods. A dimensional data model (star schema) with fewer joins works best for MPP architecture including ELT-based SQL workloads. Consider using a TEMPORARY table for intermediate staging tables as feasible for the ELT process for better write performance, because temporary tables only write a single copy.

A common rule of thumb for ELT workloads is to avoid row-by-row, cursor-based processing (a commonly overlooked finding for stored procedures). This is sub-optimal because such processing needs to happen on the leader node of an MPP database like Amazon Redshift. Instead, the recommendation for such a workload is to look for an alternative distributed processing programming framework, such as Apache Spark.

Several hundreds to thousands of single record inserts, updates, and deletes for highly transactional needs are not efficient using MPP architecture. Instead, stage those records for either a bulk UPDATE or DELETE/INSERT on the table as a batch operation.

With the external table capability of Redshift Spectrum, you can optimize your transformation logic using a single SQL as opposed to loading data first in Amazon Redshift local storage for staging tables and then doing the transformations on those staging tables.

Key considerations for data lake export

When you unload data from Amazon Redshift to your data lake in S3, pay attention to data skew or processing skew in your Amazon Redshift tables. The UNLOAD command uses the parallelism of the slices in your cluster. Hence, if there is a data skew at rest or processing skew at runtime, unloaded files on S3 may have different file sizes, which impacts your UNLOAD command response time and query response time downstream for the unloaded data in your data lake.

You should also control maximum file size to approximately 100 MB or less in the UNLOAD command for better performance for downstream consumption. Similarly, for S3 partitioning, a rule of thumb is to not exceed number of partitions per table on S3 to couple of hundreds by choosing the low cardinality partitioning columns (year, quarter, month, and day are good choices) in the UNLOAD command. This avoids creating too many partitions, which in turn creates a large volume of metadata in the AWS Glue catalog, leading to high query times via Athena and Redshift Spectrum.

To get the best throughput and performance under concurrency for multiple UNLOAD commands running in parallel, create a separate queue for unload queries with Concurrency Scaling turned on. This lets Amazon Redshift burst additional Concurrency Scaling clusters as required.

Key considerations for Redshift Spectrum for ELT

To get the best performance from Redshift Spectrum, pay attention to the maximum pushdown operations possible, such as S3 scan, projection, filtering, and aggregation, in your query plans for a performance boost. This is because you want to utilize the powerful infrastructure underneath that supports Redshift Spectrum. Using predicate pushdown also avoids consuming resources in the Amazon Redshift cluster.

In addition, avoid complex operations like DISTINCT or ORDER BY on more than one column and replace them with GROUP BY as applicable. Amazon Redshift can push down a single column DISTINCT as a GROUP BY to the Spectrum compute layer with a query rewrite capability underneath, whereas multi-column DISTINCT or ORDER BY operations need to happen inside Amazon Redshift cluster.

Amazon Redshift optimizer can use external table statistics to generate more optimal execution plans. Without statistics, an execution plan is generated based on heuristics with the assumption that the S3 table is relatively large. It is recommended to set the table statistics (numRows) manually for S3 external tables.

For more information on Amazon Redshift Spectrum best practices, see Twelve Best Practices for Amazon Redshift Spectrum and How to enable cross-account Amazon Redshift COPY and Redshift Spectrum query for AWS KMS–encrypted data in Amazon S3.

Summary

This post discussed the common use cases and design best practices for building ELT and ETL data processing pipelines for data lake architecture using few key features of Amazon Redshift: Spectrum, Concurrency Scaling, and the recently released support for data lake export with partitioning.

Part 2 of this series, ETL and ELT design patterns for lake house architecture using Amazon Redshift: Part 2, shows you how to get started with a step-by-step walkthrough of a few simple examples using AWS sample datasets.

As always, AWS welcomes feedback. Please submit thoughts or questions in the comments.

 


About the Authors

Asim Kumar Sasmal is a senior data architect – IoT in the Global Specialty Practice of AWS Professional Services. He helps AWS customers around the globe to design and build data driven solutions by providing expert technical consulting, best practices guidance, and implementation services on AWS platform. He is passionate about working backwards from customer ask, help them to think big, and dive deep to solve real business problems by leveraging the power of AWS platform.

 

Maor Kleider is a principal product manager for Amazon Redshift, a fast, simple and cost-effective data warehouse. Maor is passionate about collaborating with customers and partners, learning about their unique big data use cases and making their experience even better. In his spare time, Maor enjoys traveling and exploring new restaurants with his family.

New – Amazon Managed Apache Cassandra Service (MCS)

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-managed-apache-cassandra-service-mcs/

Managing databases at scale is never easy. One of the options to store, retrieve, and manage large amounts of structured data, including key-value and tabular formats, is Apache Cassandra. With Cassandra, you can use the expressive Cassandra Query Language (CQL) to build applications quickly.

However, managing large Cassandra clusters can be difficult and takes a lot of time. You need specialized expertise to set up, configure, and maintain the underlying infrastructure, and have a deep understanding of the entire application stack, including the Apache Cassandra open source software. You need to add or remove nodes manually, rebalancing partitions, and doing so while keeping your application available with the required performance. Talking with customers, we found out that they often keep their clusters scaled up for peak load because scaling down is complex. To keep your Cassandra cluster updated, you have to do it node by node. It’s hard to backup and restore a cluster if something goes wrong during an update, and you may end up skipping patches or running an outdated version.

Introducing Amazon Managed Cassandra Service
Today, we are launching in open preview Amazon Managed Apache Cassandra Service (MCS), a scalable, highly available, and managed Apache Cassandra-compatible database service. Amazon MCS is serverless, so you pay for only the resources you use and the service automatically scales tables up and down in response to application traffic. You can build applications that serve thousands of requests per second with virtually unlimited throughput and storage.

With Amazon MCS, you can run your Cassandra workloads on AWS using the same Cassandra application code and developer tools that you use today. Amazon MCS implements the Apache Cassandra version 3.11 CQL API, allowing you to use the code and drivers that you already have in your applications. Updating your application is as easy as changing the endpoint to the one in the Amazon MCS service table.

Amazon MCS provides consistent single-digit-millisecond read and write performance at any scale, so you can build applications with low latency to provide a smooth user experience. You have visibility into how your application is performing using Amazon CloudWatch.

There is no limit on the size of a table or the number of items, and you do not need to provision storage. Data storage is fully managed and highly available. Your table data is replicated automatically three times across multiple AWS Availability Zones for durability.

All customer data is encrypted at rest by default. You can use encryption keys stored in AWS Key Management Service (KMS)Amazon MCS is also integrated with AWS Identity and Access Management (IAM) to help you manage access to your tables and data.

Using Amazon Managed Cassandra Service
You can use Amazon MCS with the console, CQL, or existing Apache 2.0 licensed Cassandra drivers. In the console there is a CQL editor, or you can connect using cqlsh. 

To connect using cqlsh, I need to generate service-specific credentials for an existing IAM user. This is just a command using the AWS Command Line Interface (CLI):

aws iam create-service-specific-credential --user-name USERNAME --service-name mcs.amazonaws.com

{
    "ServiceSpecificCredential": {
        "CreateDate": "2019-11-27T14:36:16Z",
        "ServiceName": "mcs.amazonaws.com",
        "ServiceUserName": "USERNAME-at-123412341234",
        "ServicePassword": "...",
        "ServiceSpecificCredentialId": "...",
        "UserName": "USERNAME",
        "Status": "Active"
    }
}

Amazon MCS only accepts secure connections using TLS.  I download the Amazon root certificate and edit the cqlshrc configuration file to use it. Now, I can connect with:

cqlsh {endpoint} {port} -u {ServiceUserName} -p {ServicePassword} --ssl

First, I create a keyspace. A keyspace contains one or more tables and defines the replication strategy for all the tables it contains. With Amazon MCS the default replication strategy for all keyspaces is the Single-region strategy. It replicates data 3 times across multiple Availability Zones in a single AWS Region.

To create a keyspace I can use the console or CQL. In the Amazon MCS console, I provide the name for the keyspace.

Similarly, I can use CQL to create the bookstore keyspace:

CREATE KEYSPACE IF NOT EXISTS bookstore WITH REPLICATION={'class': 'SingleRegionStrategy'};

Now I create a table. A table is where your data is organized and stored. Again, I can use the console or CQL. From the console, I select the bookstore keyspace and give the table a name.

Below that, I add the columns for my books table. Each row in a table is referenced by a primary key, that can be composed of one or more columns, the values of which determine which partition the data is stored in. In my case the primary key is the ISBN. Optionally, I can add clustering columns, which determine the sort order of records within a partition. I am not using clustering columns for this table.

Alternatively, using CQL, I can create the table with the following commands:

USE bookstore;

CREATE TABLE IF NOT EXISTS books
(isbn text PRIMARY KEY,
title text,
author text,
pages int,
year_of_publication int);

I now use CQL to insert a record in the books table:

INSERT INTO books (isbn, title, author, pages, year_of_publication)
VALUES ('978-0201896831',
'The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)',
'Donald E. Knuth', 672, 1997);

Let’s run a quick query. In the console, I select the books table and then Query table.

In the CQL Editor, I use the default query and select Run command.

By default, I see the result of the query in table view:

If I prefer, I can see the result in JSON format, similar to what an application using the Cassandra API would see:

To insert more records, I use csqlsh again and upload some data from a local CSV file:

COPY books (isbn, title, author, pages, year_of_publication)
FROM './books.csv' WITH delimiter=',' AND header=TRUE;

Now I look again at the content of the books table:

SELECT * FROM books;

I can select a row using a primary key, or use filtering for additional conditions. For example:

SELECT title FROM books WHERE isbn='978-1942788713';

SELECT title FROM books WHERE author='Scott Page' ALLOW FILTERING;

With Amazon MCS you can use existing Apache Cassandra 2.0–licensed drivers and developer tools. Open-source Cassandra drivers are available for Java, Python, Ruby, .NET, Node.js, PHP, C++, Perl, and Go.

You can learn more in the Amazon MCS documentation.

Available in Open Preview
Amazon MCS is available today in open preview in US East (N. Virginia), US East (Ohio), Europe (Stockholm), Asia Pacific (Singapore), Asia Pacific (Tokyo).

As we work with the Cassandra API libraries, we are contributing bug fixes to the open source Apache Cassandra project. We are also contributing back improvements such as built-in support for AWS authentication (SigV4), which simplifies managing credentials for customers running Cassandra on Amazon Elastic Compute Cloud (EC2), since EC2 and IAM can handle distribution and management of credentials using instance roles automatically. We are also announcing the funding of AWS promotional service credits for testing Cassandra-related open-source projects. To learn more about these contributions, visit the Open Source blog.

During the preview, you can use Amazon MCS with on-demand capacity. At general availability, we will also offer the option to use provisioned throughput for more predictable workloads. With on-demand capacity mode, Amazon MCS charges you based on the amount of data your applications read and write from your tables. You do not need to specify how much read and write throughput capacity to provision to your tables because Amazon MCS accommodates your workloads instantly as they scale up or down.

As part of the AWS Free Tier, you can get started with Amazon MCS for free. For the first three months, you are offered a monthly free tier of 30 million write request units, 30 million read request units, and 1 GB of storage. Your free tier starts when you create your first Amazon MCS resource.

Next year we are making it easier to migrate your data Amazon MCS, adding support to use AWS Database Migration Service.

Amazon MCS makes it easy to use Cassandra workloads at any scale, providing a simple programming interface to build new applications, or migrate existing ones. I can’t wait to see what are you going to use it for!

New for Amazon Redshift – Data Lake Export and Federated Query

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-redshift-data-lake-export-and-federated-queries/

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze data using standard SQL and existing Business Intelligence (BI) tools.

To get information from unstructured data that would not fit in a data warehouse, you can build a data lake. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. With a data lake built on Amazon Simple Storage Service (S3), you can easily run big data analytics and use machine learning to gain insights from your semi-structured (such as JSON, XML) and unstructured datasets.

Today, we are launching two new features to help you improve the way you manage your data warehouse and integrate with a data lake:

  • Data Lake Export to unload data from a Redshift cluster to S3 in Apache Parquet format, an efficient open columnar storage format optimized for analytics.
  • Federated Query to be able, from a Redshift cluster, to query across data stored in the cluster, in your S3 data lake, and in one or more Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Aurora PostgreSQL databases.

This architectural diagram gives a quick summary of how these features work and how they can be used together with other AWS services.

Let’s explain the interactions you see in the diagram better, starting from how you can use these features, and the advantages they provide.

Using Redshift Data Lake Export

You can now unload the result of a Redshift query to your S3 data lake in Apache Parquet format. The Parquet format is up to 2x faster to unload and consumes up to 6x less storage in S3, compared to text formats. This enables you to save data transformation and enrichment you have done in Redshift into your S3 data lake in an open format.

You can then analyze the data in your data lake with Redshift Spectrum, a feature of Redshift that allows you to query data directly from files on S3. Or you can use different tools such as Amazon Athena, Amazon EMR, or Amazon SageMaker.

To try this new feature, I create a new cluster from the Redshift console, and follow this tutorial to load sample data that keeps track of sales of musical events across different venues. I want to correlate this data with social media comments on the events stored in my data lake. To understand their relevance, each event should have a way of comparing its relative sales to other events.

Let’s build a query in Redshift to export the data to S3. My data is stored across multiple tables. I need to create a query that gives me a single view of what is going on with sales. I want to join the content of the  sales and date tables, adding information on the gross sales for an event (total_price in the query), and the percentile in terms of all time gross sales compared to all events.

To export the result of the query to S3 in Parquet format, I use the following SQL command:

UNLOAD ('SELECT sales.*, date.*, total_price, percentile
           FROM sales, date,
                (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile
                   FROM (SELECT eventid, sum(pricepaid) total_price
                           FROM sales
                       GROUP BY eventid)) as percentile_events
          WHERE sales.dateid = date.dateid
            AND percentile_events.eventid = sales.eventid')
TO 's3://MY-BUCKET/DataLake/Sales/'
FORMAT AS PARQUET
CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole';

To give Redshift write access to my S3 bucket, I am using an AWS Identity and Access Management (IAM) role. I can see the result of the UNLOAD command using the AWS Command Line Interface (CLI). As expected, the output of the query is exported using the Parquet columnar data format:

$ aws s3 ls s3://MY-BUCKET/DataLake/Sales/
2019-11-25 14:26:56 1638550 0000_part_00.parquet
2019-11-25 14:26:56 1635489 0001_part_00.parquet
2019-11-25 14:26:56 1624418 0002_part_00.parquet
2019-11-25 14:26:56 1646179 0003_part_00.parquet

To optimize access to data, I can specify one or more partition columns so that unloaded data is automatically partitioned into folders in my S3 bucket. For example, I can unload sales data partitioned by year, month, and day. This enables my queries to take advantage of partition pruning and skip scanning irrelevant partitions, improving query performance and minimizing cost.

To use partitioning, I need to add to the previous SQL command the PARTITION BY option, followed by the columns I want to use to partition the data in different directories. In my case, I want to partition the output based on the year and the calendar date (caldate in the query) of the sales.

UNLOAD ('SELECT sales.*, date.*, total_price, percentile
           FROM sales, date,
                (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile
                   FROM (SELECT eventid, sum(pricepaid) total_price
                           FROM sales
                       GROUP BY eventid)) as percentile_events
          WHERE sales.dateid = date.dateid
            AND percentile_events.eventid = sales.eventid')
TO 's3://MY-BUCKET/DataLake/SalesPartitioned/'
FORMAT AS PARQUET
PARTITION BY (year, caldate)
CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole';

This time, the output of the query is stored in multiple partitions. For example, here’s the content of a folder for a specific year and date:

$ aws s3 ls s3://MY-BUCKET/DataLake/SalesPartitioned/year=2008/caldate=2008-07-20/
2019-11-25 14:36:17 11940 0000_part_00.parquet
2019-11-25 14:36:17 11052 0001_part_00.parquet
2019-11-25 14:36:17 11138 0002_part_00.parquet
2019-11-25 14:36:18 12582 0003_part_00.parquet

Optionally, I can use AWS Glue to set up a Crawler that (on demand or on a schedule) looks for data in my S3 bucket to update the Glue Data Catalog. When the Data Catalog is updated, I can easily query the data using Redshift Spectrum, Athena, or EMR.

The sales data is now ready to be processed together with the unstructured and semi-structured  (JSON, XML, Parquet) data in my data lake. For example, I can now use Apache Spark with EMR, or any Sagemaker built-in algorithm to access the data and get new insights.

Using Redshift Federated Query
You can now also access data in RDS and Aurora PostgreSQL stores directly from your Redshift data warehouse. In this way, you can access data as soon as it is available. Straight from Redshift, you can now perform queries processing data in your data warehouse, transactional databases, and data lake, without requiring ETL jobs to transfer data to the data warehouse.

Redshift leverages its advanced optimization capabilities to push down and distribute a significant portion of the computation directly into the transactional databases, minimizing the amount of data moving over the network.

Using this syntax, you can add an external schema from an RDS or Aurora PostgreSQL database to a Redshift cluster:

CREATE EXTERNAL SCHEMA IF NOT EXISTS online_system
FROM POSTGRES
DATABASE 'online_sales_db' SCHEMA 'online_system'
URI ‘my-hostname' port 5432
IAM_ROLE 'iam-role-arn'
SECRET_ARN 'ssm-secret-arn';

Schema and port are optional here. Schema will default to public if left unspecified and default port for PostgreSQL databases is 5432. Redshift is using AWS Secrets Manager to manage the credentials to connect to the external databases.

With this command, all tables in the external schema are available and can be used by Redshift for any complex SQL query processing data in the cluster or, using Redshift Spectrum, in your S3 data lake.

Coming back to the sales data example I used before, I can now correlate the trends of my historical data of musical events with real-time sales. In this way, I can understand if an event is performing as expected or not, and calibrate my marketing activities without delays.

For example, after I define the online commerce database as the online_system external schema in my Redshift cluster, I can compare previous sales with what is in the online commerce system with this simple query:

SELECT eventid,
       sum(pricepaid) total_price,
       sum(online_pricepaid) online_total_price
  FROM sales, online_system.current_sales
 GROUP BY eventid
 WHERE eventid = online_eventid;

Redshift doesn’t import database or schema catalog in its entirety. When a query is run, it localizes the metadata for the Aurora and RDS tables (and views) that are part of the query. This localized metadata is then used for query compilation and plan generation.

Available Now
Amazon Redshift data lake export is a new tool to improve your data processing pipeline and is supported with Redshift release version 1.0.10480 or later. Refer to the AWS Region Table for Redshift availability, and check the version of your clusters.

The new federation capability in Amazon Redshift is released as a public preview and allows you to bring together data stored in Redshift, S3, and one or more RDS and Aurora PostgreSQL databases. When creating a cluster in the Amazon Redshift management console, you can pick three tracks for maintenance: Current, Trailing, or Preview. Within the Preview track, preview_features should be chosen to participate to the Federated Query public preview. For example:

These features simplify data processing and analytics, giving you more tools to react quickly, and a single point of view for your data. Let me know what you are going to use them for!

Danilo

New for Amazon Aurora – Use Machine Learning Directly From Your Databases

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-aurora-use-machine-learning-directly-from-your-databases/

Machine Learning allows you to get better insights from your data. But where is most of the structured data stored? In databases! Today, in order to use machine learning with data in a relational database, you need to develop a custom application to read the data from the database and then apply the machine learning model. Developing this application requires a mix of skills to be able to interact with the database and use machine learning. This is a new application, and now you have to manage its performance, availability, and security.

Can we make it easier to apply machine learning to data in a relational database? Even for existing applications?

Starting today, Amazon Aurora is natively integrated with two AWS machine learning services:

  • Amazon SageMaker, a service providing you with the ability to build, train, and deploy custom machine learning models quickly.
  • Amazon Comprehend, a natural language processing (NLP) service that uses machine learning to find insights in text.

Using this new functionality, you can use a SQL function in your queries to apply a machine learning model to the data in your relational database. For example, you can detect the sentiment of a user comment using Comprehend, or apply a custom machine learning model built with SageMaker to estimate the risk of “churn” for your customers. Churn is a word mixing “change” and “turn” and is used to describe customers that stop using your services.

You can store the output of a large query including the additional information from machine learning services in a new table, or use this feature interactively in your application by just changing the SQL code run by the clients, with no machine learning experience required.

Let’s see a couple of examples of what you can do from an Aurora database, first by using Comprehend, then SageMaker.

Configuring Database Permissions
The first step is to give the database permissions to access the services you want to use: Comprehend, SageMaker, or both. In the RDS console, I create a new Aurora MySQL 5.7 database. When it is available, in the Connectivity & security tab of the regional endpoint, I look for the Manage IAM roles section.

There I connect Comprehend and SageMaker to this database cluster. For SageMaker, I need to provide the Amazon Resource Name (ARN) of the endpoint of a deployed machine learning model. If you want to use multiple endpoints, you need to repeat this step. The console takes care of creating the service roles for the Aurora database to access those services in order for the new machine learning integration to work.

Using Comprehend from Amazon Aurora
I connect to the database using a MySQL client. To run my tests, I create a table storing comments for a blogging platform and insert a few sample records:

CREATE TABLE IF NOT EXISTS comments (
       comment_id INT AUTO_INCREMENT PRIMARY KEY,
       comment_text VARCHAR(255) NOT NULL
);

INSERT INTO comments (comment_text)
VALUES ("This is very useful, thank you for writing it!");
INSERT INTO comments (comment_text)
VALUES ("Awesome, I was waiting for this feature.");
INSERT INTO comments (comment_text)
VALUES ("An interesting write up, please add more details.");
INSERT INTO comments (comment_text)
VALUES ("I don’t like how this was implemented.");

To detect the sentiment of the comments in my table, I can use the aws_comprehend_detect_sentiment and aws_comprehend_detect_sentiment_confidence SQL functions:

SELECT comment_text,
       aws_comprehend_detect_sentiment(comment_text, 'en') AS sentiment,
       aws_comprehend_detect_sentiment_confidence(comment_text, 'en') AS confidence
  FROM comments;

The aws_comprehend_detect_sentiment function returns the most probable sentiment for the input text: POSITIVE, NEGATIVE, or NEUTRAL. The aws_comprehend_detect_sentiment_confidence function returns the confidence of the sentiment detection, between 0 (not confident at all) and 1 (fully confident).

Using SageMaker Endpoints from Amazon Aurora
Similarly to what I did with Comprehend, I can access a SageMaker endpoint to enrich the information stored in my database. To see a practical use case, let’s implement the customer churn example mentioned at the beginning of this post.

Mobile phone operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct a machine learning model. As input for the model, we’re looking at the current subscription plan, how much the customer is speaking on the phone at different times of day, and how often has called customer service.

Here’s the structure of my customer table:

SHOW COLUMNS FROM customers;

To be able to identify customers at risk of churn, I train a model following this sample SageMaker notebook using the XGBoost algorithm. When the model has been created, it’s deployed to a hosted endpoint.

When the SageMaker endpoint is in service, I go back to the Manage IAM roles section of the console to give the Aurora database permissions to access the endpoint ARN.

Now, I create a new will_churn SQL function giving input to the endpoint the parameters required by the model:

CREATE FUNCTION will_churn (
       state varchar(2048), acc_length bigint(20),
       area_code bigint(20), int_plan varchar(2048),
       vmail_plan varchar(2048), vmail_msg bigint(20),
       day_mins double, day_calls bigint(20),
       eve_mins double, eve_calls bigint(20),
       night_mins double, night_calls bigint(20),
       int_mins double, int_calls bigint(20),
       cust_service_calls bigint(20))
RETURNS varchar(2048) CHARSET latin1
       alias aws_sagemaker_invoke_endpoint
       endpoint name 'estimate_customer_churn_endpoint_version_123';

As you can see, the model looks at the customer’s phone subscription details and service usage patterns to identify the risk of churn. Using the will_churn SQL function, I run a query over my customers table to flag customers based on my machine learning model. To store the result of the query, I create a new customers_churn table:

CREATE TABLE customers_churn AS
SELECT *, will_churn(state, acc_length, area_code, int_plan,
       vmail_plan, vmail_msg, day_mins, day_calls,
       eve_mins, eve_calls, night_mins, night_calls,
       int_mins, int_calls, cust_service_calls) will_churn
  FROM customers;

Let’s see a few records from the customers_churn table:

SELECT * FROM customers_churn LIMIT 7;

I am lucky the first 7 customers are apparently not going to churn. But what happens overall? Since I stored the results of the will_churn function, I can run a SELECT GROUP BY statement on the customers_churn table.

SELECT will_churn, COUNT(*) FROM customers_churn GROUP BY will_churn;

Starting from there, I can dive deep to understand what brings my customers to churn.

If I create a new version of my machine learning model, with a new endpoint ARN, I can recreate the will_churn function without changing my SQL statements.

Available Now
The new machine learning integration is available today for Aurora MySQL 5.7, with the SageMaker integration generally available and the Comprehend integration in preview. You can learn more in the documentation. We are working on other engines and versions: Aurora MySQL 5.6 and Aurora PostgreSQL 10 and 11 are coming soon.

The Aurora machine learning integration is available in all regions in which the underlying services are available. For example, if both Aurora MySQL 5.7 and SageMaker are available in a region, then you can use the integration for SageMaker. For a complete list of services availability, please see the AWS Regional Table.

There’s no additional cost for using the integration, you just pay for the underlying services at your normal rates. Pay attention to the size of your queries when using Comprehend. For example, if you do sentiment analysis on user feedback in your customer service web page, to contact those who made particularly positive or negative comments, and people are making 10,000 comments a day, you’d pay $3/day. To optimize your costs, remember to store results.

It’s never been easier to apply machine learning models to data stored in your relational databases. Let me know what you are going to build with this!

Danilo

Simplify management of Amazon Redshift clusters with the Redshift console

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/simplify-management-of-amazon-redshift-clusters-with-the-redshift-console/

Amazon Redshift is the most popular and the fastest cloud data warehouse. It includes a console for administrators to create, configure, and manage Amazon Redshift clusters. The new Amazon Redshift console modernizes the user interface and adds several features to improve managing your clusters and workloads running on clusters.

The new Amazon Redshift console provides the following benefits:

  • Visibility to the health and performance of clusters from a unified dashboard.
  • Simplified management of clusters by streamlining several screens and flows, and reducing the number of clicks for several everyday operations.
  • Improved mean-time-to-diagnose query performance issues by adding capabilities to monitor user queries and correlate with cluster performance metrics.

This post discusses how you can use the new console to create your first cluster, and manage and monitor your clusters in your AWS account.

Prerequisites

For the best experience in using the new Amazon Redshift console, make sure you meet the following prerequisites.

If you are using the AmazonRedshiftFullAccess, you don’t need to make any changes to your permissions. The AmazonRedshiftFullAccess grants full access to all Amazon Redshift resources for the AWS account.

If you are using custom policies, either attach AmazonRedshiftFullAccess or add the following code to your IAM user’s policy:

{  
   "Version":"2012-10-17",
   "Statement":[  
      {  
         "Action":[  
            "cloudwatch:ListMetrics",
            "cloudwatch:GetMetricWidgetImage",
            "cloudwatch:GetMetricData",
            "tag:GetResources",
            "tag:UntagResources",
            "tag:GetTagValues",
            "tag:GetTagKeys",
            "tag:TagResources",
            "iam:ListRoles"
         ],
         "Effect":"Allow",
         "Resource":"*"
      }
   ]
}

You also have to upgrade your clusters to the latest maintenance patch.

New console launch page

When you log in to the new console and you don’t have a cluster, you see the launch page. You can navigate to your dashboard, clusters, queries, Query Editor, configuration changes, and advisor using the left navigation menu.

You are redirected to the Dashboard page if you have at least one cluster.

The following screenshot shows your Amazon Redshift console and navigation menu.

To expand the menu, choose the expand icon at the top of the menu. See the following screenshot.

Setting up a new Amazon Redshift data warehouse

To use Amazon Redshift, you have to first create a new Amazon Redshift cluster. You can launch a cluster by providing a few parameters such as node type, number of nodes, and the master user password. The rest of the parameters can be default values. You can also get an estimated cost for your cluster, and calculate the best configuration based on the size of your data.

Now that you created an Amazon Redshift cluster, you’re ready to create some tables and load data using the Query Editor.

Query Editor

The Query Editor allows you to author and execute queries in your cluster. To connect to your cluster, from Editor, choose Connect to database.

You have to provide your cluster URL and database credentials to connect to your database. See the following screenshot.

You can run SQL commands to create tables, load data, run queries, and perform visual analysis in the Query Editor.

You can switch to a new database connection by clicking on the “Change connection” button as highlighted below.

Monitoring dashboard

More than 15,000 customers use Amazon Redshift to power their analytical workloads to enable modern analytics use cases such as Business Intelligence, predictive analytics, and real-time streaming analytics. As an Amazon Redshift administrator or developer, you want your users, such as data analysts or BI professionals, to get optimal performance for their workloads. Being proactive, you might set up Amazon CloudWatch alarms on different Amazon Redshift metrics.

With the existing console, you can navigate to different pages to get a glance of the health and performance of Amazon Redshift clusters. The new dashboard provides unified visibility to the health and performance of your clusters.

The dashboard enables Amazon Redshift administrators and operators to gain visibility into the following:

  • The number of clusters, number of nodes, availability, and cluster health status.
  • CloudWatch alarms for your clusters.
  • Critical performance metrics for the top five clusters, such as the number of queries, database connections, and CPU utilization.
  • Performance workloads such as query throughput, average query length, average run time, and average wait time for your clusters.
  • Consolidated view of all events for your clusters.

You can filter data in each widget by a specific filter or modify the range to perform a trend analysis for a metric.

The dashboard also allows you to isolate issues in your clusters to focus on the problem. For example, if you have CloudWatch alarm for an Amazon Redshift metric such as DB Connections, as the preceding screenshot shows, you can drill down to view the alarm details or drill down to the cluster to get details such as queries and loads running, and cluster and database metrics.

Cluster management

The Clusters page allows you to view node details and key performance metrics for the clusters, tags, and notifications. You can customize this page to include your preferred metric and areas of focus. In this post, the page includes a maintenance track and release status. Through the Actions menu, you can perform frequent operations such as modify, resize, manage tags, reboot, create a snapshot or configure a cross-region snapshot, or delete your cluster.

The following screenshot shows the Clusters page and options on the Actions menu.

Cluster details

If you are diagnosing a performance problem, you can drill down to view cluster details. The cluster details page shows the details for a specific cluster, organized in the following categories:

  • Query monitoring
  • Cluster performance
  • Maintenance
  • Backup
  • Properties

The Actions menu allows you to perform everyday operations such as resize, reboot, or change configuration. You can also modify the password for your master user for the database, create a snapshot, or restore a table in your database.

The following screenshot shows the details page for a cluster and the available menus and options.

Query and Load monitoring

You can monitor the current workload for your WLM queues and view critical metrics such as active queries and query throughput.

The following screenshot shows the Query monitoring section and the time-series view of the breakdown of the workload.

Cluster and database monitoring

You can view cluster and database performance metrics for your cluster, such as CPU utilization, percentage of disk space used, and database connections using the Cluster performance tab. You can also change the time period and period and granularity of data.

The following screenshot shows the Cluster performance section.

You can change your default metric view to the metrics of your choice by using the config button. You can either add 6, 9, or 20 metrics to this page. It is recommended that you keep these six metrics for optimal page performance.

Cluster maintenance

The Maintenance and monitoring tab allows you to view your CloudWatch alarms, events, and maintenance level. You can also enable auditing, create CloudWatch alarms, and create maintenance windows for your cluster.

The following screenshot shows the Maintenance and monitoring section.

Backup and restore

Amazon Redshift automatically takes incremental snapshots (backups) of your data every eight hours or 5 GB per node of data change. Alternatively, you can create a snapshot schedule to control when to create snapshots. You can create a new cluster by restoring from an existing snapshot.

You can view, search your snapshots, enable automatic snapshots, take a manual snapshot, and configure cross-region snapshots from this page.

You can also create a new cluster from an existing snapshot. With this release, Amazon Redshift allows you to restore snapshots to create a cluster with different node types than the original version. You can either migrate your Amazon Redshift data warehouse to a more powerful node type or clone to a smaller node type for development and testing.

You can view snapshots for your clusters by navigating the snapshots submenu from the cluster menu. You can search, view, and delete snapshots generated by existing and deleted clusters. You can restore a cluster from a snapshot, including a snapshot generated from a deleted cluster.

The following screenshot shows the Backup section and available actions in the Snapshots section.

Properties

Amazon Redshift clusters provide several customization options. The properties tab allows you to view and edit standard configurations such as IAM roles, cluster configuration details, database configurations, and network configurations. You can also manage tags for your cluster.

You need the JDBC and ODBC URLs for connecting to a cluster for using with different ETL, Business Intelligence, and developer tools. This post also provides you connection details such as JDBC and ODBC URLs and IP addresses for the nodes in the cluster.

The following screenshot shows the Properties section.

Conclusion

This post showed you the new features in the Amazon Redshift management console, such as monitoring your dashboard and updated flows to create, manage, and monitor Amazon Redshift clusters. The Clusters page logically organizes information for better navigation and simplify everyday operations. Switch today to the new Redshift Console and simplify management of your Amazon Redshift clusters.

In a later post, you can learn how the Amazon Redshift console improves query monitoring and helps you diagnose query performance.

 


About the Authors

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies. He has more than 20 years of experience in the IT industry and has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

 

 

 

 

Raja Bhogi is an engineering manager at AWS. He is responsible for building delightful and easy-to-use web experiences for analytics and blockchain products. His work includes launching web experiences for new analytics products, and working on new feature launches for existing products. He is passionate about web technologies, performance insights, and tuning. He is a thrill seeker and enjoys everything from roller coasters to bungy jumping.

 

 

 

 

Migration Complete – Amazon’s Consumer Business Just Turned off its Final Oracle Database

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/migration-complete-amazons-consumer-business-just-turned-off-its-final-oracle-database/

Over my 17 years at Amazon, I have seen that my colleagues on the engineering team are never content to leave good-enough alone. They routinely re-evaluate every internal system to make sure that it is as scalable, efficient, performant, and secure as possible. When they find an avenue for improvement, they will use what they have learned to thoroughly modernize our architectures and implementations, often going so far as to rip apart existing systems and rebuild them from the ground up if necessary.

Today I would like to tell you about an internal database migration effort of this type that just wrapped up after several years of work. Over the years we realized that we were spending too much time managing and scaling thousands of legacy Oracle databases. Instead of focusing on high-value differentiated work, our database administrators (DBAs) spent a lot of time simply keeping the lights on while transaction rates climbed and the overall amount of stored data mounted. This included time spent dealing with complex & inefficient hardware provisioning, license management, and many other issues that are now best handled by modern, managed database services.

More than 100 teams in Amazon’s Consumer business participated in the migration effort. This includes well-known customer-facing brands and sites such as Alexa, Amazon Prime, Amazon Prime Video, Amazon Fresh, Kindle, Amazon Music, Audible, Shopbop, Twitch, and Zappos, as well as internal teams such as AdTech, Amazon Fulfillment Technology, Consumer Payments, Customer Returns, Catalog Systems, Deliver Experience, Digital Devices, External Payments, Finance, InfoSec, Marketplace, Ordering, and Retail Systems.

Migration Complete
I am happy to report that this database migration effort is now complete. Amazon’s Consumer business just turned off its final Oracle database (some third-party applications are tightly bound to Oracle and were not migrated).

We migrated 75 petabytes of internal data stored in nearly 7,500 Oracle databases to multiple AWS database services including Amazon DynamoDB, Amazon Aurora, Amazon Relational Database Service (RDS), and Amazon Redshift. The migrations were accomplished with little or no downtime, and covered 100% of our proprietary systems. This includes complex purchasing, catalog management, order fulfillment, accounting, and video streaming workloads. We kept careful track of the costs and the performance, and realized the following results:

  • Cost Reduction – We reduced our database costs by over 60% on top of the heavily discounted rate we negotiated based on our scale. Customers regularly report cost savings of 90% by switching from Oracle to AWS.
  • Performance Improvements – Latency of our consumer-facing applications was reduced by 40%.
  • Administrative Overhead – The switch to managed services reduced database admin overhead by 70%.

The migration gave each internal team the freedom to choose the purpose-built AWS database service that best fit their needs, and also gave them better control over their budget and their cost model. Low-latency services were migrated to DynamoDB and other highly scalable non-relational databases such as Amazon ElastiCache. Transactional relational workloads with high data consistency requirements were moved to Aurora and RDS; analytics workloads were migrated to Redshift, our cloud data warehouse.

We captured the shutdown of the final Oracle database, and had a quick celebration:

DBA Career Path
As I explained earlier, our DBAs once spent a lot of time managing and scaling our legacy Oracle databases. The migration freed up time that our DBAs now use to do an even better job of performance monitoring and query optimization, all with the goal of letting them deliver a better customer experience.

As part of the migration, we also worked to create a fresh career path for our Oracle DBAs, training them to become database migration specialists and advisors. This training includes education on AWS database technologies, cloud-based architecture, cloud security, OpEx-style cost management. They now work with both internal and external customers in an advisory role, where they have an opportunity to share their first-hand experience with large-scale migration of mission-critical databases.

Migration Examples
Here are examples drawn from a few of the migrations:

Advertising – After the migration, this team was able to double their database fleet size (and their throughput) in minutes to accommodate peak traffic, courtesy of RDS. This scale-up effort would have taken months.

Buyer Fraud – This team moved 40 TB of data with just one hour of downtime, and realized the same or better performance at half the cost, powered by Amazon Aurora.

Financial Ledger – This team moved 120 TB of data, reduced latency by 40%, cut costs by 70%, and cut overhead by the same 70%, all powered by DynamoDB.

Wallet – This team migrated more than 10 billion records to DynamoDB, reducing latency by 50% and operational costs by 90% in the process. To learn more about this migration, read Amazon Wallet Scales Using Amazon DynamoDB.

My recent Prime Day 2019 post contains more examples of the extreme scale and performance that are possible with AWS.

Migration Resources
If you are ready to migrate from Oracle (or another hand-managed legacy database) to one or more AWS database services, here are some resources to get you started:

AWS Migration Partners – Our slate of AWS Migration Partners have the experience, expertise, and tools to help you to understand, plan, and execute a database migration.

Migration Case Studies -Read How Amazon is Achieving Database Freedom Using AWS to learn more about this effort; read the Prime Video, Advertising, Items & Offers, Amazon Fulfillment, and Analytics case studies to learn more about the examples that I mentioned above.

AWS Professional Services – My colleagues at AWS Professional Services are ready to work alongside you to make your migration a success.

AWS Migration Tools & Services – Check out our Cloud Migration page, read more about Migration Hub, and don’t forget about the Database Migration Service.

AWS Database Freedom – The AWS Database Freedom program is designed to help qualified customers migrate from traditional databases to cloud-native AWS databases.

AWS re:Invent Sessions – We are finalizing an extensive lineup of chalk talks and breakout sessions for AWS re:Invent that will focus on this migration effort, all led by the team members that planned and executed the migrations.

Jeff;

 

 

Orchestrate Amazon Redshift-Based ETL workflows with AWS Step Functions and AWS Glue

Post Syndicated from Ben Romano original https://aws.amazon.com/blogs/big-data/orchestrate-amazon-redshift-based-etl-workflows-with-aws-step-functions-and-aws-glue/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud that offers fast query performance using the same SQL-based tools and business intelligence applications that you use today. Many customers also like to use Amazon Redshift as an extract, transform, and load (ETL) engine to use existing SQL developer skillsets, to quickly migrate pre-existing SQL-based ETL scripts, and—because Amazon Redshift is fully ACID-compliant—as an efficient mechanism to merge change data from source data systems.

In this post, I show how to use AWS Step Functions and AWS Glue Python Shell to orchestrate tasks for those Amazon Redshift-based ETL workflows in a completely serverless fashion. AWS Glue Python Shell is a Python runtime environment for running small to medium-sized ETL tasks, such as submitting SQL queries and waiting for a response. Step Functions lets you coordinate multiple AWS services into workflows so you can easily run and monitor a series of ETL tasks. Both AWS Glue Python Shell and Step Functions are serverless, allowing you to automatically run and scale them in response to events you define, rather than requiring you to provision, scale, and manage servers.

While many traditional SQL-based workflows use internal database constructs like triggers and stored procedures, separating workflow orchestration, task, and compute engine components into standalone services allows you to develop, optimize, and even reuse each component independently. So, while this post uses Amazon Redshift as an example, my aim is to more generally show you how to orchestrate any SQL-based ETL.

Prerequisites

If you want to follow along with the examples in this post using your own AWS account, you need a Virtual Private Cloud (VPC) with at least two private subnets that have routes to an S3 VPC endpoint.

If you don’t have a VPC, or are unsure if yours meets these requirements, I provide an AWS CloudFormation template stack you can launch by selecting the following button. Provide a stack name on the first page and leave the default settings for everything else. Wait for the stack to display Create Complete (this should only take a few minutes) before moving on to the other sections.

Scenario

For the examples in this post, I use the Amazon Customer Reviews Dataset to build an ETL workflow that completes the following two tasks which represent a simple ETL process.

  • Task 1: Move a copy of the dataset containing reviews from the year 2015 and later from S3 to an Amazon Redshift table.
  • Task 2: Generate a set of output files to another Amazon S3 location which identifies the “most helpful” reviews by market and product category, allowing an analytics team to glean information about high quality reviews.

This dataset is publicly available via an Amazon Simple Storage Service (Amazon S3) bucket. Complete the following tasks to get set up.

Solution overview

The following diagram highlights the solution architecture from end to end:

The steps in this process are as follows:

  1. The state machine launches a series of runs of an AWS Glue Python Shell job (more on how and why I use a single job later in this post!) with parameters for retrieving database connection information from AWS Secrets Manager and an .sql file from S3.
  2. Each run of the AWS Glue Python Shell job uses the database connection information to connect to the Amazon Redshift cluster and submit the queries contained in the .sql file.
    1. For Task 1: The cluster utilizes Amazon Redshift Spectrum to read data from S3 and load it into an Amazon Redshift table. Amazon Redshift Spectrum is commonly used as an means for loading data to Amazon Redshift. (See Step 7 of Twelve Best Practices for Amazon Redshift Spectrum for more information.)
    2. For Task 2: The cluster executes an aggregation query and exports the results to another Amazon S3 location via UNLOAD.
  3. The state machine may send a notification to an Amazon Simple Notification Service (SNS) topic in the case of pipeline failure.
  4. Users can query the data from the cluster and/or retrieve report output files directly from S3.

I include an AWS CloudFormation template to jumpstart the ETL environment so that I can focus this post on the steps dedicated to building the task and orchestration components. The template launches the following resources:

  • Amazon Redshift Cluster
  • Secrets Manager secret for storing Amazon Redshift cluster information and credentials
  • S3 Bucket preloaded with Python scripts and .sql files
  • Identity and Access Management (IAM) Role for AWS Glue Python Shell jobs

See the following resources for how to complete these steps manually:

Be sure to select at least two private subnets and the corresponding VPC, as shown in the following screenshot. If you are using the VPC template from above, the VPC appears as 10.71.0.0/16 and the subnet names are A private and B private.

The stack should take 10-15 minutes to launch. Once it displays Create Complete, you can move on to the next section. Be sure to take note of the Resources tab in the AWS CloudFormation console, shown in the following screenshot, as I refer to these resources throughout the post.

Building with AWS Glue Python Shell

Begin by navigating to AWS Glue in the AWS Management Console.

Making a connection

Amazon Redshift cluster resides in a VPC, so you first need to create a connection using AWS Glue. Connections contain properties, including VPC networking information, needed to access your data stores. You eventually attach this connection to your Glue Python Shell Job so that it can reach your Amazon Redshift cluster.

Select Connections from the menu bar, and then select Add connection. Give your connection a name like blog_rs_connection,  select Amazon Redshift as the Connection type, and then select Next, as shown in the following screenshot.

Under Cluster, enter the name of the cluster that the AWS CloudFormation template launched, i.e blogstack-redshiftcluster-####. Because the Python code I provide for this blog already handles credential retrieval, the rest of the values around database information you enter here are largely placeholders. The key information you are associating with the connection is networking-related.

Please note that you are not able to test the connection without the correct cluster information.  If you are interested in doing so, note that Database name and Username are auto-populated after selecting the correct cluster, as shown in the following screenshot. Follow the instructions here to retrieve the password information from Secrets Manager to copy into the Password field.

ETL code review

Take a look at the two main Python scripts used in this example:

Pygresql_redshift_common.py is a set of functions that can retrieve cluster connection information and credentials from Secrets Manger, make a connection to the cluster, and submit queries respectively. By retrieving cluster information at runtime via a passed parameter, these functions allow the job to connect to any cluster to which it has access. You can package these functions into a library by following the instructions to create a python .egg file (already completed as a part of the AWS CloudFormation template launch). Note that AWS Glue Python Shell supports several python libraries natively.

import pg
import boto3
import base64
from botocore.exceptions import ClientError
import json

#uses session manager name to return connection and credential information
def connection_info(db):

	session = boto3.session.Session()
	client = session.client(
		service_name='secretsmanager'
	)

	get_secret_value_response = client.get_secret_value(SecretId=db)

	if 'SecretString' in get_secret_value_response:
		secret = json.loads(get_secret_value_response['SecretString'])
	else:
		secret = json.loads(base64.b64decode(get_secret_value_response['SecretBinary']))
		
	return secret


#creates a connection to the cluster
def get_connection(db,db_creds):

	con_params = connection_info(db_creds)
	
	rs_conn_string = "host=%s port=%s dbname=%s user=%s password=%s" % (con_params['host'], con_params['port'], db, con_params['username'], con_params['password'])
	rs_conn = pg.connect(dbname=rs_conn_string)
	rs_conn.query("set statement_timeout = 1200000")
	
	return rs_conn


#submits a query to the cluster
def query(con,statement):
    res = con.query(statement)
    return res

The AWS Glue Python Shell job runs rs_query.py when called. It starts by parsing job arguments that are passed at invocation. It uses some of those arguments to retrieve a .sql file from S3, then connects and submits the statements within the file to the cluster using the functions from pygresql_redshift_common.py. So, in addition to connecting to any cluster using the Python library you just packaged, it can also retrieve and run any SQL statement. This means you can manage a single AWS Glue Python Shell job for all of your Amazon Redshift-based ETL by simply passing in parameters on where it should connect and what it should submit to complete each task in your pipeline.

from redshift_module import pygresql_redshift_common as rs_common
import sys
from awsglue.utils import getResolvedOptions
import boto3

#get job args
args = getResolvedOptions(sys.argv,['db','db_creds','bucket','file'])
db = args['db']
db_creds = args['db_creds']
bucket = args['bucket']
file = args['file']

#get sql statements
s3 = boto3.client('s3') 
sqls = s3.get_object(Bucket=bucket, Key=file)['Body'].read().decode('utf-8')
sqls = sqls.split(';')

#get database connection
print('connecting...')
con = rs_common.get_connection(db,db_creds)

#run each sql statement
print("connected...running query...")
results = []
for sql in sqls[:-1]:
    sql = sql + ';'
    result = rs_common.query(con, sql)
    print(result)
    results.append(result)

print(results)

Creating the Glue Python Shell Job

Next, put that code into action:

  1. Navigate to Jobs on the left menu of the AWS Glue console page and from there, select Add job.
  2. Give the job a name like blog_rs_query.
  3. For the IAM role, select the same GlueExecutionRole you previously noted from the Resources section of the AWS CloudFormation console.
  4. For Type, select Python shell, leave Python version as the default of Python 3, and for This job runs select An existing script that you provide.
  5. For S3 path where the script is stored, navigate to the script bucket created by the AWS CloudFormation template (look for ScriptBucket in the Resources), then select the python/py file.
  6. Expand the Security configuration, script libraries, and job parameters section to add the Python .egg file with the Amazon Redshift connection library to the Python library path. It is also located in the script bucket under python /redshift_module-0.1-py3.6.egg.

When all is said and done everything should look as it does in the following screenshot:

Choose Next. Add the connection you created by choosing Select to move it under Required connections. (Recall from the Making a connection section that this gives the job the ability to interact with your VPC.) Choose Save job and edit script to finish, as shown in the following screenshot.

Test driving the Python Shell job

After creating the job, you are taken to the AWS Glue Python Shell IDE. If everything went well, you should see the rs_query.py code. Right now, the Amazon Redshift cluster is sitting there empty, so use the Python code to run the following SQL statements and populate it with tables.

  1. Create an external database (amzreviews).
  2. Create an external table (reviews) from which Amazon Redshift Spectrum can read from the source data in S3 (the public reviews dataset). The table is partitioned by product_category because the source files are organized by category, but in general you should partition on frequently filtered columns (see #4).
  3. Add partitions to the external table.
  4. Create an internal table (reviews) local to the Amazon Redshift cluster. product_id works well as a DISTKEY because it has high cardinality, even distribution, and most likely (although not explicitly part of this blog’s scenario) a column that will be used to join with other tables. I choose review_date as a SORTKEY to efficiently filter out review data that is not part of my target query (after 2015). Learn more about how to best choose DISTKEY/SORTKEY as well as additional table design parameters for optimizing performance by reading the Designing Tables documentation.
    CREATE EXTERNAL SCHEMA amzreviews 
    from data catalog
    database 'amzreviews'
    iam_role 'rolearn'
    CREATE EXTERNAL database IF NOT EXISTS;
    
    
    
    CREATE EXTERNAL TABLE amzreviews.reviews(
      marketplace varchar(10), 
      customer_id varchar(15), 
      review_id varchar(15), 
      product_id varchar(25), 
      product_parent varchar(15), 
      product_title varchar(50), 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine varchar(5), 
      verified_purchase varchar(5), 
      review_headline varchar(25), 
      review_body varchar(1024), 
      review_date date, 
      year int)
    PARTITIONED BY ( 
      product_category varchar(25))
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION
      's3://amazon-reviews-pds/parquet/';
      
      
      
    ALTER TABLE amzreviews.reviews ADD
    partition(product_category='Apparel') 
    location 's3://amazon-reviews-pds/parquet/product_category=Apparel/'
    partition(product_category='Automotive') 
    location 's3://amazon-reviews-pds/parquet/product_category=Automotive'
    partition(product_category='Baby') 
    location 's3://amazon-reviews-pds/parquet/product_category=Baby'
    partition(product_category='Beauty') 
    location 's3://amazon-reviews-pds/parquet/product_category=Beauty'
    partition(product_category='Books') 
    location 's3://amazon-reviews-pds/parquet/product_category=Books'
    partition(product_category='Camera') 
    location 's3://amazon-reviews-pds/parquet/product_category=Camera'
    partition(product_category='Grocery') 
    location 's3://amazon-reviews-pds/parquet/product_category=Grocery'
    partition(product_category='Furniture') 
    location 's3://amazon-reviews-pds/parquet/product_category=Furniture'
    partition(product_category='Watches') 
    location 's3://amazon-reviews-pds/parquet/product_category=Watches'
    partition(product_category='Lawn_and_Garden') 
    location 's3://amazon-reviews-pds/parquet/product_category=Lawn_and_Garden';
    
    
    CREATE TABLE reviews(
      marketplace varchar(10),
      customer_id varchar(15), 
      review_id varchar(15), 
      product_id varchar(25) DISTKEY, 
      product_parent varchar(15), 
      product_title varchar(50), 
      star_rating int, 
      helpful_votes int, 
      total_votes int, 
      vine varchar(5), 
      verified_purchase varchar(5), 
      review_date date, 
      year int,
      product_category varchar(25))
      
      SORTKEY (
         review_date
        );

Do this first job run manually so you can see where all of the elements I’ve discussed come into play. Select Run Job at the top of the IDE screen. Expand the Security configuration, script libraries, and job parameters section. This is where you add in the parameters as key-value pairs, as shown in the following screenshot.

KeyValue
–dbreviews
–db_credsreviewssecret
–bucket<name of s3 script bucket>
–filesql/reviewsschema.sql

Select Run job to start it. The job should take a few seconds to complete. You can look for log outputs below the code in the IDE to watch job progress.

Once the job completes, navigate to Databases in the AWS Glue console and look for the amzreviews database and reviews table, as shown in the following screenshot. If they are there, then everything worked as planned! You can also connect to your Amazon Redshift cluster using the Redshift Query Editor or with your own SQL client tool and look for the local reviews table.

Step Functions Orchestration

Now that you’ve had a chance to run a job manually, it’s time to move onto something more programmatic that is orchestrated by Step Functions.

Launch Template

I provide a third AWS CloudFormation template for kickstarting this process as well. It creates a Step Functions state machine that calls two instances of the AWS Glue Python Shell job you just created to complete the two tasks I outlined at the beginning of this post.

For BucketName, paste the name of the script bucket created in the second AWS CloudFormation stack. For GlueJobName, type in the name of the job you just created. Leave the other information as default, as shown in the following screenshot. Launch the stack and wait for it to display Create Complete—this should take only a couple of minutes—before moving on to the next section.

Working with the Step Functions State Machine

State Machines are made up of a series of steps, allowing you to stitch together services into robust ETL workflows. You can monitor each step of execution as it happens, which means you can identify and fix problems in your ETL workflow quickly, and even automatically.

Take a look at the state machine you just launched to get a better idea. Navigate to Step Functions in the AWS Console and look for a state machine with a name like GlueJobStateMachine-######. Choose Edit to view the state machine configuration, as shown in the following screenshot.

It should look as it does in the following screenshot:

As you can see, state machines are created using JSON templates made up of task definitions and workflow logic. You can run parallel tasks, catch errors, and even pause workflows and wait for manual callback to continue. The example I provide contains two tasks for running the SQL statements that complete the goals I outlined at the beginning of the post:

  1. Load data from S3 using Redshift Spectrum
  2. Transform and writing data back to S3

Each task contains basic error handling which, if caught, routes the workflow to an error notification task. This example is a simple one to show you how to build a basic workflow, but you can refer to the Step Functions documentation for an example of more complex workflows to help build a robust ETL pipeline. Step Functions also supports reusing modular components with Nested Workflows.

SQL Review

The state machine will retrieve and run the following SQL statements:

INSERT INTO reviews
SELECT marketplace, customer_id, review_id, product_id, product_parent, product_title, star_rating, helpful_votes, total_votes, vine, verified_purchase, review_date, year, product_category
FROM amzreviews.reviews
WHERE year > 2015;

As I mentioned previously, Amazon Redshift Spectrum is a great way to run ETL using an INSERT INTO statement. This example is a simple load of the data as it is in S3, but keep in mind you can add more complex SQL statements to transform your data prior to loading.

UNLOAD ('SELECT marketplace, product_category, product_title, review_id, helpful_votes, AVG(star_rating) as average_stars FROM reviews GROUP BY marketplace, product_category, product_title, review_id, helpful_votes ORDER BY helpful_votes DESC, average_stars DESC')
TO 's3://bucket/testunload/'
iam_role 'rolearn';

This statement groups reviews by product, ordered by number of helpful votes, and writes to Amazon S3 using UNLOAD.

State Machine execution

Now that everything is in order, start an execution. From the state machine main page select Start an Execution.

Leave the defaults as they are and select Start to begin execution. Once execution begins you are taken to a visual workflow interface where you can follow the execution progress, as shown in the following screenshot.

Each of the queries takes a few minutes to run. In the meantime, you can watch the Amazon Redshift query logs to track the query progress in real time. These can be found by navigating to Amazon Redshift in the AWS Console, selecting your Amazon Redshift cluster, and then selecting the Queries tab, as shown in the following screenshot.

Once you see COMPLETED for both queries, navigate back to the state machine execution. You should see success for each of the states, as shown in the following screenshot.

Next, navigate to the data bucket in the S3 AWS Console page (refer to the DataBucket in the CloudFormation Resources tab). If all went as planned, you should see a folder named testunload in the bucket with the unloaded data, as shown in the following screenshot.

Inject Failure into Step Functions State Machine

Next, test the error handling component of the state machine by intentionally causing an error. An easy way to do this is to edit the state machine and misspell the name of the Secrets Manager secret in the ReadFilterJob task, as shown in the following screenshot.

If you want the error output sent to you, optionally subscribe to the error notification SNS Topic. Start another state machine execution as you did previously. This time the workflow should take the path toward the NotifyFailure task, as shown in the following screenshot. If you subscribed to the SNS Topic associated with it, you should receive a message shortly thereafter.

The state machine logs will show the error in more detail, as shown in the following screenshot.

Conclusion

In this post I demonstrated how you can orchestrate Amazon Redshift-based ETL using serverless AWS Step Functions and AWS Glue Python Shells jobs. As I mentioned in the introduction, the concepts can also be more generally applied to other SQL-based ETL, so use them to start building your own SQL-based ETL pipelines today!

 


About the Author

Ben Romano is a Data Lab solution architect at AWS. Ben helps our customers architect and build data and analytics prototypes in just four days in the AWS Data Lab.

 

 

 

 

Protect and Audit PII data in Amazon Redshift with DataSunrise Security

Post Syndicated from Saunak Chandra original https://aws.amazon.com/blogs/big-data/protect-and-audit-pii-data-in-amazon-redshift-with-datasunrise-security/

DataSunrise, in their own words: DataSunrise is a database security software company that offers a breadth of security solutions, including data masking (dynamic and static masking), activity monitoring, database firewalls, and sensitive data discovery for various databases. The goal is to protect databases against external and internal threats and vulnerabilities. Customers often choose DataSunrise Database Security because it gives them unified control and a single-user experience when protecting different database engines that run on AWS, including Amazon Redshift, Amazon Aurora, all Amazon RDS database engines, Amazon DynamoDB, and Amazon Athena, among others. DataSunrise Security Suite is a set of tools that can protect and audit PII data in Amazon Redshift.

DataSunrise offers passive security with data auditing in addition to active data and database security. Active security is based on predefined security policies, such as preventing unauthorized access to sensitive data, blocking suspicious SQL queries, preventing SQL-injection attacks, or dynamically masking and obfuscating data in real time. DataSunrise comes with high availability, failover, and automatic scaling.

This post focuses on active security for Amazon Redshift, in particular DataSunrise’s capabilities for masking and access control of personally identifiable information (PII), which you can back with DataSunrise’s passive security offerings such as auditing access of sensitive information. This post discusses DataSunrise security for Amazon Redshift, how it works, and how to get started.

Why you need active security for Amazon Redshift

Amazon Redshift is a massively parallel processing (MPP), fully managed petabyte-scale data warehouse (DW) solution with over 15,000 deployments worldwide. Amazon Redshift provides a database encryption mechanism to protect sensitive data, such as payment information and health insurance. For more information, see Amazon Redshift Database Encryption.

Many organizations store sensitive data, commonly classified as personally identifiable information (PII) or sensitive personal information (SPI). You may need solutions to manage access control to such sensitive information, and want to manage it efficiently and flexibly, preferably using a management tool. DataSunrise is a centralized management tool that masks that data. It resolves the PII and SPI data access control requirement by allowing you to enforce masking policies against all queries against your Amazon Redshift data warehouse.

What DataSunrise masking does

DataSunrise enables masking queries against Amazon Redshift by acting as a proxy layer between your applications and the backend stores of Amazon Redshift, enabling transparent data flow, bindings, and so on, while your end-users receive masked or obfuscated data that allows them to do their job but prevents any risk of revealing PII data unintentionally.

DataSunrise can exempt users who are authorized to access this information by composing policies and choosing from predefined policy templates that would allow those users to bypass masking constraints when needed.

How it works

DataSunrise operates as a proxy between users or applications that connect to the database and the database server. DataSunrise intercepts the traffic for in-depth analysis and filtering. It applies data masking and access control policies to enforce active security policies against your PII data. When the database firewall is enabled and a security policy violation is detected, DataSunrise can block the malicious SQL query and notify administrators via SMTP or SNMP. With real-time alerts, you can maintain continuous database security and streamline compliance.

DataSunrise operates as a proxy

Getting started with DataSunrise

You can deploy DataSunrise on a Windows or Linux instance in Amazon EC2. You can download a fully prepared DataSunrise AMI from AWS Marketplace to protect your Amazon Redshift cluster. DataSunrise Database and Data Security are available for both Windows and Linux platforms.

After deploying DataSunrise, you can configure security policies for Amazon Redshift and create data masking and access control security rules. After you configure and activate the security policies, DataSunrise enacts those policies against the user and application traffic that would connect to the database through DataSunrise’s proxy.

DataSunrise customers need to configure the Amazon Redshift cluster security group inbound rule to allow DataSunrise IP. For more information, see Amazon Redshift Cluster Security Group. Additionally, you can include the DataSunrise security group in the cluster security group when it runs on the same AWS VPC. Users can execute queries only through connecting to the DataSunrise endpoint and not to the Amazon Redshift cluster endpoint. All DB users and groups are imported from Amazon Redshift into DataSunrise for authentication and authorization to Amazon Redshift objects.

Creating a dynamic data masking rule

Masking obfuscates part or an entire column value. When a column is masked, the column values are replaced by fake values. It is effected either by replacing some original characters with fake ones or by using some masking functions. DataSunrise has many built-in masking functions for credit card numbers, e-mails, etc. Masking protects sensitive or personally identifiable data such as credit card numbers. This is not the same as encryption or hashing, which applies a sophisticated algorithm to a scalar value to convert it into another value.

You can create dynamic masking rules using object-based filters in DataSunrise’s console. DataSunrise identifies the protected objects during application calls and enforces those security rules against targeted operations, schemas, or objects in general within your Amazon Redshift cluster. Security administrators can enable those rules granularly based on the object level and caller identity. They can allow exemptions when needed by identifying authorized callers.

To perform dynamic masking in DataSunrise, you need to create data masking rules as part of defining such security policies.

Complete the following steps to create those masking policies:

  1. In the DataSunrise console, choose Masking > Dynamic Masking Rules.
  2. Choose Add Rule. Add required information.

    Create Dynamic Data Masking Rule

Create Dynamic Data Masking Rule

  1. In the Masking Settings section click Select and navigate to a table in a schema and check the columns you want to mask. See the following screenshot of the Check Columns page:

    Redshift columns to enable dynamic masking

Redshift columns to enable dynamic masking

Click Done after you decide which column to protect and choose the masking method and any relevant settings to allow business-oriented outcomes of the masked information.

In Add Rule, Filter Sessions, you can choose which users, applications, hosts, and more are affected by this rule.

Creating a static data masking rule

You can mask data permanently with static masking as opposed to dynamic data masking. It stores the objects permanently in a separate schema or database. During static masking, DataSunrise copies each selected table into a separate schema or database. So, static masking may require additional storage space. Some of these tables have columns with masked content stored on the disk. This replicated schema is a fully functional schema in which you can run user queries. The source tables remain untouched and unmasked data can be viewed. In case original data has been changed, it is required to repeat the static masking procedure again. In that case it is necessary to truncate tables with previously masked data.

  1. From the menu, choose Masking > Static Masking.
  2. In New Static Masking Task, in Source and Target Instances, choose the source database, schema, and corresponding target destination.See the following screenshot of the New Static Masking Task page:
  1. In Select Source Tables to Transfer and Columns to Mask, choose the objects to which you wish to apply masking.The following screenshot shows the list of available tables:

DataSunrise also enables you to reschedule recurring static masking jobs so you can refresh your masked records based on your source or production data.

Static data masking under DataSunrise applies to Amazon Redshift local tables. In addition to local tables, Amazon Redshift allows querying external tables in Amazon S3; however, DataSunrise does not support static masking on data stored in Amazon S3 accessed in Amazon Redshift via external tables. For more information, see Using Amazon Redshift Spectrum to Query External Data.

Creating a security/access control rule

While data masking can help in many cases to allow your Amazon Redshift users the appropriate access, you may need further enforcement of access control to filter out any operations that might violate your security strategy. DataSunrise can import database users’ and groups’ metadata from Amazon Redshift, which the DataSunrise administrator can use to configure the security profile. If you already have a set of users defined in your existing Redshift DB you don’t need to additionally recreate the users for DataSunrise. DataSunrise will use this list of users only to configure rules. DataSunrise does not modify any traffic related to the authentication process of database users by default.

  1. In the DataSunrise console, from the menu, choose Security > Rules.The following screenshot shows the Security Rules page:
  2. Choose Add Rule.The following screenshot shows the details you can enter to compose this new rule:

DataSunrise also allows you to compose your rule to restrict or allow certain users, applications, hosts, and so on from performing activities that you consider prohibited against particular objects or areas within your Amazon Redshift cluster.

The following screenshot shows the Filter Sessions page:

DataSunrise enables you to create rules for specific objects, query groups, query types, and SQL injection activities, and trigger actions when authorization errors occur.

Static masking doesn’t have a direct impact on performance, but if a customer uses the DataSunrise custom function, it could impact performance because custom functions execute on the DataSunrise server.

Using DataSunrise audit and compliance policies

From the console, Compliance > Add Compliance.

In the Compliance orchestrator page, you can initiate a scan of your Amazon Redshift cluster to identify all PII data or sensitive information in general, per your compliance standards. DataSunrise comes bundled with internal scans for HIPAA, GDPR, and other compliance standards, but you can create or amend any of those libraries to accommodate any special requirements that your security strategy mandates. The following screenshot shows the Add Compliance page:

After completing the scan, DataSunrise guides you through the process of composing rules for sensitive information within your Amazon Redshift cluster.

You can also create your audit rules manually. The following screenshot shows the New Audit Rule page:

You can set audit rules for any restriction to make sure that transactional trails are only collected when necessary. You can target objects starting from the entire database down to a single column in your Amazon Redshift cluster. See the following screenshot:

Conclusion

DataSunrise’s masking feature allows for descriptive specifications of access control to sensitive columns, in addition to built-in encryption provided by the Amazon Redshift cluster. Its proxy enables more fine-grained access control, auditing, and masking capabilities to better monitor, protect, and comply with regulatory standards that address the ever-increasing needs of securing and protecting data. DataSunrise’s integration with Amazon Redshift addresses those concerns by simplifying and automating the security rules and its applications. Keep your data safe and protected at all times!

To get started with DataSunrise with Amazon Redshift, visit DataSunrise in AWS Marketplace.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors


Saunak Chandra is a senior partner solutions architect for Redshift at AWS.
Saunak likes to experiment with new products in the technology space, alongside his day to day work. He loves exploring the nature in the Pacific Northwest. A short hiking or biking in the trails is his favorite weekend morning routine. He also likes to do yoga when he gets time from his kid.

 

 

 

Radik Chumaren is an engineering leader at DataSunrise. Radik is specializing in heterogeneous database environments with focus on building database security software in the cloud. He enjoys reading and playing soccer.

 

 

 

 

NoSQL Workbench for Amazon DynamoDB – Available in Preview

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/nosql-workbench-for-amazon-dynamodb-available-in-preview/

I am always impressed by the flexibility of Amazon DynamoDB, providing our customers a fully-managed key-value and document database that can easily scale from a few requests per month to millions of requests per second.

The DynamoDB team released so many great features recently, from on-demand capacity, to support for native ACID transactions. Here’s a great recap of other recent DynamoDB announcements such as global tables, point-in-time recovery, and instant adaptive capacity. DynamoDB now encrypts all customer data at rest by default.

However, switching mindset from a relational database to NoSQL is not that easy. Last year we had two amazing talks at re:Invent that can help you understand how DynamoDB works, and how you can use it for your use cases:

To help you even further, we are introducing today in preview NoSQL Workbench for Amazon DynamoDB, a free, client-side application available for Windows and macOS to help you design and visualize your data model, run queries on your data, and generate the code for your application!

The three main capabilities provided by the NoSQL Workbench are:

  • Data modeler — to build new data models, adding tables and indexes, or to import, modify, and export existing data models.
  • Visualizer — to visualize data models based on their applications access patterns, with sample data that you can add manually or import via a SQL query.
  • Operation builder — to define and execute data-plane operations or generate ready-to-use sample code for them.

To see how this new tool can simplify working with DynamoDB, let’s build an application to retrieve information on customers and their orders.

Using the NoSQL Workbench
In the Data modeler, I start by creating a CustomerOrders data model, and I add a table, CustomerAndOrders, to hold my customer data and the information on their orders. You can use this tool to create a simple data model where customers and orders are in two distinct tables, each one with their own primary keys. There would be nothing wrong with that. Here I’d like to show how this tool can also help you use more advanced design patterns. By having the customer and order data in a single table, I can construct queries that return all the data I need with a single interaction with DynamoDB, speeding up the performance of my application.

As partition key, I use the customerId. This choice provides an even distribution of data across multiple partitions. The sort key in my data model will be an overloaded attribute, in the sense that it can hold different data depending on the item:

  • A fixed string, for example customer, for the items containing the customer data.
  • The order date, written using ISO 8601 strings such as 20190823, for the items containing orders.

By overloading the sort key with these two possible values, I am able to run a single query that returns the customer data and the most recent orders. For this reason, I use a generic name for the sort key. In this case, I use sk.

Apart from the partition key and the optional sort key, DynamoDB has a flexible schema, and the other attributes can be different for each item in a table. However, with this tool I have the option to describe in the data model all the possible attributes I am going to use for a table. In this way, I can check later that all the access patterns I need for my application work well with this data model.

For this table, I add the following attributes:

  • customerName and customerAddress, for the items in the table containing customer data.
  • orderId and deliveryAddress, for the items in the table containing order data.

I am not adding a orderDate attribute, because for this data model the value will be stored in the sk sort key. For a real production use case, you would probably have much more attributes to describe your customers and orders, but I am trying to keep things simple enough here to show what you can do, without getting lost in details.

Another access pattern for my application is to be able to get a specific order by ID. For that, I add a global secondary index to my table, with orderId as partition key and no sort key.

I add the table definition to the data model, and move on to the Visualizer. There, I update the table by adding some sample data. I add data manually, but I could import a few rows from a table in a MySQL database, for example to simplify a NoSQL migration from a relational database.

Now, I visualize my data model with the sample data to have a better understanding of what to expect from this table. For example, if I select a customerId, and I query for all the orders greater than a specific date, I also get the customer data at the end, because the string customer, stored in the sk sort key, is always greater that any date written in ISO 8601 syntax.

In the Visualizer, I can also see how the global secondary index on the orderId works. Interestingly, items without an orderId are not part of this index, so I get only 4 of the 6 items that are part of my sample data. This happens because DynamoDB writes a corresponding index entry only if the index sort key value is present in the item. If the sort key doesn’t appear in every table item, the index is said to be sparseSparse indexes are useful for queries over a subsection of a table.

I now commit my data model to DynamoDB. This step creates server-side resources such as tables and global secondary indexes for the selected data model, and loads the sample data. To do so, I need AWS credentials for an AWS account. I have the AWS Command Line Interface (CLI) installed and configured in the environment where I am using this tool, so I can just select one of my named profiles.

I move to the Operation builder, where I see all the tables in the selected AWS Region. I select the newly created CustomerAndOrders table to browse the data and build the code for the operations I need in my application.

In this case, I want to run a query that, for a specific customer, selects all orders more recent that a date I provide. As we saw previously, the overloaded sort key would also return the customer data as last item. The Operation builder can help you use the full syntax of DynamoDB operations, for example adding conditions and child expressions. In this case, I add the condition to only return orders where the deliveryAddress contains Seattle.

I have the option to execute the operation on the DynamoDB table, but this time I want to use the query in my application. To generate the code, I select between Python, JavaScript (Node.js), or Java.

You can use the Operation builder to generate the code for all the access patterns that you plan to use with your application, using all the advanced features that DynamoDB provides, including ACID transactions.

Available Now
You can find how to set up NoSQL Workbench for Amazon DynamoDB (Preview) for Windows and macOS here.

We welcome your suggestions in the DynamoDB discussion forum. Let us know what you build with this new tool and how we can help you more!

Learn about AWS Services & Solutions – September AWS Online Tech Talks

Post Syndicated from Jenny Hang original https://aws.amazon.com/blogs/aws/learn-about-aws-services-solutions-september-aws-online-tech-talks/

Learn about AWS Services & Solutions – September AWS Online Tech Talks

AWS Tech Talks

Join us this September to learn about AWS services and solutions. The AWS Online Tech Talks are live, online presentations that cover a broad range of topics at varying technical levels. These tech talks, led by AWS solutions architects and engineers, feature technical deep dives, live demonstrations, customer examples, and Q&A with AWS experts. Register Now!

Note – All sessions are free and in Pacific Time.

Tech talks this month:

 

Compute:

September 23, 2019 | 11:00 AM – 12:00 PM PTBuild Your Hybrid Cloud Architecture with AWS – Learn about the extensive range of services AWS offers to help you build a hybrid cloud architecture best suited for your use case.

September 26, 2019 | 1:00 PM – 2:00 PM PTSelf-Hosted WordPress: It’s Easier Than You Think – Learn how you can easily build a fault-tolerant WordPress site using Amazon Lightsail.

October 3, 2019 | 11:00 AM – 12:00 PM PTLower Costs by Right Sizing Your Instance with Amazon EC2 T3 General Purpose Burstable Instances – Get an overview of T3 instances, understand what workloads are ideal for them, and understand how the T3 credit system works so that you can lower your EC2 instance costs today.

 

Containers:

September 26, 2019 | 11:00 AM – 12:00 PM PTDevelop a Web App Using Amazon ECS and AWS Cloud Development Kit (CDK) – Learn how to build your first app using CDK and AWS container services.

 

Data Lakes & Analytics:

September 26, 2019 | 9:00 AM – 10:00 AM PTBest Practices for Provisioning Amazon MSK Clusters and Using Popular Apache Kafka-Compatible Tooling – Learn best practices on running Apache Kafka production workloads at a lower cost on Amazon MSK.

 

Databases:

September 25, 2019 | 1:00 PM – 2:00 PM PTWhat’s New in Amazon DocumentDB (with MongoDB compatibility) – Learn what’s new in Amazon DocumentDB, a fully managed MongoDB compatible database service designed from the ground up to be fast, scalable, and highly available.

October 3, 2019 | 9:00 AM – 10:00 AM PTBest Practices for Enterprise-Class Security, High-Availability, and Scalability with Amazon ElastiCache – Learn about new enterprise-friendly Amazon ElastiCache enhancements like customer managed key and online scaling up or down to make your critical workloads more secure, scalable and available.

 

DevOps:

October 1, 2019 | 9:00 AM – 10:00 AM PT – CI/CD for Containers: A Way Forward for Your DevOps Pipeline – Learn how to build CI/CD pipelines using AWS services to get the most out of the agility afforded by containers.

 

Enterprise & Hybrid:

September 24, 2019 | 1:00 PM – 2:30 PM PT Virtual Workshop: How to Monitor and Manage Your AWS Costs – Learn how to visualize and manage your AWS cost and usage in this virtual hands-on workshop.

October 2, 2019 | 1:00 PM – 2:00 PM PT – Accelerate Cloud Adoption and Reduce Operational Risk with AWS Managed Services – Learn how AMS accelerates your migration to AWS, reduces your operating costs, improves security and compliance, and enables you to focus on your differentiating business priorities.

 

IoT:

September 25, 2019 | 9:00 AM – 10:00 AM PTComplex Monitoring for Industrial with AWS IoT Data Services – Learn how to solve your complex event monitoring challenges with AWS IoT Data Services.

 

Machine Learning:

September 23, 2019 | 9:00 AM – 10:00 AM PTTraining Machine Learning Models Faster – Learn how to train machine learning models quickly and with a single click using Amazon SageMaker.

September 30, 2019 | 11:00 AM – 12:00 PM PTUsing Containers for Deep Learning Workflows – Learn how containers can help address challenges in deploying deep learning environments.

October 3, 2019 | 1:00 PM – 2:30 PM PTVirtual Workshop: Getting Hands-On with Machine Learning and Ready to Race in the AWS DeepRacer League – Join DeClercq Wentzel, Senior Product Manager for AWS DeepRacer, for a presentation on the basics of machine learning and how to build a reinforcement learning model that you can use to join the AWS DeepRacer League.

 

AWS Marketplace:

September 30, 2019 | 9:00 AM – 10:00 AM PTAdvancing Software Procurement in a Containerized World – Learn how to deploy applications faster with third-party container products.

 

Migration:

September 24, 2019 | 11:00 AM – 12:00 PM PTApplication Migrations Using AWS Server Migration Service (SMS) – Learn how to use AWS Server Migration Service (SMS) for automating application migration and scheduling continuous replication, from your on-premises data centers or Microsoft Azure to AWS.

 

Networking & Content Delivery:

September 25, 2019 | 11:00 AM – 12:00 PM PTBuilding Highly Available and Performant Applications using AWS Global Accelerator – Learn how to build highly available and performant architectures for your applications with AWS Global Accelerator, now with source IP preservation.

September 30, 2019 | 1:00 PM – 2:00 PM PTAWS Office Hours: Amazon CloudFront – Just getting started with Amazon CloudFront and [email protected]? Get answers directly from our experts during AWS Office Hours.

 

Robotics:

October 1, 2019 | 11:00 AM – 12:00 PM PTRobots and STEM: AWS RoboMaker and AWS Educate Unite! – Come join members of the AWS RoboMaker and AWS Educate teams as we provide an overview of our education initiatives and walk you through the newly launched RoboMaker Badge.

 

Security, Identity & Compliance:

October 1, 2019 | 1:00 PM – 2:00 PM PTDeep Dive on Running Active Directory on AWS – Learn how to deploy Active Directory on AWS and start migrating your windows workloads.

 

Serverless:

October 2, 2019 | 9:00 AM – 10:00 AM PTDeep Dive on Amazon EventBridge – Learn how to optimize event-driven applications, and use rules and policies to route, transform, and control access to these events that react to data from SaaS apps.

 

Storage:

September 24, 2019 | 9:00 AM – 10:00 AM PTOptimize Your Amazon S3 Data Lake with S3 Storage Classes and Management Tools – Learn how to use the Amazon S3 Storage Classes and management tools to better manage your data lake at scale and to optimize storage costs and resources.

October 2, 2019 | 11:00 AM – 12:00 PM PTThe Great Migration to Cloud Storage: Choosing the Right Storage Solution for Your Workload – Learn more about AWS storage services and identify which service is the right fit for your business.

 

 

Creating custom Pinpoint dashboards using Amazon QuickSight, part 3

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/messaging-and-targeting/creating-custom-pinpoint-dashboards-using-amazon-quicksight-part-3/

Note: This post was written by Manan Nayar and Aprajita Arora, Software Development Engineers on the AWS Digital User Engagement team.


This is the third and final post in our series about creating custom visualizations of your Amazon Pinpoint metrics using Amazon QuickSight.

In our first post, we used the Metrics APIs to retrieve specific Key Performance Indicators (KPIs), and then created visualizations using QuickSight. In the second post, we used the event stream feature in Amazon Pinpoint to enable more in-depth analyses.

The examples in the first two posts used Amazon S3 to store the metrics that we retrieved from Amazon Pinpoint. This post takes a different approach, using Amazon Redshift to store the data. By using Redshift to store this data, you gain the ability to create visualizations on large data sets. This example is useful in situations where you have a large volume of event data, and situations where you need to store your data for long periods of time.

Step 1: Provision the storage

The first step in setting up this solution is to create the destinations where you’ll store the Amazon Pinpoint event data. Since we’ll be storing the data in Amazon Redshift, we need to create a new Redshift cluster. We’ll also create an S3 bucket, which will house the original event data that’s streamed from Amazon Pinpoint.

To create the Redshift cluster and the S3 bucket

  1. Create a new Redshift cluster. To learn more, see the Amazon Redshift Getting Started Guide.
  2. Create a new table in the Redshift cluster that contains the appropriate columns. Use the following query to create the table:
    create table if not exists pinpoint_events_table(
      rowid varchar(255) not null,
      project_key varchar(100) not null,
      event_type varchar(100) not null,
      event_timestamp timestamp not null,
      campaign_id varchar(100),
      campaign_activity_id varchar(100),
      treatment_id varchar(100),
      PRIMARY KEY (rowid)
    );
  3. Create a new Amazon S3 bucket. For complete procedures, see Create a Bucket in the Amazon S3 Getting Started Guide.

Step 2: Set up the event stream

This example uses the event stream feature of Amazon Pinpoint to send event data to S3. Later, we’ll create a Lambda function that sends the event data to your Redshift cluster when new event data is added to the S3 bucket. This method lets us store the original event data in S3, and transfer a subset of that data to Redshift for analysis.

To set up the event stream

  1. Sign in to the Amazon Pinpoint console at http://console.aws.amazon.com/pinpoint. In the list of projects, choose the project that you want to enable event streaming for.
  2. Under Settings, choose Event stream.
  3. Choose Edit, and then configure the event stream to use Amazon Kinesis Data Firehose. If you don’t already have a Kinesis Data Firehose stream, follow the link to create one in the Kinesis console. Configure the stream to send data to an S3 bucket. For more information about creating streams, see Creating an Amazon Kinesis Data Firehose Delivery Stream.
  4. Under IAM role, choose Automatically create a role. Choose Save.

Step 3: Create the Lambda function

In this section, you create a Lambda function that processes the raw event stream data, and then writes it to a table in your Redshift cluster.
To create the Lambda function:

  1. Download the psycopg2 binary from https://github.com/jkehler/awslambda-psycopg2. This Python library lets you interact with PostgreSQL databases, such as Amazon Redshift. It contains certain libraries that aren’t included in Lambda.
    • Note: This Github repository is not an official AWS-managed repository.
  2. Within the awslambda-psycopg2-master folder, you’ll find a folder called psycopg2-37. Rename the folder to psycopg2 (you may need to delete the existing folder with that name), and then compress the entire folder to a .zip file.
  3. Create a new Lambda function from scratch, using the Python 3.7 runtime.
  4. Upload the psycopg2.zip file that you created in step 1 to Lambda.
  5. In Lambda, create a new function called lambda_function.py. Paste the following code into the function:
    import datetime
    import json
    import re
    import uuid
    import os
    import boto3
    import psycopg2
    from psycopg2 import Error
    
    cluster_redshift = "<clustername>"
    dbname_redshift = "<dbname>"
    user_redshift = "<username>"
    password_redshift = "<password>"
    endpoint_redshift = "<endpoint>"
    port_redshift = "5439"
    table_redshift = "pinpoint_events_table"
    
    # Get the file that contains the event data from the appropriate S3 bucket.
    def get_file_from_s3(bucket, key):
        s3 = boto3.client('s3')
        obj = s3.get_object(Bucket=bucket, Key=key)
        text = obj["Body"].read().decode()
    
        return text
    
    # If the object that we retrieve contains newline-delineated JSON, split it into
    # multiple objects.
    def clean_and_split(json_raw):
        json_delimited = re.sub('}\s{','}---X-DELIMITER---{',json_raw)
        json_clean = re.sub('\s+','',json_delimited)
        data = json_clean.split("---X-DELIMITER---")
    
        return data
    
    # Set all of the variables that we'll use to create the new row in Redshift.
    def set_variables(in_json):
    
        for line in in_json:
            content = json.loads(line)
            app_id = content['application']['app_id']
            event_type = content['event_type']
            event_timestamp = datetime.datetime.fromtimestamp(content['event_timestamp'] / 1e3).strftime('%Y-%m-%d %H:%M:%S')
    
            if (content['attributes'].get('campaign_id') is None):
                campaign_id = ""
            else:
                campaign_id = content['attributes']['campaign_id']
    
            if (content['attributes'].get('campaign_activity_id') is None):
                campaign_activity_id = ""
            else:
                campaign_activity_id = content['attributes']['campaign_activity_id']
    
            if (content['attributes'].get('treatment_id') is None):
                treatment_id = ""
            else:
                treatment_id = content['attributes']['treatment_id']
    
            write_to_redshift(app_id, event_type, event_timestamp, campaign_id, campaign_activity_id, treatment_id)
                
    # Write the event stream data to the Redshift table.
    def write_to_redshift(app_id, event_type, event_timestamp, campaign_id, campaign_activity_id, treatment_id):
        row_id = str(uuid.uuid4())
    
        query = ("INSERT INTO " + table_redshift + "(rowid, project_key, event_type, "
                + "event_timestamp, campaign_id, campaign_activity_id, treatment_id) "
                + "VALUES ('" + row_id + "', '"
                + app_id + "', '"
                + event_type + "', '"
                + event_timestamp + "', '"
                + campaign_id + "', '"
                + campaign_activity_id + "', '"
                + treatment_id + "');")
    
        try:
            conn = psycopg2.connect(user = user_redshift,
                                    password = password_redshift,
                                    host = endpoint_redshift,
                                    port = port_redshift,
                                    database = dbname_redshift)
    
            cur = conn.cursor()
            cur.execute(query)
            conn.commit()
            print("Updated table.")
    
        except (Exception, psycopg2.DatabaseError) as error :
            print("Database error: ", error)
        finally:
            if (conn):
                cur.close()
                conn.close()
                print("Connection closed.")
    
    # Handle the event notification that we receive when a new item is sent to the 
    # S3 bucket.
    def lambda_handler(event,context):
        print("Received event: \n" + str(event))
    
        bucket = event['Records'][0]['s3']['bucket']['name']
        key = event['Records'][0]['s3']['object']['key']
        data = get_file_from_s3(bucket, key)
    
        in_json = clean_and_split(data)
    
        set_variables(in_json)

    In the preceding code, make the following changes:

    • Replace <clustername> with the name of the cluster.
    • Replace <dbname> with the name of the database.
    • Replace <username> with the user name that you specified when you created the Redshift cluster.
    • Replace <password> with the password that you specified when you created the Redshift cluster.
    • Replace <endpoint> with the endpoint address of the Redshift cluster.
  6. In IAM, update the execution role that’s associated with the Lambda function to include the GetObject permission for the S3 bucket that contains the event data. For more information, see Editing IAM Policies in the AWS IAM User Guide.

Step 4: Set up notifications on the S3 bucket

Now that we’ve created the Lambda function, we’ll set up a notification on the S3 bucket. In this case, the notification will refer to the Lambda function that we created in the previous section. Every time a new file is added to the bucket, the notification will cause the Lambda function to run.

To create the event notification

  1. In S3, create a new bucket notification. The notification should be triggered when PUT events occur, and should trigger the Lambda function that you created in the previous section. For more information about creating notifications, see Configuring Amazon S3 Event Notifications in the Amazon S3 Developer Guide.
  2. Test the event notification by sending a test campaign. If you send an email campaign, your Redshift database should contain events such as _campaign.send, _email.send, _email.delivered, and others. You can check the contents of the Redshift table by running the following query in the Query Editor in the Redshift console:
    select * from pinpoint_events_table;

Step 5: Add the data set in Amazon QuickSight

If your Lambda function is sending event data to Redshift as expected, you can use your Redshift database to create a new data set in Amazon QuickSight. QuickSight includes an automatic database discovery feature that helps you add your Redshift database as a data set with only a few clicks. For more information, see Creating a Data Set from a Database in the Amazon QuickSight User Guide.

Step 6: Create your visualizations

Now that QuickSight is retrieving information from your Redshift database, you can use that data to create visualizations. To learn more about creating visualizations in QuickSight, see Creating an Analysis in the Amazon QuickSight User Guide.

This brings us to the end of our series. While these posts focused on using Amazon QuickSight to visualize your analytics data, you can also use these same techniques to create visualizations using 3rd party applications. We hope you enjoyed this series, and we can’t wait to see what you build using these examples!

Amazon Aurora PostgreSQL Serverless – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-aurora-postgresql-serverless-now-generally-available/

The database is usually the most critical part of a software architecture and managing databases, especially relational ones, has never been easy. For this reason, we created Amazon Aurora Serverless, an auto-scaling version of Amazon Aurora that automatically starts up, shuts down and scales up or down based on your application workload.

The MySQL-compatible edition of Aurora Serverless has been available for some time now. I am pleased to announce that the PostgreSQL-compatible edition of Aurora Serverless is generally available today.

Before moving on with details, I take the opportunity to congratulate the Amazon Aurora development team that has just won the 2019 Association for Computing Machinery’s (ACM) Special Interest Group on Management of Data (SIGMOD) Systems Award!

When you create a database with Aurora Serverless, you set the minimum and maximum capacity. Your client applications transparently connect to a proxy fleet that routes the workload to a pool of resources that are automatically scaled. Scaling is very fast because resources are “warm” and ready to be added to serve your requests.

 

There is no change with Aurora Serverless on how storage is managed by Aurora. The storage layer is independent from the compute resources used by the database. There is no need to provision storage in advance. The minimum storage is 10GB and, based on the database usage, the Amazon Aurora storage will automatically grow, up to 64 TB, in 10GB increments with no impact to database performance.

Creating an Aurora Serverless PostgreSQL Database
Let’s start an Aurora Serverless PostgreSQL database and see the automatic scalability at work. From the Amazon RDS console, I select to create a database using Amazon Aurora as engine. Currently, Aurora serverless is compatible with PostgreSQL version 10.5. Selecting that version, the serverless option becomes available.

I give the new DB cluster an identifier, choose my master username, and let Amazon RDS generate a password for me. I will be able to retrieve my credentials during database creation.

I can now select the minimum and maximum capacity for my database, in terms of Aurora Capacity Units (ACUs), and in the additional scaling configuration I choose to pause compute capacity after 5 minutes of inactivity. Based on my settings, Aurora Serverless automatically creates scaling rules for thresholds for CPU utilization, connections, and available memory.

Testing Some Load on the Database
To generate some load on the database I am using sysbench on an EC2 instance. There are a couple of Lua scripts bundled with sysbench that can help generate an online transaction processing (OLTP) workload:

  • The first script, parallel_prepare.lua, generates 100,000 rows per table for 24 tables.
  • The second script, oltp.lua, generates workload against those data using 64 worker threads.

By using those scripts, I start generating load on my database cluster. As you can see from this graph, taken from the RDS console monitoring tab, the serverless database capacity grows and shrinks to follow my requirements. The metric shown on this graph is the number of ACUs used by the database cluster. First it scales up to accommodate the sysbench workload. When I stop the load generator, it scales down and then pauses.

Available Now
Aurora Serverless PostgreSQL is available now in US East (N. Virginia), US East (Ohio), US West (Oregon), EU (Ireland), and Asia Pacific (Tokyo). With Aurora Serverless, you pay on a per-second basis for the database capacity you use when the database is active, plus the usual Aurora storage costs.

For more information on Amazon Aurora, I recommend this great post explaining why and how it was created:

Amazon Aurora ascendant: How we designed a cloud-native relational database

It’s never been so easy to use a relational database in production. I am so excited to see what you are going to use it for!

How to build databases using Python and text files | Hello World #9

Post Syndicated from Mac Bowley original https://www.raspberrypi.org/blog/how-to-build-databases-using-python-and-text-files-hello-world-9/

In Hello World issue 9, Raspberry Pi’s own Mac Bowley shares a lesson that introduces students to databases using Python and text files.

In this lesson, students create a library app for their books. This will store information about their book collection and allow them to display, manipulate, and search their collection. You will show students how to use text files in their programs that act as a database.

The project will give your students practical examples of database terminology and hands-on experience working with persistent data. It gives opportunities for students to define and gain concrete experience with key database concepts using a language they are familiar with. The script that accompanies this activity can be adapted to suit your students’ experience and competency.

This ready-to-go software project can be used alongside approaches such as PRIMM or pair programming, or as a worked example to engage your students in programming with persistent data.

What makes a database?

Start by asking the students why we need databases and what they are: do they ever feel unorganised? Life can get complicated, and there is so much to keep track of, the raw data required can be overwhelming. How can we use computing to solve this problem? If only there was a way of organising and accessing data that would let us get it out of our head. Databases are a way of organising the data we care about, so that we can easily access it and use it to make our lives easier.

Then explain that in this lesson the students will create a database, using Python and a text file. The example I show students is a personal library app that keeps track of which books I own and where I keep them. I have also run this lesson and allowed the students pick their own items to keep track of — it just involves a little more planning time at the end. Split the class up into pairs; have each of them discuss and select five pieces of data about a book (or their own item) they would like to track in a database. They should also consider which type of data each of them is. Give them five minutes to discuss and select some data to track.

Databases are organised collections of data, and this allows them to be displayed, maintained, and searched easily. Our database will have one table — effectively just like a spreadsheet table. The headings on each of the columns are the fields: the individual pieces of data we want to store about the books in our collection. The information about a single book are called its attributes and are stored together in one record, which would be a single row in our database table. To make it easier to search and sort our database, we should also select a primary key: one field that will be unique for each book. Sometimes one of the fields we are already storing works for this purpose; if not, then the database will create an ID number that it uses to uniquely identify each record.

Create a library application

Pull the class back together and ask a few groups about the data they selected to track. Make sure they have chosen appropriate data types. Ask some if they can find any of the fields that would be a primary key; the answer will most likely be no. The ISBN could work, but for our simple application, having to type in a 10- or 13-digit number just to use for an ID would be overkill. In our database, we are going to generate our own IDs.

The requirements for our database are that it can do the following things: save data to a file, read data from that file, create new books, display our full database, allow the user to enter a search term, and display a list of relevant results based on that term. We can decompose the problem into the following steps:

  • Set up our structures
  • Create a record
  • Save the data to the database file
  • Read from the database file
  • Display the database to the user
  • Allow the user to search the database
  • Display the results

Have the class log in and power up Python. If they are doing this locally, have them create a new folder to hold this project. We will be interacting with external files and so having them in the same folder avoids confusion with file locations and paths. They should then load up a new Python file. To start, download the starter file from the link provided. Each student should make a copy of this file. At first, I have them examine the code, and then get them to run it. Using concepts from PRIMM, I get them to print certain messages when a menu option is selected. This can be a great exemplar for making a menu in any application they are developing. This will be the skeleton of our database app: giving them a starter file can help ease some cognitive load from students.

Have them examine the variables and make guesses about what they are used for.

  • current_ID – a variable to count up as we create records, this will be our primary key
  • new_additions – a list to hold any new records we make while our code is running, before we save them to the file
  • filename – the name of the database file we will be using
  • fields – a list of our fields, so that our dictionaries can be aligned with our text file
  • data – a list that will hold all of the data from the database, so that we can search and display it without having to read the file every time

Create the first record

We are going to use dictionaries to store our records. They reference their elements using keys instead of indices, which fit our database fields nicely. We are going to generate our own IDs. Each of these must be unique, so a variable is needed that we can add to as we make our records. This is a user-focused application, so let’s make it so our user can input the data for the first book. The strings, in quotes, on the left of the colon, are the keys (the names of our fields) and the data on the right is the stored value, in our case whatever the user inputs in response to our appropriate prompts. We finish this part of by adding the record to the file, incrementing the current ID, and then displaying a useful feedback message to the user to say their record has been created successfully. Your students should now save their code and run it to make sure there aren’t any syntax errors.

You could make use of pair programming, with carefully selected pairs taking it in turns in the driver and navigator roles. You could also offer differing levels of scaffolding: providing some of the code and asking them to modify it based on given requirements.

How to use the code in your class

To complete the project, your students can add functionality to save their data to a CSV file, read from a database file, and allow users to search the database. The code for the whole project is available at helloworld.cc/database.

An example of the code

You may want to give your students the entire piece of code. They can investigate and modify it to their own purpose. You can also lead them through it, having them follow you as you demonstrate how an expert constructs a piece of software. I have done both to great effect. Let me know how your classes get on! Get in touch at [email protected]

Hello World issue 9

The brand-new issue of Hello World is out today, and available right now as a free PDF download from the Hello World website.



UK-based educators can also sign up to receive Hello World as printed magazine FOR FREE, direct to their door. And those outside the UK, educator or not, can subscribe to receive new digital issues of Hello World in their inbox on the day of release.

The post How to build databases using Python and text files | Hello World #9 appeared first on Raspberry Pi.