Tag Archives: Analytics

Create cross-account, custom Amazon Managed Grafana dashboards for Amazon Redshift

2022-06-23 Tahir Aziz

Post Syndicated from Tahir Aziz original https://aws.amazon.com/blogs/big-data/create-cross-account-custom-amazon-managed-grafana-dashboards-for-amazon-redshift/

Amazon Managed Grafana recently announced a new data source plugin for Amazon Redshift, enabling you to query, visualize, and alert on your Amazon Redshift data from Amazon Managed Grafana workspaces. With the new Amazon Redshift data source, you can now create dashboards and alerts in your Amazon Managed Grafana workspaces to analyze your structured and semi-structured data across data warehouses, operational databases, and data lakes. The Amazon Redshift plugin also comes with default out-of-the-box dashboards that make it simple to get started monitoring the health and performance of your Amazon Redshift clusters.

In this post, we present a step-by-step tutorial to use the Amazon Redshift data source plugin to visualize metrics from your Amazon Redshift clusters hosted in different AWS accounts using AWS Single Sign-On (AWS SSO) as well as how to create custom dashboards visualizing data from Amazon Redshift system tables in Amazon Managed Grafana.

Solution overview

Let’s look at the AWS services that we use in our tutorial:

Amazon Managed Grafana is a fully managed service for open-source Grafana developed in collaboration with Grafana Labs. Grafana is a popular open-source analytics platform that enables you to query, visualize, alert on, and understand your operational metrics. You can create, explore, and share observability dashboards with your team, and spend less time managing your Grafana infrastructure and more time improving the health, performance, and availability of your applications. Amazon Managed Grafana natively integrates with AWS services (like Amazon Redshift) so you can securely add, query, visualize, and analyze operational and performance data across multiple accounts and Regions for the underlying AWS service.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Today, tens of thousands of AWS customers from Fortune 500 companies, startups, and everything in between use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics jobs. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

AWS SSO is where you create or connect your workforce identities in AWS and manage access centrally across your AWS organization. You can choose to manage access just to your AWS accounts or cloud applications. You can create user identities directly in AWS SSO, or you can bring them from your Microsoft Active Directory or a standards-based identity provider, such as Okta Universal Directory or Azure AD. With AWS SSO, you get a unified administration experience to define, customize, and assign fine-grained access. Your workforce users get a user portal to access all their assigned AWS accounts, Amazon Elastic Compute Cloud (Amazon EC2) Windows instances, or cloud applications. AWS SSO can be flexibly configured to run alongside or replace AWS account access management via AWS Identity and Access Management (IAM).

The following diagram illustrates the solution architecture.

The solution includes the following components:

Captured metrics from the Amazon Redshift clusters in the development and production AWS accounts.
Amazon Managed Grafana, with the Amazon Redshift data source plugin added to it. Amazon Managed Grafana communicates with the Amazon Redshift cluster via the Amazon Redshift Data Service API.
The Grafana web UI, with the Amazon Redshift dashboard using the Amazon Redshift cluster as the data source. The web UI communicates with Amazon Managed Grafana via an HTTP API.

We walk you through the following steps in this post:

Create a user in AWS SSO for Amazon Managed Grafana workspace access.
Configure an Amazon Managed Grafana workspace.
Set up two Amazon Redshift clusters as the data sources in Grafana.
Import the Amazon Redshift dashboard supplied with the data source.
Create a custom Amazon Redshift dashboard to visualize metrics from the Amazon Redshift clusters.

Prerequisites

To follow along with this post, you should have the following prerequisites:

An AWS account
An existing Amazon Redshift cluster; if not, you can create one
The prerequisites to use Amazon Redshift as a data source for Amazon Managed Grafana
The user created for the Grafana dashboard should have super user privileges

Set up AWS SSO

In this section, we set up AWS SSO and register users.

In addition to AWS SSO integration, Amazon Managed Grafana also supports direct SAML integration with SAML 2.0 identity providers.

If you don’t have AWS SSO enabled, open the AWS SSO console and choose Enable AWS SSO.
After AWS SSO is enabled, choose Users in the navigation pane.
Choose Add user.
Enter the user details and choose Next: Groups.
Choose Add user.

Set up your Amazon Grafana workspace

In this section, we demonstrate how to set up a Grafana workspace using Amazon Managed Grafana. We set up authentication using AWS SSO, register data sources, and add administrative users for the workspace.

On the Amazon Managed Grafana console, choose Create workspace.
For Workspace name, enter a suitable name.
Choose Next.
For Authentication access, select AWS Single Sign-On.
For Permission type, select Service managed.
Choose Next.
Select Current account.
For Data sources, select Amazon Redshift.
Choose Next.
Review the details and choose Create workspace.

Now we assign a user to the workspace.
On the Workspaces page, choose the workspace you created.
Note the IAM role attached to your workspace.
Choose Assign new user or group.
Select the user to assign to the workspace.
Choose Assign users and groups.

For the purposes of this post, we need an admin user.
To change the permissions of the user you just assigned, select the user name and choose Make admin.

For the cross-account setup, we use two Amazon Redshift clusters: production and development. In the next section, we configure IAM roles in both the production and development accounts so that the Grafana in the production account is able to connect to the Amazon Redshift clusters in the production account as well as in the development account.

Configure an IAM role for the development account

In this section, we set up the IAM role in the AWS account hosting the development environment. This role is assumed by the Amazon Managed Grafana service from the production AWS account to establish the connection between Amazon Managed Grafana and Amazon Redshift cluster in the development account.

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
Select Custom trust policy.

Use the following policy code (update the account number for your production account and the Grafana service role attached to the workspace):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<production-account-number>:role/service-role/AmazonGrafanaServiceRole-xxxxxxxxxx",
                "Service": "grafana.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Choose Next.
Attach the managed IAM policy AmazonGrafanaRedshiftAccess to this role. For instructions, refer to Modifying a role permissions policy (console).
Provide a role name, description, and tags (optional), and create the role.

Configure an IAM role for the production account

Next, we configure the IAM role created by the Amazon Managed Grafana service in order to establish the connection between Amazon Managed Grafana and the Amazon Redshift cluster in the production account.

On the IAM console, choose Roles in the navigation pane.
Search for the AmazonGrafanaServiceRole-xxxxxxx role attached to your Grafana workspace.

Create an inline IAM policy and attach it to this role with the following policy code:

{
	"Version": "2012-10-17",
	"Statement": [{
		"Sid": "VisualEditor0",
		"Effect": "Allow",
		"Action": [
			"sts:AssumeRole"
		],
		"Resource":"arn:aws:iam::<dev-account-number>:role/<DevAccountRoleName>"
	}]
}

Provide a role name, description, and tags (optional), and create the role.

Import the default dashboard

In this section, we connect to the Amazon Redshift clusters in the production and development accounts from the Amazon Managed Grafana console and import the default dashboard.

On the Amazon Managed Grafana console, choose Workspaces in the navigation pane.
Choose the workspace you just created (authenticate and sign in if needed).
In the navigation pane, choose Settings and on the Configuration menu, choose Data sources.
Choose Add data source.
Search for and choose Amazon Redshift.
On the Settings tab, for Authentication provider, choose Workspace IAM role.
For Default Region, choose us-east-1.
Under Redshift Details, choose Temporary credentials.
Enter the cluster identifier and database name for your Amazon Redshift cluster in the development account.
For Database user, enter redshift_data_api_user.
Choose Save & test.
When the connection is successfully established, a message appears that the data source is working. You can now move on to the next step.
Repeat these steps to add another data source to connect to the Amazon Redshift cluster in the development account.
On the Settings tab, for Authentication provider, choose Workspace IAM role.
Enter the workspace role as the ARN of the IAM role you created earlier (arn:aws:iam::dev-account-number:role/cross-account-role-name).
For Default Region, choose us-east-1.
Under Redshift Details, choose Temporary credentials.
Enter the cluster identifier and database name for your Amazon Redshift cluster in the development account.
For Database user, enter redshift_data_api_user.
Choose Save & test.
When the connection is successfully established, a message appears that the data source is working.
On the Dashboards tab, choose Import next to Amazon Redshift.

On the dashboard page, you can change the data source between your production and development clusters on a drop-down menu.

The default Amazon Redshift dashboard, as shown in the following screenshot, makes it easy to monitor the overall health of the cluster by showing different cluster metrics, like total storage capacity used, storage utilization per node, open and closed connections, WLM mode, AQUA status, and more.

Additionally, the default dashboard displays several table-level metrics such as size of the tables, total number of rows, unsorted rows percentage, and more, in the Schema Insights section.

Add a custom dashboard for Amazon Redshift

The Amazon Redshift data source plugin allows you to query and visualize Amazon Redshift data metrics from within Amazon Managed Grafana. It’s preconfigured with general metrics. To add a custom metric from the Amazon Redshift cluster, complete the following steps:

On the Amazon Managed Grafana console, choose All workspaces in the navigation pane.
Choose the Grafana workspace URL for the workspace you want to modify.
Choose Sign in with AWS SSO and provide your credentials.
On the Amazon Managed Grafana workspace page, choose the plus sign and on the Create menu, choose Dashboard.
Choose Add a new panel.
Add the following custom SQL to get the data from the Amazon Redshift cluster:
```
select 
p.usename,
count(*) as Num_Query,
SUM(DATEDIFF('second',starttime,endtime)) as Total_Execution_seconds from stl_query s 
inner join pg_user p on s.userid= p.usesysid where starttime between $__timeFrom() and $__timeTo()
and s.userid>1 group by 1
```
For this post, we use the default settings, but you can control and link the time range using the $__timeFrom() and $__timeTo() macros; they’re bound with the time range control of your dashboard. For more information and details about the supported expressions, see Query Redshift data.
To inspect the data, choose Query inspector to test the custom query outcome.
Amazon Managed Grafana supports a variety of visualizations. For this post, we create a bar chart.
On the Visualizations tab in the right pane, choose Bar chart.
Enter a title and description for the custom chart, and leave all other properties as default.
For more information about supported properties, see Visualizations.
Choose Save.
In the pop-up window, enter a dashboard name and choose Save.

A new dashboard is created with a custom metric.
To add more metrics, choose the Add panel icon, choose Add a new panel, and repeat the previous steps.

Clean up

To avoid incurring future charges, complete the following steps:

Delete the Amazon Managed Grafana workspace.
If you created a new Amazon Redshift cluster for this demonstration, delete the cluster.

Conclusion

In this post, we demonstrated how to use AWS SSO and Amazon Managed Grafana to create an operational view to monitor the health and performance of Amazon Redshift clusters. We learned how to extend your default dashboard by adding custom and insightful dashboards to your Grafana workspace.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.

About the Authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Shawn Sachdev is a Sr. Analytics Specialist Solutions Architect at AWS. He works with customers and provides guidance to help them innovate and build well-architected and high-performance data warehouses and implement analytics at scale on the AWS platform. Before AWS, he worked in several analytics and system engineering roles. Outside of work, he loves watching sports, and is an avid foodie and craft beer enthusiast.

Ekta Ahuja is an Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

Use the AWS Glue connector to read and write Apache Iceberg tables with ACID transactions and perform time travel

2022-06-21 Tomohiro Tanaka

Post Syndicated from Tomohiro Tanaka original https://aws.amazon.com/blogs/big-data/use-the-aws-glue-connector-to-read-and-write-apache-iceberg-tables-with-acid-transactions-and-perform-time-travel/

Nowadays, many customers have built their data lakes as the core of their data analytic systems. In a typical use case of data lakes, many concurrent queries run to retrieve consistent snapshots of business insights by aggregating query results. A large volume of data constantly comes from different data sources into the data lakes. There is also a common demand to reflect the changes occurring in the data sources into the data lakes. This means that not only inserts but also updates and deletes need to be replicated into the data lakes.

Apache Iceberg provides the capability of ACID transactions on your data lakes, which allows concurrent queries to add or delete records isolated from any existing queries with read-consistency for queries. Iceberg is an open table format designed for large analytic workloads on huge datasets. You can perform ACID transactions against your data lakes by using simple SQL expressions. It also enables time travel, rollback, hidden partitioning, and schema evolution changes, such as adding, dropping, renaming, updating, and reordering columns.

AWS Glue is one of the key elements to building data lakes. It extracts data from multiple sources and ingests your data to your data lake built on Amazon Simple Storage Service (Amazon S3) using both batch and streaming jobs. To expand the accessibility of your AWS Glue extract, transform, and load (ETL) jobs to Iceberg, AWS Glue provides an Apache Iceberg connector. The connector allows you to build Iceberg tables on your data lakes and run Iceberg operations such as ACID transactions, time travel, rollbacks, and so on from your AWS Glue ETL jobs.

In this post, we give an overview of how to set up the Iceberg connector for AWS Glue and configure the relevant resources to use Iceberg with AWS Glue jobs. We also demonstrate how to run typical Iceberg operations on AWS Glue interactive sessions with an example use case.

Apache Iceberg connector for AWS Glue

With the Apache Iceberg connector for AWS Glue, you can take advantage of the following Iceberg capabilities:

Basic operations on Iceberg tables – This includes creating Iceberg tables in the AWS Glue Data Catalog and inserting, updating, and deleting records with ACID transactions in the Iceberg tables
Inserting and updating records – You can run UPSERT (update and insert) queries for your Iceberg table
Time travel on Iceberg tables – You can read a specific version of an Iceberg table from table snapshots that Iceberg manages
Rollback of table versions – You can revert an Iceberg table back to a specific version of the table

Iceberg offers additional useful capabilities such as hidden partitioning; schema evolution with add, drop, update, and rename support; automatic data compaction; and more. For more details about Iceberg, refer to the Apache Iceberg documentation.

Next, we demonstrate how the Apache Iceberg connector for AWS Glue works for each Iceberg capability based on an example use case.

Overview of example customer scenario

Let’s assume that an ecommerce company sells products on their online platform. Customers can buy products and write reviews to each product. Customers can add, update, or delete their reviews at any time. The customer reviews are an important source for analyzing customer sentiment and business trends.

In this scenario, we have the following teams in our organization:

Data engineering team – Responsible for building and managing data platforms.
Data analyst team – Responsible for analyzing customer reviews and creating business reports. This team queries the reviews daily, creates a business intelligence (BI) report, and shares it with sales team.
Customer support team – Responsible for replying to customer inquiries. This team queries the reviews when they get inquiries about the reviews.

Our solution has the following requirements:

Query scalability is important because the website is huge.
Individual customer reviews can be added, updated, and deleted.
The data analyst team needs to use both notebooks and ad hoc queries for their analysis.
The customer support team sometimes needs to view the history of the customer reviews.
Customer reviews can always be added, updated, and deleted, even while one of the teams is querying the reviews for analysis. This means that any result in a query isn’t affected by uncommitted customer review write operations.
Any changes in customer reviews that are made by the organization’s various teams need to be reflected in BI reports and query results.

In this post, we build a data lake of customer review data on top of Amazon S3. To meet these requirements, we introduce Apache Iceberg to enable adding, updating, and deleting records; ACID transactions; and time travel queries. We also use an AWS Glue Studio notebook to integrate and query the data at scale. First, we set up the connector so we can create an AWS Glue connection for Iceberg.

Set up the Apache Iceberg connector and create the Iceberg connection

We first set up Apache Iceberg connector for AWS Glue to use Apache Iceberg with AWS Glue jobs. Particularly, in this section, we set up the Apache Iceberg connector for AWS Glue and create an AWS Glue job with the connector. Complete the following steps:

Navigate to the Apache Iceberg connector for AWS Glue page in AWS Marketplace.
Choose Continue to Subscribe.

Review the information under Terms and Conditions, and choose Accept Terms to continue.

When the subscription is complete, choose Continue to Configuration.

For Fulfillment option, choose Glue 3.0. (1.0 and 2.0 are also available options.)
For Software version, choose the latest software version.

As of this writing, 0.12.0-2 is the latest version of the Apache Iceberg connector for AWS Glue.

Choose Continue to Launch.

Choose Usage instructions.
Choose Activate the Glue connector from AWS Glue Studio.

You’re redirected to AWS Glue Studio.

For Name, enter a name for your connection (for example, iceberg-connection).

Choose Create connection and activate connector.

A message appears that the connection was successfully added, and the connection is now visible on the AWS Glue Studio console.

Configure resources and permissions

We use a provided AWS CloudFormation template to set up Iceberg configuration for AWS Glue. AWS CloudFormation creates the following resources:

An S3 bucket to store an Iceberg configuration file and actual data
An AWS Lambda function to generate an Iceberg configuration file based on parameters provided by a user for the CloudFormation template, and to clean up the resources created through this post
AWS Identity and Access Management (IAM) roles and policies with necessary permissions
An AWS Glue database in the Data Catalog to register Iceberg tables

To deploy the CloudFormation template, complete the following steps:

Choose Launch Stack:

For DynamoDBTableName, enter a name for an Amazon DynamoDB table that is created automatically when AWS Glue creates an Iceberg table.

This table is used for an AWS Glue job to obtain a commit lock to avoid concurrently modifying records in Iceberg tables. For more details about commit locking, refer to DynamoDB for Commit Locking. Note that you shouldn’t specify the name of an existing table.

For IcebergDatabaseName, enter a name for the AWS Glue database that is created in the Data Catalog and used for registering Iceberg tables.
Choose Next.

Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

Start an AWS Glue Studio notebook to use Apache Iceberg

After you launch the CloudFormation stack, you create an AWS Glue Studio notebook to perform Iceberg operations. Complete the following steps:

Download the Jupyter notebook file.
On the AWS Glue console, choose Jobs in the navigation pane.
Under Create job, select Jupyter Notebook.

Select Upload and edit an existing notebook and upload iceberg-with-glue.ipynb.

Choose Create.
For Job name, enter a name.
For IAM role, choose IcebergConnectorGlueJobRole, which was created via the CloudFormation template.
Choose Start notebook job.

The process takes a few minutes to complete, after which you can see an AWS Glue Studio notebook view.

Choose Save to save the notebook.

Set up the Iceberg configuration

To set up the Iceberg configuration, complete the following steps:

Run the following cells with multiple options (magics). Note that you set your connection name for the %connections magic in the cell.

For more information, refer to Configuring AWS Glue Interactive Sessions for Jupyter and AWS Glue Studio notebooks.

A message Session <session-id> has been created appears when your AWS Glue Studio notebook is ready.

In the last cell in this section, you load your Iceberg configuration, which you specified when launching the CloudFormation stack. The Iceberg configuration includes a warehouse path for Iceberg actual data, a DynamoDB table name for commit locking, a database name for your Iceberg tables, and more.

To load the configuration, set the S3 bucket name that was created via the CloudFormation stack.

On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose the stack you created.
On the Outputs tab, copy the S3 bucket name.

Set the S3 name as the S3_BUCKET parameter in your notebook.

Run the cell and load the Iceberg configuration that you set.

Initialize the job with Iceberg configurations

We continue to run cells to initiate a SparkSession in this section.

Set an Iceberg warehouse path and a DynamoDB table name for Iceberg commit locking from the user_config parameter.
Initialize a SparkSession by setting the Iceberg configurations.
With the SparkSession object, create SparkContext and GlueContext objects.

The following screenshot shows the relevant section in the notebook.

We provide the details of each parameter that you configure for the SparkSession in the appendix of this post.

For this post, we demonstrate setting the Spark configuration for Iceberg. You can also set the configuration as AWS Glue job parameters. For more information, refer to the Usage Information section in the Iceberg connector product page.

Use case walkthrough

To walk through our use case, we use two tables; acr_iceberg and acr_iceberg_report. The table acr_iceberg contains the customer review data. The table acr_iceberg_report contains BI analysis results based on the customer review data. All changes to acr_iceberg also impact acr_iceberg_report. The table acr_iceberg_report needs to be updated daily, right before sharing business reports with stakeholders.

To demonstrate this use case, we walk through the following typical steps:

A data engineering team registers the acr_iceberg and acr_iceberg_report tables in the Glue Data Catalog.
Customers (ecommerce users) add reviews to products in the Industrial_Supplies category. These reviews are added to the Iceberg table.
A customer requests to update their reviews. We simulate updating the customer review in the acr_iceberg table.
We reflect the customer’s request of the updated review in acr_iceberg into acr_iceberg_report.
We revert the customer’s request of the updated review for the customer review table acr_iceberg, and reflect the reversion in acr_iceberg_report.

1. Create Iceberg tables of customer reviews and BI reports

In this step, the data engineering team creates the acr_iceberg Iceberg table for customer reviews data (based on the Amazon Customer Reviews Dataset), and the team creates the acr_iceberg_report Iceberg table for BI reports.

Create the `acr_iceberg` table for customer reviews

The following code initially extracts the Amazon customer reviews, which are stored in a public S3 bucket. Then it creates an Iceberg table of the customer reviews and loads these reviews into your specified S3 bucket (created via CloudFormation stack). Note that the script loads partial datasets to avoid taking a lot of time to load the data.

# Loading the dataset and creating an Iceberg table. This will take about 3-5 minutes.
spark.read \
    .option('basePath', INPUT_BASE_PATH) \
    .parquet(*INPUT_CATEGORIES) \
    .writeTo(f'{CATALOG}.{DATABASE}.{TABLE}') \
    .tableProperty('format-version', '2') \
    .create()

Regarding the tableProperty parameter, we specify format version 2 to make the table version compatible with Amazon Athena. For more information about Athena support for Iceberg tables, refer to Considerations and limitations. To learn more about the difference between Iceberg table versions 1 and 2, refer to Appendix E: Format version changes.

Let’s run the following cells. Running the second cell takes around 3–5 minutes.

After you run the cells, the acr_iceberg table is available in your specified database in the Glue Data Catalog.

You can also see the actual data and metadata of the Iceberg table in the S3 bucket that is created through the CloudFormation stack. Iceberg creates the table and writes actual data and relevant metadata that includes table schema, table version information, and so on. See the following objects in your S3 bucket:

$ aws s3 ls 's3://your-bucket/data/' --recursive
YYYY-MM-dd hh:mm:ss   83616660 data/iceberg_blog_default.db/acr_iceberg/data/00000-44-c2983230-c43a-4f4a-9b89-1f7c13e59645-00001.parquet
YYYY-MM-dd hh:mm:ss   83247771 
...
YYYY-MM-dd hh:mm:ss       5134 data/iceberg_blog_default.db/acr_iceberg/metadata/00000-bc5d3ea2-280f-4e28-a71f-4c2b749ed637.metadata.json
YYYY-MM-dd hh:mm:ss     116950 data/iceberg_blog_default.db/acr_iceberg/metadata/411308cd-1f4d-4535-9444-f6b56a56697f-m0.avro
YYYY-MM-dd hh:mm:ss       3821 data/iceberg_blog_default.db/acr_iceberg/metadata/snap-6122957686233868728-1-411308cd-1f4d-4535-9444-f6b56a56697f.avro

The job tries to create a DynamoDB table, which you specified in the CloudFormation stack (in the following screenshot, its name is myGlueLockTable), if it doesn’t exist already. As we discussed earlier, the DynamoDB table is used for commit locking for Iceberg tables.

Create the `acr_iceberg_report` Iceberg table for BI reports

The data engineer team also creates the acr_iceberg_report table for BI reports in the Glue Data Catalog. This table initially has the following records.

comment_count	avg_star	product_category
1240	4.20729367860598	Camera
95	4.80167540490342	Industrial_Supplies
663	3.80123467540571	PC

To create the table, run the following cell.

The two Iceberg tables have been created. Let’s check the acr_iceberg table records by running a query.

Determine the average star rating for each product category by querying the Iceberg table

You can see the Iceberg table records by using a SELECT statement. In this section, we query the acr_iceberg table to simulate seeing a current BI report data by running an ad hoc query.

Run the following cell in the notebook to get the aggregated number of customer comments and mean star rating for each product_category.

The cell output has the following results.

Another way to query Iceberg tables is using Amazon Athena (when you use the Athena with Iceberg tables, you need to set up the Iceberg environment) or Amazon EMR.

2. Add customer reviews in the Iceberg table

In this section, customers add comments for some products in the Industrial Supplies product category, and we add these comments to the acr_iceberg table. To demonstrate this scenario, we create a Spark DataFrame based on the following new customer reviews and then add them to the table with an INSERT statement.

marketplace

customer_id

review_id

product_id

product_
parent

product_
title

star_
rating

helpful_
votes

total_
votes

vine

verified_
purchase

review_
headline

review_
body

review_
date

year

product_
category

12345689

ISB35E4556F144

I00EDBY7X8

989172340

plastic containers

Five Stars

Great product!

2022-02-01

2022

Industrial_
Supplies

78901234

IS4392CD4C3C4

I00D7JFOPC

952000001

battery tester

nice one, but
it broke
some days later

nope

2022-02-01

2022

Industrial_
Supplies

12345123

IS97B103F8B24C

I002LHA74O

818426953

spray bottle

Two Stars

the bottle isn’t
as big as pictured.

2022-02-01

2022

Industrial_
Supplies

23000093

ISAB4268D46F3X

I00ARPLCGY

562945918

3d printer

Super great

very useful

2022-02-01

2022

Industrial_
Supplies

89874312

ISAB4268137V2Y

I80ARDQCY

564669018

circuit board

Great, but
a little bit expensive

you should buy this,
but note the price

2022-02-01

2022

Industrial_
Supplies

Run the following cells in the notebook to insert the customer comments to the Iceberg table. The process takes about 1 minute.

Run the next cell to see an addition to the product category Industrial_Supplies with 5 under comment_count.

3. Update a customer review in the Iceberg table

In the previous section, we added new customer reviews to the acr_iceberg Iceberg table. In this section, a customer requests an update of their review. Specifically, customer 78901234 requests the following update of the review ID IS4392CD4C3C4.

change star_rating from 3 to 5
update the review_headline from nice one, but it broke some days later to very good

We update the customer comment by using an UPDATE query by running the following cell.

We can review the updated record by running the next cell as follows.

Also, when you run this cell for the reporting table, you can see the updated avg_star column value for the Industrial_Supplies product category. Specifically, the avg_star value has been updated from 3.8 to 4.2 as a result of the star_rating changing from 3 to 5:

4. Reflect changes in the customer reviews table in the BI report table with a MERGE INTO query

In this section, we reflect the changes in the acr_iceberg table into the BI report table acr_iceberg_report. To do so, we run the MERGE INTO query and combine the two tables based on the condition of the product_category column in each table. This query works as follows:

When the product_category column in each table is the same, the query returns the sum of each column record
When the column in each table is not the same, the query just inserts a new record

This MERGE INTO operation is also referred to as an UPSERT (update and insert).

Run the following cell to reflect the update of customer reviews in the acr_iceberg table into the acr_iceberg_report BI table.

After the MERGE INTO query is complete, you can see the updated acr_iceberg_report table by running the following cell.

The MERGE INTO query performed the following changes:

In the Camera, Industrial_Supplies, and PC product categories, each comment_count is the sum between the initial value of the acr_iceberg_report table and the aggregated table value. For example, in the Industrial_Supplies product category row, the comment_count 100 is calculated by 95 (in the initial version of acr_iceberg_report) + 5 (in the aggregated report table).
In addition to comment_count, the avg_star in the Camera, Industrial_Supplies, or PC product category row is also computed by averaging between each avg_star value in acr_iceberg_report and in the aggregated table.
In other product categories, each comment_count and avg_star is the same as each value in the aggregated table, which means that each value in the aggregated table is inserted into the acr_iceberg_report table.

5. Roll back the Iceberg tables and reflect changes in the BI report table

In this section, the customer who requested the update of the review now requests to revert the updated review.

Iceberg stores versioning tables through the operations for Iceberg tables. We can see the information of each version of table by inspecting tables, and we can also time travel or roll back tables to an old table version.

To complete the customer request to revert the updated review, we need to revert the table version of acr_iceberg to the earlier version when we first added the reviews. Additionally, we need to update the acr_iceberg_report table to reflect the rollback of the acr_iceberg table version. Specifically, we need to perform the following three steps to complete these operations:

Check the history of table changes of acr_iceberg and acr_iceberg_report to get each table snapshot.
Roll back acr_iceberg to the version when first we inserted records, and also roll back the acr_iceberg_report table to the initial version to reflect the customer review update.
Merge the acr_iceberg table with the acr_iceberg_report table again.

Get the metadata of each report table

As a first step, we check table versions by inspecting the table. Run the following cells.

Now you can see the following table versions in acr_iceberg and acr_iceberg_report:

acr_iceberg has three versions:
- The oldest one is the initial version of this table, which shows the append operation
- The second oldest one is the record insertion, which shows the append operation
- The latest one is the update, which shows the overwrite operation
acr_iceberg_report has two versions:
- The oldest one is the initial version of this table, which shows the append operation
- The other one is from the MERGE INTO query in the previous section, which shows the overwrite operation

As shown in the following screenshot, we roll back to the acr_iceberg table version, inserting records based on the customer revert request. We also roll back to the acr_iceberg_report table version in the initial version to discard the MERGE INTO operation in the previous section.

Roll back the `acr_iceberg` and `acr_iceberg_report` tables

Based on your snapshot IDs, you can roll back each table version:

For acr_iceberg, use the second-oldest snapshot_id (in this example, 5440744662350048750) and replace <Type snapshot_id in ace_iceberg table> in the following cell with this snapshot_id.
For acr_iceberg_report table, use the initial snapshot_id (in this example, 7958428388396549892) and replace <Type snaphost_id in ace_iceberg_report table> in the following cell with this snapshot_id.

After you specify the snapshot_id for each rollback query, run the following cells.

When this step is complete, you can see the previous and current snapshot IDs of each table.

Each Iceberg table has been reverted to the specific version now.

Reflect changes in `acr_iceberg` into `acr_iceberg_report` again

We reflect the acr_iceberg table reversion into the current acr_iceberg_report table. To complete this, run the following cell.

After you rerun the MERGE INTO query, run the following cell to see the new table records. When we compare the table records, we observe that the avg_star value in Industrial_Supplies is lower than the value of the previous table avg_star.

You were able to reflect a customer’s request of reverting their updated review on the BI report table. Specifically, you can get the updated avg_star record in the Industrial_Supplies product category.

Clean up

To clean up all resources that you created, delete the CloudFormation stack.

Conclusion

In this post, we walked through using the Apache Iceberg connector with AWS Glue ETL jobs. We created an Iceberg table built on Amazon S3, and ran queries such as reading the Iceberg table data, inserting a record, merging two tables, and time travel.

The operations for the Iceberg table that we demonstrated in this post aren’t all of the operations Iceberg supports. Refer to the Apache Iceberg documentation for information about more operations.

Appendix: Spark configurations to use Apache Iceberg on AWS Glue

As we mentioned earlier, the notebook sets up a Spark configuration to integrate Iceberg with AWS Glue. The following table shows what each parameter defines.

Spark configuration key	Value	Description
`spark.sql.catalog.{CATALOG}`	`org.apache.iceberg.spark.SparkCatalog`	Specifies a Spark catalog interface that communicates with Iceberg tables.
`spark.sql.catalog.{CATALOG}.warehouse`	`{WAREHOUSE_PATH}`	A warehouse path for jobs to write iceberg metadata and actual data.
`spark.sql.catalog.{CATALOG}.catalog-impl`	`org.apache.iceberg.aws. glue.GlueCatalog`	The implementation of the Spark catalog class to communicate between Iceberg tables and the AWS Glue Data Catalog.
`spark.sql.catalog.{CATALOG}.io-impl`	`org.apache.iceberg.aws.s3.S3FileIO`	Used for Iceberg to communicate with Amazon S3.
`spark.sql.catalog.{CATALOG}.lock-impl`	`org.apache.iceberg.aws.glue. DynamoLockManager`	Used for Iceberg to manage table locks.
`spark.sql.catalog.{CATALOG}.lock.table`	`{DYNAMODB_TABLE}`	A DynamoDB table name to store table locks.
`spark.sql.extensions`	`org.apache.icerberg.spark.extensions. IcebergSparkSessionExtensions`	The implementation that enables Spark to run Iceberg-specific SQL commands.
`spark.sql.session.timeZone`	`UTC`	Sets the time zone of the Spark environment to UTC for further Iceberg time travel queries. The epoch time is in the UTC time zone.

About the Author

Tomohiro Tanaka is a Cloud Support Engineer at Amazon Web Services. He builds Glue connectors such as Apache Iceberg connector and TPC-DS connector. He’s passionate about helping customers build data lakes using ETL workloads. In his free time, he also enjoys coffee breaks with his colleagues and making coffee at home.

Resize Amazon Redshift from DC2 to RA3 with minimal or no downtime

2022-06-16 Soujanya Konka

Post Syndicated from Soujanya Konka original https://aws.amazon.com/blogs/big-data/resize-amazon-redshift-from-dc2-to-ra3-with-minimal-or-no-downtime/

Amazon Redshift is a popular cloud data warehouse that allows you to process exabytes of data across your data warehouse, operational database, and data lake using standard SQL. Amazon Redshift offers different node types like DC2 (dense compute) and RA3, which you can use for your different workloads and use cases. For more information about the benefits of migrating from DS2 to RA3, refer to Scale your cloud data warehouse and reduce costs with the new Amazon Redshift RA3 nodes with managed storage and Amazon Redshift Benchmarking: Comparison of RA3 vs. DS2 Instance Types.

Many customers use DC2 nodes for their compute-intensive workloads. It’s natural to scale with your growing workload, namely separating compute from storage so they’re right-sized as per your needs. RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. Amazon Redshift managed storage uses large, high-performance SSDs in each RA3 node for fast local storage and Amazon S3 for longer-term durable storage. If the data in a node grows beyond the size of the large local SSDs, Amazon Redshift managed storage automatically offloads that data to Amazon S3. RA3 nodes keep track of the frequency of access for each data block and cache the hottest blocks. If the blocks aren’t cached, the large networking bandwidth and precise storing techniques return the data in sub-seconds. Also, if you’re looking for features like cross-cluster data sharing and cross-Availability Zone cluster relocation, these are a few of the reasons for migrating to RA3. Many customers on DC2 have benefitted from migrating to RA3 to serve their growing performance requirements and business use cases.

As a first step of the migration, we always recommend finding the correct load of your system and determining the number of RA3 nodes that will meet your workload and give you the best cost-performance benefit. For this evaluation, you can use the simple Replay tool to conduct a what-if analysis and evaluate how your workload performs in different scenarios. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature, or assess different cluster configurations. To choose the right cluster type, you can compare different node types for your workload and choose the right configuration of RA3 with the Simple Replay utility.

Once you know the cluster type and nodes, the next question is how to migrate your current workload to RA3 with minimum downtime or without disrupting your current workload. In this post, we describe an approach to do this with minimum downtime.

Resizing an Amazon Redshift cluster

There are three ways to resize or migrate an Amazon Redshift cluster from DC2 to RA3 :

Elastic resize – If it’s available as an option, use elastic resize to change the node type, number of nodes, or both. Note that when you only change the number of nodes, the queries are temporarily paused and connections are kept open. An elastic resize can take between 10–15 minutes. During a resize operation, the cluster is read-only.
Classic resize – Use classic resize to change the node type, number of nodes, or both. Choose this option when you’re resizing to a configuration that isn’t available through elastic resize. A resize operation can take 2 hours or more, or last up to several days depending on your data size. During the resize operation, the source cluster is read-only.
Snapshot, restore, and resize – To keep your cluster available during a classic resize, make a copy of the existing cluster, then resize the new cluster. If data is written to the source cluster after a snapshot is taken, the data must be manually copied over after the migration is complete.

Checkpoints for resize

When a cluster is resized using elastic resize with the same node type, the operation doesn’t create a new cluster. As a result, the operation completes quickly. In case of resize, there could be one or more challenges causing the delay in resize:

Data volumes – The time required to complete a classic resize or a snapshot and restore operation might vary, depending on factors like the workload on the source cluster, the number and volume of tables being transformed, how evenly data is distributed across the compute nodes and slices, and the node configuration in the source and target clusters.
Snapshots – Automated snapshots are automatically deleted when their retention period expires, when you disable automated snapshots, or when you delete a cluster. If you want to keep an automated snapshot, you can copy it to a manual snapshot. You can take a manual snapshot of the cluster before the migration, which is used for resize operations, but it may not include live data from the time the snapshot was captured.
Cluster unavailable during resize – It’s critical to know roughly how long the resize will take. To do so, you can try creating a cluster from the snapshot in a test account. However, this only gives a ballpark idea because resize times can vary, especially if you intend to query your cluster during the resize. If the cluster is live almost all the time with minimal or zero non-business hours, a resize can be a challenge because the cluster can’t upsert live data and serve read requests on this data during this window.
Cluster endpoint retention – Elastic resize and cluster resize allow you to change the node type, number of nodes, or both, but the endpoint is retained. With snapshot resize, a new cluster endpoint is created, which may require a change in your application to replace the endpoint.
Reconciliation – Validate the target cluster data with the source to make sure migration was completed without data loss and ensure data quality. Reconciliation at the table level isn’t sufficient, you need to ensure records have also been copied from the source. You can run a matching record count check followed by data validation using checksum for accuracy of data.

Solution overview

The steps to prepare for migration are as follows:

Take a snapshot of the existing production Amazon Redshift cluster running on DC2.
Create another Amazon Simple Storage Service (Amazon S3) bucket, where AWS Glue writes the curated data in parallel.
Use the snapshot to create an RA3 cluster.
Configure AWS Database Migration Service (AWS DMS) to load data from the migrated bucket to Amazon S3.
After you confirm that the data is synced between the two clusters (DC and RA3) and all other downstream applications, stop the DC cluster and change the endpoint of your dependent downstream application to the newly created RA3 cluster.

Following is the current architecture depicting a live workload.

In this solution, data comes from three source systems and are written into a raw S3 bucket:

Change data capture (CDC) from an RDS instance via AWS DMS (1 in the preceding diagram)
Events captured via an external API (2)
CSV files from an external source copied to the raw bucket (3)

These sources don’t have a pattern or an interval of pushing new data.

Every few minutes, the ingested data is picked up by an S3 event trigger to run an AWS Glue workflow (4 in the preceding diagram). It provides an orchestration layer to manage and run jobs and crawlers. This workflow includes a crawler (5) that updates the metadata schema and partitions of the dataset to the AWS Glue Data Catalog. Then the crawler triggers an AWS Glue job that writes the curated data to the S3 curated bucket. From there, another AWS Glue job uploads data into Amazon Redshift (6).

In this scenario, if your workload is critical and you can’t afford a long downtime, then you need to plan your migration accordingly.

Dual write and transient data curation pipeline

As a first step of the migration, you need a parallel data process pipeline as the AWS Glue job, which writes the data into the curated S3 bucket. Create another S3 bucket and name it migrated-curated-bucket and modify the AWS Glue transform job. You can also replicate another transform job to write data to a new reserve S3 bucket in parallel.

In this scenario, live data ingestion occurs every 30 minutes. When an iteration of the extract, transform, and load (ETL) job is complete, this triggers a manual snapshot of the Amazon Redshift cluster. After the snapshot is captured, a new Amazon Redshift cluster is created using that snapshot. Cluster creation time can vary depending on the snapshot volume.

If snapshot creation takes more than 30 minutes, then the ETL job should be stopped, and resume after the snapshot creation is complete. For example, if the ETL job is triggered at 8:00 AM and finishes at 8:10 AM, then snapshot creation starts at 8:10 AM. If it finishes by 8:30 AM (the next ETL job will run at 8:30 AM as per the half-hour interval), then the ETL process continues according to the schedule. Otherwise, the job stops, and resumes after the snapshot completion.

Now we use the snapshot to launch a new RA3 redshift cluster. The process doesn’t pause the existing ETL pipeline, rather it starts writing curated data in parallel to the reserve S3 bucket. The following diagram illustrates this updated workflow.

At this point, the existing cluster is still live and continues to process the live workload. Even if creation of the Amazon Redshift cluster takes time (owing to the huge volume of data), you should still be covered. The curated data in the S3 bucket acts as a staging reserve, and this data should be loaded into the RA3 cluster after its cluster is launched.

Backfill the new RA3 cluster with missing data

After the RA3 cluster has been launched, you need to playback the captured live data from the reserve S3 bucket to the newly created cluster. Playback is only for the duration of the snapshot capture to the current timestamp. With this process, you’re trying to bring the RA3 cluster in sync with the existing live DC2 cluster.

You need to configure an AWS DMS migration task with the reserve S3 bucket as the source endpoint and the newly created RA3 cluster as the target endpoint.

AWS DMS captures ongoing changes to the target data store. This process is called ongoing replication or change data capture (CDC). AWS DMS uses this process when replicating ongoing changes from a source data store. This process works by collecting changes to the database logs using the database engine’s native API. The following diagram illustrates this workflow.

Reconciliation and cutover

Data reconciliation is the process of verification of data between source and target. In this process, target data is compared with source data to ensure that the data is transferred completely without any alterations. To ensure reliability in the pipeline and the data processed, you should create an end-to-end reconciliation report. This report verifies the percentage of matching tables, columns, and data records. It also identifies missing records, missing values, incorrect values, badly formatted values, and duplicated records.

You can define the reconciliation process to check whether both clusters are running in sync. For that you can create simple Python scripts or shell scripts to query the source and target clusters, fetch the results, and compare.

Cutover is the final step of migration, and involves switching the existing cluster with the newly launched cluster. At this point, the clusters are running in parallel. Next, you validate that the downstream data consumption flows are up to date. Verify the reconciliation metrics from the DC2 and RA3 clusters such that table updates are in sync.

You can keep dual write while you switch from the migration data pipeline. If you discover any issues after cutting over, you can switch back to the old data pipeline, which is the source of truth until cutover. In this case, cutover involves updating the DC2 cluster endpoint to the new RA3 cluster endpoint in the application. Make sure to identify a relatively quiet window during the day to update the endpoint. To keep the same endpoint for your applications and users, you can rename the new RA3 cluster with the same name as the original DC2 cluster. To rename the cluster, modify the cluster in the Amazon Redshift console or ModifyCluster API operation. For more information, see Renaming clusters or ModifyCluster API operation in the Amazon Redshift API Reference.

Up to this point, AWS DMS is continuing to update RA3. After you cut over to RA3, the DC2 cluster is no longer live and you can stop the AWS DMS replication job to RA3. Pause the last snapshot. Delete the reserve S3 bucket and AWS DMS resources used for RA3 load.

Conclusion

In this post, we presented an approach to migrate an existing Amazon Redshift cluster with minimal to no data loss, which also allows the cluster to serve both read and write operations during the resize window. Elastic resize is a quick way to resize your cluster to maintain the same number of slices in the target cluster. Slice mapping reduces the time required to resize a cluster. If you choose a resize configuration that isn’t available on elastic resize, you can choose classic resize or perform a snapshot, restore, and resize.

To learn more about what’s new with RA3 instances, refer to Amazon Redshift RA3 instances with managed storage. Amazon Redshift delivers better price performance and at the same time helps you keep your costs predictable. Amazon Redshift Serverless automatically provisions and scales the data warehouse capacity to deliver high performance for demanding and unpredictable workloads, and you pay only for the resources you use. This provides greater flexibility to choose either or both based on custom requirements. After you’ve made your choice, try the hands-on labs on Amazon Redshift.

About the Authors

Soujanya Konka is a Solutions Architect and Analytics specialist at AWS, focused on helping customers build their ideas on cloud. Expertise in design and implementation of business information systems and Data warehousing solutions. Before joining AWS, Soujanya has had stints with companies such as HSBC, Cognizant.

Dipayan Sarkar is a Specialist Solutions Architect for Analytics at AWS, where he helps customers to modernise their data platform using AWS Analytics services. He works with customer to design and build analytics solutions enabling business to make data-driven decisions.

Implement a CDC-based UPSERT in a data lake using Apache Iceberg and AWS Glue

2022-06-15 Sakti Mishra

Post Syndicated from Sakti Mishra original https://aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/

As the implementation of data lakes and modern data architecture increases, customers’ expectations around its features also increase, which include ACID transaction, UPSERT, time travel, schema evolution, auto compaction, and many more. By default, Amazon Simple Storage Service (Amazon S3) objects are immutable, which means you can’t update records in your data lake because it supports append-only transactions. But there are use cases where you might be receiving incremental updates with change data capture (CDC) from your source systems, and you might need to update existing data in Amazon S3 to have a golden copy. Previously, you had to overwrite the complete S3 object or folders, but with the evolution of frameworks such as Apache Hudi, Apache Iceberg, Delta Lake, and governed tables in AWS Lake Formation, you can get database-like UPSERT features in Amazon S3.

Apache Hudi integration is already supported with AWS analytics services, and recently AWS Glue, Amazon EMR, and Amazon Athena announced support for Apache Iceberg. Apache Iceberg is an open table format originally developed at Netflix, which got open-sourced as an Apache project in 2018 and graduated from incubator mid-2020. It’s designed to support ACID transactions and UPSERT on petabyte-scale data lakes, and is getting popular because of its flexible SQL syntax for CDC-based MERGE, full schema evolution, and hidden partitioning features.

In this post, we walk you through a solution to implement CDC-based UPSERT or MERGE in an S3 data lake using Apache Iceberg and AWS Glue.

Configure Apache Iceberg with AWS Glue

You can integrate Apache Iceberg JARs into AWS Glue through its AWS Marketplace connector. The connector supports AWS Glue versions 1.0, 2.0, and 3.0, and is free to use. Configuring this connector is as easy as clicking few buttons on the user interface.

The following steps guide you through the setup process:

Navigate to the AWS Marketplace connector page.
Choose Continue to Subscribe and then Accept Terms.
Choose Continue to Configuration.
Choose the AWS Glue version and software version.
Choose Continue to Launch.
Choose Usage Instruction, which opens a page that has a link to activate the connector.
Create a connection by providing a name and choosing Create connection and activate connector.

You can confirm your new connection on the AWS Glue Studio Connectors page.

To use this connector, when you create an AWS Glue job, make sure you add this connector to your job. Later in the implementation steps, when you create an AWS Glue job, we show how to use the connector you just configured.

Solution overview

Let’s assume you have a relational database that has product inventory data, and you want to move it into an S3 data lake on a continuous basis, so that your downstream applications or consumers can use it for analytics. After your initial data movement to Amazon S3, you’re supposed to receive incremental updates from the source database as CSV files using AWS DMS or equivalent tools, where each record has an additional column to represent an insert, update, or delete operation. While processing the incremental CDC data, one of the primary requirements you have is merging the CDC data in the data lake and providing the capability to query previous versions of the data.

To solve this use case, we present the following simple architecture that integrates Amazon S3 for the data lake, AWS Glue with the Apache Iceberg connector for ETL (extract, transform, and load), and Athena for querying the data using standard SQL. Athena helps in querying the latest product inventory data from the Iceberg table’s latest snapshot, and Iceberg’s time travel feature helps in identifying a product’s price at any previous date.

The following diagram illustrates the solution architecture.

The solution workflow consists of the following steps:

Data ingestion:
- Steps 1.1 and 1.2 use AWS Database Migration Service (AWS DMS), which connects to the source database and moves incremental data (CDC) to Amazon S3 in CSV format.
- Steps 1.3 and 1.4 consist of the AWS Glue PySpark job, which reads incremental data from the S3 input bucket, performs deduplication of the records, and then invokes Apache Iceberg’s MERGE statements to merge the data with the target UPSERT S3 bucket.
Data access:
- Steps 2.1 and 2.2 represent Athena integration to query data from the Iceberg table using standard SQL and validate the time travel feature of Iceberg.
Data Catalog:
- The AWS Glue Data Catalog is treated as a centralized catalog, which is used by AWS Glue and Athena. An AWS Glue crawler is integrated on top of S3 buckets to automatically detect the schema.

We have referenced AWS DMS as part of the architecture, but while showcasing the solution steps, we assume that the AWS DMS output is already available in Amazon S3, and focus on processing the data using AWS Glue and Apache Iceberg.

To demo the implementation steps, we use sample product inventory data that has the following attributes:

op – Represents the operation on the source record. This shows values I to represent insert operations, U to represent updates, and D to represent deletes. You need to make sure this attribute is included in your CDC incremental data before it gets written to Amazon S3. AWS DMS enables you to include this attribute, but if you’re using other mechanisms to move data, make sure you capture this attribute, so that your ETL logic can take appropriate action while merging it.
product_id – This is the primary key column in the source database’s products table.
category – This column represents the product’s category, such as Electronics or Cosmetics.
product_name – This is the name of the product.
quantity_available – This is the quantity available in the inventory for a product. When we showcase the incremental data for UPSERT or MERGE, we reduce the quantity available for the product to showcase the functionality.
last_update_time – This is the time when the product record was updated at the source database.

If you’re using AWS DMS to move data from your relational database to Amazon S3, then by default AWS DMS includes the op attribute for incremental CDC data, but it’s not included by default for the initial load. If you’re using CSV as your target file format, you can include IncludeOpForFullLoad as true in your S3 target endpoint setting of AWS DMS to have the op attribute included in your initial full load file. To learn more about the Amazon S3 settings in AWS DMS, refer to S3Settings.

To implement the solution, we create AWS resources such as an S3 bucket and an AWS Glue job, and integrate the Iceberg code for processing. Before we run the AWS Glue job, we have to upload the sample CSV files to the input bucket and process it with AWS Glue PySpark code for the output.

Prerequisites

Before getting started on the implementation, make sure you have the required permissions to perform the following in your AWS account:

Create AWS Identity and Access Management (IAM) roles as needed
Read or write to an S3 bucket
Create and run AWS Glue crawlers and jobs
Manage a database, table, and workgroups, and run queries in Athena

For this post, we use the us-east-1 Region, but you can integrate it in your preferred Region if the AWS services included in the architecture are available in that Region.

Now let’s dive into the implementation steps.

Create an S3 bucket for input and output

To create an S3 bucket, complete the following steps:

On the Amazon S3 console, choose Buckets in the navigation pane.
Choose Create bucket.
Specify the bucket name as glue-iceberg-demo, and leave the remaining fields as default.
S3 bucket names are globally unique. While implementing the solution, you may get an error saying the bucket name already exists. Make sure to provide a unique name and use the same name while implementing the rest of the implementation steps. Formatting the bucket name as <Bucket-Name>-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE} might help you get a unique name.
Choose Create bucket.
On the bucket details page, choose Create folder.
Create two subfolders: raw-csv-input and iceberg-output.
Upload the LOAD00000001.csv file into the raw-csv-input folder of the bucket.

The following screenshot provides a sample of the input dataset.

Create input and output tables using Athena

To create input and output Iceberg tables in the AWS Glue Data Catalog, open the Athena console and run the following queries in sequence:

-- Create database for the demo
CREATE DATABASE iceberg_demo;
-- Create external table in input CSV files. Replace the S3 path with your bucket name
CREATE EXTERNAL TABLE iceberg_demo.raw_csv_input(
  op string, 
  product_id bigint, 
  category string, 
  product_name string, 
  quantity_available bigint, 
  last_update_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://glue-iceberg-demo/raw-csv-input/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'typeOfData'='file');
-- Create output Iceberg table with partitioning. Replace the S3 bucket name with your bucket name
CREATE TABLE iceberg_demo.iceberg_output (
  product_id bigint,
  category string,
  product_name string,
  quantity_available bigint,
  last_update_time timestamp) 
PARTITIONED BY (category, bucket(16,product_id)) 
LOCATION 's3://glue-iceberg-demo/iceberg-output/' 
TBLPROPERTIES (
  'table_type'='ICEBERG',
  'format'='parquet',
  'write_target_data_file_size_bytes'='536870912' 
)
-- Validate the input data
SELECT * FROM iceberg_demo.raw_csv_input;

Alternatively, you can integrate an AWS Glue crawler on top of the input to create the table. Next, let’s create the AWS Glue PySpark job to process the input data.

Create the AWS Glue job

Complete the following steps to create an AWS Glue job:

On the AWS Glue console, choose Jobs in the navigation pane.
Choose Create job.
Select Spark script editor.
For Options, select Create a new script with boilerplate code.
Choose Create.

Replace the script with the following script:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

from pyspark.sql.functions import *
from awsglue.dynamicframe import DynamicFrame

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, max

from pyspark.conf import SparkConf

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'iceberg_job_catalog_warehouse'])
conf = SparkConf()

## Please make sure to pass runtime argument --iceberg_job_catalog_warehouse with value as the S3 path 
conf.set("spark.sql.catalog.job_catalog.warehouse", args['iceberg_job_catalog_warehouse'])
conf.set("spark.sql.catalog.job_catalog", "org.apache.iceberg.spark.SparkCatalog")
conf.set("spark.sql.catalog.job_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
conf.set("spark.sql.catalog.job_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
conf.set("spark.sql.iceberg.handle-timestamp-without-timezone","true")

sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

## Read Input Table
IncrementalInputDyF = glueContext.create_dynamic_frame.from_catalog(database = "iceberg_demo", table_name = "raw_csv_input", transformation_ctx = "IncrementalInputDyF")
IncrementalInputDF = IncrementalInputDyF.toDF()

if not IncrementalInputDF.rdd.isEmpty():
    ## Apply De-duplication logic on input data, to pickup latest record based on timestamp and operation 
    IDWindowDF = Window.partitionBy(IncrementalInputDF.product_id).orderBy(IncrementalInputDF.last_update_time).rangeBetween(-sys.maxsize, sys.maxsize)
                  
    # Add new columns to capture first and last OP value and what is the latest timestamp
    inputDFWithTS= IncrementalInputDF.withColumn("max_op_date",max(IncrementalInputDF.last_update_time).over(IDWindowDF))
    
    # Filter out new records that are inserted, then select latest record from existing records and merge both to get deduplicated output 
    NewInsertsDF = inputDFWithTS.filter("last_update_time=max_op_date").filter("op='I'")
    UpdateDeleteDf = inputDFWithTS.filter("last_update_time=max_op_date").filter("op IN ('U','D')")
    finalInputDF = NewInsertsDF.unionAll(UpdateDeleteDf)

    # Register the deduplicated input as temporary table to use in Iceberg Spark SQL statements
    finalInputDF.createOrReplaceTempView("incremental_input_data")
    finalInputDF.show()
    
    ## Perform merge operation on incremental input data with MERGE INTO. This section of the code uses Spark SQL to showcase the expressive SQL approach of Iceberg to perform a Merge operation
    IcebergMergeOutputDF = spark.sql("""
    MERGE INTO job_catalog.iceberg_demo.iceberg_output t
    USING (SELECT op, product_id, category, product_name, quantity_available, to_timestamp(last_update_time) as last_update_time FROM incremental_input_data) s
    ON t.product_id = s.product_id
    WHEN MATCHED AND s.op = 'D' THEN DELETE
    WHEN MATCHED THEN UPDATE SET t.quantity_available = s.quantity_available, t.last_update_time = s.last_update_time 
    WHEN NOT MATCHED THEN INSERT (product_id, category, product_name, quantity_available, last_update_time) VALUES (s.product_id, s.category, s.product_name, s.quantity_available, s.last_update_time)
    """)

    job.commit()

On the Job details tab, specify the job name.
For IAM Role, assign an IAM role that has the required permissions to run an AWS Glue job and read and write to the S3 bucket.
For Glue version, choose Glue 3.0.
For Language, choose Python 3.
Make sure Job bookmark has default value of Enable.
Under Connections, choose the Iceberg connector.
Under Job parameters, specify Key as --iceberg_job_catalog_warehouse and Value as your S3 path (e.g. s3://<bucket-name>/<iceberg-warehouse-path>).
Choose Save and then Run, which should write the input data to the Iceberg table with a MERGE statement.

Because the target table is empty in the first run, the Iceberg MERGE statement runs an INSERT statement for all records.

Query the Iceberg table using Athena

After you have successfully run the AWS Glue job, you can validate the output in Athena with the following SQL query:

SELECT * FROM iceberg_demo.iceberg_output limit 10;

The output of the query should match the input, with one difference: The Iceberg output table doesn’t have the op column.

Upload incremental (CDC) data for further processing

After we process the initial full load file, let’s upload the following two incremental files, which include insert, update, and delete records for a few products.

The following is a snapshot of first incremental file (20220302-1134010000.csv).

The following is a snapshot of the second incremental file (20220302-1135010000.csv), which shows that record 102 has another update transaction before the next ETL job processing.

After you upload both incremental files, you should see them in the S3 bucket.

Run the AWS Glue job again to process incremental files

Because we enabled bookmarks on the AWS Glue job, the next job picks up only the two new incremental files and performs a merge operation on the Iceberg table.

To run the job again, complete the following steps:

On the AWS Glue console, choose Jobs in the navigation pane.
Select the job and choose Run.

As explained earlier, the PySpark script is expected to deduplicate the input data before merging to the target Iceberg table, which means it only picks up the latest record of the 102 product.

For this post, we run the job manually, but you can configure your AWS Glue jobs to run as part of an AWS Glue workflow or via AWS Step Functions (for more information, see Manage AWS Glue Jobs with Step Functions).

Query the Iceberg table using Athena, after incremental data processing

After incremental data processing is complete, you can run the same SELECT statement again and validate that the quantity value is updated for record 102 and product record 103 is deleted.

The following screenshot shows the output.

Query the previous version of data with Iceberg’s time travel feature

You can run the following SQL query in Athena that uses the AS OF TIME statement of Iceberg to query the previous version of the data:

-SELECT * FROM iceberg_demo.iceberg_output FOR SYSTEM_TIME AS OF TIMESTAMP '2022-03-23 18:56:00'

The following screenshot shows the output. As you can see, the quantity value of product ID 102 is 30, which was available during the initial load.

Note that you have to change the AS OF TIMESTAMP value based on your runtime.

This concludes the implementation steps.

Considerations

The following are a few considerations you should keep in mind while integrating Apache Iceberg with AWS Glue:

Athena support for Iceberg became generally available recently, so make sure you review the considerations and limitations of using this feature.
AWS Glue provides DynamicFrame APIs to read from different source systems and write to different targets. For this post, we integrated Spark DataFrame instead of AWS Glue DynamicFrame because Iceberg’s MERGE statements aren’t supported with AWS Glue DynamicFrame APIs.
To learn more about AWS integration, refer to Iceberg AWS Integrations.

Conclusion

This post explains how you can use the Apache Iceberg framework with AWS Glue to implement UPSERT on an S3 data lake. It provides an overview of Apache Iceberg, its features and integration approaches, and explains how you can implement it through a step-by-step guide.

I hope this gives you a great starting point for using Apache Iceberg with AWS analytics services and that you can build on top of it to implement your solution.

Appendix: AWS Glue DynamicFrame sample code to interact with Iceberg tables

The following code sample demonstrates how you can integrate the DynamicFrame method to read from an Iceberg table:

IcebergDyF = (
    glueContext.create_dynamic_frame.from_options(
        connection_type="marketplace.spark",
        connection_options={
            "path": "job_catalog.iceberg_demo.iceberg_output",
            "connectionName": "Iceberg Connector for Glue 3.0",
        },
        transformation_ctx="IcebergDyF",
    )
)

## Optionally, convert to Spark DataFrame if you plan to leverage Iceberg’s SQL based MERGE statements
InputIcebergDF = IcebergDyF.toDF()

The following sample code shows how you can integrate the DynamicFrame method to write to an Iceberg table for append-only mode:

## Use the following 2 lines to convert Spark DataFrame to DynamicFrame, if you plan to leverage DynamicFrame API to write to final target
from awsglue.dynamicframe import DynamicFrame 
finalDyF = DynamicFrame.fromDF(InputIcebergDF,glueContext,"finalDyF")

WriteIceberg = glueContext.write_dynamic_frame.from_options(
    frame= finalDyF,
    connection_type="marketplace.spark",
    connection_options={
        "path": "job_catalog.iceberg_demo.iceberg_output",
        "connectionName": "Iceberg Connector for Glue 3.0",
    },
    format="parquet",
    transformation_ctx="WriteIcebergDyF",
)

About the Author

Sakti Mishra is a Principal Data Lab Solution Architect at AWS, where he helps customers modernize their data architecture and help define end to end data strategy including data security, accessibility, governance, and more. He is also the author of the book Simplify Big Data Analytics with Amazon EMR. Outside of work, Sakti enjoys learning new technologies, watching movies, and visiting places with family.

Simplify and optimize Python package management for AWS Glue PySpark jobs with AWS CodeArtifact

2022-06-10 Ashok Padmanabhan

Post Syndicated from Ashok Padmanabhan original https://aws.amazon.com/blogs/big-data/simplify-and-optimize-python-package-management-for-aws-glue-pyspark-jobs-with-aws-codeartifact/

Data engineers use various Python packages to meet their data processing requirements while building data pipelines with AWS Glue PySpark Jobs. Languages like Python and Scala are commonly used in data pipeline development. Developers can take advantage of their open-source packages or even customize their own to make it easier and faster to perform use cases, such as data manipulation and analysis. However, managing standardized packages can be cumbersome with multiple teams using different versions of packages, installing non-approved packages, and causing duplicate development effort due to the lack of visibility of what is available at the enterprise level. This can be especially challenging in large enterprises with multiple data engineering teams.

ETL Developers have requirements to use additional packages for their AWS Glue ETL jobs. With security being job zero for customers, many will restrict egress traffic from their VPC to the public internet, and they need a way to manage the packages used by applications including their data processing pipelines.

Our proposed solution will enable you with network egress restrictions to manage packages centrally with AWS CodeArtifact and use their favorite libraries in their AWS Glue ETL PySpark code. In this post, we’ll describe how CodeArtifact can be used for managing packages and modules for AWS Glue ETL jobs, and we’ll demo a solution using Glue PySpark jobs that run within VPC Subnets that have no internet access.

Solution overview

The solution uses CodeArtifact as a tool to make it easier for organizations of any size to securely store, publish, and share software packages used in their ETL with AWS Glue. VPC Endpoints will be enabled for CodeArtifact and Glue to enable private link connections. AWS Step Functions makes it easy to coordinate the orchestration of components used in the data processing pipeline. Native integrations with both CodeArtifact and AWS Glue enable the workflow to both authenticate the request to CodeArtifact and start the AWS Glue ETL job.

The following architecture shows an implementation of a solution using AWS Glue, CodeArtifact, and Step Functions to use additional Python modules without egress internet access. The solution is deployed using AWS Cloud Development Kit (AWS CDK), an open-source software development framework to define your cloud application resources using familiar programming languages.

Fig 1: Architecture Diagram for the Solution

To illustrate how to set up this architecture, we’ll walk you through the following steps:

Deploying an AWS CDK stack to provision the following AWS Resources
1. CodeArtifact
2. An AWS Glue job
3. Step Functions workflow
4. Amazon Simple Storage Service (Amazon S3) bucket
5. A VPC with a private Subnet and VPC Endpoints to Amazon S3 and CodeArtifact
Validate the Deployment.
Run a Sample Workflow – This workflow will run an AWS Glue PySpark job that uses a custom Python library, and an upgraded version of boto3.
Cleaning up your resources.

Prerequisites

Make sure that you complete the following steps as prerequisites:

Have an AWS account. For this post, you configure the required AWS resources using AWS CloudFormation. If you haven’t signed up, complete the following tasks:
- Create an account. For instructions, see Sign Up for AWS
- Create an AWS Identity and Access Management (IAM) user. For instructions, see Create IAM User.
Have the following installed and configured on your machine:

The solution

Launching your AWS CDK Stack

Step 1: Using your device’s command line, check out our Git repository to a local directory on your device:

git clone https://github.com/aws-samples/python-lib-management-without-internet-for-aws-glue-in-private-subnets.git

Step 2: Change directories to the new directory Amazon S3 script location:

cd python-lib-management-without-internet-for-aws-glue-in-private-subnets/scripts/s3

Step 3: Download the following CSV, which contains New York City Taxi and Limousine Commission (TLC) Trip weekly trips. This will serve as the input source for the AWS Glue Job:

aws s3 cp s3://nyc-tlc/misc/FOIL_weekly_trips_apps.csv .

Step 4: Change the directories to the path where the app.py file is located (in reference to the previous step, execute the following step):

cd ../..

Step 5: Create a virtual environment:

macOS/Linux:
python3 -m venv .env

Windows:
python -m venv .env

Step 6: Activate the virtual environment after the init process completes and the virtual environment is created:

macOS/Linux:
source .env/bin/activate

Windows:
.env\Scripts\activate.bat

Step 7: Install the required dependencies:

pip3 install -r requirements.txt

Step 8: Make sure that your AWS profile is setup along with the region that you want to deploy as mentioned in the prerequisite. Synthesize the templates. AWS CDK apps use code to define the infrastructure, and when run they produce or “synthesize” a CloudFormation template for each stack defined in the application:

cdk synthesize

Step 9: BootStrap the cdk app using the following command:

cdk bootstrap aws://<AWS_ACCOUNTID>/<AWS_REGION>

Replace the place holder AWS_ACCOUNTID and AWS_REGION with your AWS account ID and the region to be deployed.

This step provisions the initial resources, including an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments.

Step 10: Deploy the solution. By default, some actions that could potentially make security changes require approval. In this deployment, you’re creating an IAM role. The following command overrides the approval prompts, but if you would like to manually accept the prompts, then omit the --require-approval never flag:

cdk deploy "*" --require-approval never

While the AWS CDK deploys the CloudFormation stacks, you can follow the deployment progress in your terminal:

Fig 2: AWS CDK Deployment progress in terminal

Once the deployment is successful, you’ll see the successful status as follows:

Fig 3: AWS CDK Deployment completion success

Step 11: Log in to the AWS Console, go to CloudFormation, and see the output of the ApplicationStack stack:

Fig 4: AWS CloudFormation stack output

Note the values of the DomainName and RepositoryName variables. We’ll use them in the next step to upload our artifacts

Step 12: We will upload a custom library into the repo that we created. This will be used by our Glue ETL job.

Install twine using pip:

python3 -m pip install twine

The custom python package glueutils-0.2.0.tar.gz can be found under this folder of the cloned repo:

cd scripts/custom_glue_library

Configure twine with the login command (additional details here ). Refer to step 11 for the DomainName and RepositoryName from the CloudFormation output:

aws codeartifact login --tool twine --domain <DomainName> --domain-owner <AWS_ACCOUNTID> --repository <RepositoryName>

Publish Python package assets:

twine upload --repository codeartifact glueutils-0.2.0.tar.gz

Fig 5: Python package publishing using twine

Validate the Deployment

The AWS CDK stack will deploy the following AWS resources:

Amazon Virtual Private Cloud (Amazon VPC)
1. One Private Subnet
AWS CodeArtifact
1. CodeArtifact Repository
2. CodeArtifact Domain
3. CodeArtifact Upstream Repository
AWS Glue
1. AWS Glue Job
2. AWS Glue Database
3. AWS Glue Connection
AWS Step Function
Amazon S3 Bucket for AWS CDK and also for storing scripts and CSV file
IAM Roles and Policies
Amazon Elastic Compute Cloud (Amazon EC2) Security Group

Step 1: Browse to the AWS account and region via the AWS Console to which the resources are deployed.

Step 2: Browse the Subnet page (https://<region> .console.aws.amazon.com/vpc/home?region=<region> #subnets:) (*Replace region with actual AWS Region to which your resources are deployed)

Step 3: Select the Subnet with name as ApplicationStack/enterprise-repo-vpc/Enterprise-Repo-Private-Subnet1

Step 4: Select the Route Table and validate that there are no Internet Gateway or NAT Gateway for routes to Internet, and that it’s similar to the following image:

Fig 6: Route table validation

Step 5: Navigate to the CodeArtifact console and review the repositories created. The enterprise-repo is your local repository, and pypi-store is the upstream repository connected to the PyPI, providing artifacts from pypi.org.

Fig 7: AWS CodeArifact repositories created

Step 6: Navigate to enterprise-repo and search for glueutils. This is the custom python package that we published.

Fig 8: AWS CodeArifact custom python package published

Step 7: Navigate to Step Functions Console and review the enterprise-repo-step-function as follows:

Fig 9: AWS Step Functions workflow

The diagram shows how the Step Functions workflow will orchestrate the pattern.

The first step CodeArtifactGetAuthorizationToken calls the getAuthorizationToken API to generate a temporary authorization token for accessing repositories in the domain (this token is valid for 15 mins.).
The next step GenerateCodeArtifactURL takes the authorization token from the response and generates the CodeArtifact URL.
Then, this will move into the GlueStartJobRun state, which makes a synchronous API call to run the AWS Glue job.

Step 8: Navigate to the AWS Glue Console and select the Jobs tab, then select enterprise-repo-glue-job.

The AWS Glue job is created with the following script and AWS Glue Connection enterprise-repo-glue-connection. The AWS Glue connection is a Data Catalog object that enables the job to connect to sources and APIs from within the VPC. The network type connection runs the job from within the private subnet to make requests to Amazon S3 and CodeArtifact over the VPC endpoint connection. This enables the job to run without any traffic through the internet.

Note the connections section in the AWS Glue PySpark Job, which makes the Glue job run on the private subnet in the VPC provisioned.

Fig 10: AWS Glue network connections

The job takes an Amazon S3 bucket, Glue Database, Python Job Installer Option, and Additional Python Modules as job parameters. The parameters --additional-python-modules and --python-modules-installer-option are passed to install the selected Python module from a PyPI repository hosted in AWS CodeArtifact.

The script itself first reads the Amazon S3 input path of the taxi data in the CSV format. A light transformation to sum the total trips by year, week, and app is performed. Then the output is written to an Amazon S3 path as parquet . A partitioned table in the AWS Glue Data Catalog will either be created or updated if it already exists .

You can find the Glue PySpark script here.

Run a sample workflow

The following steps will demonstrate how to run a sample workflow:

Step 1: Navigate to the Step Functions Console and select the enterprise-repo-step-function.

Step 2: Select Start execution and input the following: We’re including the glueutils and latest boto3 libraries as part of the job run. It is always recommended to pin your python dependencies to avoid any breaking change due to a future version of dependency . In the below example, the latest available version of boto3, and the 0.2.0 version of glueutils will be installed. To pin it to a specific release you may add boto3==1.24.2 (Current latest release at the time of publishing this post).

{"pythonmodules": "boto3,glueutils==0.2.0"}

Step 3: Select Start execution and wait until Execution Status is Succeeded. This may take a few minutes.

Step 4: Navigate to the CodeArtifact Console to review the enterprise-repo repository. You’ll see the cached PyPi packages and all of their dependencies pulled down from PyPi.

Step 5: In the Glue Console under the Runs section of the enterprise-glue-job, you’ll see the parameters passed:

Fig 11 : AWS Glue job execution history

Note the --index-url which was passed as a parameter to the glue ETL job. The token is valid only for 15 minutes.

Step 6: Navigate to the Amazon CloudWatch Console and go to the /aws/glue-jobs log group to verify that the packages were installed from the local repo.

You will see that the 2 package names passed as parameters are installed with the corresponding versions.

Fig 12 : Amazon CloudWatch logs details for the Glue job

Step 7: Navigate to the Amazon Athena console and select Query Editor.

Step 8: Run the following query to validate the output of the AWS Glue job:

SELECT year, app, SUM(total_trips) as sum_of_total_trips 
FROM 
"codeartifactblog_glue_db"."taxidataparquet" 
GROUP BY year, app;

Clean up

Make sure that you clean up all of the other AWS resources that you created in the AWS CDK Stack deployment. You can delete these resources via the AWS CDK Destroy command as follows or the CloudFormation console.

To destroy the resources using AWS CDK, follow these steps:

Follow Steps 1-6 from the ‘Launching your CDK Stack’ section.
Destroy the app by executing the following command:
```
cdk destroy
```

Conclusion

In this post, we demonstrated how CodeArtifact can be used for managing Python packages and modules for AWS Glue jobs that run within VPC Subnets that have no internet access. We also demonstrated how the versions of existing packages can be updated (i.e., boto3) and a custom Python library (glueutils) that is developed locally is also managed through CodeArtifact.

This post enables you to use your favorite Python packages with AWS Glue ETL PySpark jobs by modifying the input to the AWS StepFunctions workflow (Step 2 in the Run a Sample workflow section).

About the Authors

Bret Pontillo is a Data & ML Engineer with AWS Professional Services. He works closely with enterprise customers building data lakes and analytical applications on the AWS platform. In his free time, Bret enjoys traveling, watching sports, and trying new restaurants.

Gaurav Gundal is a DevOps consultant with AWS Professional Services, helping customers build solutions on the customer platform. When not building, designing, or developing solutions, Gaurav spends time with his family, plays guitar, and enjoys traveling to different places.

Ashok Padmanabhan is a Sr. IOT Data Architect with AWS Professional Services, helping customers build data and analytics platform and solutions. When not helping customers build and design data lakes, Ashok enjoys spending time at the beach near his home in Florida.

Introduction to Amazon QuickSight ML Insights

2022-06-07 Rashid Sajjad

Post Syndicated from Rashid Sajjad original https://aws.amazon.com/blogs/big-data/introduction-to-amazon-quicksight-ml-insights/

Amazon QuickSight was launched in November 2016 as a fast, cloud-powered business analytics service to build visualizations, perform ad hoc analysis, and quickly get business insights from a variety of data sources. In 2018, ML Insights for QuickSight (Enterprise Edition) was announced to add machine learning (ML)-powered forecasting and anomaly detection with a few clicks. These insights are automatically generated as suggested insights, and you can also add custom insights to your analysis. Because they’re written out in narrative format, they’re easily consumable by any non-technical user and are a great way to increase adoption of your dashboards. Let’s dive deeper on how these insights are built and how to correctly set up your data to maximize the Suggested Insights feature.

What are ML Insights?

QuickSight uses ML to help uncover hidden insights and trends in your data. It does that by using an ML model that over time and with an increasing volume of data being fed into QuickSight, continually learns and improves its abilities to provide three key features (as of this writing):

ML-powered anomaly detection – Detect outliers that show significant variance from the dataset. This can help identify significant changes in your business metrics such has low-performing stores or products, or top selling items.
ML-powered forecasting – Detect trends and seasonality to forecast based on historical data. This can help project sales, orders, website traffic, and more.
Autonarratives – Embed narratives in your dashboard to tell the story of your data in plain language. This can help convey a shared understanding of the data within your organization. You can use either the suggested autonarrative or you can customize the computations and language to meet your organization’s unique requirements.

How does the ML model work?

QuickSight uses a built-in version of the Random Cut Forest (RCF) algorithm. This is a special type of Random Forest (RF) algorithm, a widely used and successful technique in ML. It takes a set of random data points, cuts them down to the same number of points, and then builds a collection of models. In contrast, a model corresponds to a decision tree—thereby the name “forest.” Because RFs can’t be easily updated in an incremental manner, RCFs were invented with variables in tree construction that were designed to allow incremental updates.

The key takeaway is that RCF is great for finding anomalies and building forecasts. This algorithm is good at finding data points that are outliers or finding trends and patterns to forecast future values.

One important thing to know about ML models is that each model is good at a certain set of predictive activities, but no one model is good for all activities.

Now that you understand what the RCF model is good at, namely anomaly detection and forecasting, you need to make sure the data meets certain requirements, so let’s walk through those steps.

Best practices for setting up data

To maximize the RCF model’s efficiency, the data that is being imported needs to contain certain properties:

At least one metric – Whatever you’re measuring (sold units, orders, and so on).
At least one dimension – The category or slice by which you look at the metric (product category, industry, customer type, and so on).
Data volumes – Your dataset requirements depend on your objective:
- Anomaly detection – Requires at least 15 data points. For example, if you have Bicycles as a product category and want to detect anomalies at a daily level, you need at least 15 days of transactions (you could have multiple rows for multiple transactions in a given day) for Bicycles in the dataset.
- Forecasting – This works best with a large dataset simply because the more history you have, the better the model can extract patterns and trends and generate future probable values. If you have daily aggregates, you need at least 38 days of data.
At least one date column – If we want to analyze anomalies or forecasts in the dataset.

QuickSight supports a wide variety of connections, like Amazon Simple Storage Service (Amazon S3), Amazon Athena, and Apache Spark. For more information about supported connections and some connection examples, refer to Amazon QuickSight Connection examples.

Get started with Suggested Insights

Let’s use a sample dataset and walk through an example of how to use the Suggested Insights feature.

To get started, let’s download a sample dataset from the public domain. For this post, we use House Sales in King County, USA. You need to have a Kaggle account to download the resource.

Download and unzip the file.

If you inspect the CVS file, you will notice it has the right grain (date), metrics (price, bedrooms) and categories (zipcode, waterfront).

Depending on what your analysis needs are, even bedrooms could be a category by which you analyze price. So your metrics and categories ultimately depend on your analysis goals.

We need to create a dataset first before we can create a QuickSight analysis.

Choose New dataset.
Choose Upload a file.
Choose the unzipped CSV file.
In the pop-up window, confirm the file upload settings, then choose Edit settings and prepare data.

You’re redirected to the data preparation editor. This is one of the most important yet overlooked functions in QuickSight.

This editor allows you to review your imported fields and their data types, specify if the field will be used as a dimension or measure, along with many other important data import functions. For production datasets, you should spend time reviewing how the dataset has been set up here.

For our sample CSV file, it’s imported into a QuickSight SPICE by default. SPICE is an in-memory engine for fast querying of imported data. For more details, see Importing data into SPICE.

Choose Save & publish to start importing the CSV file into the SPICE engine.

The default dataset name is the file name that was imported, so in our case it’s kc_house_data. You can choose the dataset on the Datasets page to see the import stats for the dataset.

Choose Create analysis to start creating your QuickSight analysis.

The analysis editor page starts by showing a blank Sheet 1 on your workspace. On the top right, your dataset’s import stats are shown again (this becomes important when importing or refreshing large datasets because the import job might still be in progress).

Let’s start by creating our first visual. The default visual type is AutoGraph, which will try to pick the best visual type based on the fields being selected.

Choose the date field.

The visual changes to Count of Records by Date, with the date aggregation set to Day.

To change the aggregation to monthly, choose the down arrow next to date on the X axis.
Choose the price field.

The AutoGraph detects that the date is a dimension (blue color) and the price is a measure (green color) because these were set up like that in the dataset editor screen (I mentioned earlier how important the data preparation editor was).

Because these fields are already set up as dimensions and measures, the AutoGraph automatically changes to Sum of Price by Date.

This visualization isn’t very helpful. What we’re really looking for is the average price per month.

For Field wells, choose price for Value and change the aggregate to Average.

We now have a nice visual that shows us the average sale price of homes in Kings County by month.

Now comes the fun part—ML Insights!

In the navigation pane, choose Insights.

Voila! QuickSight has already run the RCF model along with other statistical computations and has generated insights that are ready to be added.

These suggested insights change based on the type of visual and data that is currently in the visual. We look at how suggested insights change later in this post.

Two immediately useful insights are Highest Month and Lowest Month.

Hover over the Highest Month insight and choose the plus sign to add it to the current Sheet 1.

I can start rearranging insights and visuals and format the price field to give my current layout a more polished look.

For this post, change the format of the price field to 1,2345 to remove decimals.
You can also add titles for the insights and rename the X axis label date to Aggregate.
To add another sheet, choose the plus sign next to Sheet 1.

By default, we start again with an AutoGraph visual.

Under Visual types¸ choose the vertical bar chart.
Choose the price and zipcode fields.
Change the aggregation of price from Sum to Average.
Choose Insights in the navigation pane.

Suggested Insights now displays a completely different set of data highlights compared to Sheet 1.

Although the vertical bar chart may already tell you the top three and bottom three zip codes, Suggested Insights already recognized the type of analysis and selected the best insights to display.

Although you might eventually build a visual to portray the intended story, Suggested Insights speeds up the process of showcasing the highlights in your data and adding them to your worksheet to quickly give the reader the most important insights from your visuals.

Anomaly detection

An anomaly in QuickSight is described a data point that fall outside an overall pattern of distribution. ML-powered anomaly detection in QuickSight enables you to identify the causations and correlations to make data-driven decisions.

We already talked about data preparation for anomaly detection earlier. QuickSight already ran the RCF model during data import. As soon as a visual is added, QuickSight notifies you on the visual if it has detected an “Anomaly Insight.” This part of Suggested Insights. You can choose Setup anomaly detection to add this to your sheet.

You can also manually add an ML insight to detect anomalies.

Let’s go back to Sheet 1 with the line chart displayed.
When you choose the first suggested insight, it starts creating a widget for anomaly detection.

You can add up to five dimensions fields (not calculated fields, unless they were created in the data prep screen). QuickSight splits the metrics using the fields in the Categories section. We use the date field (our time dimension), price (our metric), and yr_built (our category) to create an anomaly detection insight. The question we are trying to answer is “Were there any monthly outliers in price based on the year built?”

Choose Get started to set up anomaly detection.
For Combinations to be analyzed, choose your field combinations.

Choosing Exact means that the date and price are analyzed against the yr_built dimension. You can also choose Hierarchical or All. These latter options become relevant when you choose multiple dimensions in the Categories list. For more information about these options, refer to Adding an ML insight to detect outliers and key drivers.

Choose Save to return to Sheet 1.

Our widget is configured at this point.

Choose Run now to start analyzing the data for anomalies.

Based on the volume of data and the number of data points in the analysis, it may take a while to run the anomaly detection.

Keep in mind that at least 15 data points are needed to run an anomaly, but then you can change the aggregation of a field to have a zoom-out view and therefore view anomalies at a higher level.

For example, if you choose the date field and change Aggregate to Monthly, you get the top anomalies at the monthly level.

In our test case, QuickSight identified a top anomaly. This is a great widget that immediately draws the reader to highlights in data that are outliers and might require further investigation.

Forecast

With ML-powered forecasting, you can forecast your key business metrics in QuickSight easily. The ML algorithm in QuickSight is designed to handle complex real-world scenarios. Not only does QuickSight provide the capability to create forecasts, it also provides Forecast as a Suggested Insight.

Going back to Sheet 1, choose the line chart and expand Insights.

At the bottom you will see a suggested forecast insight. Forecast insights, along with all other suggested insights, are dynamic in the sense that when your data updates or when a user applies filters, the values in the insight will update immediately. Once you add this to your sheet you can even customize how many periods in the future you want the insight to display for the forecast by editing the Narrative and then editing the forecast Calculation.

What if we wanted to customize the price forecasting on this line chart and add it in the visual?

Choose the options menu (three dots) at the top right of the visual and choose Add forecast.
For Periods forward, enter 6.

That is the time interval selected for the visual.

Set Prediction interval to 70.

This is the amount of interval between data points. It causes the forecast to either go wider or narrower. A wider interval means wider gaps between data points, which means the net change is higher, and vice versa.

Leave Seasonality set to Automatic.

Seasonality takes into account complex seasonal trends in your data. You can experiment with both settings to see how it affects the forecast. For our scenario, because house sales are seasonal, we chose Automatic.

Choose Apply.

With just a few clicks, we have added a forecast to our visual, as shown in the following screenshot. The orange shaded area represents the upper and lower bound of the forecasted price.

This is another great way to add intelligence to your data and quickly let analysts focus on key data points and trends.

Conclusion

The Suggested Insights feature in QuickSight allows you to speed up the discovery and highlighting of key data elements. You can find insights in your data faster, and because they’re written out in narrative format, they’re very easy for non-technical users to quickly gain insight into the most interesting trends in the data with no ML training needed.

For more details on QuickSight ML Insights, refer to the QuickSight documentation or interact with the QuickSight Community.

As always, AWS is customer obsessed and we are ready to help with any specific questions.

About the Author

Rashid Sajjad is a Partner Management Solutions Architect focused on Big Data & Analytics with Amazon Web Services. He works with APN Partners to help develop their Migration, Data & Analytics and AI/ML Practices with enterprise, mission critical solutions for their end customers.r

A serverless operational data lake for retail with AWS Glue, Amazon Kinesis Data Streams, Amazon DynamoDB, and Amazon QuickSight

2022-06-03 Gandhi Raketla

Post Syndicated from Gandhi Raketla original https://aws.amazon.com/blogs/big-data/a-serverless-operational-data-lake-for-retail-with-aws-glue-amazon-kinesis-data-streams-amazon-dynamodb-and-amazon-quicksight/

Do you want to reduce stockouts at stores? Do you want to improve order delivery timelines? Do you want to provide your customers with accurate product availability, down to the millisecond? A retail operational data lake can help you transform the customer experience by providing deeper insights into a variety of operational aspects of your supply chain.

In this post, we demonstrate how to create a serverless operational data lake using AWS services, including AWS Glue, Amazon Kinesis Data Streams, Amazon DynamoDB, Amazon Athena, and Amazon QuickSight.

Retail operations is a critical functional area that gives retailers a competitive edge. An efficient retail operation can optimize the supply chain for a better customer experience and cost reduction. An optimized retail operation can reduce frequent stockouts and delayed shipments, and provide accurate inventory and order details. Today, a retailer’s channels aren’t just store and web—they include mobile apps, chatbots, connected devices, and social media channels. The data is both structured and unstructured. This coupled with multiple fulfillment options like buy online and pick up at store, ship from store, or ship from distribution centers, which increases the complexity of retail operations.

Most retailers use a centralized order management system (OMS) for managing orders, inventory, shipments, payments, and other operational aspects. These legacy OMSs are unable to scale in response to the rapid changes in retail business models. The enterprise applications that are key for efficient and smooth retail operations rely on a central OMS. Applications for ecommerce, warehouse management, call centers, and mobile all require an OMS to get order status, inventory positions of different items, shipment status, and more. Another challenge with legacy OMSs is they’re not designed to handle unstructured data like weather data and IoT data that could impact inventory and order fulfillment. A legacy OMS that can’t scale prohibits you from implementing new business models that could transform your customer experience.

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. An operational data lake addresses this challenge by providing easy access to structured and unstructured operational data in real time from various enterprise systems. You can store your data as is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning (ML)—to guide better decisions. This can ease the burden on OMSs that can instead focus on order orchestration and management.

Solution overview

In this post, we create an end-to-end pipeline to ingest, store, process, analyze, and visualize operational data like orders, inventory, and shipment updates. We use the following AWS services as key components:

Kinesis Data Streams to ingest all operational data in real time from various systems
DynamoDB, Amazon Aurora, and Amazon Simple Storage Service (Amazon S3) to store the data
AWS Glue DataBrew to clean and transform the data
AWS Glue crawlers to catalog the data
Athena to query the processed data
A QuickSight dashboard that provides insights into various operational metrics

The following diagram illustrates the solution architecture.

The data pipeline consists of stages to ingest, store, process, analyze, and finally visualize the data, which we discuss in more detail in the following sections.

Data ingestion

Orders and inventory data is ingested in real time from multiple sources like web applications, mobile apps, and connected devices into Kinesis Data Streams. Kinesis Data Streams is a massively scalable and durable real-time data streaming service. Kinesis Data Streams can continuously capture gigabytes of data per second from hundreds of thousands of sources, such as web applications, database events, inventory transactions, and payment transactions. Frontend systems like ecommerce applications and mobile apps ingest the order data as soon as items are added to a cart or an order is created. The OMS ingests orders when the order status changes. OMSs, stores, and third-party suppliers ingest inventory updates into the data stream.

To simulate orders, an AWS Lambda function is triggered by a scheduled Amazon CloudWatch event every minute to ingest orders to a data stream. This function simulates the typical order management system lifecycle (order created, scheduled, released, shipped, and delivered). Similarly, a second Lambda function is triggered by a CloudWatch event to generate inventory updates. This function simulates different inventory updates such as purchase orders created from systems like the OMS or third-party suppliers. In a production environment, this data would come from frontend applications and a centralized order management system.

Data storage

There are two types of data: hot and cold data. Hot data is consumed by frontend applications like web applications, mobile apps, and connected devices. The following are some example use cases for hot data:

When a customer is browsing products, the real-time availability of the item must be displayed
Customers interacting with Alexa to know the status of the order
A call center agent interacting with a customer needs to know the status of the customer order or its shipment details

The systems, APIs, and devices that consume this data need the data within seconds or milliseconds of the transactions.

Cold data is used for long-term analytics like orders over a period of time, orders by channel, top 10 items by number of orders, or planned vs. available inventory by item, warehouse, or store.

For this solution, we store orders hot data in DynamoDB. DynamoDB is a fully managed NoSQL database that delivers single-digit millisecond performance at any scale. A Lambda function processes records in the Kinesis data stream and stores it in a DynamoDB table.

Inventory hot data is stored in an Amazon Aurora MySQL-Compatible Edition database. Inventory is transactional data that requires high consistency so that customers aren’t over-promised or under-promised when they place orders. Aurora MySQL is fully managed database that is up to five times faster than standard MySQL databases and three times faster than standard PostgreSQL databases. It provides the security, availability, and reliability of commercial databases at a tenth of the cost.

Amazon S3 is object storage built to store and retrieve any amount of data from anywhere. It’s a simple storage service that offers industry-leading durability, availability, performance, security, and virtually unlimited scalability at very low cost. Order and inventory cold data is stored in Amazon S3.

Amazon Kinesis Data Firehose reads the data from the Kinesis data stream and stores it in Amazon S3. Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and Splunk, enabling near-real-time analytics.

Data processing

The data processing stage involves cleaning, preparing, and transforming the data to help downstream analytics applications easily query the data. Each frontend system might have a different data format. In the data processing stage, data is cleaned and converted into a common canonical form.

For this solution, we use DataBrew to clean and convert orders into a common canonical form. DataBrew is a visual data preparation tool that makes it easy for data analysts and data scientists to prepare data with an interactive, point-and-click visual interface without writing code. DataBrew provides over 250 built-in transformations to combine, pivot, and transpose the data without writing code. The cleaning and transformation steps in DataBrew are called recipes. A scheduled DataBrew job applies the recipes to the data in an S3 bucket and stores the output in a different bucket.

AWS Glue crawlers can access data stores, extract metadata, and create table definitions in the AWS Glue Data Catalog. You can schedule a crawler to crawl the transformed data and create or update the Data Catalog. The AWS Glue Data Catalog is your persistent metadata store. It’s a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore. We use crawlers to populate the Data Catalog with tables.

Data analysis

We can query orders and inventory data from S3 buckets using Athena. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Views are created in Athena that can be consumed by business intelligence (BI) services like QuickSight.

Data visualization

We generate dashboards using QuickSight. QuickSight is a scalable, serverless, embeddable BI service powered by ML and built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights.

QuickSight also has features to forecast orders, detect anomalies in the order, and provide ML-powered insights. We can create analyses such as orders over a period of time, orders split by channel, top 10 locations for orders, or order fulfillment timelines (the time it took from order creation to order delivery).

Walkthrough overview

To implement this solution, you complete the following high-level steps:

Create solution resources using AWS CloudFormation.
Connect to the inventory database.
Load the inventory database with tables.
Create a VPC endpoint using Amazon Virtual Private Cloud (Amazon VPC).
Create gateway endpoints for Amazon S3 on the default VPC.
Enable CloudWatch rules via Amazon EventBridge to ingest the data.
Transform the data using AWS Glue.
Visualize the data with QuickSight.

Prerequisites

Complete the following prerequisite steps:

Create AWS account if you don’t have done already.
Sign up for QuickSight if you’ve never used QuickSight in this account before. To use the forecast ability in QuickSight, sign up for the Enterprise Edition.

Create resources with AWS CloudFormation

To launch the provided CloudFormation template, complete the following steps:

Choose Launch Stack:
Choose Next.
For Stack name, enter a name.
Provide the following parameters:
1. The name of the S3 bucket that holds all the data for the data lake.
2. The name of the database that holds the inventory tables.
3. The database user name.
4. The database password.
Enter any tags you want to assign to the stack and choose Next.
Select the acknowledgement check boxes and choose Create stack.

The stack takes 5–10 minutes to complete.

On the AWS CloudFormation console, you can navigate to the stack’s Outputs tab to review the resources you created.

If you open the S3 bucket you created, you can observe its folder structure. The stack creates sample order data for the last 7 days.

Connect to the inventory database

To connect to your database in the query editor, complete the following steps:

On the Amazon RDS console, choose the Region you deployed the stack in.
In the navigation pane, choose Query Editor.

If you haven’t connected to this database before, the Connect to database page opens.
For Database instance or cluster, choose your database.
For Database username, choose Connect with a Secrets Manager ARN.
The database user name and password provided during stack creation are stored in AWS Secrets Manager. Alternatively, you can choose Add new database credentials and enter the database user name and password you provided when creating the stack.
For Secrets Manager ARN, enter the value for the key InventorySecretManager from the CloudFormation stack outputs.
Optionally, enter the name of your database.
Choose Connect to database.

Load the inventory database with tables

Enter the following DDL statement in the query editor and choose Run:

CREATE TABLE INVENTORY (
    ItemID varchar(25) NOT NULL,
    ShipNode varchar(25) NOT NULL,
    SupplyType varchar(25) NOT NULL,
    SupplyDemandType varchar(25) NOT NULL,
    ItemName varchar(25),
    UOM varchar(10),
    Quantity int(11) NOT NULL,
    ETA varchar(25)	 ,
    UpdatedDate DATE,
    PRIMARY KEY (ItemID,ShipNode,SupplyType)
);

Create a VPC endpoint

To create your VPC endpoint, complete the following steps:

On the Amazon VPC console, choose VPC Dashboard.
Choose Endpoints in the navigation pane.
Choose Create Endpoint.
For Service category, select AWS services.
For Service name, search for rds and choose the service name ending with rds-data.
For VPC, choose the default VPC.
Leave the remaining settings at their default and choose Create endpoint.

Create a gateway endpoint for Amazon S3

To create your gateway endpoint, complete the following steps:

On the Amazon VPC console, choose VPC Dashboard.
Choose Endpoints in the navigation pane.
Choose Create Endpoint.
For Service category, select AWS services.
For Service name, search for S3 and choose the service name with type Gateway.
For VPC, choose the default VPC.
For Configure route tables, select the default route table.
Leave the remaining settings at their default and choose Create endpoint.

Wait for both the gateway endpoint and VPC endpoint status to change to Available.

Enable CloudWatch rules to ingest the data

We created two CloudWatch rules via the CloudFormation template to ingest the order and inventory data to Kinesis Data Streams. To enable the rules via EventBridge, complete the following steps:

On the CloudWatch console, under Events in the navigation pane, choose Rules.
Make sure you’re in the Region where you created the stack.
Choose Go to Amazon EventBridge.
Select the rule Ingest-Inventory-Update-Schedule-Rule and choose Enable.
Select the rule Ingest-Order-Schedule-Rule and choose Enable.

After 5–10 minutes, the Lambda functions start ingesting orders and inventory updates to their respective streams. You can check the S3 buckets orders-landing-zone and inventory-landing-zone to confirm that the data is being populated.

Perform data transformation

Our CloudFormation stack included a DataBrew project, a DataBrew job that runs every 5 minutes, and two AWS Glue crawlers. To perform data transformation using our AWS Glue resources, complete the following steps:

On the DataBrew console, choose Projects in the navigation pane.
Choose the project OrderDataTransform.

You can review the project and its recipe on this page.
In the navigation pane, choose Jobs.
Review the job status to confirm it’s complete.
On the AWS Glue console, choose Crawlers in the navigation pane.
The crawlers crawl the transformed data and update the Data Catalog.
Review the status of the two crawlers, which run every 15 minutes.
Choose Tables in the navigation pane to view the two tables the crawlers created.
If you don’t see these tables, you can run the crawlers manually to create them.

You can query the data in the tables with Athena.
On the Athena console, choose Query editor.
If you haven’t created a query result location, you’re prompted to do that first.
Choose View settings or choose the Settings tab.
Choose Manage.
Select the S3 bucket to store the results and choose Choose.
Choose Query editor in the navigation pane.
Choose either table (right-click) and choose Preview Table to view the table contents.

Visualize the data

If you have never used QuickSight in this account before, complete the prerequisite step to sign up for QuickSight. To use the ML capabilities of QuickSight (such as forecasting) sign up for the Enterprise Edition using the steps in this documentation.

While signing up for QuickSight, make sure to use the same region where you created the CloudFormation stack.

Grant QuickSight permissions

To visualize your data, you must first grant relevant permissions to QuickSight to access your data.

On the QuickSight console, on the Admin drop-down menu, choose Manage QuickSight.
In the navigation pane, choose Security & permissions.
Under QuickSight access to AWS services, choose Manage.
Select Amazon Athena.
Select Amazon S3 to edit QuickSight access to your S3 buckets.
Select the bucket you specified during stack creation (for this post, operational-datalake).
Choose Finish.
Choose Save.

Prepare the datasets

To prepare your datasets, complete the following steps:

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
Choose Athena.
For Data source name, enter retail-analysis.
Choose Validate connection.
After your connection is validated, choose Create data source.
For Database, choose orderdatalake.
For Tables, select orders_clean.
Choose Edit/Preview data.
For Query mode, select SPICE.
SPICE (Super-fast, Parallel, In-memory Calculation Engine) is the robust in-memory engine that QuickSight uses.
Choose the orderdatetime field (right-click), choose Change data type, and choose Date.
Enter the date format as MM/dd/yyyy HH:mm:ss.
Choose Validate and Update.
Change the data types of the following fields to QuickSight geospatial data types:
1. billingaddress.zipcode – Postcode
2. billingaddress.city – City
3. billingaddress.country – Country
4. billingaddress.state – State
5. shippingaddress.zipcode – Postcode
6. shippingaddress.city – City
7. shippingaddress.country – Country
8. shippingaddress.state – State
Choose Save & publish.
Choose Cancel to exit this page.

Let’s create another dataset for the Athena table inventory_landing_zone.
Follow steps 1–7 to create a new dataset. For Table selection, choose inventory_landing_zone.
Choose Edit/Preview data.
For Query mode, select SPICE.
Choose Save & publish.
Choose Cancel to exit this page.

Both datasets should now be listed on the Datasets page.
Choose each dataset and choose Refresh now.
Select Full refresh and choose Refresh.

To set up a scheduled refresh, choose Schedule a refresh and provide your schedule details.

Create an analysis

To create an analysis in QuickSight, complete the following steps:

On the QuickSight console, choose Analyses in the navigation pane.
Choose New analysis.
Choose the orders_clean dataset.
Choose Create analysis.
To adjust the theme, choose Themes in the navigation pane, choose your preferred theme, and choose Apply.
Name the analysis retail-analysis.

Add visualizations to the analysis

Let’s start creating visualizations. The first visualization shows orders created over time.

Choose the empty graph on the dashboard and for Visual type¸ choose the line chart.
For more information about visual types, see Visual types in Amazon QuickSight.
Under Field wells, drag orderdatetime to X axis and ordernumber to Value.
Set ordernumber to Aggregate: Count distinct.

Now we can filter these orders by Created status.
Choose Filter in the navigation pane and choose Create one.
Search for and choose status.
Choose the status filter you just created.
Select Created from the filter list and choose Apply.
Choose the graph (right-click) and choose Add forecast.
The forecasting ability is only available in the Enterprise Edition. QuickSight uses a built-in version of the Random Cut Forest (RCF) algorithm. For more information, refer to Understanding the ML algorithm used by Amazon QuickSight.
Leave the settings as default and choose Apply.
Rename the visualization to “Orders Created Over Time.”

If the forecast is applied successfully, the visualization shows the expected number of orders as well as upper and lower bounds.

If you get the following error message, allow for the data to accumulate for a few days before adding the forecast.

Let’s create a visualization on orders by location.

On the Add menu, choose Add visual.
Choose the points on map visual type.
Under Field wells, drag shippingaddress.zipcode to Geospatial and ordernumber to Size.
Change ordernumber to Aggregate: Count distinct.

You should now see a map indicating the orders by location.
Rename the visualization accordingly.

Next, we create a drill-down visualization on the inventory count.
Choose the pencil icon.
Choose Add dataset.
Select the inventory_landing_zone dataset and choose Select.
Choose the inventory_landing_zone dataset.
Add the vertical bar chart visual type.
Under Field wells, drag itemname, shipnode, and invtype to X axis, and quantity to Value.
Make sure that quantity is set to Sum.

The following screenshot shows an example visualization of order inventory.
To determine how many face masks were shipped out from each ship node, choose Face Masks (right-click) and choose Drill down to shipnode.
You can drill down even further to invtype to see how many face masks in a specific ship node are in which status.

The following screenshot shows this drilled-down inventory count.

As a next step, you can create a QuickSight dashboard from the analysis you created. For instructions, refer to Tutorial: Create an Amazon QuickSight dashboard.

Clean up

To avoid any ongoing charges, on the AWS CloudFormation console, select the stack you created and choose Delete. This deletes all the created resources. On the stack’s Events tab, you can track the progress of the deletion, and wait for the stack status to change to DELETE_COMPLETE.

The Amazon EventBridge rules generate orders and inventory data every 15 minutes, to avoid generating huge amount of data, please ensure to delete the stack after testing the blog.

If the deletion of any resources fails, ensure that you delete them manually. For deleting Amazon QuickSight datasets, you can follow these instructions. You can delete the QuickSight Analysis using these steps. For deleting the QuickSight subscription and closing the account, you can follow these instructions.

Conclusion

In this post, we showed you how to use AWS analytics and storage services to build a serverless operational data lake. Kinesis Data Streams lets you ingest large volumes of data, and DataBrew lets you cleanse and transform the data visually. We also showed you how to analyze and visualize the order and inventory data using AWS Glue, Athena, and QuickSight. For more information and resources for data lakes on AWS, visit Analytics on AWS.

About the Authors

Gandhi Raketla is a Senior Solutions Architect for AWS. He works with AWS customers and partners on cloud adoption, as well as architecting solutions that help customers foster agility and innovation. He specializes in the AWS data analytics domain.

Sindhura Palakodety is a Solutions Architect at AWS. She is passionate about helping customers build enterprise-scale Well-Architected solutions on the AWS Cloud and specializes in the containers and data analytics domains.

Automate your validated dataset deployment using Amazon QuickSight and AWS CloudFormation

2022-06-02 Jeremy Winters

Post Syndicated from Jeremy Winters original https://aws.amazon.com/blogs/big-data/automate-your-validated-dataset-deployment-using-amazon-quicksight-and-aws-cloudformation/

A lot of the power behind business intelligence (BI) and data visualization tools such as Amazon QuickSight comes from the ability to work interactively with data through a GUI. Report authors create dashboards using GUI-based tools, then in just a few clicks can share the dashboards with business users and decision-makers. This workflow empowers authors to create and manage the QuickSight resources and dashboards they’re responsible for providing.

Developer productivity is a great benefit of UI-based development, but enterprise customers often need to consider additional factors in their BI implementation:

Promoting objects through environments (development, testing, production, and so on)
Scaling for hundreds of authors and thousands of users
Implementing data security, such as row-level and column-level rules to filter the data elements visible to specific users
Regulatory requirements, processes, and compliance controls.

Approaches such as source control-backed CI/CD pipelines allow you to address compliance requirements and security gates with automation. For example, a hypothetical build pipeline for a Java Springboot application may enable developers to build and deploy freely to a dev environment, but the code must pass tests and vulnerability scans before being considered for promotion to upper environments. A human approval step then takes place before the code is released into production. Processes such as this provide quality, consistency, auditability, and accountability for the code being released.

The QuickSight API provides functionality for automation pipelines. Pipeline developers can use the API to migrate QuickSight resources from one environment to another. The API calls that facilitate handling QuickSight datasets enables inspection of the JSON representation of the dataset definition.

This post presents an example of how a QuickSight administrator can automate data resource management and security validation through use of the QuickSight API and AWS CloudFormation.

Solution overview

The model implements security rules that rely on naming conventions for tables and columns as an integral part of the security model. Instead of relying on naming conventions, you may want to use a lookup table or similar approach to store the relationships between data tables and security tables.

We guide you through the following steps:

Create relational database tables to be secured.
Create a QuickSight dataset in your dev account.
Generate a CloudFormation template using a Python script that allows you to enforce row-level and column-level security in each environment. You can customize this script to the needs of your organization.
Use the generated CloudFormation template to deploy through dev, test, and prod using your change management process.

You can use AWS CloudFormation to manage several types of QuickSight resources, but dataset resources are a critical junction for security, so they are our focus in this post.

To implement data security rules in a large organization, controls must be in place to agree upon and implement the rules from a process perspective. This post dives deep into using code to validate security aspects of your QuickSight deployment, but data security requires more than code. The approaches put forward are intended as a part of a larger change management process, much of which is based around human review and approval.

In addition to having a change management process in place, we suggest managing your AWS resources using a CI/CD pipeline. The nature of change management and CI/CD processes can vary greatly, and are outside the scope of this post.

Prerequisites

This post assumes a basic command of the following:

Familiarity with the QuickSight UI (creating datasets, dashboards, and so on)
Python programming (Python 3 preferred)
Using the Boto3 library to interact with AWS
Creating stacks with AWS CloudFormation
SQL
A Postgres database running in AWS

We don’t go into the broader picture of integrating into a full CI/CD process, so an understanding of CI/CD is helpful, but not required.

Security rules for your organization

Before we can write a script to confirm security rules have been applied correctly, we need to know what the security rules actually are. This means we need to determine the following:

What – What is the data we are trying to secure? Which fields in the database are sensitive? Which field values will be used to filter access?
Who – Who are the users and groups that should be provided access to the data and fields we have identified?

In concrete terms, we need to match identities (users and groups) to actual data values (used in row-level security) and sensitive fields (for column-level security). Identities such as users and groups typically correlate to entities in external systems such as Active Directory, but you can use native QuickSight users and groups.

For this post, we define the following rules that indicate the relationship between database objects (tables and fields) and how they should be secured. Keep in mind that these example rules may not apply to every organization. Security should be developed to match your requirements and processes.

Any field name with _sensitive appended to it is identified as containing sensitive data. For example, a column named salary_usd_sensitive should be restricted. For our scenario, we say that the user should be a member of the QuickSight restricted group in order to access sensitive fields. No other groups are allowed access to these fields.
For a given table, a companion table with _rls appended to the name contains the row-level security rules used to secure the table. In this model, the row-level security rules for the employees table are found in the employees_rls table.
Row-level security rules must be sourced 100% from the underlying data store. This means that you can’t upload rules via the QuickSight console, or use custom SQL in QuickSight to create the rules. Rules can be provided as views (if supported by the underlying data store) as long as the view definition is managed using a change management process.
The dataset name should match the name of the underlying database table.

These rules rely on a well-structured change management process for the database. If users and developers have access to change database objects in production, the rules won’t carry much weight. For examples of automated schema management using open-source CI/CD tooling, refer to Deploy, track, and roll back RDS database code changes using open source tools Liquibase and Jenkins and How to Integrate Amazon RDS Schema Changes into CI/CD Pipelines with GitLab and Liquibase.

From the QuickSight perspective, our database becomes the source of the “what” and “who” we discussed earlier. QuickSight doesn’t own the security rules, it merely implements the rules as defined in the database.

Security rule management with database objects

For this post, we source data from a Postgres database using a read-only user created for QuickSight.

First, we create our schema and a data table with a few rows inserted:

create schema if not exists ledger;

--the table we are securing
drop table if exists ledger.transactions;
create table if not exists ledger.transactions (
    txn_id integer,
    txn_type varchar(100),
    txn_desc varchar(100),
    txn_amt float,
    department varchar(100),
    discount_sensitive float
);

insert into ledger.transactions (
    txn_id,
    txn_type,
    txn_desc,
    txn_amt,
    department,
    discount_sensitive
) 
values
(1, 'expense', 'commission', -1000.00, 'field sales', 0.0),
(2, 'revenue', 'widgets',  15000.00, 'field sales', 1000.00),
(3, 'revenue', 'interest', 1000.00, 'corporate', 0.0),
(4, 'expense', 'taxes', -1234.00, 'corporate', 0.0),
(5, 'revenue', 'doodads', 1000.00, 'field sales', 100.0)
;

Note the field discount_sensitive. In our security model, any field name with _sensitive appended to it is identified as containing sensitive data. This information is used later when we implement column-level security. In our example, we have the luxury of using naming conventions to tag the sensitive fields, but that isn’t always possible. Other options could involve the use of SQL comments, or creating a table that provides a lookup for sensitive fields. Which method you choose depends upon your data and requirements, and should be supported by a change management process.

Row-level security table

The following SQL creates a table containing the row-level security rules for the ledger.transactions table, then inserts rules that match the example discussed earlier:

drop table if exists ledger.transactions_rls;
create table ledger.transactions_rls (
    groupname varchar(100),
    department varchar(1000)
);


insert into ledger.transactions_rls (groupname, department) 
values
('restricted', null), --null indicates all values
('anybody', 'field sales');

For more information about how to restrict access to a dataset using row-level security, refer to Using row-level security (RLS) with user-based rules to restrict access to a dataset

These rules match the specified QuickSight user groups to values in the department field of the transactions table.

Our last step in Postgres is to create a user that has read-only access to our tables. All end-user or SPICE refresh queries from QuickSight are run using this user. See the following code:

drop role if exists qs_user;
create role qs_user login password 'GETABETTERPASSSWORD';
grant connect on database quicksight TO qs_user;
grant usage on schema ledger to qs_user;
grant select on ledger.transactions to qs_user;
grant select on ledger.transactions_rls to qs_user;

Create user groups

Our security model provides permissions based on group membership. Although QuickSight allows for these groups to be sourced from external systems such as Active Directory, our example uses native QuickSight groups.

We create our groups using the following AWS Command Line Interface (AWS CLI) commands. Take note of the restricted group we’re creating; this is the group we use to grant access to sensitive data columns.

aws quicksight create-group \
--aws-account-id YOUR_AWS_ACCOUNT_ID_HERE \
--namespace default \
--group-name restricted

aws quicksight create-group \
--aws-account-id YOUR_AWS_ACCOUNT_ID_HERE \
--namespace default \
--group-name anybody

You can also add a user to your group with the following code:

aws quicksight create-group-membership \
--aws-account-id YOUR_AWS_ACCOUNT_ID_HERE \
--namespace default \
--group-name anybody \
--member-name [email protected]

The Python script

Now that we have set up our database and groups, we switch focus to the Python script used for the following actions:

Extracting the definition of a manually created dataset using the QuickSight API
Ensuring that the dataset definition meets security standards
Restructuring the dataset definition into the format of a CloudFormation template
Writing the CloudFormation template to a JSON file

In the header of the script, you can see the following variables, which you should set to values in your own AWS environment:

# Parameters for the source data set
region_name = 'AWS_REGION_NAME'
aws_account_id = "AWS_ACCOUNT_ID"
source_data_set_id = "ID_FOR_THE_SOURCE_DATA_SET"

# Parameters are used when creating the cloudformation template
target_data_set_name = "DATA_SET_DISPLAY_NAME"
target_data_set_id = "NEW_DATA_SET_ID"
template_file_name = "dataset.json"

QuickSight datasets have a name and an ID. The name is displayed in the QuickSight UI, and the ID is used to reference the dataset behind the scenes. The ID must be unique for a given account and Region, which is why QuickSight uses UUIDs by default, but you can use any unique string.

Create the datasets

You can use the QuickSight GUI or Public API to create a dataset for the transactions_rls and transactions tables. For instructions, refer to Creating a dataset from a database. Connect to the database, create the datasets, then apply transactions_rls as the row-level security for the transactions dataset. You can use the following list-data-sets AWS CLI call to verify that your tables were created successfully:

$ aws quicksight list-data-sets --aws-account-id YOURACCOUNT            
{
    "DataSetSummaries": [
       {
            "Arn": "arn:aws:quicksight:us-west-2:YOURACCOUNT:dataset/<ID>",
            "DataSetId": "<ID>",
            "Name": "transactions",
            "CreatedTime": "2021-09-15T15:41:56.716000-07:00",
            "LastUpdatedTime": "2021-09-15T16:38:03.658000-07:00",
            "ImportMode": "SPICE",
            "RowLevelPermissionDataSet": {
                "Namespace": "default",
                "Arn": "arn:aws:quicksight:us-west-2: YOURACCOUNT:dataset/<RLS_ID>",
                "PermissionPolicy": "GRANT_ACCESS",
                "FormatVersion": "VERSION_1",
                "Status": "ENABLED"
            },
            "RowLevelPermissionTagConfigurationApplied": false,
            "ColumnLevelPermissionRulesApplied": true
        },
        {
            "Arn": "arn:aws:quicksight:us-west-2: YOURACCOUNT:dataset/<RLS_ID>",
            "DataSetId": "<RLS_ID>",
            "Name": "transactions_rls",
            "CreatedTime": "2021-09-15T15:42:37.313000-07:00",
            "LastUpdatedTime": "2021-09-15T15:42:37.520000-07:00",
            "ImportMode": "SPICE",
            "RowLevelPermissionTagConfigurationApplied": false,
            "ColumnLevelPermissionRulesApplied": false
        }
    ]
}

Script overview

Our script is based around the describe_data_set method of the Boto3 QuickSight client. This method returns a Python dictionary containing all the attributes associated with a dataset resource. Our script analyzes these dictionaries, then coerces them into the structure required for dataset creation using AWS CloudFormation. The structure of the describe_data_set method and the AWS::QuickSight::DataSet CloudFormation resource are very similar, but not quite identical.

The following are the top-level fields in the response for the Boto3 QuickSight client describe_data_set method:

{
    'DataSet': {
        'Arn': 'string',
        'DataSetId': 'string',
        'Name': 'string',
        'CreatedTime': datetime(2015, 1, 1),
        'LastUpdatedTime': datetime(2015, 1, 1),
        'PhysicalTableMap': {},
        'LogicalTableMap': {...},
        'OutputColumns': [...],
        'ImportMode': 'SPICE'|'DIRECT_QUERY',
        'ConsumedSpiceCapacityInBytes': 123,
        'ColumnGroups': [...],
        'FieldFolders': {...},
        'RowLevelPermissionDataSet': {...},
        'ColumnLevelPermissionRules': [...]
    },
    'RequestId': 'string',
    'Status': 123
}

Our script converts the response from the API to the structure required for creating a dataset using AWS CloudFormation.

The following are the top-level fields in the AWS::QuickSight::DataSet CloudFormation resource:

{
  "Type" : "AWS::QuickSight::DataSet",
  "Properties" : {
      "AwsAccountId" : String,
      "ColumnGroups" : [ ColumnGroup, ... ],
      "ColumnLevelPermissionRules" : [ ColumnLevelPermissionRule, ... ],
      "DataSetId" : String,
      "FieldFolders" : {Key : Value, ...},
      "ImportMode" : String,
      "IngestionWaitPolicy" : IngestionWaitPolicy,
      "LogicalTableMap" : {Key : Value, ...},
      "Name" : String,
      "Permissions" : [ ResourcePermission, ... ],
      "PhysicalTableMap" : {Key : Value, ...},
      "RowLevelPermissionDataSet" : RowLevelPermissionDataSet,
      "Tags" : [ Tag, ... ]
    }
}

The key differences between both JSON structures are as follows:

describe_data_set contains Arn, CreatedTime, and LastUpdatedTime, which are useful fields but only relevant to an existing resource
AWS CloudFormation requires AwsAccountId when creating the resource
AWS CloudFormation accepts tags for the dataset, but describe_data_set doesn’t provide them
The AWS CloudFormation Permissions property allows for assigning AWS Identity and Access Management (IAM) permissions at the time of creation

Our script is able to selectively choose the top-level properties we want from the describe_data_set response, then add the fields that AWS CloudFormation requires for resource creation.

Validate security

Before the script creates the CloudFormation template, it performs validations to ensure that our dataset conforms to the defined security rules.

The following is the snippet from our script that performs validation for row-level security:

if 'RowLevelPermissionDataSet' in describe_response['DataSet']:
    if describe_response['DataSet']['RowLevelPermissionDataSet'] is None:
        raise Exception("row level permissions must be applied!")
    else:
        # now we look up the rls data set so that we can confirm that it conforms to our rules
        rls_dataset_id = describe_response['DataSet']['RowLevelPermissionDataSet']['Arn'].split('/')[-1]
        rls_response = client.describe_data_set(
            AwsAccountId = aws_account_id,
            DataSetId = rls_dataset_id
        )
        
        rls_table_map = rls_response['DataSet']['PhysicalTableMap']

        # rls table must not be custom SQL
        if 'CustomSql' in rls_table_map[list(rls_table_map.keys())[0]]:
            raise Exception("RLS data set can not contain custom SQL!")

        # confirm that the database table name is what we expect it to be 
        if rls_response['DataSet']['Name'] != describe_response['DataSet']['Name'] + '_rls':
            raise Exception("RLS data set name must match pattern tablename_rls!")

The steps in the code are as follows:

Ensure that any row-level security is applied (this is the bare minimum).
Look up the dataset that contains the row-level security rules using another Boto3 call.
Confirm that the row-level security dataset is not custom SQL.
Confirm that the name of the table is as expected, with _rls appended to the name of the table being secured.

The use of custom SQL for sourcing row-level security rules isn’t secure in our case, because a QuickSight developer could use SQL to alter the underlying rules. Because of this, our model requires that a physical table from the dataset is used as the row-level security rule source. Of course, it’s possible to use a view in the database to provide the rules. A view is okay because the definition (in our scenario) is governed by a change management process, as opposed to the custom SQL, which the QuickSight developer can create.

The rules being implemented for your specific organization will be different. You may need to connect to a database directly from your Python script in order to validate the dataset was created in a secure manner. Regardless of your actual rules, the describe_data_set API method provides you the details you need to begin validation of the dataset.

Column-level security

Our model for column-level security indicates that any database field name that ends in _sensitive should only be accessible to members of a QuickSight group named restricted. Instead of validating that the dataset has the column-level security rules applied correctly, we simply enforce the rules directly in two steps:

Identify the sensitive fields.
Create a dictionary and add it to our dataset with the key ColumnLevelPermissionRules.

To identify the sensitive fields, we create a list and iterate through the input columns of the physical table:

sensitive_fields = []
input_columns = physical_table_map[list(physical_table_map.keys())[0]]["RelationalTable"]["InputColumns"]
for input_column in input_columns:
    field_name = input_column['Name']
    if field_name[-10:len(field_name)] == '_sensitive':
        sensitive_fields.append(field_name)

The result is a list of sensitive fields. We can then take this list and integrate it into the dataset through the use of a dictionary:

if len(sensitive_fields) > 0:
    data_set["ColumnLevelPermissionRules"] = [
        {
            "Principals": [
                {"Ref": "RestrictedUserGroupArn"}
            ],
            "ColumnNames": sensitive_fields
        }
    ]

Instead of specifying a specific principal, we reference the CloudFormation template parameter RestrictedUserGroupArn. The ARN for the restricted group is likely to vary, especially if you’re deploying to another AWS account. Using a template parameter allows us to specify the ARN at the time of dataset creation in the new environment.

Access to the dataset QuickSight resources

The Permissions structure is added to the definition for each dataset:

"Permissions": [
    {
        "Principal": {
            "Ref": "QuickSightAdminPrincipal"
        },
        "Actions": [
            "quicksight:DescribeDataSet",
            "quicksight:DescribeDataSetPermissions",
            "quicksight:PassDataSet",
            "quicksight:DescribeIngestion",
            "quicksight:ListIngestions",
            "quicksight:UpdateDataSet",
            "quicksight:DeleteDataSet",
            "quicksight:CreateIngestion",
            "quicksight:CancelIngestion",
            "quicksight:UpdateDataSetPermissions"
        ]
    }
]

A value for the QuickSightAdminPrincipal CloudFormation template parameter is provided at the time of stack creation. The preceding structure provides the principal access to manage the QuickSight dataset resource itself. Note that this is not the same as data access (though an admin user could manually remove the row-level security rules). Row-level and column-level security rules indicate whether a given user has access to specific data, whereas these permissions allow for actions on the definition of the dataset, such as the following:

Updating or deleting the dataset
Changing the security permissions
Initiating and monitoring SPICE refreshes

End-users don’t require this access in order to use a dashboard created from the dataset.

Run the script

Our script requires you to specify the dataset ID, which is not the same as the dataset name. To determine the ID, use the AWS CLI list-data-sets command.

To set the script parameters, you can edit the following lines to match your environment:

# parameters for the source data set
region_name = 'us-west-2'
aws_account_id = "<YOUR_ACCOUNT_ID>"
source_data_set_id = "<SOURCE_DATA_SET_ID>"

# parameters for the target data set
target_data_set_name = "DATA_SET_PRESENTATION_NAME"
target_data_set_id = "NEW_DATA_SET_ID"

The following snippet runs the Python script:

$ quicksight_security % python3 data_set_to_cf.py                                                       
row level security validated!
the following sensitive fields were found: ['discount_sensitive']
cloudformation template written to dataset.json
cli-input-json file written to params.json

CloudFormation template

Now that the security rules have been validated, our script can generate the CloudFormation template. The describe_response_to_cf_data_set method accepts a describe_data_set response as input (along with a few other parameters) and returns a dictionary that reflects the structure of an AWS::QuickSight::DataSet CloudFormation resource. Our code uses this method once for the primary dataset, and again for the _rls rules. This method handles selecting values from the response, prunes some unnecessary items (such as empty tag lists), and replaces a few values with CloudFormation references. These references allow us to provide parameter values to the template, such as QuickSight principals and the data source ARN.

You can view the template using the cat command:

$ quicksight_security % cat dataset.json 
{
    "AWSTemplateFormatVersion": "2010-09-09",
    "Description": "Creates a QuickSight Data Set",
    "Parameters": {
        "DataSourceArn": {
            "Type": "String",
            "Description": "ARN for Postgres data source resource"
        },
        "QuickSightOwnerPrincipal": {
            "Type": "String",
            "Description": "ARN for a QuickSight principal who will be granted API access to the datasets"
        },
        "RestrictedUserGroupArn": {
            "Type": "String",
            "Description": "ARN for a QuickSight principal who will be granted access to sensitive fields"
        }
    },
    "Resources": {
        "NewDataSet": {
            "Type": "AWS::QuickSight::DataSet",
            "Properties": {
                "DataSetId": "NEW_DATA_SET_ID",
                "Name": "DATA_SET_PRESENTATION_NAME",
                "AwsAccountId": {
                    "Ref": "AWS::AccountId"
                },
                "Permissions": [
                    {
                        "Principal": {
                            "Ref": "QuickSightAdminPrincipal"
                        },
                        "Actions": [
                            "quicksight:DescribeDataSet",
                            "quicksight:DescribeDataSetPermissions",
                            "quicksight:PassDataSet",
                            "quicksight:DescribeIngestion",
                            "quicksight:ListIngestions",
                            "quicksight:UpdateDataSet",
                            "quicksight:DeleteDataSet",
                            "quicksight:CreateIngestion",
                            "quicksight:CancelIngestion",
                            "quicksight:UpdateDataSetPermissions"
                        ]
                    }
                ],
                "FieldFolders": {},
                "ImportMode": "DIRECT_QUERY",
                "LogicalTableMap": {
                    "e2305db4-2c79-4ac4-aff5-224b8c809767": {
                        "Alias": "transactions",
                        "DataTransforms": [
                            {
                                "ProjectOperation": {
                                    "ProjectedColumns": [
                                        "txn_id",
                                        "txn_type",
                                        "txn_desc",
                                        "txn_amt",
                                        "department",
                                        "discount_sensitive"
                                    ]
                                }
                            }
                        ],
                        "Source": {
                            "PhysicalTableId": "someguid-2c79-4ac4-aff5-224b8c809767"
                        }
                    }
                },
                "PhysicalTableMap": {
                    "e2305db4-2c79-4ac4-aff5-224b8c809767": {
                        "RelationalTable": {
                            "DataSourceArn": {
                                "Ref": "DataSourceArn"
                            },
                            "Schema": "ledger",
                            "Name": "transactions",
                            "InputColumns": [
                                {
                                    "Name": "txn_id",
                                    "Type": "INTEGER"
                                },
                                {
                                    "Name": "txn_type",
                                    "Type": "STRING"
                                },
                                {
                                    "Name": "txn_desc",
                                    "Type": "STRING"
                                },
                                {
                                    "Name": "txn_amt",
                                    "Type": "DECIMAL"
                                },
                                {
                                    "Name": "department",
                                    "Type": "STRING"
                                },
                                {
                                    "Name": "discount_sensitive",
                                    "Type": "DECIMAL"
                                }
                            ]
                        }
                    }
                },
                "RowLevelPermissionDataSet": {
                    "Namespace": "default",
                    "Arn": {
                        "Fn::GetAtt": [
                            "NewDataSetRLS",
                            "Arn"
                        ]
                    },
                    "PermissionPolicy": "GRANT_ACCESS",
                    "FormatVersion": "VERSION_1"
                },
                "ColumnLevelPermissionRules": [
                    {
                        "Principals": [
                            {
                                "Ref": "RestrictedUserGroupArn"
                            }
                        ],
                        "ColumnNames": [
                            "discount_sensitive"
                        ]
                    }
                ]
            }
        },
        "NewDataSetRLS": {
            "Type": "AWS::QuickSight::DataSet",
            "Properties": {
                "DataSetId": "NEW_DATA_SET_ID_rls",
                "Name": "DATA_SET_PRESENTATION_NAME_rls",
                "AwsAccountId": {
                    "Ref": "AWS::AccountId"
                },
                "Permissions": [
                    {
                        "Principal": {
                            "Ref": "QuickSightAdminPrincipal"
                        },
                        "Actions": [
                            "quicksight:DescribeDataSet",
                            "quicksight:DescribeDataSetPermissions",
                            "quicksight:PassDataSet",
                            "quicksight:DescribeIngestion",
                            "quicksight:ListIngestions",
                            "quicksight:UpdateDataSet",
                            "quicksight:DeleteDataSet",
                            "quicksight:CreateIngestion",
                            "quicksight:CancelIngestion",
                            "quicksight:UpdateDataSetPermissions"
                        ]
                    }
                ],
                "FieldFolders": {},
                "ImportMode": "SPICE",
                "LogicalTableMap": {
                    "someguid-51d7-43c4-9f8c-c60a286b0507": {
                        "Alias": "transactions_rls",
                        "DataTransforms": [
                            {
                                "ProjectOperation": {
                                    "ProjectedColumns": [
                                        "groupname",
                                        "department"
                                    ]
                                }
                            }
                        ],
                        "Source": {
                            "PhysicalTableId": "someguid-51d7-43c4-9f8c-c60a286b0507"
                        }
                    }
                },
                "PhysicalTableMap": {
                    "someguid-51d7-43c4-9f8c-c60a286b0507": {
                        "RelationalTable": {
                            "DataSourceArn": {
                                "Ref": "DataSourceArn"
                            },
                            "Schema": "ledger",
                            "Name": "transactions_rls",
                            "InputColumns": [
                                {
                                    "Name": "groupname",
                                    "Type": "STRING"
                                },
                                {
                                    "Name": "department",
                                    "Type": "STRING"
                                }
                            ]
                        }
                    }
                }
            }
        }
    }
}

You can deploy this template directly into AWS via the CloudFormation console. You are required to provide the following parameters:

DataSourceArn – A QuickSight dataset is a reference to a table or other database object. In order for this object to be accessed, we need to specify a QuickSight data source resource that facilitates the connection.
QuickSightAdminPrincipal – The IAM principal allowing access to the data source resource via AWS API calls. You can exclude the IAM permissions from this script and template if your existing security policies automatically provide access to the appropriate users and groups.
RestrictedUserGroupArn – The ARN of the QuickSight group that is granted access to the sensitive columns.

You can also deploy the template using the AWS CLI. Although it’s possible to pass in all the parameters directly via the command line, you may find it a bit clunky when entering long values. To simplify this, our script generates a params.json file structured to capture all the parameters required by the template:

{
    "Parameters": [
        {
            "ParameterKey": "DataSourceArn",
            "ParameterValue": "YOUR_DATA_SOURCE_ARN_HERE"
        },
        {
            "ParameterKey": "QuickSightAdminPrincipal",
            "ParameterValue": "YOUR_ADMIN_GROUP_PRINCIPAL_HERE"
        },
        {
            "ParameterKey": "RestrictedUserGroupArn",
            "ParameterValue": "YOUR_RESTRICTED_USER_GROUP_ARN_HERE"
        }
    ]
}

Use the following command to build the stack, with params.json as input:

aws cloudformation create-stack \
--stack-name SecuredDataSet \
--template-body file://dataset.json \
--cli-input-json file://params.json

You can use the AWS CloudFormation console to monitor the stack progress. When the creation is complete, you should see your new dataset in QuickSight!

Conclusion

Though the functionality is relatively new, I consider the API and AWS CloudFormation capabilities to be one of QuickSight’s biggest strengths. Automated validation and enforcement of security rules allows for scale and better security. Being able to manage dataset definitions using AWS CloudFormation provides repeatability, and all of this sets you up for automation. The API and AWS CloudFormation provide tooling to customize QuickSight to suit your workflow, bringing BI into your organization’s cloud management strategy.

If you are looking for related information about dashboard management and migration in QuickSight, refer to Migrate Amazon QuickSight across AWS accounts.

About the Author

Jeremy Winters is an Architect in the AWS Data Lab, where he helps customers design and build data applications to meet their business needs. Prior to AWS, Jeremy built cloud and data applications for consulting customers across a variety of industries.

Trigger an AWS Glue DataBrew job based on an event generated from another DataBrew job

2022-06-02 Nipun Chagari

Post Syndicated from Nipun Chagari original https://aws.amazon.com/blogs/big-data/trigger-an-aws-glue-databrew-job-based-on-an-event-generated-from-another-databrew-job/

Organizations today have continuous incoming data, and analyzing this data in a timely fashion is becoming a common requirement for data analytics and machine learning (ML) use cases. As part of this, you need clean data in order to gain insights that can enable enterprises to get the most out of their data for business growth and profitability. You can now use AWS Glue DataBrew, a visual data preparation tool that makes it easy to transform and prepare datasets for analytics and ML workloads.

As we build these data analytics pipelines, we can decouple the jobs by building event-driven analytics and ML workflow pipelines. In this post, we walk through how to trigger a DataBrew job automatically on an event generated from another DataBrew job using Amazon EventBridge and AWS Step Functions.

Overview of solution

The following diagram illustrates the architecture of the solution. We use AWS CloudFormation to deploy an EventBridge rule, an Amazon Simple Queue Service (Amazon SQS) queue, and Step Functions resources to trigger the second DataBrew job.

The steps in this solution are as follows:

Import your dataset to Amazon Simple Storage Service (Amazon S3).
DataBrew queries the data from Amazon S3 by creating a recipe and performing transformations.
The first DataBrew recipe job writes the output to an S3 bucket.
When the first recipe job is complete, it triggers an EventBridge event.
A Step Functions state machine is invoked based on the event, which in turn invokes the second DataBrew recipe job for further processing.
The event is delivered to the dead-letter queue if the rule in EventBridge can’t invoke the state machine successfully.
DataBrew queries data from an S3 bucket by creating a recipe and performing transformations.
The second DataBrew recipe job writes the output to the same S3 bucket.

Prerequisites

To use this solution, you need the following prerequisites:

An AWS account
AWS Identity and Access Management (IAM) permissions for DataBrew (for more information, see Setting up IAM policies for DataBrew)
An S3 bucket (to store data)

Load the dataset into Amazon S3

For this post, we use the Credit Card customers sample dataset from Kaggle. This data consists of 10,000 customers, including their age, salary, marital status, credit card limit, credit card category, and more. Download the sample dataset and follow the instructions. We recommend creating all your resources in the same account and Region.

Create a DataBrew project

To create a DataBrew project, complete the following steps:

On the DataBrew console, choose Projects and choose Create project.
For Project name, enter marketing-campaign-project-1.
For Select a dataset, select New dataset.
Under Data lake/data store, choose Amazon S3.
For Enter your source from S3, enter the S3 path of the sample dataset.
Select the dataset CSV file.
Under Permissions, for Role name, choose an existing IAM role created during the prerequisites or create a new role.
For New IAM role suffix, enter a suffix.
Choose Create project.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Create the DataBrew jobs

Now we can create the recipe jobs.

On the DataBrew console, in the navigation pane, choose Projects.
On the Projects page, select the project marketing-campaign-project-1.
Choose Open project and choose Add step.
In this step, we choose Delete to drop the unnecessary columns from our dataset that aren’t required for this exercise.

You can choose from over 250 built-in functions to merge, pivot, and transpose the data without writing code.

Select the columns to delete and choose Apply.
Choose Create job.
For Job name, enter marketing-campaign-job1.
Under Job output settings¸ for File type, choose your final storage format (for this post, we choose CSV).
For S3 location, enter your final S3 output bucket path.
Under Settings, for File output storage, select Replace output files for each job run.
Choose Save.
Under Permissions, for Role name¸ choose an existing role created during the prerequisites or create a new role.
Choose Create job.

Now we repeat the same steps to create another DataBrew project and DataBrew job.

For this post, I named the second project marketing-campaign-project2 and named the job marketing-campaign-job2.
When you create the new project, this time use the job1 output file location as the new dataset.
For this job, we deselect Unknown and Uneducated in the Education_Level column.

Deploy your resources using CloudFormation

For a quick start of this solution, we deploy the resources with a CloudFormation stack. The stack creates the EventBridge rule, SQS queue, and Step Functions state machine in your account to trigger the second DataBrew job when the first job runs successfully.

Choose Launch Stack:
For DataBrew source job name, enter marketing-campaign-job1.
For DataBrew target job name, enter marketing-campaign-job2.
For both IAM role configurations, make the following choice:
1. If you choose Create a new Role, the stack automatically creates a role for you.
2. If you choose Attach an existing IAM role, you must populate the IAM role ARN manually in the following field or else the stack creation fails.
Choose Next.
Select the two acknowledgement check boxes.
Choose Create stack.

Test the solution

To test the solution, complete the following steps:

On the DataBrew console, choose Jobs.
Select the job marketing-campaign-job1 and choose Run job.

This action automatically triggers the second job, marketing-campaign-job2, via EventBridge and Step Functions.

When both jobs are complete, open the output link for marketing-campaign-job2.

You’re redirected to the Amazon S3 console to access the output file.

In this solution, we created a workflow that required minimal code. The first job triggers the second job, and both jobs deliver the transformed data files to Amazon S3.

Clean up

To avoid incurring future charges, delete all the resources created during this walkthrough:

IAM roles
DataBrew projects and their associated recipe jobs
S3 bucket
CloudFormation stack

Conclusion

In this post, we walked through how to use DataBrew along with EventBridge and Step Functions to run a DataBrew job that automatically triggers another DataBrew job. We encourage you to use this pattern for event-driven pipelines where you can build sequence jobs to run multiple jobs in conjunction with other jobs.

About the Authors

Nipun Chagari is a Senior Solutions Architect at AWS, where he helps customers build highly available, scalable, and resilient applications on the AWS Cloud. He is passionate about helping customers adopt serverless technology to meet their business objectives.

Prarthana Angadi is a Software Development Engineer II at AWS, where she has been expanding what is possible with code in order to make life more efficient for AWS customers.

Integrate AWS Glue Schema Registry with the AWS Glue Data Catalog to enable effective schema enforcement in streaming analytics use cases

2022-06-01 Sam Mokhtari

Post Syndicated from Sam Mokhtari original https://aws.amazon.com/blogs/big-data/integrate-aws-glue-schema-registry-with-the-aws-glue-data-catalog-to-enable-effective-schema-enforcement-in-streaming-analytics-use-cases/

Metadata is an integral part of data management and governance. The AWS Glue Data Catalog can provide a uniform repository to store and share metadata. The main purpose of the Data Catalog is to provide a central metadata store where disparate systems can store, discover, and use that metadata to query and process the data.

Another important aspect of data governance is serving and managing the relationship between data stores and external clients, which are the producers and consumers of data. As the data evolves, especially in streaming use cases, we need a central framework that provides a contract between producers and consumers to enable schema evolution and improved governance. The AWS Glue Schema Registry provides a centralized framework to help manage and enforce schemas on data streaming applications using convenient integrations with Apache Kafka and Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Apache Flink and Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.

In this post, we demonstrate how to integrate Schema Registry with the Data Catalog to enable efficient schema enforcement in streaming analytics use cases.

Stream analytics on AWS

There are many different scenarios where customers want to run stream analytics on AWS while managing the schema evolution effectively. To manage the end-to-end stream analytics life cycle, there are many different applications involved for data production, processing, analytics, routing, and consumption. It can be quite hard to manage changes across different applications for stream analytics use cases. Adding/removing a data field across different stream analytics applications can lead to data quality issues or downstream application failures if it is not managed appropriately.

For example, a large grocery store may want to send orders information using Amazon KDS to it’s backend systems. While sending the order information, customer may want to make some data transformations or run analytics on it. The orders may be routed to different targets depending upon the type of orders and it may be integrated with many backend applications which expect order stream data in specific format. But the order details schema can change due to many different reasons such as new business requirements, technical changes, source system upgrades or something else.

The changes are inevitable but customers want a mechanism to manage these changes effectively while running their stream analytics workloads. To support stream analytics use cases on AWS and enforce schema and governance, customers can make use of AWS Glue Schema Registry along with AWS Stream analytics services.

You can use Amazon Kinesis Data Firehose data transformation to ingest data from Kinesis Data Streams, run a simple data transformation on a batch of records via a Lambda function, and deliver the transformed records to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon OpenSearch Service, Splunk, Datadog, NewRelic, Dynatrace, Sumologic, LogicMonitor, MongoDB, and an HTTP endpoint. The Lambda function transforms the current batch of records with no information or state from previous batches.

Lambda function also has the stream analytics capability for Amazon Kinesis Data Analytics and Amazon DynamoDB. This feature enables data aggregation and state management across multiple function invocations. This capability uses a tumbling window, which is a fixed-size, non-overlapping time interval of up to 15 minutes. When you apply a tumbling window to a stream, records in the stream are grouped by window and sent to the processing Lambda function. The function returns a state value that is passed to the next tumbling window.

Kinesis Data Analytics provides SQL-based stream analytics against streaming data. This service also enables you to use an Apache Flink application to process stream data. Data can be ingested from Kinesis Data Streams and Kinesis Data Firehose while supporting Kinesis Data Firehose (Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and Splunk), Lambda, and Kinesis Data Streams as destinations.

Finally, you can use the AWS Glue streaming extract, transform, and load (ETL) capability as a serverless method to consume data from Kinesis and Apache Kafka or Amazon MSK. The job aggregates, transforms, and enriches the data using Spark streaming, then continuously loads the results into Amazon S3-based data lakes, data warehouses, DynamoDB, JDBC, and more.

Managing stream metadata and schema evolution is becoming more important for stream analytics use cases. To enable these on AWS, the Data Catalog and Schema Registry allow you to centrally control and discover schemas. Before the release of schema referencing in the Data Catalog, you relied on managing schema evolution separately in the Data Catalog and Schema Registry, which usually leads to inconsistencies between these two. With the new release of the Data Catalog and Schema Registry integration, you can now reference schemas stored in the schema registry when creating or updating AWS Glue tables in the Data Catalog. This helps avoid inconsistency between the schema registry and Data Catalog, which results in end-to-end data quality enforcement.

In this post, we walk you through a streaming ETL example in AWS Glue to better showcase how this integration can help. This example includes reading streaming data from Kinesis Data Streams, schema discovery with Schema Registry, using the Data Catalog to store the metadata, and writing out the results to an Amazon S3 as a sink.

Solution overview

The following high-level architecture diagram shows the components to integrate Schema Registry and the Data Catalog to run streaming ETL jobs. In this architecture, Schema Registry helps centrally track and evolve Kinesis Data Streams schemas.

At a high level, we use the Amazon Kinesis Data Generator (KDG) to stream data to a Kinesis data stream, use AWS Glue to run streaming ETL, and use Amazon Athena to query the data.

In the following sections, we walk you through the steps to build this architecture.

Create a Kinesis data stream

To set up a Kinesis data stream, complete the following steps:

On the Kinesis console, choose Data streams.
Choose Create data stream.
Give the stream a name, such as ventilator_gsr_stream.
Complete stream creation.

Configure Kinesis Data Generator to generate sample data

You can use the KDG with the ventilator template available on the GitHub repo to generate sample data. The following diagram shows the template on the KDG console.

Add a new AWS Glue schema registry

To add a new schema registry, complete the following steps:

On the AWS Glue console, under Data catalog in the navigation pane, choose Schema registries.
Choose Add registry.
For Registry name, enter a name (for example, MyDemoSchemaReg).
For Description, enter an optional description for the registry.
Choose Add registry.

Add a schema to the schema registry

To add a new schema, complete the following steps:

On the AWS Glue console, under Schema registries in the navigation pane, choose Schemas.
Choose Add schema.
Provide the schema name (ventilatorstream_schema_gsr) and attach the schema to the schema registry defined in the previous step.
AWS Glue schemas currently support Avro or JSON formats; for this post, select JSON.
Use the default Compatibility mode and provide the necessary tags as per your tagging strategy.

Compatibility modes allow you to control how schemas can or cannot evolve over time. These modes form the contract between applications producing and consuming data. When a new version of a schema is submitted to the registry, the compatibility rule applied to the schema name is used to determine if the new version can be accepted. For more information on different compatibility modes, refer to Schema Versioning and Compatibility.

Enter the following sample JSON:

{
  "$id": "https://example.com/person.schema.json",
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Ventilator",
  "type": "object",
  "properties": {
    "ventilatorid": {
      "type": "integer",
      "description": "Ventilator ID"
    },
    "eventtime": {
      "type": "string",
      "description": "Time of the event."
    },
    "serialnumber": {
      "description": "Serial number of the device.",
      "type": "string",
      "minimum": 0
    },
    "pressurecontrol": {
      "description": "Pressure control of the device.",
      "type": "integer",
      "minimum": 0
    },
    "o2stats": {
      "description": "O2 status.",
      "type": "integer",
      "minimum": 0
    },
    "minutevolume": {
      "description": "Volume.",
      "type": "integer",
      "minimum": 0
    },
    "manufacturer": {
      "description": "Volume.",
      "type": "string",
      "minimum": 0
    }
  }
}

Choose Create schema and version.

Create a new Data Catalog table

To add a new table in the Data Catalog, complete the following steps:

On the AWS Glue Console, under Data Catalog in the navigation pane, choose Tables.
Choose Add table.
Select Add tables from existing schema.
Enter the table name and choose the database.
Select the source type as Kinesis and choose a data stream in your own account.
Choose the respective Region and choose the stream ventilator_gsr_stream.
Choose the MyDemoSchemaReg registry created earlier and the schema (ventilatorstream_schema_gsr) with its respective version.

You should be able to preview the schema.

Choose Next and then choose Finish to create your table.

Create the AWS Glue job

To create your AWS Glue job, complete the following steps:

On the AWS Glue Studio console, choose Jobs in the navigation pane.
Select Visual with a source and target.
Under Source, select Amazon Kinesis and under Target, select Amazon S3.
Choose Create.
Choose Data source.
Configure the job properties such as name, AWS Identity and Access Management (IAM) role, type, and AWS version.

For the IAM role, specify a role that is used for authorization to resources used to run the job and access data stores. Because streaming jobs require connecting to sources and sinks, you need to make sure that the IAM role has permissions to read from Kinesis Data Streams and write to Amazon S3.

For This job runs, select A new script authored by you.
Under Advanced properties, keep Job bookmark disabled.
For Log Filtering, select Standard filter and Spark UI.
Under Monitoring options, enable Job metrics and Continuous logging with Standard filter.
Enable the Spark UI and provide the S3 bucket path to store the Spark event logs.
For Job parameters, enter the following key-values:
- –output_path – The S3 path where the final aggregations are persisted
- –aws_region – The Region where you run the job
Leave Connections empty and choose Save job and edit script.
Use the following code for the AWS Glue job (update the values for database, table_name, and checkpointLocation):

import sys
import datetime
import boto3
import base64
from pyspark.sql import DataFrame, Row
from pyspark.context import SparkContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame

args = getResolvedOptions(sys.argv, \
['JOB_NAME', \
'aws_region', \
'output_path'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# S3 sink locations
aws_region = args['aws_region']
output_path = args['output_path']

s3_target = output_path + "ventilator_metrics"
checkpoint_location = output_path + "cp/"
temp_path = output_path + "temp/"


def processBatch(data_frame, batchId):
now = datetime.datetime.now()
year = now.year
month = now.month
day = now.day
hour = now.hour
minute = now.minute
if (data_frame.count() > 0):
dynamic_frame = DynamicFrame.fromDF(data_frame, glueContext, "from_data_frame")
apply_mapping = ApplyMapping.apply(frame = dynamic_frame, mappings = [ \
("ventilatorid", "long", "ventilatorid", "long"), \
("eventtime", "string", "eventtime", "timestamp"), \
("serialnumber", "string", "serialnumber", "string"), \
("pressurecontrol", "long", "pressurecontrol", "long"), \
("o2stats", "long", "o2stats", "long"), \
("minutevolume", "long", "minutevolume", "long"), \
("manufacturer", "string", "manufacturer", "string")],\
transformation_ctx = "apply_mapping")

dynamic_frame.printSchema()

# Write to S3 Sink
s3path = s3_target + "/ingest_year=" + "{:0>4}".format(str(year)) + "/ingest_month=" + "{:0>2}".format(str(month)) + "/ingest_day=" + "{:0>2}".format(str(day)) + "/ingest_hour=" + "{:0>2}".format(str(hour)) + "/"
s3sink = glueContext.write_dynamic_frame.from_options(frame = apply_mapping, connection_type = "s3", connection_options = {"path": s3path}, format = "parquet", transformation_ctx = "s3sink")

# Read from Kinesis Data Stream
sourceData = glueContext.create_data_frame.from_catalog( \
database = "kinesislab", \
table_name = "ventilator_gsr_new", \
transformation_ctx = "datasource0", \
additional_options = {"startingPosition": "TRIM_HORIZON", "inferSchema": "true"})

sourceData.printSchema()

glueContext.forEachBatch(frame = sourceData, batch_function = processBatch, options = {"windowSize": "100 seconds", "checkpointLocation": "s3://<bucket name>/ventilator_gsr/checkpoint/"})
job.commit()

Our AWS Glue job is ready to read the data from the Kinesis data stream and send it to Amazon S3 in Parquet format.

Query the data using Athena

The processed streaming data is written in Parquet format to Amazon S3. Run an AWS Glue crawler on the Amazon S3 location where the streaming data is written; the crawler updates the Data Catalog. You can then run queries using Athena to start driving relevant insights from the data.

Clean up

It’s always a good practice to clean up all the resources created as part of this post to avoid any undue cost. To clean up your resources, delete the AWS Glue database, tables, crawlers, jobs, service role, and S3 buckets.

Additionally, be sure to clean up all other AWS resources that you created using AWS CloudFormation. You can delete these resources on the AWS CloudFormation console by deleting the stack used for the Kinesis Data Generator.

Conclusion

This post demonstrated the importance of centrally managing metadata and schema evolution in stream analytics use cases. It also described how the integration of the Data Catalog and Schema Registry can help you achieve this on AWS. We used a streaming ETL example in AWS Glue to better showcase how this integration can help to enforce end-to-end data quality.

To learn more and get started, you can check out AWS Glue Data Catalog and AWS Glue Schema Registry.

About the Authors

Dr. Sam Mokhtari is a Senior Solutions Architect at AWS. His main area of depth is data and analytics, and he has published more than 30 influential articles in this field. He is also a respected data and analytics advisor, and has led several large-scale implementation projects across different industries, including energy, health, telecom, and transport.

Amar Surjit is a Sr. Solutions Architect based in the UK who has been working in IT for over 20 years designing and implementing global solutions for enterprise customers. He is passionate about streaming technologies and enjoys working with customers globally to design and build streaming architectures and drive value by analyzing their streaming data.

Supercharging Dream11’s Data Highway with Amazon Redshift RA3 clusters

2022-06-01 Dhanraj Gaikwad

Post Syndicated from Dhanraj Gaikwad original https://aws.amazon.com/blogs/big-data/supercharging-dream11s-data-highway-with-amazon-redshift-ra3-clusters/

This is a guest post by Dhanraj Gaikwad, Principal Engineer on Dream11 Data Engineering team.

Dream11 is the world’s largest fantasy sports platform, with over 120 million users playing fantasy cricket, football, kabaddi, basketball, hockey, volleyball, handball, rugby, futsal, American football, and baseball. Dream11 is the flagship brand of Dream Sports, India’s leading Sports Technology company, and has partnerships with several national and international sports bodies and cricketers.

In this post, we look at how we supercharged our data highway, the backbone of our major analytics pipeline, by migrating our Amazon Redshift clusters to RA3 nodes. We also look at why we were excited about this migration, the challenges we faced during the migration and how we overcame them, as well as the benefits accrued from the migration.

Background

The Dream11 Data Engineering team runs the analytics pipelines (what we call our Data Highway) across Dream Sports. In near-real time, we analyze various aspects that directly impact the end-user experience, which can have a profound business impact for Dream11.

Initially, we were analyzing upwards of terabytes of data per day with Amazon Redshift clusters that ran mainly on dc2.8xlarge nodes. However, due to a rapid increase in our user participation over the last few years, we observed that our data volumes increased multi-fold. Because we were using dc2.8xlarge clusters, this meant adding more nodes of dc2.8xlarge instance types to the Amazon Redshift clusters. Not only was this increasing our costs, it also meant that we were adding additional compute power when what we really needed was more storage. Because we anticipated significant growth during the Indian Premier League (IPL) 2021, we actively explored various options using our AWS Enterprise Support team. Additionally, we were expecting more data volume over the next few years.

The solution

After discussions with AWS experts and the Amazon Redshift product team, we at Dream11 were recommended the most viable option of migrating our Amazon Redshift clusters from dc2.8xlarge to the newer RA3 nodes. The most obvious reason for this was the decoupled storage from compute. As a result, we could use lesser nodes and move our storage to Amazon Redshift managed storage. This allowed us to respond to data volume growth in the coming years as well as reduce our costs.

To start off, we conducted a few elementary tests using an Amazon Redshift RA3 test cluster. After we were convinced that this wouldn’t require many changes in our Amazon Redshift queries, we decided to carry out a complete head-to-head performance test between the two clusters.

Validating the solution

Because the user traffic on the Dream11 app tends to spike during big ticket tournaments like the IPL, we wanted to ensure that the RA3 clusters could handle the same traffic that we usually experience during our peak. The AWS Enterprise Support team suggested using the Simple Replay tool, an open-sourced tool released by AWS that you can use to record and replay the queries from one Amazon Redshift cluster to another. This tool allows you to capture queries on a source Amazon Redshift cluster, and then replay the same queries on a destination Amazon Redshift cluster (or clusters). We decided to use this tool to capture our performance test queries on the existing dc2.8xlarge clusters and replay them on a test Amazon Redshift cluster composed of RA3 nodes. During this time of our experimentation, the newer version of the automated AWS CloudFormation-based toolset (now on GitHub), was not available.

Challenges faced

The first challenge came up when using the Simple Replay tool because there was no easy way to compare the performance of like-to-like queries on the two types of clusters. Although Amazon Redshift provides various statistics using meta-tables about individual queries and their performance, the Simple Replay tool adds additional comments in each Amazon Redshift query on the target cluster to make it easier to know if these queries were run by the Simple Replay tool. In addition, the Simple Replay tool drops comments from the queries on the source cluster.

Comparing each query performance with the Amazon Redshift performance test suite would mean writing additional scripts for easy performance comparison. An alternative would have been to modify the Simple Replay tool code, because it’s open source on GitHub. However, with the IPL 2022 beginning in just a few days, we had to explore another option urgently.

After further discussions with the AWS Enterprise Support team, we decided to use two test clusters: one with the old dc2.8xlarge nodes, and another with the newer RA3 nodes. The idea was to use the Simple Replay tool to run the captured queries from our original cluster on both test clusters. This meant that the queries would be identical on both test clusters, making it easier to compare. Although this meant running an additional test cluster for a few days, we went ahead with this option. As a side note, the newer automated AWS CloudFormation-based toolset does exactly the same in an automated way.

After we were convinced that most of our Amazon Redshift queries performed satisfactorily, we noticed that certain queries were performing slower on the RA3-based cluster than the dc2.8xlarge cluster. We narrowed down the problem to SQL queries with full table scans. We rectified it by following proper data modelling practices in the ETL workflow. Then we were ready to migrate to the newer RA3 nodes.

The migration to RA3

The migration from the old cluster to the new cluster was smoother than we thought. We used the elastic resize approach, which meant we only had a few minutes of Amazon Redshift downtime. We completed the migration successfully with a sufficient buffer timeline for more tests. Additional tests indicated that the new cluster performed how we wanted it to.

The trial by fire

The new cluster performed satisfactorily during our peak performance loads in the IPL as well as the following ICC T20 Cricket World Cup. We’re excited that the new RA3 node-based Amazon Redshift cluster can support our data volume growth needs without needing to increase the number of instance nodes.

We migrated from dc2 to RA3 in April 2021. The data volume has grown by 50% since then. If we had continued with dc2 instances, the cluster cost would have increased by 50%. However, because of the migration to RA3 instances, even with an increase in data volume by 50% since April 2021, the cluster cost has increased by 0.7%, which is attributed to an increase in storage cost.

Conclusion

Migrating to the newer RA3-based Amazon Redshift cluster helped us decouple our computing needs from our storage needs, and now we’re prepared for our expected data volume growth for the next few years. Moreover, we don’t need to add compute nodes if we only need storage, which is expected to bring down our costs in the long run. We did need to fine-tune some of our queries on the newer cluster. With the Simple Replay tool, we could do a direct comparison between the older and the newer cluster. You can also use the newer automated AWS CloudFormation-based toolset if you want to follow a similar approach.

We highly recommend RA3 instances. They give you the flexibility to size your RA3 cluster based on the
amount of data stored without increasing your compute costs.

About the Authors

Dhanraj Gaikwad is a Principal Data Engineer at Dream11. Dhanraj has more than 15 years of experience in the field of data and analytics. In his current role, Dhanraj is responsible for building the data platform for Dream Sports and is specialized in data warehousing, including data modeling, building data pipelines, and query optimizations. He is passionate about solving large-scale data problems and taking unique approaches to deal with them.

Sanket Raut is a Principal Technical Account Manager at AWS based in Vasai ,India. Sanket has more than 16 years of industry experience, including roles in cloud architecture, systems engineering, and software design. He currently focuses on enabling large startups to streamline their cloud operations and optimize their cloud spend. His area of interest is in serverless technologies.

Visualize MongoDB data from Amazon QuickSight using Amazon Athena Federated Query

2022-05-25 Soujanya Konka

Post Syndicated from Soujanya Konka original https://aws.amazon.com/blogs/big-data/visualize-mongodb-data-from-amazon-quicksight-using-amazon-athena-federated-query/

In this post, you will learn how to use Amazon Athena Federated Query to connect a MongoDB database to Amazon QuickSight in order to build dashboards and visualizations.

Amazon Athena is a serverless interactive query service, based on Presto, that provides full ANSI SQL support to query a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet, that are stored on Amazon Simple Storage Service (Amazon S3). For data that isn’t stored on Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store it in Amazon S3. With Athena Federated Query, you can run SQL queries across data that is stored in relational, non-relational, object, and custom data sources.

MongoDB is a popular NoSQL database option for websites and API endpoints. You can choose to deploy MongoDB as a self-hosted or fully-managed database. Databases are a popular choice for UI applications for managing user profiles, product catalogs, profile views, clickstream events, events from a connected device, and so on. QuickSight is a serverless business analytics service with built-in machine learning (ML) capabilities that can automatically look for patterns and outliers, and has the flexibility to embed dashboards in applications for a data-driven experience. You can also use QuickSight Q to allow users to ask questions using natural language and find answers to business questions immediately.

Overview of Athena Federated Query

Athena Federated Query uses data source connectors that run on AWS Lambda to run federated queries to other data sources. Prebuilt data source connectors are available for native stores, like Amazon Timestream, Amazon CloudWatch Logs, Amazon DynamoDB, and external sources like Vertica and SAP Hana. You can also write a connector by using the Athena Query Federation SDK. You can customize Athena’s prebuilt connectors for your own use, or modify a copy of the source code to create your own AWS Serverless Application Repository package.

Solution overview

The following architecture diagram shows the components of the Athena Federated Query MongoDB connector. It contains the following components:

A virtual private cloud (VPC) configured with public and private subnets across three Availability Zones.
A MongoDB cluster with customizable Amazon Elastic Block Store (Amazon EBS) storage deployed in private subnets and NAT gateways in a public subnet for outbound internet connectivity for MongoDB instances.
Bastion hosts in an auto scaling group with Elastic IP addresses to allow inbound SSH access.
An AWS Identity and Access Management (IAM) MongoDBnode role with Amazon Elastic Compute Cloud (Amazon EC2) and Amazon S3 permissions.
Security groups to enable communication within the VPC.
Lambda functions deployed in a private subnet accessing S3 buckets. Athena invokes the Lambda function, which in turn fetches the data from MongoDB and maps the response back to Athena.
AWS Secrets Manager through a VPC endpoint.

Prerequisites

To implement the solution, you need the following:

An AWS account to access AWS services.
An IAM user with permission to CreateRole, ListRoles, GetPolicy, and AttachRolePolicy.
An IAM user with an access key and secret key to configure an integrated development environment (IDE).
A MongoDB database. You can deploy a hosted MongoDB on Amazon EC2 or MongoDB Atlas in a VPC.
If you don’t have a QuickSight subscription configured, sign up for one. You can access the QuickSight free trial as part of the AWS Free Tier option.
A new secret in Secrets Manager to store your MongoDB user name and password.
Data loaded into your MongoDB database. For this example, we used an airline dataset. Load the sample data either from the MongoDB command line or the MongoDB Atlas user interface, if using MongoDB Atlas.

Configure a Lambda connector

The first step in the deployment is to set up the connector environment. Athena uses data source connectors that run on Lambda to run federated queries. To connect with MongoDB, use the Amazon Athena DocumentDB Connector, which also works with any endpoint that is compatible with MongoDB.

To configure a Lambda connector, complete the following steps:

On the Athena console, choose Data sources in the navigation pane.
To view a published list of data sources for Athena, select Amazon DocumentDB.
Choose Next.
In the Data source details section, give your data source a unique name; for example, ds_mongo.
This will be the connection name that appears under Data sources for Athena.
Choose Create Lambda function.
This launches the Create function page in Lambda. The connector is deployed by using AWS Serverless Application Repository.
For SecretNameOrPrefix, enter mongo.
For SpillBucket, enter spl-mongo-athena-test.
For AthenaCatalogName, enter us-west-mongo-cat.

For DocDBConnectionString (the MongoDB connection), enter the following:

mongodb://${docdb_instance_1_creds}@replace_with_mongodb_private_ip:27017/?authSource=admin&readPreference=secondaryPreferred&retryWrites=false;

For SecurityGroupIds, choose the security group that you want to associate with the function. Make sure that the security group of the MongoDB instance allows traffic from the Lambda function.
For SpillPrefix, enter athena-spill.
For Subnetids, enter the subnet IDs of subnets with MongoDB instances.
In this case, LambdaMemory and LambdaTimeout have been set to the maximum values, but these can vary depending on the query run and memory requirements. SpillBucket is an S3 bucket in your account to store data that exceeds the Lambda function response size limits.
Keep the rest as defaults.
Select the acknowledgement check box choose Deploy.
The connection function is launched based on the given parameters.
Create a VPC endpoint to allow the Lambda function to access Amazon S3 through an endpoint.
This is for the spill bucket. The spill bucket is a staging area for copying the results of the queries that are performed on MongoDB via Athena federation. This is so that the Lambda function in the VPC can access Amazon S3.
Go back to the Athena console.
Under Connection details, for Lambda function, choose the newly created Lambda function.
Choose Next.
To verify the connection, on the Athena console, choose Data sources, then choose ds_mongo.
Associated databases from the connection should be listed.

You should now be able to query the datasets from the Athena query editor by using SQL.
In the query editor, for Data Source, choose ds_mongo.

Athena federates the query using the connector, which invokes the Lambda function. Then the query is performed by the function on MongoDB, and the query results are translated back to Athena. The following is a sample query that was performed on the airlines dataset.

Create a dataset on QuickSight to read the data from MongoDB

Before you launch QuickSight for the first time in an AWS account, you must set up an account. For instructions, see Signing in to Amazon QuickSight.

After the initial setup, you can create a dataset with Athena as the source. The QuickSight service role needs permission to invoke the Lambda function that connects MongoDB. The aws-quicksight-service-role-v0 service role is automatically created with the QuickSight account.

To create a dataset in QuickSight, complete the following steps:

On the IAM console, in the navigation pane, choose Roles.
Search for the role aws-quicksight-service-role-v0 and add the permission Lambda _fullaccess.
In an organization, there could be different data stores based on data load and consumption patterns. Examples include catalog or manual data that is associated with products in a MongoDB or key-value index store, transactions or sales data in a SQL database, and images or video clips that are associated with the product in an object store.
In this case, an airlines table from MongoDB is joined with a flat file that contains information on the airports.
Use the QuickSight cross-data store feature to join data from different sources on common fields.
We then update the data types for our geographic fields like fields like city, country, latitude, and longitude so we can build maps later.
You can also create calculated fields while preparing your dataset, which allows you to reuse them in other QuickSight analyses.

With a few clicks, you should be able to create a dashboard with the published dataset. For instance, you can plot your data on a map, show trends in a line chart, and add autonarratives from the list of Suggested Insights to create the analysis shown in the following screenshot.

Clean up

Make sure to clean up your resources to avoid resource spend and associated costs. You need to delete the EC2 instances with MongoDB. In the case of MongoDB Atlas, you can delete the databases and tables. Delete the Athena data source ds_mongo and unsubscribe your QuickSight account from the Manage QuickSight admin page.

Conclusion

With QuickSight and Athena Federated Query, organizations can access additional data sources beyond those already supported by QuickSight. If you have data in sources other than Amazon S3, you can use Athena Federated Query to analyze the data in place or build pipelines that extract and store data in Amazon S3. Athena now also supports cross-account federated queries to enable teams of analysts, data scientists, and data engineers to query data stored in other AWS accounts. Try connecting to proprietary data formats and sources, or build new user-defined functions, with the Athena Query Federation SDK.

About the Author

Nilesh Parekh is a Partner Solution Architect with ISV India segment. Nilesh help assist partner to review and remediate their workload running on AWS based on the AWS Well-Architected and Foundational Technical Review best practices. He also helps assist partners on Application Modernizations and delivering POCs.

Synchronize your AWS Glue Studio Visual Jobs to different environments

2022-05-24 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/synchronize-your-aws-glue-studio-visual-jobs-to-different-environments/

AWS Glue has become a popular option for integrating data from disparate data sources due to its ability to integrate large volumes of data using distributed data processing frameworks. Many customers use AWS Glue to build data lakes and data warehouses. Data engineers who prefer to develop data processing pipelines visually using AWS Glue Studio to create data integration jobs. This post introduces Glue Visual Job API to author the Glue Studio Visual Jobs programmatically, and Glue Job Sync utility that uses the API to easily synchronize Glue jobs to different environments without losing the visual representation.

Glue Job Visual API

AWS Glue Studio has a graphical interface called Visual Editor that makes it easy to author extract, transform, and load (ETL) jobs in AWS Glue. The Glue jobs created in the Visual Editor contain its visual representation that composes data transformation. In this post, we call the jobs Glue Studio Visual Jobs.

For example, it’s common to develop and test AWS Glue jobs in a dev account, and then promote the jobs to a prod account. Previously, when you copied the AWS Glue Studio Visual jobs to a different environment, there was no mechanism to copy the visual representation together. This means that the visual representation of the job was lost and you could only copy the code produced with Glue Studio. It can be time consuming and tedious to either copy the code or recreate the job.

AWS Glue Job Visual API lets you programmatically create and update Glue Studio Visual Jobs by providing a JSON object that indicates visual representation, and also retrieve the visual representation from existing Glue Studio Visual Jobs. A Glue Studio Visual Job consists of data source nodes for reading the data, transform nodes for modifying the data, and data target nodes for writing the data.

There are some typical use cases for Glue Visual Job API:

Automate creation of Glue Visual Jobs.
Migrate your ETL jobs from third-party or on-premises ETL tools to AWS Glue. Many AWS partners, such as Bitwise, Bladebridge, and others have built convertors from the third-party ETL tools to AWS Glue.
Synchronize AWS Glue Studio Visual jobs from one environment to another without losing visual representation.

In this post, we focus on a utility that uses Glue Job Visual APIs to achieve the mass synchronization of your Glue Studio Visual Jobs without losing the visual representation.

Glue Job Sync Utility

There are common requirements to synchronize the Glue Visual Jobs between different environments.

Promote Glue Visual Jobs from a dev account to a prod account.
Transfer ownership of Glue Visual Jobs between different AWS accounts.
Replicate Glue Visual Job configurations from one region to another for disaster recovery purpose.

Glue Job Sync Utility is built on top of Glue Visual Job API, and the utility lets you synchronize the jobs to different accounts without losing the visual representation. The Glue Job Sync Utility is a python application that enables you to synchronize your AWS Glue Studio Visual jobs to different environments using the new Glue Job Visual API. This utility requires that you provide source and target AWS environment profiles. Optionally, you can provide a list of jobs that you want to synchronize, and specify how the utility should replace your environment-specific objects using a mapping file. For example, Amazon Simple Storage Service (Amazon S3) locations in your development environment and role can be different than your production environment. The mapping config file will be used to replace the environment specific objects.

How to use Glue Job Sync Utility

In this example, we’re synchronizing two AWS Glue Studio Visual jobs, test1 and test2, from the development environment to the production environment in a different account.

Source environment (dev environment)
- AWS Account ID: 123456789012
- AWS Region: eu-west-3 (Paris)
- AWS Glue Studio Visual jobs: test1, test2
- AWS Identity and Access Management (IAM) Role ARN for Glue job execution role: arn:aws:iam::123456789012:role/GlueServiceRole
- Amazon S3 bucket for Glue job script and other asset location: s3://aws-glue-assets-123456789012-eu-west-3/
- Amazon S3 bucket for data location: s3://dev-environment/
Destination environment (prod environment)
- AWS Account ID: 234567890123
- AWS Region: eu-west-3 (Paris)
- IAM Role ARN for Glue job execution role: arn:aws:iam::234567890123:role/GlueServiceRole
- Amazon S3 bucket for Glue job script and other asset location: s3://aws-glue-assets-234567890123-eu-west-3/
- Amazon S3 bucket for data location: s3://prod-environment/

Set up the utility in your local environment

You will need the following prerequisites for this utility:

Python 3.6 or later.
Latest version of boto3.
Create two AWS named profiles, dev and prod, with the corresponding credentials in your environment. Follow this instruction.

Download the Glue Job Sync Utility

Download the sync utility from the GitHub repository to your local machine.

Create AWS Glue Studio Visual Jobs

Create two AWS Glue Studio Visual jobs, test1, and test2, in the source account.
- If you don’t have any AWS Glue Studio Visual jobs, then follow this instruction to create the Glue Studio Visual jobs.
Open AWS Glue Studio in the destination account and verify that the test1 and test2 jobs aren’t present.

Run the Job Sync Utility

Create a new file named mapping.json, and enter the following JSON code. With the configuration in line 1, the sync utility will replace all of the Amazon S3 references within the job (in this case s3://aws-glue-assets-123456789012-eu-west-3) to the mapped location (in this case s3://aws-glue-assets-234567890123-eu-west-3). Then, the utility will create the job to the destination environment. Along these lines, line 2 and line 3 will trigger appropriate substitutions in the job. Note that these are example values and you’ll need to substitute the right values that match your environment.
```
{
    "s3://aws-glue-assets-123456789012-eu-west-3": "s3://aws-glue-assets-234567890123-eu-west-3",
    "arn:aws:iam::123456789012:role/GlueServiceRole": "arn:aws:iam::234567890123:role/GlueServiceRole",
    "s3://dev-environment": "s3://prod-environment"
}
```

Execute the utility by running the following command:

$ python3 sync.py --src-profile dev --src-region eu-west-3 --dst-profile prod --dst-region eu-west-3 --src-job-names test1,test2 --config-path mapping.json

Verify successful synchronization by opening AWS Glue Studio in the destination account:
Open the Glue Studio Visual jobs, test1, and test2, and verify the visual representation of the DAG.

The screenshot above shows that you were able to copy the jobs test1 and test2 while keeping DAG into the destination account.

Conclusion

AWS Glue Job Visual API and the AWS Glue Sync Utility simplify how you synchronize your jobs to different environments. These are designed to easily integrate into your Continuous Integration pipelines while retaining the visual representation that improves the readability of the ETL pipeline.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for designing AWS features, implementing software artifacts, and helping customer architectures. In his spare time, he enjoys watching anime in Prime Video.

Aaron Meltzer is a Software Engineer on the AWS Glue Studio team. He leads the design and implementation of features to simplify the management of AWS Glue jobs. Outside of work, Aaron likes to read and learn new recipes.

Mohamed Kiswani is the Software Development Manager on the AWS Glue Team

Shiv Narayanan is a Senior Technical Product Manager on the AWS Glue team.

Enable Amazon QuickSight federation with Google Workspace

2022-05-23 Sriharsh Adari

Post Syndicated from Sriharsh Adari original https://aws.amazon.com/blogs/big-data/enable-amazon-quicksight-federation-with-google-workspace/

Amazon QuickSight is a scalable, serverless, embeddable, machine learning (ML)-powered business intelligence (BI) service built for the cloud that supports identity federation in both Standard and Enterprise editions. Organizations are working towards centralizing their identity and access strategy across all of their applications, including on-premises, third-party, and applications on AWS. Many organizations use Google Workspace to control and manage user authentication and authorization centrally. You can enable federation to QuickSight accounts without needing to create and manage users. This authorizes users to access QuickSight assets—analyses, dashboards, folders, and datasets—through centrally managed Google Workspace Identities.

In this post, we go through the steps to configure federated single sign-on (SSO) between a Google Workspace instance and QuickSight account. We demonstrate registering an SSO application in Google Workspace, and map QuickSight roles (admin, author, and reader) to Google Workspace Identities. These QuickSight roles represent three different personas supported in QuickSight. Administrators can publish the QuickSight app in a Google Workspace Dashboard to enable users to SSO to QuickSight using their Google Workspace credentials.

Solution overview

In your organization, the portal is typically a function of your identity provider (IdP), which handles the exchange of trust between your organization and QuickSight.

On the Google Workspace Dashboard, you can review a list of apps. This post shows you how to configure the custom app for AWS.

The user flow consists of the following steps:

The user logs in to your organization’s portal and chooses the option to go to the QuickSight console.
The portal verifies the user’s identity in your organization.
The portal generates a SAML authentication response that includes assertions that identify the user and include attributes about the user. The portal sends this response to the client browser. Although not discussed here, you can also configure your IdP to include a SAML assertion attribute called SessionDuration that specifies how long the console session is valid.
The client browser is redirected to the AWS single sign-on endpoint and posts the SAML assertion.
The endpoint requests temporary security credentials on behalf of the user, and creates a QuickSight sign-in URL that uses those credentials.
AWS sends the sign-in URL back to the client as a redirect.
The client browser is redirected to the QuickSight console. If the SAML authentication response includes attributes that map to multiple AWS Identity and Access Management (IAM) roles, the user is first prompted to select the role for accessing the console.

The following diagram illustrates the solution architecture.

The following are the high-level steps to set up federated single sign-on access via Google Workspace:

Download the Google IdP information.
Create an IAM IdP with Google as SAML IdP.
Configure IAM policies for QuickSight roles.
Configure IAM QuickSight roles for federated users.
Create a custom user attribute in Google Workspace.
Add the AWS SAML attributes to your Google Workspace user profile.
Set up the AWS SAML app in Google Workspace.
Grant access to users in Google Workspace.
Verify federated access to your QuickSight instance.

Detailed procedures for each of these steps comprise the remainder of this post.

Prerequisites

For this walkthrough, you should have the following prerequisites:

A Google Workspace subscription
An AWS account with QuickSight subscription
Basic understanding of QuickSight roles—admin, author, and reader
Basic understanding of IAM and privileges required to create an IAM identity provider, roles, policies, and users

Download the Google IdP information

First, let’s get the SAML metadata that contains essential information to enable your AWS account to authenticate the IdP and locate the necessary communication endpoint locations. Complete the following steps:

Log in to the Google Workspace Admin console.
On the Admin console home page, under Security in the navigation pane, choose Authentication and SSO with SAML applications.
Under IdP metadata, choose Download Metadata.

Create an IAM IdP with Google as SAML IdP

You now configure Azure AD as your SAML IdP via the IAM console. Complete the following steps:

On the IAM console, choose Identity providers in the navigation pane.
Choose Add provider.
For Configure provider, select SAML.
For Provider name, enter a name for the IdP (such as Google).
For Metadata document, choose Choose file and specify the SAML metadata document that you downloaded.
Choose Add provider.
Document the Amazon Resource Name (ARN) by viewing the IdP you just created.

The ARN should looks similar to arn:aws:iam::<YOURACCOUNTNUMBER>:saml-provider/Google. We need this ARN to configure claim rules later in this post.

Configure IAM policies for QuickSight roles

In this step, we create three IAM policies for different role permissions in QuickSight:

QuickSight-Federated-Admin
QuickSight-Federated-Author
QuickSight-Federated-Reader

Use the following steps to set up the QuickSight-Federated-Admin policy. This policy grants admin privileges in QuickSight to the federated user:

On the IAM console, choose Policies.
Choose Create policy.

Choose JSON and replace the existing text with the following code:

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Effect”: “Allow”,
            “Action”: “quicksight:CreateAdmin”,
            “Resource”: “*”
        }
    ]
}

Choose Review policy.
For Name, enter QuickSight-Federated-Admin.
Choose Create policy.

Repeat these steps to create QuickSight-Federated-Author, and use the following policy to grant author privileges in QuickSight to the federated user:

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Effect”: “Allow”,
            “Action”: “quicksight:CreateUser”,
            “Resource”: “*”
        }
    ]
}

Repeat the steps to create QuickSight-Federated-Reader, and use the following policy to grant reader privileges in QuickSight to the federated user:

{
    “Version”: "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "quicksight:CreateReader",
            "Resource": "*"
        }
    ]
}

Configure IAM QuickSight roles for federated users

Next, create the roles that Google IdP users assume when federating into QuickSight. The following steps set up the admin role:

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Trusted entity type, choose SAML 2.0 federation.
For SAML provider, choose the provider you created earlier (Google).
For Attribute, choose SAML:aud.
For Value, enter https://signin.aws.amazon.com/saml.
Choose Next.
On the Add permissions page, select the QuickSight-Federated-Admin IAM policy you created earlier.
Choose Next.
For Role name, enter QuickSight-Admin-Role.
For Role description, enter a description.
Choose Create role.
On the IAM console, in the navigation pane, choose Roles.
Choose the QuickSight-Admin-Role role you created to open the role’s properties.
On the Trust relationships tab, choose Edit trust relationship.
Under Trusted entities, verify that the IdP you created is listed.
Under Condition, verify that SAML:aud with a value of https://signin.aws.amazon.com/saml is present.
Repeat these steps to create author and reader roles and attach the appropriate policies:
1. For QuickSight-Author-Role, use the policy QuickSight-Federated-Author.
2. For QuickSight-Reader-Role, use the policy QuickSight-Federated-Reader.
Navigate to the newly created roles and note the ARNs for them.

We use these ARNs to configure claims rules later in this post. They are in the following format:

arn:aws:iam:: <YOURACCOUNTNUMBER>:role/QuickSight-Admin-Role
arn:aws:iam:: <YOURACCOUNTNUMBER>:role/QuickSight-Author-Role
arn:aws:iam:: <YOURACCOUNTNUMBER>:role/QuickSight-Reader-Role

Create a custom user attribute in Google Workspace

Now let’s create a custom user attribute in your Google Workspace. This allows us to add the SAML attributes that the AWS Management Console expects in order to allow a SAML-based authentication.

Log in to Google Admin console with admin credentials.
Under Directory, choose Users.
On the More options menu, choose Manage custom attributes.
Choose Add Custom Attribute.
For Select type of trusted entity, choose SAML 2.0 federation.
Configure the custom attribute as follows:
1. Category: Amazon
2. Description: Amazon Custom Attributes
For Custom fields, enter the following:
1. Name: Role
2. Info type: Text
3. Visibility: Visible to user and admin
4. No. of values: Multi-value
Choose Add.

The new category appears on the Manage user attributes page.

Add the AWS SAML attributes to the Google Workspace user profile

Now that we have configured a custom user attribute, let’s add the SAML attributes that we noted earlier to the Google Workspace user profile.

While logged in to the Google Admin console with admin credentials, navigate to the Users page.
In the Users list, find the user. If you need help, see Find a user account.
Choose the user’s name to open their account page.
Choose User information.
Choose custom attribute you recently created, named Amazon.
Add a value to this custom attribute noted earlier in the following format: <AWS Role ARN>,<AWS provider/IdP ARN>.
Choose Save.

Set up the AWS SAML app in Google Workspace

Now that we have everything in place, we’re ready to create a SAML app within our Google Workspace account and provide the QuickSight instance starting URL. This provides the entry point for Google Workspace users to SSO into the QuickSight instance.

While logged in to Google Admin console with admin credentials, under Apps, choose Web and mobile apps.
Choose Add App, and Search for apps.
Enter Amazon Web Services in the search field.
In the search results, hover over the Amazon Web Services SAML app and choose Select.
On the Google Identity Provider details page, choose Continue.
On the Service provider details page, the ACS URL and Entity ID values for Amazon Web Services are configured by default.
For Start URL, enter https://quicksight.aws.amazon.com.

On the Attribute Mapping page, choose the Select field menu and map the following Google directory attributes to their corresponding Amazon Web Services attributes:

Google Directory Attribute	Amazon Web Services Attribute
Basic Information > Primary Email	https://aws.amazon.com/SAML/Attributes/RoleSessionName
Amazon > Role	https://aws.amazon.com/SAML/Attributes/Role

Choose Finish.

Grant access to users in Google Workspace

When the SAML app is created in Google workspace, it’s turned off by default. This means for users logged in to their Google Workspace account, the SAML app isn’t visible to them. We now enable the AWS SAML app to your Google Workspace users.

While logged in to the Google Admin console with admin credentials, navigate to the Web and mobile apps page.
Choose Amazon Web Services.
Choose User access.
To turn on a service for everyone in your organization, choose ON for everyone.
Choose Save.

If you don’t want to activate this application for all users, you can alternatively grant access to a subset of users by using Google Workspace organizational units.

Verify federated access to the QuickSight instance

To test your SAML 2.0-based authentication with QuickSight for users in your existing IDP (Google Workspace), complete the following steps:

Open a new browser session, for example, using Chrome, in a new incognito window.
Log in to your Google Workspace account (for the purpose of this demo, we use the Google Workspace admin account).
Choose Amazon Web Services from the list of Google apps.

Conclusion

This post provided a step-by-step guide for configuring Google Workspace as your IdP, and using IAM roles to enable SSO to QuickSight. Now your users have a seamless sign-in experience to QuickSight and have the appropriate level of access related to their role.

Although this post demonstrated the integration of IAM and Google Workspace, you can replicate this solution using your choice of SAML 2.0 IdPs. For other supported federation options, see Using identity federation and single sign-on (SSO) with Amazon QuickSight.

To get answers to your questions related to QuickSight, refer to the QuickSight Community.

If you have any questions or feedback, please leave a comment.

About the Authors

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Srikanth Baheti is a Specialized World Wide Sr. Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.

Tips and tricks for high-performant dashboards in Amazon QuickSight

2022-05-19 Shekhar Kopuri

Post Syndicated from Shekhar Kopuri original https://aws.amazon.com/blogs/big-data/tips-and-tricks-for-high-performant-dashboards-in-amazon-quicksight/

Amazon QuickSight is cloud-native business intelligence (BI) service. QuickSight automatically optimizes queries and execution to help dashboards load quickly, but you can make your dashboard loads even faster and make sure you’re getting the best possible performance by following the tips and tricks outlined in this post.

Data flow and execution of QuickSight dashboard loads

The data flow in QuickSight starts from the client browser to the web server and then flows to the QuickSight engine, which in some cases executes queries against SPICE—a Super-fast, Parallel, In-memory Calculation Engine—or in other cases directly against the database. SPICE uses a combination of columnar storage, in-memory technologies enabled through the latest hardware innovations, and machine code generation to run interactive queries on large datasets and get rapid responses.

The web server, QuickSight engine, and SPICE are auto scaled by QuickSight. This is a fully managed service—you don’t need to worry about provisioning or managing infrastructure when you want to scale up a particular dashboard from tens to thousands of users on SPICE. Dashboards built against direct query data sources may require provisioning or managing infrastructure on the customer side.

The following diagram illustrates the data flow:

Let’s look at the general execution process to understand the implications:

A request is triggered in the browser, leading to several static assets such as JavaScript, fonts, and images being downloaded.
All the metadata (such as visual configurations and layout) is fetched for the dashboard.
Queries are performed, which may include setting up row-level and column-level security, or fetching dynamic control values, default parameters, and all values of drop-downs in filter controls.
Up to your concurrency limit, the queries to render your visuals run in a specific sequence (described later in this post). If you’re using SPICE, the concurrency of queries is much higher. Pagination within visuals may lead to additional queries.

The actual execution is more complex and depends on how dashboards are configured and other factors such as the data source types, Direct Query vs. SPICE, cardinality of fields and how often data is getting refreshed etc. Many operations run in parallel and all visual-related queries are run via WebSocket, as shown in the following screenshot. Many of the steps run in the end-user’s browser, therefore there are limitations such as the number of sequences and workloads that can be pushed onto the browser. Performance may also be slightly different based on the browser type because each browser handles contention differently.

Now let’s look at many great tips that can improve your dashboard’s performance!

SPICE

Utilizing the capabilities of SPICE when possible is a great way to boost overall performance because SPICE manages scaling as well as caching results for you. We recommend using SPICE whenever possible.

Metadata

As seen in the preceding execution sequence, QuickSight fetches metadata up front for a given dashboard during the initial load. We recommend the following actions regarding metadata.

Remove unused datasets from analysis

Datasets that may have been used in the past but have no visual associated with the dashboard anymore add to the metadata payload unnecessarily. It’s likely to impact to dashboard performance.

Make sure your row-level and column-level security is performant

Row-Level security, column-level security and dynamic default parameters each require lookups to take place before the visual queries are issued. When possible, try to limit the number and the complexity of your rules datasets to help these lookups execute faster. Use SPICE for your rules dataset when possible. If you must use a direct query, make sure that the queries are optimal and that the data source you’re querying is scaled appropriately up front.

For embedded dashboards, a great way to optimize row-level security lookups is by utilizing session tags for row-level security paired with an anonymous identity. Similarly, dynamic default parameters, if used, can be evaluated in the host application up front and passed using the embedding SDK.

Calculated functions

In this section, we offer tips regarding calculated functions.

Move calculations to the data prep stage

QuickSight allows you to add calculated fields in the data prep or analysis experiences. We strongly encourage you to move as many calculations as possible to the data prep stage which will allow QuickSight to materialize calculations which do not contain aggregation or parameters into the SPICE dataset. Materializing calculated fields in the dataset helps you reduce the runtime calculations, which improves query performance. Even if you are using aggregation or parameters in your calculation, it might still be possible to move parts of the calculations to data prep. For instance, if you have a formula like the following:

You can remove the sum() and just keep the ifelse(), which will allow QuickSight to materialize (precompute) it and save it as a real field in your SPICE dataset. Then you can either add another calculation which sums it up, or just use sum aggregation once you add it to your visuals.

Generally materializing calculations that use complex ifelse logic or do string manipulation/lookups will result in the greatest improvements in dashboard performance.

Implement the simplified ifelse syntax

The ifelse function supports simplified statements. For example, you might start with the following statement:

The following simplified statement is more performant:

Use the toString() function judiciously

The toString() function has a much lower performance and is much heavier on the database engine than a simple integer or number-based arithmatic calculations. Therefore, you should use it sparingly.

Know when nulls are returned by the system and use null value customization

Most authors make sure that null conditions on calculated fields are handled gracefully. QuickSight often handles nulls gracefully for you. You can use that to your advantage and make the calculations simpler. In the following example, the division by 0 is already handled by QuickSight:

You can write the preceding code as the following:

If you need to represent nulls on visuals with a static string, QuickSight allows you to set custom values when a null value is returned in a visual configuration. In the preceding example, you could just set a custom value of 0 in the formatting option. Removing such handling from the calculated fields can significantly help query performance.

On-sheet filters vs. parameters

Parameters are seemingly a very simple construct but they can quickly get complicated, especially when used in nested calculation functions or when used in controls. Parameters are all evaluated on the fly, forcing all the dependencies to be handled real time. Ask yourself if each parameter is really required. In some cases, you may be able to replace them with simple dropdown control, as shown in the following example for $market.

Instead of creating a control parameter to use in a calculated field, you might be able to use the field with a dropdown filter control.

Text field vs. Dropdown (or List) filter controls

When you are designing an analysis, you can add a filter control for the visuals you want to filter. if the data type of the field is string, you have several choices for the type of control filter. Text field which displays a text box where you can enter a single entry or multiple entries is suggested for the better performance, rather than Dropdown (or List) which requires to fetch the values to populate a list that you can select a single or multiple values.

On-sheet controls

The control panel at the top of the dashboard is collapsible by default, but this setting allows you to have an expanded state while publishing the dashboard. If this setting is enabled, QuickSight prioritizes the calls in order to fetch the controls’ values before the visual loads. If any of the controls have high cardinality, it could impact the performance of loading the dashboard. Evaluate this need against the fact that QuickSight persists last-used control values and the reader might not actually need to adjust controls as a first step.

Visual types: Charts

In this section, we provide advice when using Charts.

Use ‘Hide the “other” category’ when your dimension has less than the cutoff limit

You can choose to limit how many data points you want to display in your visual, before they are added to the other category. This category contains the aggregated data for all the data beyond the cutoff limit for the visual type you are using – either the one you impose or the one based on display limits. If you know your dimension has less than the cutoff limit, use this option. This will improve your dashboard performance.

The other category does not show on scatter plots, heat maps, maps, tables (tabular reports), or key performance indicators (KPIs). It also doesn’t show on line charts when the x-axis is a date.

Visual types: Tables and pivot tables

In this section, we provide advice when using tables and pivot tables.

Use the Values field well when displaying a raw table view

If you want to output all the raw data into table, you can use Group by fields, Values fields, or a mix of them. The most performant approach is set every field into Values. When using Group by, a query is first run under the hood followed by the Group by function, therefore all the data is pulled from the database, which is expensive.

Deploy a minimal set of rows, columns, metrics, and table calculations

If you include too many combinations of rows, columns, metrics, and table calculations in one pivot table, you risk overwhelming the viewer. You can also run into the computational limitations of the underlying database. To reduce the level of complexity and potential errors, you can take the following actions:

Apply filters to reduce the data included in for the visual
Use fewer fields in the Row and Column field wells
Use as few fields as possible in the Values field well
Create additional pivot tables so that each displays fewer metrics
Reduce subtotals, totals and conditional formatting when possible

Uncollapsed columns are always the simplest case and will likely remain more performant outside of a few cases.

Visual queries sequence

The execution of the individual visual sequence is left to right, then top to bottom. Understanding the sequence of execution can be helpful: you can rearrange visuals on your dashboard without losing the context. Place heavier visuals further down in the dashboard, and place lightweight KPI and insight visuals near the top to display “above-the-fold” content sooner, which improves the dashboard performance’s perception for your readers.

Embedding

Our final set of recommendations are in regards to embedding.

Remove user management flows from the critical path

Most times, user management and authentication flows (such as DescribeUser and RegisterUser APIs) can run asynchronously on the host application.

Consider registering the user in advance before the actual embedding, so that the overhead is removed from every analytics page visit.

Authenticate the user on your website in advance, and acquire any Amazon Cognito or AWS Security Token Service (Amazon STS) session tokens (if required) in advance (for example, at user login time or home page visit). This reduces additional runtime latency overhead when a user visits an analytics page.

Move workloads from clients to the web server or backend services

If a QuickSight dashboard is embedded on a webpage on the host application, which performs other activities too, play close attention to the sequence of API calls on the host. The QuickSight dashboard load might be gated by other heavy API calls on the host application. Move the logic to the web server or backend services as much as possible to limit contention on the browser.

Don’t tear down the embedding iFrame when the user navigates away from analytics section

When the user moves temporarily to a non-analytics page of your web application (especially in single-page applications), instead of removing the embedding iframe from DOM, you can hide it from the user while keeping the iFrame in the page DOM elements. This allows you to resume the same session when the user navigates back to analytics section of your application, and they don’t need to wait for reload.

Use navigateToDashboard() and navigateToSheet() whenever possible

If you have multiple dashboards on your host application that don’t need to load concurrently, you can optimize the authentication flow by utilizing two APIs we expose, navigateToDashboard() or navigateToSheet(), in our JavaScript SDK. These APIs reuse the same iFrame for each load, while reusing the authentication token.

This technique has proven to be very effective for many of our embedding users.

For more information about these APIs, refer to Amazon QuickSight Embedding SDK.

Conclusion

In this post, we shared some tips and tricks for tuning the performance of your QuickSight dashboards. In 2021, we doubled our SPICE data limits to 500 million rows of data per dataset. In addition, incremental data refresh is available for SQL-based data sources such as Amazon Redshift, Amazon Athena, Amazon RDS, Amazon Aurora, PostgreSQL, MySQL, Oracle, SQL Server, MariaDB, Presto, Teradata or Snowflake up to every 15 minutes, which cuts down time between data updates by 75%. In 2022, we continue to innovate on your behalf to make QuickSight dashboard loads even more performant.

We look forward to your feedback on how these tips and tricks helped your dashboards load faster.

About the Authors

Shekhar Kopuri is a Senior Software Development Manager for Amazon QuickSight. He leads the front platform engineering team that focusses on various aspects of front end experience including website performance. Before joining AWS, Shekhar led development of multiple provisioning and activation network OSS applications for a large global telecommunications service provider.

Blake Carroll is a Senior Frontend Engineer for Amazon QuickSight. He works with the frontend platform engineering team with a focus on website performance and has previously been the frontend lead for initial reporting and theming functionality in QuickSight. Prior to joining Amazon, Blake was a co-founder in the digital interactive agency space working with national brands to produce creative web experiences.

Vijay Chaudhari is a Senior Software Development Engineer for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Vijay started his career with IBM, writing software for the Information Management group. At Amazon, he has built backend applications for retail systems, and near real-time data pre-computation, reporting and analytics systems at Amazon scale. He is passionate about learning and solving new customer problems, and helping them adopt cloud native technologies.

Wakana Vilquin-Sakashita is Specialist Solution Architect for Amazon QuickSight. She works closely with customers to help making sense of the data through visualization. Previously Wakana worked for S&P Global assisting customers to access data, insights and researches relevant for their business.

Analyze Amazon Ion datasets using Amazon Athena

2022-05-19 Pathik Shah

Post Syndicated from Pathik Shah original https://aws.amazon.com/blogs/big-data/analyze-amazon-ion-datasets-using-amazon-athena/

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Amazon Ion is a richly typed, self-describing, hierarchical data serialization format offering interchangeable binary and text representations. The text format extends JSON (meaning all JSON files are valid Ion files), and is easy to read and author, supporting rapid prototyping. The binary representation is efficient to store, transmit, and skip-scan parse. The rich type system provides unambiguous semantics for long-term preservation of data that can survive multiple generations of software evolution.

Athena now supports querying and writing data in Ion format. The Ion format is currently used by internal Amazon teams, by external services such as Amazon Quantum Ledger Database (Amazon QLDB) and Amazon DynamoDB (which can be exported into Ion), and in the open-source SQL query language PartiQL.

In this post, we discuss use cases and the unique features Ion offers, followed by examples of querying Ion with Athena. For demonstration purposes, we use the transformed version of the City Lots San Francisco dataset.

Features of Ion

In this section, we discuss some of the unique features that Ion offers:

Type system
Dual format
Efficiency gains
Skip scanning

Type system

Ion extends JSON, adding support for more precise data types to improve interpretability, simplify processing, and avoid rounding errors. These high precision numeric types are essential for financial services, where fractions of a cent on every transaction add up. Data types that are added are arbitrary-size integers, binary floating-point numbers, infinite-precision decimals, timestamps, CLOBS, and BLOBS.

Dual format

Users can be presented with a familiar text-based representation while benefiting from the performance efficiencies of a binary format. The interoperability between the two formats enables you to rapidly discover, digest, and interpret data in a familiar JSON-like representation, while underlying applications benefit from a reduction in storage, memory, network bandwidth, and latency from the binary format. This means you can write plain text queries that run against both text-based and binary-based Ion. You can rewrite parts of your data in text-based Ion when you need human readable data during development and switch to binary in production.

When debugging a process, the ability for systems engineers to locate data and understand it as quickly as possible is vital. Ion provides mechanisms to move between binary and a text-based representation, optimizing for both the human and the machine. Athena supports querying and writing data in both of these Ion formats. The following is an example Ion text document taken from the transformed version of the citylots dataset:

{ "type": "Feature"
, "properties": { "MAPBLKLOT": "0004002"
                 ,"BLKLOT": "0004002"
                 ,"BLOCK_NUM": "0004"
                 , "LOT_NUM": "002"
                 , "FROM_ST": "0"
                 , "TO_ST": "0"
                 , "STREET": "UNKNOWN"
                 , "ST_TYPE": null
                 , "ODD_EVEN": "E" }
, "geometry": { "type": "Polygon"
               , "coordinates": [ [ [ -122.415701204606876, 37.808327252671461, 0.0 ],
                                    [ -122.415760743593196, 37.808630700240904, 0.0 ],
                                    [ -122.413787891332404, 37.808566801319841, 0.0 ],
                                    [ -122.415701204606876, 37.808327252671461, 0.0 ] ] ] } }

Efficiency gains

Binary-encoded Ion reduces file size by moving repeated values, such as field names, into a symbol table. Symbol tables reduce CPU and read latency by limiting the validation of character encoding to the single instance of the value in the symbol table.

For example, a company that operates at Amazon’s scale can produce large volumes of application logs. When compressing Ion and JSON logs, we noticed approximately 35% less CPU time to compress the log, which produced an average of roughly 26% smaller files. Log files are critical when needed but costly to retain, so the reduction in file sizes combined with the read performance gains from symbol tables helps when handling these logs. The following is an example of file size reduction with the citylots JSON dataset when converted to Ion binary with GZIP and ZSTD compression:

77MB    citylots.ion
 17MB    citylots.ion.gz
 15MB    citylots.ion.zst
181MB    citylots.json
 22MB    citylots.json.gz
 18MB    citylots.json.zst

Skip-scanning

In a textual format, every byte must be read and interpreted, but because Ion’s binary format is a TLV (type-length-value) encoding, an application may skip over elements that aren’t needed. This reduces query and application processing costs correlated with the proportion of unexamined fields.

For example, forensic analysis of application log data involves reading large volumes of data where only a fraction of the data is needed for diagnosis. In these scenarios, skip-scanning allows the binary Ion reader to move past irrelevant fields without the cost of reading the element stored within a field. This results in users experiencing lower resource usage and quicker response times.

Query Ion datasets using Athena

Athena now supports querying and creating Ion-formatted datasets via an Ion-specific SerDe, which in conjunction with IonInputFormat and IonOutputFormat allows you to read and write valid Ion data. Deserialization allows you to run SELECT queries on the Ion data so that it can be queried to gain insights. Serialization through CTAS or INSERT INTO queries allows you to copy datasets from existing tables’ values or generate new data in the Ion format.

The interchangeable nature of Ion text and Ion binary means that Athena can read datasets that contain both types of files. Because Ion is a superset of JSON, a table using the Ion SerDe can also include JSON files. Unlike the JSON SerDe, where every new line character indicates a new row, the Ion SerDe uses a combination of closing brackets and new line characters to determine new rows. This means that if each JSON record in your source documents isn’t on a single line, these files can now be read in Athena via the Ion SerDe.

Create external tables

Athena supports querying Ion-based datasets by defining AWS Glue tables with the user-defined metadata. Let’s start with an example of creating an external table for a dataset stored in Ion text. The following is a sample row from the citylots dataset:

{
    type:"Feature",
    properties:{
        mapblklot:"0579021",
        blklot:"0579024",
        block_num:"0579",
        lot_num:"024",
        from_st:"2160",
        to_st:"2160",
        street:"PACIFIC",
        st_type:"AVE",
        odd_even:"E"
    },
    geometry:{
        type:"Polygon",coordinates:[[[-122.4308798855922, ...]]]
    }
}

To create an external table that has its data stored in Ion, you have two syntactic options.

First, you can specify STORED AS ION. This is a more concise method, and is best used for simple cases, when no additional properties are required. See the following code:

CREATE EXTERNAL TABLE city_lots_ion1 (
  type STRING, 
  properties struct<
    mapblklot:string,
    blklot:string,
    block_num:string,
    lot_num:string,
    from_st:string,
    to_st:string,
    street:string,
    st_type:string,
    odd_even:string>, 
  geometry struct<
    type:string,
    coordinates:array<array<array<decimal(18,15)>>>,
    multi_coordinates:array<array<array<array<decimal(18,15)>>>>>
)
STORED AS ION
LOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'

Alternatively, you can explicitly specify the Ion classpaths in ROW FORMAT SERDE, INPUTFORMAT, and OUTPUTFORMAT. Unlike the first method, you can specify a SERDEPROPERTIES clause here. In our example DDL, we added a SerDe property that allows values that are outside of the Hive data type ranges to overflow rather than fail the query:

CREATE EXTERNAL TABLE city_lots_ion2(
  type STRING, 
  properties struct<
    mapblklot:string,
    blklot:string,
    block_num:string,
    lot_num:string,
    from_st:string,
    to_st:string,
    street:string,
    st_type:string,
    odd_even:string>, 
  geometry struct<
    type:string,
    coordinates:array<array<array<decimal(18,15)>>>,
    multi_coordinates:array<array<array<array<decimal(18,15)>>>>>
)
ROW FORMAT SERDE 
  'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
 'ion.fail_on_overflow'='false'
 )
STORED AS INPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonInputFormat' 
OUTPUTFORMAT 
  'com.amazon.ionhiveserde.formats.IonOutputFormat'
LOCATION
  's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'

Athena converts STORED AS ION into the explicit classpaths, so both tables look similar in the metastore. If we look in AWS Glue, we see both tables we just created have the same input format, output format, and SerDe serialization library.

Now that our table is created, we can run standard SELECT queries on the city_lots_ion table. Let’s run a query that specifies the block_num from our example row of Ion data to verify that we can read from the table:

-- QUERY
SELECT * FROM city_lots_ion1 WHERE properties.block_num='0579';

The following screenshot shows our results.

Use path extraction to read from specific fields

Athena supports further customization of how data is interpreted via SerDe properties. To specify these, you can add a WITH SERDEPROPERTIES clause, which is a subfield of the ROW FORMAT SERDE field.

In some situations, we may only care about some parts of the information. Let’s suppose we don’t want any of the geometry info from the citylots dataset, and only need a few of the fields in properties. One solution is to specify a search path using the path extractor SerDe property:

-- Path Extractor property
ion.<column>.path_extractor = <search path>

Path extractors are search paths that Athena uses to map the table columns to locations in the individual document. Full information on what can be done with path extractors is available on GitHub, but for our example, we focus on creating simple paths that use the names of each field as an index. In this case, the search path takes the form of a space-delimited set of indexes (and wraps it with parentheses) that indicate the location of each desired piece of information. We map the search paths to table columns by using the path extractor property.

By default, Athena builds path extractors dynamically based on column names unless overridden. This means that when we run our SELECT query on our city_lots_ion1 table, Athena builds the following search paths:

Default Extractors generated by Athena for city_lots_ion1.
-- Extracts the 'type' field to the 'type' column
    'ion.type.path_extractor' = '(type)'

-- Extracts the 'properties' field to the 'properties' column
    'ion.properties.path_extractor' = '(properties)'

-- Extracts the 'geometry' field to the 'geometry' column
    'ion.geometry.path_extractor' = '(geometry)'

Assuming we only care about the block and lot information from the properties struct, and the geometry type from the geometry struct, we can build search paths that map the desired fields from the row of data to table columns. First let’s build the search paths:

(properties mapblklot) - Search path for the mapblklot field in the properties struct
(properties blklot) - Search path for the blklot field in the properties struct
(properties block_num) - Search path for the block_num field in the properties struct
(properties lot_num) - Search path for the lot_num field in the properties struct
(geometry type) - Search path for the type field in the geometry struct

Now let’s map these search paths to table columns using the path extractor SerDe property. Because the search paths specify where to look for data, we are able to flatten and rename our datasets to better serve our purpose. For this example, let’s rename the mapblklot field to map_block_lot, blklot to block_lot, and the geometry type to shape:

 'ion.map_block_lot.path_extractor' = '(properties mapblklot)'
 'ion.block_lot.path_extractor' = '(properties blklot)'
 'ion.block_num.path_extractor' = '(properties block_num)'
 'ion.lot_num.path_extractor' = '(properties lot_num)'
 'ion.shape.path_extractor' = '(geometry type)'

Let’s put all of this together and create the city_blocks table:

CREATE EXTERNAL TABLE city_blocks (
    map_block_lot STRING,
    block_lot STRING,
    block_num STRING,
    lot_num STRING,
    shape STRING
)
ROW FORMAT SERDE
 'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
 'ion.map_block_lot.path_extractor' = '(properties mapblklot)',
 'ion.block_lot.path_extractor' = '(properties blklot)', 
 'ion.block_num.path_extractor' = '(properties block_num)',
 'ion.lot_num.path_extractor' = '(properties lot_num)',
 'ion.shape.path_extractor' = '(geometry type)'
 )
STORED AS ION
LOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_ion_binary/'

Now we can run a select query on the city_blocks table, and see the results:

-- Select Query
SELECT * FROM city_blocks WHERE block_num='0579';

Utilizing search paths in this way enables skip-scan parsing when reading from Ion binary files, which allows Athena to skip over the unneeded fields and reduces the overall time it takes to run the query.

Use CTAS and UNLOAD for data transformation

Athena supports CREATE TABLE AS SELECT (CTAS), which creates a new table in Athena from the results of a SELECT statement from another query. Athena also supports UNLOAD, which writes query results to Amazon S3 from a SELECT statement to the specified data format.

Both CTAS and UNLOAD have a property to specify a format and a compression type. This allows you to easily convert Ion datasets to other data formats, such as Parquet or ORC, and vice versa, without needing to set up a complex extract, transform, and load (ETL) job. This is beneficial for situations when you want to transform your data, or know you will run repeated queries on a subset of your data and want to use some of the benefits inherent to columnar formats. Combining it with path extractors is especially helpful, because we’re only storing the data that we need in the new format.

Let’s use CTAS to convert our city_blocks table from Ion to Parquet, and compress it via GZIP. Because we have path extractors set up for the city_blocks table, we only need to convert a small portion of the original dataset:

CREATE TABLE city_blocks_parquet_gzip
WITH (format = 'PARQUET', write_compression='GZIP')
AS SELECT * FROM city_blocks;

We can now run queries against the city_block_parquet_gzip table, and should see the same result. To test this out, let’s run the same SELECT query we ran before on the Parquet table:

SELECT * FROM city_blocks_parquet_gzip WHERE block_num='0579';

When converting tables from another format to Ion, Athena supports the following compression codecs: ZSTD, BZIP2, GZIP, SNAPPY, and NONE. In addition to adding Ion as a new format for CTAS, we added the ion_encoding property, which allows you to choose whether the output files are created in Ion text or Ion binary. This allows for serialization of data from other formats back into Ion.

Let’s convert the original city_lots JSON file back to Ion, but this time we specify that we want to use ZSTD compression and a binary encoding.

The JSON file can be found at following location: s3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/

Because Ion is a superset of JSON, we can use the Ion SerDe to read this file:

CREATE EXTERNAL TABLE city_blocks_json_ion_serde (
    map_block_lot STRING,
    block_lot STRING,
    block_num STRING,
    lot_num STRING,
    shape STRING
)
ROW FORMAT SERDE
'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
'ion.map_block_lot.path_extractor' = '(properties mapblklot)',
'ion.block_lot.path_extractor' = '(properties blklot)',
'ion.block_num.path_extractor' = '(properties block_num)',
'ion.lot_num.path_extractor' = '(properties lot_num)',
'ion.shape.path_extractor' = '(geometry type)'
)
STORED AS ION
LOCATION 's3://aws-bigdata-blog/artifacts/athena-ion-blog/city_lots_json/'

Now let’s copy this table into our desired Ion binary form:

CREATE TABLE city_blocks_ion_zstd
WITH (format = 'ION', write_compression='ZSTD', ion_encoding='BINARY')
AS SELECT * FROM city_blocks_parquet_gzip

Finally, let’s run our verification SELECT statement to verify everything was created properly:

SELECT * FROM city_blocks_ion_zstd WHERE block_num='0579';

Use UNLOAD to store Ion data in Amazon S3

Sometimes we just want to reformat the data and don’t need to store the additional metadata to query the table. In this case, we can use UNLOAD, which stores the results of the query in the specified format in an S3 bucket.

Let’s test it out, using UNLOAD to convert the drivers_names table from Ion to ORC, compress it via ZLIB, and store it to an S3 bucket:

UNLOAD (SELECT * FROM city_blocks_ion_zstd WHERE block_num='0579') 
TO 's3://<your-s3-bucket>/athena-ion-blog/unload/orc_zlib/'
WITH (format = 'ORC', compression='ZLIB')

When you check in Amazon S3, you can find a new file in the ORC format.

Conclusion

This post talked about the new feature in Athena that allows you to query and create Ion datasets using standard SQL. We discussed use cases and unique features of the Ion format like type system, dual formats (Ion text and Ion binary), efficiency gains, and skip-scanning. You can get started with querying an Ion dataset stored in Amazon S3 by simply creating a table in Athena, and also converting existing datasets to Ion format and vice versa using CTAS and UNLOAD statements.

To learn more about querying Ion using Athena, refer to Amazon Ion Hive SerDe.

References

Name of Dataset: City Lots San Francisco in .json
Licenses and Notes: OG dataset under PDDL 1.0; Github .json version under MIT style license
Link(s) to Dataset(s) (note – original dataset is NLA from DataSF)
Location of transformed version

About the Authors

Pathik Shah is a Sr. Big Data Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.

Jacob Stein works on the Amazon Athena team as a Software Development Engineer. He led the project to add support for Ion in Athena. He loves working on technical problems unique to internet scale data, and is passionate about developing scalable solutions for distributed systems.

Giovanni Matteo Fumarola is the Engineering Manager of the Athena Data Lake and Storage team. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.

Pete Ford is a Sr. Technical Program Manager at Amazon.

Use Amazon Redshift RA3 with managed storage in your modern data architecture

2022-05-18 Bhanu Pittampally

Post Syndicated from Bhanu Pittampally original https://aws.amazon.com/blogs/big-data/use-amazon-redshift-ra3-with-managed-storage-in-your-modern-data-architecture/

Over the years, Amazon Redshift has evolved a lot to meet our customer demands. Its journey started as a standalone data warehousing appliance that provided a low-cost, high-performance, cloud-based data warehouse. Support for Amazon Redshift Spectrum compute nodes was later added to extend your data warehouse to data lakes, and the concurrency scaling feature was added to support burst activity and scale your data warehouse to support thousands of queries concurrently. In its latest offering, Amazon Redshift runs on third-generation architecture where storage and compute layers are decoupled and scaled independent of each other. This latest generation powers the several modern data architecture patterns our customers are actively embracing to build flexible and scalable analytics platforms.

When spinning up a new instance of Amazon Redshift, you get to choose either Amazon Redshift Serverless, for when you need a data warehouse that can scale seamlessly and automatically as your demand evolves unpredictably, or you can choose an Amazon Redshift provisioned cluster for steady-state workloads and greater control over your Amazon Redshift cluster’s configuration.

An Amazon Redshift provisioned cluster is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs the Amazon Redshift engine and contains one or more databases. Creating an Amazon Redshift cluster is the first step in your process of building an Amazon Redshift data warehouse. While launching a provisioned cluster, one option that you specify is the node type. The node type determines the CPU, RAM, storage capacity, and storage drive type for each node.

In this post, we cover the current generation node RA3 architecture, different RA3 node types, important capabilities that are available only on RA3 node types, and how you can upgrade your current Amazon Redshift node types to RA3.

Amazon Redshift RA3 nodes

RA3 nodes with managed storage enable you to optimize your data warehouse by scaling and paying for compute and managed storage independently. RA3 node types are the latest node type for Amazon Redshift. With RA3, you choose the number of nodes based on your performance requirements and only pay for the managed storage that you use. RA3 architecture gives you the ability to size your cluster based on the amount of data you process daily or the amount of data that you want to store in your warehouse; there is no need to account for both storage and processing needs together.

Other node types that we previously offered include the following:

Dense compute – DC2 nodes enable you to have compute-intensive data warehouses with local SSD storage included. You choose the number of nodes you need based on data size and performance requirements.
Dense storage (deprecated) – DS2 nodes enable you to create large data warehouses using hard disk drives (HDDs). If you’re using the DS2 node type, we strongly recommend that you upgrade to RA3 to get twice as much storage and improved performance for the same on-demand cost.

When you use the RA3 node size and choose your number of nodes, you can provision the compute independent of storage. RA3 nodes are built on the AWS Nitro System and feature high bandwidth networking and large high-performance SSDs as local caches. RA3 nodes use your workload patterns and advanced data management techniques to deliver the performance of local SSD while scaling storage automatically to Amazon Simple Storage Service (Amazon S3).

RA3 node types come in three different sizes to accommodate your analytical workloads. You can quickly start experimenting with the RA3 node type by creating a single-node ra3.xlplus cluster and explore various features that are available. If you’re running a medium-sized data warehouse, you can size your cluster with ra3.4xlarge nodes. For large data warehouses, you can start with ra3.16xlarge. The following table gives more information about RA3 node types and their specifications as of this writing.

Node Type	vCPU	RAM (GiB)	Default Slices Per Node	Managed Storage Quota Per Node	Node Range with Create Cluster	Total Managed Storage Capacity
ra3.xlplus	4	32	2	32 TB	1-16	1024 TB
ra3.4xlarge	12	96	4	128 TB	2-32	8192 TB
ra3.16xlarge	48	384	16	128 TB	2-128	16384 TB

Amazon Redshift with managed storage

Amazon Redshift with a managed storage architecture (RMS) still boasts the same resiliency and industry-leading hardware. With managed storage, Amazon Redshift uses intelligent data prefetching and data evictions based on the temperature of your data. This method helps you decide where to store your most-queried data. Most frequently used blocks (hot data) are cached locally on SSD, and infrequently used blocks (cold data) are stored on an RMS layer backed by Amazon S3. The following diagram depicts the leader node, compute node, and Amazon Redshift managed storage.

In the following sections, we discuss the capabilities that Amazon Redshift RA3 with managed storage can provide.

Independently scale compute and storage

As the scale of an organization grows, data continues to grow—reaching petabytes. The amount of data you ingest into your Amazon Redshift data warehouse also grows. You may be looking for ways to cost-effectively analyze all your data and at the same time have control over choosing the right compute or storage resource at the right time. For customers who are looking to be cost conscientious and cost-effective, the RA3 platform provides the option to scale and pay for your compute and storage resources separately.

With RA3 instances with managed storage, you can choose the number of nodes based on your performance requirements, and only pay for the managed storage that you use. This gives you the flexibility to size your RA3 cluster based on the amount of data you process daily without increasing your storage costs. It allows you to pay per hour for the compute and separately scale your data warehouse storage capacity without adding any additional compute resources and paying only for what you use.

Another benefit of RMS is that Amazon Redshift manages which data should be stored locally for fastest access, and data that is slightly colder is still kept within fast-access reach.

Advanced hardware

RA3 instances use high-bandwidth networking built on the AWS Nitro System to further reduce the time taken for data to be offloaded to and retrieved from Amazon S3. Managed storage uses high-performance SSDs for your hot data and Amazon S3 for your cold data, providing ease of use, cost-effective storage, and fast query performance.

Additional security options

Amazon Redshift managed VPC endpoints enable you to set up a private connection to securely access your Amazon Redshift cluster within your virtual private cloud (VPC) from client applications in another VPC within the same AWS account, another AWS account, or a subnet without using public IPs and without requiring the traffic to traverse across the internet.

The following scenarios describe common reasons to allow access to a cluster using an Amazon Redshift managed VPC endpoint:

AWS account A wants to allow a VPC in AWS account B to have access to a cluster
AWS account A wants to allow a VPC that is also in AWS account A to have access to a cluster
AWS account A wants to allow a different subnet in the cluster’s VPC within AWS account A to have access to a cluster

For information about access options to another VPC, refer to Enable private access to Amazon Redshift from your client applications in another VPC.

Further optimize your workload

In this section, we discuss two ways to further optimize your workload.

AQUA

AQUA (Advanced Query Accelerator) is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run up to 10 times faster than other enterprise cloud data warehouses by automatically boosting certain types of queries. AQUA is available with the ra3.16xlarge, ra3.4xlarge, or ra3.xlplus nodes at no additional charge and with no code changes.

AQUA is an analytics query accelerator for Amazon Redshift that uses custom-designed hardware to speed up queries that scan large datasets. AQUA automatically optimizes query performance on subsets of the data that require extensive scans, filters, and aggregation. With this approach, you can use AQUA to run queries that scan, filter, and aggregate large datasets.

For more information about using AQUA, refer to How to evaluate the benefits of AQUA for your Amazon Redshift workloads.

Concurrency scaling for write operations

With RA3 nodes, you can take advantage of concurrency scaling for write operations, such as extract, transform, and load (ETL) statements. Concurrency scaling for write operations is especially useful when you want to maintain consistent response times when your cluster receives a large number of requests. It improves throughput for write operations contending for resources on the main cluster.

Concurrency scaling supports COPY, INSERT, DELETE, and UPDATE statements. In some cases, you might follow DDL statements, such as CREATE, with write statements in the same commit block. In these cases, the write statements are not sent to the concurrency scaling cluster.

When you accrue credit for concurrency scaling, this credit accrual applies to both read and write operations.

Increased agility to scale compute resources

Elastic resize allows you to scale your Amazon Redshift cluster up and down in minutes to get the performance you need, when you need it. However, there are limits on the nodes that you can add to a cluster. With some RA3 node types, you can increase the number of nodes up to four times the existing count. All RA3 node types support a decrease in the number of nodes to a quarter of the existing count. The following table lists growth and reduction limits for each RA3 node type.

Node Type	Growth Limit	Reduction Limit
ra3.xlplus	2 times (from 4 to 8 nodes, for example)	To a quarter of the number
ra3.4xlarge	4 times (from 4 to 16 nodes, for example)	To a quarter of the number (from 16 to 4 nodes, for example)
ra3.16xlarge	4 times (from 4 to 16 nodes, for example)	To a quarter of the number (from 16 to 4 nodes, for example)

RA3 node types also have a shorter duration of snapshot restoration time because of the separation of storage and compute.

Improved resiliency

Amazon Redshift employs extensive fault detection and auto remediation techniques in order to maximize the availability of a cluster. With the RA3 architecture, you can enable cluster relocation, which provides additional resiliency by having the ability to relocate a cluster across Availability Zones without losing any data (RPO is zero) or having to change your client applications. The cluster’s endpoint remains the same after the relocation occurs so applications can continue operating without modifications. As the existing cluster fails, a new cluster is created on demand in another Availability Zone so cost of a standby replica cluster is avoided.

Accelerate data democratization

In this section, we share two techniques to accelerate data democratization.

Data sharing

Data sharing provides instant, granular, and high-performance access without copying data and data movement. You can query live data constantly across all consumers on different RA3 clusters in the same AWS account, in a different AWS account, or in a different AWS Region. Data is shared securely and provides governed collaboration. You can provide access in different granularity, including schema, database, tables, views, and user-defined functions.

This opens up various new use cases where you may have one ETL cluster that is producing data and have multiple consumers such as ad-hoc querying, dashboarding, and data science clusters to view the same data. This also enables bi-directional collaboration where groups such as marketing and finance can share data with one another. Queries accessing shared data use the compute resources of the consumer Amazon Redshift cluster and don’t impact the performance of the producer cluster.

For more information about data sharing, refer to Sharing Amazon Redshift data securely across Amazon Redshift clusters for workload isolation.

AWS Data Exchange for Amazon Redshift

AWS Data Exchange for Amazon Redshift enables you to find and subscribe to third-party data in AWS Data Exchange that you can query in an Amazon Redshift data warehouse in minutes. You can also license your data in Amazon Redshift through AWS Data Exchange. Access is automatically granted when a customer subscribes to your data and is automatically revoked when their subscription ends. Invoices are automatically generated, and payments are automatically collected and disbursed through AWS. This feature empowers you to quickly query, analyze, and build applications with third-party data.

For details on how to publish a data product and subscribe to a data product using AWS Data Exchange for Amazon Redshift, refer to New – AWS Data Exchange for Amazon Redshift.

Cross-database queries for Amazon Redshift

Amazon Redshift supports the ability to query across databases in a Redshift cluster. With cross-database queries, you can seamlessly query data from any database in the cluster, regardless of which database you are connected to. Cross-database queries can eliminate data copies and simplify your data organization to support multiple business groups on the same cluster.

One of many use cases where Cross-database query helps you is when data is organized across multiple databases in a Redshift cluster to support multi-tenant configurations. For example, different business groups and teams that own and manage data sets in their specific database in the same data warehouse need to collaborate with other groups. You might want to perform common ETL staging and processing while your raw data is spread across multiple databases. Organizing data in multiple Redshift databases is also a common scenario when migrating from traditional data warehouse systems. With cross-database queries, you can now access data from any of the databases on the Redshift cluster without having to connect to that specific database. You can also join data sets from multiple databases in a single query

You can read more about cross-database queries here.

Upgrade to RA3

You can upgrade to RA3 instances within minutes no matter the size of your current Amazon Redshift clusters. Simply take a snapshot of your cluster and restore it to a new RA3 cluster. For more information, refer to Upgrading to RA3 node types.

You can also simplify your migration efforts with Amazon Redshift Simple Replay. For more information, refer to Simplify Amazon Redshift RA3 migration evaluation with Simple Replay utility.

Summary

In this post, we talked about the RA3 node types, the benefits of Amazon Redshift managed storage, and the additional capabilities that you get by using Amazon Redshift RA3 with managed storage. Migrating to RA3 node types isn’t a complicated effort, you can get started today.

About the Authors

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytic solutions. His background is in data warehouses—architecture, development, and administration. He has been in the data and analytics field for over 13 years.

Jason Pedreza is an Analytics Specialist Solutions Architect at AWS with data warehousing experience handling petabytes of data. Prior to AWS, he built data warehouse solutions at Amazon.com. He specializes in Amazon Redshift and helps customers build scalable analytic solutions.

Ingest Stripe data in a fast and reliable way using Stripe Data Pipeline for Amazon Redshift

2022-05-18 Jessica Ho

Post Syndicated from Jessica Ho original https://aws.amazon.com/blogs/big-data/ingest-stripe-data-in-a-fast-and-reliable-way-using-stripe-data-pipeline-for-amazon-redshift/

Enterprises typically host a myriad of business applications for varying data needs. As companies grow, so does the demand for insights from a complete set of business data. Having data from various applications that store data in disparate silos can delay the decision-making process. However, building and maintaining an API integration or a third-party extract, transform, and load (ETL) pipeline to move data into a destination data store can be time-consuming and expensive.

Today we’re delighted to introduce Stripe Data Pipeline for Amazon Redshift to help you access your Stripe data and extract insight securely and easily from Amazon Redshift. This data, including billing, issuing, and payment records, can be shared in a consistent and automated fashion. You can integrate your Stripe data with data from other sources in your Amazon Redshift clusters to create a single source of truth.

In this post, we discuss the benefits of Stripe Data Pipeline and some of its use cases.

Solution overview

Amazon Redshift is a fast, fully managed, petabyte-scale cloud data warehousing service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence (BI) tools. It’s optimized for datasets ranging from a few hundred gigabytes to petabytes or more. This columnar data warehouse provides provisioned as well as serverless deployment options and uses an industry-standard SQL interface to analyze structured and semi-structured data with fast query performance.

Stripe Data Pipeline is powered by Amazon Redshift’s latest RA3 instances, which provide cross-account data sharing capability. RA3 takes a performant, cost-effective approach to address rapidly growing data volume by decoupling data processing from managed storage. You can then scale compute and storage independently and only pay for what you use. Data sharing provides read access directly to data stored across Amazon Redshift clusters without data movement. This capability removes the complexity and delays that are often associated with managing large distributed datasets across multiple accounts.

The solution provides the following core features and benefits:

Scalable and managed data pipeline – You don’t need to build, maintain, and scale custom ETL jobs. You can set up Stripe Data Pipeline in minutes, and it and scales automatically to handle increased business activities and data volume.
Up-to-date financial data – You automatically receive and refresh a complete set of your Stripe data and reports in Amazon Redshift on a low-latency schedule. Stripe Data Pipeline is built into Stripe and always provides accurate data.
Security and compliance – Data is shared directly from Stripe with your Amazon Redshift cluster, and confidentiality of the data is protected in transit and at rest. Amazon Redshift offers comprehensive security controls and monitoring via native integration with AWS CloudTrail and Amazon CloudWatch (for more information, see Logging Amazon Redshift API calls with AWS CloudTrail and Monitoring Amazon Redshift using CloudWatch metrics, respectively). You can define and audit who has access to what and ensure the compliance requirements are met.
Extensibility – Once the data is accessible in AWS, you benefit from the breadth of native integrations Amazon Redshift supports. You can join datasets from other data stores in operational databases, build reports and dashboards with BI tools, or identify patterns and generate prediction using Amazon Redshift ML.

The following architecture diagram provides a quick overview of how data sharing works and how other AWS services can be used together. We dive deeper into different use cases in the following sections.

Accept datashares from Stripe

You can configure the solution in a few steps with no code necessary.

Once Stripe creates a datashare from the producer cluster and authorizes your AWS account, you can view this datashare on your Amazon Redshift console. You need to associate it with specific or all clusters in your AWS account as the consumer. Clusters can be specified by namespaces as globally unique identifiers. Next, you create a database from the datashare in order to start querying data.

Query data from the consumer Amazon Redshift cluster

You can now access your Stripe data and schema directly from Amazon Redshift’s web-based query editor. This direct connection enables teams to pull accurate analysis of various functions of the business. For example:

Finance – “How does my cash flow change based on seasonality?”
Sales – “How many customers do we have in the US?”
Product – “How many active users do we have on each subscription plan?”
Sales operations – “Which customers haven’t paid their invoices?”

The following screenshot shows an example in which the query editor displays the number of charges blocked per Stripe’s connected account.

Use federated queries

The modern data architecture of Amazon Redshift enables you to store data in purpose-built data stores based on specific use cases, and allows querying external databases on Amazon Relational Database Service (Amazon RDS) or datasets in an Amazon Simple Storage Service (Amazon S3) data lake without moving these datasets to Amazon Redshift clusters. You can drive deeper by incorporating data from Amazon RDS, or from an S3 data lake through Amazon Redshift Spectrum. This capability provides a native integration without requiring additional ETL jobs.

The following syntax allows you to create an external schema from an Amazon Aurora MySQL-Compatible Edition database to an Amazon Redshift cluster. Amazon Redshift assumes an AWS Identity and Access Management (IAM) role and uses AWS Secrets Manager to access external data stores. For more information and examples with other supported data stores, refer to Querying data with federated queries in Amazon Redshift.

CREATE EXTERNAL SCHEMA auroram
FROM MYSQL
DATABASE ‘example_database’ SCHEMA 'example_schema' -- schema is optional
URI ‘hostname’
IAM_ROLE ‘iam_role_arn’
SECRET_ARN ‘aws_secrets_manager_arn’;

Coming back to Stripe Data Pipeline, now you can combine the data from an Aurora table and create further analysis. For example, you can correlate the trends of customer acquisition against sales campaign by region, so you can gain an understanding of the campaign effectiveness and make adjustment to marketing strategies.

Create visualizations and dashboards

Now that your complete set of business data is accessible from Amazon Redshift, you can start to explore the data and create visualizations. Amazon QuickSight is a serverless BI service that allows you to easily connect to a data source, create analyses, publish dashboards, and share between teams. QuickSight seamlessly integrates with AWS services such as Amazon Redshift, Amazon S3, and many more.

The following screenshot illustrates how straightforward it is to connect an Amazon Redshift instance to QuickSight as a new data source.

The following screenshot is of a sample QuickSight dashboard pulling data from Amazon Redshift.

Key considerations

When using Stripe Data Pipeline, consider the following:

Instance type – This solution is available for all RA3 node types. If you run an existing DS2 or DC2 cluster, there are multiple options to migrate to RA3, including elastic resize, snapshot and restore, and classic resize. For more information, including an upgrade sizing reference between different node types, refer to Upgrading to RA3 node types.
RI migration – If you have Amazon Redshift Reserved Instances (RIs), you can use the RI migration feature to migrate the DS2 RI clusters to equivalent RA3 RI clusters as part of a cross-instance resize or cross-instance snapshot restore operation. The RA3 RI covering the new cluster will be the same cost and on the same calendar terms as the original DS2 RI for supported configurations.
Encryption – The consumer cluster must be encrypted as part of the enhanced security control for cross-account sharing. You can enable encryption at cluster creation time, or modify an unencrypted cluster with either AWS Key Management Service (AWS KMS) or AWS CloudHSM.
Federated queries – This capability works with external DB instances, including Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Aurora MySQL-Compatible Edition. You should also ensure that you have an Amazon Redshift cluster with a cluster maintenance version that supports federated queries.

Conclusion

In this post, we introduced Stripe Data Pipeline for Amazon Redshift and discussed options to further integrate with AWS services. Stripe Data Pipeline removes the need to build custom API integration or adopt a third-party ETL pipeline, making data accessible with a few clicks and with no code required. Businesses can automatically receive up-to-date data from Stripe in their data warehouse on AWS, reduce data silos, and extract deep insights to address business needs.

Check out Stripe Data Pipeline for more information about the solution and how to get started.

About the Authors

Jessica Ho is a Sr. Partner Solutions Architect at AWS supporting ISV partners who build business applications. She is passionate about creating differentiated solutions that promote cloud adoption. Outside of work, she enjoys spoiling her garden into a mini jungle.

Alexander Mahabir is a Sr. Partner Solutions Architect at AWS based in the D.C metropolitan area. Alex has over 16 year of experience building cloud, and on-premise solutions for small, medium, and large enterprises. Alex currently works with ISV partners in the Digital Customer Experience segment.

Coming June 2022: An updated Amazon QuickSight dashboard experience

2022-05-18 Rushabh Vora

Post Syndicated from Rushabh Vora original https://aws.amazon.com/blogs/big-data/coming-june-2022-an-updated-amazon-quicksight-dashboard-experience/

Starting June 30, 2022, Amazon QuickSight is introducing the new look and feel for your dashboards. In this post, we walk through the changes to expect with the new look. The new dashboard experience includes the following improvements:

Simplified toolbar
Discoverable visual menu
Polished controls, menu, and submenus
Non-blocking right pane for secondary experiences like filters, threshold alerts, and downloads

Simplified toolbar

The simplified toolbar experience offers updated icons for key actions for better visual clarity.

The following screenshot shows the old look.

screenshot shows the old look

The following screenshot shows the new look with updated icons.

screenshot shows the new look with updated icons

Discoverable visual menu

The visual menu is visible on-hover to improve the discoverability of drills, export, and filter restatements.

To use the visual menu with the old version, you needed to select the visual.

visual menu with the old look, where you needed to select the visual

With the new look, you can view the menu by hovering over it.

With new look, you can view the menu by hovering over it

Polished controls, menu, and submenus

The new look features improved controls, menu, submenus, toast notifications, and other dashboard elements to provide more polished visual experience.

For example, the following screenshot shows the old look for calendar controls.

screenshot shows the old look for calendar controls

The following screenshot shows the new look.

screenshot shows the new look for calendar control

The following screenshot shows the old look for the menu and submenus.

screenshot shows the old look for the menu and submenus

The following screenshot shows the new look.

screenshot shows the new look for menu and submenus

Non-blocking right pane

The new look features a non-blocking right pane for secondary experiences like filters, threshold alerts, and downloads, to improve focus on the content of the dashboard.‘

The following animation shows the old look for ad-hoc filtering.

animation shows the old look for ad-hoc filtering

The filters are now moved to the right pane.

animation shows the new look for ad-hoc filtering that is now moved to the right pane

The following animation shows the old look for creating threshold alerts.

animation shows the old look for creating threshold alerts

Threshold alert creation is now in the right pane.

animation shows the new look for creating threshold alerts, now moved to right pane

The following animation shows the old look for the downloading experience.

animation shows the old look for the downloading experience

The new look offers a downloads pane.

animation shows the new look for the downloading experience in the right pane

Summary

The new look for the QuickSight dashboard experience will be available starting June 30, 2022. If you have any questions or feedback, please reach out to us by leaving a comment.

About the Author

Rushabh Vora is a Senior Technical Product Manager for Amazon QuickSight, Amazon Web Service’s cloud-native, fully managed BI service. He is passionate about Data Visualization. Prior to QuickSight, he was working with Amazon Business as a Product Manager.

Use a linear learner algorithm in Amazon Redshift ML to solve regression and classification problems

2022-05-18 Phil Bates

Post Syndicated from Phil Bates original https://aws.amazon.com/blogs/big-data/use-a-linear-learner-algorithm-in-amazon-redshift-ml-to-solve-regression-and-classification-problems/

Amazon Redshift is the fastest, most widely used, fully managed, and petabyte-scale cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Amazon Redshift ML, powered by Amazon SageMaker, makes it easy for SQL users such as data analysts, data scientists, and database developers to create, train, and deploy machine learning (ML) models using familiar SQL commands and then use these models to make predictions on new data for use cases such as churn prediction, customer lifetime value prediction, and product recommendations. Redshift ML makes the model available as a SQL function within the Amazon Redshift data warehouse so you can easily use it in queries and reports. Customers across all verticals are using Redshift ML to derive better insights from their data. For example, Jobcase uses Redshift ML to recommend job search at scale. Magellan RX Management uses Redshift ML to predict drug therapeutic use conditions.

Amazon Redshift supports supervised learning, including regression, binary classification, multi-class classification, and unsupervised learning using K-Means. You can optionally specify XGBoost, MLP, and now linear learner model types, which are supervised learning algorithms used for solving either classification or regression problems, and provide a significant increase in speed over traditional hyperparameter optimization techniques. Amazon Redshift also supports bring-your-own-model to invoke remote SageMaker endpoints.

In this post, we show you how to use Redshift ML to solve regression and classification problems using the SageMaker linear learner algorithm, which explores different training objectives and chooses the best solution from a validation set.

Solution overview

We first solve a linear regression problem, followed by a multi-class classification problem.

The following table shows some common use cases and algorithms used.

Use Case	Algorithm / Problem Type
Customer churn prediction	Classification
Predict if a sales lead will close	Classification
Fraud detection	Classification
Price and revenue prediction	Linear regression
Customer lifetime value prediction	Linear regression

To use the linear learner algorithm, you need to provide inputs or columns representing dimensional values and also the label or target, which is the value you’re trying to predict. The linear learner algorithm trains many models in parallel, and automatically determines the most optimized model.

Prerequisites

To get started, we need an Amazon Redshift cluster or an Amazon Redshift Serverless endpoint and an AWS Identity and Access Management (IAM) role attached that provides access to SageMaker and permissions to an Amazon Simple Storage Service (Amazon S3) bucket.

For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

To create a simple cluster with a default IAM role, see Use the default IAM role in Amazon Redshift to simplify accessing other AWS services.

Use case 1: Linear regression

In this use case, we analyze the Abalone dataset and determine the relationship between the physical measurements and use that to determine the age of abalone. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope, which is a time-consuming task. We want to predict the age using different physical measurements, which is easier to measure. The age of abalone is (number of rings + 1.5) years.

Prepare the data

Load the Abalone dataset into Amazon Redshift using the following SQL. You can use the Amazon Redshift query editor v2 or your preferred SQL tool to run these commands.

To create the table, use the following commands:

create table abalone_dataset
( 
id INT IDENTITY(1,1),
Sex     CHAR(1),
Length float,
Diameter float,
Height float,
Whole  float,
Shucked  float,
Viscera  float,
Shell float,
Rings  integer   
);

To load data into Amazon Redshift, use the following COPY command:

COPY abalone_dataset 
from 's3://redshift-ml-multiclass/abalone.csv' 
REGION 'us-east-1'
IAM_ROLE default
CSV IGNOREHEADER 1
NULL AS 'NULL';

To train the model, we use the abalone table and 80% of the data to train the model, and then test the accuracy of that model by seeing if it correctly predicts the age of ring label attribute on the remaining 20% of the data. Run the following command to create training and validation tables:

create table abalone_training as
  select * from abalone_dataset where mod(id,10) < 8 ; 

create table abalone_validation as
  select * from abalone_dataset where mod(id,10) >= 8;

Create a model in Redshift ML

To create a model in Amazon Redshift, use the following command:

CREATE MODEL model_abalone_ring_prediction
FROM (
 SELECT   Sex ,
Length ,
Diameter ,
Height ,
Whole  ,
Shucked  ,
Viscera  ,
Shell,
Rings  as target_label
FROM abalone_training
) TARGET target_label
FUNCTION f_abalone_ring_prediction
IAM_ROLE default
MODEL_TYPE LINEAR_LEARNER 
PROBLEM_TYPE REGRESSION
OBJECTIVE 'MSE'
SETTINGS (
  S3_BUCKET '<your-s3-bucket>',
  MAX_RUNTIME 15000
  );

We define the following parameters in the CREATE MODEL statement:

Problem type – We use the linear learner problem type, which is newly added to extend upon typical linear models by training many models in parallel, in a computationally efficient manner.
Objective – We specified MSE (mean square error) as our objective, which is a common metric for evaluation of regression problems.
Max runtime –This parameter denotes how long the model training can run. Specifying a larger value may help create a better tuned model. The default value for this parameter is 5400 (90 minutes). For this example, we set it to 15000.

The preceding statement takes a few seconds to complete. It initiates an Amazon SageMaker Autopilot process in the background to automatically build, train, and tune the best ML model for the input data. It then uses Amazon SageMaker Neo to deploy that model locally in the Amazon Redshift cluster or Amazon Redshift Serverless as a user-defined function (UDF). You can use the SHOW MODEL command in Amazon Redshift to track the progress of your model creation, which should be in the READY state within the max_runtime parameter you defined while creating the model.

To check the status of the model, use the following command:

show model model_abalone_ring_prediction;

The following is the tabular outcome for the preceding command after model training was done. It took approximately 120 minutes to train the model.

Key	Value
Model Name	model_abalone_ring_prediction
Schema Name	public
Owner	awsuser
Creation Time	Tue, 10.05.2022 19:42:33
Model State	READY
validation:mse	4.082088
Estimated Cost	5.423719
.	.
TRAINING DATA:	.
Query	SELECT SEX , LENGTH , DIAMETER , HEIGHT , WHOLE , SHUCKED , VISCERA , SHELL, RINGS AS TARGET_LABEL
.	FROM ABALONE_TRAINING
Target Column	TARGET_LABEL
.	.
PARAMETERS:	.
Model Type	linear_learner
Problem Type	Regression
Objective	MSE
AutoML Job Name	redshiftml-20220510194233380173
Function Name	f_abalone_ring_prediction
Function Parameters	sex length diameter height whole shucked viscera shell
Function Parameter Types	bpchar float8 float8 float8 float8 float8 float8 float8
IAM Role	default-aws-iam-role
S3 Bucket	poc-generic-bkt
Max Runtime	15000

Model validation

We notice from the preceding table that the MSE for the training data is 4.08. Now let’s run the prediction query and validate the accuracy of the model on the testing and validation dataset:

select
ROUND(AVG(POWER(( tgt_label - predicted ),2)),2) mse
, ROUND(SQRT(AVG(POWER(( tgt_label - predicted ),2))),2) rmse
from 
(
SELECT   Sex ,
Length ,
Diameter ,
Height ,
Whole  ,
Shucked  ,
Viscera  ,
Shell,
Rings as tgt_label,
f_abalone_ring_prediction(Sex ,Length ,Diameter ,Height ,Whole  ,Shucked  ,Viscera  ,Shell) as predicted,
case when tgt_label = predicted then 1
      else 0 end as match,
  case when tgt_label  <> predicted then 1
    else 0 end as nonmatch
FROM abalone_validation
)t1

The following is the outcome from the query:

mse	rmse
5.08	2.25

The MSE value from the preceding query results indicates that our model is accurate enough to the actual values from our validation dataset.

We can also observe that Redshift ML is able to identify the right combination of features to come up with a usable prediction model. We can further check the impact of each attribute and its contribution and weightage in the model selection using the following command:

select explain_model ('model_abalone_ring_prediction')

The following is the outcome, where each attribute weightage is representative of its role in model decision-making:

{"explanations":
    {"kernel_shap":
      {"label0":
        {"expected_value":10.06938362121582,
          "global_shap_values":
          {
            "diameter":0.6897614190439705,
            "height":0.38391156156643987,
            "length":0.29646334630067408,
            "sex":0.5516722137411109,
            "shell":1.5679368167990147,
            "shucked":2.222549468867254,
            "viscera":0.2879883139361144,
            "whole":0.8085603201751219
            }
        }
       }
      }
      ,"version":"1.0"
    }

Use case 2: Multi-class classification

For this use case, we use the Covertype dataset (copyright Jock A. Blackard and Colorado State University), which contains information collected by the US Geological Survey and the US Forest Service about wilderness areas in northern Colorado. This has been downloaded to an S3 bucket to make it simple to create the model. You may want to download the dataset description. This dataset contains various measurements such as elevation, distance to waters and roadways, as well as the wilderness area designation and the soil type. Our ML task is to create a model to predict the cover type for a given area.

Prepare the data

To prepare the data for this model, you need to create and populate the table public.covertype_data in Amazon Redshift using the Covertype dataset. You may use the following SQL in Amazon Redshift query editor v2 or your preferred SQL tool:

CREATE TABLE public.covertype_data (
    elevation bigint ENCODE az64,
    aspect bigint ENCODE az64,
    slope bigint ENCODE az64,
    horizontal_distance_to_hydrology bigint ENCODE az64,
    vertical_distance_to_hydrology bigint ENCODE az64,
    horizontal_distance_to_roadways bigint ENCODE az64,
    hillshade_9am bigint ENCODE az64,
    hillshade_noon bigint ENCODE az64,
    hillshade_3pm bigint ENCODE az64,
    horizontal_distance_to_fire_points bigint ENCODE az64,
    wilderness_area1 bigint ENCODE az64,
    wilderness_area2 bigint ENCODE az64,
    wilderness_area3 bigint ENCODE az64,
    wilderness_area4 bigint ENCODE az64,
    soil_type1 bigint ENCODE az64,
    soil_type2 bigint ENCODE az64,
    soil_type3 bigint ENCODE az64,
    soil_type4 bigint ENCODE az64,
    soil_type5 bigint ENCODE az64,
    soil_type6 bigint ENCODE az64,
    soil_type7 bigint ENCODE az64,
    soil_type8 bigint ENCODE az64,
    soil_type9 bigint ENCODE az64,
    soil_type10 bigint ENCODE az64,
    soil_type11 bigint ENCODE az64,
    soil_type12 bigint ENCODE az64,
    soil_type13 bigint ENCODE az64,
    soil_type14 bigint ENCODE az64,
    soil_type15 bigint ENCODE az64,
    soil_type16 bigint ENCODE az64,
    soil_type17 bigint ENCODE az64,
    soil_type18 bigint ENCODE az64,
    soil_type19 bigint ENCODE az64,
    soil_type20 bigint ENCODE az64,
    soil_type21 bigint ENCODE az64,
    soil_type22 bigint ENCODE az64,
    soil_type23 bigint ENCODE az64,
    soil_type24 bigint ENCODE az64,
    soil_type25 bigint ENCODE az64,
    soil_type26 bigint ENCODE az64,
    soil_type27 bigint ENCODE az64,
    soil_type28 bigint ENCODE az64,
    soil_type29 bigint ENCODE az64,
    soil_type30 bigint ENCODE az64,
    soil_type31 bigint ENCODE az64,
    soil_type32 bigint ENCODE az64,
    soil_type33 bigint ENCODE az64,
    soil_type34 bigint ENCODE az64,
    soil_type35 bigint ENCODE az64,
    soil_type36 bigint ENCODE az64,
    soil_type37 bigint ENCODE az64,
    soil_type38 bigint ENCODE az64,
    soil_type39 bigint ENCODE az64,
    soil_type40 bigint ENCODE az64,
    cover_type bigint ENCODE az64 
)
DISTSTYLE AUTO;	


Copy public.covertype_data
From 's3://redshift-ml-multiclass/covtype.data.gz'
iam_role default 
gzip
delimiter ','
region 'us-east-1';

Now that our dataset is loaded, we run the following SQL statements to split the data into three sets for training (80%), validation (10%), and prediction (10%). Note that Redshift ML Autopilot automatically splits the data into training and validation, but by splitting it here, you’re able to verify the accuracy of your model.

To prepare the dataset, assign random values to split the data:

create table  public.covertype_data_prep  as
select a.*,
cast (random() * 80 as int) as data_group_id
from public.covertype_data a;

Use the following code for the training set:

Create table public.covertype_training as
Select * from public.covertype_data_prep
Where data_group_id < 80;

Use the following code for the validation set:

Create table public.covertype_validation as
Select * from public.covertype_data_prep
Where data_group_id between 80 and 89;

Use the following code for the test set:

Create table public.covertype_test as
Select * from public.covertype_data_prep
Where data_group_id > 89;

Now that we have our datasets, it’s time to create the model.

Create a model in Redshift ML using linear learner

Run the following SQL command to create your model—note our target is cover_type and we use all the inputs from our training set:

CREATE  MODEL forest_cover_type_model
FROM (select Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40,
  Cover_type from public.covertype_training)
TARGET cover_type
FUNCTION predict_cover_type
IAM_ROLE default
MODEL_TYPE LINEAR_LEARNER              
PROBLEM_TYPE MULTICLASS_CLASSIFICATION 
 OBJECTIVE 'Accuracy'
SETTINGS (
  S3_BUCKET '<<your-amazon-s3-bucket-name>>’
, 
  S3_GARBAGE_COLLECT OFF,
  MAX_RUNTIME 9600
) ;

You can use the SHOW MODEL command to view the status of the model.

You can see that the model has an accuracy score of .730279 and is in the READY state. Now let’s run a query to do some validation of our own.

Model validation

Run the following SQL query against the validation table, using the function created by our model:

select 
 cast(sum(t1.match)as decimal(7,2)) as predicted_matches
,cast(sum(t1.nonmatch) as decimal(7,2)) as predicted_non_matches
,cast(sum(t1.match + t1.nonmatch) as decimal(7,2))  as total_predictions
,predicted_matches / total_predictions as pct_accuracy
from 
(select 
Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40,
   Cover_type as actual_cover_type,
   predict_cover_type( Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40) as predicted_cover_type,
case when actual_cover_type = predicted_cover_type then 1
      else 0 end as match,
  case when actual_cover_type <>  predicted_cover_type then 1
    else 0 end as nonmatch

from public.covertype_validation
) t1;

You can see that our accuracy is very close to our score from the SHOW MODEL output.

Run a prediction query

Let’s run a prediction query in Amazon Redshift ML using our function against our test dataset to see the most common class of cover type for the Neota Wilderness Area. We can denote this by checking wildnerness_area2 for a value of 1.

The dataset includes the following wilderness areas:

Rawah Wilderness Area
Neota Wilderness Area
Comanche Peak Wilderness Area
Cache la Poudre Wilderness Area

The cover types are in seven different classes:

Spruce/Fir
Lodgepole Pine
Ponderosa Pine
Cottonwood/Willow
Aspen
Douglas Fir
Krummholz

There are also 40 different soil type definitions, which you can see in the dataset description, with a value of 0 or 1 to note if it’s applicable for a particular row. The following are a few example soil types:

Cathedral family – Rock outcrop complex, extremely stony
Vanet-Ratake families –Rocky outcrop complex, very stony
Haploborolis family – Rock outcrop complex, rubbly
Ratake family – Rock outcrop complex, rubbly
Vanet family – Rock outcrop complex, rubbly
Vanet-Wetmore families – Rock outcrop complex, stony

select t1. predicted_cover_type, count(*) 
from 
(     
select 
   Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10 , 
   Soil_Type11,  
   Soil_Type12 , 
   Soil_Type13 ,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40,
   predict_cover_type( Elevation, 
   Aspect,  
   Slope,  
   Horizontal_distance_to_hydrology,  
   Vertical_distance_to_hydrology,   
   Horizontal_distance_to_roadways,  
   HIllshade_9am,  
   Hillshade_noon,  
   Hillshade_3pm , 
   Horizontal_Distance_To_Fire_Points,  
   Wilderness_Area1,  
   Wilderness_Area2,   
   Wilderness_Area3,  
   Wilderness_Area4,  
   soil_type1,
   Soil_Type2,
   Soil_Type3, 
   Soil_Type4, 
   Soil_Type5,  
   Soil_Type6, 
   Soil_Type7,  
   Soil_Type8, 
   Soil_Type9, 
   Soil_Type10,
   Soil_Type11,  
   Soil_Type12, 
   Soil_Type13,
   Soil_Type14, 
   Soil_Type15, 
   Soil_Type16,  
   Soil_Type17, 
   Soil_Type18,  
   Soil_Type19,  
   Soil_Type20,  
   Soil_Type21,  
   Soil_Type22,  
   Soil_Type23,  
   Soil_Type24, 
   Soil_Type25,  
   Soil_Type26, 
   Soil_Type27,  
   Soil_Type28,  
   Soil_Type29,  
   Soil_Type30,  
   Soil_Type31, 
   Soil_Type32, 
   Soil_Type33,  
   Soil_Type34, 
   Soil_Type36, 
   Soil_Type37,  
   Soil_Type38,  
   Soil_Type39, 
   Soil_Type40) as predicted_cover_type 

from public.covertype_test
where wilderness_area2 = 1)
t1
group by 1;

Our model has predicted that the majority of cover is Spruce and Fir.

You can experiment with various combinations, such as determining which soil types are most likely to occur in a predicted cover type.

Conclusion

Redshift ML makes it easy for users of all levels to create, train, and tune models using a SQL interface. In this post, we walked you through how to use the linear learner algorithm to create regression and multi-class classification models. You can then use those models to make predictions using simple SQL commands and gain valuable insights.

To learn more about RedShift ML, visit Amazon Redshift ML.

About the Authors

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Sohaib Katariwala is an Analytics Specialist Solutions Architect at AWS. He has 12+ years of experience helping organizations derive insights from their data.

Jiayuan Chen is a Senior Software Development Engineer at AWS. He is passionate about designing and building data-intensive applications, and has been working in the areas of data lake, query engine, ingestion, and analytics. He keeps up with latest technologies and innovates things that spark joy.

Debu Panda is a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Solution overview

Prerequisites

Set up AWS SSO

Set up your Amazon Grafana workspace

Configure an IAM role for the development account

Configure an IAM role for the production account

Import the default dashboard

Add a custom dashboard for Amazon Redshift

Clean up

Conclusion

About the Authors

Apache Iceberg connector for AWS Glue

Overview of example customer scenario

Set up the Apache Iceberg connector and create the Iceberg connection

Configure resources and permissions

Start an AWS Glue Studio notebook to use Apache Iceberg

Set up the Iceberg configuration

Initialize the job with Iceberg configurations

Use case walkthrough

1. Create Iceberg tables of customer reviews and BI reports

Create the acr_iceberg table for customer reviews

Create the acr_iceberg_report Iceberg table for BI reports

Determine the average star rating for each product category by querying the Iceberg table

2. Add customer reviews in the Iceberg table

3. Update a customer review in the Iceberg table

4. Reflect changes in the customer reviews table in the BI report table with a MERGE INTO query

5. Roll back the Iceberg tables and reflect changes in the BI report table

Get the metadata of each report table

Roll back the acr_iceberg and acr_iceberg_report tables

Reflect changes in acr_iceberg into acr_iceberg_report again

Clean up

Conclusion

Appendix: Spark configurations to use Apache Iceberg on AWS Glue

About the Author

Resizing an Amazon Redshift cluster

Checkpoints for resize

Solution overview

Dual write and transient data curation pipeline

Backfill the new RA3 cluster with missing data

Reconciliation and cutover

Conclusion

About the Authors

Configure Apache Iceberg with AWS Glue

Solution overview

Prerequisites

Create an S3 bucket for input and output

Create input and output tables using Athena

Create the AWS Glue job

Query the Iceberg table using Athena

Upload incremental (CDC) data for further processing

Run the AWS Glue job again to process incremental files

Query the Iceberg table using Athena, after incremental data processing

Query the previous version of data with Iceberg’s time travel feature

Considerations

Conclusion

Appendix: AWS Glue DynamicFrame sample code to interact with Iceberg tables

About the Author

Solution overview

Prerequisites

The solution

Launching your AWS CDK Stack

Validate the Deployment

Run a sample workflow

Clean up

Conclusion

About the Authors

What are ML Insights?

How does the ML model work?

Best practices for setting up data

Get started with Suggested Insights

Anomaly detection

Forecast

Conclusion

About the Author

Solution overview

Data ingestion

Data storage

Data processing

Data analysis

Data visualization

Create the `acr_iceberg` table for customer reviews

Create the `acr_iceberg_report` Iceberg table for BI reports

Roll back the `acr_iceberg` and `acr_iceberg_report` tables

Reflect changes in `acr_iceberg` into `acr_iceberg_report` again