Tag Archives: Analytics

Analyze Amazon SES events at scale using Amazon Redshift

2022-05-10 Manash Deb

Post Syndicated from Manash Deb original https://aws.amazon.com/blogs/big-data/analyze-amazon-ses-events-at-scale-using-amazon-redshift/

Email is one of the most important methods for business communication across many organizations. It’s also one of the primary methods for many businesses to communicate with their customers. With the ever-increasing necessity to send emails at scale, monitoring and analysis has become a major challenge.

Amazon Simple Email Service (Amazon SES) is a cost-effective, flexible, and scalable email service that enables you to send and receive emails from your applications. You can use Amazon SES for several use cases, such as transactional, marketing, or mass email communications.

An important benefit of Amazon SES is its native integration with other AWS services, such as Amazon CloudWatch and Amazon Redshift, which allows you to monitor and analyze your emails sending at scale seamlessly. You can store your email events in Amazon Redshift, which is a widely used, fast, and fully managed cloud data warehouse. You can then analyze these events using SQL to gain business insights such as marketing campaign success, email bounces, complaints, and so on.

In this post, you will learn how to implement an end-to-end solution to automate this email analysis and monitoring process.

Solution overview

The following architecture diagram highlights the end-to-end solution, which you can provision automatically with an AWS CloudFormation template.

In this solution, you publish Amazon SES email events to an Amazon Kinesis Data Firehose delivery stream that publishes data to Amazon Redshift. You then connect to the Amazon Redshift database and use a SQL query tool to analyze Amazon SES email events that meet the given criteria. We use the Amazon Redshift SUPER data type to store the event (JSON data) in Amazon Redshift. The SUPER data type handles semi-structured data, which can have varying table attributes and types.

The alarm system uses Amazon CloudWatch logs that Kinesis Data Firehose generates when a data load to Amazon Redshift fails. We have set up a metric filter that pattern matches the CloudWatch log events to determine the error condition and triggers a CloudWatch alarm. This in turn sends out email notifications using Amazon Simple Notification Service (Amazon SNS).

Prerequisites

As a prerequisite for deploying the solution in this post, you need to set up Amazon SES in your account. For more information, see Getting Started with Amazon Simple Email Service.

Solution resources and features

The architecture built by AWS CloudFormation supports AWS best practices for high availability and security. The CloudFormation template takes care of the following key resources and features:

Amazon Redshift cluster – An Amazon Redshift cluster with encryption at rest enabled using an AWS Key Management Service (AWS KMS) customer managed key (CMK). This cluster acts as the destination for Kinesis Data Firehose and stores all the Amazon SES email sending events in the table ses, as shown in the following screenshot.
Kinesis Data Firehose configuration – A Kinesis Data Firehose delivery stream that acts as the event destination for all Amazon SES email sending metrics. The delivery stream is set up with Amazon Redshift as the destination. Server-side encryption is enabled using an AWS KMS CMK, and destination error logging has been enabled as per best practices.
Amazon SES configuration – A configuration set in Amazon SES that is used to map Kinesis Data Firehose as the event destination to publish email metrics.

To use the configuration set when sending emails, you can specify a default configuration set for your verified identity, or include a reference to the configuration set in the headers of the email.

Exploring and analyzing the data – We use Amazon Redshift query editor v2 for exploring and analyzing the data.
Alarms and notifications for ingestion failures – A data load error notification system using CloudWatch and Amazon SNS generates email-based notifications in the event of a failure during data load from Kinesis Data Firehose to Amazon Redshift. The setup creates a CloudWatch log metric filter, as shown in the following screenshot.

A CloudWatch alarm based on the metric filter triggers an SNS notification when in alarm state. For more information, see Using Amazon CloudWatch alarms.

Deploy the CloudFormation template

The provided CloudFormation template automatically creates all the required resources for this solution in your AWS account. For more information, see Getting started with AWS CloudFormation.

Sign in to the AWS Management Console.
Choose Launch Stack to launch AWS CloudFormation in your AWS account:
For Stack name, enter a meaningful name for the stack, for example, ses_events.
Provide the following values for the stack parameters:
1. ClusterName – The name of the Amazon Redshift cluster.
2. DatabaseName – The name of the first database to be created when the Amazon Redshift cluster is created.
3. DeliveryStreamName – The name of the Firehose delivery stream.
4. MasterUsername – The user name that is associated with the primary user account for the Amazon Redshift cluster.
5. NodeType – The type of node to be provisioned. (Default dc2.large)
6. NotificationEmailId – The email notification list that is used to configure an SNS topic for sending CloudWatch alarm and event notifications.
7. NumberofNodes – The number of compute nodes in the Amazon Redshift cluster. For multi-node clusters, the NumberofNodes parameter must be greater than 1.
8. OnPremisesCIDR – IP range (CIDR notation) for your existing infrastructure to access the target and replica Amazon Redshift clusters.
9. SESConfigSetName – Name of the Amazon SES configuration set.
10. SubnetId – Subnet ID where source Amazon Redshift cluster is created.
11. Vpc – VPC in which Amazon Redshift cluster is launched.
Choose Next.
Review all the information and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

You can track the progress of the stack creation on the Events tab. Wait for the stack to complete and show the status CREATE_COMPLETE.

Test the solution

To send a test email, we use the Amazon SES mailbox simulator. Set the configuration-set header to the one created by the CloudFormation template.

We use the Amazon Redshift query editor V2 to query the Amazon Redshift table (created by the CloudFormation template) and see if the events have shown up.

If the data load of the event stream fails from Kinesis Data Firehose to Amazon Redshift, the failure notification system is triggered, and you receive an email notification via Amazon SNS.

Clean up

Some of the AWS resources deployed by the CloudFormation stacks in this post incur a cost as long as you continue to use them.

You can delete the CloudFormation stack to delete all AWS resources created by the stack. To clean up all your stacks, use the AWS CloudFormation console to remove the stacks that you created in reverse order.

On the Stacks page on the AWS CloudFormation console, choose the stack to delete.
In the stack details pane, choose Delete.
Choose Delete stack when prompted.

After stack deletion begins, you can’t stop it. The stack proceeds to the DELETE_IN_PROGRESS state. When the stack deletion is complete, the stack changes to the DELETE_COMPLETE state. The AWS CloudFormation console doesn’t display stacks in the DELETE_COMPLETE state by default. To display deleted stacks, you must change the stack view filter. For more information, see Viewing deleted stacks on the AWS CloudFormation console.

If the delete fails, the stack enters the DELETE_FAILED state. For solutions, see Delete stack fails.

Conclusion

In this post, we walked through the process of setting up Amazon SES and Amazon Redshift to deploy an email reporting service that can scale to support millions of events. We used Amazon Redshift to store semi-structured messages using the SUPER data type in database tables to support varying message sizes and formats. With this solution, you can easily run analytics at scale and analyze your email event data for deliverability-related issues such as bounces or complaints.

Use the CloudFormation template provided to speed up provisioning of the cloud resources required for the solution (Amazon SES, Kinesis Data Firehose, and Amazon Redshift) in your account while following security best practices. Then you can analyze Amazon SES events at scale using Amazon Redshift.

About the Authors

Manash Deb is a Software Development Engineer in the AWS Directory Service team. He has worked on building end-to-end applications in different database and technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems on AWS.

Arnab Ghosh is a Solutions Architect for AWS in North America helping enterprise customers build resilient and cost-efficient architectures. He has over 13 years of experience in architecting, designing, and developing enterprise applications solving complex business problems.

Sanjoy Thanneer is a Sr. Technical Account Manager with AWS based out of New York. He has over 20 years of experience working in Database and Analytics Domains. He is passionate about helping enterprise customers build scalable , resilient and cost efficient Applications.

Justin Morris is a Email Deliverability Manager for the Simple Email Service team. With over 10 years of experience in the IT industry, he has developed a natural talent for diagnosing and resolving customer issues and continuously looks for growth opportunities to learn new technologies and services.

Simplify your ETL and ML pipelines using the Amazon Athena UNLOAD feature

2022-05-05 Dylan Qu

Post Syndicated from Dylan Qu original https://aws.amazon.com/blogs/big-data/simplify-your-etl-and-ml-pipelines-using-the-amazon-athena-unload-feature/

Many organizations prefer SQL for data preparation because they already have developers for extract, transform, and load (ETL) jobs and analysts preparing data for machine learning (ML) who understand and write SQL queries. Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

By default, Athena automatically writes SELECT query results in CSV format to Amazon S3. However, you might often have to write SELECT query results in non-CSV files such as JSON, Parquet, and ORC for various use cases. In this post, we walk you through the UNLOAD statement in Athena and how it helps you implement several use cases, along with code snippets that you can use.

Athena UNLOAD overview

CSV is the only output format used by the Athena SELECT query, but you can use UNLOAD to write the output of a SELECT query to the formats and compression that UNLOAD supports. When you use UNLOAD in a SELECT query statement, it writes the results into Amazon S3 in specified data formats of Apache Parquet, ORC, Apache Avro, TEXTFILE, and JSON.

Although you can use the CTAS statement to output data in formats other than CSV, those statements also require the creation of a table in Athena. The UNLOAD statement is useful when you want to output the results of a SELECT query in a non-CSV format but don’t require the associated table. For example, a downstream application might require the results of a SELECT query to be in JSON format, and Parquet or ORC might provide a performance advantage over CSV if you intend to use the results of the SELECT query for additional analysis.

In this post, we walk you through the following use cases for the UNLOAD feature:

Compress Athena query results to reduce storage costs and speed up performance for downstream consumers
Store query results in JSON file format for specific downstream consumers
Feed downstream Amazon SageMaker ML models that require files as input
Simplify ETL pipelines with AWS Step Functions without creating a table

Use case 1: Compress Athena query results

When you’re using Athena to process and create large volumes of data, storage costs can increase significantly if you don’t compress the data. Furthermore, uncompressed formats like CSV and JSON require you to store and transfer a large number of files across the network, which can increase IOPS and network costs. To reduce costs and improve downstream big data processing application performance such as Spark applications, a best practice is to store Athena output into compressed columnar compressed file formats such as ORC and Parquet.

You can use the UNLOAD statement in your Athena SQL statement to create compressed ORC and Parquet file formats. In this example, we use a 3 TB TPC-DS dataset to find all items returned between a store and a website. The following query joins the four tables: item, store_returns, web_returns, and customer_address:

UNLOAD (
		select *
		from store_returns, item, web_returns, customer_address
		where sr_item_sk = i_item_sk and
		wr_item_sk = i_item_sk and
		wr_refunded_addr_sk = ca_address_sk
	) to 's3://your-bucket/temp/athenaunload/usecase1/' with (
		format = 'PARQUET',
		compression = 'SNAPPY',
		partitioned_by = ARRAY['ca_location_type']
		
	)

The resulting query output when Snappy compressed and stored in Parquet format resulted in a 62 GB dataset. The same output in a non-compressed CSV format resulted in a 248 GB dataset. The Snappy compressed Parquet format yielded a 75% smaller storage size, thereby saving storage costs and resulting in faster performance.

Use case 2: Store query results in JSON file format

Some downstream systems to Athena such as web applications or third-party systems require the data formats to be in JSON format. The JSON file format is a text-based, self-describing representation of structured data that is based on key-value pairs. It’s lightweight, and is widely used as a data transfer mechanism by different services, tools, and technologies. In these use cases, the UNLOAD statement with the parameter format value of JSON can unload files in JSON file format to Amazon S3.

The following SQL extracts the returns data for a specific customer within a specific data range against the 3 TB catalog_returns table and stores it to Amazon S3 in JSON format:

UNLOAD (
		select cr_returned_date_sk, cr_returning_customer_sk, cr_catalog_page_sk, cr_net_loss
		from catalog_returns
		where cr_returned_date_sk = 2450821 and cr_returning_customer_sk = 11026691
	) to 's3://your-bucket/temp/athenaunload/usecase2/' with (
		format = 'JSON', compression = 'NONE'
	)

By default, Athena uses Gzip for JSON and TEXTFILE formats. You can set the compression to NONE to store the UNLOAD result without any compression. The query result is stored as the following JSON file:

{"cr_returned_date_sk":2450821,"cr_returning_customer_sk":11026691,"cr_catalog_page_sk":20.8,"cr_net_loss":53.31}

The query result can now be consumed by a downstream web application.

Use case 3: Feed downstream ML models

Analysts and data scientists rely on Athena for ad hoc SQL queries, data discovery, and analysis. They often like to quickly create derived columns such as aggregates or other features. These need to be written as files in Amazon S3 so a downstream ML model can directly read the files without having to rely on a table.

You can also parametrize queries using Athena prepared statements that are repetitive. Using the UNLOAD statement in a prepared statement provides the self-service capability to less technical users or analysts and data scientists to export files needed for their downstream analysis without having to write queries.

In the following example, we create derived columns and feature engineer for a downstream SageMaker ML model that predicts the best discount for catalog items in future promotions. We derive averages for quantity, list price, discount, and sales price for promotional items sold in stores where the promotion is not offered by mail or a special event. Then we restrict the results to a specific gender, marital, and educational status. We use the following query:

UNLOAD(
		Select i_item_id, 
	        avg(ss_quantity) avg_sales,
	        avg(ss_list_price) avg_list_price,
	        avg(ss_coupon_amt) avg_coupon_amt,
	        avg(ss_sales_price) avg_sales_price 
	 from store_sales, customer_demographics, date_dim, item, promotion
	 where cast(ss_sold_date_sk AS int) = d_date_sk and
	       ss_item_sk = i_item_sk and
	       ss_cdemo_sk = cd_demo_sk and
	       ss_promo_sk = p_promo_sk and
	       cd_gender = 'M' and 
	       cd_marital_status = 'M' and
	       cd_education_status = '4 yr Degree' and
	       (p_channel_email = 'N' or p_channel_event = 'N') and
	       d_year = 2001 
	 group by i_item_id
	 order by i_item_id
	) to 's3://your-bucket/temp/athenaunload/usecase3/' with (
		format = 'PARQUET',compression = 'SNAPPY'
	)

The output is written as Parquet files in Amazon S3 for a downstream SageMaker model training job to consume.

Use case 4: Simplify ETL pipelines with Step Functions

Step Functions is integrated with the Athena console to facilitate building workflows that include Athena queries and data processing operations. This helps you create repeatable and scalable data processing pipelines as part of a larger business application and visualize the workflows on the Athena console.

In this use case, we provide an example query result in Parquet format for downstream consumption. In this example, the raw data is in TSV format and gets ingested on a daily basis. We use the Athena UNLOAD statement to convert the data into Parquet format. After that, we send the location of the Parquet file as an Amazon Simple Notification Service (Amazon SNS) notification. Downstream applications can be notified via SNS to take further actions. One common example is to initiate a Lambda function that uploads the Athena transformation result into Amazon Redshift.

The following diagram illustrates the ETL workflow.

The workflow includes the following steps:

Start an AWS Glue crawler pointing to the raw S3 bucket. The crawler updates the metadata of the raw table with new files and partitions.
Invoke a Lambda function to clean up the previous UNLOAD result. This step is required because UNLOAD doesn’t write data to the specified location if the location already has data in it (UNLOAD doesn’t overwrite existing data). To reuse a bucket location as a destination for UNLOAD, delete the data in the bucket location, and then run the query again. Another common pattern is to UNLOAD data to a new partition with incremental data processing.
Start an Athena UNLOAD query to convert the raw data into Parquet.
Send a notification to downstream data consumers when the file is updated.

Set up resources with AWS CloudFormation

To prepare for querying both data sources, launch the provided AWS CloudFormation template:

Keep all the provided parameters and choose Create stack.

The CloudFormation template creates the following resources:

An Athena workgroup etl-workgroup, which holds the Athena UNLOAD queries.
A data lake S3 bucket that holds the raw table. We use the Amazon Customer Reviews Dataset in this post.
An Athena output S3 bucket that holds the UNLOAD result and query metadata.
An AWS Glue database.
An AWS Glue crawler pointing to the data lake S3 bucket.
A LoadDataBucket Lambda function to load the Amazon Customer Reviews raw data into the S3 bucket.
A CleanUpS3Folder Lambda function to clean up previous Athena UNLOAD result.
An SNS topic to notify downstream systems when the UNLOAD is complete.

When the stack is fully deployed, navigate to the Outputs tab of the stack on the AWS CloudFormation console and note the value of the following resources:

AthenaWorkgroup
AthenaOutputBucket
CleanUpS3FolderLambda
GlueCrawler
SNSTopic

Build a Step Functions workflow

We use the Athena Workflows feature to build the ETL pipeline.

On the Athena console, under Jobs in the navigation pane, choose Workflows.
Under Create Athena jobs with Step Functions workflows, for Query large datasets, choose Get started.
Choose Create your own workflow.
Choose Continue.

The following is a screenshot of the default workflow. Compare the default workflow against the earlier ETL workflow we described. The default workflow doesn’t contain a Lambda function invocation and has an additional GetQueryResult step.

Next, we add a Lambda Invoke step.

Search for Lambda Invoke in the search bar.
Choose the Lambda:Invoke step and drag it to above the Athena: StartQueryExecution step.
Choose the Athena:GetQueryResults step (right-click) and choose Delete state.
Now the workflow aligns with the earlier design.
Choose the step Glue: StartCrawler.
In the Configuration section, under API Parameters, enter the following JSON (provide the AWS Glue crawler name from the CloudFormation stack output):
```
{
  "Name": "GlueCrawler"
}
```
Choose the step Glue: GetCrawler.
In the Configuration section, under API Parameters, enter the following JSON:
```
{
  "Name": "GlueCrawler"
}
```
Choose the step Lambda: Invoke.
In the Configuration section, under API Parameters, for Function name, choose the function -CleanUpS3FolderLambda-.
In the Payload section, enter the following JSON (include the Athena output bucket from the stack output):
```
{
  "bucket_name": “AthenaOutputBucket”,
  "prefix": "parquet/"
}
```
Choose the step Athena: StartQueryExecution.

In the right Configuration section, under API Parameters, enter the following JSON (provide the Athena output bucket and workgroup name):

{
  "QueryString": "UNLOAD (SELECT * FROM \"athena_unload_blog\".\"reviews\" )  TO 's3://AthenaOutputBucket/parquet' WITH (format = 'PARQUET',compression = 'SNAPPY')",
  "WorkGroup": “AthenaWorkgroup”
}

Notice the Wait for task to complete check box is selected. This pauses the workflow while the Athena query is running.

Choose the step SNS: Publish.
In the Configuration section, under API Parameters, for Topic, pick the SNSTopic created by the CloudFormation template.
In the Message section, enter the following JSON to pass the data manifest file location to the downstream consumer:
```
{
  "Input.$": "$.QueryExecution.Statistics.DataManifestLocation"
}
```

For more information, refer to the GetQueryExecution response syntax.

Choose Next.
Review the generated code and choose Next.
In the Permissions section, choose Create new role.
Review the auto-generated permissions and choose Create state machine.
In the Add additional permissions to your new execution role section, choose Edit role in IAM.
Add permissions and choose Attach policies.
Search for the AWSGlueConsoleFullAccess managed policy and attach it.

This policy grants full access to AWS Glue resources when using the AWS Management console. Generate a policy based on access activity in production following the least privilege principle.

Test the workflow

Next, we test out the Step Functions workflow.

On the Athena console, under Jobs in the navigation pane, choose Workflows.
Under State machines, choose the workflow we just created.
Choose Execute, then choose Start execution to start the workflow.
Wait until the workflow completes, then verify there are UNLOAD Parquet files in the bucket AthenaOutputBucket.

Clean up

To help prevent unwanted charges to your AWS account, delete the AWS resources that you used in this post.

On the Amazon S3 console, choose the -athena-unload-data-lake bucket.
Select all files and folders and choose Delete.
Enter permanently delete as directed and choose Delete objects.
Repeat these steps to remove all files and folders in the -athena-unload-output bucket.
On the AWS CloudFormation console, delete the stack you created.
Wait for the stack status to change to DELETE_COMPLETE.

Conclusion

In this post, we introduced the UNLOAD statement in Athena with some common use cases. We demonstrated how to compress Athena query results to reduce storage costs and improve performance, store query results in JSON file format, feed downstream ML models, and create and visualize ETL pipelines with Step Functions without creating a table.

To learn more, refer to the Athena UNLOAD documentation and Visualizing AWS Step Functions workflows from the Amazon Athena console.

About the Authors

Dylan Qu is a Specialist Solutions Architect focused on Big Data & Analytics with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Harsha Tadiparthi is a Principal Solutions Architect focused on providing analytics and AI/ML strategies and solution designs to customers.

Use Amazon Kinesis Data Firehose to extract data insights with Coralogix

2022-05-04 Tal Knopf

Post Syndicated from Tal Knopf original https://aws.amazon.com/blogs/big-data/use-amazon-kinesis-data-firehose-to-extract-data-insights-with-coralogix/

This is a guest blog post co-written by Tal Knopf at Coralogix.

Digital data is expanding exponentially, and the existing limitations to store and analyze it are constantly being challenged and overcome. According to Moore’s Law, digital storage becomes larger, cheaper, and faster with each successive year. The advent of cloud databases is just one example of how this is happening. Previous hard limits on storage size have become obsolete since their introduction.

In recent years, the amount of available data storage in the world has increased rapidly, reflecting this new reality. If you took all the information just from US academic research libraries and lumped it together, it would add up to 2 petabytes.

Coralogix has worked with AWS to bring you a solution to allow for the flawless integration of high volumes of data with the Coralogix platform for analysis, using Amazon Kinesis Data Firehose.

Kinesis Data Firehose and Coralogix

Kinesis Data Firehose delivers real-time streaming data to destinations like Amazon Simple Storage Service (Amazon S3), Amazon Redshift, or Amazon OpenSearch Service (successor to Amazon Elasticsearch Service), and now supports delivering streaming data to Coralogix. There is no limit on the number of delivery streams, so you can use it to get data from multiple AWS services.

Kinesis Data Firehose provides built-in, fully managed error handling, transformation, conversion, aggregation, and compression functionality, so you don’t need to write applications to handle these complexities.

Coralogix is an AWS Partner Network (APN) Advanced Technology Partner with AWS DevOps Competency. The platform enables you to easily explore and analyze logs to gain deeper insights into the state of your applications and AWS infrastructure. You can analyze all your AWS service logs while storing only the ones you need, and generate metrics from aggregated logs to uncover and alert on trends in your AWS services.

Solution overview

Imagine a pipe flowing with data—messages, to be more specific. These messages can contain log lines, metrics, or any other type of data you want to collect.

Clearly, there must be something pushing data into the pipe; this is the provider. There must also be a mechanism for pulling data out of the pipe; this is the consumer.

Kinesis Data Firehose makes it easy to collect, process, and analyze real-time, streaming data by grouping the pipes together in the most efficient way to help with management and scaling.

It offers a few significant benefits compared to other solutions:

Keeps monitoring simple – With this solution, you can configure AWS Web Application Firewall (AWS WAF), Amazon Route 53 Resolver Query Logs, or Amazon API Gateway to deliver log events directly to Kinesis Data Firehose.
Integrates flawlessly – Most AWS services use Amazon CloudWatch by default to collect logs, metrics, and additional events data. CloudWatch logs can easily be sent using the Firehose delivery stream.
Flexible with minimum maintenance – To configure Kinesis Data Firehose with the Coralogix API as a destination, you only need to set up the authentication in one place, regardless of the amount of services or integrations providing the actual data. You can also configure an S3 bucket as a backup plan. You can back up all log events or only those exceeding a specified retry duration.
Scale, scale, scale – Kinesis Data Firehose scales up to meet your needs with no need for you to maintain it. The Coralogix platform is also built for scale and can meet all your monitoring needs as your system grows.

Prerequisites

To get started, you must have the following:

A Coralogix account. If you don’t already have an account, you can sign up for one.
A Coralogix private key.

To find your key, in your Coralogix account, choose API Keys on the Data Flow menu.

Locate the key for Send Your Data.

Set up your delivery stream

To configure your deliver stream, complete the following steps:

On the Kinesis Data Firehose console, choose Create delivery stream.
Under Choose source and destination, for Source, choose Direct PUT.
For Destination, choose Coralogix.
For Delivery stream name¸ enter a name for your stream.
Under Destination settings, for HTTP endpoint name, enter a name for your endpoint.
For HTTP endpoint URL, enter your endpoint URL based on your Region and Coralogix account configuration.
For Access key, enter your Coralogix private key.
For Content encoding¸ select GZIP.
For Retry duration, enter 60.

To override the logs applicationName, subsystemName, or computerName, complete the optional steps under Parameters.

For Key, enter the log name.
For Value, enter a new value to override the default.
For this post, leave the configuration under Buffer hints as is.
In the Backup settings section, for Source record in Amazon S3, select Failed data only (recommended).
For S3 backup bucket, choose an existing bucket or create a new one.
Leave the settings under Advanced settings as is.
Review your settings and choose Create delivery stream.

Logs subscribed to your delivery stream are immediately sent and available for analysis within Coralogix.

Conclusion

Coralogix provides you with full visibility into your logs, metrics, tracing, and security data without relying on indexing to provide analysis and insights. When you use Kinesis Data Firehose to send data to Coralogix, you can easily centralize all your AWS service data for streamlined analysis and troubleshooting.

To get the most out of the platform, check out Getting Started with Coralogix, which provides information on everything from parsing and enrichment to alerting and data clustering.

About the Authors

Tal Knopf is the Head of Customer DevOps at Coralogix. He uses his vast experience in designing and building customer-focused solutions to help users extract the full value from their observability data. Previously, he was a DevOps engineer in Akamai and other companies, where he specialized in large-scale systems and CDN solutions.

Ilya Rabinov is a Solutions Architect at AWS. He works with ISVs at late stages of their journey to help build new products, migrate existing applications, or optimize workloads on AWS. His ares of interest include machine learning, artificial intelligence, security, DevOps culture, CI/CD, and containers.

Top Amazon QuickSight features and updates launched Q1 2022

2022-05-03 Mia Heard

Post Syndicated from Mia Heard original https://aws.amazon.com/blogs/big-data/top-amazon-quicksight-features-and-updates-launched-q1-2022/

Amazon QuickSight is a serverless, cloud-based business intelligence (BI) service that brings data insights to your teams and end users through machine learning (ML) powered dashboards and data visualizations, which can be access via QuickSight or embedded in apps and portals that your users access. This post shares the top QuickSight features and updates launched in Q1 2022.

Amazon QuickSight Community

In the new Amazon QuickSight Community, you can ask and answer questions, network with and learn from other BI users from across the globe, access learning content, and stay up to date with what’s new on QuickSight all in one place!

Join the global QuickSight Community today!

Groups Management UI

QuickSight now provides a user interface to manage user groups, allowing admins to efficiently and easily manage user groups via the QuickSight admin console. Groups Management UI is available to administrators with access to QuickSight admin console pages via AWS Identity and Access Management (IAM) credentials.

To learn more, see Creating and managing groups in Amazon QuickSight.

Comparative and cumulative date and time calculations

Amazon QuickSight authors can now quickly implement advanced date/time calculations without having to use complicated row offsets or pre-computed columns. You can add these calculations in regular business reporting, trend analysis, and time series analysis.

To learn more about the new period functions and their capabilities in various use cases, see Add comparative and cumulative date/time calculations in Amazon QuickSight.

Rich text formatting on visual titles and subtitles

QuickSight authors can now add rich context to their visuals by choosing from various formatting options like font type, size, style, color and style. You can also better organize the text by choosing from various alignment and ordering options. Visual titles and subtitles now also support hyperlinks as well as parameter-based dynamic text.

To learn more, see Formatting a visual title and subtitle.

Custom subtotals at all levels on pivot table

QuickSight allows you to customize how subtotals are displayed in pivot tables, with options for display at multiple levels and for both rows and columns.

To learn more, see Displaying Totals and Subtotals.

Auto refresh direct query controls

QuickSight now supports automatic refreshes of values displayed in drop-down, multi-select and other controls in dashboards that are in direct query mode. Values within controls are updated every 24 hours to ensure the data is automatically updated without any end-user intervention.

For further details, see Refreshing data in Amazon QuickSight.

Conclusion

QuickSight serves millions of dashboard views weekly, enabling data-driven decision-making in organizations of all sizes, including customers like the NFL, 3M, Accenture, and more.

To stay up to date on all things new with QuickSight, visit What’s New with Analytics!

About the Author

Mia Heard is a product marketing manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service.

Access Apache Livy using a Network Load Balancer on a Kerberos-enabled Amazon EMR cluster

2022-05-03 Bharat Gamini

Post Syndicated from Bharat Gamini original https://aws.amazon.com/blogs/big-data/access-apache-livy-using-a-network-load-balancer-on-a-kerberos-enabled-amazon-emr-cluster/

Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. Amazon EMR supports Kerberos for authentication; you can enable Kerberos on Amazon EMR and put the cluster in a private subnet to maximize security.

To access the cluster, the best practice is to use a Network Load Balancer (NLB) to expose only specific ports, which are access-controlled via security groups. By default, the NLB prevents Kerberos ticket authentication to any Amazon EMR service.

Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as SparkContext management, all via a simple REST interface or an RPC client library.

In this post, we discuss how to provide Kerberos ticket access to Livy for external systems like Airflow and Notebooks using an NLB. You can apply this process to other Amazon EMR services beyond Livy, such as Trino and Hive.

Solution overview

The following are the high-level steps required:

Create an EMR cluster with Kerberos security configuration.
Create an NLB with required listeners and target groups.
Update the Kerberos Key Distribution Center (KDC) to create a new service principal and keytab changes.
Update the Livy configuration file.
Verify Livy is accessible via the NLB.
Run the Python Livy test case.

Prerequisites

The advanced configuration presented in this post assumes familiarity with Amazon EMR, Kerberos, Livy, Python and bash.

Create an EMR cluster

Create the Kerberos security configuration using the AWS Command Line Interface (AWS CLI) as follows (this creates the KDC on the EMR primary node):

aws emr create-security-configuration --name kdc-security-config --security-configuration '{
   "EncryptionConfiguration":{
      "InTransitEncryptionConfiguration":{
         "TLSCertificateConfiguration":{
            "CertificateProviderType":"PEM",
            "S3Object":"s3://${conf_bucket}/${certs.zip}"
         }
      },
      "AtRestEncryptionConfiguration":{
         "S3EncryptionConfiguration":{
            "EncryptionMode":"SSE-S3"
         }
      },
      "EnableInTransitEncryption":true,
      "EnableAtRestEncryption":true
   },
   "AuthenticationConfiguration":{
      "KerberosConfiguration":{
         "Provider":"ClusterDedicatedKdc",
         "ClusterDedicatedKdcConfiguration":{
            "TicketLifetimeInHours":24
         }
      }
   }
}'

It’s a security best practice to keep passwords in AWS Secrets Manager. You can use a bash function like the following as the argument to the --kerberos-attributes option so no passwords are stored in the launch script or command line. The function outputs the required JSON for the --kerberos-attributes option after retrieving the password from Secrets Manager.

krbattrs() { # Pull the KDC password from Secrets Manager without saving to disk or var
cat << EOF
  {
    "Realm": "EC2.INTERNAL",
    "KdcAdminPassword": "$(aws secretsmanager get-secret-value \
        --secret-id KDCpasswd  |jq -r .SecretString)"
  }
EOF
}

Create the cluster using the AWS CLI as follows:

aws emr create-cluster \
  --name "<your-cluster-name>" \
  --release-label emr-6.4.0 \
  --log-uri "s3://<your-log-bucket>" \
  --applications Name=Hive Name=Spark \
  --ec2-attributes "KeyName=<your-key-name>,SubnetId=<your-private-subnet>" \
  --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge \
  --security-configuration <kdc-security-config> \
  --kerberos-attributes $(krbattrs) \
  --use-default-roles

Create an NLB

Create an internet-facing NLB with TCP listeners in your VPC and subnet. An Internet-facing load balancer routes requests from clients to targets over the internet. Conversely, an Internal NLB routes requests to targets using private IP addresses. For instructions, refer to Create a Network Load Balancer.

The following screenshot shows the listener details.

Create target groups and register the EMR primary instance (Livy3) and KDC instance (KDC3). For this post, these instances are the same; use the respective instances if KDC is running on a different instance.

The KDC and EMR security groups must allow the NLB’s private IP address to access ports 88 and 8998, respectively. You can find the NLB’s private IP address by searching the elastic network interfaces for the NLB’s name. For access control instructions, refer to this article on the knowledge center. Now that the security groups allow access, the NLB health check should pass, but Livy isn’t usable via the NLB until you make further changes (detailed in the following sections). The NLB is actually being used as a proxy to access Livy rather than doing any load balancing.

Update the Kerberos KDC

The KDC used by the Livy service must contain a new HTTP Service Principal Name (SPN) using the public NLB host name.

You can create the new principle from the EMR primary host using the full NLB public host name:

sudo kadmin.local addprinc HTTP/[email protected]

Replace the fully qualified domain name (FQDN) and Kerberos realm as needed. Ensure the NLB hostname is all lowercase.

After the new SPN exists, you create two keytabs containing that SPN. The first keytab is for the Livy service. The second keytab, which must use the same KVNO number as the first keytab, is for the Livy client.

Create Livy service keytab as follows:

sudo kadmin.local ktadd -norandkey -k /etc/livy2.keytab livy/[email protected]
sudo kadmin.local ktadd -norandkey -k /etc/livy2.keytab HTTP/[email protected]
sudo chown livy:livy /etc/livy2.keytab
sudo chmod 600 /etc/livy2.keytab
sudo -u livy klist -e -kt /etc/livy2.keytab

Note the key version number (KVNO) for the HTTP principal in the output of the preceding klist command. The KVNO numbers for the HTTP principal must match the KVNO numbers in the user keytab. Copy the livy2.keytab file to the EMR cluster Livy host if it’s not already there.

Create a user or client keytab as follows:

sudo kadmin.local ktadd -norandkey -k /var/tmp/user1.keytab [email protected]
sudo kadmin.local ktadd -norandkey -k /var/tmp/user1.keytab HTTP/[email protected]

Note the -norandkey option used when adding the SPN. That preserves the KVNO created in the preceding livy2.keytab.

Copy the user1.keytab to the client machine running the Python test case as user1.

Replace the FQDN, realm, and keytab path as needed.

Update the Livy configuration file

The primary change on the EMR cluster primary node running the Livy service is to the /etc/livy/conf/livy.conf file. You change the authentication principal that Livy uses, as well as the associated Kerberos keytab created earlier.

Make the following changes to the livy.conf file with sudo:

livy.server.auth.kerberos.principal = HTTP/[email protected]
livy.server.auth.kerberos.keytab = /etc/livy2.keytab

Don’t change the livy.server.launch.kerberos.* values.

Restart and verify the Livy service:

sudo systemctl restart livy-server
sudo systemctl status livy-server

Verify the Livy port is listening:

sudo lsof -Pi :8998

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 30106 livy 196u IPv6 224853844 0t0 TCP *:8998 (LISTEN)

You can automate these steps (modifying the KDC and Livy config file) by adding a step to the EMR cluster. For more information, refer to Tutorial: Configure a cluster-dedicated KDC.

Verify Livy is accessible via the NLB

You can now use user1.keytab to authenticate against the Livy REST endpoint. Copy the user1.keytab you created earlier to the host and user login, which run the Livy test case. The host running the test case must be configured to connect to the modified KDC.

Create a Linux user (user1) on client host and EMR cluster.

If the client host has a terminal available that the user can run commands in, you can use the following commands to verify network connectivity to Livy before running the actual Livy Python test case.

Verify the NLB host and port are reachable (no data will be returned by the nc command):

$ nc -vz mynlb-a0ac4b07f9f7f0a1.elb.us-west-2.amazonaws.com 8998
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 44.242.1.1:8998.
Ncat: 0 bytes sent, 0 bytes received in 0.02 seconds.

Create a TLS connection, which returns the server’s TLS certificate and TCP packets:

openssl s_client -connect mynlb-a0ac4b07f9f7f0a1.elb.us-west-2.amazonaws.com:8998

If the openssl command doesn’t return a TLS server certificate, the rest of the verification doesn’t succeed. You may have a proxy or firewall interfering with the connection. Investigate your network environment, resolve the issue, and repeat the openssl command to ensure connectivity.

Verify the Livy REST endpoint using curl. This verifies Livy REST but not Spark.

kinit -kt user1.keytab [email protected]
curl -k -u : --negotiate https://mynlb-a0ac4b07f9f7f0a1.elb.us-west-2.amazonaws.com:8998/sessions
{"from":0,"total":0,"sessions":[]}

curl arguments:
  "-k"   - Ignore secure connection check
  "-u :" - Use user name and passwords from environment
  "--negotiate" – Enables negotiate(SPNEGO) authentication

Run the Python Livy test case

The Livy test case is a simple Python3 script named livy-verify.py. You can run this script from a client machine to run Spark commands via Livy using the NLB. The script is as follows:

#!/usr/bin/env python3
# pylint: disable=invalid-name,no-member

"""
Verify Livy (REST) service using pylivy module, install req modules or use PEX:
  sudo yum install python3-devel
  sudo python3 -m pip install livy==0.8.0 requests==2.27.1 requests_gssapi==1.2.3
  https://pypi.org/project/requests-gssapi/

Kerberos authN implicitly uses TGT from kinit in users login env
"""

import shutil
import requests
from requests_gssapi import HTTPSPNEGOAuth
from livy import LivySession

# Disable ssl-warnings for self-signed certs when testing
requests.packages.urllib3.disable_warnings()

# Set the base URI(use FQDN and TLS) to target a Livy service
# Redefine remote_host to specify the remote Livy hostname to connect to
remote_host="mynlb-a0ac4b07f9f7f0a1.elb.us-west-2.amazonaws.com"
livy_uri = "https://" + remote_host + ":8998"

def livy_session():
    ''' Connect to Livy (REST) and run trivial pyspark command '''
    sconf = {"spark.master": "yarn", "spark.submit.deployMode": "cluster"}

    if shutil.which('kinit'):
        kauth = HTTPSPNEGOAuth()
        # Over-ride with an explicit user principal
        #kauth = HTTPSPNEGOAuth(principal="[email protected]")
        print("kinit found, kauth using Krb cache")
    else:
        kauth = None
        print("kinit NOT found, kauth set to None")

    with LivySession.create(livy_uri, verify=False, auth=kauth, spark_conf=sconf) as ls:
        ls.run("rdd = sc.parallelize(range(100), 2)")
        ls.run("rdd.take(3)")

    return 'LivySession complete'

def main():
    """ Starting point """
    livy_session()

if __name__ == '__main__':
    main()

The test case requires the new SPN to be in the user’s Kerberos ticket cache. To get the service principal into the Kerberos cache, use the kinit command with the -S option:

kinit -kt user1.keytab -S HTTP/[email protected] [email protected]

Note the SPN and the User Principal Name (UPN) are both used in the kinit command.

The Kerberos cache should look like the following code, as revealed by the klist command:

klist
Ticket cache: FILE:/tmp/krb5cc_1001
Default principal: [email protected]

Valid starting Expires Service principal
01/20/2022 01:38:06 01/20/2022 11:38:06 HTTP/[email protected]
renew until 01/21/2022 01:38:06

Note the HTTP service principal in the klist ticket cache output.

After the SPN is in the cache as verified by klist, you can run the following command to verify that Livy accepts the Kerberos ticket and runs the simple PySpark script. It generates a simple array, [0,1,2], as the output. The preceding Python script has been copied to the /var/tmp/user1/ folder in this example.

/var/tmp/user1/livy-verify.py
kinit found, kauth using TGT
[0, 1, 2]

It can take a minute or so to generate the result. Any authentication errors will happen in seconds. If the test in the new environment generates the preceding array, the Livy Kerberos configuration has been verified.

Any other client program that needs to have Livy access must be a Kerberos client of the KDC that generated the keytabs. It must also have a client keytab (such as user1.keytab or equivalent) and the service principal key in its Kerberos ticket cache.

In some environments, a simple kinit as follows may be sufficient:

kdestroy
kinit -kt user1.keytab [email protected]

Summary

If you have an existing EMR cluster running Livy and using Kerberos (even in a private subnet), you can add an NLB to connect to the Livy service and still authenticate with Kerberos. For simplicity, we used a cluster-dedicated KDC in this post, but you can use any KDC architecture option supported by Amazon EMR. This post documented all the KDC and Livy changes to make it work; the script and procedure have been run successfully in multiple environments. You can modify the Python script as needed and try running the verification script in your environment.

For more details about the systems and processes described in this post, refer to the following:

About the Authors

John Benninghoff is a AWS Professional Services Sr. Data Architect, focused on Data Lake architecture and implementation.

Bharat Gamini is a Data Architect focused on Big Data & Analytics at Amazon Web Services. He helps customers architect and build highly scalable, robust and secure cloud-based analytical solutions on AWS. Besides family time, he likes watching movies and sports.

Real-time analytics with Amazon Redshift streaming ingestion

2022-04-27 Sam Selvan

Post Syndicated from Sam Selvan original https://aws.amazon.com/blogs/big-data/real-time-analytics-with-amazon-redshift-streaming-ingestion/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL. Amazon Redshift offers up to three times better price performance than any other cloud data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day and power analytics workloads such as high-performance business intelligence (BI) reporting, dashboarding applications, data exploration, and real-time analytics.

We’re excited to launch Amazon Redshift streaming ingestion for Amazon Kinesis Data Streams, which enables you to ingest data directly from the Kinesis data stream without having to stage the data in Amazon Simple Storage Service (Amazon S3). Streaming ingestion allows you to achieve low latency in the order of seconds while ingesting hundreds of megabytes of data into your Amazon Redshift cluster.

In this post, we walk through the steps to create a Kinesis data stream, generate and load streaming data, create a materialized view, and query the stream to visualize the results. We also discuss the benefits of streaming ingestion and common use cases.

The need for streaming ingestion

We hear from our customers that you want to evolve your analytics from batch to real time, and access your streaming data in your data warehouses with low latency and high throughput. You also want to enrich your real-time analytics by combining them with other data sources in your data warehouse.

Use cases for Amazon Redshift streaming ingestion center around working with data that is generated continually (streamed) and needs to be processed within a short period (latency) of its generation. Sources of data can vary, from IoT devices to system telemetry, utility service usage, geolocation of devices, and more.

Before the launch of streaming ingestion, if you wanted to ingest real-time data from Kinesis Data Streams, you needed to stage your data in Amazon S3 and use the COPY command to load your data. This usually involved latency in the order of minutes and needed data pipelines on top of the data loaded from the stream. Now, you can ingest data directly from the data stream.

Solution overview

Amazon Redshift streaming ingestion allows you to connect to Kinesis Data Streams directly, without the latency and complexity associated with staging the data in Amazon S3 and loading it into the cluster. You can now connect to and access the data from the stream using SQL and simplify your data pipelines by creating materialized views directly on top of the stream. The materialized views can also include SQL transforms as part of your ELT (extract, load and transform) pipeline.

After you define the materialized views, you can refresh them to query the most recent stream data. This means that you can perform downstream processing and transformations of streaming data using SQL at no additional cost and use your existing BI and analytics tools for real-time analytics.

Amazon Redshift streaming ingestion works by acting as a stream consumer. A materialized view is the landing area for data that is consumed from the stream. When the materialized view is refreshed, Amazon Redshift compute nodes allocate each data shard to a compute slice. Each slice consumes data from the allocated shards until the materialized view attains parity with the stream. The very first refresh of the materialized view fetches data from the TRIM_HORIZON of the stream. Subsequent refreshes read data from the last SEQUENCE_NUMBER of the previous refresh until it reaches parity with the stream data. The following diagram illustrates this workflow.

Setting up streaming ingestion in Amazon Redshift is a two-step process. You first need to create an external schema to map to Kinesis Data Streams and then create a materialized view to pull data from the stream. The materialized view must be incrementally maintainable.

Create a Kinesis data stream

First, you need to create a Kinesis data stream to receive the streaming data.

On the Amazon Kinesis console, choose Data streams.
Choose Create data stream.
For Data stream name, enter ev_stream_data.
For Capacity mode, select On-demand.
Provide the remaining configurations as needed to create your data stream.

Generate streaming data with the Kinesis Data Generator

You can synthetically generate data in JSON format using the Amazon Kinesis Data Generator (KDG) utility and the following template:

{
    
   "_id" : "{{random.uuid}}",
   "clusterID": "{{random.number(
        {   "min":1,
            "max":50
        }
    )}}", 
    "connectionTime": "{{date.now("YYYY-MM-DD HH:mm:ss")}}",
    "kWhDelivered": "{{commerce.price}}",
    "stationID": "{{random.number(
        {   "min":1,
            "max":467
        }
    )}}",
      "spaceID": "{{random.word}}-{{random.number(
        {   "min":1,
            "max":20
        }
    )}}",
 
   "timezone": "America/Los_Angeles",
   "userID": "{{random.number(
        {   "min":1000,
            "max":500000
        }
    )}}"
}

The following screenshot shows the template on the KDG console.

Load reference data

In the previous step, we showed you how to load synthetic data into the stream using the Kinesis Data Generator. In this section, you load reference data related to electric vehicle charging stations to the cluster.

Download the Plug-In EVerywhere Charging Station Network data from the City of Austin’s open data portal. Split the latitude and longitude values in the dataset and load it in to a table with the following schema.

CREATE TABLE ev_station
  (
     siteid                INTEGER,
     station_name          VARCHAR(100),
     address_1             VARCHAR(100),
     address_2             VARCHAR(100),
     city                  VARCHAR(100),
     state                 VARCHAR(100),
     postal_code           VARCHAR(100),
     no_of_ports           SMALLINT,
     pricing_policy        VARCHAR(100),
     usage_access          VARCHAR(100),
     category              VARCHAR(100),
     subcategory           VARCHAR(100),
     port_1_connector_type VARCHAR(100),
     voltage               VARCHAR(100),
     port_2_connector_type VARCHAR(100),
     latitude              DECIMAL(10, 6),
     longitude             DECIMAL(10, 6),
     pricing               VARCHAR(100),
     power_select          VARCHAR(100)
  ) DISTTYLE ALL

Create a materialized view

You can access your data from the data stream using SQL and simplify your data pipelines by creating materialized views directly on top of the stream. Complete the following steps:

Create an external schema to map the data from Kinesis Data Streams to an Amazon Redshift object:

CREATE EXTERNAL SCHEMA evdata FROM KINESIS
IAM_ROLE 'arn:aws:iam::0123456789:role/redshift-streaming-role';

Create an AWS Identity and Access Management (IAM) role (for the policy, see Getting started with streaming ingestion).

Now you can create a materialized view to consume the stream data. You can choose to use the SUPER datatype to store the payload as is in JSON format or use Amazon Redshift JSON functions to parse the JSON data into individual columns. For this post, we use the second method because the schema is well defined.

Create the materialized view so it’s distributed on the UUID value from the stream and is sorted by the approximatearrivaltimestamp value:

CREATE MATERIALIZED VIEW ev_station_data_extract DISTKEY(5) sortkey(1) AS
    SELECT approximatearrivaltimestamp,
    partitionkey,
    shardid,
    sequencenumber,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'_id')::character(36) as ID,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'clusterID')::varchar(30) as clusterID,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'connectionTime')::varchar(20) as connectionTime,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'kWhDelivered')::DECIMAL(10,2) as kWhDelivered,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'stationID')::DECIMAL(10,2) as stationID,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'spaceID')::varchar(100) as spaceID,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'timezone')::varchar(30) as timezone,
    json_extract_path_text(from_varbyte(data, 'utf-8'),'userID')::varchar(30) as userID
    FROM evdata."ev_station_data";

Refresh the materialized view:

REFRESH MATERIALIZED VIEW ev_station_data_extract;

The materialized view doesn’t auto-refresh while in preview, so you should schedule a query in Amazon Redshift to refresh the materialized view once every minute. For instructions, refer to Scheduling SQL queries on your Amazon Redshift data warehouse.

Query the stream

You can now query the refreshed materialized view to get usage statistics:

SELECT to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS') as connectiontime
,SUM(kWhDelivered) AS Energy_Consumed
,count(distinct userID) AS #Users
from ev_station_data_extract
group by to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS')
order by 1 desc;

The following table contains the results.

connectiontime	energy_consumed	#users
2022-02-27 23:52:07+00	72870	131
2022-02-27 23:52:06+00	510892	998
2022-02-27 23:52:05+00	461994	934
2022-02-27 23:52:04+00	540855	1064
2022-02-27 23:52:03+00	494818	999
2022-02-27 23:52:02+00	491586	1000
2022-02-27 23:52:01+00	499261	1000
2022-02-27 23:52:00+00	774286	1498
2022-02-27 23:51:59+00	505428	1000
2022-02-27 23:51:58+00	262413	500
2022-02-27 23:51:57+00	486567	1000
2022-02-27 23:51:56+00	477892	995
2022-02-27 23:51:55+00	591004	1173
2022-02-27 23:51:54+00	422243	823
2022-02-27 23:51:53+00	521112	1028
2022-02-27 23:51:52+00	240679	469
2022-02-27 23:51:51+00	547464	1104
2022-02-27 23:51:50+00	495332	993
2022-02-27 23:51:49+00	444154	898
2022-02-27 23:51:24+00	505007	998
2022-02-27 23:51:23+00	499133	999
2022-02-27 23:29:14+00	497747	997
2022-02-27 23:29:13+00	750031	1496

Next, you can join the materialized view with the reference data to analyze the charging station consumption data for the last 5 minutes and break it down by station category:

SELECT to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS') as connectiontime
,SUM(kWhDelivered) AS Energy_Consumed
,count(distinct userID) AS #Users
,st.category
from ev_station_data_extract ext
join ev_station st on
ext.stationID = st.siteid
where approximatearrivaltimestamp > current_timestamp -interval '5 minutes'
group by to_timestamp(connectionTime, 'YYYY-MM-DD HH24:MI:SS'),st.category
order by 1 desc, 2 desc

The following table contains the results.

connectiontime	energy_consumed	#users	category
2022-02-27 23:55:34+00	188887	367	Workplace
2022-02-27 23:55:34+00	133424	261	Parking
2022-02-27 23:55:34+00	88446	195	Multifamily Commercial
2022-02-27 23:55:34+00	41082	81	Municipal
2022-02-27 23:55:34+00	13415	29	Education
2022-02-27 23:55:34+00	12917	24	Healthcare
2022-02-27 23:55:34+00	11147	19	Retail
2022-02-27 23:55:34+00	8281	14	Parks and Recreation
2022-02-27 23:55:34+00	5313	10	Hospitality
2022-02-27 23:54:45+00	146816	301	Workplace
2022-02-27 23:54:45+00	112381	216	Parking
2022-02-27 23:54:45+00	75727	144	Multifamily Commercial
2022-02-27 23:54:45+00	29604	55	Municipal
2022-02-27 23:54:45+00	13377	30	Education
2022-02-27 23:54:45+00	12069	26	Healthcare

Visualize the results

You can set up a simple visualization using Amazon QuickSight. For instructions, refer to Quick start: Create an Amazon QuickSight analysis with a single visual using sample data.

We create a dataset in QuickSight to join the materialized view with the charging station reference data.

Then, create a dashboard showing energy consumption and number of connected users over time. The dashboard also shows the list of locations on the map by category.

Streaming ingestion benefits

In this section, we discuss some of the benefits of streaming ingestion.

High throughput with low latency

Amazon Redshift can handle and process several gigabytes of data per second from Kinesis Data Streams. (Throughput is dependent on the number of shards in the data stream and the Amazon Redshift cluster configuration.) This allows you to experience low latency and high bandwidth when consuming streaming data, so you can derive insights from your data in seconds instead of minutes.

As we mentioned earlier, the key differentiator with the direct ingestion pull approach in Amazon Redshift is lower latency, which is in seconds. Contrast this to the approach of creating a process to consume the streaming data, staging the data in Amazon S3, and then running a COPY command to load the data into Amazon Redshift. This approach introduces latency in minutes due to the multiple steps involved in processing the data.

Straightforward setup

Getting started is easy. All the setup and configuration in Amazon Redshift uses SQL, which most cloud data warehouse users are already familiar with. You can get real-time insights in seconds without managing complex pipelines. Amazon Redshift with Kinesis Data Streams is fully managed, and you can run your streaming applications without requiring infrastructure management.

Increased productivity

You can perform rich analytics on streaming data within Amazon Redshift and using existing familiar SQL without needing to learn new skills or languages. You can create other materialized views, or views on materialized views, to do most of your ELT data pipeline transforms within Amazon Redshift using SQL.

Streaming ingestion use cases

With near-real time analytics on streaming data, many use cases and industry verticals applications become possible. The following are just some of the many application use cases:

Improve the gaming experience – You can focus on in-game conversions, player retention, and optimizing the gaming experience by analyzing real-time data from gamers.
Analyze clickstream user data for online advertising – The average customer visits dozens of websites in a single session, yet marketers typically analyze only their own websites. You can analyze authorized clickstream data ingested into the warehouse to assess your customer’s footprint and behavior, and target ads to your customers just-in-time.
Real-time retail analytics on streaming POS data – You can access and visualize all your global point of sale (POS) retail sales transaction data for real-time analytics, reporting, and visualization.
Deliver real-time application insights – With the ability to access and analyze streaming data from your application log files and network logs, developers and engineers can conduct real-time troubleshooting of issues, deliver better products, and alert systems for preventative measures.
Analyze IoT data in real time – You can use Amazon Redshift streaming ingestion with Amazon Kinesis services for real-time applications such as device status and attributes such as location and sensor data, application monitoring, fraud detection, and live leaderboards. You can ingest streaming data using Kinesis Data Streams, process it using Amazon Kinesis Data Analytics, and emit the results to any data store or application using Kinesis Data Streams with low end-to-end latency.

Conclusion

This post showed how to create Amazon Redshift materialized views to ingest data from Kinesis data streams using Amazon Redshift streaming ingestion. With this new feature, you can easily build and maintain data pipelines to ingest and analyze streaming data with low latency and high throughput.

The streaming ingestion preview is now available in all AWS Regions where Amazon Redshift is available. To get started with Amazon Redshift streaming ingestion, provision an Amazon Redshift cluster on the current track and verify your cluster is running version 1.0.35480 or later.

For more information, refer to Streaming ingestion (preview), and check out the demo Real-time Analytics with Amazon Redshift Streaming Ingestion on YouTube.

About the Author

Sam Selvan is a Senior Analytics Solution Architect with Amazon Web Services.

Amazon EMR on Amazon EKS provides up to 61% lower costs and up to 68% performance improvement for Spark workloads

2022-04-27 Melody Yang

Post Syndicated from Melody Yang original https://aws.amazon.com/blogs/big-data/amazon-emr-on-amazon-eks-provides-up-to-61-lower-costs-and-up-to-68-performance-improvement-for-spark-workloads/

Amazon EMR on Amazon EKS is a deployment option offered by Amazon EMR that enables you to run Apache Spark applications on Amazon Elastic Kubernetes Service (Amazon EKS) in a cost-effective manner. It uses the EMR runtime for Apache Spark to increase performance so that your jobs run faster and cost less.

In our benchmark tests using TPC-DS datasets at 3 TB scale, we observed that Amazon EMR on EKS provides up to 61% lower costs and up to 68% improved performance compared to running open-source Apache Spark on Amazon EKS via equivalent configurations. In this post, we walk through the performance test process, share the results, and discuss how to reproduce the benchmark. We also share a few techniques to optimize job performance that could lead to further cost-optimization for your Spark workloads.

How does Amazon EMR on EKS reduce cost and improve performance?

The EMR runtime for Spark is a performance-optimized runtime for Apache Spark that is 100% API compatible with open-source Apache Spark. It’s enabled by default with Amazon EMR on EKS. It helps run Spark workloads faster, leading to lower running costs. It includes multiple performance optimization features, such as Adaptive Query Execution (AQE), dynamic partition pruning, flattening scalar subqueries, bloom filter join, and more.

In addition to the cost benefit brought by the EMR runtime for Spark, Amazon EMR on EKS can take advantage of other AWS features to further optimize cost. For example, you can run Amazon EMR on EKS jobs on Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances, providing up to 90% cost savings when compared to On-Demand Instances. Also, Amazon EMR on EKS supports Arm-based Graviton EC2 instances, which creates a 15% performance improvement and up to 30% cost savings when compared a Graviton2-based M6g to M5 instance type.

The recent graceful executor decommissioning feature makes Amazon EMR on EKS workloads more robust by enabling Spark to anticipate Spot Instance interruptions. Without the need to recompute or rerun impacted Spark jobs, Amazon EMR on EKS can further reduce job costs via critical stability and performance improvements.

Additionally, through container technology, Amazon EMR on EKS offers more options to debug and monitor Spark jobs. For example, you can choose Spark History Server, Amazon CloudWatch, or Amazon Managed Prometheus and Amazon Managed Grafana (for more details, refer to the Monitoring and Logging workshop). Optionally, you can use familiar command line tools such as kubectl to interact with a job processing environment and observe Spark jobs in real time, which provides a fail-fast and productive development experience.

Amazon EMR on EKS supports multi-tenant needs and offers application-level security control via a job execution role. It enables seamless integrations to other AWS native services without a key-pair set up in Amazon EKS. The simplified security design can reduce your engineering overhead and lower the risk of data breach. Furthermore, Amazon EMR on EKS handles security and performance patches so you can focus on building your applications.

Benchmarking

This post provides an end-to-end Spark benchmark solution so you can get hands-on with the performance test process. The solution uses unmodified TPC-DS data schema and table relationships, but derives queries from TPC-DS to support the Spark SQL test case. It is not comparable to other published TPC-DS benchmark results.

Key concepts

Transaction Processing Performance Council-Decision Support (TPC-DS) is a decision support benchmark that is used to evaluate the analytical performance of big data technologies. Our test data is a TPC-DS compliant dataset based on the TPC-DS Standard Specification, Revision 2.4 document, which outlines the business model and data schema, relationship, and more. As the whitepaper illustrates, the test data contains 7 fact tables and 17 dimension tables, with an average of 18 columns. The schema consists of essential retailer business information, such as customer, order, and item data for the classic sales channels: store, catalog, and internet. This source data is designed to represent real-world business scenarios with common data skews, such as seasonal sales and frequent names. Additionally, the TPC-DS benchmark offers a set of discrete scaling points (scale factors) based on the approximate size of the raw data. In our test, we chose the 3 TB scale factor, which produces 17.7 billion records, approximately 924 GB compressed data in Parquet file format.

Test approach

A single test session consists of 104 Spark SQL queries that were run sequentially. To get a fair comparison, each session of different deployment types, such as Amazon EMR on EKS, was run three times. The average runtime per query from these three iterations is what we analyze and discuss in this post. Most importantly, it derives two summarized metrics to represent our Spark performance:

Total execution time – The sum of the average runtime from three iterations
Geomean – The geometric mean of the average runtime

Test results

In the test result summary (see the following figure), we discovered that the Amazon EMR-optimized Spark runtime used by Amazon EMR on EKS is approximately 2.1 times better than the open-source Spark on Amazon EKS in geometric mean and 3.5 times faster by the total runtime.

The following figure breaks down the performance summary by queries. We observed that EMR runtime for Spark was faster in every query compared to open-source Spark. Query q67 was the longest query in the performance test. The average runtime with open-source Spark was 1019.09 seconds. However, it took 150.02 seconds with Amazon EMR on EKS, which is 6.8 times faster. The highest performance gain in these long-running queries was q72—319.70 seconds (open-source Spark) vs. 26.86 seconds (Amazon EMR on EKS), a 11.9 times improvement.

Test cost

Amazon EMR pricing on Amazon EKS is calculated based on the vCPU and memory resources used from the time you start to download your EMR application Docker image until the Amazon EKS pod terminates. As a result, you don’t pay any Amazon EMR charges until your application starts to run, and you only pay for the vCPU and memory used during a job—you don’t pay for the full amount of compute resources in an EC2 instance.

Overall, the estimated benchmark cost in the US East (N. Virginia) Region is $22.37 per run for open-source Spark on Amazon EKS and $8.70 per run for Amazon EMR on EKS – that’s 61% cheaper due to the 68% quicker job runtime. The following table provides more details.

Benchmark Job	Runtime (Hour)	Estimated Cost	Total EC2 Instance	Total vCPU	Total Memory (GiB)	Root Device (EBS)
Amazon EMR on EKS	0.68	$8.70	6	216	432	20 GiB gp2
Open-Source Spark on Amazon EKS	2.13	$22.37	6	216	432	20 GiB gp2
Amazon EMR on Amazon EC2 (1 primary and 5 core nodes)	0.80	$8.80	6	196	424	20 GiB gp2

The cost estimate doesn’t account for Amazon Simple Storage Service (Amazon S3) storage, or PUT and GET requests. The Amazon EMR on EKS uplift calculation is based on the hourly billing information provided by AWS Cost Explorer.

Cost breakdown

The following is the cost breakdown for the Amazon EMR on EKS job ($8.70):

Total uplift on vCPU – (126.93 * $0.01012) = (total number of vCPU used * per vCPU-hours rate) = $1.28
Total uplift on memory – (258.7 * $0.00111125) = (total amount of memory used * per GB-hours rate) = $0.29
Total Amazon EMR uplift cost – $1.57
Total Amazon EC2 cost – (6 * $1.728 * 0.68) = (number of instances * c5d.9xlarge hourly rate * job runtime in hour) = $7.05
Other costs – ($0.1 * 0.68) + ($0.1/730 * 20 * 6 * 0.68) = (shared Amazon EKS cluster charge per hour * job runtime in hour) + (EBS per GB-hourly rate * root EBS size * number of instances * job runtime in hour) = $0.08

The following is the cost breakdown for the open-source Spark on Amazon EKS job ($22.37):

Total Amazon EC2 cost – (6 * $1.728 * 2.13) = (number of instances * c5d.9xlarge hourly rate * job runtime in hour) = $22.12
Other costs – ($0.1 * 2.13) + ($0.1/730 * 20 * 6 * 2.13) = (shared EKS cluster charge per hour * job runtime in hour) + (EBS per GB-hourly rate * root EBS size * number of instances * job runtime in hour) = $0.25

The following is the cost breakdown for the Amazon EMR on Amazon EC2 ($8.80):

Total Amazon EMR cost – (5 * $0.27 * 0.80) + (1 * $0.192 * 0.80) = (number of core nodes * c5d.9xlarge Amazon EMR price * job runtime in hour) + (number of primary nodes * m5.4xlarge Amazon EMR price * job runtime in hour) = $1.23
Total Amazon EC2 cost – (5 * $1.728 * 0.80) + (1 * $0.768 * 0.80) = (number of core nodes * c5d.9xlarge instance price * job runtime in hour) + (number of primary nodes * m5.4xlarge instance price * job runtime in hour) = $7.53
Other Cost – ($0.1/730 * 20 GiB * 6 * 0.80) + ($0.1/730 * 256 GiB * 1 * 0.80) = (EBS per GB-hourly rate * root EBS size * number of instances * job runtime in hour) + (EBS per GB-hourly rate * default EBS size for m5.4xlarge * number of instances * job runtime in hour) = $0.041

Benchmarking considerations

In this section, we share some techniques and considerations for the benchmarking.

Set up an Amazon EKS cluster with Availability Zone awareness

Our Amazon EKS cluster configuration looks as follows:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: $EKSCLUSTER_NAME
  region: us-east-1
availabilityZones:["us-east-1a"," us-east-1b"]  
managedNodeGroups: 
  - name: mn-od
    availabilityZones: ["us-east-1b"]

In the cluster configuration, the mn-od managed node group is assigned to the single Availability Zone b, where we run the test against.

Availability Zones are physically separated by a meaningful distance from other Availability Zones in the same AWS Region. This produces round trip latency between two compute instances located in different Availability Zones. Spark implements distributed computing, so exchanging data between compute nodes is inevitable when performing data joins, windowing, and aggregations across multiple executors. Shuffling data between multiple Availability Zones adds extra latency to the network I/O, which therefore directly impacts Spark performance. Additionally, when data is transferred between two Availability Zones, data transfer charges apply in both directions.

For this benchmark, which is a time-sensitive workload, we recommend running in a single Availability Zone and using On-Demand instances (not Spot) to have a dedicated compute resource. In an existing Amazon EKS cluster, you may have multiple instance types and a Multi-AZ setup. You can use the following Spark configuration to achieve the same goal:

--conf spark.kubernetes.node.selector.eks.amazonaws.com/capacityType=ON_DEMAND
--conf spark.kubernetes.node.selector.topology.kubernetes.io/zone=us-east-1b

Use instance store volume to increase disk I/O

Spark data shuffle, the process of reading and writing intermediate data to disk, is a costly operation. Besides the network I/O speed, Spark demands high performant disk to support a large amount of data redistribution activities. I/O operations per second (IOPS) is an equally important measure to baseline Spark performance. For instance, the SQL queries 23a, 23b, 50, and 93 are shuffle-intensive Spark workloads in TPC-DS, so choosing an optimal storage strategy can significantly shorten their runtime. General speaking, the recommended options are either attaching multiple EBS disk volumes per node in Amazon EKS or using the d series EC2 instance type, which offers high disk I/O performance within a compute family (for example, c5d.9xlarge is the d series in the c5 compute optimized family).

The following table summarizes the hardware specification we used:

Instance	On-Demand Hourly Price	vCPU	Memory (GiB)	Instance Store	Networking Performance (Gbps)	100% Random Read IOPS	Write IOPS
c5d.9xlarge	$1.73	36	72	1 x 900GB NVMe SSD	10	350,000	170,000

To simplify our hardware configuration, we chose the AWS Nitro System EC2 instance type c5d.9xlarge, which comes with a NVMe-based SSD instance store volume. As of this writing, the built-in NVMe SSD disk requires less effort to set up and provides optimal disk performance we need. In the following code, the one-off preBoostrapCommand is triggered to mount an instance store to a node in Amazon EKS:

managedNodeGroups: 
  - name: mn-od
    preBootstrapCommands:
      - "sleep 5; sudo mkfs.xfs /dev/nvme1n1;sudo mkdir -p /local1;sudo echo /dev/nvme1n1 /local1 xfs defaults,noatime 1 2 >> /etc/fstab"
      - "sudo mount -a"
      - "sudo chown ec2-user:ec2-user /local1"

Run as a predefined job user, not a root user

For security, it’s not recommended to run Spark jobs as a root user. But how can you access the NVMe SSD volume mounted to the Amazon EKS cluster as a non-root Spark user?

An init container is created for each Spark driver and executor pods in order to set the volume permission and control the data access. Let’s check out the Spark driver pod via the kubectl exec command, which allows us execute into the running container and have an interactive session. We can observe the following:

The init container is called volume-permission.
The SSD disk is called /ossdata1. The Spark driver has stored some data to the disk.
The non-root Spark job user is called hadoop.

This configuration is provided in a format of a pod template file for Amazon EMR on EKS, so you can dynamically tailor job pods when Spark configuration doesn’t support your needs. Be aware that the predefined user’s UID in the EMR runtime for Spark is 999, but it’s set to 1000 in open-source Spark. The following is a sample Amazon EMR on EKS driver pod template:

apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    app: sparktest
  volumes:
    - name: spark-local-dir-1
      hostPath:
        path: /local1  
  initContainers:  
  - name: volume-permission
    image: public.ecr.aws/y4g4v0z7/busybox
    # grant volume access to "hadoop" user with uid 999
    command: ['sh', '-c', 'mkdir /data1; chown -R 999:1000 /data1'] 
    volumeMounts:
      - name: spark-local-dir-1
        mountPath: /data1
  containers:
  - name: spark-kubernetes-driver
    volumeMounts:
      - name: spark-local-dir-1
        mountPath: /data1

In the job submission, we map the pod templates via the Spark configuration:

"spark.kubernetes.driver.podTemplateFile": "s3://'$S3BUCKET'/pod-template/driver-pod-template.yaml",
"spark.kubernetes.executor.podTemplateFile": "s3://'$S3BUCKET'/pod-template/executor-pod-template.yaml",

Spark on k8s operator is a popular tool to deploy Spark on Kubernetes. Our open-source Spark benchmark uses the tool to submit the job to Amazon EKS. However, the Spark operator currently doesn’t support file-based pod template customization, due to the way it operates. So we embed the disk permission setup into the job definition, as in the example on GitHub.

Disable dynamic resource allocation and enable Adaptive Query Execution in your application

Spark provides a mechanism to dynamically adjust compute resources based on workload. This feature is called dynamic resource allocation. It provides flexibility and efficiency to manage compute resources. For example, your application may give resources back to the cluster if they’re no longer used, and may request them again later when there is demand. It’s quite useful when your data traffic is unpredictable and an elastic compute strategy is needed at your application level. While running the benchmarking, our source data volume (3 TB) is certain and the jobs were run on a fixed-size Spark cluster that consists of six EC2 instances. You can turn off the dynamic allocation in EMR on EC2 as shown in the following code, because it doesn’t suit our purpose and might add latency to the test result. The rest of Spark deployment options, such as Amazon EMR on EKS, has the dynamic allocation off by default, so we can ignore these settings.

--conf spark.dynamicAllocation.enabled=false 
--conf spark.shuffle.service.enabled=false

Dynamic resource allocation is a different concept from automatic scaling in Amazon EKS, such as the Cluster Autoscaler. Disabling the dynamic allocation feature only fixes our 6-node Spark cluster size per job, but doesn’t stop the Amazon EKS cluster from expanding or shrinking automatically. It means our Amazon EKS cluster is still able to scale between 1 and 30 EC2 instances, as configured in the following code:

managedNodeGroups: 
  - name: mn-od
    availabilityZones: ["us-east-1b"] 
    instanceType: c5d.9xlarge
    minSize: 1
    desiredCapacity: 1
    maxSize: 30

Spark Adaptive Query Execution (AQE) is an optimization technique in Spark SQL since Spark 3.0. It dynamically re-optimizes the query execution plan at runtime, which supports a variety of optimizations, such as the following:

Dynamically switch join strategies
Dynamically coalesce shuffle partitions
Dynamically handle skew joins

The feature is enabled by default in EMR runtime for Spark, but disabled by default in open-source Apache Spark 3.1.2. To provide the fair comparison, make sure it’s set in the open-source Spark benchmark job declaration:

  sparkConf:
    # Enable AQE
    "spark.sql.adaptive.enabled": "true"
    "spark.sql.adaptive.localShuffleReader.enabled": "true"
    "spark.sql.adaptive.coalescePartitions.enabled": "true"
    "spark.sql.adaptive.skewJoin.enabled": "true"

Walkthrough overview

With these considerations in mind, we run three Spark jobs in Amazon EKS. This helps us compare Spark 3.1.2 performance in various deployment scenarios. For more details, check out the GitHub repository.

In this walkthrough, we show you how to do the following:

Produce a 3 TB TPC-DS complaint dataset
Run a benchmark job with the open-source Spark operator on Amazon EKS
Run the same benchmark application with Amazon EMR on EKS

We also provide information on how to benchmark with Amazon EMR on Amazon EC2.

Prerequisites

Install the following tools for the benchmark test:

The AWS Command Line Interface (AWS CLI) version 2.1.14 or higher (for instructions, refer to Installing or updating the latest version of the AWS CLI)
eksctl
kubectl
Helm CLI
Docker on Mac (for Linux or Windows OS, refer to Docker Desktop overview)

Provision resources

The provision script creates the following resources:

A new Amazon EKS cluster
Amazon EMR on EKS enabler
The required AWS Identity and Access Management (IAM) roles
The S3 bucket emr-on-eks-nvme-${ACCOUNTID}-${AWS_REGION}, referred to as <S3BUCKET> in the following steps

The provisioning process takes approximately 30 minutes.

Download the project with the following command:

git clone https://github.com/aws-samples/emr-on-eks-bencharmk.git
cd emr-on-eks-bencharmk

Create a test environment (change the Region if necessary):

export EKSCLUSTER_NAME=eks-nvme
export AWS_REGION=us-east-1

./provision.sh

Modify the script if needed for testing against an existing Amazon EKS cluster. Make sure the existing cluster has the Cluster Autoscaler and Spark Operator installed. Examples are provided by the script.

Validate the setup:

# should return results
kubectl get pod -n oss | grep spark-operator
kubectl get pod -n kube-system | grep nodescaler

Generate TPC-DS test data (optional)

In this optional task, you generate TPC-DS test data in s3://<S3BUCKET>/BLOG_TPCDS-TEST-3T-partitioned. The process takes approximately 80 minutes.

The job generates TPC-DS compliant datasets with your preferred scale. In this case, it creates 3 TB of source data (approximately 924 GB compressed) in Parquet format. We have pre-populated the source dataset in the S3 bucket blogpost-sparkoneks-us-east-1 in Region us-east-1. You can skip the data generation job if you want to have a quick start.

Be aware of that cross-Region data transfer latency will impact your benchmark result. It’s recommended to generate the source data to your S3 bucket if your test Region is different from us-east-1.

Start the job:

kubectl apply -f examples/tpcds-data-generation.yaml

Monitor the job progress:

kubectl get pod -n oss
kubectl logs tpcds-data-generation-3t-driver -n oss

Cancel the job if needed:

kubectl delete -f examples/tpcds-data-generation.yaml

The job runs in the namespace oss with a service account called oss in Amazon EKS, which grants a minimum permission to access the S3 bucket via an IAM role. Update the example .yaml file if you have a different setup in Amazon EKS.

Benchmark for open-source Spark on Amazon EKS

Wait until the data generation job is complete, then update the default input location parameter (s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned) to your S3 bucket in the tpcds-benchmark.yaml file.. Other parameters in the application can also be adjusted. Check out the comments in the yaml file for details. This process takes approximately 130 minutes.

If the data generation job is skipped, run the following steps without waiting.

Start the job:

kubectl apply -f examples/tpcds-benchmark.yaml

Monitor the job progress:

kubectl get pod -n oss
kubectl logs tpcds-benchmark-oss-driver -n oss

Cancel the job if needed:

kubectl delete -f examples/tpcds-benchmark.yaml

The benchmark application outputs a CSV file capturing runtime per Spark SQL query and a JSON file with query execution plan details. You can use the collected metrics and execution plans to compare and analyze performance between different Spark runtimes (open-source Apache Spark vs. EMR runtime for Spark).

Benchmark with Amazon EMR on EKS

Wait for the data generation job finish before starting the benchmark for Amazon EMR on EKS. Don’t forget to change the input location (s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned) to your S3 bucket. The output location is s3://<S3BUCKET>/EMRONEKS_TPCDS-TEST-3T-RESULT. If you use the pre-populated TPC-DS dataset, start the Amazon EMR on EKS benchmark without waiting. This process takes approximately 40 minutes.

Start the job (change the Region if necessary):

export EMRCLUSTER_NAME=emr-on-eks-nvme
export AWS_REGION=us-east-1

./examples/emr6.5-benchmark.sh

Amazon EKS offers multi-tenant isolation and optimized resource allocation features, so it’s safe to run two benchmark tests in parallel on a single Amazon EKS cluster.

Monitor the job progress in real time:

kubectl get pod -n emr
#run the command then search "execution time" in the log to analyze individual query's performance
kubectl logs YOUR_DRIVER_POD_NAME -n emr spark-kubernetes-driver

Cancel the job (get the IDs from the cluster list on the Amazon EMR console):

aws emr-containers cancel-job-run --virtual-cluster-id <YOUR_VIRTUAL_CLUSTER_ID> --id <YOUR_JOB_ID>

The following are additional useful commands:

#Check volume status
kubectl exec -it YOUR_DRIVER_POD_NAME -c spark-kubernetes-driver -n emr -- df -h

#Login to a running driver pod
kubectl exec -it YOUR_DRIVER_POD_NAME -c spark-kubernetes-driver -n emr – bash

#Monitor compute resource usage
watch "kubectl top node"

Benchmark for Amazon EMR on Amazon EC2

Optionally, you can use the same benchmark solution to test Amazon EMR on Amazon EC2. Download the benchmark utility application JAR file from a running Kubernetes container, then submit a job via the Amazon EMR console. More details are available in the GitHub repository.

Clean up

To avoid incurring future charges, delete the resources generated if you don’t need the solution anymore. Run the following cleanup script (change the Region if necessary):

cd emr-on-eks-bencharmk

export EKSCLUSTER_NAME=eks-nvme
export AWS_REGION=us-east-1

./deprovision.sh

Conclusion

Without making any application changes, we can run Apache Spark workloads faster and cheaper with Amazon EMR on EKS when compared to Apache Spark on Amazon EKS. We used a benchmark solution running on a 6-node c5d.9xlarge Amazon EKS cluster and queried a TPC-DS dataset at 3 TB scale. The performance test result shows that Amazon EMR on EKS provides up to 61% lower costs and up to 68% performance improvement over open-source Spark 3.1.2 on Amazon EKS.

If you’re wondering how much performance gain you can achieve with your use case, try out the benchmark solution or the EMR on EKS Workshop. You can also contact your AWS Solution Architects, who can be of assistance alongside your innovation journey.

About the Authors

Melody Yang is a Senior Big Data Solution Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.

Kinnar Kumar Sen is a Sr. Solutions Architect at Amazon Web Services (AWS) focusing on Flexible Compute. As a part of the EC2 Flexible Compute team, he works with customers to guide them to the most elastic and efficient compute options that are suitable for their workload running on AWS. Kinnar has more than 15 years of industry experience working in research, consultancy, engineering, and architecture.

Modernize your healthcare clinical quality data repositories with Amazon Redshift Data Vault

2022-04-26 Sarathi Balakrishnan

Post Syndicated from Sarathi Balakrishnan original https://aws.amazon.com/blogs/big-data/modernize-your-healthcare-clinical-quality-data-repositories-with-amazon-redshift-data-vault/

With the shift to value-based care, tapping data to its fullest to improve patient satisfaction tops the priority of healthcare executives everywhere. To achieve this, reliance on key technology relevant to their sphere is a must. This is where data lakes can help. A data lake is an architecture that can assist providers in storing, sharing, and utilizing electronic health records and other patient data. Healthcare organizations are already utilizing data lakes to unite heterogeneous data from across hospital systems. The use of data lakes has aided in the effective transformation of the organizational culture, resulting in a data-driven approach to resolving issues. Healthcare clinical quality metric systems stream massive amounts of structured and semi-structured data into a data lake in real time from various sources. However, storing structured data in a data lake and applying schema-on-read leads to quality issues and adds complexity to the simple structured data process.

In this post, we demonstrate how to modernize your clinical quality data repositories with Amazon Redshift Data Vault.

How healthcare systems use data lakes

Data lakes are popular in healthcare systems because they allow the integration and exploratory analysis of diverse data. Data lakes comprise multiple data assets stored within a Hadoop ecosystem with minimal alterations to the original data format (or file). Because of this, the data lake lacks schema-on-write functionality. In data lakes, information is accessed using a methodology called schema-on-read. On the other hand, a data warehouse is a subject-oriented, time-variant, and centrally controlled database system designed for the management of mixed workloads and large queries. The schema of the data warehouse is predefined, the data written to it must conform to its schema (schema-on-write). It’s perfect for structured data storage.

Because the healthcare industry wants to preserve the data in raw format for compliance and regulatory requirements, data lakes are widely used in clinical quality tracking systems even though the incoming data are structured and semi-structured. For business and analytical use cases, the data in a data lake is further converted into a dimensional schema and loaded into a data warehouse. Data warehouses require several months of modeling, mapping, ETL (extract, transform, and load) planning and creation, and testing. The result is consistent, validated data for reporting and analytics.

The following diagram shows a typical clinical quality repository in a current healthcare system.

Data lakes offer information in its raw form, as well as a specialized means to obtaining data that applies the schema-on-read. In general, data lake users are experienced analysts familiar with data wrangling techniques that apply schema-on-read or understand data content from unstructured formats. Business users and data visualization builders struggle in the absence of powerful purpose-built search capabilities and data extraction algorithms. That isn’t to say data lakes are devoid of metadata or rules controlling their use, security, or administration. It’s the exact opposite. A successful data lake structures its data in such a way that it promotes better and more efficient access, and reuses data management methods or provides new tools that increase search and general knowledge of the data content.

Challenges in the data lake

However, data modeling shouldn’t be disregarded while employing the schema-on-read approach to handle data. A lack of meaningful data structure might lead to data quality problems, integration issues, performance issues, and deviations from organizational goals. Storing structured data in a data lake and then applying schema-on-read generates problems with data quality and complicates a basic structured data operation. You can also keep these structured datasets in data warehouses by skipping the data lake, but this increases the complexity of scaling, data transformation, and loading (ETL). But data warehouse ETL jobs apply a large number of business rules and heavily transform the data from its raw form to make the data fit one or more business purposes.

Dimensional representation data warehouses are located close to business applications but away from the source. This brings new challenges, like reverse tracking the data to its source to identify potential data errors and keep the original source for regulatory validation. In clinical quality reporting and regulatory reporting, accuracy is key. To provide high accuracy and data quality, developers often reverse track the data to its original source. This is very complex and sometimes not possible in a data warehouse model because the data loses its raw form. Compared to data warehouses, data lakes perform poorly in terms of dataset identification, processing, and catalog management. It’s complex to securely share data in a data lake between accounts. Cloud data lakes show improvement in this area, but some human intervention is still needed to make sure the data catalogs and security match healthcare requirements and standards.

Design and build a Data Vault model in Amazon Redshift

Amazon Redshift is a relational database system based on the PostgreSQL standard. It’s designed to efficiently run online analytical processing (OLAP) queries on petabyte-scale data. It also offers other query-handling efficiencies such as parallel processing, columnar database design, column data compression, a query optimizer, and compiled query code. A cluster is at the heart of every Amazon Redshift deployment. Each cluster consists of a leader node and one or more compute nodes, all of which are linked by high-speed networks.

Amazon Redshift RA3 instances with managed storage allow you to choose the number of nodes based on your performance requirements. The RA3 instance type can scale your data warehouse storage capacity automatically without manual intervention, and without needing to add additional compute resources. The number of nodes you choose is determined by the quantity of your data and the query processing performance you want.

The following diagram illustrates the Amazon Redshift RA3 instance architecture.

The interface to your business intelligence application is provided by a leader node. It provides conventional PostgreSQL JDBC/ODBC query and response interfaces. It acts as a traffic cop, routing requests from client applications to the proper compute nodes and managing the results that are delivered. It also distributes ingested data to computing nodes so that your databases may be built.

Amazon Redshift features like streaming, RA3’s near-limitless storage, Amazon Redshift ML, the SUPER data type, automatic table optimization, materialized views, and Amazon Redshift Spectrum open up the possibility of modernizing healthcare clinical quality data repositories and standardizing the data model for fast access and high accuracy. These features make RA3 instances the perfect candidate for your clinical quality repository. Additionally, the data sharing capability in Amazon Redshift keeps a single source for truth across the organization and removes the need for data silos. It’s tightly coupled with other AWS services, which allows you to take advantage of the AWS Cloud ecosystem and get the most out of your data with less effort.

Standardizing data modeling has two benefits: it allows you to reuse technical and organizational procedures (for example, ETL and project management), and you can readily merge data from diverse contexts without much complication. Because storage is separated from the computing, the Amazon Redshift RA3 instance retains the classic data lake feature of storing huge amounts of data. On the data side, to keep the data close to its raw form, a Data Vault model is the ideal solution to this problem.

Data Vault is a methodology to speed up the development of data warehouse initiatives and keep the data close to its sources. The Data Vault 2.0 data modeling standards are popular because they stress the business keys by hashing the keys across entities. This eliminates the need for strong relationships between entities and their linkages within the delivery of business processes. Data Vault makes it easy to integrate with versatile data sources for completely different use cases than it’s originally designed for, due to the following features:

Entities can stand alone and be built on patterns, each with a well-defined purpose.
Because the data represents the source system and delivers business value, data silos are eliminated.
You can load data in parallel with the least number of dependencies.
Historized data is kept at the coarsest level of granularity possible.
You can implement flexible business rules independently of data loading. This reduces the time taken to load the data by 25%.
You can add new data sources without affecting the current model.

The following diagram shows a simple clinical quality data repository with the Data Vault model with Amazon Redshift.

The following shows a simple clinical quality business Data Vault model using Amazon Redshift materialized views.

The Data Vault architecture is divided into four stages:

Staging – A copy of the most recent modifications to data from the source systems is created. This layer doesn’t save history, and you can perform numerous alterations to the staged data while populating, such as data type changes or scaling, character set conversion, and the inclusion of metadata columns to enable subsequent processing.
Raw vault – This stores a history of all data from numerous source systems. Except for putting the data into source system independent targets, no filters or business transformations happen here.
Business vault – This is an optional offering that is frequently developed. It includes business computations and denormalizations with the goal of enhancing the speed and ease of access inside the consumption layer, known as the information mart layer.
Data mart layer – This is where data is most typically accessible by users, such as reporting dashboards or extracts. You can build multiple marts from a single Data Vault integration layer, and the most frequent data modeling choice for these marts is Star/Kimball schemas.

The Amazon Redshift RA3 instance capabilities we discussed enable the development of highly performant and cost-effective Data Vault solutions. The staging and raw layers, for example, are populated in micro-batches by one Amazon Redshift cluster 24 hours a day. You can build the business Data Vault layer once a day and pause it to save costs when completed, and any number of consumer Amazon Redshift clusters can access the results.

The following diagram shows the complete architecture of the clinical quality Data Vault.

The architecture has the following components:

Batched raw data is loaded into the staging database raw schema.
The data quality validated structured data is delivered immediately to the staging database curated schema.
Data from devices, EHR systems, and vendor medical systems are also sent directly into the Amazon Redshift staging database.
The Data Vault 2.0 approach is used in the Amazon Redshift producer cluster to standardize data structure and assure data integrity.
Materialized views make a business Data Vault.
The Amazon Redshift SUPER data type allows you to store and query semi-structured data.
Amazon Redshift data share capabilities securely exchange the Data Vault tables and views between departments.
Support for streaming ingestion in Amazon Redshift eliminates the need to stage data in Amazon S3 before ingesting it into Amazon Redshift.
Amazon Redshift allows you to offload the historical data to Amazon S3. Healthcare clinical quality tracking systems usually offload data older than 6–7 years old for archival.
Amazon S3 and Amazon Redshift are HIPAA eligible.

Advantages of the Amazon Redshift clinical data repository

A clinical data repository in Amazon Redshift has the following benefits:

Data Vault enables the preservation of data as raw in its native form.
Amazon Redshift RA3 provides limitless storage, a real-time data feed, structured and semi-structured data, and more.
The new streaming functionality and analytics capacity in Amazon Redshift provide near-real-time analysis without the need for extra services.
Data quality is assured while maintaining data close to the business and source. You can reduce storage costs while maintaining data fidelity.
It’s compatible with business applications and allows for easy adoption when merging or expanding a variety of businesses without complexity.
The decoupled storage and compute option in Amazon Redshift allows for cost savings and offers ML capabilities within Amazon Redshift for quicker prediction, forecasting, and anomaly detection.

Conclusion

Data Vault is a one-of-a-kind database platform that promotes openness. It’s quick to adopt, infinitely configurable, and ready to overcome the most demanding data difficulties in current clinical quality metric tracking systems. Combined with the advantages of Amazon Redshift RA3 instances, Data Vault can outperform and deliver more value for cost and time.

To get started,

Create Amazon Redshift RA3 instance for the primary “Clinical Data Repository” and the data marts.
Build a Data vault schema for the raw vault an create materialized views for the business vault.
Enable Amazon Redshift data share to share data between producer cluster and consumer cluster.
Load the structed and unstructured data in to the producer cluster data vault for the business use.

About the Authors

Sarathi Balakrishnan is the Global Partner Solutions Architect, specializing in Data, Analytics and AI/ML at AWS. He works closely with AWS partner globally to build solutions and platforms on AWS to accelerate customers’ business outcomes with state-of-the-art cloud technologies and achieve more in their cloud explorations. He helps with solution architecture, technical guidance, and best practices to build cloud-native solutions. He joined AWS with over 20 years of large enterprise experience in agriculture, insurance, health care and life science, marketing and advertisement industries to develop and implement data and AI strategies.

Kasi Muthu is a Senior Partner Solutions Architect based in Houston, TX. He is a trusted advisor in Data, Analytics, AI/ML space. He also helps customers migrate their workloads to AWS with focus on scalability, resiliency, performance and sustainability. Outside of work, he enjoys spending time with his family and spending quite a bit of his free time on YouTube.

Email Routing Insights

2022-04-25 Joao Sousa Botto

Post Syndicated from Joao Sousa Botto original https://blog.cloudflare.com/email-routing-insights/

Email Routing Insights

Have you ever wanted to try a new email service but worried it might lead to you missing any emails? If you have, you’re definitely not alone. Some of us email ourselves to make sure it reaches the correct destination, others don’t rely on a new address for anything serious until they’ve seen it work for a few days. In any case, emails often contain important information, and we need to trust that our emails won’t get lost for any reason.

To help reduce these worries about whether emails are being received and forwarded – and for troubleshooting if needed – we are rolling out a new Overview page to Email Routing. On the Overview tab people now have full visibility into our service and can see exactly how we are routing emails on their behalf.

Routing Status and Metrics

The first thing you will see in the new tab is an at a glance view of the service. This includes the routing status (to know if the service is configured and running), whether the necessary DNS records are configured correctly, and the number of custom and destination addresses on the zone.

Below the configuration summary, you will see more advanced statistics about the number of messages received on your custom addresses, and what happened to those messages. You will see information about the number of emails forwarded or dropped by Email Routing (based on the rules you created), and the number that fall under other scenarios such as being rejected by Email Routing (due to errors, not passing security checks or being considered spam) or rejected by your destination mailbox. You now have the exact counts and a chart, so that you can track these metrics over time.

Activity Log

On the Cloudflare Email Routing tab you’ll also see the Activity Log, where you can drill deeper into specific behaviors. These logs show you details about the email messages that reached one of the custom addresses you have configured on your Cloudflare zone.

For each message the logs will show you the Message ID, Sender, Custom Address, when Cloudflare Email Routing received it, and the action that was taken. You can also expand the row to see the SPF, DMARC, and DKIM status of that message along with any relevant error messaging.

And we know looking at every message can be overwhelming, especially when you might be resorting to the logs for troubleshooting purposes, so you have a few options for filtering:

Search for specific people (email addresses) that have messaged you.
Filter to show only one of your custom addresses.
Filter to show only messages where a specific action was taken.

Routes and Settings

Next to the Overview tab, you will find the Routes tab with the configuration UI that is likely already familiar to you. That’s where you create custom addresses, add and verify destination addresses, and create rules with the relationships between the custom and destination addresses.

Lastly the Settings tab includes less common actions such as the DNS configuration and the options for off boarding from Email Routing.

We hope you enjoy this update. And if you have any questions or feedback about this product, please come see us in the Cloudflare Community and the Cloudflare Discord.

Author AWS Glue jobs with PyCharm using AWS Glue interactive sessions

2022-04-22 Kunal Ghosh

Post Syndicated from Kunal Ghosh original https://aws.amazon.com/blogs/big-data/author-aws-glue-jobs-with-pycharm-using-aws-glue-interactive-sessions/

Data lakes, business intelligence, operational analytics, and data warehousing share a common core characteristic—the ability to extract, transform, and load (ETL) data for analytics. Since its launch in 2017, AWS Glue has provided serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

AWS Glue interactive sessions allows programmers to build, test, and run data preparation and analytics applications. Interactive sessions provide access to run fully managed serverless Apache Spark using an on-demand model. AWS Glue interactive sessions also provide advanced users the same Apache Spark engine as AWS Glue 2.0 or AWS Glue 3.0, with built-in cost controls and speed. Additionally, development teams immediately become productive using their existing development tool of choice.

In this post, we walk you through how to use AWS Glue interactive sessions with PyCharm to author AWS Glue jobs.

Solution overview

This post provides a step-by-step walkthrough that builds on the instructions in Getting started with AWS Glue interactive sessions. It guides you through the following steps:

Create an AWS Identity and Access Management (IAM) policy with limited Amazon Simple Storage Service (Amazon S3) read privileges and associated role for AWS Glue.
Configure access to a development environment. You can use a desktop computer or an OS running on the AWS Cloud using Amazon Elastic Compute Cloud (Amazon EC2).
Integrate AWS Glue interactive sessions with an integrated development environments (IDE).

We use the script Validate_Glue_Interactive_Sessions.ipynb for validation, available as a Jupyter notebook.

Prerequisites

You need an AWS account before you proceed. If you don’t have one, refer to How do I create and activate a new AWS account? This guide assumes that you already have installed Python and PyCharm. Python 3.7 or later is the foundational prerequisite.

Create an IAM policy

The first step is to create an IAM policy that limits read access to the S3 bucket s3://awsglue-datasets, which has the AWS Glue public datasets. You use IAM to define the policies and roles for access to AWS Glue.

On the IAM console, choose Policies in the navigation pane.
Choose Create policy.

On the JSON tab, enter the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3-object-lambda:Get*",
                "s3-object-lambda:List*"
            ],
            "Resource": ["arn:aws:s3:::awsglue-datasets/*"]
        }
    ]
}

Choose Next: Tags.
Choose Next: Review.
For Policy name, enter glue_interactive_policy_limit_s3.
For Description, enter a description.
Choose Create policy.

Create an IAM role for AWS Glue

To create a role for AWS Glue with limited Amazon S3 read privileges, complete the following steps:

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Trusted entity type, select AWS service.
For Use cases for other AWS services, choose Glue.
Choose Next.
On the Add permissions page, search and choose the AWS managed permission policies AWSGlueServiceRole and glue_interactive_policy_limit_s3.
Choose Next.
For Role name, enter glue_interactive_role.
Choose Create role.
Note the ARN of the role, arn:aws:iam::<replacewithaccountID>:role/glue_interactive_role.

Set up development environment access

This secondary level of access configuration needs to occur on the developer’s environment. The development environment can be a desktop computer running Windows or Mac/Linux, or similar operating systems running on the AWS Cloud using Amazon EC2. The following steps walk through each client access configuration. You can select the configuration path that is applicable to your environment.

Set up a desktop computer

To set up a desktop computer, we recommend completing the steps in Getting started with AWS Glue interactive sessions.

Set up an AWS Cloud-based computer with Amazon EC2

This configuration path follows the best practices for providing access to cloud-based resources using IAM roles. For more information, refer to Using an IAM role to grant permissions to applications running on Amazon EC2 instances.

On the IAM console, choose Roles in the navigation pane.
Choose Create role.
For Trusted entity type¸ select AWS service.
For Common use cases, select EC2.
Choose Next.
Add the AWSGlueServiceRole policy to the newly created role.
On the Add permissions menu, choose Create inline policy.
Create an inline policy that allows the instance profile role to pass or assume glue_interactive_role and save the new role as ec2_glue_demo.

Your new policy is now listed under Permissions policies.

On the Amazon EC2 console, choose (right-click) the instance you want to attach to the newly created role.
Choose Security and choose Modify IAM role.
For IAM role¸ choose the role ec2_glue_demo.
Choose Save.
On the IAM console, open and edit the trust relationship for glue_interactive_role.
Add “AWS”: [“arn:aws:iam:::user/glue_interactive_user”,”arn:aws:iam:::role/ec2_glue_demo”] to the principal JSON key.
Complete the steps detailed in Getting started with AWS Glue interactive sessions.

You don’t need to provide an AWS access key ID or AWS secret access key as part of the remaining steps.

Integrate AWS Glue interactive sessions with an IDE

You’re now ready to set up and validate your PyCharm integration with AWS Glue interactive sessions.

On the welcome page, choose New Project.
For Location, enter the location of your project glue-interactive-demo.
Expand Python Interpreter.
Select Previously configured interpreter and choose the virtual environment you created earlier.
Choose Create.

The following screenshot shows the New Project page on a Mac computer. A Windows computer setup will have a relative path beginning with C:\ followed by the PyCharm project location.

Choose the project (right-click) and on the New menu, choose Jupyter Notebook.
Name the notebook Validate_Glue_Interactive_Sessions.

The notebook has a drop-down called Managed Jupyter server: auto-start, which means the Jupyter server automatically starts when any notebook cell is run.

Run the following code:

print("This notebook will start the local Python kernel")

You can observe that the Jupyter server started running the cell.

On the Python 3 (ipykernal) drop-down, choose Glue PySpark.
Run the following code to start a Spark session:
```
spark
```
Wait to receive the message that a session ID has been created.

Run the following code in each cell, which is the boilerplate syntax for AWS Glue:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())

Read the publicly available Medicare Provider payment data in the AWS Glue data preparation sample document:

medicare_dynamicframe = glueContext.create_dynamic_frame.from_options(
    's3',
    {'paths': ['s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv']},
    'csv',
    {'withHeader': True})
print("Count:",medicare_dynamicframe.count())
medicare_dynamicframe.printSchema()

Change the data type of the provider ID to long to resolve all incoming data to long:

medicare_res = medicare_dynamicframe.resolveChoice(specs = [('Provider Id','cast:long')])
medicare_res.printSchema()

Display the providers:

medicare_res.toDF().select('Provider Name').show(10,truncate=False)

Clean up

You can run %delete_session which deletes the current session and stops the cluster, and the user stops being charged. Have a look at the AWS Glue interactive sessions magics. Also please remember to delete IAM policy and role once you are done.

Conclusion

In this post, we demonstrated how to configure PyCharm to integrate and work with AWS Glue interactive sessions. The post builds on the steps in Getting started with AWS Glue interactive sessions to enable AWS Glue interactive sessions to work with Jupyter notebooks. We also provided ways to validate and test the functionality of the configuration.

About the Authors

Kunal Ghosh is a Sr. Solutions Architect at AWS. His passion is building efficient and effective solutions on cloud, especially involving analytics, AI, data science, and machine learning. Besides family time, he likes reading and watching movies. He is a foodie.

Sebastian Muah is a Solutions Architect at AWS focused on analytics, AI/ML, and big data. He has over 25 years of experience in information technology and helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. He enjoys cycling and building things around his home.

Query 10 new data sources with Amazon Athena

2022-04-21 Scott Rigney

Post Syndicated from Scott Rigney original https://aws.amazon.com/blogs/big-data/query-10-new-data-sources-with-amazon-athena/

When we first launched Amazon Athena, our mission was to make it simple to query data stored in Amazon Simple Storage Service (Amazon S3). Athena customers found it easy to get started and develop analytics on petabyte-scale data lakes, but told us they needed to join their Amazon S3 data with data stored elsewhere. We added connectors to sources including Amazon DynamoDB and Amazon Redshift to give data analysts, data engineers, and data scientists the ability to run SQL queries on data stored in databases running on-premises or in the cloud alongside data stored in Amazon S3.

Today, thousands of AWS customers from nearly every industry use Athena federated queries to surface insights and make data-driven decisions from siloed enterprise data—using a single AWS service and SQL dialect.

We’re excited to expand your ability to derive insights from more of your data with today’s release of 10 new data source connectors, which include some of the most widely used data stores on the market.

New data sources for Athena

You can now use Athena to query and surface insights from 10 new data sources:

SAP HANA (Express Edition)
Teradata
Cloudera
Hortonworks
Snowflake
Microsoft SQL Server
Oracle
Azure Data Lake Storage (ADLS) Gen2
Azure Synapse
Google BigQuery

Today’s release greatly expands the number of data sources supported by Athena. For a complete list of supported data sources, see Using Athena Data Source Connectors.

To coincide with this release, we enhanced the Athena console to help you browse available sources and connect to your data in fewer steps. You can now search, sort, and filter the available connectors on the console, and then follow the guided setup wizard to connect to your data.

Just as before, we’ve open-sourced the new connectors to invite contributions from the developer community. For more information, see Writing a Data Source Connector Using the Athena Query Federation SDK.

Connect the dots in your analytics strategy with Athena

With the breadth of data storage options available today, it’s common for data-driven organizations to choose a data store that meets the requirements of specific use cases and applications. Although this flexibility is ideal for architects and developers, it can add complexity for analysts, data scientists, and data engineers, which prevents them from accessing the data they need. To get around this, many users resort to workarounds that often involve learning new programming languages and database concepts or building data pipelines to prepare the data before it can be analyzed. Athena helps cut through this complexity with support for over 25 data sources and its simple-to-use, pay-as-you-go, serverless design.

With Athena, you can use your existing SQL knowledge to extract insights from a wide range of data sources without learning a new language, developing scripts to extract (and duplicate) data, or managing infrastructure. Athena allows you to do the following:

Run on-demand analysis on data spread across multiple cloud providers and systems of record using a single tool and single SQL dialect
Visualize data in business intelligence applications that use Athena to perform complex, multi-source joins
Design self-service extract, transform, and load (ETL) pipelines and event-based data processing workflows with Athena’s integration with AWS Step Functions
Unify diverse data sources to produce rich input features for machine learning model training workflows
Develop user-facing data-as-a-product applications that surface insights across data mesh architectures
Support analytics use cases while your organization migrates on-premises sources to the AWS Cloud

Get started with Athena’s data source connectors

To get started with federated queries for Athena, on the Athena console, choose Data Sources in the navigation pane, choose a data source, and follow the guided setup experience to configure your connector. After the connection is established and the source is registered with Athena, you can query the data via the Athena console, API, AWS SDK, and compatible third-party applications. To learn more, see Using Amazon Athena Federated Query and Writing Federated Queries.

You can also share a data source connection with team members, allowing them to use their own AWS account to query the data without setting up a duplicate connector. To learn more, see Enabling Cross-Account Federated Queries.

Conclusion

We encourage you to evaluate Athena and federated queries on your next analytics project. For help getting started, we recommend the following resources:

If you’re new to Athena and want to learn more about its features and capabilities, see Amazon Athena features.
For a complete list of supported data sources, see Athena Data Source Connectors.
You can use the Athena data connector SDK to create a custom data source connector and query it with Athena. For more information, see Writing a Data Source Connector Using the Athena Query Federation SDK.

About the Authors

Scott Rigney is a Senior Technical Product Manager with Amazon Web Services (AWS) and works with the Amazon Athena team based out of Arlington, Virginia. He is passionate about building analytics products that enable enterprises to make data-driven decisions.

Jean-Louis Castro-Malaspina is a Senior Product Marketing Manager with Amazon Web Services (AWS) based in Hershey, Pennsylvania. He enjoys highlighting how customers use Analytics and Amazon Athena to unlock innovation. Outside of work, Jean-Louis enjoys spending time with his wife and daughter, running, and following international soccer.

Suresh Akena is a Principal WW GTM Leader for Amazon Athena. He works with the startups, enterprise and strategic customers to provide leadership on large scale data strategies including migration to AWS platform, big data and analytics and ML initiatives and help them to optimize and improve time to market for data driven applications when using AWS.

Build your data pipeline in your AWS modern data platform using AWS Lake Formation, AWS Glue, and dbt Core

2022-04-21 Benjamin Menuet

Post Syndicated from Benjamin Menuet original https://aws.amazon.com/blogs/big-data/build-your-data-pipeline-in-your-aws-modern-data-platform-using-aws-lake-formation-aws-glue-and-dbt-core/

dbt has established itself as one of the most popular tools in the modern data stack, and is aiming to bring analytics engineering to everyone. The dbt tool makes it easy to develop and implement complex data processing pipelines, with mostly SQL, and it provides developers with a simple interface to create, test, document, evolve, and deploy their workflows. For more information, see docs.getdbt.com.

dbt primarily targets cloud data warehouses such as Amazon Redshift or Snowflake. Now, you can use dbt against AWS data lakes, thanks to the following two services:

AWS Glue Interactive Sessions, a serverless Apache Spark runtime environment managed by AWS Glue with on-demand access and a 1-minute billing minimum
AWS Lake Formation, a service that makes it easy to quickly set up a secure data lake

In this post, you’ll learn how to deploy a data pipeline in your modern data platform using the dbt-glue adapter built by the AWS Professional Services team in collaboration with dbtlabs.

With this new open-source, battle-tested dbt AWS Glue adapter, developers can now use dbt for their data lakes, paying for just the compute they need, with no need to shuffle data around. They still have access to everything that makes dbt great, including the local developer experience, documentation, tests, incremental data processing, Git integration, CI/CD, and more.

Solution overview

The following diagram shows the architecture of the solution.

The steps in this workflow are as follows:

The data team configures a local Python virtual environment and creates a data pipeline with dbt.
The dbt-glue adapter uses Lake Formation to perform all structure manipulation, like creation of database, tables. or views.
The dbt-glue adapter uses AWS Glue interactive sessions as the backend for processing your data.
All data is stored in Amazon Simple Storage Service (Amazon S3) as Parquet open file format.
The data team can now query all data stored in the data lake using Amazon Athena.

Walkthrough overview

For this post, you run a data pipeline that creates indicators based on NYC taxi data by following these steps:

Deploy the provided AWS CloudFormation stack in Region us-east-1.
Configure your Amazon CloudShell environment.
Install dbt, the dbt CLI, and the dbt adaptor.
Use CloudShell to clone the project and configure it to use your account’s configuration.
Run dbt to implement the data pipeline.
Query the data with Athena.

For our use case, we use the data from the New York City Taxi Records dataset. This dataset is available in the Registry of Open Data on AWS (RODA), which is a repository containing public datasets from AWS resources.

The CloudFormation template creates the nyctaxi database in your AWS Glue Data Catalog and a table (records) that points to the public dataset. You don’t need to host the data in your account.

Prerequisites

The CloudFormation template used by this project configures the AWS Identity and Access Management (IAM) role GlueInteractiveSessionRole with all the mandatory permissions.

For more details on permissions for AWS Glue interactive sessions, refer to Securing AWS Glue interactive sessions with IAM.

Deploy resources with AWS CloudFormation

The CloudFormation stack deploys all the required infrastructure:

An IAM role with all the mandatory permissions to run an AWS Glue interactive session and the dbt-glue adapter.
An AWS Glue database and table to store the metadata related to the NYC taxi records dataset
An S3 bucket to use as output and store the processed data
An Athena configuration (a workgroup and S3 bucket to store the output) to explore the dataset
An AWS Lambda function as an AWS CloudFormation custom resource that updates all the partitions in the AWS Glue table

To create these resources, choose Launch Stack and follow the instructions:

Configure the CloudShell environment

To start working with the shell, complete the following steps:

Sign in to the AWS Management Console and launch CloudShell using either one of the following two methods:
1. Choose the CloudShell icon on the console navigation bar.
2. Enter cloudshell in the Find Services box and then choose the CloudShell option.
Because dbt and the dbt-glue adapter are compatible with Python versions 3.7, 3.8, and 3.9, check the version of Python:
```
$ python3 --version
```

Configure a Python virtual environment to isolate the package version and code dependencies:

$ sudo yum install git -y
$ python3 -m venv dbt_venv
$ source dbt_venv/bin/activate
$ python3 -m pip install --upgrade pip

Configure the aws-glue-session package:

$ sudo yum install gcc krb5-devel.x86_64 python3-devel.x86_64 -y
$ pip3 install --no-cache-dir --upgrade boto3
$ pip3 install --no-cache-dir --upgrade aws-glue-sessions

Install dbt, the dbt CLI, and the dbt adaptor

The dbt CLI is a command-line interface for running dbt projects. It’s is free to use and available as an open source project. Install dbt and the dbt CLI with the following code:

$ pip3 install --no-cache-dir dbt-core

For more information, refer to How to install dbt, What is dbt?, and Viewpoint.

Install the dbt adapter with the following code:

$ pip3 install --no-cache-dir dbt-glue

Clone the project

The dbt AWS Glue interactive session demo project contains an example of a data pipeline that produces metrics based on NYC taxi dataset. Clone the project with the following code:

$ git clone https://github.com/aws-samples/dbtgluenyctaxidemo

This project comes with the following configuration example:

$ dbtgluenyctaxidemo/profile/profiles.yml

The following table summarizes the parameter options for the adaptor.

Option	Description	Mandatory
project_name	The dbt project name. This must be the same as the one configured in the dbt project.	yes
type	The driver to use.	yes
query-comment	A string to inject as a comment in each query that dbt runs.	no
role_arn	The ARN of the interactive session role created as part of the CloudFormation template.	yes
region	The AWS Region were you run the data pipeline.	yes
workers	The number of workers of a defined workerType that are allocated when a job runs.	yes
worker_type	The type of predefined worker that is allocated when a job runs. Accepts a value of Standard, G.1X, or G.2X.	yes
schema	The schema used to organize data stored in Amazon S3.	yes
database	The database in Lake Formation. The database stores metadata tables in the Data Catalog.	yes
session_provisioning_timeout_in_seconds	The timeout in seconds for AWS Glue interactive session provisioning.	yes
location	The Amazon S3 location of your target data.	yes
idle_timeout	The AWS Glue session idle timeout in minutes. (The session stops after being idle for the specified amount of time.)	no
glue_version	The version of AWS Glue for this session to use. Currently, the only valid options are 2.0 and 3.0. The default value is 2.0.	no
security_configuration	The security configuration to use with this session.	no
connections	A comma-separated list of connections to use in the session.	no

Run the dbt project

The objective of this sample project is to create the following four tables, which contain metrics based on the NYC taxi dataset:

silver_nyctaxi_avg_metrics – Basic metrics based on NYC Taxi Open Data for the year 2016
gold_nyctaxi_passengers_metrics – Metrics per passenger based on the silver metrics table
gold_nyctaxi_distance_metrics – Metrics per distance based on the silver metrics table
gold_nyctaxi_cost_metrics – Metrics per cost based on the silver metrics table

To run the project dbt, you should be in the project folder:
```
$ cd dbtgluenyctaxidemo
```

The project requires you to set environment variables in order to run on the AWS account:

$ export DBT_ROLE_ARN="arn:aws:iam::$(aws sts get-caller-identity --query "Account" --output text):role/GlueInteractiveSessionRole"
$ export DBT_S3_LOCATION="s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity --query "Account" --output text)-us-east-1/"

Make sure the profile is set up correctly from the command line:
```
$ dbt debug --profiles-dir profile
```
Run the models with the following code:
```
$ dbt run --profiles-dir profile
```

Generate documentation for the project:

$ dbt docs generate --profiles-dir profile

View the documentation for the project:
```
$ dbt docs serve --profiles-dir profile
```

Query data via Athena

This section demonstrates how to query the target table using Athena. To query the data, complete the following steps:

On the Athena console, switch the workgroup to athena-dbt-glue-aws-blog.
If the Workgroup athena-dbt-glue-aws-blog settings dialog box appears, choose Acknowledge.

Use the following query to explore the metrics created by the dbt project:

SELECT cm.avg_cost_per_minute
, cm.avg_cost_per_distance
, dm.avg_distance_per_duration
, dm.year
, dm.month
, dm.type
FROM "dbt_nyc_metrics"."gold_nyctaxi_distance_metrics" dm
LEFT JOIN "dbt_nyc_metrics"."gold_nyctaxi_cost_metrics" cm
ON dm.type = cm.type
AND dm.year = cm.year
AND dm.month = cm.month
WHERE dm.type = 'yellow'
AND dm.year = '2016'
AND dm.month = '6'

The following screenshot shows the results of this query.

Clean Up

To clean up your environment, complete the following steps in CloudShell:

Delete the database created by dbt:

$ aws glue delete-database --name dbt_nyc_metrics

Delete all generated data:

$ aws s3 rm s3://aws-dbt-glue-datalake-$(aws sts get-caller-identity --query "Account" --output text)-us-east-1/ --recursive
$ aws s3 rm 3://aws-athena-dbt-glue-query-results-$(aws sts get-caller-identity --query "Account" --output text)-us-east-1/ --recursive

Delete the CloudFormation stack:

$ aws cloudformation delete-stack --stack-name dbt-demo

Summary

This post demonstrates how AWS managed services are key enablers and accelerators to build a modern data platform at scale or take advantage of an existing one.

With the introduction of dbt and aws-glue-dbt-adapter, data teams can access data stored in your modern data platform using SQL statements to extract value from data.

To report a bug or request a feature, please open an issue on GitHub. If you have any questions or suggestions, leave your feedback in the comment section. If you need further assistance to optimize your modern data platform, contact your AWS account team or a trusted AWS Partner.

About the Authors

Benjamin Menuet is a Data Architect with AWS Professional Services. He helps customers develop big data and analytics solutions to accelerate their business outcomes. Outside of work, Benjamin is a trail runner and has finished some mythic races like the UTMB.

Armando Segnini is a Data Architect with AWS Professional Services. He spends his time building scalable big data and analytics solutions for AWS Enterprise and Strategic customers. Armando also loves to travel with his family all around the world and take pictures of the places he visits.

Moshir Mikael is a Senior Practice Manager with AWS Professional Services. He led development of large enterprise data platforms in EMEA and currently leading the Professional Services teams in EMEA for analytics.

Anouar Zaaber is a Senior Engagement Manager in AWS Professional Services. He leads internal AWS teams, external partners, and customer teams to deliver AWS cloud services that enable customers to realize their business outcomes.

Amazon QuickSight 1-click public embedding available in preview

2022-04-21 Kareem Syed-Mohammed

Post Syndicated from Kareem Syed-Mohammed original https://aws.amazon.com/blogs/big-data/amazon-quicksight-1-click-public-embedding-available-in-preview/

Amazon QuickSight is a fully managed, cloud-native business intelligence (BI) service that makes it easy to connect to your data, create interactive dashboards, and share these with tens of thousands of users, either directly within a QuickSight application, or embedded in web apps and portals.

QuickSight Enterprise Edition now supports 1-click public embedding, a feature that allows you to embed your dashboards into public applications, wikis, and portals without any coding or development needed. Anyone on the internet can start accessing these embedded dashboards with up-to-date information instantly, without any server deployments or infrastructure licensing needed! 1-click public embedding allows you to empower your end-users with access to insights.

In this post, we walk you through the steps to use this feature, demonstrate the end-user experience, and share sample use cases.

Solution overview

1-click public embedding requires an administrator of the QuickSight account to enable this feature and use session-based pricing. After you complete the prerequisites (see the following section), the 1-click embedding process involves three simple steps:

Enable public access on the dashboard
Allow list the domain where you want to embed the dashboard in QuickSight
Embed the dashboard

Prerequisites

The account needs to have session capacity pricing enabled
As a prerequisite, make sure your QuickSight account is enabled for dashboards to be shared by the public. Because this feature allows dashboards to be enabled for any user, the usability of this feature must be enabled by an administrator of your QuickSight account with the following AWS Identity and Administration (IAM) permissions:
```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "quicksight:UpdatePublicSharingSettings",
            "Resource": "*",
            "Effect": "Allow"
        }
     ]
}
```

After you enable this policy, you can activate the account-level settings.

Sign in to your QuickSight account.
On the user name drop-down menu, choose Manage QuickSight.
Choose Security & permissions in the navigation pane.
Under Public access to dashboards, choose Manage.
Enable the setting Everyone on the internet to allow public access.
Confirm this setting and choose Accept.

Enabling this setting doesn’t automatically enable all the dashboards to be accessed by anyone on the internet. It gives the ability for authors of the dashboards to individually enable the dashboard to be accessed by anyone on the internet via the share link or when embedded.

Enable public access permissions on the dashboard

After you create a QuickSight dashboard, to enable public access, complete the following steps:

On the QuickSight dashboard, on the Share menu, choose Share dashboard.

Only owners and co-owners of the dashboard can perform this action.

Under Enable access for, turn on Anyone on the internet (public).

This setting allows you to share this dashboard with anyone on the internet via the share link or when embedded.

Accept this setting and choose Confirm.

You now have the option to copy the link or the embed code to share the dashboard. Note that when this setting is enabled, the dashboard can only be accessed using the link or when embedded using the embed code.

After you enable the dashboard for public access, you can see badges on the dashboard as follows.

The dashboard toolbar has a PUBLIC badge.

The dashboards grid view has an eye icon for each dashboard.

The dashboards list view has an eye icon for each dashboard.

Disable the public access

You can disable the public access of your dashboards in two ways:

Disable the Anyone on the internet setting for the dashboard (to be done by the owners and co-owners of the dashboard).
Disable the Public access to dashboard setting at the account level (to be done by the IAM admin). You can disable the account-level settings even if the dashboard-level setting is enabled. This still prevents the dashboards from being accessed publicly.

Allow list the domain where you want to embed the dashboard in QuickSight

The domain where the dashboard is to be embedded must be allow listed in QuickSight. For instructions, see Adding domains for embedded users.

Embed the dashboard

After you set your desired access to the dashboard, you can choose Copy embed code, which copies the embed code for that dashboard. This code embeds the dashboard when added to an internal application.

The copied embed code is similar to the following code:

    <iframe
        width="960"
        height="720"
        src="https://quicksight.aws.amazon.com/sn/
        embed/share/accounts/accountid/dashboards/dashboardid">
    </iframe>

To embed the dashboard in an HTML page, open the HTML of the page where you want to embed the dashboard and enter the copied embed code into the HTML code.

If your public-facing applications are built on Google Sites, to embed your dashboard, open the page on Google Sites, then choose Insert and Embed. A pop-up window appears with a prompt to enter a URL or embed code. Choose Embed code and enter the copied embed code in the text box.

Make sure to allow list the following domains in QuickSight when embedding in Google Sites: https://googleusercontent.com (enable subdomains), https://www.gstatic.com, and https://sites.google.com.

After you embed the dashboard in your application, anyone who can access your application can now access the embedded dashboard.

Sample use cases

1-click public embedding enables you to embed your dashboards into public applications, wikis, and portals without any coding or development needed. In this section, we present two sample use cases.

For our first use case, a fictitious school district uses 1-click public embedding to report on the teachers’ enrollment in the district. They built a dashboard and used this feature to embed it on their public-facing site.

For our second use case, a fictitious fintech that provides investment solutions is using 1-click public embedding to show how their investment compares against other well-known indexes and commodities. They used this feature to add this comparison dashboard on their public-facing marketing pages.

Try out 1-click public embedding

To try out this feature, see Embed Amazon QuickSight dashboard in seconds. In this demo, you can change the dashboard between a logistics or sales dashboard by choosing Change Dashboard and entering the embed code for the dashboard you want to render on the site.

Conclusion

With 1-click public embedding, you can now embed rich and interactive QuickSight dashboards quickly and easily. Enable your end-users to dive deeper into their data through embedded dashboard with the click of a button—and with no infrastructure setup or management, scale to millions of users. 1-click public embedding is now in preview; to access this feature, please contact [email protected].

QuickSight also supports embedding in SaaS apps without any user management needed. For more information, refer to Embed multi-tenant dashboards in SaaS apps using Amazon QuickSight without provisioning or managing users.

To stay up to date on QuickSight embedded analytics, check out what’s new with the QuickSight User Guide.

About the Authors

Kareem Syed-Mohammed is a Product Manager at Amazon QuickSight. He focuses on embedded analytics, APIs, and developer experience. Prior to QuickSight he has been with AWS Marketplace and Amazon retail as a PM. Kareem started his career as a developer and then PM for call center technologies, Local Expert and Ads for Expedia. He worked as a consultant with McKinsey and Company for a short while.

Srikanth Baheti is a Specialized World Wide Sr. Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.-

Introducing AWS Glue Auto Scaling: Automatically resize serverless computing resources for lower cost with optimized Apache Spark

2022-04-21 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-auto-scaling-automatically-resize-serverless-computing-resources-for-lower-cost-with-optimized-apache-spark/

Data created in the cloud is growing fast in recent days, so scalability is a key factor in distributed data processing. Many customers benefit from the scalability of the AWS Glue serverless Spark runtime. Today, we’re pleased to announce the release of AWS Glue Auto Scaling, which helps you scale your AWS Glue Spark jobs automatically based on the requirements calculated dynamically during the job run, and accelerate job runs at lower cost without detailed capacity planning.

Before AWS Glue Auto Scaling, you had to predict workload patterns in advance. For example, in cases when you don’t have expertise in Apache Spark, when it’s the first time you’re processing the target data, or when the volume or variety of the data is significantly changing, it’s not so easy to predict the workload and plan the capacity for your AWS Glue jobs. Under-provisioning is error-prone and can lead to either missed SLA or unpredictable performance. On the other hand, over-provisioning can cause underutilization of resources and cost overruns. Therefore, it was a common best practice to experiment with your data, monitor the metrics, and adjust the number of AWS Glue workers before you deployed your Spark applications to production.

With AWS Glue Auto Scaling, you no longer need to plan AWS Glue Spark cluster capacity in advance. You can just set the maximum number of workers and run your jobs. AWS Glue monitors the Spark application execution, and allocates more worker nodes to the cluster in near-real time after Spark requests more executors based on your workload requirements. When there are idle executors that don’t have intermediate shuffle data, AWS Glue Auto Scaling removes the executors to save the cost.

AWS Glue Auto Scaling is available with the optimized Spark runtime on AWS Glue version 3.0, and you can start using it today. This post describes possible use cases and how it works.

Use cases and benefits for AWS Glue Auto Scaling

Traditionally, AWS Glue launches a serverless Spark cluster of a fixed size. The computing resources are held for the whole job run until it is completed. With the new AWS Glue Auto Scaling feature, after you enable it for your AWS Glue Spark jobs, AWS Glue dynamically allocates compute resource considering the given maximum number of workers. It also supports dynamic scale-out and scale-in of the AWS Glue Spark cluster size over the course of job. As more executors are requested by Spark, more AWS Glue workers are added to the cluster. When the executor has been idle without active computation tasks for a period of time and associated shuffle dependencies, the executor and corresponding worker are removed.

AWS Glue Auto Scaling makes it easy to run your data processing in the following typical use cases:

Batch jobs to process unpredictable amounts of data
Jobs containing driver-heavy workloads (for example, processing many small files)
Jobs containing multiple stages with uneven compute demands or due to data skews (for example, reading from a data store, repartitioning it to have more parallelism, and then processing further analytic workloads)
Jobs to write large amouns of data into data warehouses such as Amazon Redshift or read and write from databases

Configure AWS Glue Auto Scaling

AWS Glue Auto Scaling is available with the optimized Spark runtime on Glue version 3.0. To enable Auto Scaling on the AWS Glue Studio console, complete the following steps:

Open AWS Glue Studio.
Choose Jobs.
Choose your job.
Choose the Job details tab.
For Glue version, choose Glue 3.0 – Supports spark 3.1, Scala 2, Python.
Select Automatically scale the number of workers.
For Maximum number of workers, enter the maximum workers that can be vended to the job run.
Choose Save.

To enable Auto Scaling in the AWS Glue API or AWS Command Line Interface (AWS CLI), set the following job parameters:

Key – --enable-auto-scaling
Value – true

Monitor AWS Glue Auto Scaling

In this section, we discuss three ways to monitor AWS Glue Auto Scaling: via Amazon CloudWatch metrics or Spark UI.

CloudWatch metrics

After you enable AWS Glue Auto Scaling, Spark dynamic allocation is enabled and the executor metrics are visible in CloudWatch. You can review the following metrics to understand the demand and optimized usage of executors in their Spark applications enabled with Auto Scaling:

glue.driver.ExecutorAllocationManager.executors.numberAllExecutors
glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors

AWS Glue Studio Monitoring page

In the Monitoring page in AWS Glue Studio, you can monitor the DPU hours you spent for a specific job run. The following screenshot shows two job runs that processed the same dataset; one without Auto Scaling which spent 8.71 DPU hours, and another one with Auto Scaling enabled which spent only 1.48 DPU hours. The DPU hour values per job run are also available with GetJobRun API responses.

Spark UI

With the Spark UI, you can monitor that the AWS Glue Spark cluster dynamically scales out and scales in with AWS Glue Auto Scaling. The event timeline shows when each executor is added and removed gradually over the Spark application run.

In the following sections, we demonstrate AWS Glue Auto Scaling with two use cases: jobs with driver-heavy workloads, and jobs with multiple stages.

Example 1: Jobs containing driver-heavy workloads

A typical workload for AWS Glue Spark jobs is to process many small files to prepare the data for further analysis. For such workloads, AWS Glue has built-in optimizations, including file grouping, a Glue S3 Lister, partition pushdown predicates, partition indexes, and more. For more information, see Optimize memory management in AWS Glue. All those optimizations execute on the Spark driver and speed up the planning phase on Spark driver to compute and distribute the work for parallel processing with Spark executors. However, without AWS Glue Auto Scaling, Spark executors are idle during the planning phase. With Auto Scaling, Glue jobs only allocate executors when the driver work is complete, thereby saving executor cost.

Here’s the example DAG shown in AWS Glue Studio. This AWS Glue job reads from an Amazon Simple Storage Service (Amazon S3) bucket, performs the ApplyMapping transformation, runs a simple SELECT query repartitioning data to have 800 partitions, and writes back to another location in Amazon S3.

Without AWS Glue Auto Scaling

The following screenshot shows the executor timeline in Spark UI when the AWS Glue job ran with 20 workers without Auto Scaling. You can confirm that all 20 workers started at the beginning of the job run.

With AWS Glue Auto Scaling

In contrast, the following screenshot shows the executor timeline of the same job with Auto Scaling enabled and the maximum workers set to 20. The driver and one executor started at the beginning, and other executors started only after the driver finished its computation for listing 367,920 partitions on the S3 bucket. These 19 workers were not charged during the long-running driver task.

Both jobs completed in 44 minutes. With AWS Glue Auto Scaling, the job completed in the same amount of time with lower cost.

Example 2: Jobs containing multiple stages

Another typical workload in AWS Glue is to read from the data store or large compressed files, repartition it to have more parallelism for downstream processing, and process further analytic queries. For example, when you want to read from a JDBC data store, you may not want to have many concurrent connections, so you can avoid impacting source database performance. For such workloads, you can have a small number of connections to read data from the JDBC data store, then repartition the data with higher parallelism for further analysis.

Here’s the example DAG shown in AWS Glue Studio. This AWS Glue job reads from the JDBC data source, runs a simple SELECT query adding one more column (mod_id) calculated from the column ID, performs the ApplyMapping node, then writes to an S3 bucket with partitioning by this new column mod_id. Note that the JDBC data source was already registered in the AWS Glue Data Catalog, and the table has two parameters, hashfield=id and hashpartitions=5, to read from JDBC through five concurrent connections.

Without AWS Glue Auto Scaling

The following screenshot shows the executor timeline in the Spark UI when the AWS Glue job ran with 20 workers without Auto Scaling. You can confirm that all 20 workers started at the beginning of the job run.

With AWS Glue Auto Scaling

The following screenshot shows the same executor timeline in the Spark UI with Auto Scaling enabled with 20 maximum workers. The driver and two executors started at the beginning, and other executors started later. The first two executors read data from the JDBC source with fewer number of concurrent connections. Later, the job increased parallelism and more executors were started. You can also observe that there were 16 executors, not 20, which further reduced cost.

Conclusion

This post discussed AWS Glue Auto Scaling, which automatically resizes the computing resources of your AWS Glue Spark job capacity and reduce cost. You can start using AWS Glue Auto Scaling for both your existing workloads and future new workloads, and take advantage of it today! For more information about AWS Glue Auto Scaling, see Using Auto Scaling for AWS Glue. Migrate your jobs to Glue version 3.0 and get the benefits of Auto Scaling.

Special thanks to everyone who contributed to the launch: Raghavendhar Thiruvoipadi Vidyasagar, Ping-Yao Chang, Shashank Bhardwaj, Sampath Shreekantha, Vaibhav Porwal, and Akash Gupta.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is passionate about architecting fast-growing data platforms, diving deep into distributed big data software like Apache Spark, building reusable software artifacts for data lakes, and sharing the knowledge in AWS Big Data blog posts. In his spare time, he enjoys taking care of killifish, hermit crabs, and grubs with his children.

Bo Li is a Software Development Engineer on the AWS Glue team. He is devoted to designing and building end-to-end solutions to address customers’ data analytic and processing needs with cloud-based, data-intensive technologies.

Rajendra Gujja is a Software Development Engineer on the AWS Glue team. He is passionate about distributed computing and everything and anything about data.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team works on distributed systems for efficiently managing data lakes on AWS and optimizes Apache Spark for performance and reliability.

Enhance analytics with Google Trends data using AWS Glue, Amazon Athena, and Amazon QuickSight

2022-04-20 Drew Philip

Post Syndicated from Drew Philip original https://aws.amazon.com/blogs/big-data/enhance-analytics-with-google-trends-data-using-aws-glue-amazon-athena-and-amazon-quicksight/

In today’s market, business success often lies in the ability to glean accurate insights and predictions from data. However, data scientists and analysts often find that the data they have at their disposal isn’t enough to help them make accurate predictions for their use cases. A variety of factors might alter an outcome and should be taken into account when making a prediction model. Google Trends is an available option, presenting a broad source of data that reflects global trends more comprehensively. This can help enrich a dataset to yield a better model.

You can use Google Trends data for a variety of analytical use cases. For example, you can use it to learn about how your products or brands are faring among targeted audiences. You can also use it to monitor competitors and see how well they’re performing against your brand.

In this post, we shows how to get Google Trends data programmatically, integrate it into a data pipeline, and use it to analyze data, using Amazon Simple Storage Service (Amazon S3), AWS Glue, Amazon Athena, and Amazon QuickSight. We use an example dataset of movies and TV shows and demonstrate how to get the search queries from Google Trends to analyze the popularity of movies and TV shows.

Solution overview

The following diagram shows a high-level architecture of the solution using Amazon S3, AWS Glue, the Google Trends API, Athena, and QuickSight.

The solution consists of the following components:

Amazon S3 – The storage layer that stores the list of topics for which Google Trends data has to be gathered. It also stores the results returned by Google Trends.
AWS Glue – The serverless data integration service that calls Google Trends for the list of topics to get the search results, aggregates the data, and loads it to Amazon S3.
Athena – The query engine that allows you to query the data stored in Amazon S3. You can use it for supporting one-time SQL queries on Google Trends data and for building dashboards using tools like QuickSight.
QuickSight – The reporting tool used for building visualizations.

In the following sections, we walk through the steps to set up the environment, download the libraries, create and run the AWS Glue job, and explore the data.

Set up your environment

Complete the following steps to set up your environment:

Create an S3 bucket where you upload the list of movies and TV shows. For this post, we use a Netflix Movies and TV Shows public dataset from Kaggle.
Create an AWS Identity and Access Management (IAM) service role that allows AWS Glue to read and write data to the S3 buckets you just created.
Create a new QuickSight account with the admin/author role and access granted to Athena and Amazon S3.

Download the external libraries and dependencies for the AWS Glue Job

The AWS Glue job needs the following two external Python libraries: pytrends and awswrangler. pytrends is a library that provides a simple interface for automating the downloading of reports from Google Trends. awswrangler is a library provided by AWS to integrate data between a Pandas DataFrame and AWS repositories like Amazon S3.

Download the following .whl files for the libraries and upload them to Amazon S3:

pytrends – pytrends-4.8.0-py3-none-any.whl
awswrangler – awswrangler-2.14.0-py3-none-any.whl

Create and configure an AWS Glue job

To set up your AWS Glue job, complete the following steps:

On the AWS Glue console, under ETL in the navigation pane, choose Jobs – New.
For Create job, select Python Shell script editor.
For Options, select Create a new script with boilerplate code.
Choose Create.

On the Script tab, enter the following script, replacing the source and target buckets with your bucket names:

# Import external library TrendReq needed to connect to Google Trends API and library awswrangler to read/write from pandas to Amazon S3.

from pytrends.request import TrendReq
pytrend = TrendReq(hl='en-US', tz=360, timeout=10) 
import pandas as pd
import awswrangler as wr

# Function get_gtrend, accepts a list of terms as input, calls Google Trends API for each term to get the search trends 
def get_gtrend(terms):
  trends =[]
  for term in terms:
# Normalizing the data using popular movie Titanic as baseline to get trends over time.
    pytrend.build_payload(kw_list=["Titanic",term.lower()])
    df = pytrend.interest_over_time()
    df["google_trend"] = round((df[term.lower()] /df['Titanic']) *100)
    
# Transforming and filtering trends results to align with Analytics use case
    df_trend = df.loc[df.index >= "2018-1-1", "google_trend"].resample(rule="M").max().to_frame()
    df_trend["movie"] = term
    trends.append(df_trend.reset_index())

# Last step in function to concatenate the results for each term and return an aggregated dataset 
  concat_df = pd.concat(trends)
  return concat_df

def main():
  
# Change the bucket and prefix name to Amazon S3 location where movie titles file from Kaggle has been downloaded. 
  source_bucket = "source_bucket"
  source_prefix = "source_prefix"

# Awswrangler method s3.read_csv is called to load the titles from S3 location into a DataFrame and convert it to a list.
  df = wr.s3.read_csv(f's3://{source_bucket}/{source_prefix}/')
  movies = df['title'].head(20).values.tolist()

#  Call the get_trends function and pass the list of movies as an input. Pandas dataframe is returned with trend data for movies.
  df = get_gtrend(terms=movies)

# Change the prefix name to location where you want to store results. 
  target_bucket = "target_bucket" 
  target_prefix = "target_prefix" 

# Use awswrangler to save pandas dataframe to Amazon S3. 
  wr.s3.to_csv(df,f's3://{target_bucket}/{target_prefix}/trends.csv',index= False)


# Invoke the main function
main()

On the Job details tab, for Name, enter the name of the AWS Glue job.
For IAM Role, choose the role that you created earlier with permissions to run the job and access Amazon S3.
For Type, enter Python Shell to run the Python code.
For Python Version, specify the Python version as Python 3.6.
For Data processing units, choose 1 DPU.
For Number of retries, enter 3.
Expand Advanced properties and under Libraries, enter the location of the S3 bucket where the pytrends and awswrangler files were downloaded.
Choose Save to save the job.

Run the AWS Glue job

Navigate to the AWS Glue console and run the AWS Glue job you created. When the job is complete, a CSV file with the Google Trends values is created in the target S3 bucket with the prefix specified in the main() function. In the next step, we create an AWS Glue table referring to the target bucket and prefix to allow queries to be run against the Google Trends data.

Create an AWS Glue table on the Google Trends data

In this step, we create a table in the AWS Glue Data Catalog using Athena. The table is created on top of the Google Trends data saved in the target S3 bucket.

In the Athena query editor, select default as the database and enter the following DDL command to create a table named trends. Replace the target bucket and prefix with your own values.

CREATE EXTERNAL TABLE `trends`(
  `date` date, 
  `google_trend` double, 
  `title` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<< target_bucket >>/<<target_prefix >>/'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

This table has three columns:

date – The time dimension for aggregating the data. In this example, the time period is monthly.
google_trend – The count of Google Trends values normalized on a scale of 0–100.
title – The name of the movie or TV show.

Query the data using Athena

Now you can run one-time queries to find the popularity of movies and TV shows.

In the first example, we find the top 10 most popular movies and TV shows for November 2021. In the Athena query editor, enter the following SQL command to query the trends table created in the previous step:

select title,google_trend
from trends 
where date = date_parse('2021-11-30','%Y-%m-%d')
order by google_trend desc
limit 10

In the following example, we find the top 10 most popular movies and TV shows that have grown most in popularity in 2021 until November 30. In the Athena query editor, enter the following SQL command to query the trends table:

select  title,max(google_trend)-min(google_trend) trend_diff
from trends
where date between date_parse('2021-01-31','%Y-%m-%d') and date_parse('2021-11-30','%Y-%m-%d')
group by title
order by 2 desc
limit 10

Build a dashboard to visualize the data using QuickSight

We can use QuickSight to build a dashboard on the data downloaded from Google Trends to identify top movies and TV shows. Complete the following steps:

Sign in to your QuickSight account.
On the QuickSight console, choose Datasets and choose New dataset.
Choose Athena as your data source.
For Data source name, enter a name.
For Athena workgroup, choose [primary].
Choose Create data source.
For Database, choose default.
For Tables, select the trends table.
Choose Select.
Select Directly query your data.
Choose Visualize.

For the first visual, we create a bar chart of the top movies or TV shows by title sorted in ascending order of aggregated Google Trends values.

Choose the horizontal bar chart visual type.
For Y axis, choose title.
For Value, choose google_trend (Average).

Next, we create a time series plot of Google Trends count by month for titles.

Add a new visual and choose the autograph visual type.
For X axis, choose date.
For Value, choose google_trend (Sum).
For Color¸ choose title.

Clean up

To avoid incurring future charges, delete the resources you created for AWS Glue, Amazon S3, IAM, and QuickSight.

AWS Glue Catalog table
- On the AWS Glue console, choose Tables under Databases in the navigation pane.
- Select the AWS Glue Data Catalog table that you created.
- On the Actions drop-down menu, choose Delete.
- Choose Delete to confirm.
AWS Glue Job
- Choose Jobs in the navigation pane.
- Select the AWS Glue job you created.
- On the Actions drop-down menu, choose Delete.
S3 bucket
- On the Amazon S3 console, choose Buckets in navigation pane.
- Choose the bucket you created.
- Choose Empty and enter your bucket name.
- Choose Confirm.
- Choose Delete and enter your bucket name.
- Choose Delete bucket.
IAM Role
- On the IAM console, choose Roles in navigation pane.
- Choose the role you attached to AWS Glue job.
- Choose Delete role.
- Choose Yes.
Amazon QuickSight
- If you created a QuickSight user for trying out this blog and do not want to retain that access, please ask your QuickSight admin to delete your user.
- If you created the QuickSight account itself just for trying this blog and no longer want to retain it, use following steps to delete it.
- Choose your user name on the application bar, and then choose Manage QuickSight
- Choose Account settings.
- Choose Delete Account.

You can only have one QuickSight account active for each AWS account. Make sure that other users aren’t using QuickSight before you delete the account.

Conclusion

Integrating external data sources such as Google Trends via AWS Glue, Athena, and QuickSight can help you enrich your datasets to yield greater insights. You can use it in a data science context when the model is under-fit and requires more relevant data in order to make better predictions. In this post, we used movies as an example, but the solution extends to a wide breadth of industries, such as products in a retail context or commodities in a finance context. If the simple inventory histories or the transaction dates are available, you may find little correlation to future demand or prices. But with an integrated data pipeline using external data, new relationships in the dataset make the model more reliable.

In a business context, whether your team wants to test out a machine learning (ML) proof of concept more quickly or have limited access to pertinent data, Google Trends integration is a relatively quick way to enrich your data for the purposes of ML and data insights.

You can also extend this concept to other third-party datasets, such as social media sentiment, as your team’s expertise grows and your ML and analytics operations mature. Integrating external datasets such as Google Trends is just one part of the feature and data engineering process, but it’s a great place to start and, in our experience, most often leads to better models that businesses can innovate from.

About the Authors

Drew Philip is a Sr. Solutions Architect with AWS Private Equity. He has held senior
technical leadership positions within key AWS partners such as Microsoft, Oracle, and
Rackspace. Drew focuses on applied engineering that leverages AI-enabled digital innovation and development, application modernization, resiliency and operational excellence for workloads at scale in the public and private sector. He sits on the board of Calvin University’s computer science department and is a contributing member of the AWS Machine Learning Technical Focus Community.

Gautam Prothia is a Senior Solution Architect within AWS dedicated to Strategic Accounts. Gautam has more than 15+ years of experience designing and implementing large-scale data management and analytical solutions. He has worked with many clients across industries to help them modernize their data platforms on the cloud.

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Scale Amazon Redshift to meet high throughput query requirements

2022-04-20 Erik Anderson

Post Syndicated from Erik Anderson original https://aws.amazon.com/blogs/big-data/scale-amazon-redshift-to-meet-high-throughput-query-requirements/

Many enterprise customers have demanding query throughput requirements for their data warehouses. Some may be able to address these requirements through horizontally or vertically scaling a single cluster. Others may have a short duration where they need extra capacity to handle peaks that can be addressed through Amazon Redshift concurrency scaling. However, enterprises with consistently high demand that can’t be serviced by a single cluster need another option. These enterprise customers require large datasets to be returned from queries at a high frequency. These scenarios are also often paired with legacy business intelligence (BI) tools where data is further analyzed.

Amazon Redshift is a fast, fully managed cloud data warehouse. Tens of thousands of customers use Amazon Redshift as their analytics platform. These customers range from small startups to some of the world’s largest enterprises. Users such as data analysts, database developers, and data scientists use Amazon Redshift to analyze their data to make better business decisions.

This post provides an overview of the available scaling options for Amazon Redshift and also shares a new design pattern that enables query processing in scenarios where having multiple leader nodes are required to extract large datasets for clients or BI tools without introducing additional overhead.

Common Amazon Redshift scaling patterns

Because Amazon Redshift is a managed cloud data warehouse, you only pay for what you use, so sizing your cluster appropriately is critical for getting the best performance at the lowest cost. This process begins with choosing the appropriate instance family for your Amazon Redshift nodes. For new workloads that are planning to scale, we recommend starting with our RA3 nodes, which allow you to independently tailor your storage and compute requirements. The RA3 nodes provide three instance types to build your cluster with: ra3.xlplus, ra3.4xlarge, and ra3.16xlarge.

Horizontal cluster scaling

Let’s assume for this example, you build your cluster with four ra3.4xlarge nodes. This configuration provides 48 vCPUs and 384 GiB RAM. Your workload is consistent throughout the day, with few peaks and valleys. As adoption increases and more users need access to the data, you can add nodes of the same node type to your cluster to increase the amount of compute power available to handle those queries. An elastic resize is the fastest way to horizontally scale your cluster to add nodes as a consistent load increases.

Vertical cluster scaling

Horizontal scaling has its limits, however. Each node type has a limit to the number of nodes that can be managed in a single cluster. To continue with the previous example, ra3.4xlarge nodes have a maximum of 64 nodes per cluster. If your workload continues to grow and you’re approaching this limit, you may decide to vertically scale your cluster. Vertically scaling increases the resources given to each node. Based on the additional resources provided by the larger nodes, you will likely decrease the quantity of nodes at the same time.

Rather than running a cluster with 64 ra3.4xlarge nodes, you could elastically resize your cluster to use 16 ra3.16xlarge nodes and have the equivalent resources to host your cluster. The transition to a larger node type allows you to horizontally scale with those larger nodes. You can create an Amazon Redshift cluster with up to 16 nodes. However, after creation, you can resize your cluster to contain up to 32 ra3.xlplus nodes, up to 64 ra3.4xlarge nodes, or up to 128 ra3.16xlarge nodes.

Concurrency scaling

In March 2019, AWS announced the availability of Amazon Redshift concurrency scaling. Concurrency scaling allows you to add more query processing power to your cluster, but only when you need it. Rather than a consistent volume of workload throughout the day, perhaps there are short periods of time when you need more resources. When you choose concurrency scaling, Amazon Redshift automatically and transparently adds more processing power for just those times when you need it. This is a cost-effective, low-touch option for burst workloads. You only pay for what you use on a per-second basis, and you accumulate 1 hour’s worth of concurrency scaling credits every 24 hours. Those free credits have met the needs of 97% of our Amazon Redshift customers’ concurrency scaling requirements, meaning that most customers get the benefits of concurrency scaling without increasing their costs.

The size of your concurrency scaling cluster is directly proportional to your cluster size, so it also scales as your cluster does. By right-sizing your base cluster and using concurrency scaling, you can address the vast majority of performance requirements.

Multi-cluster scaling

Although the previous three scaling options work together to address the needs of the vast majority of our customers, some customers need another option. These use cases require large datasets to be returned from queries at a high frequency and perform further analysis on them using legacy BI tools.

While working with customers to address these use cases, we have found that in these scenarios, multiple medium-sized clusters can perform better than a single large cluster. This phenomenon mostly relates to the single Amazon Redshift leader node’s throughput capacity.

This last scaling pattern uses multiple Amazon Redshift clusters, which allows you to achieve near-limitless read scalability. Rather than relying on a single cluster, a single leader node, and concurrency scaling, this architecture allows you to add as many resources as needed to address your high throughput query requirements. This pattern relies on Amazon Redshift data sharing abilities to enable a seamless multi-cluster experience.

The remainder of this post covers the details of this architecture.

Solution overview

The following diagram outlines a multi-cluster architecture.

The first supporting component for this architecture is Amazon Redshift managed storage. Managed storage is available for RA3 nodes and allows the complete decoupling of compute and storage resources. This decoupling supports another feature that was announced at AWS re:Invent 2020—data sharing. Data sharing is primarily intended to let you share data amongst different data warehouse groups so that you can retain a single set of data to remove duplication. Data sharing ensures that the users accessing the data are using compute on their clusters rather than using compute on the owning cluster, which better aligns cost to usage.

In this post, we introduce another use case of data sharing: horizontal cluster scaling. This architecture allows you to create two or more clusters to handle high throughput query requirements while maintaining a single data source.

An important component in this design is the Network Load Balancer (NLB). The NLB serves as a single access point for clients to connect to the backend data warehouse for performing reads. It also allows changing the number of underlying clusters transparently to users. If you decide to add or remove clusters, all you need to do is add or remove targets in your NLB. It’s also important to note that this design can use any of the previous three scaling options (horizontal, vertical, and concurrency scaling) to fine-tune the number of resources available to service your particular workload.

Prerequisites

Let’s start by creating two Amazon Redshift clusters of RA3 instance type, and name them producer_cluster and consumer_cluster. For instructions, refer to Create a cluster.

In this post, our producer cluster is a central ETL cluster hosting enterprise sales data using a 3 TB Cloud DW dataset based on the TPC-DS benchmark.

The next step is to configure data sharing between the producer and consumer clusters.

Set up data sharing at the producer cluster

In this step, you need a cluster namespace from the consumer_cluster. One way to find the namespace value of a cluster is to run the SQL statement SELECT CURRENT_NAMESPACE when connected to the consumer_cluster. Another way is through the Amazon Redshift console. Navigate to your Amazon Redshift consumer_cluster, and find the cluster namespace located in the General information section.

After you connect to the producer cluster, create the data share and add the schema and tables to the data share. Then, grant usage to the consumer namespace by providing the namespace value. See the following code:

/* Create Datashare and add objects to the share */ 
CREATE DATASHARE producertpcds3tb;

ALTER DATASHARE producertpcds3tb ADD SCHEMA order_schema;
ALTER DATASHARE producertpcds3tb ADD ALL TABLES in SCHEMA order_schema;

GRANT USAGE ON DATASHARE producertpcds3tb TO NAMESPACE '<consumer namespace>';

You can validate that data sharing was correctly configured by querying these views from the producer cluster:

SELECT * FROM SVV_DATASHARES;
SELECT * FROM SVV_DATASHARE_OBJECTS;

Set up data sharing at the consumer cluster

Get the cluster namespace of the producer cluster by following same steps for the consumer cluster. After you connect to the consumer cluster, you can create a database referencing the data share of the producer cluster. Then you create an external schema and set the search path in the consumer cluster, which allows schema-level access control within the consumer cluster and uses a two-part notation when referencing shared data objects. Finally, you grant usage on the database to a user, and run a query to check if objects as part of data share are accessible. See the following code:

/* Create a local database and schema reference */

CREATE DATABASE tpcds_3tb FROM DATASHARE producertpcds3tb OF NAMESPACE '<producer namespace>';


/*Create External schema */
CREATE EXTERNAL SCHEMA order_schema FROM REDSHIFT DATABASE 'tpcds_3tb' SCHEMA 'order_schema';

SET SEARCH_PATH TO order_schema,public;


/* Grant usage on database to a user */ 

GRANT USAGE On DATABASE tpcds_3tb TO awsuser;

/* Query to check objects accessible from the consumer cluster */

SELECT * FROM SVV_DATASHARE_OBJECTS;

Set up the Network Load Balancer

After you set up data sharing at both the producer_cluster and consumer_cluster, the next step is to configure a Network Load Balancer to accept connections through a single endpoint and forward the connections to both clusters for reading data via queries.

As a prerequisite, collect the following information from the Amazon Redshift producer and consumer clusters on the Amazon Redshift console in the cluster properties section. Use the producer cluster information if consumer cluster is not mentioned below.

Parameter Name	Parameter Description
VPCid	Amazon Redshift cluster VPC
NLBSubnetid	Subnet where the NLB ENI is created. The NLB and Amazon Redshift subnet need to be in the same Availability Zone.
NLBSubnetCIDR	Used for allowlisting inbound access in the Amazon Redshift security group
NLBPort	Port to be used by NLB Listener, usually the same port as Amazon Redshift port 5439
RedshiftPrivateIP	IP address of Amazon Redshift leader node of the producer cluster
RedshiftPrivateIP	IP address of Amazon Redshift leader node of the consumer cluster
RedshiftPort:	Port used by Amazon Redshift clusters, usually 5439
RedshiftSecurityGroup	Security group to allow connectivity to Amazon Redshift cluster

After you collect this information, run the AWS CloudFormation script NLB.yaml to set up the Network Load Balancer for the producer and consumer clusters. The following screenshot shows the stack parameters.

After you create the CloudFormation stack, note the NLB endpoint on the stack’s Outputs tab. You use this endpoint to connect to the Amazon Redshift clusters.

This NLB setup is done for the both producer and consumer clusters by the CloudFormation stack. If needed, you can add additional Amazon Redshift clusters to an existing NLB by navigating to Target groups page of the Amazon EC2 console. Then navigate to rsnlbsetup-target and add the Amazon Redshift cluster leader node private IP and port.

Validate the connections to the Amazon Redshift clusters

After you set up the NLB, the next step is to validate the connectivity to the Amazon Redshift clusters. You can do this by first configuring SQL tools like SQL Workbench, DBeaver, or Aginity Workbench and setting the host name and endpoint to the Amazon Redshift cluster’s NLB endpoint, as shown in the following screenshot. For additional configuration information, see Connecting to an Amazon Redshift cluster using SQL client tools.

Repeat this process a few times to validate that there are connections to both clusters. Similarly, you can use the same NLB endpoint as the host name while configuring.

As a next step, we use JMeter to show how the NLB is connecting to each of the clusters. The Apache JMeter application is open-source software, a 100% pure Java application designed to load test functional behavior and measure performance. Our NLB connects to each cluster in a round-robin manner, which enables even distribution of read load on Amazon Redshift clusters.

Setting up JMeter is out of scope of this post; refer to Building high-quality benchmark tests for Amazon Redshift using Apache JMeter to learn more about setting up JMeter and performance testing on an Amazon Redshift cluster.

The following screenshot shows the HTML output of the response data from JMeter testing. It shows that requests go to both the Amazon Redshift producer and consumer clusters in a round-robin manner.

The preceding screenshot shows a sample output from running 20 SQL queries. Testing with over 1,000 SQL runs was performed with over four Amazon Redshift clusters, and the NLB was able to distribute them as evenly as possible across all of those clusters.

With this setup, you have the flexibility to add Amazon Redshift clusters to your NLB as needed and can configure data sharing to enable horizontal scaling of Amazon Redshift clusters. When demand reduces, you can either de-register some of the Amazon Redshift clusters at the NLB configuration or simply pause the Amazon Redshift cluster and the NLB automatically connects to only those clusters that are available at the time.

Conclusion

In this post, you learned about the different ways that Amazon Redshift can scale to meet your needs as they adjust over time. Use horizontal scaling to increase the number of nodes in your cluster. Use vertical scaling to increase the size of each node. Use concurrency scaling to dynamically address peak workloads. Use multiple clusters with data sharing behind an NLB to provide near-endless scalability. You can use these architectures independently or in combination with each other to build your high-performing, cost-effective data warehouse using Amazon Redshift.

To learn more about some of the foundational features used in the architecture mentioned in this post, refer to:

About the Authors

Erik Anderson is a Principal Solutions Architect at AWS. He has nearly two decades of experience guiding numerous Fortune 100 companies along their technology journeys. He is passionate about helping enterprises build scalable, performant, and cost-effective solutions in the cloud. In his spare time, he loves spending time with his family, home improvement projects, and playing sports.

Rohit Bansal is a Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to build next-generation Analytics solutions using other AWS Analytics Services.

Amazon Redshift continues its price-performance leadership

2022-04-20 Stefan Gromoll

Post Syndicated from Stefan Gromoll original https://aws.amazon.com/blogs/big-data/amazon-redshift-continues-its-price-performance-leadership/

Data is a strategic asset. Getting timely value from data requires high-performance systems that can deliver performance at scale while keeping costs low. Amazon Redshift is the most popular cloud data warehouse that is used by tens of thousands of customers to analyze exabytes of data every day. We continue to add new capabilities to improve the price-performance ratio for our customers as you bring more data to your Amazon Redshift environments.

This post goes into detail on the analytic workload trends we’re seeing from the Amazon Redshift fleet’s telemetry data, new capabilities we have launched to improve Amazon Redshift’s price-performance, and the results from the latest benchmarks derived from TPC-DS and TPC-H, which reenforce our leadership.

Data-driven performance optimization

We relentlessly focus on improving Amazon Redshift’s price-performance so that you continue to see improvements in your real-world workloads. To this end, the Amazon Redshift team takes a data-driven approach to performance optimization. Werner Vogels discussed our methodology in Amazon Redshift and the art of performance optimization in the cloud, and we have continued to focus our efforts on using performance telemetry from our large customer base to drive the Amazon Redshift performance improvements that matter most to our customers.

At this point, you might ask why does price-performance matter? One critical aspect of a data warehouse is how it scales as your data grows. Will you be paying more per TB as you add more data, or will your costs remain consistent and predictable? We work to make sure that Amazon Redshift delivers not only strong performance as your data grows, but also consistent price-performance.

Optimizing high-concurrency, low-latency workloads

One of the trends that we have observed is that customers are increasingly building analytics applications that require high concurrency of low-latency queries. In the context of data warehousing, this can mean hundreds or even thousands of users running queries with response time SLAs of under 5 seconds.

A common scenario is an Amazon Redshift-powered business intelligence dashboard that serves analytics to a very large number of analysts. For example, one of our customers processes foreign exchange rates and delivers insights based on this data to their users using an Amazon Redshift-powered dashboard. These users generate an average of 200 concurrent queries to Amazon Redshift that can spike to 1,200 concurrent queries at the open and close of the market, with a P90 query SLA of 1.5 seconds. Amazon Redshift is able to meet this requirement, so this customer can meet their business SLAs and provide the best service possible to their users.

A specific metric we track is the percentage of runtime across all clusters that is spent on short-running queries (queries with runtime less than 1 second). Over the last year, we’ve seen a significant increase in short query workloads in the Amazon Redshift fleet, as shown in the following chart.

As we started to look deeper into how Amazon Redshift ran these kinds of workloads, we discovered several opportunities to optimize performance to give you even better throughput on short queries:

We significantly reduced Amazon Redshift’s query-planning overhead. Even though this isn’t large, it can be a significant portion of the runtime of short queries.
We improved the performance of several core components for situations where many concurrent processes contend for the same resources. This further reduced our query overhead.
We made improvements that allowed Amazon Redshift to more efficiently burst these short queries to concurrency scaling clusters to improve query parallelism.

To see where Amazon Redshift stood after making these engineering improvements, we ran an internal test using the Cloud Data Warehouse Benchmark derived from TPC-DS (see a later section of this post for more details on the benchmark, which is available in GitHub). To simulate a high-concurrency, low-latency workload, we used a small 10 GB dataset so that all queries ran in a few seconds or less. We also ran the same benchmark against several other cloud data warehouses. We didn’t enable auto scaling features such as concurrency scaling on Amazon Redshift for this test because not all data warehouses support it. We used an ra3.4xlarge Amazon Redshift cluster, and sized all other warehouses to the closest matching price-equivalent configuration using on-demand pricing. Based on this configuration, we found that Amazon Redshift can deliver up to 8x better performance on analytics applications that predominantly required short queries with low latency and high concurrency, as shown in the following chart.

With Concurrency Scaling on Amazon Redshift, throughput can be seamlessly and automatically scaled to additional Amazon Redshift clusters as user concurrency grows. We increasingly see customers using Amazon Redshift to build such analytics applications based on our telemetry data.

This is just a small peek into the behind-the-scenes engineering improvements our team is continually making to help you improve performance and save costs using a data-driven approach.

New features improving price-performance

With the constantly evolving data landscape, customers want high-performance data warehouses that continue to launch new capabilities to deliver the best performance at scale while keeping costs low for all workloads and applications. We have continued to add features that improve Amazon Redshift’s price-performance out of the box at no additional cost to you, allowing you to solve business problems at any scale. These features include the use of best-in-class hardware through the AWS Nitro System, hardware acceleration with AQUA, auto-rewriting queries so that they run faster using materialized views, Automatic Table Optimization (ATO) for schema optimization, Automatic Workload Management (WLM) to offer dynamic concurrency and optimize resource utilization, short query acceleration, automatic materialized views, vectorization and single instruction/multiple data (SIMD) processing, and much more. Amazon Redshift has evolved to become a self-learning, self-tuning data warehouse, abstracting away the performance management effort needed so you can focus on high-value activities like building analytics applications.

To validate the impact of the latest Amazon Redshift performance enhancements, we ran price-performance benchmarks comparing Amazon Redshift with other cloud data warehouses. For these tests, we ran both a TPC-DS-derived benchmark and a TPC-H-derived benchmark using a 10-node ra3.4xlarge Amazon Redshift cluster. To run the tests on other data warehouses, we chose warehouse sizes that most closely matched the Amazon Redshift cluster in price ($32.60 per hour), using published on-demand pricing for all data warehouses. Because Amazon Redshift is an auto-tuning warehouse, all tests are “out of the box,” meaning no manual tunings or special database configurations are applied—the clusters are launched and the benchmark is run. Price-performance is then calculated as cost per hour (USD) times the benchmark runtime in hours, which is equivalent to the cost to run the benchmark.

For both the TPC-DS-derived and TPC-H-derived tests, we find that Amazon Redshift consistently delivers the best price-performance. The following chart shows the results for the TPC-DS-derived benchmark.

The following chart shows the results for the TPC-H-derived benchmark.

Although these benchmarks reaffirm Amazon Redshift’s price-performance leadership, we always encourage you to try Amazon Redshift using your own proof-of-concept workloads as the best way to see how Amazon Redshift can meet your data needs.

Find the best price-performance for your workloads

The benchmarks used in this post are derived from the industry-standard TPC-DS and TPC-H benchmarks, and have the following characteristics:

The schema and data are used unmodified from TPC-DS and TPC-H.
The queries are used unmodified from TPC-DS and TPC-H. TPC-approved query variants are used for a warehouse if the warehouse doesn’t support the SQL dialect of the default TPC-DS or TPC-H query.
The test includes only the 99 TPC-DS and 22 TPC-H SELECT queries. It doesn’t include maintenance and throughput steps.
Three power runs (single stream) were run with query parameters generated using the default random seed of the TPC-DS and TPC-H kits.
The primary metric of total query runtime is used when calculating price-performance. The runtime is taken as the best of the three runs.
Price-performance is calculated as cost per hour (USD) times the benchmark runtime in hours, which is equivalent to cost to run the benchmark. Published on-demand pricing is used for all data warehouses.

We call this benchmark the Cloud Data Warehouse Benchmark, and you can easily reproduce the preceding benchmark results using the scripts, queries, and data available on GitHub. It is derived from the TPC-DS and TPC-H benchmarks as described earlier, and as such is not comparable to published TPC-DS or TPC-H results, because the results of our tests don’t comply with the specification.

Each workload has unique characteristics, so if you’re just getting started, a proof of concept is the best way to understand how Amazon Redshift performs for your requirements. When running your own proof of concept, it’s important to focus on the right metrics—query throughput (number of queries per hour) and price-performance. You can make a data-driven decision by running a proof of concept on your own or with assistance from AWS or a system integration and consulting partner.

Conclusion

This post discussed the analytic workload trends we’re seeing from Amazon Redshift customers, new capabilities we have launched to improve Amazon Redshift’s price-performance, and the results from the latest benchmarks.

If you’re an existing Amazon Redshift customer, connect with us for a free optimization session and briefing on the new features announced at AWS re:Invent 2021. To stay up to date with the latest developments in Amazon Redshift, follow the What’s New in Amazon Redshift feed.

About the Authors

Stefan Gromoll is a Senior Performance Engineer with Amazon Redshift where he is responsible for measuring and improving Redshift performance. In his spare time, he enjoys cooking, playing with his three boys, and chopping firewood.

Ravi Animi is a Senior Product Management leader in the Redshift Team and manages several functional areas of the Amazon Redshift cloud data warehouse service including performance, spatial analytics, streaming ingestion and migration strategies. He has experience with relational databases, multi-dimensional databases, IoT technologies, storage and compute infrastructure services and more recently as a startup founder using AI/deep learning, computer vision, and robotics.

Florian Wende is a Performance Engineer with Amazon Redshift.

Automate notifications on Slack for Amazon Redshift query monitoring rule violations

2022-04-19 Dipankar Kushari

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/automate-notifications-on-slack-for-amazon-redshift-query-monitoring-rule-violation/

In this post, we walk you through how to set up automatic notifications of query monitoring rule (QMR) violations in Amazon Redshift to a Slack channel, so that Amazon Redshift users can take timely action.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. With Amazon Redshift, you can analyze your data to derive holistic insights about your business and your customers. One of the challenges is to protect the data warehouse workload from poorly written queries that can consume significant resources. Amazon Redshift query monitoring rules are a feature of workload management (WLM) that allow automatic handling of poorly written queries. Rules that are applied to a WLM queue allow queries to be logged, canceled, hopped (only available with manual WLM), or to change priority (only available with automatic WLM). The reason to use QMRs is to protect against wasteful use of the cluster. You can also use these rules to log resource-intensive queries, which provides the opportunity to establish governance for ad hoc workloads.

The Amazon Redshift cluster automatically collects query monitoring rules metrics. This convenient mechanism lets you view attributes like the following:

Query runtime, in seconds
Query return row count
The CPU time for a SQL statement

It also makes Amazon Redshift Spectrum metrics available, such as the number of Redshift Spectrum rows and MBs scanned by a query.

When a query violates a QMR, Amazon Redshift logs the violation into the STL_WLM_RULE_ACTION system view. If the action is aborted for the queries that violate a QMR, end-users see an error that indicates query failure due to violation of QMRs. We recommend that administrative team members periodically examine violations listed in the STL_WLM_RULE_ACTION table and coach the involved end-users on how to avoid future rule violations.

Alternately, a centralized team, using a Slack channel for collaboration and monitoring, can configure Amazon Redshift events and alarms to be sent to their channel, so that they can take timely action. In the following sections, we walk you through how to set up automatic notifications of QMR violations to a Slack channel through the use of Slack events and alarms. This allows Amazon Redshift users to be notified and take timely actions without the need to query the system view.

Solution overview

To demonstrate how you can receive automatic notification to a Slack channel for QMR violation, we have designed the following architecture. As shown in the following diagram, we have mixed workload extract, transform, and load (ETL), business intelligence (BI) dashboards, and analytics applications that are powered by an Amazon Redshift cluster. The solution relies on AWS Lambda and Amazon Simple Notification Service (Amazon SNS) to send notifications of Amazon Redshift QMR violations to Slack.

To implement this solution, you create an Amazon Redshift cluster and attach a custom defined parameter group.

Amazon Redshift provides one default parameter group for each parameter group family. The default parameter group has preset values for each of its parameters, and it can’t be modified. If you want to use different parameter values than the default parameter group, you must create a custom parameter group and then associate your cluster with it.

In the parameter group, you can use automatic WLM and define a few workload queues, such as a queue for processing ETL workloads and a reporting queue for user queries. You can name the default queue adhoc. With automatic WLM, Amazon Redshift determines the optimal concurrency and memory allocation for each query that is running in each queue.

For each workload queue, you can define one or more QMRs. For example, you can create a rule to abort a user query if it runs for more than 300 seconds or returns more than 1 billion rows. Similarly, you can create a rule to log a Redshift Spectrum query that scans more than 100 MB.

The Amazon Redshift WLM evaluates metrics every 10 seconds. It records details about actions that result from QMR violation that is associated with user-defined queues in the STL_WLM_RULE_ACTION system table. In this solution, a Lambda function is scheduled to monitor the STL_WLM_RULE_ACTION system table every few minutes. When the function is invoked, if it finds a new entry, it publishes a detailed message to an SNS topic. A second Lambda function, created as the target subscriber to the SNS topic, is invoked whenever any message is published to the SNS topic. This second function invokes a pre-created Slack webhook, which sends the message that was received through the SNS topic to the Slack channel of your choice. (For more information on publishing messages by using Slack webhooks, see Sending messages using incoming webhooks.)

To summarize, the solution involves the following steps:

Create an Amazon Redshift custom parameter group and add workload queues.
Configure query monitoring rules.
Attach the custom parameter group to the cluster.
Create a SNS topic.
Create a Lambda function and schedule it to run every 5 minutes by using an Amazon EventBridge rule.
Create the Slack resources.
Add an incoming webhook and authorize the Slack app to post messages to a Slack channel.
Create the second Lambda function and subscribe to the SNS topic.
Test the solution.

Create an Amazon Redshift custom parameter group and add workload queues

In this step, you create an Amazon Redshift custom parameter group with automatic WLM enabled. You also create the following queues to separate the workloads in the parameter group:

reporting – The reporting queue runs BI reporting queries that are performed by any user who belongs to the Amazon Redshift database group named reporting_group
adhoc – The default queue, renamed adhoc, performs any query that is not sent to any other queue

Complete the following steps to create your parameter group and add workload queues:

Create a parameter group, named csblog, with automatic WLM enabled.
On the Amazon Redshift console, select the custom parameter group you created.
Choose Edit workload queues.
On the Modify workload queues page, choose Add queue.
Fill in the Concurrency scaling mode and Query priority fields as needed to create the reporting queue.
Repeat these steps to add the adhoc queue.

For more information about WLM queues, refer to Configuring workload management.

Configure query monitoring rules

In this step, you add QMRs to each workload queue. For instructions, refer to Creating or modifying a query monitoring rule using the console.

For the reporting queue, add the following QMRs:

nested_loop – Logs any query involved in a nested loop join that results in a row count more than 10,000,000 rows.
long_running – Stops queries that run for more than 300 seconds (5 minutes).

For the adhoc queue, add the following QMRs:

returned_rows – Stops any query that returns more than 1,000,000 rows back to the calling client application (this isn’t practical and can degrade the end-to-end performance of the application).
spectrum_scan – Stops any query that scans more than 1000 MB of data from an Amazon Simple Storage Service (Amazon S3) data lake by using Redshift Spectrum.

Attach the custom parameter group to the cluster

To attach the custom parameter group to your provisioned Redshift cluster, follow the instructions in Associating a parameter group with a cluster. If you don’t already have a provisioned Redshift cluster, refer to Create a cluster.

For this post, we attached our custom parameter group csblog to an already created provisioned Amazon Redshift cluster.

Create an SNS topic

In this step, you create an SNS topic that receives a detailed message of QMR violation from the Lambda function that checks the Amazon Redshift system table for QMR violation entries. For instructions, refer to Creating an Amazon SNS topic.

For this post, we created an SNS topic named redshiftqmrrulenotification.

Create a Lambda function to monitor the system table

In this step, you create a Lambda function that monitors the STL_WLM_RULE_ACTION system table. Whenever any record is found in the table since the last time the function ran, the function publishes a detailed message to the SNS topic that you created earlier. You also create an EventBridge rule to invoke the function every 5 minutes.

For this post, we create a Lambda function named redshiftqmrrule that is scheduled to run every 5 minutes via an EventBridge rule named Redshift-qmr-rule-Lambda-schedule. For instructions, refer to Building Lambda functions with Python.

The following screenshot shows the function that checks the pg_catalog.stl_wlm_rule_action table.

To create an EventBridge rule and associate it with the Lambda function, refer to Create a Rule.

The following screenshot shows the EventBridge rule Redshift-qmr-rule-Lambda-schedule, which calls the function every 5 minutes.

We use the following Python 3.9 code for this Lambda function. The function uses an Amazon Redshift Data API call that uses GetClusterCredentials for temporary credentials.

import json
import time
import unicodedata
import traceback
import sys
from pip._internal import main
import urllib3
import os
import boto3
from datetime import datetime

# initiate redshift-data client in boto3
client = boto3.client("redshift-data")

query = "select userid,query,service_class,trim(rule) as rule,trim(action) as action,recordtime from stl_wlm_rule_action WHERE userid > 1 AND recordtime >= current_timestamp AT TIME ZONE 'UTC' - INTERVAL '5 minute' order by recordtime desc;"
sns = boto3.resource('sns')
sns_arn = os.environ['sns_arn']
platform_endpoint = sns.PlatformEndpoint('{sns_arn}'.format(sns_arn = sns_arn))

def status_check(client, query_id):
    desc = client.describe_statement(Id=query_id)
    status = desc["Status"]
    if status == "FAILED":
        raise Exception('SQL query failed:' + query_id + ": " + desc["Error"])
    return status.strip('"')

def execute_sql(sql_text, redshift_database, redshift_user, redshift_cluster_id):
    print("Executing: {}".format(sql_text))
    res = client.execute_statement(Database=redshift_database, DbUser=redshift_user, Sql=sql_text,
                                   ClusterIdentifier=redshift_cluster_id)
    
    query_id = res["Id"]
    print("query id")
    print(query_id)
    done = False
    while not done:
        time.sleep(1)
        status = status_check(client, query_id)
        if status in ("FAILED", "FINISHED"):
            print("status is: {}".format(status))
            break
    return query_id

def publish_to_sns(message):
    try:
        # Publish a message.
        response = platform_endpoint.publish(
                  Subject='Redshift Query Monitoring Rule Notifications',
                  Message=message,
                  MessageStructure='string'

            )
        return  response

    except:
        print(' Failed to publish messages to SNS topic: exception %s' % sys.exc_info()[1])
        return 'Failed'

def lambda_handler(event, context):
    
    rsdb = os.environ['rsdb']
    rsuser = os.environ['rsuser']
    rscluster = os.environ['rscluster']
    #print(query)
    res = execute_sql(query, rsdb, rsuser, rscluster)
    print("res")
    print(res)
    response = client.get_statement_result(
        Id = res
    )
    # datetime object containing current date and time
    now = datetime.now()
    dt_string = now.strftime("%d-%b-%Y %H:%M:%S")
    print(response) 
    if response['TotalNumRows'] > 0:
        messageText = '################## Reporting Begin' + ' [' + str(dt_string) + ' UTC] ##################\n\n'
        messageText = messageText + 'Total number of queries affected by QMR Rule violation for Redshift cluster "' + rscluster + '" is ' + str(len(response['Records'])) + '.' + '\n' + '\n'
        for i in range(len(response['Records'])):
            messageText = messageText + 'It was reported at ' + str(response['Records'][i][5]['stringValue'])[11:19] + ' UTC on ' + str(response['Records'][i][5]['stringValue'])[0:10] + ' that a query with Query ID - ' + str(response['Records'][i][1]['longValue']) + ' had to ' +  str(response['Records'][i][4]['stringValue']) + ' due to violation of QMR Rule "' + str(response['Records'][i][3]['stringValue']) + '".\n'
        messageText = messageText + '\n########################### Reporting End ############################\n\n'
        query_result_json = messageText
        response = publish_to_sns(query_result_json)
    else:
        print('No rows to publish to SNS')

We use four environment variables for this Lambda function:

rscluster – The Amazon Redshift provisioned cluster identifier
rsdb – The Amazon Redshift database where you’re running these tests
rsuser – The Amazon Redshift user who has the privilege to run queries on pg_catalog.stl_wlm_rule_action
sns_arn – The Amazon Resource Name (ARN) of the SNS topic that we created earlier

Create Slack resources

In this step, you create a new Slack workspace (if you don’t have one already), a new private Slack channel (only if you don’t have one or don’t want to use an existing one), and a new Slack app in the Slack workspace. For instructions, refer to Create a Slack workspace, Create a channel, and Creating an app.

For this post, we created the following resources in the Slack website and Slack desktop app:

A Slack workspace named RedshiftQMR*****
A private channel, named redshift-qmr-notification-*****-*******, in the newly created Slack workspace
A new Slack app in the Slack workspace, named RedshiftQMRRuleNotification (using the From Scratch option)

Add an incoming webhook and authorize Slack app

In this step, you enable and add an incoming webhook to the Slack workspace that you created. For full instructions, refer to Enable Incoming Webhooks and Create an Incoming Webhook. You also authorize your Slack app so that it can post messages to the private Slack channel.

In the Slack app, under Settings in the navigation pane, choose Basic Information.
Choose Incoming Webhooks.
Turn on Activate Incoming Webhooks.
Choose Add New Webhook to Workspace.
Authorize the Slack app RedshiftQMRRuleNotification so that it can post messages to the private Slack channel redshift-qmr-notification-*****-*******.

The following screenshot shows the details of the newly added incoming webhook.

Create a second Lambda function and subscribe to the SNS topic

In this step, you create a second Lambda function that is subscribed to the SNS topic that you created earlier. For full instructions, refer to Building Lambda functions with Python and Subscribing a function to a topic.

For this post, we create a second function named redshiftqmrrulesnsinvoke, which is subscribed to the SNS topic redshiftqmrrulenotification. The second function sends a detailed QMR violation message (received from the SNS topic) to the designated Slack channel named redshift-qmr-notification-*. This function uses the incoming Slack webhook that we created earlier.

We also create an SNS subscription of the second Lambda function to the SNS topic that we created previously.

The following is the Python 3.9 code used for the second Lambda function:

import urllib3
import json
import os

http = urllib3.PoolManager()
def lambda_handler(event, context):
    
    url = os.environ['webhook']
    channel = os.environ['channel']
    msg = {
        "channel": channel,
        "username": "WEBHOOK_USERNAME",
        "text": event['Records'][0]['Sns']['Message'],
        "icon_emoji": ""
    }
    
    encoded_msg = json.dumps(msg).encode('utf-8')
    resp = http.request('POST',url, body=encoded_msg)
    print({
        "message": event['Records'][0]['Sns']['Message'], 
        "status_code": resp.status, 
        "response": resp.data
    })

We use two environment variables for the second Lambda function:

channel – The Slack channel that we created
webhook – The Slack webhook that we created

Test the solution

To show the effect of the QMRs, we ran queries that violate the QMRs we set up.

Test 1: Returned rows

Test 1 looks for violations of the returned_rows QMR, in which the return row count is over 1,000,000 for a query that ran in the adhoc queue.

We created and loaded a table named lineitem in a schema named aquademo, which has more than 18 billion records. You can refer to the GitHub repo to create and load the table.

We ran the following query, which violated the returned_rows QMR, and the query was stopped as specified in the action set in the QMR.

select * from aquademo.lineitem limit 1000001;

The following screenshot shows the view from the Amazon Redshift client after running the query.

The following screenshot shows the view on the Amazon Redshift console.

The following screenshot shows the notification we received in our Slack channel.

Test 2: Long-running queries

Test 2 looks for violations of the long_running QMR, in which query runtime is over 300 seconds for a user who belongs to reporting_group.

In the following code, we created a new Amazon Redshift group named reporting_group and added a new user, named reporting_user, to the group. reporting_group is assigned USAGE and SELECT privileges on all tables in the retail and aquademo schemas.

create group reporting_group;
create user reporting_user in group reporting_group password 'Test12345';
grant usage on schema retail,aquademo to group reporting_group;
grant select on all tables in schema retail,aquademo to group reporting_group;

We set the session authorization to reporting_user so the query runs in the reporting queue. We ran the following query, which violated the long_running QMR, and the query was stopped as specified in the action set in the QMR:

set session authorization reporting_user;
set enable_result_cache_for_session  to off;
select * from aquademo.lineitem;

The following screenshot shows the view from the Amazon Redshift client.

The following screenshot shows the view on the Amazon Redshift console.

The following screenshot shows the notification we received in our Slack channel.

Test 3: Nested loops

Test 3 looks for violations of the nested_loop QMR, in which the nested loop join row count is over 10,000,000 for a user who belongs to reporting_group.

We set the session authorization to reporting_user so the query runs in the reporting queue. We ran the following query, which violated the nested_loop QMR, and the query logged the violation as specified in the action set in the QMR:

set session authorization reporting_user;
set enable_result_cache_for_session  to off;
select ss.*,cd.* 
from retail.store_sales ss
, retail.customer_demographics cd;

Before we ran the original query, we also checked the explain plan and noted that this nested loop will return more than 10,000,000 rows. The following screenshot shows the query explain plan.

The following screenshot shows the notification we received in our Slack channel.

Test 4: Redshift Spectrum scans

Test 4 looks for violations of the spectrum_scan QMR, in which Redshift Spectrum scans exceed 1000 MB for a query that ran in the adhoc queue.

For this example, we used store_sales data (unloaded from an Amazon Redshift table that was created by using the TPC-DS benchmark data) loaded in an Amazon S3 location. Data in Amazon S3 is non-partitioned under one prefix and has a volume around 3.9 GB. We created an external schema (qmr_spectrum_rule_test) and external table (qmr_rule_store_sales) in Redshift Spectrum.

We used the following steps to run this test with the sample data:

Run an unload SQL command:

unload ('select * from store_sales')
to 's3://<<Your Amazon S3 Location>>/store_sales/' 
iam_role default;

Create an external schema from Redshift Spectrum:

CREATE EXTERNAL SCHEMA if not exists qmr_spectrum_rule_test
FROM DATA CATALOG DATABASE 'qmr_spectrum_rule_test' region 'us-east-1' 
IAM_ROLE default
CREATE EXTERNAL DATABASE IF NOT exists;

Create an external table in Redshift Spectrum:

create external table qmr_spectrum_rule_test.qmr_rule_store_sales
(
ss_sold_date_sk int4 ,            
  ss_sold_time_sk int4 ,     
  ss_item_sk int4  ,      
  ss_customer_sk int4 ,           
  ss_cdemo_sk int4 ,              
  ss_hdemo_sk int4 ,         
  ss_addr_sk int4 ,               
  ss_store_sk int4 ,           
  ss_promo_sk int4 ,           
  ss_ticket_number int8 ,        
  ss_quantity int4 ,           
  ss_wholesale_cost numeric(7,2) ,          
  ss_list_price numeric(7,2) ,              
  ss_sales_price numeric(7,2) ,
  ss_ext_discount_amt numeric(7,2) ,             
  ss_ext_sales_price numeric(7,2) ,              
  ss_ext_wholesale_cost numeric(7,2) ,           
  ss_ext_list_price numeric(7,2) ,               
  ss_ext_tax numeric(7,2) ,                 
  ss_coupon_amt numeric(7,2) , 
  ss_net_paid numeric(7,2) ,   
  ss_net_paid_inc_tax numeric(7,2) ,             
  ss_net_profit numeric(7,2)                     
) ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '|' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<<Your Amazon S3 Location>>/store_sales/'
TABLE PROPERTIES (
  'averageRecordSize'='130', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'='|', 
  'recordCount'='11083990573', 
  'sizeKey'='1650877678933', 
  'typeOfData'='file');

Run the following query:

select * 
FROM qmr_spectrum_rule_test.qmr_rule_store_sales 
where ss_sold_date_sk = 2451074;

The query violated the spectrum_scan QMR, and the query was stopped as specified in the action set in the QMR.

The following screenshot shows the view from the Amazon Redshift client.

The following screenshot shows the view on the Amazon Redshift console.

The following screenshot shows the notification we received in our Slack channel.

Clean up

When you’re finished with this solution, we recommend deleting the resources you created to avoid incurring any further charges.

Conclusion

Amazon Redshift is a powerful, fully managed data warehouse that can offer significantly increased performance and lower cost in the cloud. In this post, we discussed how you can automate notification of misbehaving queries on Slack by using query monitoring rules. QMRs can help you maximize cluster performance and throughput when supporting mixed workloads. Use these instructions to set up your Slack channel to receive automatic notifications from your Amazon Redshift cluster for any violation of QMRs.

About the Authors

Dipankar Kushari is a Senior Specialist Solutions Architect in the Analytics team at AWS.

Harshida Patel is a Specialist Senior Solutions Architect in the Analytics team at AWS.

Write prepared data directly into JDBC-supported destinations using AWS Glue DataBrew

2022-04-18 Dhiraj Thakur

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/write-prepared-data-directly-into-jdbc-supported-destinations-using-aws-glue-databrew/

AWS Glue DataBrew offers over 250 pre-built transformations to automate data preparation tasks (such as filtering anomalies, standardizing formats, and correcting invalid values) that would otherwise require days or weeks writing hand-coded transformations.

You can now write cleaned and normalized data directly into JDBC-supported databases and data warehouses without having to move large amounts of data into intermediary data stores. In just a few clicks, you can configure recipe jobs to specify the following output destinations: Amazon Redshift, Snowflake, Microsoft SQL Server, MySQL, Oracle Database, and PostgreSQL.

In this post, we walk you through how to connect and transform data from an Amazon Simple Storage Service (Amazon S3) data lake and write prepared data directly into an Amazon Redshift destination on the DataBrew console.

Solution overview

The following diagram illustrates our solution architecture.

In our solution, DataBrew queries sales order data from an Amazon S3 data lake and performs the data transformation. Then the DataBrew job writes the final output to Amazon Redshift.

To implement the solution, you complete the following high-level steps:

Create your datasets.
Create a DataBrew project with the datasets.
Build a transformation recipe in DataBrew.
Run the DataBrew recipe.

Prerequisites

To complete this solution, you should have an AWS account. Make sure you have the required permissions to create the resources required as part of the solution.

For our use case, we use a mock dataset. You can download the data files from GitHub.

Complete the following prerequisite steps:

On the Amazon S3 console, upload all three CSV files to an S3 bucket.
Create the Amazon Redshift cluster to capture the product wise sales data.
Set up a security group for Amazon Redshift.
Create a schema in Amazon Redshift if required. For this post, we use the existing public schema.

Create datasets

To create the datasets, complete the following steps:

On the Datasets page of the DataBrew console, choose Connect new dataset.
For Dataset name, enter a name (for example, order).
Enter the S3 bucket path where you uploaded the data files as part of the prerequisites.
Choose Select the entire folder.
For Selected file type, select CSV.
For CSV delimiter, choose Comma.
For Column header values, select Treat first row as header.
Choose Create dataset.

Create a project using the datasets

To create your DataBrew project, complete the following steps:

On the DataBrew console, on the Projects page, choose Create project.
For Project Name, enter order-proj.
For Attached recipe, choose Create new recipe.

The recipe name is populated automatically.

For Select a dataset, select My datasets.
Select the order dataset.
For Role name, choose the AWS Identity and Access Management (IAM) role to be used with DataBrew.
Choose Create project.

You can see a success message along with our Amazon S3 order table with 500 rows.

After the project is opened, a DataBrew interactive session is created. DataBrew retrieves sample data based on your sampling configuration selection.

Build a transformation recipe

In a DataBrew interactive session, you can cleanse and normalize your data using over 250 pre-built transformations. In this post, we use DataBrew to perform a few transforms and filter only valid orders with order amounts greater than $0.

To do this, you perform the following steps:

On the Column menu, choose Delete.
For Source columns, choose the columns order_id, timestamp, and transaction_date.
Choose Apply.
We filter the rows based on an amount value greater than $0.
Choose Add to recipe to add the condition as a recipe step.
To perform a custom sort based on state, on the Sort menu, choose Ascending.
For Source, choose state_name.
Select Sort by custom values.
Specify an ordered list of state names separated by commas.
Choose Apply.

The following screenshot shows the full recipe that we applied to our dataset.

Run the DataBrew recipe job on the full data

Now that we have built the recipe, we can create and run a DataBrew recipe job.

On the project details page, choose Create job.
For Job name, enter product-wise-sales-job.
For Output to, choose JDBC.
For connection name, choose Browse.
Choose Add JDBC connection.
For Connection name, enter a name (for example, redshift-connection).
Provide details like the host, database name, and login credentials of your Amazon Redshift cluster.
In the Network options section, choose the VPC, subnet, and security groups of your Amazon Redshift cluster.
Choose Create connection.
Provide a table prefix with schema name (for example, public.product_wise_sales).
For Role name, choose the IAM role to be used with DataBrew.
Choose Create and run job.
Navigate to the Jobs page and wait for the product-wise-sales-job job to complete.
Navigate to the Amazon Redshift cluster to confirm the output table starts with product_wise_sales_*.

Clean up

Delete the following resources that might accrue cost over time:

The Amazon Redshift cluster
The recipe job product-wise-sales-job
Input files stored in your S3 bucket
The job output stored in your S3 bucket
The IAM roles created as part of projects and jobs
The DataBrew project order-proj and its associated recipe order-proj-recipe
The DataBrew datasets

Conclusion

In this post, we saw how to how to connect and transform data from an Amazon S3 data lake and create a DataBrew dataset. We also saw how easily we can bring data from a data lake into DataBrew, seamlessly apply transformations, and write prepared data directly into an Amazon Redshift destination.

To learn more, refer to the DataBrew documentation.

About the Author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Amit Mehrotra is a Solution Architecture leader with Amazon Web Services. He leads an org that customers cloud journey.

Develop and test AWS Glue version 3.0 jobs locally using a Docker container

2022-04-14 Subramanya Vajiraya

Post Syndicated from Subramanya Vajiraya original https://aws.amazon.com/blogs/big-data/develop-and-test-aws-glue-version-3-0-jobs-locally-using-a-docker-container/

AWS Glue is a fully managed serverless service that allows you to process data coming through different data sources at scale. You can use AWS Glue jobs for various use cases such as data ingestion, preprocessing, enrichment, and data integration from different data sources. AWS Glue version 3.0, the latest version of AWS Glue Spark jobs, provides a performance-optimized Apache Spark 3.1 runtime experience for batch and stream processing.

You can author AWS Glue jobs in different ways. If you prefer coding, AWS Glue allows you to write Python/Scala source code with the AWS Glue ETL library. If you prefer interactive scripting, AWS Glue interactive sessions and AWS Glue Studio notebooks helps you to write scripts in notebooks by inspecting and visualizing the data. If you prefer a graphical interface rather than coding, AWS Glue Studio helps you author data integration jobs visually without writing code.

For a production-ready data platform, a development process and CI/CD pipeline for AWS Glue jobs is key. We understand the huge demand for developing and testing AWS Glue jobs where you prefer to have flexibility, a local laptop, a Docker container on Amazon Elastic Compute Cloud (Amazon EC2), and so on. You can achieve that by using AWS Glue Docker images hosted on Docker Hub or the Amazon Elastic Container Registry (Amazon ECR) Public Gallery. The Docker images help you set up your development environment with additional utilities. You can use your preferred IDE, notebook, or REPL using the AWS Glue ETL library.

This post is a continuation of blog post “Developing AWS Glue ETL jobs locally using a container“. While the earlier post introduced the pattern of development for AWS Glue ETL Jobs on a Docker container using a Docker image, this post focuses on how to develop and test AWS Glue version 3.0 jobs using the same approach.

Solution overview

The following Docker images are available for AWS Glue on Docker Hub:

AWS Glue version 3.0 – amazon/aws-glue-libs:glue_libs_3.0.0_image_01
AWS Glue version 2.0 – amazon/aws-glue-libs:glue_libs_2.0.0_image_01

You can also obtain the images from the Amazon ECR Public Gallery:

AWS Glue version 3.0 – public.ecr.aws/glue/aws-glue-libs:glue_libs_3.0.0_image_01
AWS Glue version 2.0 – public.ecr.aws/glue/aws-glue-libs:glue_libs_2.0.0_image_01

Note: AWS Glue Docker images are x86_64 compatible and arm64 hosts are currently not supported.

In this post, we use amazon/aws-glue-libs:glue_libs_3.0.0_image_01 and run the container on a local machine (Mac, Windows, or Linux). This container image has been tested for AWS Glue version 3.0 Spark jobs. The image contains the following:

Amazon Linux
AWS Glue ETL Library (aws-glue-libs)
Apache Spark 3.1.1
Spark history server
JupyterLab
Livy
Other library dependencies (the same as the ones of the AWS Glue job system)

To set up your container, you pull the image from Docker Hub and then run the container. We demonstrate how to run your container with the following methods, depending on your requirements:

spark-submit
REPL shell (pyspark)
pytest
JupyterLab
Visual Studio Code

Prerequisites

Before you start, make sure that Docker is installed and the Docker daemon is running. For installation instructions, see the Docker documentation for Mac, Windows, or Linux. Also make sure that you have at least 7 GB of disk space for the image on the host running Docker.

For more information about restrictions when developing AWS Glue code locally, see Local Development Restrictions.

Configure AWS credentials

To enable AWS API calls from the container, set up your AWS credentials with the following steps:

Create an AWS named profile.
Open cmd on Windows or a terminal on Mac/Linux, and run the following command:
```
PROFILE_NAME="profile_name"
```

In the following sections, we use this AWS named profile.

Pull the image from Docker Hub

If you’re running Docker on Windows, choose the Docker icon (right-click) and choose Switch to Linux containers… before pulling the image.

Run the following command to pull the image from Docker Hub:

docker pull amazon/aws-glue-libs:glue_libs_3.0.0_image_01

Run the container

Now you can run a container using this image. You can choose any of following methods based on your requirements.

spark-submit

You can run an AWS Glue job script by running the spark-submit command on the container.

Write your ETL script (sample.py in the example below) and save it under the /local_path_to_workspace/src/ directory using the following commands:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/src
$ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}

These variables are used in the docker run command below. The sample code (sample.py) used in the spark-submit command below is included in the appendix at the end of this post.

Run the following command to run the spark-submit command on the container to submit a new Spark application:

$ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_spark_submit amazon/aws-glue-libs:glue_libs_3.0.0_image_01 spark-submit /home/glue_user/workspace/src/$SCRIPT_FILE_NAME
...22/01/26 09:08:55 INFO DAGScheduler: Job 0 finished: fromRDD at DynamicFrame.scala:305, took 3.639886 s
root
|-- family_name: string
|-- name: string
|-- links: array
| |-- element: struct
| | |-- note: string
| | |-- url: string
|-- gender: string
|-- image: string
|-- identifiers: array
| |-- element: struct
| | |-- scheme: string
| | |-- identifier: string
|-- other_names: array
| |-- element: struct
| | |-- lang: string
| | |-- note: string
| | |-- name: string
|-- sort_name: string
|-- images: array
| |-- element: struct
| | |-- url: string
|-- given_name: string
|-- birth_date: string
|-- id: string
|-- contact_details: array
| |-- element: struct
| | |-- type: string
| | |-- value: string
|-- death_date: string

...

REPL shell (pyspark)

You can run a REPL (read-eval-print loop) shell for interactive development. Run the following command to run the pyspark command on the container to start the REPL shell:

$ docker run -it -v ~/.aws:/home/glue_user/.aws -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark
...
 ____ __
 / __/__ ___ _____/ /__
 _\ \/ _ \/ _ `/ __/ '_/
 /__ / .__/\_,_/_/ /_/\_\  version 3.1.1-amzn-0
 /_/

Using Python version 3.7.10 (default, Jun 3 2021 00:02:01)
Spark context Web UI available at http://56e99d000c99:4040
Spark context available as 'sc' (master = local[*], app id = local-1643011860812).
SparkSession available as 'spark'.
>>>

pytest

For unit testing, you can use pytest for AWS Glue Spark job scripts.

Run the following commands for preparation:

$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=sample.py
$ UNIT_TEST_FILE_NAME=test_sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/tests
$ vim ${WORKSPACE_LOCATION}/tests/${UNIT_TEST_FILE_NAME}

Run the following command to run pytest on the test suite:

$ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pytest amazon/aws-glue-libs:glue_libs_3.0.0_image_01 -c "python3 -m pytest"
starting org.apache.spark.deploy.history.HistoryServer, logging to /home/glue_user/spark/logs/spark-glue_user-org.apache.spark.deploy.history.HistoryServer-1-5168f209bd78.out
============================================================= test session starts =============================================================
platform linux -- Python 3.7.10, pytest-6.2.3, py-1.11.0, pluggy-0.13.1
rootdir: /home/glue_user/workspace
plugins: anyio-3.4.0
collected 1 item  

tests/test_sample.py . [100%]

============================================================== warnings summary ===============================================================
tests/test_sample.py::test_counts
 /home/glue_user/spark/python/pyspark/sql/context.py:79: DeprecationWarning: Deprecated in 3.0.0. Use SparkSession.builder.getOrCreate() instead.
 DeprecationWarning)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
======================================================== 1 passed, 1 warning in 21.07s ========================================================

JupyterLab

You can start Jupyter for interactive development and ad hoc queries on notebooks. Complete the following steps:

Run the following command to start JupyterLab:

$ JUPYTER_WORKSPACE_LOCATION=/local_path_to_workspace/jupyter_workspace/
$ docker run -it -v ~/.aws:/home/glue_user/.aws -v $JUPYTER_WORKSPACE_LOCATION:/home/glue_user/workspace/jupyter_workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 -p 8998:8998 -p 8888:8888 --name glue_jupyter_lab amazon/aws-glue-libs:glue_libs_3.0.0_image_01 /home/glue_user/jupyter/jupyter_start.sh
...
[I 2022-01-24 08:19:21.368 ServerApp] Serving notebooks from local directory: /home/glue_user/workspace/jupyter_workspace
[I 2022-01-24 08:19:21.368 ServerApp] Jupyter Server 1.13.1 is running at:
[I 2022-01-24 08:19:21.368 ServerApp] http://faa541f8f99f:8888/lab
[I 2022-01-24 08:19:21.368 ServerApp] or http://127.0.0.1:8888/lab
[I 2022-01-24 08:19:21.368 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Open http://127.0.0.1:8888/lab in your web browser in your local machine to access the JupyterLab UI.
Choose Glue Spark Local (PySpark) under Notebook.

Now you can start developing code in the interactive Jupyter notebook UI.

Visual Studio Code

To set up the container with Visual Studio Code, complete the following steps:

Install Visual Studio Code.
Install Python.
Install Visual Studio Code Remote – Containers.
Open the workspace folder in Visual Studio Code.
Choose Settings.
Choose Workspace.
Choose Open Settings (JSON).

Enter the following JSON and save it:

{
    "python.defaultInterpreterPath": "/usr/bin/python3",
    "python.analysis.extraPaths": [
        "/home/glue_user/aws-glue-libs/PyGlue.zip:/home/glue_user/spark/python/lib/py4j-0.10.9-src.zip:/home/glue_user/spark/python/",
    ]
}

Now you’re ready to set up the container.

Run the Docker container:

$ docker run -it -v ~/.aws:/home/glue_user/.aws -v $WORKSPACE_LOCATION:/home/glue_user/workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 18080:18080 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark

Start Visual Studio Code.
Choose Remote Explorer in the navigation pane, and choose the container amazon/aws-glue-libs:glue_libs_3.0.0_image_01.
Right-click and choose Attach to Container.
If the following dialog appears, choose Got it.
Open /home/glue_user/workspace/.
Create an AWS Glue PySpark script and choose Run.

You should see the successful run on the AWS Glue PySpark script.

Conclusion

In this post, we learned how to get started on AWS Glue Docker images. AWS Glue Docker images help you develop and test your AWS Glue job scripts anywhere you prefer. It is available on Docker Hub and Amazon ECR Public Gallery. Check it out, we look forward to getting your feedback.

Appendix: AWS Glue job sample codes for testing

This appendix introduces three different scripts as AWS Glue job sample codes for testing purposes. You can use any of them in the tutorial.

The following sample.py code uses the AWS Glue ETL library with an Amazon Simple Storage Service (Amazon S3) API call. The code requires Amazon S3 permissions in AWS Identity and Access Management (IAM). You need to grant the IAM-managed policy arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess or IAM custom policy that allows you to make ListBucket and GetObject API calls for the S3 path.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions


class GluePythonSampleTest:
    def __init__(self):
        params = []
        if '--JOB_NAME' in sys.argv:
            params.append('JOB_NAME')
        args = getResolvedOptions(sys.argv, params)

        self.context = GlueContext(SparkContext.getOrCreate())
        self.job = Job(self.context)

        if 'JOB_NAME' in args:
            jobname = args['JOB_NAME']
        else:
            jobname = "test"
        self.job.init(jobname, args)

    def run(self):
        dyf = read_json(self.context, "s3://awsglue-datasets/examples/us-legislators/all/persons.json")
        dyf.printSchema()

        self.job.commit()


def read_json(glue_context, path):
    dynamicframe = glue_context.create_dynamic_frame.from_options(
        connection_type='s3',
        connection_options={
            'paths': [path],
            'recurse': True
        },
        format='json'
    )
    return dynamicframe


if __name__ == '__main__':
    GluePythonSampleTest().run()z

The following test_sample.py code is a sample for a unit test of sample.py:

import pytest
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import sys
from src import sample


@pytest.fixture(scope="module", autouse=True)
def glue_context():
    sys.argv.append('--JOB_NAME')
    sys.argv.append('test_count')

    args = getResolvedOptions(sys.argv, ['JOB_NAME'])
    context = GlueContext(SparkContext.getOrCreate())
    job = Job(context)
    job.init(args['JOB_NAME'], args)

    yield(context)

    job.commit()


def test_counts(glue_context):
    dyf = sample.read_json(glue_context, "s3://awsglue-datasets/examples/us-legislators/all/persons.json")
    assert dyf.toDF().count() == 1961

About the Authors

Subramanya Vajiraya is a Cloud Engineer (ETL) at AWS Sydney specialized in AWS Glue. He is passionate about helping customers solve issues related to their ETL workload and implement scalable data processing and analytics pipelines on AWS. Outside of work, he enjoys going on bike rides and taking long walks with his dog Ollie, a 1-year-old Corgi.

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He enjoys learning different use cases from customers and sharing knowledge about big data technologies with the wider community.