Tag Archives: AWS Big Data

New charts, formatting, and layout options in Amazon QuickSight

2020-12-16 Sapna Maheshwari

Post Syndicated from Sapna Maheshwari original https://aws.amazon.com/blogs/big-data/new-charts-formatting-and-layout-options-in-amazon-quicksight/

Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to create and deliver insights to everyone in your organization. In this post, we explore how authors of QuickSight dashboards can use some of the new chart types, layout options, and dashboard formatting controls to deliver dashboards that intuitively deliver insights to all your users, whether within QuickSight or embedded in your websites or multi-tenant apps.

This blog post explores some of the visualization and dashboard customization features offered in Amazon QuickSight with the following datasets:

Ads dataset for stacked area and funnel charts
P&L dataset for waterfall charts
Student Performance dataset for histograms and box plots
Patient Info dataset for filled/choropleth maps

Picking the right chart for your use case

In this section, we explore some of the new charts QuickSight introduced in 2020 and how these help with various use cases.

Funnel charts

Funnel charts help visualize the progressive reduction of data as it passes from one phase to another. Data in each of these phases is represented as different portions of 100% (the whole). The most common use of the funnel chart is in visualizing conversion data. For example, you can represent sales lead generation showing different stages of sales conversion from first contact to lead generation.

To build a funnel chart with our Ads dataset, complete the following steps:

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the funnel chart icon.
For Group by, choose Stage.
For Value, choose Ad name.
To change default configuration, choose the gear icon.
In the Data labels section, for Metric label style, choose Value and percent of first.

The video below demonstrates these steps.

Stacked area charts

Stacked area charts are best used to visualize part-to-whole relationships, to show how each category contributes to the cumulative total. For this post, we create a stacked area chart with the Ads dataset.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the stacked area chart icon.
For X axis, choose Date (MONTH).
For Value, choose Cost (Sum).
For Color, choose Segment.
Choose the gear icon.
Under Legend, deselect Legend title.
Under Y-Axis, select Show Y axis label.
Under Data labels, select Show data labels.
Choose your desired position, font size, font color, and label pattern.

Histograms

Histograms help visualize the frequency distribution of a dataset and display numerical data by grouping data into bins of equal width. Each bin is plotted as a bar whose height corresponds to the number of data points within the bin.

For this post, we use the Student Performance dataset to create a histogram.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the histogram icon.
For Value, choose math score.

You can customize the histogram to show bins by bin count, bin width, or a custom start value. For this post, we sort by bin width.

Under Histogram, select Bin width.
For Bin width, enter 5.

Box plots

Box plot (also called box or whisker plot) is a standardized way of displaying distribution of data based on a five-number summary (minimum, first quartile (Q1), median, third quartile (Q3), and maximum). This is useful to determine if data is symmetrical, skewed, or tightly grouped. Box plots also show outliers.

For this post, we create a box plot on the Student Performance dataset.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the box plot icon.
For Group by, choose Gender.
For Value, choose writing score and reading score.
In the visual settings, under Box plot, select Show outliers and Show all data points.
Under Legend, deselect Show legend title.

Waterfall charts

Waterfall charts help you understand the cumulative effect of sequentially introduced positive or negative values. This is great to understand contributions to a whole, for example the main contributors to a monthly increase in revenue, or the breakdown of revenue vs costs.

We use the P&L dataset to create a waterfall chart.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual selection pane, choose the waterfall chart icon.
For Category, choose Line item.
For Value¸ choose Value (Sum).
Under Legend, deselect Show legend title.
For Position, select Bottom.
Under Title, deselect Show title.

Choropleth maps

Choropleth maps use differences in shading or coloring within geographical areas or regions to indicate the value of a variable in those areas

We use the Patient Info dataset to create a choropleth map.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the funnel chart icon.
For Location, choose State.
For Color, choose Revenue (Sum).
Choose the menu options icon (…) and choose Conditional formatting.
For Column, choose Revenue.
For Fill type, select Gradient.
For Max value, choose a color (for this post, we choose blue).

The video below demonstrates these steps.

You can also control the color and shading of the geographic areas using conditional formatting as shown below.

The video below demonstrates these steps.

Additionally, you can configure the alignment of column headers and content within. You can change the vertical alignment – top, middle, bottom and also the horizontal alignment – left, center and right.

Customization and formatting options

QuickSight also supports several formatting options that allow you to streamline visualizations and convey additional information in your dashboards.

Table/pivot table formatting

Pin or unpin and add custom text totals

You can now pin totals to the top or bottom of tables and pivot tables in QuickSight. This feature helps you view the totals even while scrolling through the tables.

Go to visual setting (gear icon on the visual menu).
Under Total, select Pin totals.
For Position, choose a position (for this post, we choose Bottom).

Additionally, you can edit the text you want to show on totals and subtotals.

For Total label, enter your custom text (for this post, we enter Grand Total).

Table alignment and wrapping

You can now align horizontal (left, right, center, auto) and vertical (top, middle, bottom) alignment on column headers and cell values in a table visual. Additionally, you can apply text wrapping on table and pivot table headers so that long headers are readable versus having to scroll over the header.

These options are available under Table options.

Hide +/- buttons on pivot tables

You can now show or hide +/- buttons on pivot tables. This allows you to improve presentation of pivot tables by removing these icons and keeping the pivot table simple. This option is available under Styling.

Visual customization options

In this section, we discuss additional customization options in QuickSight.

Custom sorting

If you want to sort your charts in a custom defined order different from the default alphabetic order, you can now do so on QuickSight. For example, you can sort geographical regions in the order of East, West, Central, and South, by ranking these regions 1– 4 and then sorting on this rank field. See video below to learn how to.

You can also therefore sort using any other critical metric fields that aren’t part of the visual. Choose your field well and choose Sort options to see available sort order options.

You can also therefore sort on any other critical metric fields that aren’t part of the visual. Choose your field well and choose Sort options to see available sort order options.

The following screenshot shows your sorted visualizations.

Adding descriptive text, images, and links

You can add images or logos to your dashboard using QuickSight’s narrative component available in Enterprise Edition.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the insights icon.
Choose Customize insight.
Remove any existing text and add your custom text.

You can also add hyperlinks to text and images. Upload your image to a secure location where the image is accessible to QuickSight and your users.

The video below demonstrates these steps.

Customizing colors and fonts

QuickSight offers easy-to-build themes that allow customization of the color palette, background and foreground colors, spacing, fonts, and more. Themes can be created by authors and shared within an account, and are also accessible via APIs for programmatic management. Themes can also be defaulted for all users in the organization using APIs.

You can also prioritize the colors that you want to use in your dashboard by prioritizing them within your theme’s color palette.

You can apply predefined themes available out of the box or create your own themes that fit your corporate branding. The following screenshots show how a dashboard looks in both dark and light themes.

For more information, see Evolve your analytics with Amazon QuickSight’s new APIs and theming capabilities.

Null value customization

You may occasionally have null values in your data and want to represent nulls with different annotations. For each of the dimensions and metrics in the dataset, you can provide custom text for null values. Go to More formatting options for any field.

The option is available under Null values.

Reference lines

You can draw reference lines based on a calculated field or a constant value. Choose the gear icon and navigate to reference lines section. In the following screenshot, the orange reference line is based off a calculation (average profit) and the black reference line is plotted on a constant value of 1,000.

You can also link reference lines to parameters via a calculated field, which allows you to create what-if scenarios within your dashboard.

Custom colors on heat and tree maps

Color gradient customization on heat and tree maps allows you to select colors for lower, intermediate, and upper limits so that the gradient is applied within these colors. You can configure this under Color in the visual settings.

Using logarithmic scale

If your metric numbers aren’t in the same order of magnitude, where some are extremely on the higher end and others on the lower end (for example, representing stock price for different entities or pandemic spread rates), you can represent them on a logarithmic scale so they’re normalized, yet relative. To use logarithmic scale, go to visual setting and under Y-Axis, for Scale, select Logarithmic.

The following screenshot shows your visualization after applying logarithmic scale.

Adjustable font size

You can now apply different font sizes on all visual content and visuals titles. In pivot tables and tables, you can see font sizes for table headers, cells, totals, and subtotals. In key performance indicators (KPIs), you can set font sizes for primary and comparison values, which allows you to keep dashboards dense and add more KPIs.

Actions

Finally, to all these charts, you can apply the following actions:

Filter actions– Select points on a chart to filter across the dashboard. QuickSight supports hierarchical filter actions that allow you to trigger one filter action from more than one chart. For more information, see Enhancing dashboard interactivity with Amazon QuickSight Actions.
URL actions – Trigger navigation from the dashboard to an external website and pass dynamic values within a URL.

Layout enhancements

QuickSight dashboards default to auto-fit mode, which makes them responsive based on screen size. However, in many situations, it’s preferable that the view you design is exactly what end-users see, whether on a laptop or a large monitor. QuickSight offers optimized layouts that allow you to pick a specific screen resolution to optimize for (such as the screen size most of your users use on a daily basis), and QuickSight automatically scales the dashboard view to render appropriately on larger or smaller screens. This doesn’t affect mobile devices—QuickSight automatically optimizes for mobile devices using a single-column layout. To adjust the scaling mode, choose Settings in the navigation pane while in dashboard authoring (analysis) mode.

If you build your dashboard for a 1024 px screen, for example, QuickSight scales that view to a larger or smaller screen to ensure that all users see the same content (mobile devices continue to fall back to a single-column, mobile-specific layout to ensure usability). Opting for the optimized mode also makes sure that your email reports look exactly like the dashboard that your viewers interact with.

On-sheet filter controls

You can now add filters to your dashboard directly without having to create parameters. Choose the field that you need to filter on choose Add filter for this field. Choose the newly added filter and choose Add to sheet.

If you need to pin it to the controls section, choose the filter and choose Pin to top.

The video below demonstrates these steps.

Thus, QuickSight allows you to choose from any of these control types to add to dashboards – single-select drop-downs, multi-select drop-downs, date and time picker, single-sided slider, single-line text box, time range picker, relative date selection, and numeric range slider. Learn more about on-sheet controls from the blog post here.

Other launches in 2020

While this blog covers all key charting and visualization launches in 2020, you can take a look at all new features enabled across other areas within QuickSight from this blog post here.

Conclusion

With these new QuickSight feature releases, you can now choose the chart type that is best suited to represent your data. You can provide richer dashboards for your readers by using the new formatting table options, dynamic titles, and reference lines. For more information about authoring dashboards in QuickSight, watch the virtual workshop Build Advanced Analytics and Dashboards with Amazon QuickSight and consider subscribing to the Amazon QuickSight YouTube channel for the latest training and feature walkthroughs.

About the Author

Sapna Maheshwari is a Specialist Solutions Architect for Amazon QuickSight. She is passionate about telling stories with data. In her previous roles at American Express and Early Warning services , she managed and led several projects in the data and analytics space.She enjoys helping customers unearth actionable insights from their data.

Boosting your data lake insights using the Amazon Athena Query Federation SDK

2020-12-15 Adir Sharabi

Post Syndicated from Adir Sharabi original https://aws.amazon.com/blogs/big-data/boosting-your-data-lake-insights-using-the-amazon-athena-query-federation-sdk/

Today’s modern applications use multiple purpose-built database engines, including relational, key-value, document, and in-memory databases. This purpose-built approach improves the way applications use data by providing better performance and reducing cost. However, the approach raises some challenges for data teams that need to provide a holistic view on top of these database engines, and especially when they need to merge the data with datasets in the organization’s data lake.

In this post, we show how to use the Amazon Athena Query Federation SDK to easily enrich your data in Amazon Simple Storage Service (Amazon S3) with data from external datastores, apply complex transformations, and get predictive insights by inferencing machine learning (ML) models.

We start by running a query to enrich an S3 backed table that holds features extracted from breast cancer images with fictional patient personal information stored in Amazon DynamoDB. We then use the Athena UDF functionality to decrypt sensitive information stored in the table. Next, we select these features and use Athena integration with Amazon SageMaker to pass them to a linear learner model to predict whether breast cancer is present. Lastly, we show an Amazon QuickSight dashboard to visualize the results.

For this post, we use the following resources:

A dataset of features computed from a digitized image of a fine needle aspirate of a breast mass. Such features include radius, texture, perimeter, area, and smoothness. The dataset is stored as a CSV file in Amazon S3.
Patient’s personal information (such as age range, email, and country) stored in DynamoDB.
A linear learner model deployed into a SageMaker endpoint to predict breast cancer. For more information, see Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda.

Prerequisites

Athena Query Federation is now generally available in US East (Ohio), US East (N. Virginia), and US West (Oregon). To use this feature, upgrade your engine version to Athena engine version 2 in your workgroup settings. To enable this feature in other Regions, you need to create an Athena workgroup named AmazonAthenaPreviewFunctionality and join that workgroup. Workgroups allows us to:

Isolate users, teams, applications, or workloads into groups
Enforce costs constraints per query or workgroup
Track query-related metrics for all workgroup queries in Amazon CloudWatch

For more information, see Managing Workgroups.

Creating a DynamoDB table and SageMaker endpoint

For the post, we create a new DynamoDB table with synthetic patient data. For the inference requests, we create a new SageMaker endpoint.

Deploy the following AWS CloudFormation stack in us-east-1:

The stack performs the following:

Creates a DynamoDB table
Creates and triggers an AWS Lambda function to load the table with data
Creates a SageMaker endpoint for inference requests

It can take up to 10 minutes for the CloudFormation stack to create the resources.

After the resource creation is complete, navigate to the AWS CloudFormation console.
Choose the boost-your-datalake-insights stack
On the Outputs tab, copy the values for DynamoTableName and SMEndpointName. You use these values later in the post.

Downloading the dataset

Under new or existing bucket create a folder named breast_cancer_features. Download the breast_cancer_raw_data.csv file and upload it to breast_cancer_features folder in your bucket. This file contains the Breast Cancer Wisconsin (Diagnostic) Data Set, available at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29.

The file holds 570 records with 30 columns of values computed for each cell nucleus. Such features include radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. The following screenshot displays a few records from the file.

Creating the features table in Athena

To query this dataset from Athena, you need to create an external table that points to that file. There are several ways to create table in Athena. For this post, we use a Hive data definition language (DDL) statement.

On the Athena console, if using a non-GA Region, switch to the AmazonAthenaPreviewFunctionality workgroup created earlier.

Enter the following statement:

CREATE EXTERNAL TABLE breast_cancer_features(
id string, radius_mean double, texture_mean double, perimeter_mean double, area_mean double, smoothness_mean double, compactness_mean double, concavity_mean double, concave_points_mean double, symmetry_mean double, fractal_dimension_mean double, radius_se double, texture_se double, perimeter_se double, area_se double, smoothness_se double, compactness_se double, concavity_se double, concave_points_se double, symmetry_se double, fractal_dimension_se double, radius_worst double, texture_worst double, perimeter_worst double, area_worst double, smoothness_worst double, compactness_worst double, concavity_worst double, concave_points_worst double, symmetry_worst double, fractal_dimension_worst double)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<bucket>/breast_cancer_features/'
TBLPROPERTIES (
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',', 
  'skip.header.line.count'='1')

Choose Run Query or press CTRL + Enter.

After the table is created successfully, we can run queries such as the following, to profile your data and better understand how it’s distributed:

SELECT variance(radius_mean) AS variance,
        stddev(radius_mean) AS stddev ,
        min(radius_mean) AS min,
         max(radius_mean) AS max,
         avg(radius_mean) AS avg
FROM "default"."breast_cancer_features"

The following screenshot shows our results.

Joining the features table with data stored in DynamoDB

In this step, we enrich the features table by joining it with the personal information stored in DynamoDB. With Athena Query Federation, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

The following screenshot displays a few records from the DynamoDB table.

To join the features table stored in Amazon S3 with the personal data stored in DynamoDB, we need to create a data source connector that runs on Lambda to run the federated query. A data source connector is a piece of code that can translate between your target data source and Athena. You can think of a connector as an extension of the query engine in Athena.

Prebuilt Athena data source connectors exist for data sources like Amazon CloudWatch Logs, DynamoDB, Amazon DocumentDB (with MongoDB capability), and Amazon Relational Database Service (Amazon RDS), and JDBC-compliant relational data sources such as MySQL and PostgreSQL. You can also use the Athena Query Federation SDK to write custom connectors.

Preparing to create federated queries is a two-part process: deploying a Lambda function data source connector, and connecting the Lambda function to a data source.

Deploying a Lambda function data source connector

To deploy the data source connector, complete the following steps

On the Athena console, choose Data sources.
Choose Connect data source.
Choose Query a data source.
For Choose a data source, choose the data source that you want to query with Athena, such as Amazon DynamoDB.
Choose Next.
For Lambda function, choose Configure new AWS Lambda function.

The function page for the connector that you chose opens on the Lambda console. The page includes detailed information about the connector.

Under Application settings, specify the required information. At a minimum, this includes:
- AthenaCatalogName – A name for the Lambda function that indicates the data source that it targets, such as athena_dynamodb_connector.
- SpillBucket – An S3 bucket in your account to store data that exceeds Lambda function response size limits.

For more information about the remaining configurable options, see Available Connectors on GitHub.

Select I acknowledge that this app creates custom IAM roles.
Choose Deploy.

The Resources section on the Lambda console shows the deployment status of the connector and informs you when the deployment is complete.

Connecting to a data source

After you deploy the data source connector to your account, you can connect it to a data source from Athena.

On the Athena console, choose Connect data source.
Choose Query a data source.
Choose the data source for the connector that you just deployed, such as Amazon DynamoDB. If you used the Athena Query Federation SDK to create your own connector and have deployed it to your account, choose All other data sources.
Choose Next.
For Choose Lambda function, choose the function that you named previously (athena_dynamodb_connector).
For Catalog name, enter a unique name to use for the data source in your SQL queries, such as dynamo_db.

The name can be up to 127 characters and must be unique within your account. It can’t be changed after creation. Valid characters are a–z, A–Z, 0–9, _, @, and -. The names awsdatacatalog, hive, jmx, and system are reserved by Athena and can’t be used for custom catalog names.

Choose Connect.

The Data sources page shows your connector in the list of catalog names. You can now use the connector in your queries.

Querying the external data source

To query your external data source, complete the following steps:

In the Athena Query editor, for Data source, choose dynamo_db.

The DynamoDB table appears in the Tables list.

Choose … (three dots) and choose Preview table.

Now we can run queries like the following to enrich and get more insights on our data by joining it with additional data that was stored in DynamoDB:

SELECT age, count(*)
FROM breast_cancer_features d
JOIN "dynamo_db"."default"."<DynamoTableName>" p
ON d.id = p.patient_id
GROUP BY age

The following screenshot shows our results.

Using custom UDFs to decrypt the patient’s email

When looking at our patient’s personal information stored in the DynamoDB table, we can see that the patient’s email is encrypted. We want to allow our users to get the decrypted email while querying with Athena without needing to run custom ETL, which requires us to store it decrypted.

User Defined Functions (UDFs) in Athena allow you to create custom functions to process records or groups of records. A UDF accepts parameters, performs the work, and returns a result. However, UDFs in Athena have the following limitations:

Scalar UDFs only – Athena only supports scalar UDFs, which process one row at a time and return a single column value. Athena passes a batch of rows to the UDF each time it invokes a Lambda function.
Java runtime only – As of this writing, Athena UDFs support only the Java 8 runtime for Lambda.

To use this feature in preview, you must create an Athena workgroup named AmazonAthenaPreviewFunctionality and join that workgroup, as specified in prerequisites section. For more information, see Querying with User Defined Functions.

Deploying a custom UDF

In this post, we use the decrypt method from the example UDF we published. The same encryption key that we used for the encryption should also be used for the decryption. We store that string in AWS Secrets Manager and use the Athena Query Federation SDKs to retrieve the stored key when the function is called.

On the Secrets Manager console, choose Store a new secret.
For Select secret type, select Other type of secrets.
Choose Plaintext.
Remove all the JSON brackets and enter a base64 encoded string as data key, such as AQIDBAUGBwgJAAECAwQFBg==
For Select the encryption key, choose DefaultEncryptionKey.
Choose Next.

Enter athena_encrypt_udf_key as the secret name.
Choose Next.
Chose Next again.
Chose Store.

Next, we deploy the Lambda function that runs the UDF.

On the AWS Serverless Application Repository console, in the navigation pane, choose Available applications.
Select Show apps that create custom IAM roles or resource policies.
In the search box, enter AthenaUserDefinedFunctions.
Choose the application from the result pane.

The Lambda function’s Application details page opens on the Lambda console.

For SecretNameOrPrefix, enter the name or prefix of a set of names within Secrets Manager that this function should have access to, such as athena_encrypt_udf_key* (make sure to include an asterisk at the end).
For LambdaFunctionName, enter the name of the Lambda function that runs your UDFs, such as athena_udf.
Select I acknowledge that this app creates custom IAM roles.
Choose Deploy.

The Resources section of the Lambda console shows the deployment status of the connector and informs you when the deployment is complete.

Querying with UDFs

The USING FUNCTION clause specifies a UDF or multiple UDFs that can be referenced by a subsequent SELECT statement in the query. You need the method name for the UDF and the name of the Lambda function that hosts the UDF.

In our example, the decrypt method gets two parameters: encrypted_col and secretName. We define the function and run the following query:

USING FUNCTION decrypt(encrypted_col VARCHAR, secretName VARCHAR)
RETURNS VARCHAR
TYPE LAMBDA_INVOKE WITH (lambda_name ='athena_udf')

SELECT encrypted_email, 
    decrypt(encrypted_email,'athena_encrypt_udf_key') AS decrypted_email
FROM "dynamo_db"."default"."<DynamoTableName>"

The following screenshot shows the query results.

Using Athena ML to predict breast cancer

To get predictive insights from our data using ML models, we need to write a code to enable inference against a deployed model. Data analysts are often skilled in using SQL, but may not have expertise in programming. ML with Athena bridges this gap and lets you run inference on models deployed on SageMaker by writing SQL statements in Athena. This feature simplifies access for data analysis to ML models such as anomaly detection, customer cohort analysis, and sales predictions, and improves productivity. This eliminates the need to use complex programming methods or offload data and jobs orchestration to run inference.

SageMaker is a fully managed ML service, where data scientists and developers can quickly and easily build and train ML models. In addition, SageMaker allows you to deploy your model to get predictions in one of two ways:

Hosting services – Deploys the model behind an HTTPS endpoint, which allows synchronous inference from your application. For more information, see Deploy a model on SageMaker Hosting Services.
Batch transform – Provides asynchronous predictions for an entire dataset located on Amazon S3. For more information, see Get Inferences for an Entire Dataset with Batch Transform.

Running a SQL statement with SageMaker inference to predict breast cancer

To perform inference from Athena, you need to train the model based on historical labeled data and deploy it into the SageMaker hosting service. In this post, we use a linear learner model to predict breast cancer out of features we used before. The model was already deployed as part of the CloudFormation stack as you can see in the screenshot below from the SageMaker console.

To use ML with Athena, you define a function with the USING FUNCTION clause. The function points to the SageMaker model endpoint that you want to use and specifies the variable names and data types to pass to the model. Subsequent clauses in the query reference the function to pass values to the model. The model runs inference based on the values that the query passes and then returns inference results.

In our example, the model endpoint is SMEndpoint-odPIqmf9LjUh and the variables are the feature columns from the breast_cancer_features table. We define the function and run the following query:

USING FUNCTION predict_breast_cancer(radius_mean double, texture_mean double, perimeter_mean double, area_mean double, smoothness_mean double, compactness_mean double, concavity_mean double, concave_points_mean double, symmetry_mean double, fractal_dimension_mean double, radius_se double, texture_se double, perimeter_se double, area_se double, smoothness_se double, compactness_se double, concavity_se double, concave_points_se double, symmetry_se double, fractal_dimension_se double, radius_worst double, texture_worst double, perimeter_worst double, area_worst double, smoothness_worst double, compactness_worst double, concavity_worst double, concave_points_worst double, symmetry_worst double, fractal_dimension_worst double) 
RETURNS DOUBLE 
TYPE SAGEMAKER_INVOKE_ENDPOINT WITH (sagemaker_endpoint = '<SMEndpointName>')

SELECT id, prediction, CASE WHEN round(prediction)=1 THEN 'B' ELSE 'M' END AS diagnosis
FROM( SELECT id, predict_breast_cancer(radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave_points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave_points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave_points_worst, symmetry_worst, fractal_dimension_worst) AS prediction FROM breast_cancer_features)a

In the following query results, you can see the prediction column returned by the model. The diagnosis column that derives from it is either B for benign or M for malignant.

Visualizing the data using QuickSight

Because many decision-makers aren’t familiar with SQL syntax, they want to consume the data in more graphic way, such as through charts and dashboards. Visualizing the data is also required by data scientists to identify trends and anomalies, and understand how our data behaves.

Before we visualize our data with QuickSight, in order to get the best performances, we create a new dataset in Amazon S3 that holds all the patient’s personal information joined with the breast cancer diagnostic we got from the ML model inference. We use Athena’s CREATE TABLE AS SELECT (CTAS) statement to create the table and populate it with the joint data:

CREATE TABLE breast_cancer_final
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
external_location = 's3://<bucket>/breast_cancer_final/')
AS
USING FUNCTION predict_breast_cancer(radius_mean double, texture_mean double, perimeter_mean double, area_mean double, smoothness_mean double, compactness_mean double, concavity_mean double, concave_points_mean double, symmetry_mean double, fractal_dimension_mean double, radius_se double, texture_se double, perimeter_se double, area_se double, smoothness_se double, compactness_se double, concavity_se double, concave_points_se double, symmetry_se double, fractal_dimension_se double, radius_worst double, texture_worst double, perimeter_worst double, area_worst double, smoothness_worst double, compactness_worst double, concavity_worst double, concave_points_worst double, symmetry_worst double, fractal_dimension_worst double) 
RETURNS DOUBLE 
TYPE SAGEMAKER_INVOKE_ENDPOINT WITH (sagemaker_endpoint = '<SMEndpointName>')

SELECT *, CASE WHEN round(prediction)=1 THEN 'B' ELSE 'M' END AS diagnosis
FROM( SELECT id, p.*, predict_breast_cancer(radius_mean, texture_mean, perimeter_mean, area_mean, smoothness_mean, compactness_mean, concavity_mean, concave_points_mean, symmetry_mean, fractal_dimension_mean, radius_se, texture_se, perimeter_se, area_se, smoothness_se, compactness_se, concavity_se, concave_points_se, symmetry_se, fractal_dimension_se, radius_worst, texture_worst, perimeter_worst, area_worst, smoothness_worst, compactness_worst, concavity_worst, concave_points_worst, symmetry_worst, fractal_dimension_worst) AS prediction FROM breast_cancer_features d JOIN "dynamo_db"."default"."<DynamoTableName>" p
ON d.id = p.patient_id)a

When the query successfully finishes, we can start building our dashboards using QuickSight.

First, we create a dataset in QuickSight for our table in Athena. For instructions, see Creating a Dataset Using Amazon Athena Data.

Next, we create QuickSight visuals.

The following dashboard shows the distribution of age range using a pie visual, a patient count by diagnostic per week, the top five countries, and patient distribution over time with forecast enabled.

Clean up

Now to the final step, cleaning up the resources.

To avoid unnecessary charges on your AWS account, do the following:

Destroy all of the resources created by the CloudFormation stack in create a DynamoDB table and SageMaker endpoint set up by deleting the stack after you’re done experimenting with it. You can follow the steps here to delete the stack.
You have to manually delete the S3 bucket you created with the data uploaded and generated with Athena.

Conclusions

This post demonstrated how the Athena Query Federation SDK allows you to implement serverless ETL to get more out of your data in your Amazon S3 data lake. We showed how simple Athena SQL syntax allows you to enrich your data with an external datastore, perform business logic using custom UDFs, and get insights by running ML inference. We also visualized our enriched dataset by creating a dashboard in QuickSight.

If you have feedback about this post, please share it in the comments. If you have questions about implementing the solution used in this post, comment or open a thread on the Developer Tools forum.

About the Authors

Adir Sharabi is a Solutions Architect with Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud. He is also passionate about Data and helping customers to get the most out of it.

Eitan Sela is a Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS. Eitan also helps customers build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.

Announcing Amazon Redshift federated querying to Amazon Aurora MySQL and Amazon RDS for MySQL

2020-12-15 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/announcing-amazon-redshift-federated-querying-to-amazon-aurora-mysql-and-amazon-rds-for-mysql/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers have built analytics workloads using it. We’re always listening to your feedback and, in April 2020, we announced general availability for federated querying to Amazon Aurora PostgreSQL and Amazon Relational Database Service (Amazon RDS) for PostgreSQL to enable you to query data across your operational databases, your data warehouse, and your data lake to gain faster and deeper insights not possible otherwise.

Today, we’re launching a new feature of Amazon Redshift federated query to Amazon Aurora MySQL and Amazon RDS for MySQL to help you expand your operational databases in the MySQL family. With this lake house architecture expansion to support more operational data stores, you can query and combine data more easily in real time and store data in open file formats in your Amazon Simple Storage Service (Amazon S3) data lake. Your data can then be more available to other analytics and machine learning (ML) tools, rather than siloed in disparate data stores.

In this post, we share information about how to get started with this new federated query feature to MySQL.

Prerequisites

To try this new feature, create a new Amazon Redshift cluster in a sql_preview maintenance track and Aurora MySQL instance and load sample TPC data into both data stores. To make sure both Aurora MySQL DB instances can accept connections from the Amazon Redshift cluster, you should make sure that both your Amazon Redshift cluster and Aurora MySQL instances are in the same Amazon Virtual Private Cloud (Amazon VPC) and subnet group. This way, you can add the security group for the Amazon Redshift cluster to the inbound rules of the security group for the Aurora MySQL DB instance.

If your Amazon Redshift cluster and Aurora MySQL instances are in the different VPC, you can set up VPC peering or other networking to allow Amazon Redshift to make connections to your Aurora MySQL instances. For more information about VPC networking, see Working with a DB instance in a VPC.

Configuring AWS Secrets Manager for remote database credentials

Amazon Redshift needs database credentials to issue a federated query to a MySQL database. AWS Secrets Manager provides a centralized service to manage secrets and can be used to store your MySQL database credentials. Because Amazon Redshift retrieves and uses these credentials, they are transient, not stored in any generated code, and discarded after the query runs.

Storing credentials in Secrets Manager takes only a few minutes. To store a new secret, complete the following steps:

On the Secrets Manager console, choose Secrets.
Choose Store a new secret.
For Select secret type, select Credentials for RDS database.
For User name, enter a name.
For Password, enter a password.
For Select the encryption key, choose DefaultEncryptionkey.
For Select which RDS database this secret will access, choose your database.

Storing credentials in Secrets Manager takes only a few minutes.

Optionally, copy programmatic code for accessing your secret using your preferred programming languages (which is not needed for this post).
Choose Next.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console. The secret ARN is needed in the subsequent step.

After you create the secret, you can retrieve the secret ARN by choosing the secret on the Secrets Manager console.

Setting up IAM role

You can now pull everything together by embedding the secret ARN into an AWS Identity and Access Management (IAM) policy, naming the policy, and attaching it to an IAM role. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "<SecretARN>"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

Finally, attach the same IAM role to your Amazon Redshift cluster.

On the Amazon Redshift console, choose Clusters.
Choose your cluster.
On the Actions drop-down menu, choose Manage IAM roles.

Finally, attach the same IAM role to your Amazon Redshift cluster.

Choose and add the IAM role you just created.

Setting up external schema

The final step is to create an external schema to connect to your Aurora MySQL instance. The following example code creates the external schema statement that you need to run on your Amazon Redshift cluster to complete this step:

CREATE EXTERNAL SCHEMA IF NOT EXISTS mysqlfq 
FROM MYSQL 
DATABASE 'tpc' 
URI '<AuroraClusterEndpoint>' 
PORT 3306 
IAM_ROLE '<IAMRole>' 
SECRET_ARN '<SecretARN>'

Use the following parameters:

URI – Aurora MySQL cluster endpoint
IAM_Role – IAM role created from the previous step
Secret_ARN – Secret ARN

After you set up the external schema, you’re ready to run some queries to test different use cases.

Querying live operational data

You can now query real-time operational data in your Aurora MySQL instance from Amazon Redshift. Note that isolation level is read committed for MySQL. See the following code:

dev=# select top 10 ws_order_number from mysqlfq.web_sales;
 ws_order_number 
-----------------
        93628990
       157020207
         4338647
        41395871
        58468186
       171095867
        12514566
        74946143
         3418243
        67054239
(10 rows)

Querying mysqlfq.web_sales in Amazon Redshift routes the request to MySQL tpc database and web_sales table. If you examine the query plan, you can see the query runs at the MySQL instance as shown by the step Remote MySQL Seq Scan:

dev=# explain select top 10 ws_order_number from mysqlfq.web_sales;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 XN Limit  (cost=0.00..0.20 rows=10 width=8)
   ->  XN MySQL Query Scan web_sales  (cost=0.00..6869.28 rows=343464 width=8)
         ->  Remote MySQL Seq Scan mysqlfq.web_sales  (cost=0.00..3434.64 rows=343464 width=8)
(3 rows)

Simplifying ELT and ETL

You can also extract operational data directly from your Aurora MySQL instance and load it into Amazon Redshift. See the following code:

dev=# create table staging_customer as select c_customer_id from mysqlfq.customer where c_customer_id not in (select c_customer_id from customer);
SELECT
dev=# select count(*) from staging_customer;
 count  
--------
 350000
(1 row)

The preceding code uses CTAS to create and load incremental data from your operational MySQL instance into a staging table in Amazon Redshift. You can then perform transformation and merge operations from the staging table to the target table. For more information, see Updating and inserting new data.

Combining operational data with data from your data warehouse and data lake

You can combine live operational data from your Aurora MySQL instance with data from your Amazon Redshift data warehouse and S3 data lake by creating a late binding view.

To access your S3 data lake historical data via Amazon Redshift Spectrum, create an external table:

create external schema mysqlspectrum
from data catalog
database 'spectrumdb'
iam_role '<IAMRole>'
create external database if not exists;
 
create external table mysqlspectrum.customer 
stored as parquet 
location 's3://<yourS3bucket>/customer/'
as select * from customer where c_customer_sk <= 100000;

You can then run queries on the view to gain insight on data across the three sources:

drop view vwCustomer;
create view vwCustomer as
select c_customer_sk, 'redshift' as source from public.customer where c_customer_sk > 100000
union all
select c_customer_sk, 'mysql' as source from mysqlfq.customer
union all
select c_customer_sk, 's3' as source from mysqlspectrum.customer
with no schema binding;

select * from vwCustomer where c_customer_sk in (1, 149712,29033279);

You should the following three records as output:

dev=# select * from vwCustomer where c_customer_sk in (1, 149712,29033279);
 c_customer_sk |  source  
---------------+----------
      29033279 | mysql
             1 | s3
        149712 | redshift
(3 rows)

If you examine the query plan, you can see that the predicates are pushed down to your MySQL instance to run:

dev=# explain select * from vwCustomer where c_customer_sk in (1,149712,29033279);
                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 XN Subquery Scan vwcustomer  (cost=0.00..48398.40 rows=6988 width=36)
   ->  XN Append  (cost=0.00..48328.52 rows=6988 width=4)
         ->  XN Subquery Scan "*SELECT* 1"  (cost=0.00..40000.03 rows=3 width=4)
               ->  XN Seq Scan on customer  (cost=0.00..40000.00 rows=3 width=4)
                     Filter: (((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279)) AND (c_customer_sk > 100000))
         ->  XN Subquery Scan "*SELECT* 2"  (cost=0.00..6548.63 rows=5492 width=4)
               ->  XN MySQL Query Scan customer  (cost=0.00..6493.71 rows=5492 width=4)
                     ->  Remote MySQL Seq Scan mysqlfq.customer  (cost=0.00..6438.79 rows=5492 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
         ->  XN Subquery Scan "*SELECT* 3"  (cost=0.00..1779.86 rows=1493 width=4)
               ->  XN S3 Query Scan customer  (cost=0.00..1764.93 rows=1493 width=4)
                     ->  S3 Seq Scan mysqlspectrum.customer location:"s3://<yourS3bucket>/customer" format:PARQUET  (cost=0.00..1750.00 rows=1493 width=4)
                           Filter: ((c_customer_sk = 1) OR (c_customer_sk = 149712) OR (c_customer_sk = 29033279))
(13 rows)

Available Now

Amazon Redshift federated querying to Aurora MySQL and Amazon RDS for MySQL is now available for public preview with Amazon Redshift release version 1.0.21591 or later. Refer to the AWS Region Table for Amazon Redshift availability and to check the version of your clusters.

About the Authors

BP Yau is an Analytics Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next-generation big data analytics platform using AWS technologies.

Zhouyi Yang is a Software Development Engineer for the Amazon Redshift Query Processing team. He’s passionate about gaining new knowledge about large databases and has worked on SQL language features such as federated query and IAM role privilege control. In his spare time, he enjoys swimming, tennis, and reading.

Entong Shen is a Senior Software Development Engineer for Amazon Redshift. He has been working on MPP databases for over 8 years and has focused on query optimization, statistics, and SQL language features such as stored procedures and federated query. In his spare time, he enjoys listening to music of all genres and working in his succulent garden.

Building high-quality benchmark tests for Amazon Redshift using SQLWorkbench and psql

2020-12-15 Asser Moustafa

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/building-high-quality-benchmark-tests-for-amazon-redshift-using-sqlworkbench-and-psql/

In the introductory post of this series, we discussed benchmarking benefits and best practices common across different open-source benchmarking tools. In this post, we discuss benchmarking Amazon Redshift with the SQLWorkbench and psql open-source tools. Let’s first start with a quick review of the introductory installment.

When you use Amazon Redshift to scale compute and storage independently, a need arises to profile the compute requirements of various production workloads so that your Amazon Redshift cluster configuration reflects an appropriately balanced compute layer. You also need an approach to scale up with eyes wide open, and benchmarking different Amazon Redshift cluster configurations against various production workloads can help you appropriately accommodate workload expansion. In addition, you may also use benchmark tests to proactively monitor a production cluster’s performance in real time.

For prospective Amazon Redshift customers, benchmarking Amazon Redshift is often one of the main components of their evaluation and a key source of insight into the price-to-performance ratio of different Amazon Redshift configurations.

Open-source tools, with their cost-efficiency and vendor neutrality, are often the preferred choice for profiling production workloads and benchmark tests. However, best practices for using these tools are scarce, possibly resulting in flawed compute profiles, flawed benchmark results, customer frustration, and bloated timelines.

One final point before we get started: there is a lot that could be said about benchmarking—more than can be accommodated in a single post. Analytics Specialists Solutions Architects such as myself frequently and happily engage with current and prospective customers to help you evaluate your benchmarking strategy and approach at no charge. I highly recommend you take advantage of that benefit by reaching out to your AWS account Solutions Architect.

SQLWorkbench

SQLWorkbench, also referred to as SQLWorkbench/J, is an open-source SQL query tool that you can freely download as a .zip file. It’s written in Java so it runs on Windows, Linux/UNIX, and macOS, and naturally requires a supported Java runtime environment (JRE). SQLWorkbench also requires a JDBC driver for the database (to download the latest Amazon Redshift JDBC driver, see Configuring a JDBC driver version 1.0 connection).

SQLWorkbench can run in GUI or console mode. I discuss both in this section, but in my experience, customers typically default to the GUI mode, so we explore that version first. Also, I have found that customers that use SQLWorkbench often use it in a Windows environment (something to keep in mind if operating system has a determination on which open-source tool you use).

Typically, you stand up a Windows EC2 instance to serve as your benchmark host, and install SQLWorkbench on that machine. When you have SQLWorkbench running, setting up a connection to your Amazon Redshift cluster is quite easy. For this post, I assume you’re familiar with the basics of JDBC connections. The following screenshot shows what the SQLWorkbench connection dialog box might look like when populated with connection information.

After establishing a successful connection to your Amazon Redshift cluster, a query tab opens, in which you can write and run SQL queries similar to that shown in the following screenshot.

For benchmark tests, it’s highly recommended to set the maxrows field to a relatively low number to avoid noise from long transmission times of large result sets.

Unlike the LIMIT clause in a SQL SELECT statement, which can alter (short-circuit) Amazon Redshift query processing, setting the maxrows field (whether to a value as low as 1 or something much higher) has no impact on query processing in Amazon Redshift; maxrows only impacts SQLWorkbench’s rendering workload and overhead. You can easily verify this by running the same query multiple times with different maxrows settings and observing that the number of rows returned for each query on the Amazon Redshift console query history page doesn’t change. Although the resulting query times should still be considered as query runtimes, they certainly help you get closer to a query’s execution time. Setting the maxrows field to a relatively low number also reduces the risk of SQLWorkbench running into an out-of-memory error from very large result sets.

This straightforward GUI interface is appealing because it has a minimal learning curve and quickly enables you to start submitting benchmark tests against your Amazon Redshift cluster. SQLWorkbench is a very useful tool, and it may be a good fit for informal or simple benchmark tests that deal with a handful of benchmark queries, relatively small tables (such as under 50 million rows in a fact table), and are focused more on determining general directionality of query runtimes (for example, cluster A was faster than cluster B at running business query 123), rather than capturing accurate query runtimes. The GUI interface can also be helpful for quickly and easily tweaking test queries to be more or less intense, or to correct SQL syntax if the query originated from a different platform.

However, for more formal and complex benchmark tests that deal with large tables and must capture accurate query runtimes, SQLWorkbench’s straightforward GUI interface faces a scalability challenge: inputting potentially hundreds or thousands of benchmark queries, running them sequentially or simultaneously, and capturing their runtimes in a practical manner can prove to be a huge challenge.

In addition, SQLWorkBench’s rendering and processing times for query result sets are added to a query’s runtime, and so even moderately sized query result sets can lead to potentially significant noise in query runtimes. For example, I recently observed a customer reduce their query runtimes by several orders of magnitude by switching to a command line tool while keeping all other aspects of their benchmark tests and environment constant. Some of the queries were straightforward filter queries with no joins, returning 400,000 rows from a 2 billion-row fact table with approximately 30 mostly integer columns.

Using console mode

One way to minimize the scale problem and rendering noise is to switch to SQLWorkbench console mode (the command line interface), which comes bundled with the GUI version of SQLWorkbench in the same downloadable .zip file.

In this section, we show one way to enter console mode from the Windows command line prompt (note the -showTiming=true flag that enables query execution times print on the screen) and connect to an Amazon Redshift cluster.

The following code starts SQLWorkbench in console mode:

c:\ sqlwbconsole64.exe -showTiming=true

When you’re in console mode, use the following command to connect to an Amazon Redshift cluster:

SQL> WbConnect -username=<Redshift User> -password=<Redshift User Password> -url=<fully qualified Redshift JDBC URL with port and database> -driver=<Redshift JDBC driver class name>

For example:

SQL> WbConnect -username=demouser -password=******* -url=jdbc:redshift://demo-poc-redshift-cluster.xxxxxx.us-west-2.redshift.amazonaws.com:8192/dev -driver=com.amazon.redshift.jdbc.Driver

The following screenshot shows our output.

Again, it’s recommended to set the maximum rows for the results sets to a relatively low number, using the following command:

SQL> set maxrows <number>;

Although console mode may have a slightly higher learning curve, it can significantly reduce potential rendering noise in a query’s runtime. In addition, SQLWorkbench’s console mode lends itself to scripting, which opens the door to many more sophisticated benchmarking scenarios, particularly when simulating concurrent users and capturing sophisticated metrics.

Comparing performance of SQLWorkbench modes

Let’s use an example use case to demonstrate the potential performance differences of both modes of SQLWorkbench. Although Example Corp is a hypothetical company, the use case is quite typical and realistic, and the benchmark results presented are based on actual customer experiences.

Example Corp has onboarded terabytes of data, over 100 ETL jobs, and thousands of business users to our Amazon Redshift deployment over the past quarter. Data architects and engineers have observed the Amazon Redshift cluster’s average CPU utilization steadily increase, and now wish to scale up the cluster before onboarding additional data, ETL jobs, and users waiting in the project pipeline.

To determine the optimal cluster size, we perform a few simple benchmark tests on different cluster configurations. We first identify five or so sufficiently complex production queries for benchmarking clusters of different sizes and instance types. We decide query runtime is a sufficient measure of the optimal cluster size, because we’re mainly interested in directional guidance (for example, query runtimes improved significantly with 1.5x cluster size, but only marginally with larger than 1.5x cluster sizes).

We can use an Amazon Redshift snapshot from our production cluster to quickly stand up a few differently configured clusters varying in node size or node type (such as ra3.4xl vs. ra3.16xl). We use a production snapshot to create the benchmark clusters so we can keep the cluster data identical.

However, manually running the benchmark queries individually using the SQLWorkbench GUI shows query runtimes actually increased in most cases (compared to the original production cluster) despite the more powerful clusters! Upon a closer look, we realize internet transport noise has not been isolated from the query runtimes. We stand up a dedicated test EC2 machine in the same VPC and Availability Zone as our benchmark Amazon Redshift clusters and install a SQLWorkbench GUI client.

Running the benchmark queries using the SQLWorkbench GUI provides similar query runtimes as the original cluster configuration. Again, not what was expected. Upon switching to SQLWorkbench console mode, however, we observe an improvement in query runtimes by several orders of magnitude.

psql

In my experience, psql is the preferred open-source command line query tool for customers running in a Linux/UNIX environment, so in this post, I assume a Linux EC2 instance is being used to run psql. If the standard Amazon Linux AMI was chosen (usually the first one in the list) during EC2 creation, you can use the following commands to update and verify psql v9.2 on the Linux EC2 instance:

> sudo yum update
> sudo yum install postgresql
> psql --help

Feel free to also search the freely available community AMIs, which might have newer versions of PostGreSQL server and the psql client pre-installed.

After psql is installed, connecting to an Amazon Redshift cluster is pretty straightforward by specifying a few command line parameters:

psql -h <Redshift JDBC endpoint> -p <Redshift port> -U <Redshift user> -d <Redshift database>

The standard Amazon Redshift port is 5439, but I use port 8192 in the following code because of certain firewall requirements in my environment:

psql -h benchmark-redshift-cluster1.xxxxx.us-west-2.redshift.amazonaws.com -p 5439 -U masteruser -d dev

The following screenshot shows our output.

After you connect to the Amazon Redshift cluster, be sure to run the \timing on command to enable query timing.

It’s also highly recommended that you consider setting the FETCH_COUNT variable to a relatively low number on the psql console to avoid long transmission times for large result sets:

\set FETCH_COUNT 500

By setting this variable, database cursors and the FETCH command are used in conjunction with queries. Setting this variable has no impact on query processing in Amazon Redshift, but rather the number of rows returned to the client application from the fully materialized result set.

Although the command line nature of psql may have a slightly higher learning curve than similar GUI applications, it also helps keep it lightweight and introduces minimal processing noise into a query’s runtime. For example, I observed a customer’s query runtime improve by several orders of magnitude by simply switching from a GUI tool to command line psql, while keeping all other aspects of the benchmark test and environment constant.

In addition, psql’s command line interface lends itself to scripting, which opens the door to many more sophisticated benchmarking scenarios, particularly when simulating concurrent users and capturing sophisticated concurrency metrics. In fact, a number of customizable, parameter-driven scripts have already been written by AWS Analytics Specialists such as myself for sophisticated benchmarking compute and concurrency scenarios, and are freely available to current and prospective customers.

Another utility that you can use in combination with such scripts is Simple Replay, a utility that is freely available on the Amazon Redshift Utilities GitHub repo. Simply Replay can extract workload histories from a source Amazon Redshift cluster and replay those workloads (using the psql command line client) with high fidelity on a different (such as a benchmark test) Amazon Redshift cluster.

For Simple Replay to extract workload details from an Amazon Redshift cluster, audit logging must be enabled in the cluster, and it may take about an hour for the most recent workloads to propagate to the audit logs.

After we run the extract command, Simple Replay extracts workload information such as the connection patterns (for example, number of users and their connection timing), COPY and UNLOAD commands, and other SQL queries so that they can be replayed on a different Amazon Redshift cluster with high fidelity (and, in our case, using psql command line as the SQL client). The following screenshot shows our output.

The workload details are typically stored in an Amazon Simple Storage Service (Amazon S3) bucket, which is specified in the Simple Replay configuration file, among other properties. See the following screenshot.

After running the python3 Extraction.py extraction.yaml command on the command line, we can review the workload details in our target S3 bucket to verify that the expected complexity was captured. The following screenshot shows the workload details on the Amazon S3 console.

The next step is to replay the extracted workload on a baseline cluster that mirrors our production cluster configuration (to establish a baseline runtime profile) and one or more target clusters using Simple Replay’s replay capability, as shown in the following screenshot.

Now let’s take another look at the example scenario presented in the previous section to demonstrate using the psql command line client with Simple Replay. Again, Example Corp has onboarded terabytes of data, over 100 ETL jobs, and thousands of business users to our Amazon Redshift deployment over the past quarter. Data architects and engineers have observed the Amazon Redshift cluster’s average CPU utilization steadily increase, and now wish to scale the cluster up (again) before onboarding additional data, ETL jobs, and users waiting in the project pipeline.

To determine the optimal cluster size, we first use the Simple Replay utility to extract information on all concurrent workloads that have occurred in the past 48 hours, from one-time user queries to BI reporting queries to ETL transformations. After we extract the information from the logs of the source Amazon Redshift cluster, we replay the same workloads on various benchmark cluster configurations. We may repeat this process for other timeframes in the past, such as month-end reporting or timeframes that exhibited unexpected workload spikes. To determine the optimal cluster size, the Example Corp team observes the CPU utilization of each benchmark cluster configuration and chooses the best cluster offering the best price-to-performance ratio.

For other capabilities and functionality in psql scripts, I recommend you reach out to your AWS account SA to evaluate available benchmarking scripts in relation to your needs and perhaps avoid “reinventing the wheel.”

Conclusion

In this series of posts, we discussed a number of recommended best practices for conducting high-quality benchmark tests. Some of the best practices represented core principles that span all the open-source tools discussed (such as consistency in testing methodology). In this post, we reviewed the strengths and appropriateness of SQLWorkbench and psql for conducting benchmark tests. I hope this series has been helpful, and strongly encourage current and prospective customers to reach out to me or other AWS colleagues if you wish to delve deeper.

About the Author

Asser Moustafa is an Analytics Specialist Solutions Architect at AWS based out of Dallas, Texas. He advises customers in the Americas on their Amazon Redshift and data lake architectures and migrations, starting from the POC stage to actual production deployment and maintenance.

Getting the most out of your analytics stack with Amazon Redshift

2020-12-14 Rajiv Gupta

Post Syndicated from Rajiv Gupta original https://aws.amazon.com/blogs/big-data/getting-the-most-out-of-your-analytics-stack-with-amazon-redshift/

Analytics environments today have seen an exponential growth in the volume of data being stored. In addition, analytics use cases have expanded, and data users want access to all their data as soon as possible. The challenge for IT organizations is how to scale your infrastructure, manage performance, and optimize for cost while meeting these growing demands.

As a Sr. Analytics Solutions Architect at AWS, I get to learn firsthand about these challenges and work with our customers to help them design and optimize their architecture. Amazon Redshift is often a key component in that analytics stack. Amazon Redshift has several built-in features to give you out-of-the-box performance, such as automatic workload management, automatic ANALYZE, automatic VACUUM DELETE, and automatic VACUUM SORT. These tuning features enable you to get the performance you need with fewer resources. In addition, Amazon Redshift provides the Amazon Redshift Advisor, which continuously scans your Amazon Redshift cluster and provides recommendations based on best practices. All you need to do is review the recommendations and apply the ones that provide the most benefit.

In this post, we examine a few challenges to your continually evolving analytics environment and how you can configure it to get the most out of your analytics stack by using the new innovations in Amazon Redshift.

Choosing the optimal hardware

The first area to consider is to ensure that you’ve chosen the optimal node type for your workload. Amazon Redshift has three families of node types; RA3, DC2, and DS2. The newest node type, RA3, was built for compute and storage separation so you can cost-effectively scale storage and compute to handle most analytics workloads, and should be chosen for most customers. If you have smaller datasets (< 640 GB of compressed data) or heavier compute needs, you may want to consider the DC2 nodes. Finally, the DS2 node type, while still available, is considered a legacy node type. If you’re using the DS2 node type, you should migrate to RA3 to optimize costs and performance. One of the key benefits of cloud computing is that you’re not tied to one node type. It’s very fast and efficient to migrate between one node type to another using the elastic resize functionality. Because all node types are compatible, no code changes are necessary.

For each node type, there are different sizes to consider. For the RA3 node type, there are the XLPlus, 4XLarge and 16XLarge sizes, for the DC2 node types, the Large and 8XLarge sizes and for for DS2, the XLarge and 8XLarge sizes. The following table summarizes the allocated resources for each instance type as of December 11, 2020.

Instance type	Disk type	Size	Memory	vCPUs	Maximum Nodes
RA3 xlplus	Managed Storage	Scales to 32 TB*	32 GB	4	16
RA3 4xlarge	Managed Storage	Scales to 64 TB*	96 GB	12	32
RA3 16xlarge	Managed Storage	Scales to 128 TB*	384 GB	48	128
DC2 large	SSD	160 GB	16 GB	2	32
DC2 8xlarge	SSD	2.56 TB	244 GB	32	128
DS2 xlarge	Magnetic	2 TB	32 GB	4	32
DS2 8xlarge	Magnetic	16 TB	244 GB	36	128

When determining the node size and the number of nodes needed in your cluster, consider your processing needs. For the node type you’re using, start with the smallest node size and consider the larger node sizes when you exceed the threshold of number of nodes. For example, for the RA3.XLPlus node type, the maximum number of nodes is 16 nodes. When you exceed this, consider 6 or more of the RA3.4XLarge node. When you exceed 32 nodes of the RA3.4XLarge node type, consider 8 or more of the RA3.16XLarge node. The Amazon Redshift console (see the following screenshot) provides a helpful tool to help you size your cluster taking into consideration parameters such as the amount of storage you need as well as your workload.

Reserving compute power

Building and managing a data warehouse environment is a large cross-functional effort often involving an investment in both time and resources. Amazon Redshift provides deep discounts on the hardware needed to run your data warehouse if you reserve your instances. After you’ve evaluated your workloads and have a configuration you like, purchase Reserved Instances (RIs) for discounts from 20% to 75% when compared to on-demand. You can purchase RIs using a Full Upfront, Partial Upfront, or sometimes a No Upfront payment plan. Reserved Instances are not tied to a particular cluster and are pooled across your account. If your needs expand and you decide to increase your cluster size, simply purchase additional Reserved Instances.

In the following chart, you can compare the yearly on-demand cost of a Redshift cluster to the equivalent cost of a 1-year RI and a 3-year RI (sample charges and discounts are based on 1 node of dc2.large all upfront commitments in the us-east-1 Region as published on November 1st, 2020). If you use your cluster for more than 7.5 months of the year, you save money with the purchase of a 1-year RI. If you use your cluster for more than 4.5 months on-demand, you can save even more money with the purchase of a 3-year RI.

Managing intermittent workloads

If your workload is infrequently accessed, Amazon Redshift allows you to pause and resume your cluster. When your cluster is paused, you’re only charged for backup and storage costs. You can pause and resume your cluster using an API command, the Amazon Redshift console, or through a scheduler. One use case where this feature is useful is if you’re using Amazon Redshift as a compute engine that reads data from the Amazon Simple Storage Service (Amazon S3) data lake and unloads the results back to the data lake. In that use case, you only need the cluster running during the data curation process. Another use case where this feature may be useful is if the cluster only needs to be available when the data pipeline runs and for a reporting platform to refresh its in-memory storage.

When deciding to pause and resume your cluster, keep in mind the cost savings from Reserved Instances. In the following chart, we can compare the daily on-demand cost of an Amazon Redshift cluster to the equivalent cost of a 1-year RI and a 3-year RI when divided by the number of days in the RI (sample charges and discounts are based on 1 node of dc2.large all upfront commitments in the us-east-1 Region as published on November 1st, 2020). If you use your cluster for more than 15 hours in a day, you save money with the purchase of a 1-year RI. If you use your cluster for more than 9 hours in a day, you can save even more money with the purchase of a 3-year RI.

Managing data growth with RA3

With use cases expanding, there is more and more demand for data within the analytics environment. In many cases, data growth outpaces the compute needs. In traditional MPP systems, to manage data growth, you could add nodes to your cluster, archive old data, or make choices on which data to include. With Amazon Redshift RA3 with managed storage, instead of the primary storage being on your compute nodes, your primary storage is backed by Amazon S3, which allows for much greater storage elasticity. The compute nodes contain a high-performance SSD backed local cache of your data. Amazon Redshift automatically moves hot data to cache so that processing the hottest data is fast and efficient. This removes the need to worry about storage so you can scale your cluster based on your compute needs. Migrating to RA3 is fast and doesn’t require making any configuration changes. You simply resize your cluster and choose the node type and number of nodes for your target configuration. The following diagram illustrates this architecture.

Managing real-time analytics with federated query

We often see the use case of wanting a unified view of your data that is accessible within your analytics environment. That data might be high-volume log or sensor data that is being streamed into the data lake, or operational data that is generated in an OLTP database. With Amazon Redshift, instead of building pipelines to ingest that data into your data warehouse, you can use you can use the federated query feature and Amazon Redshift Spectrum to expose this data as external schemas and tables for direct querying, joining, and processing. Querying data in place reduces both the storage needs and compute needs of your data warehouse. The query engine de-composes the query and determines which parts of the query processing can be run in the source system. When possible, filters and transformations are pushed down to the source. In the case of an OLTP database, any applicable filters are applied to the query in Amazon Aurora PostgreSQL or Amazon RDS for PostgreSQL. In the case of the data lake, the processing occurs in the Amazon Redshift Spectrum compute layer. The following diagram illustrates this architecture.

Managing data growth with compression

When you analyze the steps involved in running a query, I/O operations are usually the most time-consuming step. You can reduce the number of I/O operations and maximize resources in your analytics environment by optimizing storage. Amazon Redshift is a columnar store database and organizes data in 1 MB blocks. The more data that can fit in a 1 MB block, the less I/O operations are needed for reads and writes.

For every column in Amazon Redshift, you can choose the compression encoding algorithm. Because the column compression is so important, Amazon Redshift developed a new encoding algorithm: AZ64. This proprietary algorithm is intended for numeric and data/time data types. Benchmarking AZ64 against other popular algorithms (ZSTD and LZO) showed better performance and sometimes better storage savings. To apply a column encoding, you typically specify the encoding in the CREATE TABLE statement. If you don’t specifically set a column encoding, Amazon Redshift chooses the most optimal based on the data type you specified either at table creation or when it is first loaded. If you have older tables, they may not be taking advantage of the latest encoding algorithms. You can modify the encoding using the ALTER TABLE statement. The following table summarizes your storage savings and performance improvements.

Managing spiky workloads with concurrency scaling

Analytic workloads rarely have even compute requirements 24/7. Instead, spikes appear throughout the day, whether it’s because of an ingestion pipeline or a spike in user activity related to a business event that is out of your control.

When user demand is unpredictable, you can use the concurrency scaling feature to automatically scale your cluster. When the cluster sees a spike in user activity, concurrency scaling detects that spike and automatically routes queries to a new cluster within seconds. Queries run on the concurrent cluster without any change to your application and don’t require data movement. You can configure concurrency scaling to use up to 10 concurrent clusters, but it only uses the clusters it needs for the time it needs them. When your query runs against the concurrent cluster, you only pay for the amount of time the query is run and billed per second. The following diagram illustrates this architecture.

Each cluster earns up to 1 hour of free concurrency scaling credits per day, which is sufficient for 97% of customers. You can also set up costs controls using the usage limits. This feature can alert you or even disable the feature if you exceed a certain amount of usage.

Managing spiky workloads with elastic resize

When the user demand is predictable, you can use the elastic resize feature to easily scale your cluster up and down using an API command, the console, or based on a schedule. For example, if you have an ETL workload every night that requires additional I/O capacity, you can schedule a resize to occur every evening during your ETL workload. During an elastic resize, the endpoint doesn’t change and it happens within minutes. If a session connection is running, it’s paused until the resize completes. You can then scale back down at the end of the ETL workload. The following diagram illustrates this process.

Whether it’s through elastic resize or concurrency scaling, you want to size based on your steady state compute needs, not the peaks, and use features like elastic resize and concurrency scaling.

Providing access to shared data through multiple clusters

You may have multiple groups within your organization who want to access the analytics data. One option is to load all the data into one Amazon Redshift cluster and size the cluster to meet the compute needs of all users. However, that option can be costly. Also, isolating the workloads for some of your groups provides a few benefits. Each organization can be responsible for their own cluster charges and if either group has a tight SLA, they can ensure that the other’s queries don’t cause resource contention. One solution for sharing data and isolating workloads is by using the lake house architecture. When you manage your data in a data lake, you can keep it in open formats that are easily transportable and readable by any number of analytics services. Capabilities such as Amazon Redshift Spectrum, data lake export, and INSERT (external table), enable you to easily read and write data from a shared data lake within Amazon Redshift. Each group can live query the external data and join it to any local data they may have. Each group may even consider pausing and resuming their cluster when it is not in use. The following diagram illustrates this architecture.

Amazon Redshift Spectrum even supports reading data in Apache Hudi and Delta Lake.

Summary

Tens of thousands of customers choose Amazon Redshift to power analytics across their organization, and we’re constantly innovating to meet your growing analytics needs. For more information about these capabilities and see demos of many of the optimizations, see our AWS Online Tech Talks and check out What’s New in Amazon Redshift.

About the Author

Rajiv Gupta is a data warehouse specialist solutions architect with Amazon Web Services.

Working with timestamp with time zone in your Amazon S3-based data lake

2020-12-10 Thiyagarajan Arumugam

Post Syndicated from Thiyagarajan Arumugam original https://aws.amazon.com/blogs/big-data/working-with-timestamp-with-time-zone-in-your-amazon-s3-based-data-lake/

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can use the purpose-built analytics services for a range of use cases, from analyzing petabyte-scale datasets to querying the metadata of a single object. AWS analytics services support open file formats such as Parquet, ORC, JSON, Avro, CSV, and more, so it’s convenient to analyze with the tool that is most appropriate for your use case. For more information, see Amazon S3 as the Data Lake Storage Platform.

The TIMESTAMP and TIMESTAMPTZ (TIMESTAMP with time zone) data types are key data elements associated with many time-based datasets (for example clickstream, historical sales, and forecasting) in your data lake. But when you access the data across different analytical services, such as Amazon EMR-based ETL outputs being read by Amazon Redshift Spectrum, you may not know how the data will behave. Furthermore, lack of proper handling may cause accuracy issues in timestamp with time zone data types. This post delves into handling the TIMESTAMP and TIMESTAMPTZ data types in the context of a data lake by using a centralized data architecture. Because AWS analytical services cover a broad spectrum, we primarily focus on handing timestamps using Apache Hive, Apache Spark, Apache Parquet (using Amazon EMR and Amazon Athena), and Amazon Redshift to cover both the data lake and data warehouse.

Overview of TIMESTAMP and TIMESTAMPTZ data types in your data lake

Let’s start with some common definitions of the TIMESTAMP and TIMESTAMPTZ data types.

The TIMESTAMP data type stores values that include the date and time of day. For example, 12/17/1997 17:37:16. Timestamps are presented without time zone information.
The TIMESTAMPTZ data type to stores values with the date, time of day, and time zone. For example, 12/17/1997 17:37:16 (PST).

Internally, the timestamp is as an integer, representing seconds in UTC since the epoch (1970-01-01 00:00:00 UTC) and TIMESTAMPTZ values also stored as integers with respect to Coordinated Universal Time (UTC).

When working with the TIMESTAMPTZ data type, reads and writes use the time zone of the client user machine. When no time zone is set up or if left at the default values (such as the JVM/SQL client), it defaults to UTC.

Timestamp behavior when accessed across the analytical services

For this post, we discuss handling the timestamp with time zone data when accessed individually within the services and as well as between the services. The following diagram shows the architecture for this setup.

In this architecture, Parquet objects are stored in a centralized Amazon S3-based data lake, and Amazon EMR, Athena, and Amazon Redshift are used to access this centralized data. Data is also processed by these individual engines and accessed across these services through the Amazon S3 storage.

In this post, we illustrate the behavior of the different data types when data moves across different services from the Amazon S3 Parquet files.

Processing data in Amazon EMR (ETL) and accessing it with Amazon Redshift

In this use case, the Spark or Hive data pipeline generates Parquet files in the data lake and stores it in Amazon S3. Parquet files that are stored in Amazon S3 are loaded to Amazon Redshift using the COPY command. The following diagram illustrates this workflow.

To test this setup, complete the following steps:

Create a Hive table and insert a sample row (for this post, we use an EMR cluster spun up in us-west-2, PST):

CREATE EXTERNAL TABLE clickstream_dwh.clickstream_hive
(
  sessionid             BIGINT,
  click_region       STRING,
  click_datetime_utc       TIMESTAMP,
  pageid                INT,
  productid          INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION s3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_hive/';


insert into clickstream_dwh.clickstream_hive values (9074420482 ,'SEATTLE, US' ,'2014-04-06 02:40:13' ,3365,183876);

Verify the Parquet file content using the Parquet tool in Amazon S3:

$ parq 000000_0 --head 10
    sessionid click_region  click_datetime_utc  pageid  productid
0  9074420482        SEATTLE, US 2014-04-06 09:40:13    3365     183876

In the preceding output, the Hive client running Amazon EMR interprets the time zone with respect to the end-user client (in PST), and converts it to UTC when writing to the Parquet file.

Read through Hive and Spark (in Pacific time):

Read through Hive(Pacific):

PST:
select * from clickstream_dwh.clickstream_hive; 

+-------------+---------------+------------------------+---------+------------+
|  sessionid  | click_region  |   click_datetime_utc   | pageid  | productid  |
+-------------+---------------+------------------------+---------+------------+
| 9074420482  | SEATTLE, US         | 2014-04-06 02:40:13.0  | 3365    | 183876     |

Amazon EMR Hive and Spark convert the underlying UTC stored timestamp values in Parquet to the client user machine’s relative time (PST) when displaying the results.

Copy the Parquet file to an Amazon Redshift table with the TIMESTAMP column data type (in UTC). We use the SQL command line client tool psql to query the results in Amazon Redshift.

COPY the parquet file to Redshift table with timestamp column data type(UTC):


CREATE TABLE clickstream_dwh.clickstream_ts
(
  sessionid             BIGINT,
  click_region       VARCHAR(100),
  click_datetime_utc       TIMESTAMP,
  pageid                INT,
  productid          INT
);

dev=# SHOW TIMEZONE;
 TimeZone 
----------
 UTC
(1 row)

dev=# select * from clickstream_dwh.clickstream_ts;
 sessionid  | click_region | click_datetime_utc  | pageid | productid 
------------+--------------+---------------------+--------+-----------
 9074420482 | SEATTLE, US        | 2014-04-06 09:40:13 |   3365 |    183876
(1 row)


dev=# SET timezone = 'America/Los_Angeles';
SET

dev=# SHOW TIMEZONE;
      TimeZone       
---------------------
 America/Los_Angeles
(1 row)


dev=# select * from clickstream_dwh.clickstream_ts;
 sessionid  | click_region | click_datetime_utc  | pageid | productid 
------------+--------------+---------------------+--------+-----------
 9074420482 | SEATTLE, US        | 2014-04-06 09:40:13 |   3365 |    183876
(1 row)

Note: SET timezone = 'America/Los_Angeles' , does not affect the TIMESTAMP column.

In the preceding output, the timestamp doesn’t have a time zone. All data is interpreted in UTC or whatever raw format it was when loaded into Amazon Redshift.

Copy the Parquet file to the Amazon Redshift table using TIMESTAMPTZ (UTC & Pacific):

COPY the parquet file to Redshift table with TIMESTAMPTZ(UTC & Pacific):


CREATE TABLE clickstream_dwh.clickstream_tz
(
  sessionid             BIGINT,
  click_region       VARCHAR(100),
  click_datetime_utc       TIMESTAMPTZ,
  pageid                INT,
  productid          INT
);


COPY clickstream_dwh.clickstream_tz
FROM 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_hive/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftDemoRole'
parquet;



dev=# SHOW TIMEZONE;
 TimeZone 
----------
 UTC
(1 row)

dev=# select * from clickstream_dwh.clickstream_tz;
 sessionid  | click_region |   click_datetime_utc    | pageid | productid 
------------+--------------+------------------------+--------+-----------
 9074420482 | SEATTLE, US        | 2014-04-06 09:40:13+00 |   3365 |    183876
(1 row)

dev=# SET timezone = 'America/Los_Angeles';
SET
dev=# 
dev=# 
dev=# SHOW TIMEZONE;
      TimeZone       
---------------------
 America/Los_Angeles
(1 row)

dev=# select * from clickstream_dwh.clickstream_tz;
 sessionid  | click_region |   click_datetime_utc    | pageid | productid 
------------+--------------+------------------------+--------+-----------
 9074420482 | SEATTLE, US        | 2014-04-06 02:40:13-07 |   3365 |    183876
(1 row)

The output shows that TIMESTAMPTZ can interpret the client time zone and convert the value with respect to the end-user client (PST), though the actual values are stored in UTC.

Processing data from Amazon Redshift and moving it to an Amazon S3 data lake

In the following use case, we copy data from Amazon Redshift to a data lake. Amazon Redshift stores the TIMESTAMP and TIMESTAMPTZ columns data types in a table. The table data is exported to Amazon S3 as Parquet files with the UNLOAD command. The following diagram illustrates this architecture.

To experiment with this setup, complete the following steps:

Unload the Amazon Redshift table data to Amazon S3 (in UTC):

UNLOAD ('select * from clickstream_dwh.clickstream_tz;')
TO 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_rs/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftDemoRole'
parquet;

Verify the Parquet file content:

Check the parquet file content:

$ parq 0016_part_00.parquet --head 10
    sessionid click_region   click_datetime_utc  pageid  productid
0  9074420482        SEATTLE, US 2014-04-06 09:40:13    3365     183876

Create a table in Hive and query it (in UTC):

CREATE EXTERNAL TABLE clickstream_dwh.clickstream_rs
(
  sessionid             BIGINT,
  click_region       STRING,
  click_datetime_utc       TIMESTAMP,
  pageid                INT,
  productid          INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
LOCATION 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_rs/';



> set hive.parquet.timestamp.skip.conversion=false;


> select * from clickstream_dwh.clickstream_rs;


+-------------+---------------+------------------------+---------+------------+
|  sessionid  | click_region  |   click_datetime_utc   | pageid  | productid  |
+-------------+---------------+------------------------+---------+------------+
| 9074420482  | SEATTLE, US         | 2014-04-06 02:40:13.0  | 3365    | 183876     |
+-------------+---------------+------------------------+---------+------------+
1 row selected (0.888 seconds)

In the preceding output, the actual data in the Parquet file is stored in UTC, but Hive can read and display the local time zone using client settings.

When using Hive, set hive.parquet.timestamp.skip.conversion=false. Pre-3.1.2 Hive implementation of Parquet stores timestamps in UTC on-file; this flag allows you to skip the conversion when reading Parquet files created from other tools that may not have done so. Setting it to false treats legacy timestamps as UTC-normalized. For more information, see hive.parquet.timestamp.skip.conversion.

Query using Spark-SQL (in Pacific time):

spark-sql> select *  from clickstream_dwh.clickstream_rs;


9074420482	SEATTLE, US	2014-04-06 02:40:13	3365	183876

In the preceding output, Spark converts the values with respect to the end-user client (PST), though the actual values are stored in UTC.

Accessing data through Athena

To access the data through Athena, you need to create the external table either in the AWS Glue Data Catalog or Hive metastore. In this example, we populate the Data Catalog.

To create a table using the Data Catalog, sign in to the Athena console and run the following DDL:

CREATE EXTERNAL TABLE IF NOT EXISTS default.blog_clickstream(
  `sessionid` bigint,
  `click_region` string,
  `click_datetime_utc` timestamp,
  `pageid` int,
  `productid` int 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
) LOCATION 's3://clickstream-dwh-us-west-2/warehouse/clickstream_dwh.db/clickstream_rs/'
TBLPROPERTIES ('has_encrypted_data'='false');

The following screenshot shows the query results.

The results show that Athena converts click_datetime_utc to the user’s local time zone (in this case, PST).

Accessing data through Amazon Redshift Spectrum

To access the data through Amazon Redshift Spectrum, you need to create the following:

The external database and table in the Data Catalog or a Hive metastore. We can use the same table we already created in the preceding use case (default.blog_clickstream).
An Amazon Redshift external schema for the external database in the Data Catalog.

See the following code:

dev=# CREATE EXTERNAL SCHEMA ext_clickstream_blog
      from data catalog
      database 'default' 
      region 'us-west-2' 
      iam_role 'arn:aws:iam::123456789012:role/RedshiftDemoRole';


dev=#  select * from ext_clickstream_blog.blog_clickstream;

sessionid  | click_region | click_datetime_utc  | pageid | productid
-----------+--------------+---------------------+--------+----------
9074420482 | SEATTLE, US        | 2014-04-06 09:40:13 |   3365 |    183876

The output shows that Amazon Redshift Spectrum can convert click_datetime_utc to the local time zone of the user (PST).

Use cases for handling TIMESTAMP AND TIMESTAMPTZ data types

When implementing the data model for your data lake, the choice between selecting the TIMESTAMP or TIMESTAMPTZ data type depends on how your end-users consume the data. In this section, we discuss two different use cases.

Using TIMESTAMP for a uniform display of one normalized time

When you want a uniform display of a standard time (in a particular time zone), use the TIMESTAMP data type and baseline the values that are stored into a particular time zone. For example, the following table shows collected clickstream data from a global website.

sessionid	click_region	click_datetime_utc	pageid	productid
3682484416	Chennai	2014-04-06T 18:44:58	7156	309743
6367587374	London	2014-04-06T 18:44:58	5298	749625
9074420482	Los Angles	2014-04-06T 09:40:13	3365	183876
1746153004	Perth	2014-04-04T 06:13:28	3761	725195
1449344779	Singapore	2014-04-04T 06:13:28	3527	140229
4115543521	New York	2014-04-07T 09:22:28	3712	831655
2748081381	Paris	2014-04-07T 09:22:28	8474	347742
1120684200	New York	2014-04-07T 09:22:28	2731	568755

The clicks for the website come from users across the globe and are normalized for the UTC time zone using the click_datetime_utc column and a TIMESTAMP data type. You can accomplish this step during the data transformation process. Normalizing the data avoids confusion when data is analyzed across the different Regions without needing explicit conversion.

Using TIMESTAMPTZ for a contextual display of data depending on the user’s local time zone

When you need a contextual display of date and time for users accessing the data, choose the TIMESTAMPTZ data type. For example, let’s consider a customer service application that is accessing data from a centralized data warehouse. Individual users of the application are interested in analyzing data with respect which location the issue happened, rather than a normalized time zone. See the following code:

create table public.customer_issue_log
(
customer_id	BIGINT NOT NULL,
customer_location	VARCHAR(50) NOT NULL,
customer_timezone	VARCHAR(10) NOT NULL,
issue_create_time timestamptz NOT NULL,
issue_create_time_utc timestamp  NULL,
issue_id	INTEGER NOT NULL,
issue_severity INTEGER NOT NULL
);

The following table summarizes the output.

customer_id	customer_location	customer_timezone	issue_create_time	issue_create_time_utc	issue_id	issue_severity
9589430063	Chennai, India	IST	4/5/2014 11:44:58 PM	4/6/2014 04:44:58 AM	2343	3
2796599493	New York, US	EST	4/6/2014 06:44:58 AM	4/6/2014 11:44:58 AM	2780	4
1836626118	Toronto, CA	EDT	4/4/2014 05:13:28 AM	4/4/2014 10:13:28 AM	7821	1
6790206978	Sydney, Australia	AEDT	4/5/2014 05:40:13 PM	4/5/2014 10:40:13 PM	3135	5

The issue_create_time column stores the date and time values and the time zone. When you query this table, you can view issue_create_time in your local time zone automatically (without any explicit conversion) by configuring set timezone or using a SQL client (such as SQL Workbench) that automatically adjusts this with respect your local computer settings.

In addition, you can also introduce optional redundant columns such as issue_create_time_utc for ease of use when users try to analyze the data across different Regions.

Independent of the approach taken for the implementation, there is no loss of timestamp or time zone values when using the preceding approach, and you can perform data aggregation on both columns without needing explicit conversion because all data is stored in UTC in the underlying storage (Parquet in Amazon Redshift). For example, you can roll up data into weekly or monthly aggregates across the Regions without any explicit conversion. The following example code calculates the weekly number of issues by priority across all locations:

select date_trunc('week', issue_create_time) wk, issue_severity, count(issue_id) from public.customer_issue_log
where
group by date_trunc('week', issue_create_time), issue_severity;

Best practices for handling timestamps and time zones with data types

You should handle dates as either DATE, TIMESTAMP, or TIMESTAMPTZ data types and not convert them to strings. When dates are interpreted from strings, you lose all the features and flexiblity of working with date fields and date calculations, and also lose efficiency of processing. Moreover, casting or converting at runtime can be expensive.

When using TIMESTAMP or TIMESTAMPTZ data types, be aware of the client tools that access them. Client tool behavior largely depends on the local setting of the drivers and JVM. But it’s possible to override the behavior and always check for client tool-specific default behavior.

Use TIMESTAMPTZ only when absolutely necessary in the data model. In most use cases, TIMESTAMP simplifies data handling and avoids ambiguity when users access them.

Summary

In this post, we talked about handling and using TIMESTAMP and TIMESTAMPTZ data types with an Amazon S3-backed data lake. Most importantly, we covered how different AWS services like Amazon Redshift, Amazon EMR, Hive, and many other client tools interpret and interact with these data types. Choosing between using TIMESTAMP or TIMESTAMPTZ depends on the use case and how the end-user wants to visualize the data (a uniform display with one normalized time or a contextual display depending on time zone, respectively). Happy timestamping!

About the Authors

Thiyagarajan Arumugam is a Principal Solutions Architect at Amazon Web Services and designs customer architectures to process data at scale. Prior to AWS, he built data warehouse solutions at Amazon.com. In his free time, he enjoys all outdoor sports and practices the Indian classical drum mridangam.

Srinivasan Krishnasamy is a ‘Senior Big Data Consultant’ at Amazon Web Services. He joined AWS in 2015 and specializes in building and supporting Big Data solutions that help customers to ingest, process and analyze data at scale.

Satish Sathiya is a Product Engineer at Amazon Redshift. He is an avid big data enthusiast who collaborates with customers around the globe to achieve success and meet their data warehousing and data lake architecture needs.

Using pipes to explore, discover and find data in Amazon ES with Piped Processing Language

2020-12-09 Viraj Phanse

Post Syndicated from Viraj Phanse original https://aws.amazon.com/blogs/big-data/using-pipes-to-explore-discover-and-find-data-in-amazon-es-with-piped-processing-language/

System developers, DevOps engineers, support engineers, site reliability engineers (SREs), and IT managers make sure that the underlying infrastructure powering the applications and systems within an organization is available, reliable, secure, and scalable. To achieve these goals, you need to perform a fast and deep analysis on the underlying logs, monitoring, and observability data. Amazon Elasticsearch Service (Amazon ES) is a popular choice to store and analyze such data. However, extracting insights from Elasticsearch isn’t easy. Although Query DSL (the language used to query data stored in Elasticsearch) is powerful, it has a steep learning curve, and wasn’t designed as a human interface to easily create one-time queries and explore user data.

In this post, we discuss the newly supported Piped Processing Language (PPL) feature, powered by Open Distro for Elasticsearch, which enables you to form complex queries and quickly explore and discover data with the help of pipes.

What is Piped Processing Language?

Piped Processing Language is powered by Open Distro for Elasticsearch, an Apache 2.0-licensed distribution of Elasticsearch. PPL enables you to explore, discover, and find data stored in Elasticsearch, using a set of commands delimited by pipes ( | ).

Pipes allow you to combine two or more commands as a chain, such that the output of one command acts as an input for the next command, very similar to Unix pipes. With PPL, you can now search for keywords and feed the results from the command on the left of the pipe to the command on the right of the pipe, effectively creating a command pipeline.

Use case

As an illustration, consider a use case where you want to find out the number of hosts that are responding with HTTP 404 (Page not found) and HTTP 503 (Server Unavailability) errors, aggregate the error responses per host, and sort in the order of impact.

Using Query DSL

When you use Query DSL, the query looks similar to the following code:

GET kibana_sample_data_logs/_search
{"from":0,"size":0,"timeout":"1m","query":{"bool":{"should":[{"term":{"response.keyword":{"value":"404","boost":1}}},{"term":{"response.keyword":{"value":"503","boost":1}}}],"adjust_pure_negative":true,"boost":1}},"sort":[{"_doc":{"order":"asc"}}],"aggregations":{"composite_buckets":{"composite":{"size":1000,"sources":[{"host":{"terms":{"field":"host.keyword","missing_bucket":true,"order":"asc"}}},{"response":{"terms":{"field":"response.keyword","missing_bucket":true,"order":"asc"}}}]},"aggregations":{"request_count":{"value_count":{"field":"request.keyword"}},"sales_bucket_sort":{"bucket_sort":{"sort":[{"request_count":{"order":"desc"}}],"size":10}}}}}}

The following screenshot shows the query results.

Using PPL

You can replace the entire DSL query with a single PPL command:

source = kibana_sample_data_logs | where response='404' or response='503' | stats count(request) as request_count by host, response | sort -request_count

The following screenshot shows the query results.

Commands and functions supported by PPL

PPL supports a comprehensive set of commands, including search, where, field, rename, dedup, sort, stats, eval, head, top, and rare. These commands are read-only requests to process data and return results. The following table summarizes the purpose of each command.

Command	What does it do?	Example	Result
`search source`	Retrieves documents from the index. The keyword search can be ignored.	`source=accounts;`	Retrieves all documents from the accounts index.
`field`	Keeps or removes fields from the search result.	`source=accounts \| fields account_number`, `firstname`, `lastname`;	Gets account_number, firstname, and lastname fields from the search result.
`dedup`	Removes duplicate documents defined by a field from the search result.	`source=accounts \| dedup gender \| fields account_number`, `gender`;	Removes duplicate documents with the same gender.
`stats`	Aggregates the search results using `sum`, `count`, `min`, `max`, and `avg`.	`source=accounts \| stats avg(age)`;	Calculates the average age of all accounts.
`eval`	Evaluates an expression and appends its result to the search result.	`search source=accounts \| eval doubleAge = age * 2 \| fields age`, `doubleAge`;	Creates a new doubleAge field for each document that is age * 2.
`head`	Returns the first N number of results in a specified search order.	`search source=accounts \| fields firstname, age \| head`;	Fetches the first 10 results.
`top`	Finds the most common values of all fields in the field list.	`search source=accounts \| top gender`;	Finds the most common value of gender.
`rare`	Finds the least common values of all fields in a field list.	`search source=accounts \| rare gender`;	Finds the least common value of gender.
`where`	Filters the search result.	`search source=accounts \| where account_number=1` or `gender="F" \| fields account_number, gender`;	Gets all the documents from the account index.
`rename`	Renames one or more fields in a search result.	`search source=accounts \| rename account_number as an \| fields acc`;	Renames the account field as acc.
`sort`	Sorts results in a specified field.	`search source=accounts \| sort age \| fields account_number, age`;	Sorts all documents by age field in ascending order.

PPL also supports functions including date-time, mathematical, string, aggregate, and trigonometric, and operators and expressions.

Summary

Piped Processing Language, powered by Open Distro for Elasticsearch, has a comprehensive set of commands and functions that enable you to quickly begin extracting insights from your data in Elasticsearch. It’s supported on all Amazon ES domains running Elasticsearch 7.9 or greater. PPL also expands the capabilities of the Query Workbench in Kibana in addition to SQL. For more information, see Piped Processing Language.

About the Author

Viraj Phanse is a product management leader at Amazon Web Services for Search Services/Analytics. An avid foodie, he loves trying cuisines from around the globe. In his free time, he loves to play his keyboard and travel.

Introducing Amazon Redshift RA3.xlplus nodes with managed storage

2020-12-09 BP Yau

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/introducing-amazon-redshift-ra3-xlplus-nodes-with-managed-storage/

Since we launched Amazon Redshift as a cloud data warehouse service more than seven years ago, tens of thousands of customers have built analytics workloads using it. We’re always listening to your feedback and, in December 2019, we announced our third-generation RA3 node type to provide you the ability to scale and pay for compute and storage independently. In this post, I share more about RA3, including a new smaller size node type, and information about how to get started.

RA3 nodes in Amazon Redshift

The new RA3 nodes let you determine how much compute capacity you need to support your workload and then scale the amount of storage based on your needs. The managed storage of Amazon Redshift automatically uses high-performance, SSD-based local storage as its tier-1 cache. The managed storage takes advantage of optimizations such as data block temperature, data block age, and workload patterns to optimize Amazon Redshift performance and manage data placement across tiers of storage automatically. No action or changes to existing workflows are necessary on your part.

The first member of the RA3 family was the RA3.16xlarge, and we subsequently added the RA3.4xlarge to cater to the needs of customers with a large number of workloads.

We’re now adding a new smaller member of the RA3 family, the RA3.xlplus.

This allows Amazon Redshift to deliver up to three times better price performance than other cloud data warehouses. For existing Amazon Redshift customers using DS2 instances, you get up to two times better performance and double the storage at the same cost when you upgrade to RA3. RA3 also includes AQUA (Advanced Query Accelerator) for Amazon Redshift at no additional cost. AQUA is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run up to ten times faster than other cloud data warehouses by automatically boosting certain types of queries. The preview is open to all customers now, and it will be generally available in January 2021.

RA3 nodes with managed storage are a great fit for analytics workloads that require best price per performance, with massive storage capacity and the ability to scale and pay for compute and storage independently. In the past, there was pressure to offload or archive old data to other storage because of fixed storage limits, which made maintaining the operational analytics dataset and querying the larger historical dataset when needed difficult. With Amazon Redshift managed storage, we’re meeting the needs of customers that want to store more data in their data warehouse.

The new RA3.xlplus node provides 4 vCPUs and 32 GiB of RAM and addresses up to 32 TB of managed storage. A cluster with RA3.xlplus node-type can contain up to 32 of these instances, for a total storage of 1024 TB (that’s 1 petabyte!). With the new smaller RA3.xlplus instance type, it’s even easier to get started with Amazon Redshift.

The differences between RA3 nodes are summarized in the following table.

	vCPU	Memory	Storage Quota	I/O	Price (US East (N. Virginia))
ra3.xlplus	4	32 GiB	32TB RMS	0.65 GB/sec	$1.086 per hour
ra3.4xlarge	12	96 GiB	64TB RMS	2 GB/sec	$3.26 per hour
ra3.16xlarge	48	384 GiB	64TB RMS	8 GB/sec	$13.04 per hour

Creating a new Amazon Redshift cluster

You can create a new cluster on the Amazon Redshift console or the AWS Command Line Interface (AWS CLI). For this post, I walk you through using the Amazon Redshift console.

On the Amazon Redshift console, choose Clusters.
Choose Create cluster.
For Choose the size of the cluster, choose I’ll choose.
For Node type, select your preferred node type (for this post, we select xlplus).

The following screenshot shows the Cluster configuration page for the US East (N. Virginia) Region. The price may vary slightly in different Regions.

If you have a DS2 or DC2 instance-based cluster, you can create an RA3 cluster to evaluate the new instance with managed storage. You use a recent snapshot of your Amazon Redshift DS2 or DC2 cluster to create a new cluster based on RA3 instances. We recommend using 2 nodes of RA3.xlplus for every 3 nodes of DS2.xl or 3 nodes of RA3.xlplus for every 8 nodes of DC2.large. For more information about upgrading sizes, see Upgrading to RA3 node types. You can always adjust the compute capacity by adding or removing nodes with elastic resize.

If you’re migrating to Amazon Redshift from on-premises data warehouses, you can size your Amazon Redshift cluster using the sizing widget on the Amazon Redshift console. On the Cluster configuration page, for Choose the size of the cluster, choose Help me choose.

Answer the subsequent questions on total storage size and your data access pattern in order to size your cluster’s compute and storage resource.

The sizing widget recommends the cluster configuration in the Calculated configuration summary section.

Conclusion

RA3 instances are now available in 16 AWS Regions. For the latest RA3 node type availability, see RA3 node type availability in AWS Regions.

The price varies from Region to Region, starting at $1.086 per hour per node in US East (N. Virginia). For more information, see Amazon Redshift pricing.

About the Author

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS. His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Amazon EMR Studio (Preview): A new notebook-first IDE experience with Amazon EMR

2020-12-09 Fei Lang

Post Syndicated from Fei Lang original https://aws.amazon.com/blogs/big-data/amazon-emr-studio-preview-a-new-notebook-first-ide-experience-with-amazon-emr/

We’re happy to announce Amazon EMR Studio (Preview), an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug applications written in R, Python, Scala, and PySpark. EMR Studio provides fully managed Jupyter notebooks and tools like Spark UI and YARN Timeline Service to simplify debugging. EMR Studio uses AWS Single Sign-On (AWS SSO), and allows you to log in directly with your corporate credentials without signing in to the AWS Management Console.

With EMR Studio, you can run notebook code on Amazon EMR running on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS), and debug your applications. For more information about Amazon EMR on Amazon EKS, see What is Amazon EMR on EKS.

EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing with the performance-optimized Apache Spark runtime that Amazon EMR provides. You can also install custom kernels and libraries, collaborate with peers using code repositories such as GitHub and Bitbucket, or run parameterized notebooks as part of scheduled workflows using orchestration services like Apache Airflow or Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Administrators can set up EMR clusters that can be used by EMR Studio users, or create predefined AWS CloudFormation templates for Amazon EMR and allow you to simply choose a template for creating your own cluster.

In this post, we discuss the benefits that EMR Studio offers and we introduce to you some of its capabilities. To learn more about creating and using EMR Studios, see Use Amazon EMR Studio.

Benefits of using EMR Studio

EMR Studio offers the following benefits:

Set up a unified experience to develop and diagnose EMR Spark applications – Administrators can set up EMR Studio to allow you to log in using your corporate credentials without having to sign in to the AWS console. You get a single unified environment to interactively explore, process, and visualize data using notebooks, build and schedule pipelines, and debug applications without having to log in to EMR clusters.
Use fully managed Jupyter notebooks – With EMR Studio, you can develop analytics and data science applications in R, Python, Scala, and PySpark with fully managed Jupyter notebooks. You can take advantage of distributed processing using the performance-optimized Amazon EMR runtime for Apache Spark with Jupyter kernels and applications running on EMR clusters. you can attach notebooks to an existing cluster that uses Amazon EC2 instances, or to an EMR on EKS virtual cluster. You can also start your own clusters using templates pre-configured by administrators.
Collaborate with others using code repositories – From the EMR Studio notebooks environment, you can connect to code repositories such as AWS CodeCommit, GitHub, and Bitbucket to collaborate with peers.
Run custom Python libraries and kernels – From EMR Studio, you can install custom Python libraries or Jupyter kernels required for your applications directly to the EMR clusters.
Automate workflows using pipelines – EMR Studio makes it easy to move from prototyping to production. You can create EMR Studio notebooks that can be programmatically invoked with parameters, and use APIs to run the parameterized notebooks. You can also use orchestration tools such as Apache Airflow or Amazon MWAA to run notebooks in automated workflows.
Simplified debugging – With EMR Studio, you can debug jobs and access logs without logging in to the cluster. EMR Studio provides native application interfaces such as Spark UI and YARN Timeline. When a notebook is run in EMR Studio, the application logs are uploaded to Amazon Simple Storage Service (Amazon S3). As a result, you can access logs and diagnose applications even after your EMR cluster is terminated. You can quickly locate the job to debug by filtering based on the cluster or time when the application was run.

In the following section, we demonstrate some of the capabilities of Amazon EMR Studio using a sample notebook. For our sample notebook, we use the open-source, real-time COVID-19 US daily case reports provided by Johns Hopkins University CSSE from the following GitHub repo.

Notebook-first IDE experience with AWS SSO integration

EMR Studio makes it simple to interact with applications on an EMR cluster. After an administrator sets up EMR Studio and provides the access URL (which looks like https://es-*************************.emrstudio.us-east-1.amazonaws.com), you can log in to EMR Studio with your corporate credentials.

After you log in to EMR Studio, you get started by creating a Workspace. A Workspace is a collection of one or more notebooks for a project. The Workspaces and the notebooks that you create in EMR Studio are automatically saved in an Amazon S3 location.

Now, we create a Workspace by completing the following steps:

On the EMR Studio Dashboard page, choose Create Workspace.
On the Create a Workspace page, enter a Workspace name and a Description.

Naming the Workspace helps identify your project. Your workspace is automatically saved, and you can find it later on the Workspaces page. For this post, we name our Workspace EMR-Studio-WS-Demo1.

On the Subnet drop-down menu, choose a subnet for your Workspace.

Each subnet belongs to the same Amazon Virtual Private Cloud (Amazon VPC) as your EMR Studio. Your administrator may have set up one or more subnets to use for your EMR clusters. You should choose a subnet that matches the subnet where you use EMR clusters. If you’re not sure about which subnet to use, contact your administrator.

For S3 location, choose the Amazon S3 location where EMR Studio backs up all notebook ﬁles in the Workspace.

This location is where your Workspace and all the notebooks in the Workspace are automatically saved.

In the Advanced configuration section, you can attach an EMR cluster to your Workspace.

For this post, we skip this step. EMR Studio allows you to create Workspaces and notebooks without attaching to an EMR cluster. You can attach an EMR cluster later when you’re ready to run your notebooks.

Choose Create Workspace.

Fully managed environment for managing and running Jupyter-based notebooks

EMR Studio provides a fully managed environment to help organize and manage Workspaces. Workspaces are the primary building blocks of EMR Studio, and they preserve the state of your notebooks. You can create different Workspaces for each project. From within a Workspace, you can create notebooks, link your Workspace to a code repository, and attach your Workspace to an EMR cluster to run notebooks. Your Workspaces and the notebooks and settings it contains are automatically saved in the Amazon S3 location that you specify.

If you created the workspace EMR-Studio-WS-Demo1 by following the preceding steps, it appears on the Workspaces page with the name EMR-Studio-WS-Demo1 along with status Ready, creation time, and last modified timestamp.

The following table describes each possible Workspace status.

Status	Meaning
`Starting`	The Workspace is being prepared, but is not yet ready to use.
`Ready`	You can open the Workspace to use the notebook editor. When a Workspace has a `Ready` status, you can open or delete it.
`Attaching`	The Workspace is being attached to a cluster.
`Attached`	The Workspace is attached to an EMR cluster. If a Workspace is not attached to an EMR cluster, you need to attach it to an EMR cluster before you can run any notebook code in the Workspace.
`Idle`	The Workspace is stopped and currently idle. When you launch an idle Workspace, the Workspace status changes from `Idle` to `Starting` to `Ready`.
`Stopping`	The Workspace is being stopped.
`Deleting`	When you delete a Workspace, it’s marked for deletion. EMR Studio automatically deletes Workspaces marked for deletion. After a Workspace is deleted, it no longer shows in the list of Workspaces.

You can choose the Workspace that you created (EMR-Studio-WS-Demo1) to open it. This opens a new web browser tab with the JupyterLab interface. The icon-denoted tabs on the left sidebar allow you to access tool panels such as the ﬁle browser or JupyterLab command palette. To learn more about the EMR Studio Workspace interface, see Understand the Workspace User Interface.

EMR Studio automatically creates an empty notebook with the same name as the Workspace. For this post, we the Workspace that we created, it automatically creates EMR-Studio-WS-Demo1.ipynb. In the following screenshot, no cluster or kernel is specified in the top right corner, because we didn’t choose to attach any cluster while creating the Workspace. You can write code in your new notebook, but before you run your code, you need to attach it to an EMR cluster and specify a kernel. To attach your workspace to a cluster, choose the EMR clusters icon on the left panel.

Linking Git-based code repositories with your Workspace

You can collaborate with your peers by sharing notebooks as code via code repositories. EMR Studio supports the following Git-based services:

AWS CodeCommit
GitHub
Bitbucket

This capability provides the following beneﬁts:

Version control – Record code changes in a version control system so you can review the history of your changes and selectively revert them.
Collaboration – Share code with team members working in diﬀerent Workspaces through remote Git-based repositories. Workspaces can clone or merge code from remote repositories and push changes back to those repositories.
Code reuse – Many Jupyter notebooks that demonstrate data analysis or machine learning techniques are available in publicly hosted repositories, such as GitHub. You can associate your Workspace with a GitHub repository to reuse the Jupyter notebooks contained in a repository.

To link Git repositories to your Workspace, you can link an existing repository or create a new one. When you link an existing repository, you choose from a list of Git repositories associated with the AWS account in which your EMR Studio was created.

We add a new repository by completing the following steps:

Choose the Git icon.
For Repository name¸ enter a name (for example, emr-notebook).
For Git repository URL, enter the URL for the Git repo (for this post, we use the sample notebook at https://github.com/emrnotebooks/notebook_execution).
For Git credentials, select your credentials. Because we’re using a public repo, we select Use a public repository without credentials.
Choose Add repository.

After it’s added, we can see the repo on the Git repositories drop-down menu.

Choose the repo to link to the Workspace.

You can link up to three Git repositories with an EMR Studio Workspace. For more information, see Link Git-Based Repositories to an EMR Studio Workspace.

Choose the File browser icon to locate the Git repo we just linked.

Attaching and detaching Workspaces to and from EMR clusters

EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing using the performance-optimized EMR runtime for Apache Spark. You can attach your Workspace to an EMR cluster and get distributed data processing using Spark or custom kernels. You can use primary node capacity to run non-distributed applications.

In addition to using Amazon EMR clusters running on Amazon EC2, you can attach a Workspace to an Amazon EMR on EKS virtual cluster to run notebook code. For more information about how to use an Amazon EMR on EKS cluster in EMR Studio, see Use an Amazon EMR on EKS Cluster to Run Notebook Code.

Before you can run your notebooks, you must attach your Workspace to an EMR cluster. For more information about clusters, see Create and Use Clusters with EMR Studio.

To run the Git repo notebooks that we linked in the previous step, complete the following steps:

Choose the EMR cluster
Attach the Workspace to an existing EMR cluster running on Amazon EC2 instances.
Open the notebook demo_pyspark.ipynb from the Git repo emr-notebook that we linked to the Workspace.

In the upper right corner of the Workspace UI, we can see the ID of the EMR cluster being attached to our Workspace, as well as the kernel selected to run the notebook.

Record the value of the cluster ID (for example, <j-*************>).

We use this value later to locate the EMR cluster for application debugging purposes.

You can also detach the Workspace from the cluster in the Workspace UI and re-attach it to another cluster. For more information, see Detach a Cluster from Your Workspace.

Being able to easily attach and detach to and from any EMR cluster allows you to move any workload from prototyping into production. For example, you can start your prototype development by attaching your workspace to a development EMR cluster and working with test datasets. When you’re ready to run your notebook with larger production datasets, you can detach your workspace from the development EMR cluster and attach it to a larger production EMR cluster.

Installing and loading custom libraries and kernels

You can install notebook-scoped libraries with a PySpark kernel in EMR Studio. The libraries installed are isolated to your notebook session and don’t interfere with libraries installed via EMR bootstrap actions, or libraries installed by other EMR Studio notebook sessions that may be running on the same EMR cluster. After you install libraries for your Workspace, they’re available for other notebooks in the Workspace in the same session.

Our sample notebook demo_pyspark.ipynb is a Python script. It uses real-time COVID-19 US daily case reports as input data. The following parameters are defined in the first cell:

DATE – The given date used when the notebook job is started.
TOP_K – The top k US states with confirmed COVID-19 cases. We use this to plot Graph a.
US_STATES – The names of the specific US states being checked for the fatality rates of COVID-19 patients. We use this plot Graph b.

The parameters can be any of the Python data types.

Running this notebook plots two graphs:

Graph a – Visualizes the top k US states with most the COVID-19 cases on a given date
Graph b – Visualizes the fatality rates among specific US states on a given date

In our notebook, we install notebook-scoped libraries by running the following code from within a notebook cell:

sc.install_pypi_package("pandas==0.25.1")
sc.install_pypi_package("requests==2.24.0")
sc.install_pypi_package("numpy==1.19.1")
sc.install_pypi_package("kiwisolver==1.2.0")
sc.install_pypi_package("matplotlib==3.3.0")

We use these libraries in the subsequent cells for the further data analysis and visualization steps in the notebook.

The following set of parameters is used to run the notebook:

{"DATE": "10-15-2020",
 "TOP_K": 6,
"US_STATES": ["Wisconsin", "Texas", "Nevada"]}

Running all the notebook cells generates two graphs. Graph a shows the top six US states with confirmed COVID-19 cases on October 15, 2020.

Graph b shows the fatality rates of COVID-19 patients in Texas, Wisconsin, and Nevada on October 15, 2020.

EMR Studio also allows you to install Jupyter notebook kernels and Python libraries on a cluster primary node, which makes your custom environment available to any EMR Studio Workspace attached the cluster. To install the sas_kernel kernel on a cluster primary node, run the following code within a notebook cell:

!/emr/notebook-env/bin/pip install sas_kernel

The following screenshot shows your output.

For more information about how to install kernels and use libraries, see Installing and Using Kernels and Libraries.

Diagnosing applications and jobs with EMR Studio

In EMR Studio, you can quickly debug jobs and access logs without logging in to the cluster, such as setting up a web proxy through an SSH connection, for both active and stopped clusters. You can use native application interfaces such as Spark UI and YARN Timeline Service directly from EMR Studio. EMR Studio also allows you to quickly locate the cluster or job to debug by using filters such as cluster state, creation time, and cluster ID. For more information, see Diagnose Applications and Jobs with EMR Studio.

Now, we show you how to open a native application interface to debug the notebook job that already finished.

On the EMR Studio page, choose Clusters.

A list appears with all the EMR clusters launched under the same AWS account. You can filter the list by cluster state, cluster ID, or creation time range by entering values in the provided ﬁelds.

Choose the cluster ID of the EMR cluster that we attached to the Workspace EMR-Studio-WS-Demo1 for running notebook demo_pyspark.ipynb.
For Spark job debugging, on the Launch application UIs menu, choose Spark History Server.

The following screenshot shows you the Spark job debugging UI.

We can traverse the details for our notebook application by checking actual logs from the Spark History Server, as in the following screenshot.

For Yarn application debugging, on the Launch application UIs menu, choose Yarn Timeline Server.

The following screenshot shows the Yarn debugging UI.

Orchestrating analytics notebook jobs to build ETL production pipelines

EMR Studio makes it easy for you to move any analytics workload from prototyping to production. With EMR Studio, you can run parameterized notebooks as part of scheduled workflows using orchestration services like AWS Step Functions and Apache Airflow or Amazon MWAA.

In this section, we show a simple example of how to orchestrate running notebook workflows using Apache Airflow.

We have a fully tested notebook under an EMR Studio Workspace, and want to schedule a workflow that runs the notebook on an on-demand EMR cluster every 10 minutes.

Record the value of the Workspace ID (for example, e-*****************************) and the notebook file path relative to the home directory within the Workspace (for example, demo.ipynb or my_folder/demo.ipynb)

The workflow that we create takes care of the following tasks:

Create an EMR cluster.
Wait until the cluster is ready.
Start running a notebook defined by the Workspace ID, notebook file path, and the cluster created.
Wait until the notebook is complete.

The following screenshot is the tree view of this example DAG. The DAG definition is available on the GitHub repo. Make sure you replace any placeholder values with the actual ones before using.

When you open the Gantt chart of one of the successful notebooks, we can see the timeline of our workflow. The time spent creating the cluster and creating a notebook execution is negligible compared to the time spent waiting for the cluster to be ready and waiting for the notebook to finish, which meets the expectation of our SLA.

This example is a just starting point. Try it out and extend it with more sophisticated workflows that suit your needs.

Summary

In this post, we highlighted some of the capabilities of EMR Studio, such as the ability to log in via AWS SSO, access fully managed Jupyter notebooks, link Git-based code repositories, change clusters, load custom Python libraries and kernels, diagnose clusters and jobs using native application UIs, and orchestrate notebook jobs using Apache Airflow or Amazon MWAA.

There is no additional charge for using EMR Studio in public preview, and you only pay for the use of the EMR cluster or other AWS services such as AWS Service Catalog. For more information, see the EMR Studio FAQs.

EMR Studio is available on Amazon EMR release version 6.2 and later, in the US East (N. Virginia), US West (Oregon), and EU (Ireland) Regions for public preview. For the latest Region availability for the public preview, see Considerations.

If you have questions or suggestions, feel free to leave a comment.

About the Authors

Fei Lang is a Senior Big Data Architect at Amazon Web Services. She is passionate about building the right big data solution for customers. In her spare time, she enjoys the scenery of the Pacific Northwest, going for a swim, and spending time with her family.

Shuang Li is a Senior Product Manager for Amazon EMR at AWS. She holds a doctoral degree in Computer Science and Engineering from Ohio State University.

Ray Liu is a Software Development Engineer at AWS. Besides work, he enjoys traveling and spending time with family.

Kendra Ellis is a Programmer Writer at AWS.

Get up to 3x better price performance with Amazon Redshift than other cloud data warehouses

2020-12-08 Eugene Kawamoto

Post Syndicated from Eugene Kawamoto original https://aws.amazon.com/blogs/big-data/get-up-to-3x-better-price-performance-with-amazon-redshift-than-other-cloud-data-warehouses/

Since we announced Amazon Redshift in 2012, tens of thousands of customers have trusted us to deliver the performance and scale they need to gain business insights from their data. Amazon Redshift customers span all industries and sizes, from startups to Fortune 500 companies, and we work to deliver the best price performance for any use case. Earlier in 2020, we published a blog post about improved speed and scalability in Amazon Redshift. This includes optimizations such as dynamically adding cluster capacity when you need it with concurrency scaling, making sure you use cluster resources efficiently with automatic workload management (WLM), and automatically adjusting data layout, distribution keys, and query plans to provide optimal performance for a given workload. We also described how customers, including Codeacademy, OpenVault, Yelp, and Nielsen, have taken advantage of Amazon Redshift RA3 nodes with managed storage to scale their cloud data warehouses and reduce costs.

In addition to improving performance and scale, we are constantly looking at how to also improve the price performance that Amazon Redshift provides. One of the ways we ensure that we provide the best value for customers is to measure the performance of Amazon Redshift and other cloud data warehouses regularly using queries derived from industry-standard benchmarks such as TPC-DS. We completed our most recent tests based on the TPC-DS benchmark in November using the latest version of the products available across the vendors tested at that time. For Amazon Redshift, this includes more than 15 new capabilities released this year prior to November, but not new capabilities announced during AWS re:Invent 2020.

Best Out-of-the-Box and Tuned Price Performance

The test completed in November showed that Amazon Redshift delivers up to three times better price performance out-of-the-box than other cloud data warehouses. The following chart illustrates these findings.

For this test, we ran all 99 queries from the TPC-DS benchmark against a 3 TB data set. We calculated price performance by multiplying the time required to run all queries in hours by the price per hour for each cloud data warehouse. We used clusters with comparable hardware characteristics for each data warehouse. We also used default settings for each cloud data warehouse, except we enabled encryption for all four services because it’s on for two by default, and we disabled result caching where applicable. The default settings allowed us to determine the price performance delivered with no manual tuning effort. We selected the best result out of three runs for each query in order to take advantage of optimizations provided by each service. Finally, to ensure an apples-to-apples comparison, we used public pricing, and compared price performance rather than performance alone. For Amazon Redshift specifically, we used on-demand pricing; Amazon Redshift Reserved Instance pricing provides up to a 60% discount vs. on-demand pricing.

These results show that Amazon Redshift provides the best price performance out-of-the-box, even for a comparatively small 3 TB dataset. This means that you can benefit from Amazon Redshift’s leading price performance from the start without manual tuning.

You can also take advantage of performance tuning techniques for Amazon Redshift to achieve even better results for your workloads. We repeated the benchmark test using tuning best practices provided by each cloud data warehouse vendor. After all cloud data warehouses are tuned, Amazon Redshift has 1.5 times better price performance than the nearest cloud data warehouse competitor, as shown in the following chart.

As with all benchmarks, transparency and reproducibility are crucial. For this reason, we have made the data and queries available on GitHub for anyone to use. See the README in GitHub for detailed instructions on re-running these benchmarks.

Tuned price performance improves as your data warehouse grows

One critical aspect of a data warehouse is how it scales as your data grows. Will you be paying more per TB as you add more data, or will your costs remain consistent and predictable? We work to make sure that Amazon Redshift delivers not only strong performance as your data grows, but also consistent price performance. We tested Amazon Redshift price performance using TPC-DS with 3 TB, 30 TB, and 100 TB datasets on three different cluster sizes. As shown in the following graph, Amazon Redshift tuned price performance improved (from $2.80 to $2.41 per TB per run) as the datasets grew. Tuning reduces the amount of network and disk I/O required for a given workload, and has varying impact depending on the combination of workload and cluster size.

In addition, as shown in the following table, Amazon Redshift out-of-the-box price performance is nearly the same ($4.80 to $5.01 per TB per run) for all three dataset sizes. This linear scaling of price performance across data size and cluster size, both out-of-the-box and tuned, makes sure that Amazon Redshift will scale predictably as your data and workloads grow.

Amazon Redshift TPC-DS benchmark results, November 2020
		Out-of-Box		Tuned
Data set (TB)	Cluster	Runtime (sec)	Price per TB per run	Runtime (sec)	Price per TB per run
3	10 node ra3.4xlarge	1591	$4.80	926	$2.80
30	5 node ra3.16xlarge	8291	$5.01	4198	$2.53
100	10 node ra3.16xlarge	13,343	$4.83	6644	$2.41

You can learn more about Amazon Redshift’s performance on large datasets in How Amazon Redshift powers large-scale analytics for Amazon.com. This AWS re:Invent 2020 session shows how Amazon.com is using Amazon Redshift to keep up with exploding data growth, and how you can upgrade your existing data warehouse workloads to RA3 nodes to get scale and performance at great value.

Up to 10x better query performance with AQUA

We’re investing to make sure Amazon Redshift continues to improve as your data warehouse needs grow. As noted earlier, these benchmark results reflect the latest version of Amazon Redshift as of November, 2020. This version includes more than 15 new features released earlier this year, such as distributed bloom filters, vectorized queries, and automatic WLM, but doesn’t include the benefits from new capabilities announced during AWS re:Invent 2020. You can join What’s new with Amazon Redshift at AWS re:Invent 2020 to learn more about the new capabilities.

These new capabilities include AQUA (Advanced Query Accelerator) for Amazon Redshift. AQUA is a new distributed and hardware-accelerated cache for Amazon Redshift that delivers up to 10x better query performance than other cloud data warehouses. AQUA takes a new approach to cloud data warehousing. AQUA brings the compute to storage by doing a substantial share of data processing in-place on the innovative cache. In addition, it uses AWS-designed processors and a scale-out architecture to accelerate data processing beyond anything traditional CPUs can do today. AQUA’s preview is now open to all customers, and AQUA will be generally available in January 2021. You can learn more about AQUA and other new Amazon Redshift capabilities by joining What’s new with Amazon Redshift at AWS re:Invent 2020.

Price performance continues to improve

Find the best price performance for your workloads

You can reproduce the results above using the data and queries available on GitHub.

Each workload has unique characteristics, so if you’re just getting started, a proof of concept is the best way to understand how Amazon Redshift performs for your requirements. When running your own proof of concept, it’s important that you focus on proper cluster sizing and the right metrics—query throughput (number of queries per hour) and price performance. You can make a data-driven decision by requesting assistance with a proof of concept or working with a system integration and consulting partner.

If you’re an existing Amazon Redshift customer, connect with us for a free optimization session and briefing on the new features announced at AWS re:Invent 2020.

To stay up-to-date with the latest developments in Amazon Redshift, subscribe to the What’s New in Amazon Redshift RSS feed.

About the Authors

Eugene Kawamoto is a director of product management for Amazon Redshift. Eugene leads the product management and database engineering teams at AWS. He has been with AWS for ~8 years supporting analytics and database services both in Seattle and in Tokyo. In his spare time, he likes running trails in Seattle, loves finding new temples and shrines in Kyoto, and enjoys exploring his travel bucket list.

Stefan Gromoll is a Senior Performance Engineer with Amazon Redshift where he is responsible for measuring and improving Redshift performance. In his spare time, he enjoys cooking, playing with his three boys, and chopping firewood.

Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML

2020-12-08 Debu Panda

Post Syndicated from Debu Panda original https://aws.amazon.com/blogs/big-data/create-train-and-deploy-machine-learning-models-in-amazon-redshift-using-sql-with-amazon-redshift-ml/

Amazon Redshift is the most popular, fully managed, and petabyte-scale data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to leverage this data to train machine learning (ML) models, which can then be used to generate insights on new data for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies.

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. Amazon Redshift ML allows you to use your data in Amazon Redshift with Amazon SageMaker, a fully managed ML service, without requiring you to become experts in ML. This post shows you how you use familiar SQL statements to create and train ML models from data in Amazon Redshift and use these models to make in-database predictions on new data for use cases such as churn prediction and fraud risk scoring.

ML use cases relevant to data warehousing

You may use different ML approaches according to what’s relevant for your business, such as supervised, unsupervised, and reinforcement learning. With this release, Amazon Redshift ML supports supervised learning, which is most commonly used in enterprises for advanced analytics. As evident in the following diagram, supervised learning is preferred when you have a training dataset and an understanding of how specific input data predicts various business outcomes. The inputs used for the ML model are often referred to as features, and the outcomes or results are called targets or labels. Your training dataset is a table or a query whose attributes or columns comprise features, and targets are extracted from your data warehouse. The following diagram illustrates this architecture.

You can use supervised training for advanced analytics use cases ranging from forecasting and personalization to customer churn prediction. Let’s consider a customer churn prediction use case. The columns that describe customer information and usage are features, and the customer status (active vs. inactive) is the target or label.

The following table shows different types of use cases and algorithms used.

Use Case	Algorithm / Problem Type
Customer churn prediction	Classification
Predict if a sales lead will close	Classification
Fraud detection	Classification
Price and revenue prediction	Linear regression
Customer lifetime value prediction	Linear regression
Detect if a customer is going to default a loan	Logistic regression

Current ways to use ML in your data warehouse

You may rely on ML experts to build and train models on your behalf or invest a lot of time learning new tools and technology to do so yourself. For example, you might need to identify the appropriate ML algorithms in SageMaker or use Amazon SageMaker Autopilot for your use case, and then export the data from your data warehouse and prepare the training data to work with these model types.

Data analysts and database developers are familiar with SQL. Unfortunately, you often have to learn a new programming language (such as Python or R) to build, train, and deploy ML models in SageMaker. When the model is deployed and you want to use it with new data for making predictions (also known as inference), you need to repeatedly move the data back and forth between Amazon Redshift and SageMaker through a series of manual and complicated steps:

Export training data to Amazon Simple Storage Service (Amazon S3).
Train the model in SageMaker.
Export prediction input data to Amazon S3.
Use prediction in SageMaker.
Import predicted columns back into the database.

The following diagram illustrates this workflow.

This iterative process is time-consuming and prone to errors, and automating the data movement can take weeks or months of custom coding that then needs to be maintained. Amazon Redshift ML enables you to use ML with your data in Amazon Redshift without this complexity.

Introducing Amazon Redshift ML

To create an ML model, as a data analyst, you can use a simple SQL query to specify the data in Amazon Redshift you want to use as the data inputs to train your model and the output you want to predict. For example, to create a model that predicts customer churn, you can query columns in one or more tables in Amazon Redshift that include the customer profile information and historical account activity as the inputs, and the column showing whether the customer is active or inactive as the output you want to predict.

When you run the SQL command to create the model, Amazon Redshift ML securely exports the specified data from Amazon Redshift to Amazon S3 and calls SageMaker Autopilot to automatically prepare the data, select the appropriate pre-built algorithm, and apply the algorithm for model training. Amazon Redshift ML handles all the interactions between Amazon Redshift, Amazon S3, and SageMaker, abstracting the steps involved in training and compilation. After the model is trained, Amazon Redshift ML makes it available as a SQL function in your Amazon Redshift data warehouse by compiling it via Amazon SageMaker Neo. The following diagram illustrates this solution.

Benefits of Amazon Redshift ML

Amazon Redshift ML provides the following benefits:

Allows you to create and train ML models with simple SQL commands without having to learn external tools
Provides you with flexibility to use automatic algorithm selection
Automatically preprocesses data and creates, trains, and deploys models
Enables advanced users to specify problem type
Enables ML experts such as data scientists to select algorithms such as XGBoost and specify hyperparameters and preprocessors
Enables you to generate predictions using SQL without having to ship data outside your data warehouse
Allows you to pay only for training; prediction is included with the costs of your cluster (typically, ML predictions drive cost in production)

In this post, we look at a simple example that you can use to get started with Amazon Redshift ML.

To train data for a model that predicts customer churn, SageMaker Autopilot preprocesses the training data, finds the algorithm that provides the best accuracy, and applies it to the training data to build a performant model.

We provide step-by-step guidance to create a cluster, create sample schema, load data, create your first ML model in Amazon Redshift, and invoke the prediction function from your queries.

Prerequisites for enabling Amazon Redshift ML

As an Amazon Redshift administrator, the following steps are required to create your Amazon Redshift cluster for using Amazon Redshift ML:

On the Amazon S3 console, create an S3 bucket that Amazon Redshift ML uses for uploading the training data that SageMaker uses to train the model. For this post, we name the bucket redshiftml-<your_account_id>. Ensure that you create your S3 bucket in the same AWS region where you will create your Amazon Redshift cluster.
Create an AWS Identity and Access Management (IAM role) named RedshiftML with the policy that we provided below. While it is easy to get started with AmazonS3FullAccess and AmazonSageMakerFullAccess, we recommend using a minimal policy that we provided below (If you already have an existing IAM role, then just add these to that role):

To use or modify this policy, replace <your-account-id> with your AWS account number. Note that the policy assumes that you have created the IAM role with the name RedshiftML and the S3 bucket with the name redshiftml-<your_account_id>. The S3 bucket redshift-downloads is from where we will load the sample data used in this blog.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "ecr:BatchCheckLayerAvailability",
                "ecr:BatchGetImage",
                "ecr:GetAuthorizationToken",
                "ecr:GetDownloadUrlForLayer",
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:DescribeLogStreams",
                "logs:PutLogEvents",
                "sagemaker:*Job*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole",
                "s3:AbortMultipartUpload",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:iam::<your-account-id>:role/Redshift-ML",
                "arn:aws:s3:::redshiftml-<your-account-id>/*",
                "arn:aws:s3:::redshift-downloads/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::redshiftml-<your-account-id>",
                "arn:aws:s3:::redshift-downloads"
            
            ]
        }
    ]
}

For instructions, see Creating IAM roles.

Choose Edit trust relationship

Enter the following trust relationship definition to trust SageMaker:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "redshift.amazonaws.com",
          "sagemaker.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

On the Amazon Redshift console, create a new Amazon Redshift cluster.
Attach the IAM policy that you created earlier (RedshiftML).
Create the cluster with the preview track (SQL_PREVIEW).
You can select the preview by turning off default configuration and choosing the maintenance option.

When your cluster creation is complete and the cluster is up and running, you can create accounts for data analysts on an Amazon Redshift cluster. For this post, we create a user named demouser.

Use the Amazon Redshift query editor or your preferred SQL client to connect to Amazon Redshift as an administrator and run the following command:

create user demouser with password '<yourpassword>';

Grant CREATE MODEL privileges to your users. The following code grants privileges to the demouser user for creating a model:
```
GRANT CREATE MODEL TO demouser;
```

Loading sample data

We use a customer churn model in this post. As an admin or database developer, you have to first create the schema and load data into Amazon Redshift. This dataset is attributed to the University of California Irvine Repository of Machine Learning Datasets. We have modified this data for use with Amazon Redshift ML.

Create a schema named demo_ml that stores the example table and the ML model that we create:
```
CREATE SCHEMA DEMO_ML;
```

In the next steps, we create the sample table and load data into the table that we use to train the ML model.

Create the table in the demo_ml schema:

CREATE TABLE demo_ml.customer_activity (
state varchar(2), 
account_length int, 
area_code int,
phone varchar(8), 
intl_plan varchar(3), 
vMail_plan varchar(3),
vMail_message int, 
day_mins float, 
day_calls int, 
day_charge float,
total_charge float,
eve_mins float, 
eve_calls int, 
eve_charge float, 
night_mins float,
night_calls int, 
night_charge float, 
intl_mins float, 
intl_calls int,
intl_charge float, 
cust_serv_calls int, 
churn varchar(6),
record_date date);

Load the sample data by using the following command. Replace your IAM role and account ID appropriate for your environment.

COPY DEMO_ML.customer_activity 
FROM 's3://redshift-downloads/redshift-ml/customer_activity/' 
IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML' delimiter ',' IGNOREHEADER 1  
region 'us-east-1';

The demouser user should also have the usual SELECT access to the tables with the data used for training:
```
GRANT SELECT on demo_ml.customer_activity TO demouser;
```

You need to also grant CREATE and USAGE on the schema to allow users to create models and query using the ML inference functions on the demo_ml schema:
```
GRANT CREATE, USAGE ON SCHEMA demo_ml TO demouser;
```

Now the analyst (demouser) can train a model.

Creating and training your first ML model

Use your favorite SQL client to connect to your Amazon Redshift cluster as the demouser user that your admin created. You have to run the following command to create your model named customer_churn_model:

CREATE MODEL demo_ml.customer_churn_model
FROM (SELECT state,
             area_code,
             total_charge/account_length AS average_daily_spend, 
             cust_serv_calls/account_length AS average_daily_cases,
             churn 
      FROM demo_ml.customer_activity
         WHERE record_date < '2020-01-01' 

     )
TARGET churn
FUNCTION predict_customer_churn
IAM_ROLE 'arn:aws:iam::<accountID>:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'redshiftml-<your-account-id>'
)
;

The SELECT query in the FROM clause specifies the training data. The TARGET clause specifies which column is the label that the CREATE MODEL builds a model to predict. The other columns in the training query are the features (input) used for the prediction. In this example, the training data provides the features state, area code, average daily spend, and average daily cases for the customers that have been active accounts for earlier than January 1, 2020. The target column churn indicates whether the customer still has an active membership or has suspended their membership. For more information about CREATE MODEL syntax, see Amazon Redshift developers guide. After the model is created, you can run queries to make predictions.

Checking the status of your ML model

You can check the status of your models by running the SHOW MODEL command from your SQL prompt.

Enter the SHOW MODEL ALL command to see all the models that you have access to:

SHOW MODEL ALL

`SchemaName`	`ModelName`
`demo_ml`	`customer_churn_model`

Enter the SHOW MODEL command with your model name to see the status for a specific model:

SHOW MODEL demo_ml.customer_churn_model

The following output provides the status of your model:

Key						Value
Model Name				customer_churn_model
Schema Name				demo_ml
Owner					awsuser
Creation Time			"Tue, 24.11.2020 07:02:51"
Model State				READY
validation:			
f1,						0.681240
Estimated Cost			0.990443
TRAINING DATA:,
Query	"SELECT STATE, AREA_CODE, TOTAL_CHARGE/ACCOUNT_LENGTH AS AVERAGE_DAILY_SPEND, CUST_SERV_CALLS/ACCOUNT_LENGTH AS AVERAGE_DAILY_CASES, CHURN"
FROM DEMO_ML.CUSTOMER_ACTIVITY
WHERE ACCOUNT_LENGTH > 120
Target Column,			CHURN

PARAMETERS:,
Model Type					auto
Problem Type				BinaryClassification
Objective					F1
Function Name				predict_customer_churn
Function Parameters,		"state area_code average_daily_spend  
average_daily_cases "
Function Parameter Types 	"varchar int4 float8 int4 "
IAM Role					arn:aws:iam::99999999999:role/RedshiftML
s3 Bucket					redshiftml

Testing

Evaluating your model performance

You can see the F1 value for the example model customer_churn_model in the output of the SHOW MODEL command. The F1 amount signifies the statistical measure of the precision and recall of all the classes in the model. The value ranges between 0–1; the higher the score, the better the accuracy of the model.

You can use the following example SQL query as an illustration to see which predictions are incorrect based on the ground truth:

WITH infer_data AS (
  SELECT area_code ||phone  accountid, churn,
    demo_ml.predict_customer_churn( 
          state,
          area_code, 
          total_charge/account_length , 
          cust_serv_calls/account_length ) AS predicted
  FROM demo_ml.customer_activity
WHERE record_date <  '2020-01-01'

)
SELECT *  FROM infer_data where churn!=predicted;

Invoking your ML model for inference

You can use your SQL function to apply the ML model to your data in queries, reports, and dashboards. For example, you can run the predict_customer_churn SQL function on new customer data in Amazon Redshift regularly to predict customers at risk of churning and feed this information to sales and marketing teams so they can take preemptive actions, such as sending these customers an offer designed to retain them.

For example, you can run the following query to predict which customers in area_code 408 might churn:

SELECT area_code ||phone  accountid, 
       demo_ml.predict_customer_churn( 
          state,
          area_code, 
          total_charge/account_length , 
          cust_serv_calls/account_length )
          AS "predictedActive"
FROM demo_ml.customer_activity
WHERE area_code='408' and record_date > '2020-01-01';

The following output shows the account ID and whether the account is predicted to remain active.

`accountId`	`predictedActive`
`408393-7984`	`False.`
`408357-3817`	`True.`
`408418-6412`	`True.`
`408395-2854`	`True.`
`408372-9976`	`False.`
`408353-3061`	`True.`

Providing privileges to invoke the prediction function

As the model owner, you can grant EXECUTE on the prediction function to business analysts to use the model. The following code grants the EXECUTE privilege to marketing_analyst_grp. The marketing_analyst_grp should have the USAGE granted on the demo_ml schema:

GRANT EXECUTE demo_ml.predict_customer_churn TO marketing_analyst_grp

Cost control

Amazon Redshift ML leverages your existing cluster resources for prediction so you can avoid additional Amazon Redshift charges. There is no additional Amazon Redshift charge for creating or using a model, and prediction happens locally in your Amazon Redshift cluster, so you don’t have to pay extra unless you need to resize your cluster.

The CREATE MODEL request uses SageMaker for model training and Amazon S3 for storage, and incurs additional expense. The cost depends on the number of cells in your training data, where the number of cells is the product of the number of records (in the training query or table) times the number of columns. For example, if the SELECT query of the CREATE MODEL produces 10,000 records for training and each record has five columns, then the number of cells in the training data is 50,000. You can control the training cost by setting the MAX_CELLS. If you don’t, the default value of MAX_CELLS is 1 million.

If the training data produced by the SELECT query of the CREATE MODEL exceeds the MAX_CELLS limit you provided (or the default one million, in case you didn’t provide one) the CREATE MODEL randomly chooses approximately MAX_CELLS divided by number of columns records from the training dataset and trains using these randomly chosen tuples. The random choice ensures that the reduced training dataset doesn’t have any bias. Therefore, by setting the MAX_CELLS, you can keep your cost within your limits. See the following code:

CREATE MODEL demo_ml.customer_churn_model
FROM (SELECT state,
             area_code,
             total_charge/account_length AS average_daily_spend, 
             cust_serv_calls/account_length AS average_daily_cases,
             churn 
      FROM demo_ml.customer_activity
      WHERE account_length > 120 
     )
TARGET churn
FUNCTION predict_customer_churn
IAM_ROLE 'arn:aws:iam::<acountID>:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'redshiftml_<your_account_id>',
   MAX_CELLS 10000
)
;

For more information about costs associated with various cell numbers and free trial details, see Amazon Redshift pricing.

An alternate method of cost control is the MAX_RUNTIME parameter, also specified as a CREATE MODEL setting. If the training job in SageMaker exceeds the specified MAX_RUNTIME seconds, the CREATE MODEL ends the job.

The prediction functions run within your Amazon Redshift cluster, and you don’t incur additional expense there.

Customer feedback

“At Rackspace Technology we help companies elevate their AI/ML operations. We’re excited about the new Amazon Redshift ML feature because it will make it easier for our mutual Redshift customers to use ML on their Redshift with a familiar SQL interface. The seamless integration with Amazon SageMaker will empower data analysts to use data in new ways, and provide even more insight back to the wider organization.” – Nihar Gupta, General Manager for Data Solutions, Rackspace Technology

“We have always been looking for a unified platform that will enable both data processing and machine learning model training/scoring. Amazon Redshift has been our preferred data warehouse for processing large volumes of customer transactional data and we are increasingly leveraging Amazon SageMaker for model training and scoring. Until now, we had to move the data back and forth between the two for the ML steps in pipelines, which is quite time consuming and error prone. With the ML feature embedded, Amazon Redshift becomes that unified platform we have been looking for which will significantly simplify our ML pipelines.” – Srinivas Chilukuri, Principal – AI Center of Excellence, ZS Associates

Conclusion

In this post, we briefly discussed ML use cases relevant for data warehousing. We introduced Amazon Redshift ML and outlined how it enables SQL users to create, train, deploy, and use ML with simple SQL commands without learning external tools. We also provided an example of how to get started with Amazon Redshift ML.

Amazon Redshift ML also enables ML experts such as data scientists to quickly create ML models to simplify their pipeline and eliminate the need to export data from Amazon Redshift. We will discuss how data scientists can use Amazon Redshift ML in a future post.

About the Authors

Debu Panda, a senior product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 20 years of experience in the IT world.

Yannis Papakonstantinou is a senior principal scientist at AWS and professor (on leave) of University of California at San Diego whose research on querying nested and semi-structured data, data integration, and the use and maintenance of materialized views has received over 16,500 citations.

Murali Balakrishnan Narayanaswamy is a senior machine learning scientist at AWS and received a PhD from Carnegie Mellon University on the intersection of AI, optimization, learning and inference to combat uncertainty in real-world applications.

Sriram Krishnamurthy is a software development manager for the Amazon Redshift query processing team and has been working on semi-structured data processing and SQL compilation and execution for over 15 years.

Sudipta Sengupta is a senior principal technologist at AWS who leads new initiatives in AI/ML, databases, and analytics and holds a Ph.D. in electrical engineering and computer science from Massachusetts Institute of Technology.

Stefano Stefani is a VP and distinguished engineer at AWS and has served as chief technologist for Amazon DynamoDB, Amazon Redshift, Amazon Aurora, Amazon SageMaker, and other services.

How the Allen Institute uses Amazon EMR and AWS Step Functions to process extremely wide transcriptomic datasets

2020-12-07 Gautham Acharya

Post Syndicated from Gautham Acharya original https://aws.amazon.com/blogs/big-data/how-the-allen-institute-uses-amazon-emr-and-aws-step-functions-to-process-extremely-wide-transcriptomic-datasets/

This is a guest post by Gautham Acharya, Software Engineer III at the Allen Institute for Brain Science, in partnership with AWS Data Lab Solutions Architect Ranjit Rajan, and AWS Sr. Enterprise Account Executive Arif Khan.

The human brain is one of the most complex structures in the universe. Billions of neurons and trillions of connections come together to form a labyrinthine network of activity. Understanding the mechanisms that guide our minds is one of the most challenging problems in modern scientific research.

The Allen Institute for Brain Science is dedicated to solving large-scale, fundamental problems in neuroscience. Our mission is to accelerate the rate at which the world understands the inner workings of the human brain and to uncover the essence of what makes us human.

Processing extremely wide datasets

As a part of “big science,” one of our core principles, we seek to tackle scientific challenges at scales no one else has attempted before. One of these challenges is processing large-scale transcriptomic datasets. Transcriptomics is the study of RNA. In particular, we’re interested in the genes that are expressed in individual neurons. The human brain contains almost 100 billion neurons—how do they differ from each other, and what genes do they express? After a series of complex analysis using cutting-edge techniques such as Smart-Seq and 10x Genomics Chromium Sequencing, we produce extremely large matrices of numeric values.

Such matrices are called feature matrices. Each column represents the feature of a cell, which in this case are genes. A genome is over 50,000 genes, so a single matrix can have over 50,000 columns! We expect the number of rows in our matrices to increase over time, reaching tens of millions, if not more. These matrices can reach 500 GB or more in size. Over the next few years, we want to be able to ingest tens or hundreds of such matrices.

Our goal is to provide low-latency visualizations on such matrices, allowing researchers to aggregate, slice, and dissect our data in real time. To do this, we run a series of precomputations that store expensive calculations in a database for future retrieval.

We wanted to create a flexible, scalable pipeline to run computations on these matrices and store the results for visualizations.

The pipeline

We wanted to build a pipeline that takes these large matrices as inputs, runs various Spark jobs, and stores the outputs in an Apache HBase cluster. We wanted to create something flexible so that we could easily add additional Spark transformations.

We decided on AWS Step Functions as our workflow-orchestration tool of choice. Step Functions allows us to create a state machine that orchestrates the dataflow from payload submission to database loading.

After close collaboration with the engineers at the AWS Data Lab, we came up with the following pipeline architecture.

At a high level, our pipeline has the following workflow:

Trigger a state machine from an upload event to an Amazon Simple Storage Service (Amazon S3) bucket.
Copy and unzip the input ZIP file containing a feature matrix into an Amazon S3 working directory.
Run Spark jobs on Amazon EMR to transform input feature matrices into various pre-computed datasets. Store all intermittent results in a working directory on Amazon S3 and output the results of the Spark Jobs as HFile s.
Bulk load the results of our Spark jobs into Apache HBase.

The preceding architecture diagram is deceptively simple. We found a number of challenges during our initial implementation, which we discuss in the following sections.

Lack of transaction support and rollbacks across tables in Apache HBase

The results of our Spark jobs are a number of precomputed views of our original input dataset. Each view is stored as a separate table in Apache HBase. A major drawback of Apache HBase is the lack of a native transactional system. HBase only provides row-level atomicity. Our worst-case scenario is writing partial data—cases where some views are updated, but not others, showing different results for different visualizations and resulting in scientifically incorrect data!

We worked around this by rolling our own blue/green system on top of Apache HBase. We suffix each set of tables related to a dataset with a universally unique identifier (UUID). We use Amazon DynamoDB to track the UUID associated with each individual dataset. When an update to a dataset is being written, the UUID is not switched in DynamoDB until we verify that all the new tables have been successfully written to Apache HBase. We have an API on top of HBase to facilitate reads. This API checks DynamoDB for the dataset UUID before querying HBase, so user traffic is never redirected toward a new view until we confirm a successful write. Our API involves an AWS Lambda function using HappyBase to connect to our HBase cluster, wrapped in an Amazon API Gateway layer to provide a REST interface. The following diagram illustrates this architecture.

The read path has the following steps:

R1 – API Gateway invokes a Lambda function to fetch data from a dataset
R2 – The Lambda function requests and receives the dataset UUID from DynamoDB
R3 – Lambda queries the Apache HBase cluster with the UUID

The write path has the following steps:

W1 – The state machine bulk loads new dataset tables to the Apache HBase cluster suffixed with the new UUID
W2 – After validation, the state machine updates DynamoDB so user traffic is directed towards those changes

Stalled Spark jobs on extremely wide datasets

Although Apache Spark is a fantastic engine for running distributed compute operations, it doesn’t do too well when scaling to extremely wide datasets. We routinely operate on data that surpasses 50,000 columns, which often causes issues such as a stalled JavaToPython step in our PySpark job. Although we have more investigating to do to figure out why our Spark jobs hang on these wide datasets, we found a simple workaround in the short term—batching!

A number of our jobs involve computing simple columnar aggregations on our data. This means that each calculation on a column is completely independent of all the other columns. This lends itself quite well to batching our compute. We can break our input columns into chunks and run our compute on each chunk.

The following code chunks Apache Spark aggregation functions into groups of columns:

def get_aggregation_for_matrix_and_metadata(matrix, metadata, group_by_arg, agg_func, cols_per_write):
   '''
   Performs an aggregation on the joined matrix, aggregating the desired column by the given function.
   agg_func must be a valid Pandas UDF function. Runs in batches so we don't overload the Task Scheduler with 50,000
   columns at once.
   '''
   # Chunk the data
   for col_group in pyspark_utilities.chunks(matrix.columns, cols_per_write):

       # Add the row key to the column group
       col_group.append(matrix.columns[0])

       selected_matrix = matrix.select(pyspark_utilities.escape_column_list(col_group))

       # create argument list for group by and then process
       cast_as_udf = pyspark_functions.pandas_udf(
                       agg_func,
                       pyspark_datatype.FloatType(),
                       pyspark_functions.PandasUDFType.GROUPED_AGG)

       udf_input = [cast_as_udf(selected_matrix [column_name]).alias(column_name)
                    for column_name in selected_matrix .columns
                    if column_name != group_by_arg]

       yield joined.groupby(group_by_arg).agg(*udf_input)

We then write the results of each batch to an HFile, which is then later bulk loaded into HBase.

Because the post-aggregation DataFrame was very small, we found a significant performance increase in coalescing the DataFrame post-aggregation and checkpointing the results before writing the HFiles. This forces Spark to compute the aggregation before writing the HFiles. HFiles need to be sorted by row key, so it’s easier to pass a smaller DataFrame to our HFile converter.

Using Apache Spark to write DataFrames as HFiles

Apache Spark supports writing DataFrames in multiple formats, including as HFiles. However, the documentation for doing so leaves a lot to be desired. To write out our Spark DataFrames as HFiles, we had to take the following steps:

Convert a DataFrame into a HFile-compatible format, assuming that the first column is the HBase rowkey—(row_key, column_family, col, value).
Create a JAR file containing a converter to convert input Python Objects into Java key-value byte classes. This step took a lot of trial and error—we couldn’t find clear documentation on how the Python object was serialized and passed into the Java function.
Call the saveAsNewAPIHadoopFile function, passing in the relevant information: the ZooKeeper Quorum IP, port, and cluster DNS of our Apache HBase on the Amazon EMR cluster; the HBase table name; the class name of our Java converter function; and more.

The following code writes HFiles:

import src.spark_transforms.pyspark_jobs.pyspark_utilities as pyspark_utilities
import src.spark_transforms.pyspark_jobs.output_handler.emr_constants as constants


def csv_to_key_value(row, sorted_cols, column_family):
   '''
   This method is an RDD mapping function that will map each
   row in an RDD to an hfile-formatted tuple for hfile creation
   (rowkey, (rowkey, columnFamily, columnQualifier, value))
   '''
   result = []
   for index, col in enumerate(sorted_cols[constants.ROW_KEY_INDEX + 1:], 1):
       row_key = str(row[constants.ROW_KEY_INDEX])
       value = row[index]

       if value is None:
           raise ValueError(f'Null value found at {row_key}, {col}')

       # We store sparse representations, dropping all zeroes.
       if value != 0:
           result.append((row_key, (row_key, column_family, col, value)))

   return tuple(result)


def get_sorted_df_by_cols(df):
   '''
   Sorts the matrix by column. Retains the row key as the initial column.
   '''
   cols = [df.columns[0]] + sorted(df.columns[1:])
   escaped_cols = pyspark_utilities.escape_column_list(cols)
   return df.select(escaped_cols)


def flat_map_to_hfile_format(df, column_family):
   '''
   Flat maps the matrix DataFrame into an RDD formatted for conversion into HFiles.
   '''
   sorted_df = get_sorted_df_by_cols(df)
   columns = sorted_df.columns
   return sorted_df.rdd.flatMap(lambda row: csv_to_key_value(row, columns, column_family)).sortByKey(True)


def write_hfiles(df, output_path, zookeeper_quorum_ip, table_name, column_family):
   '''
   This method will sort and map the medians psyspark dataFrame and
   then write to hfiles in the output directory using the supplied
   hbase configuration.
   '''
   # sort columns other than the row key (first column)

   rdd = flat_map_to_hfile_format(df, column_family)

   conf = {
           constants.HBASE_ZOOKEEPER_QUORUM: zookeeper_quorum_ip,
           constants.HBASE_ZOOKEEPER_CLIENTPORT: constants.ZOOKEEPER_CLIENTPORT,
           constants.ZOOKEEPER_ZNODE_PARENT: constants.ZOOKEEPER_PARENT,
           constants.HBASE_TABLE_NAME: table_name
           }

   rdd.saveAsNewAPIHadoopFile(output_path,
                              constants.OUTPUT_FORMAT_CLASS,
                              keyClass=constants.KEY_CLASS,
                              valueClass=constants.VALUE_CLASS,
                              keyConverter=constants.KEY_CONVERTER,
                              valueConverter=constants.VALUE_CONVERTER,
                              conf=conf)

The following code describes all the constants configuration:

HBASE_ZOOKEEPER_QUORUM="hbase.zookeeper.quorum"
HBASE_ZOOKEEPER_CLIENTPORT="hbase.zookeeper.property.clientPort"
ZOOKEEPER_ZNODE_PARENT="zookeeper.znode.parent"
HBASE_TABLE_NAME="hbase.mapreduce.hfileoutputformat.table.name"

OUTPUT_FORMAT_CLASS='org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2'
KEY_CLASS='org.apache.hadoop.hbase.io.ImmutableBytesWritable'
VALUE_CLASS='org.apache.hadoop.hbase.KeyValue'
KEY_CONVERTER="org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
VALUE_CONVERTER="KeyValueConverter"

ZOOKEEPER_CLIENTPORT='2181'
ZOOKEEPER_PARENT='/hbase'

ROW_KEY_INDEX = 0

The following code is a Java class to serialize input PySpark RDDs:

import org.apache.spark.api.python.Converter;
import org.apache.hadoop.hbase.KeyValue;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.List;

/**
* This class is used to convert a tuple
* supplied by a spark job created in Python
* to the corresponding hbase keyValue type
* which is needed for hfile creation.
*
*/
@SuppressWarnings("rawtypes")
public class KeyValueConverter implements Converter {

  private static final long serialVersionUID = 1L;

  /**
   * this method will take a tuple object supplied
   * by Python spark job and convert and
   * return the corresponding hbase KeyValue object.
   */
  public Object convert(Object obj) {
     KeyValue cell;
     List<?> list = new ArrayList<>();
     if (obj.getClass().isArray()) {
          list = Arrays.asList((Object[])obj);
      } else if (obj instanceof Collection) {
          list = new ArrayList<>((Collection<?>)obj);
      }

     cell = new KeyValue(
           list.get(0).toString().getBytes(),
           list.get(1).toString().getBytes(),
           list.get(2).toString().getBytes(),
           list.get(3).toString().getBytes());

     return cell;
  }
}

Looking ahead

Our computation pipeline was a success, and you can see the resulting visualizations on https://transcriptomics.brain-map.org/.

We’ve been thrilled with AWS’s reliable and feature-rich ecosystem. We used Amazon EMR, Step Functions, and Amazon S3 to build a robust, large-scale data processing pipeline.

Since writing this post, we’ve done much more, including a cross-database transaction system, wide-matrix transposes in Spark, and more. Big Data problems in neuroscience never end, and we’re excited to share more with you in the future!

About the Authors

Gautham Acharya is a Software Engineer at the Allen Institute for Brain Science. He works on the backend data platform team responsible for integrating multimodal neuroscience data into a single cohesive system.

Ranjit Rajan is a Data Lab Solutions Architect with AWS. Ranjit works with AWS customers to help them design and build data and analytics applications in the cloud.

Arif Khan is a Senior Account Executive with Amazon Web Services. He works with nonprofit research customers to help shape and deliver on a strategy that focuses on customer success, building mind share and driving broad use of Amazon’s utility computing services to support their mission.

Transform data and create dashboards simply using AWS Glue DataBrew and Amazon QuickSight

2020-12-04 Prithiviraj Jothikumar

Post Syndicated from Prithiviraj Jothikumar original https://aws.amazon.com/blogs/big-data/transform-data-and-create-dashboards-simply-using-aws-glue-databrew-and-amazon-quicksight/

Before you can create visuals and dashboards that convey useful information, you need to transform and prepare the underlying data. The range and complexity of data transformation steps required depends on the visuals you would like in your dashboard. Often, the data transformation process is time-consuming and highly iterative, especially when you are working with large datasets.

In this post, we show you how to apply data transformations on COVID-related tweets (with sentiment scores) using AWS Glue DataBrew. AWS Glue DataBrew’s visual and interactive interface allows us to apply data transformations without any coding. Some examples of transformations we apply are: changing date formats, transformation of text strings and performing table pivots. We then use Amazon QuickSight to visualize the transformed data in a dashboard.

See the AWS architecture diagram below for an overview of the complete and serverless data pipeline.

These are seven main steps in the data pipeline:

Upload CSV file with tweets to S3
Trigger AWS Lambda to partition the dataset in case the number of rows exceeds a threshold, and store the output in S3. A threshold is placed to prevent a memory exception failure to compute VADER sentiment score.
Use Amazon SQS and AWS Lambda to compute VADER sentiment scores for tweets and store the VADER output as parquet files in S3
Crawl the VADER output and produce a Data Catalog using AWS Glue, so that the data can be queried easily using Amazon Athena
Use the Data Catalog connector in AWS Glue DataBrew to create a recipe and perform transformations
Setup a Recipe job to perform the transformation and send the output to an S3 location; repeat step 5 to produce a data catalog of the transformed data
Query and submit the data to Amazon QuickSight to create visuals for the dashboard

Each section below corresponds to one (or more) of the seven steps described above.

Steps 1 to 3: Setting and Deploying AWS Lambda pipeline

To test the solution we can use the AWS CloudFormation template found here. The AWS CloudFormation template automatically creates the following for you: the S3 Bucket to store the CSV and parquet files, the Lambda function to partition the files, the SQS queue and the VADER Lambda function.

Deploying the CloudFormation Template

Create an S3 bucket with the name of your choice and note the bucket name. Please refer to this guide for detailed steps on how to create an S3 bucket.
Upload all four individual zip files to the S3 bucket created in step 1. To do so, download the zip file from here, make sure to unzip the downloaded file and place each of the four individual zip files into the S3 bucket created in step 1.
Open the AWS Management Console in the AWS Region you want to deploy the solution to, and on the Services menu, choose CloudFormation.
Choose Create Stack, choose Upload a template to Amazon S3, and then choose the file databrew-cloudformation.yaml included in the solution that you downloaded earlier and if not, click on the link provided on this step to download.
Set Stack name to databrew-stack. Specify the Amazon S3 bucket that contains the compressed version of AWS Lambda function code and layers uploaded in step 2.
For Options, you can specify tags for your stack and an optional IAM role to be used by AWS CloudFormation to create resources. If the role isn’t specified, a new role is created. You can also perform additional configuration for rollback settings and notification options.
The review section shows a recap of the information. Be sure to select the three AWS CloudFormation acknowledgements to allow AWS CloudFormation to create resources with custom names on your behalf. Also, create a change set, because the AWS CloudFormation template includes the AWS::Serverless-2016-10-31
Click Execute.
The Outputs for the stack lists all the resources created.

Downloading the dataset

Download the dataset from here.

Testing the solution

Navigate to the S3 console.
Click on the bucket name (created by CloudFormation template and listed in the Outputs section of the stack).
Create folder named input in the S3 bucket.
Upload the downloaded dataset to the folder created in step 3.
The above will trigger a lambda function to chunk the files into smaller files and proceed to another lambda function triggered by an SQS event to perform the following preprocessing steps: removal of https links, VADER sentiment scores of the 4 categories (compound, positive, negative, and neutral), and saving the file as a parquet file in S3 the path called processedv4.

Step 4: Setup AWS Glue Data Catalog

An AWS Glue Data Catalog will allows us to easily import data into AWS Glue DataBrew. Follow these steps to create a Glue crawler that crawls the the raw data with VADER output in partitioned parquet files in S3 and determines the schema:

Choose a crawler name.
Use the default options for Crawler source type.
Provide the S3 location of the parquet files.
Choose/Create an IAM role that has read/write permissions to S3 and the AWSGlueServiceRole policy attached.
Set the frequency as Run on demand.
Choose a database name for the crawler’s output.
Leave all remaining options as default.

Once the crawler has been created, select and run the new crawler. Upon completion, the table schema is generated and visible within Tables under the Databases section in the AWS Glue console.

Step 5: Creating a DataBrew project

To get started with AWS Glue DataBrew, complete the following steps:

On the DataBrew console, choose Projects.
Choose Create project.
For Project Name, enter covid19-tweets.
For Attached recipe, choose Create new recipe.
For Recipe name, enter covid19-tweets-recipe.
For Select a dataset, select New dataset.
For Dataset name, enter covid19-tweets-data.
Import the AWS Glue table <YOUR-GLUE-TABLE-NAME> from the AWS Glue database <YOUR-GLUE-DATABASE-NAME>.
Create a new AWS Identity and Access Management (IAM) policy and IAM role by following the steps on the AWS Glue DataBrew console, which provides DataBrew the necessary permissions to access Amazon S3, Amazon Athena and AWS Glue.
Select Create Project.

Step 6: Data transformations, creating a AWS Glue DataBrew recipe and recipe job

Exploratory data analysis (EDA) is an essential component of the data transformation workflow. EDA allows us to gain an intuitive understanding of the dataset by summarizing its main characteristics such as the distribution of data across columns, the corresponding data types and summary statistics. For more information on EDA using AWS Glue DataBrew, see the Exploratory Data Analysis section of the post Data preprocessing for machine learning on Amazon EMR made easy with AWS Glue DataBrew.

The set of data transformation steps listed below are based on the EDA and the attributes we would like to visualize in the Amazon QuickSight dashboard.

Delete the following columns user_description, user_created, user_verified, is_retweet, user_followers, user_friends, user_favourites and use_name, because the aforementioned columns are not dependent for Amazon QuickSight dashboard.

Based on the schema view of the dataset, we can observe that the date column is represented as a string. AWS Glue DataBrew provides various transformations date-time formats to convert date types.

Select from the dropdown select Date functions, then DATEFORMAT.
1. Create a new column based on the values in the source column.
2. For source column, select date.
3. For date format, select mm/dd/yy*HH:MM.
Delete the date column and rename the date_DATEFORMAT column to date.

A visual representation of the positive, negative, neutral, and compound scores generated by the VADER AWS Lambda function is helpful to understand the sentiments of the tweets over time. Individual scores could have large variation between rows leading to large fluctuations on the line graph. We can use the rolling window transformation over observations in a column, to create a smoother, less fluctuating line graph. To apply the rolling window function, create a less granular version of the date column that is truncated of the hours and minute.

Select From the dropdown select Date functions, then DATEFORMAT.
1. Create a new column based on the values in the source column
2. For source column, select date
3. For date format, select mm/dd/yy to transform the observations at a date-level
4. For destination column, enter date_level.
Select From the dropdown select Window functions, then ROLLING AVERAGE.
1. For source column, select compound
2. Select 3 rows before and after. This is the size of the rolling window that slides over the column and impacts the smoothness of the average score from one row to another; larger window sizes produce more smoother rolling average scores and vice-versa.
3. Order the observations by the date column because it’s granularity is at the hour- and minute-level
4. Group the observations by the date_level column, which is only represented in date without the time
5. For the destination column name, enter rolling_compound.
Repeat step 5 above for each of the source columns pos, neu and neg to create rolling_pos, rolling_neu and rolling_neg.

To support downstream visuals in Amazon QuickSight, apply the unpivot transformation to convert the selected columns into row values.

Select Pivot transformation
1. For pivot type, select Unpivot: Columns to rows
2. In the unpivot columns dropdown, select the columns rolling_compound, rolling_pos, rolling_neu and rolling_neg
3. For column name, enter Rolling_category and for column value enter VADER_score

The hashtags column contains an array of hashtags. To visualize the hashtags as a word cloud in Amazon QuickSight, create a new row for each hashtag from the array by applying the unnest transformation.

Select unnest
1. For source column, select hashtags
2. For the destination column name, enter unnest_hashtags

Normalize the data in the unnest_hashtags column

Format the data in unnest_hashtags column to lowercase
Remove all special characters by selecting the clean transformation

The data preparation phase is now complete, and the set of 20 transformations that consist of date formatting, rolling window functions and normalized columns are combined into a recipe.

The transformations above were applied on a sample of the first 500 rows of the dataset. AWS Glue DataBrew recipe job provides the ability to scale the set of transformation steps from a sample of data to the entire dataset.

To create a recipe job, complete the following steps:

On the AWS Glue DataBrew console, choose Jobs.
Choose Create recipe job.
For Job name, enter a name.
Create a new folder in Amazon S3 for the recipe job output, select the file type as GLUEPARQUET and compression as Snappy.

If your dataset is updated on a regular basis, AWS Glue DataBrew provides an option to schedule jobs.

To query the newly transformed data from S3 into Amazon QuickSight, create another new crawler/table in AWS Glue similar to steps provided earlier (refer to the following section: Step 4: Setup an AWS Glue Data Catalog).

Use AWS QuickSight to visualize transformed data

Before proceeding, make sure your Amazon QuickSight account has IAM permissions to access Amazon Athena and S3. Add a new dataset by clicking on Datasets from the left panel and click on New dataset. Select Athena from the list of data sources and provide a name for the Data source name and on the next section. When prompted, select the database and table that contains the table post AWS Glue DataBrew transformation and click Use custom SQL followed by changing the New custom SQL name to TableOrderedData and paste the following SQL query before selecting Edit/Preview data:

SELECT * FROM "<db>"."<table_name>" t ORDER BY t.date;

Replace <db> with your database and <table_name> with your table name. Leave the quotes in place.

Click Apply to make some further modifications.

For the date column, change the data type from Stringto Date and provide the format the date as it is presented in the column (i.e. MM/dd/yy HH:mm). Similarly, change date_level column into Date
Click Save and visualize to approach the next step, which is to analyze the data. In the Analyses section, select Horizontal bar chart under Visual types followed by clicking vader_score and rolling_category. Under Field wells, change vader_score (Sum) to vader_score (Average) as the Aggregate.
For this visual, select Actions and select Filter same-sheet visuals under Quick create.
Create another panel by selecting Add to create a line graph that allows rolling score. Select Line chart under Visual types and select the following in the presented order: vader_score, date, and rolling_category. Under Field wells, change vader_score (Sum) to vader_score (Average) as the Aggregate. Likewise, change date (Day) to date (Hour) as the Aggregate. Similar to the previous panel, create a quick action item.
Further panels can also be created, such as word cloud of hashtags or a pivot table consisting of the different VADER categories with score and date. The following image is based on selecting the rolling_pos bar on the first quadrant, which filters and cascades to the rest of the panels to that of the rolling_pos filter.
First quadrant depicts the overall average of each category of the entirety of the dataset. The second quadrant depicts the rolling average that is aggregated by the hour. The third quadrant is table representative that is depicted as the rolling average that is aggregated by the day. The word cloud panel is from the hashtags column.

Summary

In this post, we showed you how to interactively analyze and visualize a dataset using AWS Lambda, AWS Glue Databrew and Amazon QuickSight. We began with a COVID19 tweets dataset and computed their sentiment scores using the VADER algorithm. We then cleaned and prepared the tweets and their sentiment scores in AWS Glue DataBrew, and finally visualized key characteristics of the dataset in Amazon QuickSight.

We encourage you to extend this pipeline to your data analytics + visualization use case – there are many more pre-built transformations in AWS Glue DataBrew and pre-built visuals in Amazon QuickSight to explore.

Happy Building!

About the Authors

Prithiviraj Jothikumar, PhD, is a Data Scientist with AWS Professional Services, where he helps customers build solutions using machine learning. He enjoys watching movies and sports and spending time to meditate.

Kartik Kannapur is a Data Scientist with AWS Professional Services. He holds a Master’s degree in Applied Mathematics and Statistics from Stony Brook University and focuses on using machine learning to solve customer business problems.

Bala Krishnamoorthy is a Data Scientist with AWS Professional Services, where he helps customers solve problems and run machine learning workloads on AWS. He has worked with customers across diverse industries, including software, finance, and healthcare. In his free time, he enjoys spending time outdoors, running with his dog, beating his family and friends at board games and keeping up with the stock market.

Paritosh Walvekar is a Cloud Application Architect with AWS Professional Services, where he helps customer build cloud native applications. He has a Master’s degree in Computer Science from University at Buffalo. In his free time, he enjoys watching movies and is learning to play the piano.

Amazon EMR now provides up to 30% lower cost and up to 15% improved performance for Spark workloads on Graviton2-based instances

2020-12-04 Peter Gvozdjak

Post Syndicated from Peter Gvozdjak original https://aws.amazon.com/blogs/big-data/amazon-emr-now-provides-up-to-30-lower-cost-and-up-to-15-improved-performance-for-spark-workloads-on-graviton2-based-instances/

Amazon EMR now supports M6g, C6g and R6g instances with Amazon EMR versions 6.1.0, 5.31.0 and later. These instances are powered by AWS Graviton2 processors that are custom designed by AWS using 64-bit Arm Neoverse cores to deliver the best price performance for cloud workloads running in Amazon Elastic Compute Cloud (Amazon EC2). On Graviton2 instances, Amazon EMR runtime for Apache Spark provides up to 15% improved performance at up to 30% lower costs relative to equivalent previous generation instances. In our TPC-DS 3 TB benchmark tests, we found that queries run up to 32 times faster using Amazon EMR runtime for Apache Spark. For more information, see Amazon EMR introduces EMR runtime for Apache Spark.

You can use Apache Spark for a wide array of analytics use cases ranging from large-scale transformations to streaming, data science, and machine learning. Amazon EMR provides the latest, stable, open-source innovations, performant storage with Amazon S3, and the unique cost savings capabilities of Spot Instances and Managed Scaling.

Amazon EMR runtime for Apache Spark is a performance-optimized runtime environment for Apache Spark, available and turned on by default on Amazon EMR release 5.28.0 and later. Amazon EMR runtime for Apache Spark provides 100% API compatibility with open-source Apache Spark. This means that your Apache Spark workloads run faster and incur lower compute costs when run on Amazon EMR without requiring any changes to your application.

In this post, we discuss the results that we observed by running Spark workloads on Graviton2 instances.

AWS Graviton2 and Amazon EMR runtime performance improvements

To measure improvements, we ran TPC-DS 3 TB benchmark queries on Amazon EMR 5.30.1 using Amazon EMR runtime for Apache Spark (compatible with Apache Spark version 2.4) with 5-10 node clusters of M6g instances with data in Amazon Simple Storage Service (Amazon S3) and compared it to equivalent configuration using M5 instances. We measured performance improvements using total query execution time and geometric mean of query execution time across the 104 TPC-DS 3 TB benchmark queries.

The results showed improved performance on M6g instance EMR clusters compared to equivalent M5 instance EMR clusters of between 11.61–15.61% improvement in total query runtime for different sizes within the instance family, and between 10.52–12.91% improvement in geometric mean. To measure cost improvement, we added up the Amazon EMR and Amazon EC2 cost per instance per hour and multiplied it by the total query runtime. We observed between 21.58–30.58% reduced instance hour cost on M6g instance EMR clusters compared equivalent M5 instance EMR clusters to execute the 104 TPC-DS benchmark queries.

The following table shows results from running TPC-DS 3 TB benchmark queries using Amazon EMR 5.30.1 over equivalent M5 and M6g instance EMR clusters.

Instance Size	16 XL	12 XL	8 XL	4 XL	2 XL
Number of core instances in EMR cluster	5	5	5	5	10
Total query runtime on M5 (seconds)	6157	6167	6857	10593	10676
Total query runtime on M6g (seconds)	5196	5389	6061	9313	9240
Total query execution time improvement with M6g	15.61%	12.63%	11.61%	12.08%	13.45%
Geometric mean query execution time on M5 (sec)	33	34	35	47	47
Geometric mean query execution time on M6g (sec)	29	30	32	41	42
Geometric mean query execution time improvement with M6g	12.73%	10.79%	10.52%	12.91%	11.24%
EC2 M5 instance price ($ per hour)	$3.072	$2.304	$1.536	0.768	0.384
EMR M5 instance price ($ per hour)	$0.27	$0.27	$0.27	0.192	0.096
(EC2 + EMR) M5 instance price ($ per hour)	$3.342	$2.574	$1.806	$0.960	$0.480
Cost of running on M5 ($ per instance)	$5.72	$4.41	$3.44	$2.82	$1.42
EC2 M6g instance price ($ per hour)	$2.464	$1.848	$1.232	$0.616	$0.308
EMR M6g price ($ per hour per instance)	$0.616	$0.462	$0.308	$0.15	$0.08
(EC2 + EMR) M6g instance price ($ per hour)	$3.080	$2.310	$1.540	$0.770	$0.385
Cost of running on M6g ($ per instance)	$4.45	$3.46	$2.59	$1.99	$0.99
Total cost reduction with M6g including performance improvement	-22.22%	-21.58%	-24.63%	-29.48%	-30.58%

The following graph shows per query improvements observed on M6g 2XL instances with EMR Runtime for Spark on Amazon EMR version 5.30.1 compared to equivalent M5 2XL instances for the 104 queries in the TPC-DS 3 TB benchmark. Performance of 100 of 104 TPC-DS queries improved with M6g 2XL, and performance for 4 queries regressed (q41, q20, q42, and q52 – with the maximum regression of -20.99%). If you are evaluating migrating from M5 instances to M6g instances for EMR Spark workloads, we recommend that you test your workloads to check if any of your queries encounter slower performance.

R6g instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5 instances. Our test results showed between 14.27-21.50% improvement in total query runtime for 5 different instance sizes within the instance family, and between 12.48-18.95% improvement in geometric mean. On cost comparison, we observed 23.26%-31.66% reduced instance hour cost on R6g instance EMR clusters compared to R5 EMR instance clusters to execute the 104 TPC-DS benchmark queries. We observed 4 benchmark queries (q6, q21, q41 and q26 – with a maximum regression of -18.28%) taking longer to execute on R6g instance clusters compared to R5 instance clusters.

With C6g instances, we observed improved performance compared to C5 instances for Spark workloads on 2XL, 4XL, 12XL and 16XL instance sizes. Query execution performance regressed on 8XL instances by -0.38%. We observed between 16.84-24.15% lower instance hour costs on C6g instance EMR clusters compared equivalent C5 EMR instance clusters to execute the 104 TPC-DS benchmark queries. Performance of 73 of 104 TPC-DS queries improved with C6g 4XL, and performance for 31 queries regressed (with a maximum regression of -31.38% for q78).

Summary

By using Amazon EMR with M6g, C6g and R6g instances powered by Graviton2 processors, we observed improved performance and reduced cost of running 104 TPC-DS benchmark queries. To keep up to date, subscribe to the Big Data blog’s RSS feed to learn about more Apache Spark optimizations, configuration best practices, and tuning advice.

About the Authors

Peter Gvozdjak is a senior engineering manager for EMR at Amazon Web Services.

Al MS is a product manager for Amazon EMR at Amazon Web Services.

Building an ad-to-order conversion engine with Amazon Kinesis, AWS Glue, and Amazon QuickSight

2020-12-04 Gandhi Raketla

Post Syndicated from Gandhi Raketla original https://aws.amazon.com/blogs/big-data/building-an-ad-to-order-conversion-engine-with-aws-glue-amazon-kinesis-data-streams-and-amazon-quicksight/

Businesses in ecommerce have the challenge of measuring their ad-to-order conversion ratio for ads or promotional campaigns displayed on a webpage. Tracking the number of users that clicked on a particular promotional ad and the number of users who actually added items to their cart or placed an order helps measure the ad’s effectiveness. Utilizing promotional ads that have higher conversion rates enables you to effectively utilize limited space on your ecommerce websites and applications.

This post demonstrates how to sessionize and aggregate clickstream and order data, compute the conversion ratio in real time, and generate data visualizations. We use Amazon Kinesis Data Streams to ingest and send data to Amazon Simple Storage Service (Amazon S3), and AWS Glue, Amazon Athena, and Amazon QuickSight to catalog, analyze, and visualize the data, respectively.

Solution overview

To measure ad-to-order conversion, you need two important pieces of data: user clicks and orders. Clickstream data is captured as users navigate through the site, each time users click on the webpage, and the metadata associated with those clicks. Depending on the user base and number of active users at any moment, clickstream data can be a large amount of data generated per second. Typically, every ecommerce system has a centralized order management system that captures orders created from different channels like a web portal or mobile app. To compute an ad-to-order conversion rate, you join clickstream data and order data over time: (total number of orders/total number of clicks) *100.

The following diagram illustrates the architecture of our solution.

The solution has six main categories.

Data generators – Clickstream and order data is generated with the help of an AWS Lambda function. The function is triggered by a scheduled Amazon CloudWatch Events event every minute and generates random clicks for ingestion into a Kinesis data stream. Similarly, another function triggered by a CloudWatch event generates random orders for ingestion into a second data stream. In a production environment, this data comes from clickstream generators and a centralized order management system.
Data ingestion – Kinesis data streams ingest clickstream and order data as they are generated.
Data sessionization – Data sessionization helps group related data. For clickstream data, we can group clicks on an ad by different users or time periods. For order data, we can group orders by different ads. We use Amazon Kinesis Data Analytics for SQL to analyze streaming data in real time with standard SQL. Sessionized clickstream and order data is ingested into another in-application stream.
Data processing and storage – The sessionization stream from Kinesis Data Analytics for SQL is ingested into an Amazon Kinesis Data Firehose delivery stream, which delivers the data to a pre-configured S3 bucket.
Data Catalog – You use AWS Glue to crawl the clickstream and orders data in their respective S3 buckets, as well as build metadata definitions and tables in Athena. AWS Glue crawlers run every hour to update table definitions, and Athena views are built to compute the ad-to-order conversion.
Data visualization – You use QuickSight to generate visualizations.

Prerequisites

Before getting started, you must provision your resources with AWS CloudFormation.

Choose Launch Stack.

Choose Next.
For Stack name, enter a name for the stack.
For Bucket Name for Clicks, enter the name of the S3 bucket that holds clickstream data (for this post, click-stream).
For Bucket Name for Orders, enter the name of the S3 bucket that holds order data (order-stream).
Enter any tags you wish to assign to the stack.
Choose Next.
Verify that the stack has been created successfully.

If you have never used QuickSight in this account before, sign up for QuickSight before moving on to the next step. Keep in mind that admin access to the Enterprise Edition QuickSight instance is needed to complete setup.

Generating and ingesting clickstream data

On the Lambda console, view your function ingest-clickstream for ingesting clickstream data. The clickstream data attributes include UserId, Device, Event, EventType, and Timestamp. The event contains promotional ad information on the webpage clicked by the user. This function generates random clickstreams and ingests it into the data stream ClickStream. The following screenshot shows your function details on the console.

A CloudWatch Events rule invokes this function every minute. The following screenshot shows sample data that was ingested into the data stream. The Event column represents the portion of the webpage the user clicked; every click on the webpage has a unique ID and type assigned (for example, P601 has the event type Promotion, C301 has the event type Checkout).

Generating and ingesting order data

On the AWS Lambda console, view your function ingest-order for ingesting order data. This function ingests random orders.

Each order has order lines, which contain the attributes ItemId, Promotion, UnitPrice, and Quantity (see the following screenshot). The promotion attribute indicates the ad the user clicked before adding the item to their shopping cart. This function generates random orders and ingests it into OrderStream. The Promotion attribute joins clickstream data and order data.

Sessionizing the data

To sessionize the data, complete the following steps:

On the Kinesis Data Analytics console, select <Stack Name>-ClickStreamApplication.
Choose Run.
Repeat the same step for <Stack Name>-OrderAnalysisApp.
When the status changes to Running, choose the application name.
Under Real time analytics, choose Go to SQL results.
Choose the Real-time analytics

The application groups clicks in 1-minute intervals. Let’s take the ad P701 as an example. If this ad is clicked by multiple users, this SQL function adds all the clicks by different users in the last minute. If five users clicked on P701 in the last minute, the function outputs a ClickCount of 5. A stagger window is used because it’s well-suited for analyzing groups of data that arrive at inconsistent times.

On the Kinesis Data Analytics console, choose OrderAnalysisApp.
Choose Go to SQL results.
This application groups orders by Promotion, as shown in the following screenshot.

Processing and storing the data

In the data processing and storage stage, aggregated clickstream and order data is delivered to a Kinesis Data Firehose delivery stream. Kinesis Data Firehose delivers clickstream aggregated records and orders to the click-stream and order-stream buckets, respectively. The data is partitioned by year, month, and day. The following screenshot shows the delivery streams on the console.

Analyzing the data

To analyze your data, complete the following steps:

Verify that the S3 bucket was created for clickstream and orders.

The data in the bucket is partitioned by year, month, date, and hour.

On the AWS Glue console, view the clickstream and orders crawlers.

These two crawlers crawl the click-stream and order-stream buckets every 15 minutes and create tables.

To run the crawlers on demand, choose Run crawler.

When the crawler is finished, the Tables added column displays 1.

In the navigation pane, choose Tables.
Verify that the crawlers created the tables.
On the Athena console, choose Saved queries.

You can see three queries have been created.

Select view_clicks_aggregate to load it in the query editor.
Select ad_to_order_conversion and choose Run Query.

If the Amazon S3 bucket name has -, the crawler replaces - with _ while creating the table.

Replace - with _ in the table name when creating the view.
Repeat the same process for view_orders_aggregate and view_conversion_ratio.

Make sure you run view_clicks_aggregate and view_orders_aggregate before running view_conversion_ratio.

Choose view_conversion_ratio and choose Preview.

Orders and clicks for each promotion and the corresponding conversion ratio are displayed.

Visualizing the data

To visualize your data, you first load it into QuickSight. You can then create visualizations. In this section, we also configure a scheduled data refresh.

Loading the data

To visualize your data, you must first load your data into QuickSight.

On the QuickSight console, from the Admin drop-down menu, choose Manage QuickSight.
In the navigation pane, choose Security & Permissions.
Choose Add or remove.
Select Amazon Athena.
Select Amazon S3 to edit QuickSight access to your S3 buckets.
Choose the Details link next to Amazon S3.
Choose Select S3 buckets.
Select the bucket names you provided for clicks and orders.
Choose Finish.
Choose Update.
Choose the QuickSight icon on the top left of the admin panel to proceed back to the home screen.
In the navigation pane, choose Datasets.
Choose New dataset.
Choose Athena.
For Data source name, enter Ad-To-Order-Conversion.
Choose Validate Connection.
After your connection is validated, choose Create data source.
For Database, choose ad-to-order-conversion.
For Tables, select view_conversion_ratio.
Choose Select.
Choose Visualize.

Creating visualizations

In this section, we create two visualizations of our data. We first make a horizontal bar chart.

From the Add menu, choose Add Calculated Field.
Enter Clicks_to_Orders.
Enter the formula sum(orders)/sum(clicks).
Choose Save.
Choose … next to Click to orders.
For Show as, choose Percent.
For Visual type, choose Horizontal bar chart.
Drag promotion to Y-axis.
Drag clicks_to_orders to Value.
Drag date to Group/Color.

The following screenshot shows our visualization.

We now make our second visualization, a vertical bar chart.

Choose the + icon next to Sheet1.
For Visual types, choose Vertical bar chart.
Drag promotions to Y-axis.
Drag clicks and orders to Value.

This graph displays clicks and orders for each promotion.

Choose Insights on the left panel to see a summary of your insights.

Refreshing the data

We can also set up a scheduled refresh for our data.

Choose Manage Data.
Choose view_conversion_ratio.
Choose Schedule refresh.
Choose Create.
For Repeats, choose Hourly.
Choose Create.

You see a confirmation message that you configured a refresh one time per hour.

Conclusion

In this post, we showed you how to use AWS analytics and storage services to address business challenges that require handling large volumes of data. Kinesis Data Streams and Kinesis Data Analytics let you ingest large volumes of data and sessionize the data. We also showed you how to analyze and visualize the clickstream and order data using AWS Glue, Athena, and QuickSight.

About the Authors

Gandhi Raketla is a Senior Solutions Architect for AWS. He works with AWS customers and partners on cloud adoption, architecting solutions that help customers foster agility and innovation.

Nick Sack is a DevOps Consultant for AWS Professional Services. He is passionate about working with customers and building automated solutions to help customers on their cloud journeys. When not working, Nick enjoys hiking, playing soccer, reading, and learning about technology.

Preparing data for ML models using AWS Glue DataBrew in a Jupyter notebook

2020-12-04 Zayd Simjee

Post Syndicated from Zayd Simjee original https://aws.amazon.com/blogs/big-data/preparing-data-for-ml-models-using-aws-glue-databrew-in-a-jupyter-notebook/

AWS Glue DataBrew is a new visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning (ML). In this post, we examine a sample ML use case and show how to use DataBrew and a Jupyter notebook to upload a dataset, clean and normalize the data, and train and publish an ML model. We look for anomalies by applying the Amazon SageMaker Random Cut Forest (RCF) anomaly detection algorithm on a public dataset that records power consumption for more than 300 random households.

Deploying your resources

To make it easier for you to get started, we created an AWS CloudFormation template that automatically configures a Jupyter notebook instance with the required libraries and installs the plugin. We used Amazon Deep Learning AMI to configure the out-of-the-box Jupyter server. This easy deployment is intended to get you started on DataBrew from within a Jupyter environment. The source code for the DataBrew plugin and the CloudFormation template are available in the GitHub repo.

To deploy the solution, you must have a subnet that has internet access and an Amazon Simple Storage Service (Amazon S3) bucket where you want store the data for DataBrew. Select the VPC, subnet, security group, and the S3 bucket that you want to use store the data for DataBrew processing. Provide the Amazon Elastic Compute Cloud (Amazon EC2) key pair if you plan to SSH to the instance.

Launch the following stack:
When the template deployment is complete, on the Outputs tab, choose the URL to open JupyterLab.

Because the Jupyter server is configured with a self-signed SSL certificate, your browser warns you and prompts you to avoid continuing to this website. But because you set this up yourself, it’s safe to continue.

Choose Advanced.
Choose Proceed.
Use the password databrew_demo to log in.

For more information about securing and configuring your Jupyter server, see Set up a Jupyter Notebook Server.

In the Jupyter environment’s left panel, choose the DataBrew logo.
Choose Launch AWS Glue DataBrew.
When the extension loads, choose the Datasets tab in the navigation bar.

Preparing your data using DataBrew

Now that the DataBrew extension is ready to go, we can begin to explore how DataBrew can make data preparation easy. Our source dataset contains data points at 15-minute intervals, and is organized as a series of columns for each household. The dataset is really wide, and the RCF algorithm expects data tuples of date/time, client ID, and consumption value. Additionally, we want to normalize our data to 1-hour intervals. All of this is achieved through DataBrew.

Setting up your dataset and project

To get started, you set up your dataset, import your data, and create a new project.

Download power.consumption.csv from the GitHub repo.
On the Datasets page, choose Connect new dataset.
For Dataset name, enter a name.
In the Connect to new dataset section, choose File upload.
Upload power.consumption.csv.
For Enter S3 destination, enter an S3 path where you can save the file.
Choose Create dataset.

The file may take a few minutes to upload, depending on your internet speed.

On the Datasets page, filter for your created dataset.
Select your dataset and choose Create project with this dataset.

In the Create project wizard, give your project a name.
In the Permissions section, choose the AWS Identity and Access Management (IAM) role created from the CloudFormation template.

You can find the role on the CloudFormation stack’s Resources tab. If you use the default stack name, the role should begin with databrew-jupyter-plugin-demo.

After you create the project, the project view loads, and you’re ready to prepare your data.

Building a recipe for data transformation

A recipe is a series of steps that prepare your data for the RCF algorithm. The algorithm requires three columns: date, client ID, and an integer value. To transform our dataset to contain those three columns, we configure our recipe to do the following:

Unpivot the data to collapse measurements from multiple clients into one column.
Apply the window function to average the 15-minute data points into 1-hour data points.
Filter to keep only values at each hour.
Multiply and floor the results.

Unpivoting the data

To unpivot the data and collapse measurements, complete the following steps:

On the toolbar, choose Pivot.

In the Pivot wizard, select Unpivot: Columns to rows.
For Unpivot columns, choose MT_012, MT_013, MT_131, and MT_132.
For Column name, enter client_id.
For Value column name, enter quarter_hour_consumption.

In the Recipe pane, you can see the action that unpivots the columns. This action can be revisited later and changed. The new columns may not be visible immediately.

To see them and narrow down the visible data to only the relevant columns, choose the arrow next to Viewing.
Deselect all items and select only _c0 and our two new columns, client_id and
quarter_hour_consumption.

Applying the window function

To apply the window function to average the 15-minute data points into 1-hour data points, complete the following steps:

Choose the quarter_hour_consumption
Choose Functions.
Choose Window functions.
Choose Rolling average.

In the Create column pane, for Number of rows before, enter 0.
For Number of rows after, enter 3.
For Name of column to order by with, choose client_id.
For Destination column, enter hourly_consumption_raw.
Choose Apply.

Filtering to keep only values at each hour

In this step, you rename the date/time column, convert it to string type so that you can do simple filtering, and filter the dataset on the string column for times ending in :00:00.

For the _c0 column, choose the ellipsis icon (…) and choose Rename.

Rename the column to timestamp.
Choose the clock icon and choose string.

With the column selected, choose Filter.
Choose By condition.
Choose Ends with.
Enter the value :00:00.
Choose Apply.

Filtering the column for only values that end with :00:00 leaves you with hourly averages of power consumption per client for every hour.

Multiplying and flooring the results

In this step, you multiply the data by 100 to increase precision and floor the data so that it can be accepted by the RCF algorithm, which only accepts integers.

Choose Functions.
Choose Math functions.
Choose Multiply.
For Value using¸ choose Source columns and value.
For Source column, choose hourly_consumption_raw.
For Destination column, enter hourly_consumption_raw_times_a_hundred.
Choose Apply.

Choose Functions.
Choose Math functions.
Choose Floor.
For Source column, choose hourly_consumption_raw_times_a_hundred.
For Destination column, enter hourly_consumption.
Choose Apply.

This column contains the final, normalized data.

Running the job to transform the data

You’re now ready to transform the data.

Choose Create job.

Enter a job name and choose the dataset we created.
Specify the S3 bucket you provided in the CloudFormation template.
Choose the IAM role that AWS CloudFormation created (which we used earlier to create the project).
Choose Create and run job.

The job may take up to 5 minutes to complete.

On the Job run history page for the job, view the output on the Amazon S3 console by choosing the link in the table.

That’s it, the data is now ready to use when training and deploying our ML model.

Training and deploying the ML model using prepared data

The data was already prepared using DataBrew via the plugin, so the next step is to train an ML model using that data. We provided a sample anomaly detection notebook that you can download.

In this sample notebook, you need to specify the S3 data location where you stored the output data from DataBrew. The notebook uses the IAM role attached to the EC2 instance profile that was created by AWS CloudFormation. You can follow through the notebook and when you provide the right S3 paths, the first step is to filter the specific columns we’re interested in and visualize the time series power consumption data.

The next step is to train a sample anomaly detection model using the SageMaker Random Cut Forest algorithm. We pick one of the time series available in the input Pandas DataFrame and train the anomaly detection model with the hyperparameter feature_dim set to 1, leaving the default values for other hyperparameters. We then create an estimator for Random Cut Forest and fit the model. In a few minutes, the training should be complete. In the next step, we create a predictor and deploy the model to a SageMaker endpoint.

Using the prepared data, we run the prediction and plot the results.

We use the anomaly detection baseline that is two standard deviations away from the mean score. The data shows an anomaly towards the end of the time series. With this information, that timeframe can be further investigated.

Finally, we clean up by deleting the SageMaker endpoint to prevent any ongoing charges.

Conclusion

We’ve walked you through the process of setting up the AWS Glue DataBrew Jupyter plugin in a Jupyter notebook environment. We used the plugin to prepare data, then trained and deployed an ML model in the same Jupyter environment using SageMaker.

Although we used a DLAMI Jupyter environment in this post, the DataBrew Jupyter extension also works on SageMaker notebooks. For installation instructions, see the GitHub repo.

DataBrew makes it easy to iterate through data preparation workflows. The resultant recipes and jobs are duplicable and can be run over discrete, large datasets. The DataBrew Jupyter plugin allows you to prepare your data seamlessly, in context, within your Jupyter notebook.

About the Authors

Zayd Simjee is a software engineer at Amazon. He’s interested in distributed systems, big data, and simplifying developer experience on the Cloud. He’s worked on a few Big Data services at AWS, and most recently completed work on the AWS Glue DataBrew release.

As a Principal Solutions Architect at Amazon Web Services, Karthik Sonti works with GSI partners to help enterprises realize transformational business outcomes using artificial intelligence, machine learning and data analytics

Enabling self-service data publication to your data lake using AWS Glue DataBrew

2020-12-03 Jason Hunter

Post Syndicated from Jason Hunter original https://aws.amazon.com/blogs/big-data/enable-self-service-data-publication-to-your-data-lake-using-aws-glue-databrew/

Data lakes have been providing a level of flexibility to organizations unparalleled to anything before them. Having the ability to load and query data in place—and in its natural form—has led to an explosion of data lake deployments that have allowed organizations to accelerate against their data strategy faster than ever before.

Most organizations have a latent demand for business data that already exists within the organization—but is not yet broadly advertised or accessible. Building data lakes with AWS Lake Formation allows you to create an open and portable data platform that allows data consumers (users and applications) to discover and source high-quality and trusted data from across business data domains.

In a typical scenario, a data lake is run and managed by a data team that is responsible for the onboarding of datasets, application and enforcement of business rules (ETL/ELT), and publishing the transformed dataset into a company-wide data catalog for consumption. The following diagram illustrates a typical process flow of how data transmits through a data lake.

It most cases, the data team writes the code and uses platform tools to run the pipeline, including running business logic against the data that is to be shared to the wider organization data users.

The ability for users to self-serve the creation of datasets (such as data owners and business units) for general applicability across the business is an ongoing request for most organizations. If you run a centralized operating model across your data lake, you may require that all data pass through this centralized team or where the facility exists for self-serve data publication, this typically requires a high technical bar to participate in the data onboarding process, which is normally outside the capability of data stewards and owners.

AWS Glue DataBrew is a visual data preparation tool that helps solve some of these challenges by allowing you to inspect datasets and apply transformations against them in a visual and no code way that allows users of all skill sets to participate as part of the data preparation pipeline process. An obvious benefit is data owners and SMEs (subject matter experts) can define and implement data quality and governance checks against their data before publishing it into the data lake.

This accelerates and simplifies the process of producing trusted and quality outputs because the data owners are integrated as part of the onboarding process.

For this post, we discuss a use case of a project team in a consultancy organization that wants to combine first-party data they have acquired against an existing dataset that exists within their data lake.

Overview of solution

The workflow includes the following steps:

Step 1: Create a new AWS Glue DataBrew dataset by uploading a file to, or choosing and existing dataset, from their organization data lake.
Step 2: Create a DataBrew project to load and visually inspect and validate the datasets that you will transform.
Step 3: Using DataBrew, visually transform, augment and define the business rules you want to be applied to your new dataset. (These business rules are called steps, which collectively are called a Recipe)
Step 4: Create a DataBrew job to execute your Rules (Recipe) against the datasets in their entirety.
Step 5: Run a AWS Glue crawler to discover the resulting dataset, publish it to the AWS Glue data catalog and apply a secure permission model to it.
Step 6: Use a preferred analysis or machine learning (ML) tool to browse and consume the data in the Data Catalog based on user permissio

Let’s now go through these steps in a bit more detail.

Creating a project

A project in DataBrew defines two key aspects: the primary dataset you’re working with and the steps you want to run against that dataset to produce a trusted and validated dataset that is specific to your business need.

To create your project, complete the following steps:

On the DataBrew console, choose Create project.

For Project name, enter a name that is representative of the source and intent of the output.
For Attached recipe, choose Create new recipe.

In subsequent projects with a similar shape of data, you can reuse your recipe of business rules.

You have several options when connecting your dataset. You can upload a new data source (such as a local Excel or JSON file on my computer) or to reuse an existing dataset that exists within your data lake. If you’re sourcing data from the data lake, you can filter by the origin of the source, such as our cloud data warehouse Amazon Redshift, or transactional systems hosted in an Amazon Relation Database Service (Amazon RDS) instance. For more information about copying data into your data lake using AWS Glue, see Which Data Stores Can I Crawl? For this post, I upload a new dataset.

Select New dataset.

For Dataset name, enter a name.
For File upload, choose Choose file.
Choose the file to upload.
For Enter S3 destination, enter the location in our data lake where the file is saved to.

For Role name, choose a security role that grants access to the data lake location specified.

Typically, this is an area that you have access to that is driven by project or department privileges.

If no role is provided by default, a data platform admin or AWS account admin can create one for you. For more information, see Identity-based policy examples for AWS Glue DataBrew.

Optionally, modify the sample size of the data to inspect.
Choose Create project.

Analyzing and transforming your dataset

With a project created, I can load a sample of my dataset in a familiar tabular grid view and start to explore the data in greater detail.

This view allows for quick inspection of the data and the ability to run some sample and survey methods to determine if the data being viewed has statistical significance to our project. This way, we can make sure we’re making informed, data-driven decisions on behalf of our clients.

The following screenshot shows that the review_id column has duplicate values.

To maintain integrity and trust of the output, I can remove the duplicate rows as a step by choosing Duplicates and Remove duplicates in column.

This adds a new step to my recipe, which is enforced on this column type whenever it runs. DataBrew shows a preview of the change before you apply the changes.

If I want to dig deeper into the shape and validity of my data, I can simply choose the Schema tab to get a coalesced view of the data, or create a data profile job on the Profile tab to get further insights, such as correlations across column types.

The rest of the data looks good, so I now complete the rest of my transformation functions:

Clean the survey result summary text by removing invalid characters and quotes, and enforcing sentence casing.
For downstream ML tasks, apply a hot encode function to the text value of a customer’s sentiment (categorical variable) to that of an integer value (binary vector).
Join to a company dataset that contains our customer data so I can bring in identifiable information of the locations surveyed.
Hide columns not require for the final output.

Cleaning the summary text

We first clean the survey result summary text by removing invalid characters and quotes, and enforce sentence casing.

Choose Format.
Choose Change to sentence casing.

On the recipe bar, choose Clean.
For Specify values to remove, select Special characters.

Applying a hot encode function

Next, apply a hot encode function to the categorical variable (sentiment) to a binary vector for downstream ML purposes.

Choose Encode.
Choose One-Hot-Encode column.
Choose your source column.
Choose Apply.

Joining and enriching our dataset

We now join our dataset against a primary customer dataset so we can enrich our dataset with additional columns of value.

The initial dataset we created contains a number of key data elements, such as a review rating, customer comments, associated sentiment rating, a number of ordinal values specifying the value of the review, emotional attributes from the interviewee, and various restaurant and location identifiers.

As part of our full analysis, we want to reference the additional data elements associated with these restaurant and location identifiers so we can form a complete view of the data.

DataBrew makes this enrichment task easy by letting you join to an additional company dataset that exists within your data lake. In the following steps, I choose to bring across only those columns that I require by deselecting those that are not required. (This existing dataset contains additional data of businesses that are part of the wider survey group.)

On the recipe menu bar, choose Join.
Choose Dataset join.
For Select dataset, choose the dataset to join against.

The page displays a summary view for inspection.

Choose Next.

For this post, we perform a left join to ensure that we’re only bring in matching records against our own dataset (Table A) from the paired dataset (Table B).

Choose the columns between both tables that we want to match against (for this post, business_id).
Select the columns from the joined table to include in our dataset.

Choose Finish.

When this step is complete, we have a complete dataset that is applicable towards our final requirements for analysis.

Removing unnecessary columns

Finally, remove columns not required for the final dataset and change the column names.

At the conclusion of our data preparation, we can see the number and types of steps we have run.

For similar datasets, you can choose Publish in the Recipe pane to reuse this recipe and enforce the steps taken instead of having to complete the steps again. This allows you to build up your own repository of recipes that you can replay against various datasets to ensure a consistent level of data quality and outputs.

You then provide a description and publish the recipe for broader accessibility.

As a final validation step before publishing my dataset, I can choose to visually inspect the data lineage to confirm my sources selected are as intended and to confirm the size of my datasets to process.

For this post, I can confirm that the source and datasets are correct and the recipe being created has the expected number of steps. The data lineage view is useful for when you’re reading unfamiliar data sources because it allows you to understand the origins and transformations that have taken place beforehand.

Creating a job to publish your dataset

Now we have our dataset defined and validated by way of the visual steps we created in our recipe. It’s time to create a job that runs them against the entirety of our dataset. The output is saved back into an area of our data lake for additional processing, leading to publication within our Data Catalog.

Choose Create job.

For Job name, enter a name.

In the Job output settings section, specify the data lake location where the final output is saved to.

If required, we can optionally choose a file format for the dataset. If you have multiple requirements for different file formats, you can simply choose to save multiple outputs by choosing Add another output and specifying the details. For this post, I opt to save the output in two formats: one in CSV and another in the optimized open file format Apache Parquet.

For Role name, choose the security role that has access to our client project folder to make sure DataBrew has permission to save the dataset to the location specified.

The role that you choose here is different than the DataBrew role you chose when creating the project. Each role serves a different purpose. For this post, we choose a role that has permissions to read, run, and save the resulting files into data lake locations (as governed by the role) of which I have been delegated rights to access and save data to.

Choose Create and run job.

The project is now visible on the main page, detailing the recipe used against the dataset and the job name.
style=”margin: 20px 0px 20px 0px; border: 1px solid #CCCCCC;”

To view details about the job run and inspect completion details, choose the value in the Jobs column.

To see additional details about the history of the job, choose the job name.

The output provides details of the job.

The lineage view shows the end-to-end process of all input files, transformation steps, and final output locations.

Consuming your dataset

After you save the new dataset to your data lake, we can use AWS Glue to crawl the new dataset and publish it to our Data Catalog. For more information, see Populating the AWS Glue Data Catalog. After the data is indexed, it’s available and visible through AWS Glue and Lake Formation.

Crawling the output files from DataBrew

To crawl the output files and publish them to our Data Catalog, complete the following steps:

On the AWS Glue console, choose Crawlers.
Choose Add crawler.

For Crawler name, enter a name.
Choose Next.

For Crawler source type, select Data stores.
For Repeat crawls of S3 data stores, select Crawl new folders only.
Choose Next.

For Choose a data store, choose S3.
For Crawl data in, select Specified path in my account.
For Include path, enter the path of the DataBrew output files.
Choose Next.

Select Choose an existing IAM role.
For IAM role, choose or create a role that has access to the location of your DataBrew output.

If you don’t have such a role, your data platform admin can set one up for you.

Choose Next.

For Frequency, specify how often the crawler should check for new output files. For this post, we choose Run on demand.

This normally aligns with the schedule you have set up in DataBrew.

Choose Next.
For Database, choose (or create) a database to publish the business view of your output files.

The crawler creates one or more data tables that users can browse and, if permissions allow, run select queries against.

Choose Next.

Review the summary page and choose Finish.
Select your crawler and choose Run crawler.

When the crawler is complete, it details the number of tables it created and the time it took to discover and index your output dataset. The next step is to view and set up permissions in the Data Catalog.

Publishing and securing your DataBrew output files

In this post, we use the Lake Formation security and permission model enforced for data consumers using supported services. For more information, see Security and Access Control to Metadata and Data in Lake Formation.

On the Lake Formation console, under Data catalog, choose Databases.
Enter the database name you used when you created the crawler earlier (for this post, hospitalitydb).

Select the database and choose View tables.

We can see that a table called results was created in this database.

We can do several things, such as grant or revoke items from the Actions menu, or choose the table name to inspect the schema.

As a data owner, I grant an analyst with the username lfanalyst select permissions to my dataset, and restrict access to only a limited set of columns. For this post, I exclude the userid and reviewid columns.

After I save this change, I can let the analyst know that they have access to my survey results.

Consumption access to DataBrew output files

The analyst user who has been granted limited access to this dataset can query the data using Amazon Athena, either via the console or external analytical tools that support the JDBC or ODBC drivers it provides.

If using the console, the analyst can choose the database hospitalitydb to view the results table.

If the analyst expands the results table to view the columns, userid and reviewid aren’t present.

The following screenshot shows the results of a SQL statement to browse a sample of the dataset.

Conclusion

DataBrew makes it easy for data owners, analysts, and data SMEs to easily curate and publish data of value to their organization data lake in a self-service manner.

This visual interface, combined with ML assistance to help you better understand and validate your data, allows all types of users to easily create, consume, and publish data into an organization’s data lake in a visual, no code, or low code way while working within an existing centralized governance model.

About the Author

Jason Hunter is a Principal Solutions Architect at AWS with a focus on Big Data & Analytics solutions. With over 20 years of experience in information technology, he helps customers architect and build highly scalable, secure and performant cloud-based data platforms, accelerating their journey to become more data driven.

Controlling data lake access across multiple AWS accounts using AWS Lake Formation

2020-12-02 Rafael Suguiura

Post Syndicated from Rafael Suguiura original https://aws.amazon.com/blogs/big-data/controlling-data-lake-access-across-multiple-aws-accounts-using-aws-lake-formation/

When deploying data lakes on AWS, you can use multiple AWS accounts to better separate different projects or lines of business. In this post, we see how the AWS Lake Formation cross-account capabilities simplify securing and managing distributed data lakes across multiple accounts through a centralized approach, providing fine-grained access control to the AWS Glue Data Catalog and Amazon Simple Storage Service (Amazon S3) locations.

Use case

Keeping each business unit’s resources as compute and storage in its own AWS account allows for easier cost allocation and permissions governance. In the other hand, centralizing your Data Catalog into a single account with Lake Formation removes the overhead of managing multiple catalogs in isolated data silos, simplifying the management and data availability.

For this post, we use the example of a company with two separate teams:

The Analytics team is responsible for data ingestion, validation, and cleansing. After processing the income data, they store it on Amazon S3 and use Lake Formation for the Data Catalog, in a primary AWS account.
The Business Analyst team is responsible for generating reports and extracting insight from such data. They use Amazon Athena running in a secondary AWS account.

When a secondary account needs to access data, the data lake administrator use Lake Formation to share data across accounts, avoiding data duplication, silos, and reducing complexity. Data can be shared at the database or table level, and the administrator can define which tables and columns each analyst has access to, establishing a centralized and granular access control. The following diagram illustrates this architecture.

Architecture overview

We provide two AWS CloudFormation templates to set up the required infrastructure for this data lake use case. Each template deploys the resources in one of the accounts (primary and secondary).

In the primary account, the CloudFormation template loads the sample data in the S3 bucket. For this post, we use the publicly available dataset of historic taxi trips collected in New York City in the month of June 2020, in CSV format. The dataset is available from the New York City Taxi & Limousine Commission, via Registry of Open Data on AWS, and contains information on the geolocation and collected fares of individual taxi trips.

The template also creates a Data Catalog configuration by crawling the bucket using an AWS Glue crawler, and updating the Lake Formation Data Catalog on the primary account.

Prerequisites

To follow along with this post, you must have two AWS accounts (primary and secondary), with AWS Identity and Access Management (IAM) administrator access.

Deploying the CloudFormation templates

To get started, launch the first CloudFormation template in the primary account:

After that, deploy the second template in the secondary account:

You now have the deployment as depicted in the following architecture, and are ready to set up Lake Formation with cross-account access.

Setting up Lake Formation in the primary account

Now that you have the basic infrastructure provisioned by the template, we can dive deeper into the steps required for Lake Formation configuration. First sign in to the primary account on the AWS Management Console, using the existing IAM administrator role and account.

Assigning a role to our data lake

Lake Formation administrators are IAM users or roles that can grant and delegate Lake Formation permissions on data locations, databases, and tables. The CloudFormation template created an IAM role with the proper IAM permissions, named LakeFormationPrimaryAdmin. Now we need to assign it to our data lake:

On the Lake Formation console, in the Welcome to Lake Formation pop-up window, choose Add administrators.
1. If the pop-up doesn’t appear, in the navigation pane, under Permissions, choose Admins and database creators.
2. Under Data Lake Administrators, choose Grant.
For IAM users and roles, choose LakeFormationPrimaryAdmin.
Choose Save.

After we assign the Lake Formation administrator, we can assume this role and start managing our data lake.

On the console, choose your user name and choose Switch Roles.

Enter your primary account number and the role LakeFormationPrimaryAdmin.
Choose Switch Role.

For detailed instructions on changing your role, see Switching to a role (console).

Adding the Amazon S3 location as a storage layer

Now you’re the Lake Formation administrator. For Lake Formation to implement access control on the data lake, we need to include the Amazon S3 location as a storage layer. Let’s register our existing S3 bucket that contains sample data.

On the Lake Formation console, in the navigation pane, under Register and Ingest, choose Data lake locations.
For Amazon S3 path, choose Browse.
Choose the S3 bucket in the primary account, referenced in the CloudFormation template outputs as S3BucketPrimary.
Choose Register location.

Configuring access control

When you create the template, an AWS Glue crawler populates the Data Catalog with the database and catalog pointing to our S3 bucket. By default, Lake Formation adds IAMAllowedPrincipals permissions, which isn’t compatible with cross-account sharing. We must disable it on our database and table. For this post, we use Lake Formation access control in conjunction with IAM. For more information, see Change Data Catalog Settings.

On the Lake Formation console, in the navigation pane, under Data Catalog, choose Databases.
Choose gluedatabaseprimary.
Choose Edit.
Deselect Use only IAM access control for new tables in this database.
Choose Save.

On the database details page, on the Actions menu, choose Revoke.
For IAM users and roles, choose IAMAllowedPrincipals.
For Database permissions, select Super.

Choose Revoke.
On the database details page, choose View Tables.
Select the table that starts with lf_table.
On the Actions menu, choose Revoke.
For IAM users and roles, choose IAMAllowedPrincipals.
For Database permissions, select Super.
Choose Revoke.

You can now see the metadata and Amazon S3 data location in the table details. The CloudFormation template ran an AWS Glue crawler that populated the table.

Granting permissions

Now we’re ready to grant permissions to the Business Analyst users. Because they’re in a separate AWS account, we need to share the database across accounts.

On the Lake Formation console, under Data Catalog¸ choose Databases.
Select our database.
On the Actions menu, choose Grant.
Select External account.
For AWS account ID or AWS organization ID, enter the secondary account number.
For Table, choose All tables.
For Table permissions, select Select.
For Grantable permissions, select Select.

Grantable permissions are required to allow the principal to pass this grant to other users and roles. For our use case, the secondary account LakeFormationAdministrator grants access to the secondary account BusinessAnalyst. If this permission is revoked on the primary account in the future, all access granted to BusinessAnalyst and LakeFormationAdministrator on the secondary account is also revoked.

For this post, we share the database with a single account. Lake Formation also allows sharing with an AWS organization.

Choose Grant.

Sharing specific tables across accounts

Optionally, instead of sharing the whole database, you can share specific tables across accounts. You don’t need to share the database to share a table underneath it.

On the Lake Formation console, under Data Catalog, choose Tables.
Select the table that starts with lf_table.
On the Actions menu, choose Grant.
Select External account.
For AWS account ID or AWS organization ID, enter the secondary account number.

You can also choose specific columns to share with the secondary account. For this post, we share five columns.

For Columns, choose Include columns.
For Include columns, choose the following columns
1. vendorid
2. lpep_pickup_datetime
3. lp_dropoff_taketime
4. store_and_forward_flag
5. ratecodeid
For Table permissions, select Select.
For Grantable permissions, select Select.
Choose Grant.

Setting up Lake Formation in the secondary account

Now that the primary account setup is complete, let’s configure the secondary account. We access the resource share and create appropriate resource links, pointing to the databases or tables in the primary account. This allows the data lake administrator to grant proper access to the Business Analyst team, who queries the data through Athena. The following diagram illustrates this architecture.

Assigning a role to our data lake

Similar to the primary account, we need to assign an IAM role as the Lake Formation administrator. To better differentiate the roles, this one is named LakeFormationSecondaryAdmin.

On the Lake Formation console, under Permissions, choose Admins and database creators.
Under Data Lake Administrators, choose Grant.
In the pop-up window, choose LakeFormationSecondaryAdmin.
Choose Save.
On the console, switch to the LakeFormationSecondaryAdmin role.

Sharing resources

Lake Formation shares resources (databases and tables) by using AWS Resource Access Manager. AWS RAM provides a streamlined way to share resources across AWS accounts and also integrates with AWS Organizations. If both primary and secondary accounts are in the same organization with resource sharing enabled, resources shares are accepted automatically and you can skip this step. If not, complete the following steps:

On the AWS RAM console, in the navigation pane, under Shared with me, choose Resource shares.
Choose the Lake Formation share.
Choose Accept resource share.

The resource status switches to Active.

Creating a resource link

With the share accepted, we can create a resource link in the secondary account. Resource links are Data Catalog virtual objects that link to a shared database or table. The resource link lives in your account and the referenced object it points to can be anywhere else.

On the Lake Formation console, under Data Catalog, choose Databases.
Choose Create database.
Select Resource link.
For Resource link name, enter a name, such as lf-primary-database-rl.
For Shared database, choose gluedatabaseprimary.

The shared database’s owner ID is populated automatically.

Choose Create.

You can use this resource link the same way you use database or table references in Lake Formation. The following screenshot shows the resource link listed on the Databases page.

Granting permissions

As the creator of the resource link, at this point only you (IAM role LakeFormationSecondaryAdmin) can view and access this object in the Data Catalog. To grant visibility on the resource link to our Business Analyst users (IAM role LakeFormationSecondaryAnalyst), we need to grant them describe permissions.

On the Lake Formation console, navigate to the database details page.
On the Actions menu, choose Grant.
For IAM users and roles, choose LakeFormationSecondaryAnalyst.
For Resource Link permissions, select Describe and deselect Super.
Choose Grant.

Granting permissions on a resource link doesn’t grant permissions on the target (linked) database or table, so let’s do it now. For our use case, the analysts only need SQL SELECT capabilities, and only to the specific columns of the table.

In the navigation pane, under Data Catalog, choose Databases.
Select lf-primary-database-rl.
On the Actions menu, choose Grant on Target.
In the Grant permissions dialog box, choose My account.
For IAM users and roles, choose LakeFormationSecondaryAnalyst.
Choose the table that starts with lf_table.
Under Columns, select Include Columns and select the first five columns.
For Table permissions, select Select.
Choose Grant.

Accessing the data

With all the Lake Formation grants in place, the users are ready to access the data at the proper level.

In the secondary account, switch to the role LakeFormationSecondaryAnalyst.
On the Athena console, choose Get Started.
On the selection bar, under Workgroup, choose LakeFormationCrossAccount.
Choose Switch workgroup.

The screen refreshes; make sure you are in the right workgroup.

To use Lake Formation cross-account access, you don’t need a separate Athena workgroup. For this post, the CloudFormation template created one to simplify deployment with the proper Athena configuration.

For Data source, choose AwsDataCatalog.
For Database, choose lf-primary-database-rl.
For Tables, choose if_table_<string>.
On the … menu, choose Preview table.

Choose Run query.

You now have a data analyst on the secondary account with access to an S3 bucket in the primary account. The analyst only has access to the five columns we specified earlier.

Data access that is granted by Lake Formation cross-account access is logged in the secondary account AWS CloudTrail log file, and Lake Formation copies the event to the primary account’s log file. For more information and examples of logging messages, see Cross-Account CloudTrail Logging.

Cleaning up

To avoid incurring future charges, delete the CloudFormation templates after you finish testing the solution.

Conclusion

In this post, we went through the process of configuring Lake Formation to share AWS Glue Data Catalog metadata information across AWS accounts.

Large enterprises typically use multiple AWS accounts, and many of those accounts might need access to a data lake managed by a single AWS account. AWS Lake Formation with cross-account access set up enables you to run queries and jobs that can join and query tables across multiple accounts.

About the Authors

Rafael Suguiura is a Principal Solutions Architect at Amazon Web Services. He guides some of the world’s largest financial services companies in their cloud journey. When the weather is nice, he enjoys cycling and finding new hiking trails—and when it’s not, he catches up with sci-fi books, TV series, and video games.

Himanish Kushary is a Senior Big Data Architect at Amazon Web Services. He helps customers across multiple domains build scalable big data analytics platforms. He enjoys playing video games, and watching good movies and TV series.

New in Amazon QuickSight – session capacity pricing for large scale deployments, embedding without user provisioning, and developer portal for embedded analytics

2020-12-01 Jose Kunnackal

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/new-in-amazon-quicksight-embedding-without-user-provisioning-session-capacity-pricing-and-embedded-developer-portal/

Amazon QuickSight Enterprise edition now offers a new, session capacity-based pricing model starting at $250/month, with annual commitment options that provide scalable pricing for embedded analytics and BI rollouts to 100s of 1000s of users. QuickSight now also supports embedding dashboards in apps, websites, and wikis without the need to provision and manage users (readers) in QuickSight, which utilizes this new pricing model. Lastly, we also have a new developer portal for embedded analytics that allows you to learn more about the different embedded solutions available with QuickSight and experience it first-hand.

Session Capacity Pricing

Amazon QuickSight’s new session capacity-based pricing model provides scalable pricing for large scale deployments. Session capacity pricing allows Developers, Independent Software Vendors (ISVs) and Enterprises to benefit from lower per-session rates as they roll out embedded analytics and BI to 100s of 1000s of users. In such scenarios, average session consumption per user is usually low (<10 sessions/month) but aggregate session usage across users within the account is high. With session capacity pricing, sessions are simply charged in 30-minute blocks of usage starting with first access. Session capacity pricing is also required for embedding without user management, where per-user pricing is not meaningful. Unlike traditional options for capacity pricing which require a server with annual commitments, QuickSight’s session capacity pricing allows you to get started easily with a $250/month starter option. QuickSight’s session capacities do not slow down with increased user concurrency or higher analytical complexity of dashboards (both common in a server-based model), but instead automatically scale to ensure a consistent, fast, end-user experience. After starting with the monthly option, you can move to one of QuickSight’s annual session capacity tiers as your sessions usage increases – ensuring that costs scale with growing usage. Any usage beyond the committed levels (monthly or annual) is charged at the overage session rates indicated, with no manual intervention for scaling needed – no more scrambling to add servers as you have bursts in usage or just greater success with your application/website. Annual session capacities are billed for monthly usage of sessions, with consumption of all committed sessions expected by end of the period. Annual session capacities allow sessions to be consumed across the year, providing flexibility in ramping up on production traffic, balancing session needs across busy/lean periods of the year (e.g., first/last week of the month are busy, holidays may be slow or busy depending on the nature of the business). For more details on the new pricing options, visit the QuickSight pricing page.

Embedding without user provisioning

Before this launch, Amazon QuickSight provided embedding dashboards in apps and portals where each end-user could be identified and provisioned in QuickSight, and charged using QuickSight’s per-user pricing model. This works well for situations where each end user can be uniquely identified, and often has permissions associated with their data access levels. The NFL chose QuickSight to embed dashboards in their next-gen stats portal and share insights with clubs, broadcasters and editorial teams. 1000s of customers have since chosen to launch embedded dashboards in QuickSight for both their internal and external end-users, including Blackboard, Comcast, Panasonic Avionics, EHE Health. Customers can scale from 10s of users to 100s of 1000s without any server provisioning or management and also benefit from being able to utilize embedded QuickSight authoring capabilities to enable self-service dashboard creation.

With today’s launch, QuickSight will now also enable use cases with dashboards for 100s of 1000s of readers, where it is not possible to provision and manage users (or is highly inconvenient to do so).

Examples of these situations include embedded dashboards for sharing public statistics within a city/county, dashboards of company-wide statistics available to all users to be embedded on an internal corporate portal, or embedded dashboards with the same data intended for 10s or 1000s of users per customer, department or job location/site within an app.

Let’s take a look at a couple of examples and then how to embed these dashboards. First, a dashboard that shows a live stream of the three main stock indices (S&P 500, DOW Jones, and NASDAQ) that uses QuickSight’s newly launched connector to Amazon Timestream to provide a real-time view of the market index in US Central Time.

Second, a dashboard showing industries and count of firms in those industries using a choropleth map at the state level with the ability to drill down to county-level data.

Both dashboards are setup so that they can be accessed without any user restrictions, and we don’t have to setup users to roll this out. You can see these dashboards accessible for anyone here. To try this out, you can use the AWS Command Line Interface (AWS CLI); note that the AWS CLI used here is simply to illustrate this process, and for actual integration into a website/app you have to use the AWS SDK to obtain an embeddable URL for every new visit to the page.

4 steps to embed a dashboard

Configure an AWS Identity and Access Management (IAM) role for your application to use for embedding
```
arn:aws:iam::xxxxxxxxxxxx:role/YourApplicationRole
```

Attach the following policy to the role so the role can run the GetDashboardEmbedURL API for anonymous identity type:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "quicksight:GetDashboardEmbedUrl",
            "quickSight:GetAnonymousUserEmbedUrl"
         ],
         "Resource":[
            "replace with ARN for dashboard1",
            "replace with ARN for dashboard2",
            "replace with ARN for dashboard3"
         ]
      }
   ]
}

Run the GetDashboardEmbedURL API with IdentityType set to ANONYMOUS. This returns a response with the EmbedURL:

INPUT
aws quicksight get-dashboard-embed-url \
    --region us-west-2 \
    --namespace default \
    --dashboard-id c6e91d20-0909-4836-b87f-e4c115a88b65 \
    --identity-type ANONYMOUS \
    --aws-account-id 123456789012
    
RESPONSE
{
    "Status": 200,
    "EmbedUrl": "https://us-west-2.quicksight.aws.amazon.com/embed/bc973ae439ce45b49e011c9fc8c855ea/dashboards/c6e91d20-0909-4836-b87f-e4c115a88b65?code=AYABeJcLJ0WqjqtWBi0sdFZ2GP8AAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy13ZXN0LTI6ODQ1MzU0MDA0MzQ0OmtleS85ZjYzYzZlOS0xMzI3LTQxOGYtODhmZi1kM2Y3ODExMzI5MmIAuAECAQB421ynKsVxdYWD7qmNX3Zzbra88wGZIZL-RXp78eF_lpIBMX2cuRvnCU-OpFLUps57PQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDOfOUSMMuEepqj8bzAIBEIA77tMfykw7WnJT__2sRSInn0gymHK1_vXmBAWAlyG2mwcNsD-HGI3xNNUoaSEUdvFQ6c0XuFQAgLz8ufICAAAAAAwAABAAAAAAAAAAAAAAAAAAHkJ674fL1usbIVG0oIYfCv____8AAAABAAAAAAAAAAAAAAABAAAAm3wfjZv5rICOROeYqIPsu6jFmWxU6fBEnSHTBkaw4ZPLnIGr3Cr1HU0D7DJM90dmCQ6t9kTVOy2XdgwNm606yqoEhSjwq4OWU-_rjGilwbKpes_5uKZR0IZNh2SMqgUPuu4Q1z884FhHQmX3yRI_RxWEyTnjR2sajl1m6OQCgvRJ3kEeh3cB0wWSsSdcUeZt-iNxYRbckKa3Eb6viPXHYRs-Q_skcSTsjfJ6GQ%3D%3D&identityprovider=quicksight&isauthcode=true",
    "RequestId": "1c82321c-6934-45b0-83f4-a8ce2f641067"
}

You can embed the returned URL in the IFRAME code of your application. Make sure your application is added to the allow list in QuickSight. This is a single-use URL and has to be generated dynamically by invoking get-dashboard-embed-url from the backend compute layer upon each load of the parent page.

Embed the dashboard in your application with the following HTML code:

<head>
    <title>Basic Embed</title>
    <!-- You can download the latest QuickSight embedding SDK version from https://www.npmjs.com/package/amazon-quicksight-embedding-sdk -->
    <!-- Or you can do "npm install amazon-quicksight-embedding-sdk", if you use npm for javascript dependencies -->
    <script src="./quicksight-embedding-js-sdk.min.js"></script>
    <script type="text/javascript">
        var dashboard;

        function embedDashboard() {
            var containerDiv = document.getElementById("embeddingContainer");
            var options = {
                // replace this dummy url with the one generated via embedding API
                url: "https://us-east-1.quicksight.aws.amazon.com/sn/dashboards/dashboardId?isauthcode=true&identityprovider=quicksight&code=authcode",  
                container: containerDiv,
                scrolling: "no",
                height: "700px",
 
                footerPaddingEnabled: true
            };
            dashboard = QuickSightEmbedding.embedDashboard(options);
        }
    </script>
</head>

<body onload="embedDashboard()">
    <div id="embeddingContainer"></div>
</body>

</html>

If you use NPM to manage you front end dependencies, run npm install amazon-quicksight-embedding-sdk. And then you can add use the following code to embed the URL in your application:

import { embedDashboard } from 'amazon-quicksight-embedding-sdk';

var options = {
    url: "https://us-west-2.quicksight.aws.amazon.com/embed/bc973ae439ce45b49e011c9fc8c855ea/dashboards/c6e91d20-0909-4836-b87f-e4c115a88b65?code=AYABeJcLJ0WqjqtWBi0sdFZ2GP8AAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy13ZXN0LTI6ODQ1MzU0MDA0MzQ0OmtleS85ZjYzYzZlOS0xMzI3LTQxOGYtODhmZi1kM2Y3ODExMzI5MmIAuAECAQB421ynKsVxdYWD7qmNX3Zzbra88wGZIZL-RXp78eF_lpIBMX2cuRvnCU-OpFLUps57PQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDOfOUSMMuEepqj8bzAIBEIA77tMfykw7WnJT__2sRSInn0gymHK1_vXmBAWAlyG2mwcNsD-HGI3xNNUoaSEUdvFQ6c0XuFQAgLz8ufICAAAAAAwAABAAAAAAAAAAAAAAAAAAHkJ674fL1usbIVG0oIYfCv____8AAAABAAAAAAAAAAAAAAABAAAAm3wfjZv5rICOROeYqIPsu6jFmWxU6fBEnSHTBkaw4ZPLnIGr3Cr1HU0D7DJM90dmCQ6t9kTVOy2XdgwNm606yqoEhSjwq4OWU-_rjGilwbKpes_5uKZR0IZNh2SMqgUPuu4Q1z884FhHQmX3yRI_RxWEyTnjR2sajl1m6OQCgvRJ3kEeh3cB0wWSsSdcUeZt-iNxYRbckKa3Eb6viPXHYRs-Q_skcSTsjfJ6GQ%3D%3D&identityprovider=quicksight&isauthcode=true",
    container: document.getElementById("embeddingContainer"),
    parameters: {
        country: "United States",
        states: [
            "California",
            "Washington"
        ]
    },
    scrolling: "no",
    height: "700px",
    width: "1000px",
    locale: "en-US",
    footerPaddingEnabled: true
};
const dashboardSession = embedDashboard(options);

If you want to embed multiple dashboards and switch between them, you can pass more dashboard IDs by using the additional-dashboard-ids option while generating the URL. This generates the URL with authorization for all specified dashboard IDs and launches the dashboard specified under the dashboard-id option. See the following code:

INPUT
aws quicksight get-dashboard-embed-url \
    --region us-west-2 \
    --namespace default \
    --dashboard-id c6e91d20-0909-4836-b87f-e4c115a88b65 \
    --identity-type ANONYMOUS \
    --aws-account-id 123456789012
    --additional-dashboard-ids dashboardid1 dashboardid2
    
RESPONSE
{
    "Status": 200,
    "EmbedUrl": "https://us-west-2.quicksight.aws.amazon.com/embed/bc973ae439ce45b49e011c9fc8c855ea/dashboards/c6e91d20-0909-4836-b87f-e4c115a88b65?code=AYABeJcLJ0WqjqtWBi0sdFZ2GP8AAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy13ZXN0LTI6ODQ1MzU0MDA0MzQ0OmtleS85ZjYzYzZlOS0xMzI3LTQxOGYtODhmZi1kM2Y3ODExMzI5MmIAuAECAQB421ynKsVxdYWD7qmNX3Zzbra88wGZIZL-RXp78eF_lpIBMX2cuRvnCU-OpFLUps57PQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDOfOUSMMuEepqj8bzAIBEIA77tMfykw7WnJT__2sRSInn0gymHK1_vXmBAWAlyG2mwcNsD-HGI3xNNUoaSEUdvFQ6c0XuFQAgLz8ufICAAAAAAwAABAAAAAAAAAAAAAAAAAAHkJ674fL1usbIVG0oIYfCv____8AAAABAAAAAAAAAAAAAAABAAAAm3wfjZv5rICOROeYqIPsu6jFmWxU6fBEnSHTBkaw4ZPLnIGr3Cr1HU0D7DJM90dmCQ6t9kTVOy2XdgwNm606yqoEhSjwq4OWU-_rjGilwbKpes_5uKZR0IZNh2SMqgUPuu4Q1z884FhHQmX3yRI_RxWEyTnjR2sajl1m6OQCgvRJ3kEeh3cB0wWSsSdcUeZt-iNxYRbckKa3Eb6viPXHYRs-Q_skcSTsjfJ6GQ%3D%3D&identityprovider=quicksight&isauthcode=true",
    "RequestId": "1c82321c-6934-45b0-83f4-a8ce2f641067"
}

In the preceding code, we assign our primary dashboard’s ID to the --dashboard-id option, and the other dashboard IDs to the --additional-dashboard-ids option as a space-separated value. You can pass up to 20 dashboard IDs in this option.

The EmbedURL value in the response is the URL for the primary dashboard. You can embed this dashboard in your app, wiki, or website and use the JavaScript SDK to switch between dashboards. To switch to another dashboard without having to generate a fresh embed URL, invoke the navigateToDashboard function (available in our JavaScript library) with any of the dashboard IDs that were initially included in the get-dashboard-embed-url call. See the following example code:

var options = 
     {
      dashboardId: "dashboardid1", 
      parameters: 
       {
         country: 
          [
            "United States"
          ]
     }
    };
dashboard.navigateToDashboard(options);

For more information about the JavaScript SDK, see the GitHub repo.

Embedding without users is now generally available to all Amazon QuickSight Enterprise Edition users and requires session capacity pricing.

To opt-in for session capacity pricing and embed dashboards without user provisioning, or simply enable discounted QuickSight usage for large scale deployments, you can opt-into session capacity pricing starting with the $250/month option via the “Manage QuickSight” > “Your subscriptions” page accessible to QuickSight administrators.

Embedded Developer Portal

We have a new embedded developer portal at https://developer.quicksight.aws that allows you to quickly interact with three key embedded scenarios – 1) embedded dashboards accessible to anyone accessing a website or portal (no user provisioning required), 2) embedded dashboards accessible to only authenticated users, and 3) embedded dashboard authoring for power users of apps.

The interactive dashboards, code snippets and setup instructions allow you to learn more about the rich capabilities of QuickSight embedding, easily get started with embedding QuickSight.

Summary

With new embedding capability, QuickSight offers a modern, serverless approach to deploying dashboards and visualizations into websites, apps and corporate portals in hours, without any user provisioning or management needed to scale to 100s of 1000s of users. Unlike traditional server-based models, QuickSight’s session capacity model allows you to start at a low $250/month, month-to-month price point. As usage grows, the available annual commitment models offer scalable pricing by reducing the per-session cost – enabling both ISVs and Enterprises to roll out embedded QuickSight dashboards at large scale, whether for internal or external users. Finally, with QuickSight’s new developer portal, you have instant access to samples of interactive embedded QuickSight dashboards as well as steps to integrating various embedded capabilities into your own websites, apps and portals.

About the Authors

Jose Kunnackal John is Sr. Manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

Kareem Syed-Mohammed is a Product Manager at Amazon QuickSight. He focuses on embedded analytics, APIs, and developer experience. Prior to QuickSight he has been with AWS Marketplace and Amazon retail as a PM. Kareem started his as a career developer and then PM for call center technologies, Local Expert and Ads for Expedia. He worked as a consultant with McKinsey and Company for a short while.

Arun Santhosh is a Specialized World Wide Solution Architect for Amazon QuickSight. Arun started his career at IBM as a developer and progressed on to be an Application Architect. Later, he worked as a Technical Architect at Cognizant. Business Intelligence has been his core focus in these prior roles as well

Keeping your data lake clean and compliant with Amazon Athena

2020-12-01 David Roberts

Post Syndicated from David Roberts original https://aws.amazon.com/blogs/big-data/keeping-your-data-lake-clean-and-compliant-with-amazon-athena/

With the introduction of CTAS support for Amazon Athena (see Use CTAS statements with Amazon Athena to reduce cost and improve performance), you can not only query but also create tables using Athena with the associated data objects stored in Amazon Simple Storage Service (Amazon S3). These tables are often temporary in nature and used to filter or aggregate data that already exists in another Athena table. Although this offers great flexibility to perform exploratory analytics, when tables are dropped, the underlying Amazon S3 data remains indefinitely. Over time, the accumulation of these objects can increase Amazon S3 costs, become administratively challenging to manage, and may inadvertently preserve data that should have been deleted for privacy or compliance reasons. Furthermore, the AWS Glue table entry is purged so there is no convenient way to trace back which Amazon S3 path was mapped to a deleted table.

This post shows how you can automate deleting Amazon S3 objects associated with a table after dropping it using Athena. AWS Glue is required to be the metadata store for Athena.

Overview of solution

The solution requires that the AWS Glue table record (database, table, Amazon S3 path) history is preserved outside of AWS Glue, because it’s removed immediately after a table is dropped. Without this record, you can’t delete the associated Amazon S3 object entries after the fact.

When Athena CTAS statements are issued, AWS Glue generates Amazon CloudWatch events that specify the database and table names. These events are available from Amazon EventBridge and can be used to trigger an AWS Lambda function (autoCleanS3) to fetch the new or updated Amazon S3 path from AWS Glue and write the database, table, and Amazon S3 path into an AWS Glue history table stored in Amazon DynamoDB (GlueHistoryDDB). When Athena drop table queries are detected, CloudWatch events are generated that trigger autoCleanS3 to look up the Amazon S3 path from GlueHistoryDDB and delete all related objects from Amazon S3.

Not all dropped tables should trigger Amazon S3 object deletion. For example, when you create a table using existing Amazon S3 data (not CTAS), it’s not advisable to automatically delete the associated Amazon S3 tables, because other analysts may have other tables referring to the same source data. For this reason, you must include a user-defined comment (--dropstore ) in the Athena drop table query to cause autoCleanS3 to purge the Amazon S3 objects.

Lastly, after objects are successfully deleted, the corresponding entry in GlueHistoryDDB is updated for historical and audit purposes. The overall workflow is described in the following diagram.

The workflow contains the following steps:

A user creates a table either via Athena or the AWS Glue console or API.
AWS Glue generates a CloudWatch event, and an EventBridge rule triggers the Lambda function.
The function creates an entry in DynamoDB containing a copy of the AWS Glue record and Amazon S3 path.
The user drops the table from Athena, including the special comment --dropstore.
The Lambda function fetches the dropped table entry from DynamoDB, including the Amazon S3 path.
The function deletes data from the Amazon S3 path, including manifest files, and marks the DynamoDB entry as purged.

Walkthrough overview

To implement this solution, we complete the following steps:

Create the required AWS Identity and Access Management (IAM) policy and role.
Create the AWS Glue history DynamoDB table.
Create the Lambda autoCleanS3 function.
Create the EventBridge rules.
Test the solution.

If you prefer to use a preconfigured CloudFormation template, launch one of the following stacks depending on your Region.

Region	Launch Button
us-east-1 (N. Virginia)
us-west-2 (Oregon)
eu-west-1 (Ireland)

Prerequisites

Before implementing this solution, create an AWS Glue database and table with the data residing in Amazon S3. Be sure your user has the necessary permissions to access Athena and perform CTAS operations writing in a sample Amazon S3 location.

For more information about building a data lake, see Build a Data Lake Foundation with AWS Glue and Amazon S3.

Creating an IAM policy and role

You need to first create the required IAM policy for the Lambda function role to use to query AWS Glue and write to DynamoDB.

On the IAM console, choose Policies.
Choose Create policy.

On the JSON tab, enter the following code (update the Region, account ID, and S3 bucket accordingly, and the table name GlueHistoryDDB if you choose to change it):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:ListObjectsV2",
                "s3:DeleteObjectVersion",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<athena-query-results-s3-bucket>",
                "arn:aws:s3:::<athena-query-results-s3-bucket>/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "dynamodb:PutItem",
                "dynamodb:Scan",
                "dynamodb:Query",
                "dynamodb:UpdateItem"
            ],
            "Resource": [
                "arn:aws:dynamodb:<region>:<accountId>:table/GlueHistoryDDB"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "glue:GetTable"
            ],
            "Resource": [
                "arn:aws:glue:<region>:<accountId>:catalog",
                "arn:aws:glue:<region>:<accountId>:database/*",
                "arn:aws:glue:<region>:<accountId>:table/*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogGroup"
            ],
            "Resource": [
                "arn:aws:logs:<region>:<accountId>:*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents",
                "logs:DescribeLogStreams"
            ],
            "Resource": [
                "arn:aws:logs:<region>:<accountId>:log-group:/aws/lambda/autoCleanS3:*"
            ],
            "Effect": "Allow"
        }
    ]
}

Choose Review policy.
For Name, enter autoCleanS3-LambdaPolicy.
For Description, enter Policy used by Lambda role to purge S3 objects when an Amazon Athena table is dropped.
Choose Create policy.

Next, you need to create an IAM role and attach this policy.

On the IAM console, choose Roles.
Choose Create role.
Choose AWS service.
Choose Lambda.
Choose Next: Permissions.

For Filter policies, enter autoCleanS3-LambdaPolicy.
Choose Next: Tags.
Choose Next: Review.
For Role name, enter autoCleanS3-LambdaRole.
For Description, enter Role used by Lambda to purge S3 objects when an Amazon Athena table is dropped.
Choose Create role.

Creating the AWS Glue history DynamoDB table

You use this DynamoDB table to hold the current and historical list of AWS Glue tables and their corresponding Amazon S3 path. Create the table as follows:

On the DynamoDB console, choose Dashboard.
Choose Create table.
For Table name, enter GlueHistoryDDB.
For Partition key, enter database (leave type as String).
Select Add sort key.
Enter table_date (leave type as String).
For Table settings, select Use default settings.
Choose Create.

The following table summarizes the GlueHistoryDDB table attributes that the Lambda function creates.

Column	Type	Description
database	partition key	The name of the AWS Glue database.
table_date	sort key	A composite attribute of AWS Glue table name plus date created. Because the same database and table name can be created again, the date must be used to ensure uniqueness.
created_by	attribute	The user or Amazon EC2 instance ARN from which the table was created.
owner	attribute	The owner of the table or account number.
purged	attribute	A boolean indicating whether the Amazon S3 objects have been deleted (True/False).
s3_path	attribute	The Amazon S3 path containing objects associated with the table.
table	attribute	The AWS Glue table name.
update_time	attribute	The last time the table was updated (the Amazon S3 path changed or objects purged).
view_sql	attribute	The view DDL if a view was created.

Creating the Lambda function autoCleanS3

A CloudWatch event triggers the Lambda function autoCleanS3 when a new table is created, updated, or dropped. If the --dropstore keyword is included in the Athena query comments, the associated Amazon S3 objects are also removed.

On the Lambda console, choose Create function.
Select Author from scratch.
For Function name¸ enter autoCleanS3.
For Runtime, choose Python 3.8.
Under Permissions, for Execution role, select Use an existing role.
Choose the role you created (service-role/autoCleanS3-LambdaRole).
Choose Create function.
Scroll down to the Function code section.
If using Region us-west-2, on the Actions menu, choose Upload a file to Amazon S3.

Enter the following:

https://aws-bigdata-blog.s3.amazonaws.com/artifacts/aws-blog-keep-your-data-lake-clean-and-compliant-with-amazon-athena/autoCleanS3.zip

Choose Save.

If using a Region other than us-west-2, download the Lambda .zip file locally. Then choose Upload a .zip file and choose the file from your computer to upload the Lambda function.

In the Environment variables section, choose Edit.
Choose Add environment variable.
Enter the following key-values in the following table (customize as desired):

Key	Value	Purpose
Athena_SQL_Drop_Phrase	`--dropstore`	String to embed in Athena drop table queries to cause associated Amazon S3 objects to be removed
db_list	`Comma-separated regex filter` `<.*>`	Allows you to limit which databases may contain tables that autoCleanS3 is allowed to purge
ddb_history_table	`GlueHistoryDDB`	The name of the AWS Glue history DynamoDB table
disable_s3_cleanup	`False`	If set to True, it disables the Amazon S3 purge, still recording attempts in the history table
log_level	`INFO`	Set to DEBUG to troubleshoot if needed

You must use a standard regex expression, which can be a simple comma-separated list of the AWS Glue databases that you want autoCleanS3 to evaluate.

The following table shows example patterns for db_list.

Example Regex Pattern	Result
.*	Default, includes all databases
`clickstream_web, orders_web, default`	Includes only `clickstream_web, orders_web, default`
`.*_web`	Includes all databases having names ending in `_web`
`.stream.`	Includes all databases containing `stream` in their name

For a complete list or supported patterns, see https://docs.python.org/3/library/re.html#re.Pattern.match

Choose Save.

Creating EventBridge rules

You need to create EventBridge rules that invoke your Lambda function whenever Athena query events and AWS Glue CreateTable and UpdateTable events are generated.

Creating the Athena event rule

To create the Athena query event rule, complete the following steps:

On the EventBridge console, choose Create rule.
For Name, enter autoCleanS3-AthenaQueryEvent.
For Description, enter Amazon Athena event for any query to trigger autoCleanS3.
For Define pattern, choose Event pattern.
For Event matching pattern, choose Custom pattern.

For Event pattern, enter the following:

{
	"detail-type": [
		"AWS API Call via CloudTrail"
	],
	"source": [
		"aws.athena"
	],
	"detail": {
		"eventName": [
			"StartQueryExecution"
		]
	}
}

Choose Save.
For Select targets, choose Lambda function.
For Function¸ choose autoClean3.
Choose Create.

Creating the AWS Glue event rule

To create the AWS Glue table event rule, complete the following steps:

On the EventBridge console, choose Create rule.
For Name, enter autoCleanS3-GlueTableEvent.
For Description, enter AWS Glue event for any table creation or update to trigger autoCleanS3.
For Define pattern, choose Event pattern.
For Event matching pattern, choose Custom pattern.

For Event pattern, enter the following:

{
	"detail-type": [
		"Glue Data Catalog Database State Change"
	],
	"source": [
		"aws.glue"
	],
	"detail": {
		"typeOfChange": [
			"CreateTable",
			"UpdateTable"
		]
	}
}

Choose Save.
For Select targets, choose Lambda function.
For Function¸ choose autoClean3.
Choose Create.

You’re finished!

Testing the solution

Make sure you already have a data lake with tables defined in your AWS Glue Data Catalog and permission to access Athena. For this post, we use NYC taxi ride data. For more information, see Build a Data Lake Foundation with AWS Glue and Amazon S3.

Create a new table using Athena CTAS.

Next, verify that the entry appears in the new GlueHistoryDDB table.

On the DynamoDB console, open the GlueHistoryDDB table.
Choose Items.
Confirm the s3_path value for the table.

You can also view the Amazon S3 table path and objects associated with the table.

On the Amazon S3 console, navigate to the s3_path found in GlueHistoryDDB.
Confirm the table and path containing the data folder and associated manifest and metadata objects.

Drop the table using the keyword --dropstore.

Check the Amazon S3 path to verify both the table folder and associated manifest and metadata files have been removed.

You can also see the purged attribute for the entry in GlueHistoryDDB is now set to True, and update_time has been updated, which you can use if you ever need to look back and understand when a purge event occurred.

Considerations

The Lambda timeout may need to be increased for very large tables, because the object deletion operations may not complete in time.

To prevent accidental data deletion, it’s recommended to carefully limit which databases may participate (Lambda environment variable db_list) and to enable versioning on the Athena bucket path and set up Amazon S3 lifecycle policies to eventually remove older versions. For more information, see Deleting object versions.

Conclusion

In this post, we demonstrated how to automate the process of deleting Amazon S3 objects associated with dropped AWS Glue tables. Deleting Amazon S3 objects that are no longer associated with an AWS Glue table reduces ongoing storage expense, management overhead, and unnecessary exposure of potentially private data no longer needed within the organization, allowing you to meet regulatory requirements.

This serverless solution monitors Athena and AWS Glue table creation and drop events via CloudWatch, and triggers Lambda to perform Amazon S3 object deletion. We use DynamoDB to store the audit history of all AWS Glue tables that have been dropped over time. It’s strongly recommended to enable Amazon S3 bucket versioning to prevent accidental data deletion.

To restore the Amazon S3 objects for the deleted table, you first identify the s3_path value for the relevant table entry in GlueHistoryDDB and either copy or remove the delete marker from objects in that path. For more information, see How do I undelete a deleted S3 object?

About the Author

David Roberts is a Senior Solutions Architect at AWS. His passion is building efficient and effective solutions on the cloud, especially involving analytics and data lake governance. Besides spending time with his wife and two daughters, he likes drumming and watching movies, and is an avid video gamer.