Tag Archives: Intermediate (200)

Visualize AWS WAF logs with an Amazon CloudWatch dashboard

2023-01-26 Diana Alvarado

Post Syndicated from Diana Alvarado original https://aws.amazon.com/blogs/security/visualize-aws-waf-logs-with-an-amazon-cloudwatch-dashboard/

AWS WAF is a web application firewall service that helps you protect your applications from common exploits that could affect your application’s availability and your security posture. One of the most useful ways to detect and respond to malicious web activity is to collect and analyze AWS WAF logs. You can perform this task conveniently by sending your AWS WAF logs to Amazon CloudWatch Logs and visualizing them through an Amazon CloudWatch dashboard.

In this blog post, I’ll show you how to use Amazon CloudWatch to monitor and analyze AWS WAF activity using the options in CloudWatch metrics, Contributor Insights, and Logs Insights. I’ll also walk you through how to deploy this solution in your own AWS account by using an AWS CloudFormation template.

Prerequisites

This blog post builds on the concepts introduced in the blog post Analyzing AWS WAF Logs in Amazon CloudWatch Logs. There we introduced how to natively set up AWS WAF logging to Amazon CloudWatch logs, and discussed the basic options that are available for visualizing and analyzing the data provided in the logs.

The only AWS services that you need to turn on for this solution are Amazon CloudWatch and AWS WAF. The solution assumes that you’ve previously set up AWS WAF log delivery to Amazon CloudWatch Logs. If you have not done so, follow the instructions for AWS WAF logging destinations – CloudWatch Logs.

You will need to provide the following parameters for the CloudFormation template:

CloudWatch log group name for the AWS WAF logs
The AWS Region for the logs
The name of the AWS WAF web access control list (web ACL)

Solution overview

The architecture of the solution is outlined in Figure 1. The solution takes advantage of the native integration available between AWS WAF and CloudWatch, which simplifies the setup and management of this solution.

Figure 1: Solution architecture

In the solution, the logs are sent to CloudWatch (when you enable log delivery). From there, they’re ready to be consumed by all the different service options that CloudWatch offers, including the ones that we’ll use in this solution: CloudWatch Logs Insights and Contributor Insights.

Deploy the solution

Choose the following Launch stack button to launch the CloudFormation stack in your account.

You’ll be redirected to the CloudFormation service in the AWS US East (N. Virginia) Region, which is the default Region to deploy this solution, although this can vary depending on where your web ACL is located. You can change the Region as preferred. The template will spin up multiple cloud resources, such as the following:

CloudWatch Logs Insights queries
CloudWatch Contributor Insights visuals
CloudWatch dashboard

The solution is quickly deployed to your account and is ready to use in less than 30 minutes. You can use the solution when the status of the stack changes to CREATE_COMPLETE.

As a measure to control costs, you can also choose whether to create the Contributor Insights rules and enable them by default. For more information on costs, see the Cost considerations section later in this post.

Explore and validate the dashboard

When the CloudFormation stack is complete, you can choose the Output tab in the CloudFormation console and then choose the dashboard link. This will take you to the CloudWatch service in the AWS Management Console. The dashboard time range presents information for the last hour of activity by default, and can go up to one week, but keep in mind that Contributor Insights has a maximum time range of 24 hours. You can also select a different dashboard refresh interval from 10 seconds up to 15 minutes.

The dashboard provides the following information from CloudWatch.

Rule name	Description
WAF_top_terminating_rules	This rule shows the top rules where the requests are being terminated by AWS WAF. This can help you understand the main cause of blocked requests.
WAF_top_ips	This rule shows the top source IPs for requests. This can help you understand if the traffic and activity that you see is spread across many IPs or concentrated in a small group of IPs.
WAF_top_countries	This rule shows the main source countries for the IPs in the requests. This can help you visualize where the traffic is originating.
WAF_top_user_agents	This rule shows the main user agents that are being used to generate the requests. This will help you isolate problematic devices or identify potential false positives.
WAF_top_uri	This rule shows the main URIs in the requests that are being evaluated. This can help you identify if one specific path is the target of activity.
WAF_top_http	This rule shows the HTTP methods used for the requests examined by AWS WAF. This can help you understand the pattern of behavior of the traffic.
WAF_top_referrer_hosts	This rule shows the main referrer from which requests are being sent. This can help you identify incorrect or suspicious origins of requests based on the known application flow.
WAF_top_rate_rules	This rule shows the main rate rules being applied to traffic. It helps understand volumetric activity identified by AWS WAF.
WAF_top_labels	This rule shows the top labels found in logs. This can help you visualize the main rules that are matching on the requests evaluated by AWS WAF.

The dashboard also provides the following information from the default CloudWatch metrics sent by AWS WAF.

Rule name	Description
AllowedvsBlockedRequests	This metric shows the number of all blocked and allowed requests. This can help you understand the number of requests that AWS WAF is actively blocking.
Bot Requests vs non-Bot requests	This visual shows the number of requests identified as bots versus non-bots (if you’re using AWS WAF Bot Control).
All Requests	This metric shows the number of all requests, separated by bot and non-bot origin. This can help you understand all requests that AWS WAF is evaluating.
CountedRequests	This metric shows the number of all counted requests. This can help you understand the requests that are matching a rule but not being blocked, and aid the decision of a configuration change during the testing phase.
CaptchaRequests	This metric shows requests that go through the CAPTCHA rule.

Figure 2 shows an example of how the CloudWatch dashboard displays the data within this solution. You can rearrange and customize the elements within the dashboard as needed.

Figure 2: Example dashboard

You can review each of the queries and rules deployed with this solution. You can also customize these baseline queries and rules to provide more detailed information or to add custom queries and rules to the solution code. For more information on how to build queries and use CloudWatch Logs and Contributor Insights, see the CloudWatch documentation.

Use the dashboard for monitoring

After you’ve set up the dashboard, you can monitor the activity of the sites that are protected by AWS WAF. If suspicious activity is reported, you can use the visuals to understand the traffic in more detail, and drive incident response actions as needed.

Let’s consider an example of how to use your new dashboard and its data to drive security operations decisions. Suppose that you have a website that sells custom clothing at a bargain price. It has a sign-up link to receive offers, and you’re getting reports of unusual activity by the application team. By looking at the metrics for the web ACL that protects the site, you can see the main country for source traffic and the contributing URIs, as shown in Figure 3. You can also see that most of the activity is being detected by rules that you have in place, so you can set the rules to block traffic, or if they are already blocking, you can just monitor the activity.

Figure 3: Metrics on website activity

You can use the same visuals to decide whether an AWS WAF rule with high activity can be changed to autoblock suspicious web traffic without affecting valid customer traffic. By looking at the top terminating rules and cross-referencing information, such as source IPs, user agents, top URIs, and other request identifiers, you can understand the traffic pattern and activity of different applications and endpoints. From here, you can investigate further by using specific queries with CloudWatch Logs Insights.

Operational and security management with CloudWatch Logs Insights

You can use CloudWatch Logs Insights to interactively search and analyze log data in Amazon CloudWatch Logs using advanced queries to effectively investigate operational issues and security incidents.

Examine a bot reported as a false positive

You can use CloudWatch Logs Insights to identify requests that have specific labels to understand where the traffic is originating from based on source IP address and other essential event details. A simple example is investigating requests flagged as potential false positives.

Imagine that you have a reported false positive request that was flagged as a non-browser by AWS WAF Bot Control. You can run the non-browser user agent query that was created by the provided template on CloudWatch Logs Insights, as shown in the following example, and then verify the source IPs for the top hits for this rule group. Or you can look for a specific request that has been flagged as a false positive, in order to review the details and make adjustments as needed.

fields @timestamp, httpRequest.clientIp 
| filter @message like "awswaf:managed:aws:botcontrol:signal:non_browser_user_agent" 
| parse @message ""labels":[*]"as Labels 
| stats count(*) as requestCount by httpRequest.clientIP 
| display @timestamp,httpRequest.clientIp, httpRequest.uri,Labels 
| sort requestCount desc 
| limit 10

The non-browser user agent query also allows you confirm whether this request has other rule hits that were in count mode and were non-terminating; you can do this by examining the labels. If there are multiple rules matching the requests, that can be an indicator of suspicious activity.

If you have a CAPTCHA challenge configured on the endpoint, you can also look at CAPTCHA responses. The CaptchaTokenqueryDefinition query provided in this solution uses a variation of the preceding format, and can display the main IPs from which bad tokens are being sent. An example query is shown following, along with the query results in Figure 4. If you have signals from non-browser user agents and CAPTCHA tokens missing, then that is a strong indicator of suspicious activity.

fields @timestamp, httpRequest.clientIp 
| filter captchaResponse.failureReason = "TOKEN_MISSING" 
| stats count(*) as requestCount by httpRequest.clientIp, httpRequest.country 
| sort requestCount desc 
| limit 10

Figure 4: Main IP addresses and number of counts for CAPTCHA responses

This information can provide an indication of the main source of activity. You can then use other visuals, like top user agents or top referrers, to provide more context to the information and inform further actions, such as adding new rules to the AWS WAF configuration.

You can adapt the queries provided in the sample solution to other use cases by using the fields provided in the left-hand pane of CloudWatch Logs Insights.

Cost considerations

Configuring AWS WAF to send logs to Amazon CloudWatch logs doesn’t have an additional cost. The cost incurred is for the use of the CloudWatch features and services, such as log storage and retention, Contributor Insights rules enabled, Logs Insights queries run, matched log events, and CloudWatch dashboards. For detailed information on the pricing of these features, see the CloudWatch Logs pricing information. You can also get an estimate of potential costs by using the AWS pricing calculator for CloudWatch.

One way to help offset the cost of CloudWatch features and services is to restrict the use of the dashboard and enforce a log retention policy for AWS WAF that makes it cost effective. If you use the queries and monitoring only as-needed, this can also help reduce costs. By limiting the running of queries and the matched log events for the Contributor Insights rules, you can enable the rules only when you need them. AWS WAF also provides the option to filter the logs that are sent when logging is enabled. For more information, see AWS WAF log filtering.

Conclusion

In this post, you learned how to use a pre-built CloudWatch dashboard to monitor AWS WAF activity by using metrics and Contributor Insights rules. The dashboard can help you identify traffic patterns and activity, and you can use the sample Logs Insights queries to explore the log information in more detail and examine false positives and suspicious activity, for rule tuning.

For more information on AWS WAF and the features mentioned in this post, see the AWS WAF documentation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS WAF re:Post.

Want more AWS Security news? Follow us on Twitter.

Automate deployment and version updates for Amazon Kinesis Data Analytics applications with AWS CodePipeline

2023-01-26 Anand Shah

Post Syndicated from Anand Shah original https://aws.amazon.com/blogs/big-data/automate-deployment-and-version-updates-for-amazon-kinesis-data-analytics-applications-with-aws-codepipeline/

Amazon Kinesis Data Analytics is the easiest way to transform and analyze streaming data in real time using Apache Flink. Customers are already using Kinesis Data Analytics to perform real-time analytics on fast-moving data generated from data sources like IoT sensors, change data capture (CDC) events, gaming, social media, and many others. Apache Flink is a popular open-source framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

Although building Apache Flink applications is typically the responsibility of a data engineering team, automating the deployment and provisioning infrastructure as code (IaC) is usually owned by the platform (or DevOps) team.

The following are typical responsibilities of the data engineering role:

Write code for real-time analytics Apache Flink applications
Roll out new application versions or roll them back (for example, in the case of a critical bug)

The following are typical responsibilities of the platform role:

Write code for IaC
Provision the required resources in the cloud and manage their access

In this post, we show how you can automate deployment and version updates for Kinesis Data Analytics applications and allow both Platform and engineering teams to effectively collaborate and co-own the final solution using AWS CodePipeline with the AWS Cloud Development Kit (AWS CDK).

Solution overview

To demonstrate the automated deployment and version update of a Kinesis Data Analytics application, we use the following example real-time data analytics architecture for this post.

The workflow includes the following steps:

An AWS Lambda function (acting as data source) is the event producer pushing events on demand to Amazon Kinesis Data Streams when invoked.
The Kinesis data stream receives and stores real-time events.
The Kinesis Data Analytics application reads events from the data stream and performs real-time analytics on it.

Generic architecture

You can refer to the following generic architecture to adapt this example to your preferred CI/CD tool (for example, Jenkins). The overall deployment process is divided into three high-level parts:

Infrastructure CI/CD – This portion is highlighted in orange. The infrastructure CI/CD pipeline is responsible for deploying all the real-time streaming architecture components, including the Kinesis Data Analytics application and any connected resources typically deployed using AWS CloudFormation.
ApplicationStack – This portion is highlighted in gray. The application stack is deployed by the infrastructure CI/CD component using AWS CloudFormation.
Application CI/CD – This portion is highlighted in green. The application CI/CD pipeline updates the Kinesis Data Analytics application in three steps:
1. The pipeline builds the Java or Python source code of the Kinesis Data Analytics application and produces the application as a binary file.
2. The pipeline pushes the latest binary file to the Amazon Simple Storage Service (Amazon S3) artifact bucket after a successful build as Kinesis Data Analytics application binary files are referenced from S3.
3. The S3 bucket file put event triggers a Lambda function, which updates the version of the Kinesis Data Analytics application by deploying the latest binary.

The following diagram illustrates this workflow.

CI/CD architecture with CodePipeline

In this post, we implement the generic architecture using CodePipeline. The following diagram illustrates our updated architecture.

The final solution includes the following steps:

The platform (DevOps) team and data engineering team push their source code to their respective code repositories.
CodePipeline deploys the whole infrastructure as three stacks:
1. InfraPipelineStack – Contains a pipeline to deploy the overall infrastructure.
2. ApplicationPipelineStack – Contains a pipeline to build and deploy Kinesis Data Analytics application binaries. In this post, we build a Java source using the JavaBuildPipeline AWS CDK construct. You can use the PythonBuildPipeline AWS CDK construct to build a Python source.
3. ApplicationStack – Contains real-time data analytics pipeline resources including Lambda (data source), Kinesis Data Streams (storage), and Kinesis Data Analytics (Apache Flink application).

Deploy resources using AWS CDK

The following GitHub repository contains the AWS CDK code to create all the necessary resources for the data pipeline. This removes opportunities for manual error, increases efficiency, and ensures consistent configurations over time. To deploy the resources, complete the following steps:

Clone the GitHub repository to your local computer using the following command:

git clone https://github.com/aws-samples/automate-deployment-and-version-update-of-kda-application

Download and install the latest Node.js.
Run the following command to install the latest version of AWS CDK:

npm install -g aws-cdk

Run cdk bootstrap to initialize the AWS CDK environment in your AWS account. Replace your AWS account ID and Region before running the following command.

cdk bootstrap aws://123456789012/us-east-1

To learn more about the bootstrapping process, refer to Bootstrapping.

Part 1: Data engineering and platform teams push source code to their code repositories

The data engineering and platform teams begin work in their respective code repositories, as illustrated in the following figure.

In this post, we use two folders instead of two GitHub repositories, which you can find under the root folder of the cloned repository:

kinesis-analytics-application – This folder contains example source code of the Kinesis Data Analytics application. This represents your Kinesis Data Analytics application source code developed by your data engineering team.
infrastructure-cdk – This folder contains example AWS CDK source code of the final solution used for provisioning all the required resources and CodePipeline. You can reuse this code for your Kinesis Data Analytics application deployment.

Application development teams usually stores the application source code in git repositories. For the demonstration purpose, we will use source code as zip file downloaded from Github instead of connecting CodePipeline to the Github repository. You may want to directly connect source repository with CodePipeline. To learn more about how to connect, refer to Create a connection to GitHub.

Part 2: The platform team deploys the application pipeline

The following figure illustrates the next step in the workflow.

In this step, you deploy the first pipeline to build the Java source code from kinesis-analytics-application. Complete the following steps to deploy ApplicationPipelineStack:

Open your terminal, bash, or command window depending on your OS.
Switch the current path to the folder infrastructure-cdk.
Run npm install to download all dependencies.
Run cdk deploy ApplicationPipelineStack to deploy the application pipeline.

This process should take about 5 minutes to complete and deploys the following resources to your AWS account, highlighted in green in the preceding diagram:

CodePipeline, containing stages for AWS CodeBuild and AWS CodeDeploy
An S3 bucket to store binaries
A Lambda function to update the Kinesis Data Analytics application JAR after manual approval

Trigger an automatic build for the application pipeline

After the cdk deploy command is successful, complete the following steps to automatically run the pipeline:

Download the source code .zip file.
On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose the stack ApplicationPipelineStack.
On the Outputs tab, choose the link for the key ArtifactBucketLink.

You’re redirected to the S3 artifact bucket.

Choose Upload.
Upload the source code .zip file you downloaded.

The first pipeline run (shown as Auto Build in the following diagram) starts automatically and takes about 5 minutes to reach the manual approval stage. The pipeline automatically downloads the source code from the artifact bucket, builds the Java project kinesis-analytics-application using Maven, and publishes the output binary JAR file back to the artifact bucket under the directory jars.

View the application pipeline run

Complete the following steps to view the application pipeline run:

On the AWS CloudFormation console, navigate to the stack ApplicationPipelineStack.
On the Outputs tab, choose the link for the key ApplicationCodePipelineLink.

You’re redirected to the pipeline details page. You can see a detailed view of the pipeline, including the state of each action in each stage and the state of the transitions.

Do not approve the build for the manual approval stage yet; this is done later.

Part 3: The platform team deploys the infrastructure pipeline

The application pipeline run publishes a JAR file named kinesis-analytics-application-final.jar to the artifact bucket. Next, we deploy the Kinesis Data Analytics architecture. Complete the following steps to deploy the example flow:

Open a terminal, bash, or command window depending on your OS.
Switch the current path to the folder infrastructure-cdk.
Run cdk deploy InfraPipelineStack to deploy the infrastructure pipeline.

This process should take about 5 minutes to complete and deploys a pipeline containing stages for CodeBuild and CodeDeploy to your AWS account, as highlighted in green in the following diagram.

When the cdk deploy is complete, the infrastructure pipeline run starts automatically (shown as Auto Build 1 in the following diagram) and takes about 10 minutes to download the source code from the artifact bucket, build the AWS CDK project infrastructure-stack, and deploy ApplicationStack automatically to your AWS account. When the infrastructure pipeline run is complete, the following resources are deployed to your account (shown in green in following diagram):

A CloudFormation template named app-ApplicationStack
A Lambda function acting as a data source
A Kinesis data stream acting as the stream storage
A Kinesis Data Analytics application with the first version of kinesis-analytics-application-final.jar

View the infrastructure pipeline run

Complete the following steps to view the application pipeline run:

On the AWS CloudFormation console, navigate to the stack InfraPipelineStack.
On the Outputs tab, choose the link for the key InfraCodePipelineLink.

You’re redirected to the pipeline details page. You can see a detailed view of the pipeline, including the state of each action in each stage and the state of the transitions.

Step 4: The data engineering team deploys the application

Now your account has everything in place for the data engineering team to work independently and roll out new versions of the Kinesis Data Analytics application. You can approve the respective application build from the application pipeline to deploy new versions of the application. The following diagram illustrates the full workflow.

The build process starts automatically when it detects changes in the source code. You can test a version update by re-uploading the source code .zip file to the S3 artifact bucket. In a real-world use case, you update the main branch either via a pull request or by merging your changes, and this action triggers a new pipeline run automatically.

View the current application version

To view the current version of the Kinesis Data Analytics application, complete the following steps:

On the AWS CloudFormation console, navigate to the stack InfraPipelineStack.
On the Outputs tab, choose the link for the key KDAApplicationLink.

You’re redirected to the Kinesis Data Analytics application details page. You can find the current application version by looking at Version ID.

Approve the application deployment

Complete the following steps to approve the deployment (or version update) of the Kinesis Data Analytics application:

On the AWS CloudFormation console, navigate to the stack ApplicationPipelineStack.
On the Outputs tab, choose the link for the key ApplicationCodePipelineLink.
Choose Review from the pipeline approval stage.
When prompted, choose Approve to provide approval (optionally adding any comments) for the Kinesis Data Analytics application deployment or version update.
Repeat the steps mentioned earlier to view the current application version.

You should see the application version as defined in Version ID increased by one, as shown in the following screenshot.

Deploying a new version of the Kinesis Data Analytics application will cause a downtime of around 5 minutes because the Lambda function responsible for the version update makes the API call UpdateApplication, which restarts the application after updating the version. However, the application resumes stream processing where it left off after the restart.

Clean up

Complete the following steps to delete your resources and stop incurring costs:

On the AWS CloudFormation console, select the stack InfraPipelineStack and choose Delete.
Select the stack app-ApplicationStack and choose Delete.
Select stack ApplicationPipelineStack and choose Delete.
On the Amazon S3 console, select the bucket with the name starting with javaappCodePipeline and choose Empty.
Enter permanently delete to confirm the choice.
Select the bucket again and choose Delete.
Confirm the action by entering the bucket name when prompted.
Repeat these steps to delete the bucket with the name starting with infrapipelinestack-pipelineartifactsbucket.

Summary

This post demonstrated how to automate deployment and version updates for your Kinesis Data Analytics applications using CodePipeline and AWS CDK.

For more information, see Continuous integration and delivery (CI/CD) using CDK Pipelines and CodePipeline tutorials.

About the Author

Anand Shah is a Big Data Prototyping Solutions Architect at AWS. He works with AWS customers and their engineering teams to build prototypes using AWS analytics services and purpose-built databases. Anand helps customers solve the most challenging problems using the art of the possible technology. He enjoys beaches in his leisure time.

How to run AWS CloudHSM workloads in container environments

2023-01-25 Derek Tumulak

Post Syndicated from Derek Tumulak original https://aws.amazon.com/blogs/security/how-to-run-aws-cloudhsm-workloads-on-docker-containers/

January 25, 2023: We updated this post to reflect the fact that CloudHSM SDK3 does not support serverless environments and we strongly recommend deploying SDK5.

AWS CloudHSM provides hardware security modules (HSMs) in the AWS Cloud. With CloudHSM, you can generate and use your own encryption keys in the AWS Cloud, and manage your keys by using FIPS 140-2 Level 3 validated HSMs. Your HSMs are part of a CloudHSM cluster. CloudHSM automatically manages synchronization, high availability, and failover within a cluster.

CloudHSM is part of the AWS Cryptography suite of services, which also includes AWS Key Management Service (AWS KMS), AWS Secrets Manager, and AWS Private Certificate Authority (AWS Private CA). AWS KMS, Secrets Manager, and AWS Private CA are fully managed services that are convenient to use and integrate. You’ll generally use CloudHSM only if your workload requires single-tenant HSMs under your own control, or if you need cryptographic algorithms or interfaces that aren’t available in the fully managed alternatives.

CloudHSM offers several options for you to connect your application to your HSMs, including PKCS#11, Java Cryptography Extensions (JCE), OpenSSL Dynamic Engine, or Microsoft Cryptography API: Next Generation (CNG). Regardless of which library you choose, you’ll use the CloudHSM client to connect to HSMs in your cluster.

In this blog post, I’ll show you how to use Docker to develop, deploy, and run applications by using the CloudHSM SDK, and how to manage and orchestrate workloads by using tools and services like Amazon Elastic Container Service (Amazon ECS), Kubernetes, Amazon Elastic Kubernetes Service (Amazon EKS), and Jenkins.

Solution overview

This solution demonstrates how to create a Docker container that uses the CloudHSM JCE SDK to generate a key and use it to encrypt and decrypt data.

Note: In this example, you must manually enter the crypto user (CU) credentials as environment variables when you run the container. For production workloads, you’ll need to consider how to secure and automate the handling and distribution of these credentials. You should work with your security or compliance officer to ensure that you’re using an appropriate method of securing HSM login credentials. For more information on securing credentials, see AWS Secrets Manager.

Figure 1 shows the solution architecture. The Java application, running in a Docker container, integrates with JCE and communicates with CloudHSM instances in a CloudHSM cluster through HSM elastic network interfaces (ENIs). The Docker container runs in an EC2 instance, and access to the HSM ENIs is controlled with a security group.

Figure 1: Architecture diagram

Prerequisites

To implement this solution, you need to have working knowledge of the following items:

CloudHSM
Docker 20.10.17 – used at the time of this post
Java 8 or Java 11 – supported at the time of this post
Maven 3.05 – used at the time of this post

Here’s what you’ll need to follow along with my example:

An active CloudHSM cluster with at least one active HSM instance. You can follow the CloudHSM getting started guide to create, initialize, and activate a CloudHSM cluster.

Note: For a production cluster, you should have at least two active HSM instances spread across Availability Zones in the Region.
An Amazon Linux 2 EC2 instance in the same virtual private cloud (VPC) in which you created your CloudHSM cluster. The Amazon Elastic Compute Cloud (Amazon EC2) instance must have the CloudHSM cluster security group attached—this security group is automatically created during the cluster initialization and is used to control network access to the HSMs. To learn about attaching security groups to allow EC2 instances to connect to your HSMs, see Create a cluster in the AWS CloudHSM User Guide.
A CloudHSM crypto user (CU) account. You can create a CU by following the steps in the topic Managing HSM users in AWS CloudHSM in the AWS CloudHSM User Guide.

Solution details

In this section, I’ll walk you through how to download, configure, compile, and run a solution in Docker.

To set up Docker and run the application that encrypts and decrypts data with a key in AWS CloudHSM

On your Amazon Linux EC2 instance, install Docker by running the following command.
# sudo yum -y install docker
Start the docker service.
# sudo service docker start
Create a new directory and move to it. In my example, I use a directory named cloudhsm_container. You’ll use the new directory to configure the Docker image.
# mkdir cloudhsm_container # cd cloudhsm_container
Copy the CloudHSM cluster’s trust anchor certificate (customerCA.crt) to the directory that you just created. You can find the trust anchor certificate on a working CloudHSM client instance under the path /opt/cloudhsm/etc/customerCA.crt. The certificate is created during initialization of the CloudHSM cluster and is required to connect to the CloudHSM cluster. This enables our application to validate that the certificate presented by the CloudHSM cluster was signed by our trust anchor certificate.
In your new directory (cloudhsm_container), create a new file with the name run_sample.sh that includes the following contents. The script runs the Java class that is used to generate an Advanced Encryption Standard (AES) key to encrypt and decrypt your data.
```
#! /bin/bash

# start application
echo -e "\n* Entering AES GCM encrypt/decrypt sample in Docker ... \n"

java -ea -jar target/assembly/aesgcm-runner.jar -method environment

echo -e "\n* Exiting AES GCM encrypt/decrypt sample in Docker ... \n"
```
In the new directory, create another new file and name it Dockerfile (with no extension). This file will specify that the Docker image is built with the following components:
- The CloudHSM client package.
- The CloudHSM Java JCE package.
- OpenJDK 1.8 (Java 8). This is needed to compile and run the Java classes and JAR files.
- Maven, a build automation tool that is needed to assist with building the Java classes and JAR files.
- The AWS CloudHSM Java JCE samples that will be downloaded and built as part of the solution.

Cut and paste the following contents into Dockerfile.

Note: You will need to customize your Dockerfile, as follows:

Make sure to specify the SDK version to replace the one specified in the pom.xml file in the sample code. As of the writing of this post, the most current version is 5.7.0. To find the SDK version, follow the steps in the topic Check your client SDK version. For more information, see the Building section in the README file for the Cloud HSM JCE examples.

Make sure to update the HSM_IP line with the IP of an HSM in your CloudHSM cluster. You can get your HSM IPs from the CloudHSM console, or by running the describe-clusters AWS CLI command.

	# Use the amazon linux image
	FROM amazonlinux:2

	# Pass HSM IP address as a build argument
	ARG HSM_IP

	# Install CloudHSM client
	RUN yum install -y https://s3.amazonaws.com/cloudhsmv2-software/CloudHsmClient/EL7/cloudhsm-jce-latest.el7.x86_64.rpm

	# Install Java, Maven, wget, unzip and ncurses-compat-libs
	RUN yum install -y java maven wget unzip ncurses-compat-libs
        
	# Create a work dir
	WORKDIR /app
        
	# Download sample code
	RUN wget https://github.com/aws-samples/aws-cloudhsm-jce-examples/archive/refs/heads/sdk5.zip
        
	# unzip sample code
	RUN unzip sdk5.zip
       
	# Change to the create directory
	WORKDIR aws-cloudhsm-jce-examples-sdk5

# Build JAR files using the installed CloudHSM JCE Provider version
RUN export CLOUDHSM_CLIENT_VERSION=`rpm -qi cloudhsm-jce | awk -F': ' '/Version/ {print $2}'` \
        && mvn validate -DcloudhsmVersion=$CLOUDHSM_CLIENT_VERSION \
        && mvn clean package -DcloudhsmVersion=$CLOUDHSM_CLIENT_VERSION
        
  # Configure cloudhsm-client
  COPY customerCA.crt /opt/cloudhsm/etc/
  RUN /opt/cloudhsm/bin/configure-jce -a $HSM_IP
       
  # Copy the run_sample.sh script
  COPY run_sample.sh .
        
  # Run the script
  CMD ["bash","run_sample.sh"]

Now you’re ready to build the Docker image. Run the following command, with the name jce_sample. This command will let you use the Dockerfile that you created in step 6 to create the image.
# sudo docker build --build-arg HSM_IP=”<your HSM IP address>” -t jce_sample .
To run a Docker container from the Docker image that you just created, run the following command. Make sure to replace the user and password with your actual CU username and password. (If you need help setting up your CU credentials, see prerequisite 3. For more information on how to provide CU credentials to the AWS CloudHSM Java JCE Library, see Providing credentials to the JCE provider in the CloudHSM User Guide).
# sudo docker run --env HSM_USER=<user> --env HSM_PASSWORD=<password> jce_sample

If successful, the output should look like this:
```
	* Entering AES GCM encrypt/decrypt sample in Docker ... 

	737F92D1B7346267D329C16E
	Successful decryption

	* Exiting AES GCM encrypt/decrypt sample in Docker ...
```

Conclusion

This solution provides an example of how to run CloudHSM client workloads in Docker containers. You can use the solution as a reference to implement your cryptographic application in a way that benefits from the high availability and load balancing built in to CloudHSM without compromising the flexibility that Docker provides for developing, deploying, and running applications.

If you have comments about this post, submit them in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

How to improve security incident investigations using Amazon Detective finding groups

2023-01-25 Anna McAbee

Post Syndicated from Anna McAbee original https://aws.amazon.com/blogs/security/how-to-improve-security-incident-investigations-using-amazon-detective-finding-groups/

Uncovering the root cause of an Amazon GuardDuty finding can be a complex task, requiring security operations center (SOC) analysts to collect a variety of logs, correlate information across logs, and determine the full scope of affected resources.

Sometimes you need to do this type of in-depth analysis because investigating individual security findings in insolation doesn’t always capture the full impact of affected resources.

With Amazon Detective, you can analyze and visualize various logs and relationships between AWS entities to streamline your investigation. In this post, you will learn how to use a feature of Detective—finding groups—to simplify and expedite the investigation of a GuardDuty finding.

Detective uses machine learning, statistical analysis, and graph theory to generate visualizations that help you to conduct faster and more efficient security investigations. The finding groups feature reduces triage time and provides a clear view of related GuardDuty findings. With finding groups, you can investigate entities and security findings that might have been overlooked in isolation. Finding groups also map GuardDuty findings and their relevant tactics, techniques, and procedures to the MITRE ATT&CK framework. By using MITRE ATT&CK, you can better understand the event lifecycle of a finding group.

Finding groups are automatically enabled for both existing and new customers in AWS Regions that support Detective. There is no additional charge for finding groups. If you don’t currently use Detective, you can start a free 30-day trial.

Use finding groups to simplify an investigation

Because finding groups are enabled by default, you start your investigation by simply navigating to the Detective console. You will see these finding groups in two different places: the Summary and the Finding groups pages. On the Finding groups overview page, you can also use the search capability to look for collected metadata for finding groups, such as severity, title, finding group ID, observed tactics, AWS accounts, entities, finding ID, and status. The entities information can help you narrow down finding groups that are more relevant for specific workloads.

Figure 1 shows the finding groups area on the Summary page in the Amazon Detective console, which provides high-level information on some of the individual finding groups.

Figure 1: Detective console summary page

Figure 2 shows the Finding groups overview page, with a list of finding groups filtered by status. The finding group shown has a status of Active.

Figure 2: Detective console finding groups overview page

You can choose the finding group title to see details like the severity of the finding group, the status, scope time, parent or child finding groups, and the observed tactics from the MITRE ATT&CK framework. Figure 3 shows a specific finding group details page.

Figure 3: Detective console showing a specific finding group details page

Below the finding group details, you can review the entities and associated findings for this finding group, as shown in Figure 4. From the Involved entities tab, you can pivot to the entity profile pages for more details about that entity’s behavior. From the Involved findings tab, you can select a finding to review the details pane.

Figure 4: Detective console showing involved entities of a finding group

In Figure 4, the search functionality on the Involved entities tab is being used to look at involved entities that are of type AWS role or EC2 instance. With such a search filter in Detective, you have more data in a single place to understand which Amazon Elastic Compute Cloud (Amazon EC2) instances and AWS Identity and Access Management (IAM) roles were involved in the GuardDuty finding and what findings were associated with each entity. You can also select these different entities to see more details. With finding groups, you no longer have to craft specific log searches or search for the AWS resources and entities that you should investigate. Detective has done this correlation for you, which reduces the triage time and provides a more comprehensive investigation.

With the release of finding groups, Detective infers relationships between findings and groups them together, providing a more convenient starting point for investigations. Detective has evolved from helping you determine which resources are related to a single entity (for example, what EC2 instances are communicating with a malicious IP), to correlating multiple related findings together and showing what MITRE tactics are aligned across those findings, helping you better understand a more advanced single security event.

Conclusion

In this blog post, we showed how you can use Detective finding groups to simplify security investigations through grouping related GuardDuty findings and AWS entities, which provides a more comprehensive view of the lifecycle of the potential security incident. Finding groups are automatically enabled for both existing and new customers in AWS Regions that support Detective. There is no additional charge for finding groups. If you don’t currently use Detective, you can start a free 30-day trial. For more information on finding groups, see Analyzing finding groups in the Amazon Detective User Guide.

If you have feedback about this post, submit comments in the Comments section below. You can also start a new thread on the Amazon Detective re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Use AWS WAF CAPTCHA to protect your application against common bot traffic

2023-01-19 Abhinav Bannerjee

Post Syndicated from Abhinav Bannerjee original https://aws.amazon.com/blogs/security/use-aws-waf-captcha-to-protect-your-application-against-common-bot-traffic/

In this blog post, you’ll learn how you can use a Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) with other AWS WAF controls as part of a layered approach to provide comprehensive protection against bot traffic. We’ll describe a workflow that tracks the number of incoming requests to a site’s store page. The workflow then limits those requests if they exceed a certain threshold. Requests from IP addresses that exceed the threshold will be presented a CAPTCHA challenge to prove that the requests are being made by a human.

Amazon Web Services (AWS) offers many tools and recommendations that companies can use as they face challenges with bot traffic on their websites. Web applications can be compromised through a variety of vectors, including cross-site scripting, SQL injection, path traversal, local file inclusion, and distributed denial-of-service (DDoS) attacks. AWS WAF offers managed rules that are designed to provide protection against common application vulnerabilities or other unwanted web traffic, without requiring you to write your own rules.

There are some web attacks like web scraping, credential stuffing, and layer 7 DDoS attempts conducted by bots (as well as by humans) that target sensitive areas of your website, such as your store page. A CAPTCHA mitigates undesirable traffic by requiring the visitor to complete challenges before they are allowed to access protected resources. You can implement CAPTCHA to help prevent unwanted activities. Last year, AWS introduced AWS WAF CAPTCHA, which allows customers to set up AWS WAF rules that require CAPTCHA challenges to be completed for common targets such as forms (for example, search forms).

Scenario

Consider an attack where the unauthorized user is attempting to overwhelm a site’s store page by repeatedly sending search requests for different items.

Assume that traffic visits a website that is hosted through Amazon CloudFront and attempts the above behavior on the /store URL. In this scenario, there is a rate-based rule in place that will track the number of requests coming in from each IP. This rate-based rule tracks the rate of requests for each originating IP address and invokes the rule action on IPs with rates that go over the limit. With CAPTCHA implemented as the rule action, excessive attempts to search within a 5-minute window will result in a CAPTCHA challenge being presented to the user. This workflow is shown in Figure 1.

Figure 1: User visits a store page and is evaluated by a rate-based rule

When a user solves a CAPTCHA challenge, AWS automatically generates and encrypts a token and sends it to the client as a cookie. The client requests aren’t challenged again until the token has expired. AWS WAF calculates token expiration by using the immunity time configuration. You can configure the immunity time in a web access control list (web ACL) CAPTCHA configuration and in the configuration for a rule’s action setting. When a user provides an incorrect answer to a CAPTCHA challenge, the challenge informs the user and loads a new puzzle. When the user solves the challenge, the challenge automatically submits the original web request, updated with the CAPTCHA token from the successful puzzle completion.

Walkthrough

This workflow will require an AWS WAF rule within a new or existing rule group or web ACL. The rule will define how web requests are inspected and the action to take.

To create an AWS WAF rate-based rule

Open the AWS WAF console and in the left navigation pane, choose Web ACLs.
Choose an existing web ACL, or choose Create web ACL at the top right to create a new web ACL.
Under Rules, choose Add rules, and then in the drop-down list, choose Add my own rules and rule groups.
For Rule type, choose Rule builder.
In the Rule builder section, for Name, enter your rule name. For Type, choose Rate-based rule.
In the Request rate details section, enter your rate limit (for example, 100). For IP address to use for rate limiting, choose Source IP address, and for Criteria to count requests toward rate limit, choose Only consider requests that match criteria in a rule statement.
For Count only the requests that match the following statement, choose Matches the statement from the drop-down list.
In the Statement section, for Inspect, choose URI path. For Match type , choose Contains string.
For String to match, enter the URI path of your web page (for example, /store).
In the Action section, choose CAPTCHA.
(Optional) For Immunity time, choose Set a custom immunity time for this rule, or keep the default value (300 seconds).
To finish, choose Add rule, and then choose Save to add the rule to your web ACL.

After you add the rule, go to the Rules tab of your web ACL and navigate to your rule. Confirm that the output resembles what is shown in Figure 2. You should have a rate-based rule with a scope-down statement that matches the store URI path you entered earlier, and the action should be set to CAPTCHA.

Figure 2: Finished rate-based rule with CAPTCHA action

The following is the JSON for the CAPTCHA rule that you just created. You can use this to validate your configuration. You can also use this JSON in the rule builder while creating the rule.

{
  "Name": "CaptchaOnRBR",
  "Priority": 0,
  "Statement": {
    "RateBasedStatement": {
      "Limit": 100,
      "AggregateKeyType": "IP",
      "ScopeDownStatement": {
        "ByteMatchStatement": {
          "SearchString": "/store",
          "FieldToMatch": {
            "UriPath": {}
          },
          "TextTransformations": [
            {
              "Priority": 0,
              "Type": "NONE"
            }
          ],
          "PositionalConstraint": "CONTAINS"
        }
      }
    }
  },
  "Action": {
    "Captcha": {}
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "CaptchaOnRBR"
  },
  "CaptchaConfig": {
    "ImmunityTimeProperty": {
      "ImmunityTime": 60
    }
  }
}

After you complete this configuration, the rule will be invoked when an IP address unsuccessfully attempts to search the store at a rate that exceeds the threshold. This user will be presented with a CAPTCHA challenge, as shown in Figure 6. If the user is successful, they will be routed back to the store page. Otherwise, they will be served a new puzzle until it is solved.

Figure 3: CAPTCHA challenge presented to a request that exceeded the threshold

Implementing rate-based rules and CAPTCHA also allows you to track IP addresses, limit the number of invalid search attempts, and use the specific IP information available to you within sampled requests and AWS WAF logs to work to prevent that traffic from affecting your resources. Additionally, you have visibility into IPs addresses blocked by rate-based rules so that you can later add these addresses to a block list or create custom logic as needed to mitigate false positives.

Conclusion

In this blog post, you learned how to configure and deploy a CAPTCHA challenge with AWS WAF that checks for web requests that exceed a certain rate threshold and requires the client sending such requests to solve a challenge. Please note the additional charge for enabling CAPTCHA on your web ACL (pricing can be found here). Although CAPTCHA challenges are simple for humans to complete, they should be harder for common bots to complete with any meaningful rate of success. You can use a CAPTCHA challenge when a block action would stop too many legitimate requests, but letting all traffic through would result in unacceptably high levels of unwanted requests, such as from bots.

For more information and guidance on AWS WAF rate-based rules, see the blog post The three most important AWS WAF rate-based rules and the AWS whitepaper AWS Best Practices for DDoS Resiliency. You can also check out these additional resources:

Using AWS WAF with CAPTCHA (YouTube video)
Best practices for using the CAPTCHA and Challenge actions (AWS WAF Developer Guide)
Reduce Unwanted Traffic on Your Website with New AWS WAF Bot Control
Fine-tune and optimize AWS WAF Bot Control mitigation capability
Detect and block advanced bot traffic

If you have feedback about this blog post, submit comments in the Comments section below. You can also start a new thread on AWS WAF re:Post to get answers from the community.

Want more AWS Security news? Follow us on Twitter.

A dive into redBus’s data platform and how they used Amazon QuickSight to accelerate business insights

2023-01-18 Girish Kumar Chidananda

Post Syndicated from Girish Kumar Chidananda original https://aws.amazon.com/blogs/big-data/a-dive-into-redbuss-data-platform-and-how-they-used-amazon-quicksight-to-accelerate-business-insights/

This post is co-authored with Girish Kumar Chidananda from redBus.

redBus is one of the earliest adopters of AWS in India, and most of its services and applications are hosted on the AWS Cloud. AWS provided redBus the flexibility to scale their infrastructure rapidly while keeping costs extremely low. AWS has a comprehensive suite of services to cater to most of their needs, including providing customer support that redBus can vouch for.

In this post, we share redBus’s data platform architecture, and how various components are connected to form their data highway. We also discuss the challenges redBus faced in building dashboards for their real-time business intelligence (BI) use cases, and how they used Amazon QuickSight, a fast, easy-to-use, cloud-powered business analytics service that makes it easy for all employees within redBus to build visualizations and perform ad hoc analysis to gain business insights from their data, any time, and on any device.

About redBus

redBus is the world’s largest online bus ticketing platform built in India and serving more than 36 million happy customers around the world. Along with its bus ticketing vertical, redBus also runs a rail ticketing service called redRails and a bus and car rental service called rYde. It is part of the GO-MMT group, which is India’s leading online travel company, with an extensive brand portfolio that includes other prominent online travel brands like MakeMyTrip and Goibibo.

redBus’s data highway 1.0

redBus relies heavily on making data-driven decisions at every level, from its traveler journey tracking, forecasting demand during high traffic, identifying and addressing bottlenecks in their bus operators signup process, and more. As redBus’s business started growing in terms of the number of cities and countries they operated in and the number of bus operators and travelers using the service in each city, the amount of incoming data also increased. The need to access and analyze the data in one place required them to build their own data platform, as shown in the following diagram.

redBus data platform 1.0

In the following sections, we look at each component in more detail.

Data ingestion sources

With the data platform 1.0, the data is ingested from various sources:

Real time – The real-time data flows from redBus mobile apps, the backend microservices, and when a passenger, bus operator, or application does any operation like booking bus tickets, searching the bus inventory, uploading a KYC document, and more
Batch mode – Scheduled jobs fetch data from multiple persistent data stores like Amazon Relational Database Service (Amazon RDS), where the OLTP data from all its applications are stored, Apache Cassandra clusters, where the bus inventory from various operators is stored, Arango DB, where the user identity graphs are stored, and more

Data cataloging

The real-time data is ingested into their self-managed Apache Nifi clusters, an open-source data platform that is used to clean, analyze, and catalog the data with its routing capabilities before sending the data to its destination.

Storage and analytics

redBus uses the following services for its storage and analytical needs:

Amazon Simple Storage Service (Amazon S3), an object storage service that provides the foundation for their data lake because of its virtually unlimited scalability and higher durability. Real-time data flows from Apache Druid and data from the data stores flow at regular intervals based on the schedules.
Apache Druid, an OLAP-style data store (data flows via Kafka Druid data loader), which computes facts and metrics against various dimensions during the data loading process.
Amazon Redshift, a cloud data warehouse service that helps you analyze exabytes of data and run complex analytical queries. redBus uses Amazon Redshift to store the processed data from Amazon S3 and the aggregated data from Apache Druid.

Querying and visualization

To make redBus as data-driven as possible, they ensured that the data is accessible to their SRE engineers, data engineers, and business analysts via a visualization layer. This layer features dashboards being served using Apache SuperSet, an open-source data visualization application, and Amazon Athena, an interactive query service to analyze data in Amazon S3 using standard SQL for ad hoc querying requirements.

The challenges

Initially, redBus handled data that was being ingested at the rate of 10 million events per day. Over time, as its business started growing, so did the data volume (from gigabytes to terabytes to petabytes), data ingestion per day (from 10 million to 320 million events), and its business intelligence dashboard needs. Soon after, they started facing challenges with their self-managed Superset’s BI capabilities, and the increased operational complexities.

Limited BI capabilities

redBus encountered the following BI limitations:

Inability to create visualizations from multiple data sources – Superset doesn’t allow creating visualizations from multiple tables within its data exploration layer. redBus data engineers had to have the tables joined beforehand at the data source level itself. In order to create a 360-degree view for redBus’s business stakeholders, it became inconvenient for data engineers to maintain multiple tables supporting the visualization layer.
No global filter for visuals in a dashboard – A global or primary filter across visuals in a dashboard is not supported in Superset. For example, consider there are visuals like Sales Wins by Region, YTD Revenue Realized by Region, Sales Pipeline by Region, and more in a dashboard, and a filter Region is added to the dashboard with values like EMEA, APAC, and US. The filter Region will only apply to one of the visuals, not the entire dashboard. However, dashboard users expected filtering across the dashboard.
Not a business-user friendly tool – Superset is highly developer centric when it comes to customization. For example, if a redBus business analyst had to customize a timed refresh that automatically re-queries every slice on a dashboard according to a pre-set value, then the analyst has to update the dashboard’s JSON metadata field. Therefore, having knowledge of JSON and its syntax is mandatory for doing any customization on the visuals or dashboard.

Increased operational cost

Although Superset is open source, which means there are no licensing costs, it also means there is more effort in maintaining all the components required for it to function as an enterprise-grade BI tool. redBus has deployed and maintained a web server (Nginx) fronted by an Application Load Balancer to do the load balancing; a metadata database server (MySQL) where Superset stores its internal information like users, slices, and dashboard definitions; an asynchronous task queue (Celery) for supporting long-running queries; a message broker (RabbitMQ); and a distributed caching server (Redis) for caching the results, charting data, and more on Amazon Elastic Compute Cloud (Amazon EC2) instances. The following diagram illustrates this architecture.

Apache Superset Deploment at redBus

redBus’s DevOps team had to do the heavy lifting of provisioning the infrastructure, taking backups, scaling the components manually as needed, upgrading the components individually, and more. It also required a Python web developer to be around for making the configurational changes so all the components work together seamlessly. All these manual operations increased the total cost of ownership for redBus.

Journey towards QuickSight

redBus started exploring BI solutions primarily around a couple of its dashboarding requirements:

BI dashboards for business stakeholders and analysts, where the data is sourced via Amazon S3 and Amazon Redshift.
A real-time application performance monitoring (APM) dashboard to help their SRE engineers and developers identify the root cause of an issue in their microservices deployment so they can fix the issues before they affect their customer’s experience. In this case, the data is sourced via Druid.

QuickSight fit into most of redBus’s BI dashboard requirements, and in no time their data platform team started with a proof of concept (POC) for a couple of their complex dashboards. At the end of the POC, which spanned a month’s time, the team shared their findings.

First, QuickSight is rich in BI capabilities, including the following:

It’s a self-service BI solution with drag-and-drop features that could help redBus analysts comfortably use it without any coding efforts.
Visualizations from multiple data sources in a single dashboard could help redBus business stakeholders get a 360-degree view of sales, forecasting, and insights in a single pane of glass.
Cascading filters across visuals and across sheets in a dashboard are much-needed features for redBus’s BI requirements.
QuickSight offers Excel-like visuals—tables with calculations, pivot tables with cell grouping, and styling are attractive for the viewers.
The Super-fast, Parallel, In-memory Calculation Engine (SPICE) in QuickSight could help redBus scale to hundreds of thousands of users, who can all simultaneously perform fast interactive analysis across a wide variety of AWS data sources.
Off-the-shelf ML insights and forecasting at no additional cost would allow redBus’s data science team to focus on ML models besides sales forecasting and similar models.
Built-in row-level security (RLS) could allow redBus to grant filtered access for their viewers. For example, redBus has many business analysts who manage different countries. With RLS, each business analyst only sees data related to their assigned country within a single dashboard.
redBus uses OneLogin as its identity provider, which supports Security Assertion Markup Language 2.0 (SAML 2.0). With the help of identity federation and single sign-on support from QuickSight, redBus could provide a simple onboarding flow for their QuickSight users.
QuickSight offers built-in alerts and email notification capabilities.

Secondly, QuickSight is a fully managed, cloud-native, serverless BI service offering from AWS, with the following features:

redBus engineers don’t need to focus on the heavy lifting of provisioning, scaling, and maintaining their BI solution on EC2 instances.
QuickSight offers native integration with AWS services like Amazon Redshift, Amazon S3, and Athena, and other popular frameworks like Presto, Snowflake, Teradata, and more. QuickSight connects to most of the data sources that redBus already has except Apache Druid, because native integration with Druid was not available as of December 2022. For a complete list of the supported data sources, see Supported data sources.

The outcome

Considering all the rich features and lower total cost of ownership, redBus chose QuickSight for their BI dashboard requirements. With QuickSight, redBus’s data engineers have built a number of dashboards in no time to give insights from petabytes of data to business stakeholders and analysts. The redBus data highway evolved to bring business intelligence to a much wider audience in their organization, with better performance and faster time-to-value. As of November 2022, it combines QuickSight for business users and Superset for real-time APM dashboards (at the time of writing, QuickSight doesn’t offer a native connector to Druid), as shown in the following diagram.

redBus data platform 2.0

Sales anomaly detection dashboard

Although there are many dashboards that redBus deployed to production, sales anomaly detection is one of the interesting dashboards that redBus built. It uses redBus’s proprietary sales forecasting model, which in turn is sourced by historical sales data from Amazon Redshift tables and real-time sales data from Druid tables, as shown in the following figure.

Sales anomaly detection data flow

At regular intervals, the scheduled jobs feed the redBus forecasting model with real-time and historical sales data, and then the forecasted data is pushed into an Amazon Redshift table. The sales anomaly detection dashboard in QuickSight is served by the resultant Amazon Redshift table.

The following is one of the visuals from the sales anomaly detection dashboard. It’s built using a line chart representing hourly actual sales, predicted sales, and an alert threshold for a time series for a particular business cohort in redBus.

Sales and Predicted Sales for a particular cohort

In this visual, each bar represents the number of sales anomalies triggered at a particular point in the time series.

redBus’s analysts could further drill down to the sales details and anomalies at the minute level, as shown in the following diagram. This drill-down feature comes out of the box with QuickSight.

Drill-Down Chart - Sales and Predicted Sales for a particular cohort

For more details on adding drill-downs to QuickSight dashboard visuals, see Adding drill-downs to visual data in Amazon QuickSight.

Apart from the visuals, it has become one of viewers’ favorite dashboards at redBus due to the following notable features:

Because filtering across visuals is an out-of-the-box feature in QuickSight, a timestamp-based filter is added to the dashboard. This helps in filtering multiple visuals in the dashboard in a single click.
URL actions configured on the visuals help the viewers navigate to the context-sensitive in-house applications.
Email alerts configured on KPIs and Gauge visuals help the viewers get notifications on time.

Next steps

Apart from building new dashboards for their BI dashboard needs, redBus is taking the following next steps:

Exploring QuickSight Embedded Analytics for a couple of their application requirements to accelerate time to insights for users with in-context data visuals, interactive dashboards, and more directly within applications
Exploring QuickSight Q, which could enable their business stakeholders to ask questions in natural language and receive accurate answers with relevant visualizations that can help them gain insights from the data
Building a unified dashboarding solution using QuickSight covering all their data sources as integrations become available

Conclusion

In this post, we showed you how redBus built its data platform using various AWS services and Apache frameworks, the challenges the platform went through (especially in their BI dashboard requirements and challenges while scaling), and how they used QuickSight and lowered the total cost of ownership.

To know more about engineering at redBus, check out their medium blog posts. To learn more about what is happening in QuickSight or if you have any questions, reach out to the QuickSight Community, which is very active and offers several resources.

About the Authors

Girish Kumar Chidananda works as a Senior Engineering Manager – Data Engineering at redBus, where he has been building various data engineering applications and components for redBus for the last 5 years. Prior to starting his journey in the IT industry, he worked as a Mechanical and Control systems engineer in various organizations, and he holds an MS degree in Fluid Power Engineering from University of Bath.

Kayalvizhi Kandasamy works with digital-native companies to support their innovation. As a Senior Solutions Architect (APAC) at Amazon Web Services, she uses her experience to help people bring their ideas to life, focusing primarily on microservice architectures and cloud-native solutions using AWS services. Outside of work, she likes playing chess and is a FIDE rated chess player. She also coaches her daughters the art of playing chess, and prepares them for various chess tournaments.

Automate data lineage on Amazon MWAA with OpenLineage

2023-01-18 Stephen Said

Post Syndicated from Stephen Said original https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/

In modern data architectures, datasets are combined across an organization using a variety of purpose-built services to unlock insights. As a result, data governance becomes a key component for data consumers and producers to know that their data-driven decisions are based on trusted and accurate datasets. One aspect of data governance is data lineage, which captures the flow of data as it goes through various systems and allows consumers to understand how a dataset was derived.

In order to capture data lineage consistently across various analytical services, you need to use a common lineage model and a robust job orchestration that is able to tie together diverse data flows. One possible solution is the open-source OpenLineage project. It provides a technology-agnostic metadata model for capturing data lineage and integrates with widely used tools. For job orchestration, it integrates with Apache Airflow, which you can run on AWS conveniently through the managed service Amazon Managed Workflows for Apache Airflow (Amazon MWAA). OpenLineage provides a plugin for Apache Airflow that extracts data lineage from Directed Acyclic Graphs (DAGs).

In this post, we show how to get started with data lineage on AWS using OpenLineage. We provide a step-by-step configuration guide for the openlineage-airflow plugin on Amazon MWAA. Additionally, we share an AWS Cloud Development Kit (AWS CDK) project that deploys a pre-configured demo environment for evaluating and experiencing OpenLineage first-hand.

OpenLineage on Apache Airflow

In the following example, Airflow turns OLTP data into a star schema on Amazon Redshift Serverless.

After staging and preparing source data from Amazon Simple Storage Service (Amazon S3), fact and dimension tables are eventually created. For this, Airflow orchestrates the execution of SQL statements that create and populate tables on Redshift Serverless.

Overview on DAGs in Amazon MWAA

The openlineage-airflow plugin collects metadata about creation of datasets and dependencies between them. This allows us to move from a jobs-centric approach of Airflow to a datasets-centric approach, improving the observability of workflows.

The following screenshot shows parts of the captured lineage for the previous example. It’s displayed in Marquez, an open-source metadata service for collection and visualization of data lineage with support for the OpenLineage standard. In Marquez, you can analyze the upstream datasets and transformations that eventually create the user dimension table on the right.

Data lineage graph in marquez

The example in this post is based on SQL and Amazon Redshift. OpenLineage also supports other transformation engines and data stores such as Apache Spark and dbt.

Solution overview

The following diagram shows the AWS setup required to capture data lineage using OpenLineage.

Solution overview

The workflow includes the following components:

The openlineage-airflow plugin is configured on Airflow as a lineage backend. Metadata about the DAG runs is passed by Airflow core to the plugin, which converts it into OpenLineage format and sends it to an external metadata store. In our demo setup, we use Marquez as the metadata store.
The openlineage-airflow plugin receives its configuration from environment variables. To populate these variables on Amazon MWAA, a custom Airflow plugin is used. First, the plugin reads source values from AWS Secrets Manager. Then, it creates environment variables.
Secrets Manager is configured as a secrets backend. Typically, this type of configuration is stored in Airflow’s native metadata database. However, this approach has limitations. For instance, in case of multiple Airflow environments, you need to track and store credentials across multiple environments, and updating credentials requires you to update all the environments. With a secrets backend, you can centralize configuration.
For demo purposes, we collect data lineage from a data pipeline, which creates a star schema in Redshift Serverless.

In the following sections, we walk you through the steps for end-to-end configuration.

Install the openlineage-airflow plugin

Specify the following dependency in the requirements.txt file of the Amazon MWAA environment. Note that the latest Airflow version currently available on Amazon MWAA is 2.4.3; for this post, use the compatible version 0.19.2 of the plugin:

openlineage-airflow==0.19.2

For more details on installing Python dependencies on Amazon MWAA, refer to Installing Python dependencies.

For Airflow < 2.3, configure the plugin’s lineage backend through the following configuration overrides on the Amazon MWAA environment and load it immediately at Airflow start by disabling lazy load of plugins:

AirflowConfigurationOptions:
    core.lazy_load_plugins: False
    lineage.backend: openlineage.lineage_backend.OpenLineageBackend

For more information on configuration overrides, refer to Configuration options overview.

Configure the Secrets Manager backend with Amazon MWAA

Using Secrets Manager as a secrets backend for Amazon MWAA is straightforward. First, provide the execution role of Amazon MWAA with read permission to Secrets Manager. You can use the following policy template as a starting point:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetResourcePolicy",
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret",
        "secretsmanager:ListSecretVersionIds"
      ],
      "Resource": "arn:aws:secretsmanager:AWS_REGION:<YOUR_ACCOUNT_ID>:secret:"
    },
    {
      "Effect": "Allow",
      "Action": "secretsmanager:ListSecrets",
      "Resource": ""
    }
  ]
}

Second, configure Secrets Manager as a backend in Amazon MWAA through the following configuration overrides:

AirflowConfigurationOptions:
secrets.backend: airflow.contrib.secrets.aws_secrets_manager.SecretsManagerBackend
secrets.backend_kwargs: '{"connections_prefix" : "airflow/connections", "variables_prefix" : "airflow/variables"}'

For more information configuring a secrets backend in Amazon MWAA, refer to Configuring an Apache Airflow connection using a Secrets Manager secret and Move your Apache Airflow connections and variables to AWS Secrets Manager.

Deploy a custom envvar plugin to Amazon MWAA

Apache Airflow has a built-in plugin manager through which it can be extended with custom functionality. In our case, this functionality is to populate OpenLineage-specific environment variables based on values in Secrets Manager. Natively, Amazon MWAA allows environment variables with the prefix AIRFLOW__, but the openlineage-airflow plugin expects the prefix OPENLINEAGE__.

The following Python code is used in the plugin. We assume the file is called envvar_plugin.py:

from airflow.plugins_manager import AirflowPlugin
from airflow.models import Variable
import os

os.environ["OPENLINEAGE_URL"] = Variable.get('OPENLINEAGE_URL', default_var='')

class EnvVarPlugin(AirflowPlugin):
  name = "env_var_plugin"

Amazon MWAA has a mechanism to install a plugin through a zip archive. You zip your code, upload the archive to an S3 bucket, and pass the URL to the file to Amazon MWAA:

zip plugins.zip envvar_plugin.py

Upload plugins.zip to an S3 bucket and configure the URL in Amazon MWAA. The following screenshot shows the configuration via the Amazon MWAA console.

Configuration of a custom plugin in Amazon MWAA

For more information on installing custom plugins on Amazon MWAA, refer to Creating a custom plugin that generates runtime environment variables.

Configure connectivity between the openlineage-airflow plugin and Marquez

As a last step, store the URL to Marquez in Secrets Manager. For this, create a secret called airflow/variables/OPENLINEAGE_URL with value <protocol>://<hostname/ip>:<port> (for example, https://marquez.mysite.com:5000).

Configuration of OPENLINEAGE_URL as secret

In case you need to spin up Marquez on AWS, you have multiple options to host, including running it on Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Compute Cloud (Amazon EC2). Refer to Running Marquez on AWS or check out our infrastructure template in the next section to deploy Marquez on AWS.

Deploy with an AWS CDK-based solution template

Assuming you want to set up a demo infrastructure for all of the above in one step, you can use the following template based on the AWS CDK.

The template has the following prerequisites:

An AWS account.
Amazon Linux 2 with AWS CDK, Docker CLI, and Python3 installed. Alternatively, setting up an AWS Cloud9 environment will satisfy this requirement.

Complete the following steps to deploy the template:

Clone GitHub repository and install Python dependencies. Bootstrap the AWS CDK if required.

git clone https://github.com/aws-samples/aws-mwaa-openlineage 
	cd aws-mwaa-openlineage
	python3 -m venv .env && source .env/bin/activate
	python3 -m pip install -r requirements.txt
	cdk bootstrap

Update the value for the variable EXTERNAL_IP in constants.py to your outbound IP for connecting to the internet:
```
# Set variable to outbound IP for connecting to the internet.
EXTERNAL_IP = "255.255.255.255"
```
This configures security groups so that you can access Marquez but block other clients. constants.py is found in the root folder of the cloned repository.
Deploy the VPC_S3 stack to provision a new VPC dedicated for this solution as well as the security groups that are used by the different components:
```
cdk deploy vpc-s3
```
It creates a new S3 bucket and uploads the source raw data based on the TICKIT sample database. This serves as the landing area from the OLTP database. We then need to parse the metadata of these files through an AWS Glue crawler, which facilitates the native integration between Amazon Redshift and the S3 data lake.
Deploy the lineage stack to create an EC2 instance that hosts Marquez:
```
cdk deploy marquez
```
Access the Marquez web UI through https://{ec2.public_dns_name}:3000/. This URL is also available as part of the AWS CDK outputs for the lineage stack.
Deploy the Amazon Redshift stack to create a Redshift Serverless endpoint:
```
cdk deploy redshift
```
Deploy the Amazon MWAA stack to create an Amazon MWAA environment:
```
cdk deploy mwaa
```
You can access the Amazon MWAA UI through the URL provided in the AWS CDK output.

Test a sample data pipeline

On Amazon MWAA, you can see an example data pipeline deployed that consists of two DAGs. It builds a star schema on top of the TICKIT sample database. One DAG is responsible for loading data from the S3 data lake into an Amazon Redshift staging layer; the second DAG loads data from the staging layer to the dimensional model.

Datamodel of star schema

Open the Amazon MWAA UI through the URL obtained in the deployment steps and launch the following DAGs: rs_source_to_staging and rs_staging_to_dm. As part of the run, the lineage metadata is sent to Marquez.

After the DAG has been run, open the Marquez URL obtained in the deployment steps. In Marquez, you can find the lineage metadata for the computed star schema and related data assets on Amazon Redshift.

Clean up

Delete the AWS CDK stacks to avoid ongoing charges for the resources that you created. Run the following command in the aws-mwaa-openlineage project directory so that all resources are undeployed:

cdk destroy --all

Summary

In this post, we showed you how to automate data lineage with OpenLineage on Amazon MWAA. As part of this, we covered how to install and configure the openlineage-airflow plugin on Amazon MWAA. Additionally, we provided a ready-to-use infrastructure template for a complete demo environment.

We encourage you to explore what else can be achieved with OpenLineage. A job orchestrator like Apache Airflow is only one piece of a data platform and not all possible data lineage can be captured on it. We recommend exploring OpenLineage’s integration with other platforms like Apache Spark or dbt. For more information, refer to Integrations.

Additionally, we recommend you visit the AWS Big Data Blog for other useful blog posts on Amazon MWAA and data governance on AWS.

About the Authors

Stephen Said is a Senior Solutions Architect and works with Digital Native Businesses. His areas of interest are data analytics, data platforms and cloud-native software engineering.

Vishwanatha Nayak is a Senior Solutions Architect at AWS. He works with large enterprise customers helping them design and build secure, cost-effective, and reliable modern data platforms using the AWS cloud. He is passionate about technology and likes sharing knowledge through blog posts and twitch sessions.

Paul Villena is an Analytics Solutions Architect with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure-as-code, serverless technologies and coding in Python.

Enable cross-account sharing with direct IAM principals using AWS Lake Formation Tags

2023-01-18 Aarthi Srinivasan

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/enable-cross-account-sharing-with-direct-iam-principals-using-aws-lake-formation-tags/

With AWS Lake Formation, you can build data lakes with multiple AWS accounts in a variety of ways. For example, you could build a data mesh, implementing a centralized data governance model and decoupling data producers from the central governance. Such data lakes enable the data as an asset paradigm and unleash new possibilities with data discovery and exploration across organization-wide consumers. While enabling the power of data in decision-making across your organization, it’s also crucial to secure the data. With Lake Formation, sharing datasets across accounts only requires a few simple steps, and you can control what you share.

Lake Formation has launched Version 3 capabilities for sharing AWS Glue Data Catalog resources across accounts. When moving to Lake Formation cross-account sharing V3, you get several benefits. When moving from V1, you get more optimized usage of AWS Resource Access Manager (AWS RAM) to scale sharing of resources. When moving from V2, you get a few enhancements. First, you don’t have to maintain AWS Glue resource policies to share using LF-tags because Version 3 uses AWS RAM. Second, you can share with AWS Organizations using LF-tags. Third, you can share to individual AWS Identity and Access Management (IAM) users and roles in other accounts, thereby providing data owners control over which individuals can access their data.

Lake Formation tag-based access control (LF-TBAC) is an authorization strategy that defines permissions based on attributes called LF-tags. LF-tags are different from IAM resource tags and are associated only with Lake Formation databases, tables, and columns. LF-TBAC allows you to define the grant and revoke permissions policy by grouping Data Catalog resources, and therefore helps in scaling permissions across a large number of databases and tables. LF-tags are inherited from a database to all its tables and all the columns of each table.

Version 3 offers the following benefits:

True central governance with cross-account sharing to specific IAM principals in the target account
Ease of use in not having to maintain an AWS Glue resource policy for LF-TBAC
Efficient reuse of AWS RAM shares
Ease of use in scaling to hundreds of accounts with LF-TBAC

In this post, we illustrate the new features of cross-account sharing Version 3 in a producer-consumer scenario using TPC datasets. We walk through the setup of using LF-TBAC to share data catalog resources from the data producer account to direct IAM users in the consumer account. We also go through the steps in the receiving account to accept the shares and query the data.

Solution overview

To demonstrate the Lake Formation cross-account Version 3 features, we use the TPC datasets available at s3://aws-data-analytics-workshops/shared_datasets/tpcparquet/. The solution consists of steps in both accounts.

In account A, complete the following steps:

As a data producer, register the dataset with Lake Formation and create AWS Glue Data Catalog tables.
Create LF-tags and associate them with the database and tables.
Grant LF-tag based permissions on resources directly to personas in consumer account B.

The following steps take place in account B:

The consumer account data lake admin reviews and accepts the AWS RAM invitations.
The data lake admin gives CREATE DATABASE access to the IAM user lf_business_analysts.
The data lake admin creates a database for the marketing team and grants CREATE TABLE access to lf_campaign_manager.
The IAM users create resource links on the shared database and tables and query them in Amazon Athena.

The producer account A has the following personas:

Data lake admin – Manages the data lake in the producer account
lf-producersteward – Manages the data and user access

The consumer account B has the following personas:

Data lake admin – Manages the data lake in the consumer account
lf-business-analysts – The business analysts in the sales team needs access to non-PII data
lf-campaign-manager – The manager in the marketing team needs access to data related to products and promotions

Prerequisites

You need the following prerequisites:

Two AWS accounts. For this demonstration of how AWS RAM invites are created and accepted, you should use two accounts that are not part of the same organization.
An admin IAM user in both accounts to launch the AWS CloudFormation stacks.
Lake Formation mode enabled in both the producer and consumer account with cross-account Version 3. For instructions, refer to Change the default permission model.

Lake Formation and AWS CloudFormation setup in account A

To keep the setup simple, we have an IAM admin registered as the data lake admin.

Sign into the AWS Management Console in the us-east-1 Region.
On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
Select Choose Administrators under Datalake administrators.
In the pop-up window Manage data lake administrators, under IAM users and roles, choose IAM admin user and choose Save.
Choose Launch Stack to deploy the CloudFormation template:
Choose Next.
Provide a name for the stack and choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

Stack creation should take about 2–3 minutes. The stack establishes the producer setup as follows:

Creates an Amazon Simple Storage Service (Amazon S3) data lake bucket
Registers the data lake bucket with Lake Formation
Creates an AWS Glue database and tables
Creates an IAM user (lf-producersteward) who will act as producer steward
Creates LF-tags and assigns them to the created catalog resources as specified in the following table

Database	Table	LF-Tag Key	LF-Tag Value	Resource Tagged
`lftpcdb`	.	`Sensitivity`	`Public`	`DATABASE`
`lftpcdb`	`items`	`HasCampaign`	`true`	`TABLE`
`lftpcdb`	`promotions`	`HasCampaign`	`true`	`TABLE`
`lftpcdb`	`customers table columns = "c_last_name","c_first_name","c_email_address"`	`Sensitivity`	`Confidential`	`TABLECOLUMNS`

Verify permissions in account A

After the CloudFormation stack launches, complete the following steps in account A:

On the AWS CloudFormation console, navigate to the Outputs tab of the stack.

Choose the LFProducerStewardCredentials value to navigate to the AWS Secrets Manager console.
In the Secret value section, choose Retrieve secret value.
Note down the secret value for the password for IAM user lf-producersteward.

You need this to log in to the console later as the user lf-producersteward.

On the LakeFormation console, choose Databases on the navigation pane.
Open the database lftpcdb.
Verify the LF-tags on the database are created.

Choose View tables and choose the items table to verify the LF-tags.

Repeat the steps for the promotions and customers tables to verify the LF-tags assigned.

On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
Select the database lftpcdb and on the Actions menu, choose View Permissions.
Verify that there are no default permissions granted on the database lftpcdb for IAMAllowedPrincipals.
If you find any, select the permission and choose Revoke to revoke the permission.
On the AWS Management Console, choose the AWS CloudShell icon on the top menu.

This opens AWS CloudShell in another tab of the browser. Allow a few minutes for the CloudShell environment to set up.

Run the following AWS Command Line Interface (AWS CLI) command after replacing {BUCKET_NAME} with DataLakeBucket from the stack output.

aws s3 cp s3://aws-data-analytics-workshops/shared_datasets/tpcparquet/ s3://${BUCKET_NAME}/tpcparquet/  --recursive

If CloudShell isn’t available in your chosen Region, run the following AWS CLI command to copy the required dataset from your preferred AWS CLI environment as the IAM admin user.

Verify that your S3 bucket has the dataset copied in it.
Log out as the IAM admin user.

Grant permissions in account A

Next, we continue granting Lake Formation permissions to the dataset as a data steward within the producer account. The data steward grants the following LF-tag-based permissions to the consumer personas.

Consumer Persona	LF-tag Policy
`lf-business-analysts`	`Sensitivity=Public`
`lf-campaign-manager`	`HasCampaign=true`

Log in to account A as user lf-producersteward, using the password you noted from Secrets Manager earlier.
On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
Choose Grant.
Under Principals, select External accounts.
Provide the ARN of the IAM user in the consumer account (arn:aws:iam::<accountB_id>:user/lf-business-analysts) and press Enter.

Under LF_Tags or catalog resources, select Resources matched by LF-Tags.
Choose Add LF-Tag to add a new key-value pair.
For the key, choose Sensitivity and for the value, choose Public.
Under Database permissions, select Describe, and under Table permissions, select Select and Describe.

Choose Grant to apply the permissions.
On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
Choose Grant.
Under Principals, select External accounts.
Provide the ARN of the IAM user in the consumer account (arn:aws:iam::<accountB_id>:user/lf-campaign-manager) and press Enter.
Under LF_Tags or catalog resources, select Resources matched by LF-Tags.
Choose Add LF-Tag to add a new key-value pair.
For the key, choose HasCampaign and for the value, choose true.

Under Database permissions, select Describe, and under Table permissions, select Select and Describe.
Choose Grant to apply the permissions.
Verify on the Data lake permissions tab that the permissions you have granted show up correctly.

AWS CloudFormation setup in account B

Complete the following steps in the consumer account:

Log in as an IAM admin user in account B and launch the CloudFormation stack:
Choose Next.
Provide a name for the stack, then choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.

Stack creation should take about 2–3 minutes. The stack sets up the following resources in account B:

IAM users datalakeadmin1, lf-business-analysts, and lf-campaign-manager, with relevant IAM and Lake Formation permissions
A database called db_for_shared_tables with Create_Table permissions to the lf-campaign-manager user
An S3 bucket named lfblog-athenaresults-<your-accountB-id>-us-east-1 with ListBucket and write permissions to lf-business-analysts and lf-campaign-manager

Note down the stack output details.

Accept resource shares in account B

After you launch the CloudFormation stack, complete the following steps in account B:

On the CloudFormation stack Outputs tab, choose the link for DataLakeAdminCredentials.

This takes you to the Secrets Manager console.

On the Secrets Manager console, choose Retrieve secret value and copy the password for DataLakeAdmin user.
Use the ConsoleIAMLoginURL value from the CloudFormation template output to log in to account B with the data lake admin user name datalakeadmin1 and the password you copied from Secrets Manager.
Open the AWS RAM console in another browser tab.
In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations.

You should see two resource share invitations from the producer account A: one for database-level share and one for table-level share.

Choose each resource share link, review the details, and choose Accept.

After you accept the invitations, the status of the resource shares changes from Active from Pending.

Grant permissions in account B

To grant permissions in account B, complete the following steps:

On the Lake Formation console, under Permissions on the navigation pane, choose Administrative roles and tasks.

Under Database creators, choose Grant.

Under IAM users and roles, choose lf-business-analysts.
For Catalog permissions, select Create database.
Choose Grant.
Log out of the console as the data lake admin user.

Query the shared datasets as consumer users

To validate the lf-business-analysts user’s data access, perform the following steps:

Log in to the console as lf-business-analysts, using the credentials noted from the CloudFormation stack output.
On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.

Select the database lftpcdb and on the Actions menu, choose Create resource link.

Under Resource link name, enter rl_lftpcdb.
Choose Create.
After the resource link is created, select the resource link and choose View tables.

You can now see the four tables in the shared database.

Open the Athena console in another browser tab and choose the lfblog-athenaresults-<your-accountB-id>-us-east-1 bucket as the query results location.
Verify data access using the following query (for more information, refer to Running SQL queries using Amazon Athena):

Select * from rl_lftpcdb.customers limit 10;

The following screenshot shows the query output.

Notice that account A shared the database lftpcdb to account B using the LF-tag expression Sensitivity=Public. Columns c_first_name, c_last_name, and c_email_address in table customers were overwritten with Sensitivity=Confidential. Therefore, these three columns are not visible to user lf-business-analysts.

You can preview the other tables from the database similarly to see the available columns and data.

Log out of the console as lf-business-analysts.

Now we can validate the lf-campaign-manager user’s data access.

Log in to the console as lf-campaign-manager using the credentials noted from the CloudFormation stack output.
On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
Verify that you can see the database db_for_shared_tables shared by the data lake admin.

Under Data catalog in the navigation pane, choose Tables.

You should be able to see the two tables shared from account A using the LF-tag expression HasCampaign=true. The two tables show the Owner account ID as account A.

Because lf-campaign-manager received table level shares, this user will create table-level resource links for querying in Athena.

Select the promotions table, and on the Actions menu, choose Create resource link.

For Resource link name, enter rl_promotions.

Under Database, choose db_for_shared_tables for the database to contain the resource link.
Choose Create.
Repeat the table resource link creation for the other table items.

Notice that the resource links show account B as owner, whereas the actual tables show account A as the owner.

Open the Athena console in another browser tab and choose the lfblog-athenaresults-<your-accountB-id>-us-east-1 bucket as the query results location.
11. Query the tables using the resource links.

As shown in the following screenshot, all columns of both tables are accessible to lf-campaign-manager.

In summary, you have seen how LF-tags are used to share a database and select tables from one account to another account’s IAM users.

Clean up

To avoid incurring charges on the AWS resources created in this post, you can perform the following steps.

First, clean up resources in account A:

Empty the S3 bucket created for this post by deleting the downloaded objects from your S3 bucket.
Delete the CloudFormation stack.

This deletes the S3 bucket, custom IAM roles, policies, and the LF database, tables, and permissions.

You may choose to undo the Lake Formation settings also and add IAM access back from the Lake Formation console Settings page.

Now complete the following steps in account B:

Empty the S3 bucket lfblog-athenaresults-<your-accountB-id>-us-east-1 used as the Athena query results location.
Revoke permission to lf-business-analysts as database creator.
Delete the CloudFormation stack.

This deletes the IAM users, S3 bucket, Lake Formation database db_for_shared_tables, resource links, and all the permissions from Lake Formation.

If there are any resource links and permissions left, delete them manually in Lake Formation from both accounts.

Conclusion

In this post, we illustrated the benefits of using Lake Formation cross-account sharing Version 3 using LF-tags to direct IAM principals and how to receive the shared tables in the consumer account. We used a two-account scenario in which a data producer account shares a database and specific tables to individual IAM users in another account using LF-tags. In the receiving account, we showed the role played by a data lake admin vs. the receiving IAM users. We also illustrated how to overwrite column tags to mask and share PII data.

With Version 3 of cross-account sharing features, Lake Formation makes possible more modern data mesh models, where a producer can directly share to an IAM principal in another account, instead of the entire account. Data mesh implementation becomes easier for data administrators and data platform owners because they can easily scale to hundreds of consumer accounts using the LF-tags based sharing to organizational units or IDs.

We encourage you to upgrade your Lake Formation cross-account sharing to Version 3 and benefit from the enhancements. For more details, see Updating cross-account data sharing version settings.

About the authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

How to encrypt sensitive caller voice input in Amazon Lex

2023-01-16 Herbert Guerrero

Post Syndicated from Herbert Guerrero original https://aws.amazon.com/blogs/security/how-to-encrypt-sensitive-caller-authentication-voice-input-in-amazon-lex/

In the telecommunications industry, sensitive authentication and user data are typically received through mobile voice and keypads, and companies are responsible for protecting the data obtained through these channels. The increasing use of voice-driven interactive voice response (IVR) has resulted in a need to provide solutions that can protect user data that is gathered from mobile voice inputs. In this blog post, you’ll see how to protect a caller’s sensitive voice data that was captured through Amazon Lex by using data encryption implemented through AWS Lambda functions. The solution described in this post helps you to protect customer data received through voice channels from inadvertent or unknown access. The solution also includes decryption capabilities, which give an authorized administrator or operator the ability to decrypt user data from a Lambda console.

Solution overview

To demonstrate the IVR solution described in this post, a caller speaks two sensitive pieces of data—credit card number and zip code—from an Amazon Connect contact flow. The spoken values are encrypted and returned to the contact flow to be stored in contact attributes. The encrypted ciphertext is retained as a contact attribute for decryption purposes. Amazon CloudWatch Logs is enabled in the contact flow, but only the encrypted values are logged in log streams.

For this solution, conversation logs for this Amazon Lex bot are not enabled. An operator with assigned AWS Identity and Access Management (IAM) permissions can monitor the logged encrypted entries from CloudWatch Logs. For more information, see Working with log groups and log streams in the Amazon CloudWatch Logs User Guide.

Solution architecture

Figure 1 shows the overview of the solution described in this blog post.

Figure 1: Example of solution architecture

Figure 1 shows the following high-level steps of the solution, and the number labels correspond to the following steps.

A caller places an inbound call.
An Amazon Connect contact flow leverages a Get customer input block, backed by an Amazon Lex bot, to prompt the caller for numerical data.
The Amazon Lex bot invokes the Lambda function dev-encryption-core-EncryptFn.
The Lambda function uses the AWS Encryption SDK to encrypt the caller’s plain text data.
The AWS Encryption SDK obtains encryption keys from AWS Key Management Service (AWS KMS).
The caller’s data is encrypted by using the AWS KMS keys obtained from AWS KMS.
The Lambda function appends the encrypted data to the Amazon Lex bot session attributes.
Amazon Lex returns the fully encrypted data back to Amazon Connect.

Overview of a contact flow

Figure 2: Contact flow captures input values using Amazon Lex and returns their encrypted values

Figure 2 shows an overview of the contact flow, which has two main steps:

The first numerical data (in this example, an encrypted credit card number value) is stored in contact attributes.
The second numerical data (in this example, an encrypted zip code value) is stored in contact attributes.

Prerequisites

This solution uses the following AWS services:

The following need to be installed in your local machine:

To implement the solution in this post, you first need the Amazon Connect instance prerequisite in place.

To set up the Amazon Connect instance (if none exists)

Create an Amazon Connect instance with a claimed phone number and a configured Amazon Connect user linked to a basic routing profile. For more information about setting up a contact center, see Set up your contact center in the Amazon Connect Administrator Guide.
Assign the CallCenterManager or Admin security profile to an Amazon Connect user.
In the newly created Amazon Connect instance, under the Overview section, find the access URL with the format
https://<aliasname>.awsapps.com/connect/login
- Make note of the access URL, which you will use later to log in to the Amazon Connect Dashboard.
Log in to your Amazon Connect instance with a Connect user that has Admin or CallCenterManager permissions.

Solution procedures

This solution includes the following procedures:

Clone the project or download the solution zip file.
Create AWS resources needed for encryption and decryption.
Configure the Amazon Lex bot in Amazon Connect.
Create the contact flow in Amazon Connect.
Validate the solution.
Decrypt the collected data.

To clone or download the solution

Log in to the GitHub repo.
Clone or download the solution files to your local machine.

The downloaded file contains the artifacts needed for the deployment.

To create AWS resources needed for encryption and decryption

From the command line, change directory to the project’s root directory.
Run npm install.
Run npm run build to transpile TypeScript to JavaScript and package code and its dependencies before deploying to AWS.
Run cdk deploy CoreStack.

To configure the Amazon Lex bot in your Amazon Connect instance

In the Amazon Connect console, choose Contact flows and scroll to the Amazon Lex section.

Figure 3: Select Contact flows
From the Bot menu, select secure_LexInput(Classic). Then select +Add Amazon Lex Bot.

Figure 4: Configure the Amazon Lex bot to Amazon Connect

To import contact flow into Amazon Connect

In the Amazon Connect console, choose Overview, and then choose Login as administrator.
From the Routing menu on the left side, choose Contact flows to show the list of contact flows.
Choose Create Contact flow.
Choose the arrow to the right of the Save button and choose Import flow (beta). This imports the contact flow that you previously downloaded in the procedure To clone or download the solution.
The contact flow already has the Amazon Lex bot configured.

Figure 5: Select Import flow (beta)
In the upper right corner of the contact flow, choose Save, and then choose OK to save the changes.
Choose Publish to make the contact flow ready for use during the validation steps.
(Optional) Claim a phone number (if none is available), using the following steps:
1. In the Connect Dashboard, on the navigation menu, choose Channels, and then choose Phone numbers.
2. On the right side of the page, choose Claim a number.
3. Select the DID (Direct Inward Dialing) tab. Use the drop-down arrow to choose your country/region. When numbers are returned, choose one.
4. Write down the phone number. You call it later in this post.
(Optional) On the Edit Phone number page, in the Description box, you can type a note if desired.
To assign the contact flow to your claimed phone number, for Contact flow / IVR, choose the drop-down arrow, and then choose Secure_Lex_Input.
Choose Save.

Figure 6: Under Contact flow / IVR, select the imported contact flow

For more information, see Set up phone numbers for your contact center in the Amazon Connect Administrator Guide.

To validate the solution

Dial the test phone number to go through the voice prompt flow.
When prompted, speak a 16-digit credit card number (you have a maximum of two retries), then speak a 5-digit zip code (also a maximum of two retries).
After you complete your test call, review the log streams in Amazon CloudWatch Logs to confirm that the digits that you entered are now encrypted and stored as a contact attribute. The two entered values zipcode and creditcard are stored in contact attributes. Both are encrypted.

Figure 7: Sample log showing encrypted values for zipcode and creditcard
Log in to your Amazon Connect Dashboard as a Supervisor. The URL is provided after the connect instance has been created. In the navigation menu, choose Contact search.

Figure 8: Choose Contact search to look for the call information
Locate your inbound call on the Contact search list. Note that it can take up to 60 seconds for data to appear in the Contact search list.
Select the Contact ID for your call.

Figure 9: The Contact search showing the contact details for your test call
Copy the encrypted values for creditcard and zipcode and make note of them; you will use these values in the next procedure.

Figure 10: Contact attributes stored in a contact flow are registered as part of the contact details

To decrypt the collected data

In the AWS Lambda console, choose Functions.
Use the Search bar to look for the dev-encryption-core-DecryptFn Lambda function, and then select the name link to open it.
Under folder encryption-master, open the test folder. Under the tab \events, locate the file decrypt.json.
Use the following steps to create a sample test event in the console by using the contents from decrypt.json. For more details, see Testing Lambda functions in the console.
1. Choose the down arrow on the right side of Test.
2. Choose Configure test event.
3. Choose Create new test event.
4. For Event name, enter decryptTest.
5. Paste the contents from decrypt.json.
```
{
    "Details": {
        "Parameters": {
            "encrypted": "<encrypted-value-here>"
        }
    }
}
```
6. Choose Save.
Use the encrypted values saved in the Validate a solution procedure and replace the ones in the recently created test event.

Figure 11: Replace the creditcard or zipCode values with the ones from the Contact Search page
Choose Test. The output from the test shows the values decrypted by the Lambda function. This is shown in Figure 12 under the Execution result tab.

Figure 12: Result from the decryption operation

Note: Make sure that only the appropriate authorized administrator or operator, application, or AWS service is able to invoke the decryption Lambda function.

You have now successfully implemented the solution by encrypting and decrypting the voice input of your test call, which you collected through Amazon Lex.

Cleanup

To avoid incurring future charges, follow these steps to clean up the deployed resources that you created when implementing this solution.

To delete the Amazon Connect instance

In the Amazon Connect console, under Instance alias, select the name of the Amazon Connect instance, and choose Delete.
When prompted, type the name of the instance, and then choose Delete.

To delete the Amazon Lex bot

In the Amazon Lex console, choose the bot that you created in the To configure the Amazon Lex bot procedure.
Choose Delete, and then choose Continue.

To delete the AWS CloudFormation stack

In the AWS CloudFormation console, on the Stacks page, select the stack you created in the procedure To create AWS resources needed for encryption and decryption.
In the stack details pane, choose Delete.
Choose Delete stack when prompted. This deletes the Amazon S3 bucket, IAM roles and AWS Lambda functions you created for testing. This will also schedule a deletion date on the AWS KMS key.

Conclusion

In this post, you learned how an Amazon Connect contact flow can collect voice inputs from a caller by using Amazon Lex, and how you can encrypt these inputs by using your own AWS KMS key. This solution can help improve the security of voice input that is collected through Amazon Connect. For cost information, see the Amazon Connect pricing page.

For more information, see the blog post Creating a secure IVR solution with Amazon Connect and the topic Encrypt customer input (using OpenSSL) in the Amazon Connect Administrator Guide. As previously mentioned, the increasing use of voice-driven IVR has resulted in a need to provide solutions that can protect user data gathered from mobile voice inputs.

Additional resources include the AWS Lambda Developer Guide, the Amazon Lex Developer Guide, the Amazon Connect Administrator Guide, the AWS Nodejs SDK, and the AWS SDK for Python (Boto3).

If you need help with setting up this solution, you can get assistance from AWS Professional Services. You can also seek assistance from Amazon Connect partners available worldwide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

How BookMyShow saved 80% in costs by migrating to an AWS modern data architecture

2023-01-11 Mahesh Vandi Chalil

Post Syndicated from Mahesh Vandi Chalil original https://aws.amazon.com/blogs/big-data/how-bookmyshow-saved-80-in-costs-by-migrating-to-an-aws-modern-data-architecture/

This is a guest post co-authored by Mahesh Vandi Chalil, Chief Technology Officer of BookMyShow.

BookMyShow (BMS), a leading entertainment company in India, provides an online ticketing platform for movies, plays, concerts, and sporting events. Selling up to 200 million tickets on an annual run rate basis (pre-COVID) to customers in India, Sri Lanka, Singapore, Indonesia, and the Middle East, BookMyShow also offers an online media streaming service and end-to-end management for virtual and on-ground entertainment experiences across all genres.

The pandemic gave BMS the opportunity to migrate and modernize our 15-year-old analytics solution to a modern data architecture on AWS. This architecture is modern, secure, governed, and cost-optimized architecture, with the ability to scale to petabytes. BMS migrated and modernized from on-premises and other cloud platforms to AWS in just four months. This project was run in parallel with our application migration project and achieved 90% cost savings in storage and 80% cost savings in analytics spend.

The BMS analytics platform caters to business needs for sales and marketing, finance, and business partners (e.g., cinemas and event owners), and provides application functionality for audience, personalization, pricing, and data science teams. The prior analytics solution had multiple copies of data, for a total of over 40 TB, with approximately 80 TB of data in other cloud storage. Data was stored on‑premises and in the cloud in various data stores. Growing organically, the teams had the freedom to choose their technology stack for individual projects, which led to the proliferation of various tools, technology, and practices. Individual teams for personalization, audience, data engineering, data science, and analytics used a variety of products for ingestion, data processing, and visualization.

This post discusses BMS’s migration and modernization journey, and how BMS, AWS, and AWS Partner Minfy Technologies team worked together to successfully complete the migration in four months and saving costs. The migration tenets using the AWS modern data architecture made the project a huge success.

Challenges in the prior analytics platform

Varied Technology: Multiple teams used various products, languages, and versions of software.
Larger Migration Project: Because the analytics modernization was a parallel project with application migration, planning was crucial in order to consider the changes in core applications and project timelines.
Resources: Experienced resource churn from the application migration project, and had very little documentation of current systems.
Data : Had multiple copies of data and no single source of truth; each data store provided a view for the business unit.
Ingestion Pipelines: Complex data pipelines moved data across various data stores at varied frequencies. We had multiple approaches in place to ingest data to Cloudera, via over 100 Kafka consumers from transaction systems and MQTT(Message Queue Telemetry Transport messaging protocol) for clickstreams, stored procedures, and Spark jobs. We had approximately 100 jobs for data ingestion across Spark, Alteryx, Beam, NiFi, and more.
Hadoop Clusters: Large dedicated hardware on which the Hadoop clusters were configured incurring fixed costs. On-premises Cloudera setup catered to most of the data engineering, audience, and personalization batch processing workloads. Teams had their implementation of HBase and Hive for our audience and personalization applications.
Data warehouse: The data engineering team used TiDB as their on-premises data warehouse. However, each consumer team had their own perspective of data needed for analysis. As this siloed architecture evolved, it resulted in expensive storage and operational costs to maintain these separate environments.
Analytics Database: The analytics team used data sourced from other transactional systems and denormalized data. The team had their own extract, transform, and load (ETL) pipeline, using Alteryx with a visualization tool.

Migration tenets followed which led to project success:

Prioritize by business functionality.
Apply best practices when building a modern data architecture from Day 1.
Move only required data, canonicalize the data, and store it in the most optimal format in the target. Remove data redundancy as much possible. Mark scope for optimization for the future when changes are intrusive.
Build the data architecture while keeping data formats, volumes, governance, and security in mind.
Simplify ELT and processing jobs by categorizing the jobs as rehosted, rewritten, and retired. Finalize canonical data format, transformation, enrichment, compression, and storage format as Parquet.
Rehost machine learning (ML) jobs that were critical for business.
Work backward to achieve our goals, and clear roadblocks and alter decisions to move forward.
Use serverless options as a first option and pay per use. Assess the cost and effort for rearchitecting to select the right approach. Execute a proof of concept to validate this for each component and service.

Strategies applied to succeed in this migration:

Team – We created a unified team with people from data engineering, analytics, and data science as part of the analytics migration project. Site reliability engineering (SRE) and application teams were involved when critical decisions were needed regarding data or timeline for alignment. The analytics, data engineering, and data science teams spent considerable time planning, understanding the code, and iteratively looking at the existing data sources, data pipelines, and processing jobs. AWS team with partner team from Minfy Technologies helped BMS arrive at a migration plan after a proof of concept for each of the components in data ingestion, data processing, data warehouse, ML, and analytics dashboards.
Workshops – The AWS team conducted a series of workshops and immersion days, and coached the BMS team on the technology and best practices to deploy the analytics services. The AWS team helped BMS explore the configuration and benefits of the migration approach for each scenario (data migration, data pipeline, data processing, visualization, and machine learning) via proof-of-concepts (POCs). The team captured the changes required in the existing code for migration. BMS team also got acquainted with the following AWS services:
- Amazon EMR
- AWS Glue
- Amazon Athena
- AWS Lake Formation
- Amazon Managed Workflow for Apache Airflow (Amazon MWAA)
- Amazon QuickSight
- Amazon Redshift
- Amazon SageMaker
- Amazon Simple Storage Service (Amazon S3)
- AWS Step Functions
Proof of concept – The BMS team, with help from the partner and AWS team, implemented multiple proofs of concept to validate the migration approach:
- Performed batch processing of Spark jobs in Amazon EMR, in which we checked the runtime, required code changes, and cost.
- Ran clickstream analysis jobs in Amazon EMR, testing the end-to-end pipeline. Team conducted proofs of concept on AWS IoT Core for MQTT protocol and streaming to Amazon S3.
- Migrated ML models to Amazon SageMaker and orchestrated with Amazon MWAA.
- Created sample QuickSight reports and dashboards, in which features and time to build were assessed.
- Configured for key scenarios for Amazon Redshift, in which time for loading data, query performance, and cost were assessed.
Effort vs. cost analysis – Team performed the following assessments:
- Compared the ingestion pipelines, the difference in data structure in each store, the basis of the current business need for the data source, the activity for preprocessing the data before migration, data migration to Amazon S3, and change data capture (CDC) from the migrated applications in AWS.
- Assessed the effort to migrate approximately 200 jobs, determined which jobs were redundant or need improvement from a functional perspective, and completed a migration list for the target state. The modernization of the MQTT workflow code to serverless was time-consuming, decided to rehost on Amazon Elastic Compute Cloud (Amazon EC2) and modernization to Amazon Kinesis in to the next phase.
- Reviewed over 400 reports and dashboards, prioritized development in phases, and reassessed business user needs.

AWS cloud services chosen for proposed architecture:

Data lake – We used Amazon S3 as the data lake to store the single truth of information for all raw and processed data, thereby reducing the copies of data storage and storage costs.
Ingestion – Because we had multiple sources of truth in the current architecture, we arrived at a common structure before migration to Amazon S3, and existing pipelines were modified to do preprocessing. These one-time preprocessing jobs were run in Cloudera, because the source data was on-premises, and on Amazon EMR for data in the cloud. We designed new data pipelines for ingestion from transactional systems on the AWS cloud using AWS Glue ETL.
Processing – Processing jobs were segregated based on runtime into two categories: batch and near-real time. Batch processes were further divided into transient Amazon EMR clusters with varying runtimes and Hadoop application requirements like HBase. Near-real-time jobs were provisioned in an Amazon EMR permanent cluster for clickstream analytics, and a data pipeline from transactional systems. We adopted a serverless approach using AWS Glue ETL for new data pipelines from transactional systems on the AWS cloud.
Data warehouse – We chose Amazon Redshift as our data warehouse, and planned on how the data would be distributed based on query patterns.
Visualization – We built the reports in Amazon QuickSight in phases and prioritized them based on business demand. We discussed with business users their current needs and identified the immediate reports required. We defined the phases of report and dashboard creation and built the reports in Amazon QuickSight. We plan to use embedded reports for external users in the future.
Machine learning – Custom ML models were deployed on Amazon SageMaker. Existing Airflow DAGs were migrated to Amazon MWAA.
Governance, security, and compliance – Governance with Amazon Lake Formation was adopted from Day 1. We configured the AWS Glue Data Catalog to reference data used as sources and targets. We had to comply to Payment Card Industry (PCI) guidelines because payment information was in the data lake, so we ensured the necessary security policies.

Solution overview

BMS modern data architecture

The following diagram illustrates our modern data architecture.

The architecture includes the following components:

Source systems – These include the following:
- Data from transactional systems stored in MariaDB (booking and transactions).
- User interaction clickstream data via Kafka consumers to DataOps MariaDB.
- Members and seat allocation information from MongoDB.
- SQL Server for specific offers and payment information.
Data pipeline – Spark jobs on an Amazon EMR permanent cluster process the clickstream data from Kafka clusters.
Data lake – Data from source systems was stored in their respective Amazon S3 buckets, with prefixes for optimized data querying. For Amazon S3, we followed a hierarchy to store raw, summarized, and team or service-related data in different parent folders as per the source and type of data. Lifecycle polices were added to logs and temp folders of different services as per teams’ requirements.
Data processing – Transient Amazon EMR clusters are used for processing data into a curated format for the audience, personalization, and analytics teams. Small file merger jobs merge the clickstream data to a larger file size, which saved costs for one-time queries.
Governance – AWS Lake Formation enables the usage of AWS Glue crawlers to capture the schema of data stored in the data lake and version changes in the schema. The Data Catalog and security policy in AWS Lake Formation enable access to data for roles and users in Amazon Redshift, Amazon Athena, Amazon QuickSight, and data science jobs. AWS Glue ETL jobs load the processed data to Amazon Redshift at scheduled intervals.
Queries – The analytics team used Amazon Athena to perform one-time queries raised from business teams on the data lake. Because report development is in phases, Amazon Athena was used for exporting data.
Data warehouse – Amazon Redshift was used as the data warehouse, where the reports for the sales teams, management, and third parties (i.e., theaters and events) are processed and stored for quick retrieval. Views to analyze the total sales, movie sale trends, member behavior, and payment modes are configured here. We use materialized views for denormalized tables, different schemas for metadata, and transactional and behavior data.
Reports – We used Amazon QuickSight reports for various business, marketing, and product use cases.
Machine learning – Some of the models deployed on Amazon SageMaker are as follows:
- Content popularity – Decides the recommended content for users.
- Live event popularity – Calculates the popularity of live entertainment events in different regions.
- Trending searches – Identifies trending searches across regions.

Walkthrough

Migration execution steps

We standardized tools, services, and processes for data engineering, analytics, and data science:

Data lake
- Identified the source data to be migrated from Archival DB, BigQuery, TiDB, and the analytics database.
- Built a canonical data model that catered to multiple business teams and reduced the copies of data, and therefore storage and operational costs. Modified existing jobs to facilitate migration to a canonical format.
- Identified the source systems, capacity required, anticipated growth, owners, and access requirements.
- Ran the bulk data migration to Amazon S3 from various sources.
Ingestion
- Transaction systems – Retained the existing Kafka queues and consumers.
- Clickstream data – Successfully conducted a proof of concept to use AWS IoT Core for MQTT protocol. But because we needed to make changes in the application to publish to AWS IoT Core, we decided to implement it as part of mobile application modernization at a later time. We decided to rehost the MQTT server on Amazon EC2.
Processing
Listed the data pipelines relevant to business and migrated them with minimal modification.
Categorized workloads into critical jobs, redundant jobs, or jobs that can be optimized:
- Spark jobs were migrated to Amazon EMR.
- HBase jobs were migrated to Amazon EMR with HBase.
- Metadata stored in Hive-based jobs were modified to use the AWS Glue Data Catalog.
- NiFi jobs were simplified and rewritten in Spark run in Amazon EMR.
Amazon EMR clusters were configured one persistent cluster for streaming the clickstream and personalization workloads. We used multiple transient clusters for running all other Spark ETL or processing jobs. We used Spot Instances for task nodes to save costs. We optimized data storage with specific jobs to merge small files and compressed file format conversions.
AWS Glue crawlers identified new data in Amazon S3. AWS Glue ETL jobs transformed and uploaded processed data to the Amazon Redshift data warehouse.
Datawarehouse
- Defined the data warehouse schema by categorizing the critical reports required by the business, keeping in mind the workload and reports required in future.
- Defined the staging area for incremental data loaded into Amazon Redshift, materialized views, and tuning the queries based on usage. The transaction and primary metadata are stored in Amazon Redshift to cater to all data analysis and reporting requirements. We created materialized views and denormalized tables in Amazon Redshift to use as data sources for Amazon QuickSight dashboards and segmentation jobs, respectively.
- Optimally used the Amazon Redshift cluster by loading last two years data in Amazon Redshift, and used Amazon Redshift Spectrum to query historical data through external tables. This helped balance the usage and cost of the Amazon Redshift cluster.
Visualization
- Amazon QuickSight dashboards were created for the sales and marketing team in Phase 1:
  - Sales summary report – An executive summary dashboard to get an overview of sales across the country by region, city, movie, theatre, genre, and more.
  - Live entertainment – A dedicated report for live entertainment vertical events.
  - Coupons – A report for coupons purchased and redeemed.
  - BookASmile – A dashboard to analyze the data for BookASmile, a charity initiative.
Machine learning
- Listed the ML workloads to be migrated based on current business needs.
- Priority ML processing jobs were deployed on Amazon EMR. Models were modified to use Amazon S3 as source and target, and new APIs were exposed to use the functionality. ML models were deployed on Amazon SageMaker for movies, live event clickstream analysis, and personalization.
- Existing artifacts in Airflow orchestration were migrated to Amazon MWAA.
Security
- AWS Lake Formation was the foundation of the data lake, with the AWS Glue Data Catalog as the foundation for the central catalog for the data stored in Amazon S3. This provided access to the data by various functionalities, including the audience, personalization, analytics, and data science teams.
- Personally identifiable information (PII) and payment data was stored in the data lake and data warehouse, so we had to comply to PCI guidelines. Encryption of data at rest and in transit was considered and configured in each service level (Amazon S3, AWS Glue Data Catalog, Amazon EMR, AWS Glue, Amazon Redshift, and QuickSight). Clear roles, responsibilities, and access permissions for different user groups and privileges were listed and configured in AWS Identity and Access Management (IAM) and individual services.
- Existing single sign-on (SSO) integration with Microsoft Active Directory was used for Amazon QuickSight user access.
Automation
- We used AWS CloudFormation for the creation and modification of all the core and analytics services.
- AWS Step Functions was used to orchestrate Spark jobs on Amazon EMR.
- Scheduled jobs were configured in AWS Glue for uploading data in Amazon Redshift based on business needs.
- Monitoring of the analytics services was done using Amazon CloudWatch metrics, and right-sizing of instances and configuration was achieved. Spark job performance on Amazon EMR was analyzed using the native Spark logs and Spark user interface (UI).
- Lifecycle policies were applied to the data lake to optimize the data storage costs over time.

Benefits of a modern data architecture

A modern data architecture offered us the following benefits:

Scalability – We moved from a fixed infrastructure to the minimal infrastructure required, with configuration to scale on demand. Services like Amazon EMR and Amazon Redshift enable us to do this with just a few clicks.
Agility – We use purpose-built managed services instead of reinventing the wheel. Automation and monitoring were key considerations, which enable us to make changes quickly.
Serverless – Adoption of serverless services like Amazon S3, AWS Glue, Amazon Athena, AWS Step Functions, and AWS Lambda support us when our business has sudden spikes with new movies or events launched.
Cost savings – Our storage size was reduced by 90%. Our overall spend on analytics and ML was reduced by 80%.

Conclusion

In this post, we showed you how a modern data architecture on AWS helped BMS to easily share data across organizational boundaries. This allowed BMS to make decisions with speed and agility at scale; ensure compliance via unified data access, security, and governance; and to scale systems at a low cost without compromising performance. Working with the AWS and Minfy Technologies teams helped BMS choose the correct technology services and complete the migration in four months. BMS achieved the scalability and cost-optimization goals with this updated architecture, which has set the stage for innovation using graph databases and enhanced our ML projects to improve customer experience.

About the Authors

Mahesh Vandi Chalil is Chief Technology Officer at BookMyShow, India’s leading entertainment destination. Mahesh has over two decades of global experience, passionate about building scalable products that delight customers while keeping innovation as the top goal motivating his team to constantly aspire for these. Mahesh invests his energies in creating and nurturing the next generation of technology leaders and entrepreneurs, both within the organization and outside of it. A proud husband and father of two daughters and plays cricket during his leisure time.

Priya Jathar is a Solutions Architect working in Digital Native Business segment at AWS. She has more two decades of IT experience, with expertise in Application Development, Database, and Analytics. She is a builder who enjoys innovating with new technologies to achieve business goals. Currently helping customers Migrate, Modernise, and Innovate in Cloud. In her free time she likes to paint, and hone her gardening and cooking skills.

Vatsal Shah is a Senior Solutions Architect at AWS based out of Mumbai, India. He has more than nine years of industry experience, including leadership roles in product engineering, SRE, and cloud architecture. He currently focuses on enabling large startups to streamline their cloud operations and help them scale on the cloud. He also specializes in AI and Machine Learning use cases.

Code conversion from Greenplum to Amazon Redshift: Handling arrays, dates, and regular expressions

2023-01-09 Jagrit Shrestha

Post Syndicated from Jagrit Shrestha original https://aws.amazon.com/blogs/big-data/code-conversion-from-greenplum-to-amazon-redshift-handling-arrays-dates-and-regular-expressions/

Amazon Redshift is a fully managed service for data lakes, data analytics, and data warehouses for startups, medium enterprises, and large enterprises. Amazon Redshift is used by tens of thousands of businesses around the globe for modernizing their data analytics platform.

Greenplum is an open-source, massively parallel database used for analytics, mostly for on-premises infrastructure. Greenplum is based on the PostgreSQL database engine.

Many customers have found migration to Amazon Redshift from Greenplum an attractive option instead of managing on-premises Greenplum for the following reasons:

The opportunity to modernize the data lake and data warehouse environment
Benefits of other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon CloudWatch, Amazon EMR, Amazon SageMaker, and more

Even though both Greenplum and Amazon Redshift use the open-source PostgreSQL database engine, migration still requires a lot of planning and manual intervention. This post covers the key functions and considerations while performing code conversion from Greenplum to Amazon Redshift. It is focused on the migration of procedures, functions, and views.

Solution overview

AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Tool (AWS SCT) can migrate most of the objects in a heterogeneous database migration from Greenplum to Amazon Redshift. But there are some situations where code conversion teams encounter errors and warnings for views, procedures, and functions while creating them in Amazon Redshift. To address this type of situation, manual conversion of the code is required.

The posts focuses on how to handle the following while migrating from Greenplum to Amazon Redshift:

Arrays
Dates and timestamps
Regular expressions (regex)

Please note that for this post, we use Greenplum 4.3 and Amazon Redshift PostgreSQL 8.2.

Working with array functions

The AWS SCT doesn’t convert array functions while migrating from Greenplum or PostgreSQL to Amazon Redshift. Developers need to extensively convert those functions manually. This post outlines the most common array functions:

ARRAY_UPPER
JSON_EXTACT_ARRAY_ELEMENT_TEXT and JSON_ARRAY_LENGTH
UNNEST ()
STRING_AGG()
ANY ARRAY()

ARRAY_UPPER()

This function returns the upper bound of an array. It can be used to extract the n^th element from an array in PostgreSQL or Greenplum.

The Greenplum code is as follows:

With temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
array['"111-222-3333"','"101-201-3001"','"XXX-YYY-ZZZZ"','NULL'] as PhoneNumbers
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
array['222-333-4444','201-301-4001','AAA-BBB-CCCC'] as PhoneNumbers
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
array['333-444-5555','301-401-3001','DDD-EEE-FFFF'] as PhoneNumbers
)
Select Firstname, PhoneNumbers[ARRAY_UPPER(PhoneNumbers,1)]

There is no function to extract an element from an array in Amazon Redshift; however, there are two JSON functions that can be used for this purpose:

JSON_EXTRACT_ARRAY_ELEMENT_TEXT() – Returns a JSON array element in the outermost array of a JSON string
JSON_ARRAY_LENGTH() – Returns the number of elements in the outer array of a JSON string

See the following code:

With temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
array['"111-222-3333"','"101-201-3001"','"XXX-YYY-ZZZZ"'] as PhoneNumbers
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
array['"222-333-4444"','"201-301-4001"','"AAA-BBB-CCCC"'] as PhoneNumbers
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
array['"333-444-5555"','"301-401-3001"','"DDD-EEE-FFFF"'] as PhoneNumbers
)

Select
FirstName
,('['+array_to_string(phoneNumbers,',')+']') as JSONConvertedField
,JSON_EXTRACT_ARRAY_ELEMENT_TEXT
(
'['+array_to_string(phoneNumbers,',')+']'
,JSON_ARRAY_LENGTH('['+array_to_string(phoneNumbers,',')+']')-1
) as LastElementFromArray
from temp1

UNNEST()

UNNEST() is PostgreSQL’s system function for semi-structured data, expanding an array, or a combination of arrays to a set of rows. It is introduced to improve the database performance of thousands or records for inserts, updates, and deletes.

You can use UNNEST() for basic array, multiple arrays, and multiple arrays with different lengths.

Some of Amazon Redshift functions used to unnest arrays are split_part, json_extract_path_text, json_array_length, and json_extract_array_element_text.

In Greenplum, the UNNEST function is used to expand an array to a set of rows:

Select ‘A’,unnest(array([1,2])

Output
A 1
A 2

with temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
'111-222-3333' as Mobilephone,'101-201-3001' as HomePhone
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
'222-333-4444' as Mobilephone,'201-301-4001' as HomePhone
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
'333-444-5555' as Mobilephone,'301-401-3001' as HomePhone
)

select
FirstName
,LastName
,unnest(array[‘Mobile’::text,’HomePhone’::text]) as PhoneType
,unnest(array[MobilePhone::text,HomePhone::text]) as PhoneNumber
from
temp1
order by 1,2,3

Amazon Redshift doesn’t support the UNNEST function; you can use the following workaround:

with temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
'111-222-3333' as Mobilephone,'101-201-3001' as HomePhone
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
'222-333-4444' as Mobilephone,'201-301-4001' as HomePhone
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
'333-444-5555' as Mobilephone,'301-401-3001' as HomePhone
),
ns as
(
Select row_number() over(order by 1) as n from pg_tables
)

Select
FirstName
,LastName
,split_part('Mobile,Home',',',ns.n::int) as PhoneType
,split_part(MobilePhone|| '&&' || HomePhone, '&&', ns.n::int) as PhoneNumber
from
temp1, ns
where
ns.n<=regexp_count('Mobile,Home',',')+1
order by 1,2,3

When the element of array is in the form of array itself, use the JSON_EXTRACT_ARRAY_ELEMENT_TEXT() function and JSON_ARRAY_LENGTH:

with ns as
(
Select row_number() over(order by 1) as n from pg_tables
)

Select JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["arrayelement1","arrayelement2"]',ns.n-1)
from ns
where
ns.n<=JSON_ARRAY_LENGTH('["arrayelement1","arrayelement2"]')

STRING_AGG()

The STRING_AGG() function is an aggregate function that concatenates a list of strings and places a separator between them. The function doesn’t add the separator at the end of the string. See the following code:

STRING_AGG ( expression, separator [order_by_clause] )

The Greenplum code is as follows:

with temp1 as
(
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Doe'::text as LastName
union all
Select 'Finance'::text as Dept, 'Mary'::text as FirstName, 'Jane'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Bob'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Steve'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Account'::text as Dept, 'Phil'::text as FirstName, 'Adams'::text as LastName
union all
Select 'Account'::text as Dept, 'Jim'::text as FirstName, 'Smith'::text as LastName
)

Select dept,STRING_AGG(FirstName||' '||LastName,' ; ') as Employees from temp1 group by dept order by 1

The Amazon Redshift equivalent for the STRING_AGG() function is LISTAGG(). This aggregate function orders the rows for that group according to the ORDER BY expression, then concatenates the values into a single string:

LISTAGG(expression, separator [order_by_clause])

See the following code:

Create temporary Table temp1 as
(
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Doe'::text as LastName
union all
Select 'Finance'::text as Dept, 'Mary'::text as FirstName, 'Jane'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Bob'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Steve'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Account'::text as Dept, 'Phil'::text as FirstName, 'Adams'::text as LastName
union all
Select 'Account'::text as Dept, 'Jim'::text as FirstName, 'Smith'::text as LastName
)

Select dept,LISTAGG(FirstName||' '||LastName,' ; ') as Employees from temp1
group by dept
order by 1

ANY ARRAY()

The PostgreSQL ANY ARRAY() function evaluates and compare the left-hand expression to each element in array:

Select * from temp1 where DeptName = ANY ARRAY('10-F','20-F','30-F')

In Amazon Redshift, the evaluation can be achieved with an IN operator:

Select * from temp1 where DeptName IN ('10-F','20-F','30-F')

Working with date functions

In this section, we discuss calculating the difference between date_part for Greenplum and datediff for Amazon Redshift.

When the application needs to calculate the difference between the subfields of dates for Greenplum, it uses the function date_part, which allows you to retrieve subfields such as year, month, week, and day. In the following example queries, we calculate the number of completion_days by calculating the difference between originated_date and eco_date.

To calculate the difference between the subfields of the date, Amazon Redshift has the function datediff. The following queries show an example of how to calculate the completion_days as the difference between eco_date and orginated_date. DATEDIFF determines the number of date part boundaries that are crossed between the two expressions.

We compare the Greenplum and Amazon Redshift queries as follows:

Difference by year

The following Greenplum query returns 1 year between 2009-01-01 and 2009-12-31:

SELECT date_part(‘year’, TIMESTAMP ‘2009-01-01’) - date_part(‘year’, 2008-12-31’) as year;

The following Amazon Redshift query returns 1 year between 2009-01-01 and 2009-12-31:

SELECT datediff (year, ‘2008-12-31’ , ‘2009-01-01’ ) as year;

Difference by month

The following Greenplum query returns 1 month between 2009-01-01 and 2008-12-31:

SELECT (date_part(‘year’, ‘2009-01-01’ :: date) - date_part(‘year’, ‘2008-12-31’ :: date)) * 12 +<br />(date_part(‘month’, ‘2009-01-01’) - date_part(‘month’, ‘2008-12-31’ :: date)) as month;

The following Amazon Redshift query returns 1 month between 2009-01-01 and 2008-12-31:

SELECT datediff( month, ‘2008-12-31’ , ‘2009-01-01’ ) as month;

Difference by week

The following Greenplum query returns 0 weeks between 2009-01-01 and 2009-12-31:

SELECT date_part(‘week’, timestamp ‘2009-01-01’ ) - date_part(‘week’, timestamp ‘2008-12-31’) as week;

The following Amazon Redshift query returns 0 weeks between 2009-01-01 and 2009-12-31:

SELECT datediff( week, ‘2008-12-31’ , ‘2009-01-01’ ) as week;

Difference by day

The following Greenplum query returns 1 day:

SELECT date_part ('day', '2009-01-01 24:00:00' :: timestamp - '2008-12-31 24:00:00 :: timestamp) as day;

The following Amazon Redshift query returns 1 day:

SELECT datediff (day, ‘2008-12-31’, ‘2009-01-01’) as day;

Difference by hour

The following Greenplum query returns 1 hour:

SELECT date_part(‘hour’, ‘2009-01-01 22:56:10’ :: timestamp - ‘2008-12-31 21:54:55' :: timestamp)

The following Amazon Redshift query returns 1 hour:

SELECT datediff (hour, ‘2009-01-01 21:56:10’, ‘2009-01-01’ ) as hour;

Difference by minute

The following Greenplum query returns 3 minutes:

SELECT date_part(‘minute’, ‘2009-01-01 22:56:10’ :: timestamp - ‘2009-01-01 21:53:10’ :: timestamp) as minutes;

The following Amazon Redshift query returns 1 minute:

SELECT datediff(minute, ‘2009-01-01 21:56:10’, ‘2009-01-01 21:57:55’) as minute;

Difference by second

The following Greenplum query returns 40 seconds:

SELECT date_part(‘second’, ‘2009-01-01 22:56:50’ :: timestamp - ‘2009-01-01 21:53:10’ : : timestamp) as seconds;

The following Amazon Redshift query returns 45 seconds:

SELECT datediff(second, ‘2009-01-01 21:56:10’, ‘2009-01-01- 21:56:55’) as seconds;

Now let’s look at how we use Amazon Redshift to calculate days and weeks in seconds.

The following Amazon Redshift query displays 2 days:

SELECT datediff(second, ‘2008-12-30 21:56:10’, ‘2009-01-01- 21:56:55’)/(60*60*24) as days;

The following Amazon Redshift query displays 9 weeks:

SELECT datediff(second, ‘2008-10-30 21:56:10’, ‘2009-01-01- 21:56:55’)/(60*60*24*7) as weeks;

For Greenplum, the date subfields need to be in single quotes, whereas for Amazon Redshift, we can use date subfields such as year, month, week, day, minute, second without quotes. For Greenplum, we have to subtract the subfield from one part to another part, whereas for Amazon Redshift we can use commas to separate the two dates.

Extract ISOYEAR from date

ISOYEAR 8601 is a week-numbering year. It begins with the Monday of the week containing the 4th of January. So for the date of early January or late December, the ISO year may be different from the Gregorian year. ISO year has 52 or 53 full weeks (364 or 371 days). The extra week is called a leap week; a year with such a week is called a leap year.

The following Greenplum query displays the ISOYEAR 2020:

SELECT extract (ISOYEAR from ‘2019-12-30’ :: date) as ISOYEARS;

The following Amazon Redshift query displays the ISOYEAR 2020:

SELECT to_char(‘2019-12-30’ :: date, ‘IYYYY’) as ISOYEARS;

Function to generate_series()

Greenplum has adopted the PostgreSQL function generate_series(). But the generate_series function works differently with Amazon Redshift while retrieving records from the table because it’s a leader node-only function.

To display a series of numbers in Amazon Redshift, run the following query on the leader node. In this example, it displays 10 rows, numbered 1–10:

SELECT generate_series(1,10);

To display a series of days for a given date, use the following query. It extracts the day from the given date and subtracts 1, to display a series of numbers from 0–6:

SELECT generate_series(0, extract(day from date ‘2009-01-07’) :: int -1);

But for the queries fetching the record from the table, joining with another table’s row, and processing data at the compute node, it doesn’t work, and generates an error message with Invalid Operation. The following code is an example of a SQL statement that works for Greenplum but fails for Amazon Redshift:

SELECT column_1,
FROM table_1t1
JOIN table_2 t2
ON t2.code = t1.code
CROSS JOIN generate_series(1,12) gen(fiscal_month)
WHERE condition_1

For Amazon Redshift, the solution is to create a table to store the series data, and rewrite the code as follows:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Working with regular expressions (regex functions)

Amazon Redshift and Greenplum both support three conditions for pattern matching:

LIKE
SIMILAR TO
POSIX operators

In this post, we don’t discuss all of these pattern matching in detail. Instead, we discuss a few regex functions and regex escape characters that aren’t supported by Amazon Redshift.

Regexp_split_to_table function

The Regex_split_to_table function splits a string using a POSIX regular expression pattern as delimiter.

This function has the following syntax:

Regexp_split_to_table(string,pattern [,flags])

For Greenplum, we use the following query:

select regexp_split_to_table ('bat,cat,hat',’\,’) as regexp_split_table_GP

For Amazon Redshift, the regexp_split_to_table function has to be converted using the Amazon Redshift split_part function:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Another way to convert regexp_split_to_table is as follows:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Substring from regex expressions

Substring (the string from the regex pattern) extracts the substring or value matching the pattern that is passed on. If there is no match, null is returned. For more information, refer to Pattern Matching.

We use the following code in Greenplum:

create temp table data1 ( col1 varchar );
insert into data1 values ('hellohowareyou 12\687687abcd');
select substring( col1 from '[A-Za-z]+$') from data1;
from data1

We can use the regexp_substr function to convert this code to Amazon Redshift. It returns the characters extracted from a string by searching for a regular expression pattern. The syntax is as follows:

REGEXP_SUBSTR ( source_string, pattern [, position [, occurrence [, parameters ] ] ] )

select regexp_substr( col1, '[A-Za-z]+$') as substring_from_rs from data1

Key points while converting regular expression escapes

The Postgres escape character E doesn’t work in Amazon Redshift. Additionally, the following Greenplum regular expression constraints aren’t supported in Amazon Redshift:

\m – Matches only at the beginning of a word
\y – Matches only at the beginning or end of a word

For Amazon Redshift, use \\< and \\>, or [[:<:]] and [[:>:]] instead.

Use the following code for Greenplum:

select col1,
case
when (col1) ~ E '\\m[0-9]{2}[A-Z]{1}[0-9]{1}' then
regexp_replace(col1, E '([0-9]{2})([A-Z]{1})([0-9]{1})',E '\\2')
else 'nothing'
end as regex_test
from temp1123

Use the following code for Amazon Redshift:

select col1,
case
when (col1) ~ '\\<[0-9]{2}[A-Z]{1}[0-9]{1}>\\' then
regexp_replace(col1,'([0-9]{2})([A-Z]{1})([0-9]{1})','\\2')
else 'nothing'
end as regex_test
from temp1123

select col1,
case
when (col1) ~ '[[:<:]][0-9]{2}[A-Z]{1}[0-9]{1}[[:>:]]' then
regexp_replace(col1,'([0-9]{2})([A-Z]{1})([0-9]{1}) (.*)','\\2')
else 'nothing'
end as regex_test
from temp1123

Conclusion

For heterogeneous database migration from Greenplum to the Amazon Redshift, you can use AWS DMS and the AWS SCT to migrate most of the database objects, such as tables, views, stored procedures, and functions.

There are some situations in which one function is used for the source environment, and the target environment doesn’t support the same function. In this case, manual conversion is required to produce the same results set and complete the database migration.

In some cases, use of a new window function supported by the target environment proves more efficient for analytical queries to process petabytes of data.

This post included several situations where manual code conversion is required, which also improves the code efficiency and make queries efficient.

If you have any questions or suggestions, please share your feedback.

About the Authors

Jagrit Shrestha is a Database consultant at Amazon Web Services (AWS). He works as a database specialist helping customers migrate their on-premises database workloads to AWS and provide technical guidance.

Ishwar Adhikary is a Database Consultant at Amazon Web Services (AWS). He works closely with customers to modernize their database and application infrastructures. His focus area is migration of relational databases from On-premise data center to AWS Cloud.

Shrenik Parekh works as a Database Consultants at Amazon Web Services (AWS). He is expertise in database migration assessment, database migration, modernizing database environment with purpose-built database using AWS cloud database services. He is also focused on AWS web services for data analytics. In his spare time, he loves hiking, yoga and other outdoor activities.

Santhosh Meenhallimath is a Data Architect at AWS. He works on building analytical solutions, building data lakes and migrate Database into AWS.

How to query and visualize Macie sensitive data discovery results with Athena and QuickSight

2023-01-06 Keith Rozario

Post Syndicated from Keith Rozario original https://aws.amazon.com/blogs/security/how-to-query-and-visualize-macie-sensitive-data-discovery-results-with-athena-and-quicksight/

Amazon Macie is a fully managed data security service that uses machine learning and pattern matching to help you discover and protect sensitive data in Amazon Simple Storage Service (Amazon S3). With Macie, you can analyze objects in your S3 buckets to detect occurrences of sensitive data, such as personally identifiable information (PII), financial information, personal health information, and access credentials.

In this post, we walk you through a solution to gain comprehensive and organization-wide visibility into which types of sensitive data are present in your S3 storage, where the data is located, and how much is present. Once enabled, Macie automatically starts discovering sensitive data in your S3 storage and builds a sensitive data profile for each bucket. The profiles are organized in a visual, interactive data map, and you can use the data map to run targeted sensitive data discovery jobs. Both automated data discovery and targeted jobs produce rich, detailed sensitive data discovery results. This solution uses Amazon Athena and Amazon QuickSight to deep-dive on the Macie results, and to help you analyze, visualize, and report on sensitive data discovered by Macie, even when the data is distributed across millions of objects, thousands of S3 buckets, and thousands of AWS accounts. Athena is an interactive query service that makes it simpler to analyze data directly in Amazon S3 using standard SQL. QuickSight is a cloud-scale business intelligence tool that connects to multiple data sources, including Athena databases and tables.

This solution is relevant to data security, data governance, and security operations engineering teams.

The challenge: how to summarize sensitive data discovered in your growing S3 storage

Macie issues findings when an object is found to contain sensitive data. In addition to findings, Macie keeps a record of each S3 object analyzed in a bucket of your choice for long-term storage. These records are known as sensitive data discovery results, and they include additional context about your data in Amazon S3. Due to the large size of the results file, Macie exports the sensitive data discovery results to an S3 bucket, so you need to take additional steps to query and visualize the results. We discuss the differences between findings and results in more detail later in this post.

With the increasing number of data privacy guidelines and compliance mandates, customers need to scale their monitoring to encompass thousands of S3 buckets across their organization. The growing volume of data to assess, and the growing list of findings from discovery jobs, can make it difficult to review and remediate issues in a timely manner. In addition to viewing individual findings for specific objects, customers need a way to comprehensively view, summarize, and monitor sensitive data discovered across their S3 buckets.

To illustrate this point, we ran a Macie sensitive data discovery job on a dataset created by AWS. The dataset contains about 7,500 files that have sensitive information, and Macie generated a finding for each sensitive file analyzed, as shown in Figure 1.

Figure 1: Macie findings from the dataset

Your security team could spend days, if not months, analyzing these individual findings manually. Instead, we outline how you can use Athena and QuickSight to query and visualize the Macie sensitive data discovery results to understand your data security posture.

The additional information in the sensitive data discovery results will help you gain comprehensive visibility into your data security posture. With this visibility, you can answer questions such as the following:

What are the top 5 most commonly occurring sensitive data types?
Which AWS accounts have the most findings?
How many S3 buckets are affected by each of the sensitive data types?

Your security team can write their own customized queries to answer questions such as the following:

Is there sensitive data in AWS accounts that are used for development purposes?
Is sensitive data present in S3 buckets that previously did not contain sensitive information?
Was there a change in configuration for S3 buckets containing the greatest amount of sensitive data?

How are findings different from results?

As a Macie job progresses, it produces two key types of output: sensitive data findings (or findings for short), and sensitive data discovery results (or results).

Findings provide a report of potential policy violations with an S3 bucket, or the presence of sensitive data in a specific S3 object. Each finding provides a severity rating, information about the affected resource, and additional details, such as when Macie found the issue. Findings are published to the Macie console, AWS Security Hub, and Amazon EventBridge.

In contrast, results are a collection of records for each S3 object that a Macie job analyzed. These records contain information about objects that do and do not contain sensitive data, including up to 1,000 occurrences of each sensitive data type that Macie found in a given object, and whether Macie was unable to analyze an object because of issues such as permissions settings or use of an unsupported format. If an object contains sensitive data, the results record includes detailed information that isn’t available in the finding for the object.

One of the key benefits of querying results is to uncover gaps in your data protection initiatives—these gaps can occur when data in certain buckets can’t be analyzed because Macie was denied access to those buckets, or was unable to decrypt specific objects. The following table maps some of the key differences between findings and results.

	Findings	Results
Enabled by default	Yes	No
Location of published results	Macie console, Security Hub, and EventBridge	S3 bucket
Details of S3 objects that couldn’t be scanned	No	Yes
Details of S3 objects in which no sensitive data was found	No	Yes
Identification of files inside compressed archives that contain sensitive data	No	Yes
Number of occurrences reported per object	Up to 15	Up to 1,000
Retention period	90 days in Macie console	Defined by customer

Architecture

As shown in Figure 2, you can build out the solution in three steps:

Enable the results and publish them to an S3 bucket
Build out the Athena table to query the results by using SQL
Visualize the results with QuickSight

Figure 2: Architecture diagram showing the flow of the solution

Prerequisites

To implement the solution in this blog post, you must first complete the following prerequisites:

Enable Macie in your account. For instructions, see Getting started with Amazon Macie.
Set your account as the delegated Macie administrator account by using AWS Organizations. Optionally, you can also enable Macie in additional member accounts using AWS Organizations.
Sign up for QuickSight in the account that you set as the delegated Macie administrator. For instructions on how to sign up, see Signing up for an Amazon QuickSight subscription. You can use the QuickSight Standard Edition for this post.
To follow along with the examples in this post, download the sample dataset. The dataset is a single .ZIP file that contains three directories (fk, rt, and mkro). For this post, we used three accounts in our organization, created an S3 bucket in each of them, and then copied each directory to an individual bucket, as shown in Figure 3.

Figure 3: Sample data loaded into three different AWS accounts

Note: All data in this blog post has been artificially created by AWS for demonstration purposes and has not been collected from any individual person. Similarly, such data does not, nor is it intended, to relate back to any individual person.

Step 1: Enable the results and publish them to an S3 bucket

Publication of the discovery results to Amazon S3 is not enabled by default. The setup requires that you specify an S3 bucket to store the results (we also refer to this as the discovery results bucket), and use an AWS Key Management Service (AWS KMS) key to encrypt the bucket.

If you are analyzing data across multiple accounts in your organization, then you need to enable the results in your delegated Macie administrator account. You do not need to enable results in individual member accounts. However, if you’re running Macie jobs in a standalone account, then you should enable the Macie results directly in that account.

To enable the results

Open the Macie console.
Select the AWS Region from the upper right of the page.
From the left navigation pane, select Discovery results.
Select Configure now.
Select Create Bucket, and enter a unique bucket name. This will be the discovery results bucket name. Make note of this name because you will use it when you configure the Athena tables later in this post.
Under Encryption settings, select Create new key. This takes you to the AWS KMS console in a new browser tab.
In the AWS KMS console, do the following:
1. For Key type, choose symmetric, and for Key usage, choose Encrypt and Decrypt.
2. Enter a meaningful key alias (for example, macie-results-key) and description.
3. (Optional) For simplicity, set your current user or role as the Key Administrator.
4. Set your current user/role as a user of this key in the key usage permissions step. This will give you the right permissions to run the Athena queries later.
5. Review the settings and choose Finish.
Navigate to the browser tab with the Macie console.
From the AWS KMS Key dropdown, select the new key.
To view KMS key policy statements that were automatically generated for your specific key, account, and Region, select View Policy. Copy these statements in their entirety to your clipboard.
Navigate back to the browser tab with the AWS KMS console and then do the following:
1. Select Customer managed keys.
2. Choose the KMS key that you created, choose Switch to policy view, and under Key policy, select Edit.
3. In the key policy, paste the statements that you copied. When you add the statements, do not delete any existing statements and make sure that the syntax is valid. Policies are in JSON format.
Navigate back to the Macie console browser tab.
Review the inputs in the Settings page for Discovery results and then choose Save. Macie will perform a check to make sure that it has the right access to the KMS key, and then it will create a new S3 bucket with the required permissions.
If you haven’t run a Macie discovery job in the last 90 days, you will need to run a new discovery job to publish the results to the bucket.

In this step, you created a new S3 bucket and KMS key that you are using only for Macie. For instructions on how to enable and configure the results using existing resources, see Storing and retaining sensitive data discovery results with Amazon Macie. Make sure to review Macie pricing details before creating and running a sensitive data discovery job.

Step 2: Build out the Athena table to query the results using SQL

Now that you have enabled the discovery results, Macie will begin publishing them into your discovery results bucket in the form of jsonl.gz files. Depending on the amount of data, there could be thousands of individual files, with each file containing multiple records. To identify the top five most commonly occurring sensitive data types in your organization, you would need to query all of these files together.

In this step, you will configure Athena so that it can query the results using SQL syntax. Before you can run an Athena query, you must specify a query result bucket location in Amazon S3. This is different from the Macie discovery results bucket that you created in the previous step.

If you haven’t set up Athena previously, we recommend that you create a separate S3 bucket, and specify a query result location using the Athena console. After you’ve set up the query result location, you can configure Athena.

To create a new Athena database and table for the Macie results

Open the Athena console, and in the query editor, enter the following data definition language (DDL) statement. In the context of SQL, a DDL statement is a syntax for creating and modifying database objects, such as tables. For this example, we named our database macie_results.
```
CREATE DATABASE macie_results;
```
After running this step, you’ll see a new database in the Database dropdown. Make sure that the new macie_results database is selected for the next queries.

Figure 4: Create database in the Athena console

Create a table in the database by using the following DDL statement. Make sure to replace <RESULTS-BUCKET-NAME> with the name of the discovery results bucket that you created previously.

CREATE EXTERNAL TABLE maciedetail_all_jobs(
	accountid string,
	category string,
	classificationdetails struct<jobArn:string,result:struct<status:struct<code:string,reason:string>,sizeClassified:string,mimeType:string,sensitiveData:array<struct<category:string,totalCount:string,detections:array<struct<type:string,count:string,occurrences:struct<lineRanges:array<struct<start:string,`end`:string,`startColumn`:string>>,pages:array<struct<pageNumber:string>>,records:array<struct<recordIndex:string,jsonPath:string>>,cells:array<struct<row:string,`column`:string,`columnName`:string,cellReference:string>>>>>>>,customDataIdentifiers:struct<totalCount:string,detections:array<struct<arn:string,name:string,count:string,occurrences:struct<lineRanges:array<struct<start:string,`end`:string,`startColumn`:string>>,pages:array<string>,records:array<string>,cells:array<string>>>>>>,detailedResultsLocation:string,jobId:string>,
	createdat string,
	description string,
	id string,
	partition string,
	region string,
	resourcesaffected struct<s3Bucket:struct<arn:string,name:string,createdAt:string,owner:struct<displayName:string,id:string>,tags:array<string>,defaultServerSideEncryption:struct<encryptionType:string,kmsMasterKeyId:string>,publicAccess:struct<permissionConfiguration:struct<bucketLevelPermissions:struct<accessControlList:struct<allowsPublicReadAccess:boolean,allowsPublicWriteAccess:boolean>,bucketPolicy:struct<allowsPublicReadAccess:boolean,allowsPublicWriteAccess:boolean>,blockPublicAccess:struct<ignorePublicAcls:boolean,restrictPublicBuckets:boolean,blockPublicAcls:boolean,blockPublicPolicy:boolean>>,accountLevelPermissions:struct<blockPublicAccess:struct<ignorePublicAcls:boolean,restrictPublicBuckets:boolean,blockPublicAcls:boolean,blockPublicPolicy:boolean>>>,effectivePermission:string>>,s3Object:struct<bucketArn:string,key:string,path:string,extension:string,lastModified:string,eTag:string,serverSideEncryption:struct<encryptionType:string,kmsMasterKeyId:string>,size:string,storageClass:string,tags:array<string>,embeddedFileDetails:struct<filePath:string,fileExtension:string,fileSize:string,fileLastModified:string>,publicAccess:boolean>>,
	schemaversion string,
	severity struct<description:string,score:int>,
	title string,
	type string,
	updatedat string)
ROW FORMAT SERDE
	'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
	'paths'='accountId,category,classificationDetails,createdAt,description,id,partition,region,resourcesAffected,schemaVersion,severity,title,type,updatedAt')
STORED AS INPUTFORMAT
	'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
	'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
	's3://<RESULTS-BUCKET-NAME>/AWSLogs/'

After you complete this step, you will see a new table named maciedetail_all_jobs in the Tables section of the query editor.

Query the results to start gaining insights. For example, to identify the top five most common sensitive data types, run the following query:

select sensitive_data.category,
	detections_data.type,
	sum(cast(detections_data.count as INT)) total_detections
from maciedetail_all_jobs,
	unnest(classificationdetails.result.sensitiveData) as t(sensitive_data),
	unnest(sensitive_data.detections) as t(detections_data)
where classificationdetails.result.sensitiveData is not null
and resourcesaffected.s3object.embeddedfiledetails is null
group by sensitive_data.category, detections_data.type
order by total_detections desc
LIMIT 5

Running this query on the sample dataset gives the following output.

Figure 5: Results of a query showing the five most common sensitive data types in the dataset

(Optional) The previous query ran on all of the results available for Macie. You can further query which accounts have the greatest amount of sensitive data detected.
```
select accountid,
	sum(cast(detections_data.count as INT)) total_detections
from maciedetail_all_jobs,
	unnest(classificationdetails.result.sensitiveData) as t(sensitive_data),
	unnest(sensitive_data.detections) as t(detections_data)
where classificationdetails.result.sensitiveData is not null
and resourcesaffected.s3object.embeddedfiledetails is null
group by accountid
order by total_detections desc
```
To test this query, we distributed the synthetic dataset across three member accounts in our organization, ran the query, and received the following output. If you enable Macie in just a single account, then you will only receive results for that one account.

Figure 6: Query results for total number of sensitive data detections across all accounts in an organization

For a list of more example queries, see the amazon-macie-results-analytics GitHub repository.

Step 3: Visualize the results with QuickSight

In the previous step, you used Athena to query your Macie discovery results. Although the queries were powerful, they only produced tabular data as their output. In this step, you will use QuickSight to visualize the results of your Macie jobs.

Before creating the visualizations, you first need to grant QuickSight the right permissions to access Athena, the results bucket, and the KMS key that you used to encrypt the results.

To allow QuickSight access to the KMS key

Open the AWS Identity and Access Management (IAM) console, and then do the following:
1. In the navigation pane, choose Roles.
2. In the search pane for roles, search for aws-quicksight-s3-consumers-role-v0. If this role does not exist, search for aws-quicksight-service-role-v0.
3. Select the role and copy the role ARN. You will need this role ARN to modify the KMS key policy to grant permissions for this role.
Open the AWS KMS console and then do the following:
1. Select Customer managed keys.
2. Choose the KMS key that you created.
3. Paste the following statement in the key policy. When you add the statement, do not delete any existing statements, and make sure that the syntax is valid. Replace <QUICKSIGHT_SERVICE_ROLE_ARN> and <KMS_KEY_ARN> with your own information. Policies are in JSON format.

	{ "Sid": "Allow Quicksight Service Role to use the key",
		"Effect": "Allow",
		"Principal": {
			"AWS": <QUICKSIGHT_SERVICE_ROLE_ARN>
		},
		"Action": "kms:Decrypt",
		"Resource": <KMS_KEY_ARN>
	}

To allow QuickSight access to Athena and the discovery results S3 bucket

In QuickSight, in the upper right, choose your user icon to open the profile menu, and choose US East (N.Virginia). You can only modify permissions in this Region.
In the upper right, open the profile menu again, and select Manage QuickSight.
Select Security & permissions.
Under QuickSight access to AWS services, choose Manage.
Make sure that the S3 checkbox is selected, click on Select S3 buckets, and then do the following:
1. Choose the discovery results bucket.
2. You do not need to check the box under Write permissions for Athena workgroup. The write permissions are not required for this post.
3. Select Finish.
Make sure that the Amazon Athena checkbox is selected.
Review the selections and be careful that you don’t inadvertently disable AWS services and resources that other users might be using.
Select Save.
In QuickSight, in the upper right, open the profile menu, and choose the Region where your results bucket is located.

Now that you’ve granted QuickSight the right permissions, you can begin creating visualizations.

To create a new dataset referencing the Athena table

On the QuickSight start page, choose Datasets.
On the Datasets page, choose New dataset.
From the list of data sources, select Athena.
Enter a meaningful name for the data source (for example, macie_datasource) and choose Create data source.
Select the database that you created in Athena (for example, macie_results).
Select the table that you created in Athena (for example, maciedetail_all_jobs), and choose Select.
You can either import the data into SPICE or query the data directly. We recommend that you use SPICE for improved performance, but the visualizations will still work if you query the data directly.
To create an analysis using the data as-is, choose Visualize.

You can then visualize the Macie results in the QuickSight console. The following example shows a delegated Macie administrator account that is running a visualization, with account IDs on the y axis and the count of affected resources on the x axis.

Figure 7: Visualize query results to identify total number of sensitive data detections across accounts in an organization

You can also visualize the aggregated data in QuickSight. For example, you can view the number of findings for each sensitive data category in each S3 bucket. The Athena table doesn’t provide aggregated data necessary for visualization. Instead, you need to query the table and then visualize the output of the query.

To query the table and visualize the output in QuickSight

On the Amazon QuickSight start page, choose Datasets.
On the Datasets page, choose New dataset.
Select the data source that you created in Athena (for example, macie_datasource) and then choose Create Dataset.
Select the database that you created in Athena (for example, macie_results).

Choose Use Custom SQL, enter the following query below, and choose Confirm Query.

	select resourcesaffected.s3bucket.name as bucket_name,
		sensitive_data.category,
		detections_data.type,
		sum(cast(detections_data.count as INT)) total_detections
	from macie_results.maciedetail_all_jobs,
		unnest(classificationdetails.result.sensitiveData) as t(sensitive_data),unnest(sensitive_data.detections) as t(detections_data)
where classificationdetails.result.sensitiveData is not null
and resourcesaffected.s3object.embeddedfiledetails is null
group by resourcesaffected.s3bucket.name, sensitive_data.category, detections_data.type
order by total_detections desc

You can either import the data into SPICE or query the data directly.
To create an analysis using the data as-is, choose Visualize.

Now you can visualize the output of the query that aggregates data across your S3 buckets. For example, we used the name of the S3 bucket to group the results, and then we created a donut chart of the output, as shown in Figure 6.

Figure 8: Visualize query results for total number of sensitive data detections across each S3 bucket in an organization

From the visualizations, we can identify which buckets or accounts in our organizations contain the most sensitive data, for further action. Visualizations can also act as a dashboard to track remediation.

If you encounter permissions issues, see Insufficient permissions when using Athena with Amazon QuickSight and Troubleshooting key access for troubleshooting steps.

You can replicate the preceding steps by using the sample queries from the amazon-macie-results-analytics GitHub repo to view data that is aggregated across S3 buckets, AWS accounts, or individual Macie jobs. Using these queries with the results of your Macie results will help you get started with tracking the security posture of your data in Amazon S3.

Conclusion

In this post, you learned how to enable sensitive data discovery results for Macie, query those results with Athena, and visualize the results in QuickSight.

Because Macie sensitive data discovery results provide more granular data than the findings, you can pursue a more comprehensive incident response when sensitive data is discovered. The sample queries in this post provide answers to some generic questions that you might have. After you become familiar with the structure, you can run other interesting queries on the data.

We hope that you can use this solution to write your own queries to gain further insights into sensitive data discovered in S3 buckets, according to the business needs and regulatory requirements of your organization. You can consider using this solution to better understand and identify data security risks that need immediate attention. For example, you can use this solution to answer questions such as the following:

Is financial information present in an AWS account where it shouldn’t be?
Are S3 buckets that contain PII properly hardened with access controls and encryption?

You can also use this solution to understand gaps in your data security initiatives by tracking files that Macie couldn’t analyze due to encryption or permission issues. To further expand your knowledge of Macie capabilities and features, see the following resources:

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Macie re:Post.

Want more AWS Security news? Follow us on Twitter.

Near-real-time fraud detection using Amazon Redshift Streaming Ingestion with Amazon Kinesis Data Streams and Amazon Redshift ML

2023-01-04 Praveen Kadipikonda

Post Syndicated from Praveen Kadipikonda original https://aws.amazon.com/blogs/big-data/near-real-time-fraud-detection-using-amazon-redshift-streaming-ingestion-with-amazon-kinesis-data-streams-and-amazon-redshift-ml/

The importance of data warehouses and analytics performed on data warehouse platforms has been increasing steadily over the years, with many businesses coming to rely on these systems as mission-critical for both short-term operational decision-making and long-term strategic planning. Traditionally, data warehouses are refreshed in batch cycles, for example, monthly, weekly, or daily, so that businesses can derive various insights from them.

Many organizations are realizing that near-real-time data ingestion along with advanced analytics opens up new opportunities. For example, a financial institute can predict if a credit card transaction is fraudulent by running an anomaly detection program in near-real-time mode rather than in batch mode.

In this post, we show how Amazon Redshift can deliver streaming ingestion and machine learning (ML) predictions all in one platform.

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL.

Amazon Redshift ML makes it easy for data analysts and database developers to create, train, and apply ML models using familiar SQL commands in Amazon Redshift data warehouses.

We’re excited to launch Amazon Redshift Streaming Ingestion for Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK), which enables you to ingest data directly from a Kinesis data stream or Kafka topic without having to stage the data in Amazon Simple Storage Service (Amazon S3). Amazon Redshift streaming ingestion allows you to achieve low latency in the order of seconds while ingesting hundreds of megabytes of data into your data warehouse.

This post demonstrates how Amazon Redshift, the cloud data warehouse allows you to build near-real-time ML predictions by using Amazon Redshift streaming ingestion and Redshift ML features with familiar SQL language.

Solution overview

By following the steps outlined in this post, you’ll be able to set up a producer streamer application on an Amazon Elastic Compute Cloud (Amazon EC2) instance that simulates credit card transactions and pushes data to Kinesis Data Streams in real time. You set up an Amazon Redshift Streaming Ingestion materialized view on Amazon Redshift, where streaming data is received. You train and build a Redshift ML model to generate real-time inferences against the streaming data.

The following diagram illustrates the architecture and process flow.

The step-by-step process is as follows:

The EC2 instance simulates a credit card transaction application, which inserts credit card transactions into the Kinesis data stream.
The data stream stores the incoming credit card transaction data.
An Amazon Redshift Streaming Ingestion materialized view is created on top of the data stream, which automatically ingests streaming data into Amazon Redshift.
You build, train, and deploy an ML model using Redshift ML. The Redshift ML model is trained using historical transactional data.
You transform the streaming data and generate ML predictions.
You can alert customers or update the application to mitigate risk.

This walkthrough uses credit card transaction streaming data. The credit card transaction data is fictitious and is based on a simulator. The customer dataset is also fictitious and is generated with some random data functions.

Prerequisites

Create an Amazon Redshift cluster.
Configure the cluster to use Redshift ML.
Create an AWS Identity and Access Management (IAM) user.
Update the IAM role attached to the Redshift cluster to include permissions to access the Kinesis data stream. For more information about the required policy, refer to Getting started with streaming ingestion.
Create an m5.4xlarge EC2 instance. We tested Producer application with m5.4xlarge instance but you are free to use other instance type. When creating the instance, use the amzn2-ami-kernel-5.10-hvm-2.0.20220426.0-x86_64-gp2 AMI.
To make sure that Python3 is installed in the EC2 instance, run the following command to verity your Python version (note that the data extraction script only works on Python 3):

python3 --version

Install the following dependent packages to run the simulator program:

sudo yum install python3-pip
pip3 install numpy
pip3 install pandas
pip3 install matplotlib
pip3 install seaborn
pip3 install boto3

Configure Amazon EC2 using the variables like AWS credentials generated for IAM user created in step 3 above. The following screenshot shows an example using aws configure.

Set up Kinesis Data Streams

Amazon Kinesis Data Streams is a massively scalable and durable real-time data streaming service. It can continuously capture gigabytes of data per second from hundreds of thousands of sources, such as website clickstreams, database event streams, financial transactions, social media feeds, IT logs, and location-tracking events. The data collected is available in milliseconds to enable real-time analytics use cases such as real-time dashboards, real-time anomaly detection, dynamic pricing, and more. We use Kinesis Data Streams because it’s a serverless solution that can scale based on usage.

Create a Kinesis data stream

First, you need to create a Kinesis data stream to receive the streaming data:

On the Amazon Kinesis console, choose Data streams in the navigation pane.
Choose Create data stream.
For Data stream name, enter cust-payment-txn-stream.
For Capacity mode, select On-demand.
For the rest of the options, choose the default options and follow through the prompts to complete the setup.
Capture the ARN for the created data stream to use in the next section when defining your IAM policy.

Streaming ARN Highlight

Set up permissions

For a streaming application to write to Kinesis Data Streams, the application needs to have access to Kinesis. You can use the following policy statement to grant the simulator process that you set up in next section access to the data stream. Use the ARN of the data stream that you saved in the previous step.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt123",
"Effect": "Allow",
"Action": [
"kinesis:DescribeStream",
"kinesis:PutRecord",
"kinesis:PutRecords",
"kinesis:GetShardIterator",
"kinesis:GetRecords",
"kinesis:ListShards",
"kinesis:DescribeStreamSummary"
],
"Resource": [
"arn:aws:kinesis:us-west-2:xxxxxxxxxxxx:stream/cust-payment-txn-stream"
]
}
]
}

Configure the stream producer

Before we can consume streaming data in Amazon Redshift, we need a streaming data source that writes data to the Kinesis data stream. This post uses a custom-built data generator and the AWS SDK for Python (Boto3) to publish the data to the data stream. For setup instructions, refer to Producer Simulator. This simulator process publishes streaming data to the data stream created in the previous step (cust-payment-txn-stream).

Configure the stream consumer

This section talks about configuring the stream consumer (the Amazon Redshift streaming ingestion view).

Amazon Redshift Streaming Ingestion provides low-latency, high-speed ingestion of streaming data from Kinesis Data Streams into an Amazon Redshift materialized view. You can configure your Amazon Redshift cluster to enable streaming ingestion and create a materialized view with auto refresh, using SQL statements, as described in Creating materialized views in Amazon Redshift. The automatic materialized view refresh process will ingest streaming data at hundreds of megabytes of data per second from Kinesis Data Streams into Amazon Redshift. This results in fast access to external data that is quickly refreshed.

After creating the materialized view, you can access your data from the data stream using SQL and simplify your data pipelines by creating materialized views directly on top of the stream.

Complete the following steps to configure an Amazon Redshift streaming materialized view:

On the IAM console, choose policies in the navigation pane.
Choose Create policy.
Create a new IAM policy called KinesisStreamPolicy. For the streaming policy definition, see Getting started with streaming ingestion.
In the navigation pane, choose Roles.
Choose Create role.
Select AWS service and choose Redshift and Redshift customizable.
Create a new role called redshift-streaming-role and attach the policy KinesisStreamPolicy.
Create an external schema to map to Kinesis Data Streams :

CREATE EXTERNAL SCHEMA custpaytxn
FROM KINESIS IAM_ROLE 'arn:aws:iam::386xxxxxxxxx:role/redshift-streaming-role';

Now you can create a materialized view to consume the stream data. You can use the SUPER data type to store the payload as is, in JSON format, or use Amazon Redshift JSON functions to parse the JSON data into individual columns. For this post, we use the second method because the schema is well defined.

Create the streaming ingestion materialized view cust_payment_tx_stream. By specifying AUTO REFRESH YES in the following code, you can enable automatic refresh of the streaming ingestion view, which saves time by avoiding building data pipelines:

CREATE MATERIALIZED VIEW cust_payment_tx_stream
AUTO REFRESH YES
AS
SELECT approximate_arrival_timestamp ,
partition_key,
shard_id,
sequence_number,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TRANSACTION_ID')::bigint as TRANSACTION_ID,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TX_DATETIME')::character(50) as TX_DATETIME,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'CUSTOMER_ID')::int as CUSTOMER_ID,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TERMINAL_ID')::int as TERMINAL_ID,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TX_AMOUNT')::decimal(18,2) as TX_AMOUNT,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TX_TIME_SECONDS')::int as TX_TIME_SECONDS,
json_extract_path_text(from_varbyte(kinesis_data, 'utf-8'),'TX_TIME_DAYS')::int as TX_TIME_DAYS
FROM custpaytxn."cust-payment-txn-stream"
Where is_utf8(kinesis_data) AND can_json_parse(kinesis_data);

Note that json_extract_path_text has a length limitation of 64 KB. Also from_varbye filters records larger than 65KB.

Refresh the data.

The Amazon Redshift streaming materialized view is auto refreshed by Amazon Redshift for you. This way, you don’t need worry about data staleness. With materialized view auto refresh, data is automatically loaded into Amazon Redshift as it becomes available in the stream. If you choose to manually perform this operation, use the following command:

REFRESH MATERIALIZED VIEW cust_payment_tx_stream ;

Now let’s query the streaming materialized view to see sample data:

Select * from cust_payment_tx_stream limit 10;

Let’s check how many records are in the streaming view now:

Select count(*) as stream_rec_count from cust_payment_tx_stream;

Now you have finished setting up the Amazon Redshift streaming ingestion view, which is continuously updated with incoming credit card transaction data. In my setup, I see that around 67,000 records have been pulled into the streaming view at the time when I ran my select count query. This number could be different for you.

Redshift ML

With Redshift ML, you can bring a pre-trained ML model or build one natively. For more information, refer to Using machine learning in Amazon Redshift.

In this post, we train and build an ML model using a historical dataset. The data contains a tx_fraud field that flags a historical transaction as fraudulent or not. We build a supervised ML model using Redshift Auto ML, which learns from this dataset and predicts incoming transactions when those are run through the prediction functions.

In the following sections, we show how to set up the historical dataset and customer data.

Load the historical dataset

The historical table has more fields than what the streaming data source has. These fields contain the customer’s most recent spend and terminal risk score, like number of fraudulent transactions computed by transforming streaming data. There are also categorical variables like weekend transactions or nighttime transactions.

To load the historical data, run the commands using the Amazon Redshift query editor.

Create the transaction history table with the following code. The DDL can also be found on GitHub.

CREATE TABLE cust_payment_tx_history
(
TRANSACTION_ID integer,
TX_DATETIME timestamp,
CUSTOMER_ID integer,
TERMINAL_ID integer,
TX_AMOUNT decimal(9,2),
TX_TIME_SECONDS integer,
TX_TIME_DAYS integer,
TX_FRAUD integer,
TX_FRAUD_SCENARIO integer,
TX_DURING_WEEKEND integer,
TX_DURING_NIGHT integer,
CUSTOMER_ID_NB_TX_1DAY_WINDOW decimal(9,2),
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW decimal(9,2),
CUSTOMER_ID_NB_TX_7DAY_WINDOW decimal(9,2),
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW decimal(9,2),
CUSTOMER_ID_NB_TX_30DAY_WINDOW decimal(9,2),
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW decimal(9,2),
TERMINAL_ID_NB_TX_1DAY_WINDOW decimal(9,2),
TERMINAL_ID_RISK_1DAY_WINDOW decimal(9,2),
TERMINAL_ID_NB_TX_7DAY_WINDOW decimal(9,2),
TERMINAL_ID_RISK_7DAY_WINDOW decimal(9,2),
TERMINAL_ID_NB_TX_30DAY_WINDOW decimal(9,2),
TERMINAL_ID_RISK_30DAY_WINDOW decimal(9,2)
);
Copy cust_payment_tx_history
FROM 's3://redshift-demos/redshiftml-reinvent/2022/ant312/credit-card-transactions/credit_card_transactions_transformed_balanced.csv'
iam_role default
ignoreheader 1
csv ;

Let’s check how many transactions are loaded:

select count(1) from cust_payment_tx_history;

Check the monthly fraud and non-fraud transactions trend:

SELECT to_char(tx_datetime, 'YYYYMM') as YearMonth,
sum(case when tx_fraud=1 then 1 else 0 end) as fraud_tx,
sum(case when tx_fraud=0 then 1 else 0 end) as non_fraud_tx,
count(*) as total_tx
FROM cust_payment_tx_history
GROUP BY YearMonth;

Create and load customer data

Now we create the customer table and load data, which contains the email and phone number of the customer. The following code creates the table, loads the data, and samples the table. The table DDL is available on GitHub.

CREATE TABLE public."customer_info"(customer_id bigint NOT NULL encode az64,
job_title character varying(500) encode lzo,
email_address character varying(100) encode lzo,
full_name character varying(200) encode lzo,
phone_number character varying(20) encode lzo,
city varchar(50),
state varchar(50)
);
COPY customer_info
FROM 's3://redshift-demos/redshiftml-reinvent/2022/ant312/customer-data/Customer_Data.csv'
IGNOREHEADER 1
IAM_ROLE default CSV;
Select count(1) from customer_info;

Our test data has about 5,000 customers. The following screenshot shows sample customer data.

Build an ML model

Our historical card transaction table has 6 months of data, which we now use to train and test the ML model.

The model takes the following fields as input:

TX_DURING_WEEKEND ,
TX_AMOUNT,
TX_DURING_NIGHT ,
CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
TERMINAL_ID_NB_TX_1DAY_WINDOW ,
TERMINAL_ID_RISK_1DAY_WINDOW ,
TERMINAL_ID_NB_TX_7DAY_WINDOW ,
TERMINAL_ID_RISK_7DAY_WINDOW ,
TERMINAL_ID_NB_TX_30DAY_WINDOW ,
TERMINAL_ID_RISK_30DAY_WINDOW

We get tx_fraud as output.

We split this data into training and test datasets. Transactions from 2022-04-01 to 2022-07-31 are for the training set. Transactions from 2022-08-01 to 2022-09-30 are used for the test set.

Let’s create the ML model using the familiar SQL CREATE MODEL statement. We use a basic form of the Redshift ML command. The following method uses Amazon SageMaker Autopilot, which performs data preparation, feature engineering, model selection, and training automatically for you. Provide the name of your S3 bucket containing the code.

CREATE MODEL cust_cc_txn_fd
FROM (
SELECT TX_AMOUNT ,
TX_FRAUD ,
TX_DURING_WEEKEND ,
TX_DURING_NIGHT ,
CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
TERMINAL_ID_NB_TX_1DAY_WINDOW ,
TERMINAL_ID_RISK_1DAY_WINDOW ,
TERMINAL_ID_NB_TX_7DAY_WINDOW ,
TERMINAL_ID_RISK_7DAY_WINDOW ,
TERMINAL_ID_NB_TX_30DAY_WINDOW ,
TERMINAL_ID_RISK_30DAY_WINDOW
FROM cust_payment_tx_history
WHERE cast(tx_datetime as date) between '2022-06-01' and '2022-09-30'
) TARGET tx_fraud
FUNCTION fn_customer_cc_fd
IAM_ROLE default
SETTINGS (
S3_BUCKET '<replace this with your s3 bucket name>',
s3_garbage_collect off,
max_runtime 3600
);

I call the ML model as Cust_cc_txn_fd, and the prediction function as fn_customer_cc_fd. The FROM clause shows the input columns from the historical table public.cust_payment_tx_history. The target parameter is set to tx_fraud, which is the target variable that we’re trying to predict. IAM_Role is set to default because the cluster is configured with this role; if not, you have to provide your Amazon Redshift cluster IAM role ARN. I set the max_runtime to 3,600 seconds, which is the time we give to SageMaker to complete the process. Redshift ML deploys the best model that is identified in this time frame.

Depending on the complexity of the model and the amount of data, it can take some time for the model to be available. If you find your model selection is not completing, increase the value for max_runtime. You can set a max value of 9999.

The CREATE MODEL command is run asynchronously, which means it runs in the background. You can use the SHOW MODEL command to see the status of the model. When the status shows as Ready, it means the model is trained and deployed.

show model cust_cc_txn_fd;

The following screenshots show our output.

From the output, I see that the model has been correctly recognized as BinaryClassification, and F1 has been selected as the objective. The F1 score is a metric that considers both precision and recall. It returns a value between 1 (perfect precision and recall) and 0 (lowest possible score). In my case, it’s 0.91. The higher the value, the better the model performance.

Let’s test this model with the test dataset. Run the following command, which retrieves sample predictions:

SELECT
tx_fraud ,
fn_customer_cc_fd(
TX_AMOUNT ,
TX_DURING_WEEKEND ,
TX_DURING_NIGHT ,
CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
TERMINAL_ID_NB_TX_1DAY_WINDOW ,
TERMINAL_ID_RISK_1DAY_WINDOW ,
TERMINAL_ID_NB_TX_7DAY_WINDOW ,
TERMINAL_ID_RISK_7DAY_WINDOW ,
TERMINAL_ID_NB_TX_30DAY_WINDOW ,
TERMINAL_ID_RISK_30DAY_WINDOW )
FROM cust_payment_tx_history
WHERE cast(tx_datetime as date) >= '2022-10-01'
limit 10 ;

We see that some values are matching and some are not. Let’s compare predictions to the ground truth:

SELECT
tx_fraud ,
fn_customer_cc_fd(
TX_AMOUNT ,
TX_DURING_WEEKEND ,
TX_DURING_NIGHT ,
CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
TERMINAL_ID_NB_TX_1DAY_WINDOW ,
TERMINAL_ID_RISK_1DAY_WINDOW ,
TERMINAL_ID_NB_TX_7DAY_WINDOW ,
TERMINAL_ID_RISK_7DAY_WINDOW ,
TERMINAL_ID_NB_TX_30DAY_WINDOW ,
TERMINAL_ID_RISK_30DAY_WINDOW
) as prediction, count(*) as values
FROM public.cust_payment_tx_history
WHERE cast(tx_datetime as date) >= '2022-08-01'
Group by 1,2 ;

We validated that the model is working and the F1 score is good. Let’s move on to generating predictions on streaming data.

Predict fraudulent transactions

Because the Redshift ML model is ready to use, we can use it to run the predictions against streaming data ingestion. The historical dataset has more fields than what we have in the streaming data source, but they’re just recency and frequency metrics around the customer and terminal risk for a fraudulent transaction.

We can apply the transformations on top of the streaming data very easily by embedding the SQL inside the views. Create the first view, which aggregates streaming data at the customer level. Then create the second view, which aggregates streaming data at terminal level, and the third view, which combines incoming transactional data with customer and terminal aggregated data and calls the prediction function all in one place. The code for the third view is as follows:

CREATE VIEW public.cust_payment_tx_fraud_predictions
as
select a.approximate_arrival_timestamp,
d.full_name , d.email_address, d.phone_number,
a.TRANSACTION_ID, a.TX_DATETIME, a.CUSTOMER_ID, a.TERMINAL_ID,
a.TX_AMOUNT ,
a.TX_TIME_SECONDS ,
a.TX_TIME_DAYS ,
public.fn_customer_cc_fd(a.TX_AMOUNT ,
a.TX_DURING_WEEKEND,
a.TX_DURING_NIGHT,
c.CUSTOMER_ID_NB_TX_1DAY_WINDOW ,
c.CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW ,
c.CUSTOMER_ID_NB_TX_7DAY_WINDOW ,
c.CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW ,
c.CUSTOMER_ID_NB_TX_30DAY_WINDOW ,
c.CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW ,
t.TERMINAL_ID_NB_TX_1DAY_WINDOW ,
t.TERMINAL_ID_RISK_1DAY_WINDOW ,
t.TERMINAL_ID_NB_TX_7DAY_WINDOW ,
t.TERMINAL_ID_RISK_7DAY_WINDOW ,
t.TERMINAL_ID_NB_TX_30DAY_WINDOW ,
t.TERMINAL_ID_RISK_30DAY_WINDOW ) Fraud_prediction
From
(select
Approximate_arrival_timestamp,
TRANSACTION_ID, TX_DATETIME, CUSTOMER_ID, TERMINAL_ID,
TX_AMOUNT ,
TX_TIME_SECONDS ,
TX_TIME_DAYS ,
case when extract(dow from cast(TX_DATETIME as timestamp)) in (1,7) then 1 else 0 end as TX_DURING_WEEKEND,
case when extract(hour from cast(TX_DATETIME as timestamp)) between 00 and 06 then 1 else 0 end as TX_DURING_NIGHT
FROM cust_payment_tx_stream) a
join terminal_transformations t
on a.terminal_id = t.terminal_id
join customer_transformations c
on a.customer_id = c.customer_id
join customer_info d
on a.customer_id = d.customer_id
;

Run a SELECT statement on the view:

select * from
cust_payment_tx_fraud_predictions
where Fraud_prediction = 1;

As you run the SELECT statement repeatedly, the latest credit card transactions go through transformations and ML predictions in near-real time.

This demonstrates the power of Amazon Redshift—with easy-to-use SQL commands, you can transform streaming data by applying complex window functions and apply an ML model to predict fraudulent transactions all in one step, without building complex data pipelines or building and managing additional infrastructure.

Expand the solution

Because the data streams in and ML predictions are made in near-real time, you can build business processes for alerting your customer using Amazon Simple Notification Service (Amazon SNS), or you can lock the customer’s credit card account in an operational system.

This post doesn’t go into the details of these operations, but if you’re interested in learning more about building event-driven solutions using Amazon Redshift, refer to the following GitHub repository.

Clean up

To avoid incurring future charges, delete the resources that were created as part of this post.

Conclusion

In this post, we demonstrated how to set up a Kinesis data stream, configure a producer and publish data to streams, and then create an Amazon Redshift Streaming Ingestion view and query the data in Amazon Redshift. After the data was in the Amazon Redshift cluster, we demonstrated how to train an ML model and build a prediction function and apply it against the streaming data to generate predictions near-real time.

If you have any feedback or questions, please leave them in the comments.

About the Authors

Bhanu Pittampally is an Analytics Specialist Solutions Architect based out of Dallas. He specializes in building analytic solutions. His background is in data warehouses—architecture, development, and administration. He has been in the data and analytics field for over 15 years.

Praveen Kadipikonda is a Senior Analytics Specialist Solutions Architect at AWS based out of Dallas. He helps customers build efficient, performant, and scalable analytic solutions. He has worked with building databases and data warehouse solutions for over 15 years.

Ritesh Kumar Sinha is an Analytics Specialist Solutions Architect based out of San Francisco. He has helped customers build scalable data warehousing and big data solutions for over 16 years. He loves to design and build efficient end-to-end solutions on AWS. In his spare time, he loves reading, walking, and doing yoga.

Building .NET 7 Applications with AWS CodeBuild

2023-01-04 Tom Moore

Post Syndicated from Tom Moore original https://aws.amazon.com/blogs/devops/building-net-7-applications-with-aws-codebuild/

AWS CodeBuild is a fully managed DevOps service for building and testing your applications. As a fully managed service, there is no infrastructure to manage and you pay only for the resources that you use when you are building your applications. CodeBuild provides a default build image that contains the current Long Term Support (LTS) version of the .NET SDK.

Microsoft released the latest version of .NET in November. This release, .NET 7, includes performance improvements and functionality, such as native ahead of time compilation. (Native AoT)..NET 7 is a Standard Term Support release of the .NET SDK. At this point CodeBuild’s default image does not support .NET 7. For customers that want to start using.NET 7 right away in their applications, CodeBuild provides two means of customizing your build environment so that you can take advantage of .NET 7.

The first option for customizing your build environment is to provide CodeBuild with a container image you create and maintain. With this method, customers can define the build environment exactly as they need by including any SDKs, runtimes, and tools in the container image. However, this approach requires customers to maintain the build environment themselves, including patching and updating the tools. This approach will not be covered in this blog post.

A second means of customizing your build environment is by using the install phase of the buildspec file. This method uses the default CodeBuild image, and adds additional functionality at the point that a build starts. This has the advantage that customers do not have the overhead of patching and maintaining the build image.

Complete documentation on the syntax of the buildspec file can be found here:

https://docs.aws.amazon.com/codebuild/latest/userguide/build-spec-ref.html

Your application’s buildspec.yml file contains all of the commands necessary to build your application and prepare it for deployment. For a typical .NET application, the buildspec file will look like this:

You might want to say that you are not covering this in the post.

```
version: 0.2
phases:
  build:
    commands:
      - dotnet restore Net7TestApp.sln
      - dotnet build Net7TestApp.sln
```

Note: This build spec file contains only the commands to build the application, commands for packaging and storing build artifacts have been omitted for brevity.

In order to add the .NET 7 SDK to CodeBuild so that we can build your .NET 7 applications, we will leverage the install phase of the buildspec file. The install phase allows you to install any third-party libraries or SDKs prior to beginning your actual build.

```
  install:
    commands:
      - curl -sSL https://dot.net/v1/dotnet-install.sh | bash /dev/stdin --channel STS 
```

The above command downloads the Microsoft install script for .NET and uses that script to download and install the latest version of the .NET SDK, from the Standard Term Support channel. This script will download files and set environment variables within the containerized build environment. You can use this same command to automatically pull the latest Long Term Support version of the .NET SDK by changing the command argument STS to LTS.

Your updated buildspec file will look like this:

```
version: 0.2    
phases:
  install:
    commands:
      - curl -sSL https://dot.net/v1/dotnet-install.sh | bash /dev/stdin --channel STS 
  build:
    commands:
      - dotnet restore Net7TestApp/Net7TestApp.sln
      - dotnet build Net7TestApp/Net7TestApp.sln
```

Once you check in your buildspec file, you can start a build via the CodeBuild console, and your .NET application will be built using the .NET 7 SDK.

As your build runs you will see output similar to this:

 ```
Welcome to .NET 7.0! 
--------------------- 
SDK Version: 7.0.100 
Telemetry 
--------- 
The .NET tools collect usage data in order to help us improve your experience. It is collected by Microsoft and shared with the community. You can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT environment variable to '1' or 'true' using your favorite shell. 

Read more about .NET CLI Tools telemetry: https://aka.ms/dotnet-cli-telemetry 
---------------- 
Installed an ASP.NET Core HTTPS development certificate. 
To trust the certificate run 'dotnet dev-certs https --trust' (Windows and macOS only). 
Learn about HTTPS: https://aka.ms/dotnet-https 
---------------- 
Write your first app: https://aka.ms/dotnet-hello-world 
Find out what's new: https://aka.ms/dotnet-whats-new 
Explore documentation: https://aka.ms/dotnet-docs 
Report issues and find source on GitHub: https://github.com/dotnet/core 
Use 'dotnet --help' to see available commands or visit: https://aka.ms/dotnet-cli 
-------------------------------------------------------------------------------------- 
Determining projects to restore... 
Restored /codebuild/output/src095190443/src/git-codecommit.us-east-2.amazonaws.com/v1/repos/net7test/Net7TestApp/Net7TestApp/Net7TestApp.csproj (in 586 ms). 
[Container] 2022/11/18 14:55:08 Running command dotnet build Net7TestApp/Net7TestApp.sln 
MSBuild version 17.4.0+18d5aef85 for .NET 
Determining projects to restore... 
All projects are up-to-date for restore. 
Net7TestApp -> /codebuild/output/src095190443/src/git-codecommit.us-east-2.amazonaws.com/v1/repos/net7test/Net7TestApp/Net7TestApp/bin/Debug/net7.0/Net7TestApp.dll 
Build succeeded. 
0 Warning(s) 
0 Error(s) 
Time Elapsed 00:00:04.63 
[Container] 2022/11/18 14:55:13 Phase complete: BUILD State: SUCCEEDED 
[Container] 2022/11/18 14:55:13 Phase context status code: Message: 
[Container] 2022/11/18 14:55:13 Entering phase POST_BUILD 
[Container] 2022/11/18 14:55:13 Phase complete: POST_BUILD State: SUCCEEDED 
[Container] 2022/11/18 14:55:13 Phase context status code: Message:
```

Conclusion

Adding .NET 7 support to AWS CodeBuild is easily accomplished by adding a single line to your application’s buildspec.yml file, stored alongside your application source code. This change allows you to keep up to date with the latest versions of .NET while still taking advantage of the managed runtime provided by the CodeBuild service.

About the author:

AWS CIRT announces the release of five publicly available workshops

2022-12-22 Steve de Vera

Post Syndicated from Steve de Vera original https://aws.amazon.com/blogs/security/aws-cirt-announces-the-release-of-five-publicly-available-workshops/

Greetings from the AWS Customer Incident Response Team (CIRT)! AWS CIRT is dedicated to supporting customers during active security events on the customer side of the AWS Shared Responsibility Model.

Over the past year, AWS CIRT has responded to hundreds of such security events, including the unauthorized use of AWS Identity and Access Management (IAM) credentials, ransomware and data deletion in an AWS account, and billing increases due to the creation of unauthorized resources to mine cryptocurrency.

We are excited to release five workshops that simulate these security events to help you learn the tools and procedures that AWS CIRT uses on a daily basis to detect, investigate, and respond to such security events. The workshops cover AWS services and tools, such as Amazon GuardDuty, Amazon CloudTrail, Amazon CloudWatch, Amazon Athena, and AWS WAF, as well as some open source tools written and published by AWS CIRT.

To access the workshops, you just need an AWS account, an internet connection, and the desire to learn more about incident response in the AWS Cloud! Choose the following links to access the workshops.

Unauthorized IAM Credential Use – Security Event Simulation and Detection

During this workshop, you will simulate the unauthorized use of IAM credentials by using a script invoked within AWS CloudShell. The script will perform reconnaissance and privilege escalation activities that have been commonly seen by AWS CIRT and that are typically performed during similar events of this nature. You will also learn some tools and processes that AWS CIRT uses, and how to use these tools to find evidence of unauthorized activity by using IAM credentials.

Ransomware on S3 – Security Event Simulation and Detection

During this workshop, you will use an AWS CloudFormation template to replicate an environment with multiple IAM users and five Amazon Simple Storage Service (Amazon S3) buckets. AWS CloudShell will then run a bash script that simulates data exfiltration and data deletion events that replicate a ransomware-based security event. You will also learn the tools and processes that AWS CIRT uses to respond to similar events, and how to use these tools to find evidence of unauthorized S3 bucket and object deletions.

Cryptominer Based Security Events – Simulation and Detection

During this workshop, you will simulate a cryptomining security event by using a CloudFormation template to initialize three Amazon Elastic Compute Cloud (Amazon EC2) instances. These EC2 instances will mimic cryptomining activity by performing DNS requests to known cryptomining domains. You will also learn the tools and processes that AWS CIRT uses to respond to similar events, and how to use these tools to find evidence of unauthorized creation of EC2 instances and communication with known cryptomining domains.

SSRF on IMDSv1 – Simulation and Detection

During this workshop, you will simulate the unauthorized use of a web application that is hosted on an EC2 instance configured to use Instance Metadata Service Version 1 (IMDSv1) and vulnerable to server side request forgery (SSRF). You will learn how web application vulnerabilities, such as SSRF, can be used to obtain credentials from an EC2 instance. You will also learn the tools and processes that AWS CIRT uses to respond to this type of access, and how to use these tools to find evidence of the unauthorized use of EC2 instance credentials through web application vulnerabilities such as SSRF.

AWS CIRT Toolkit For Automating Incident Response Preparedness

During this workshop, you will install and experiment with some common tools and utilities that AWS CIRT uses on a daily basis to detect security misconfigurations, respond to active events, and assist customers with protecting their infrastructure.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Monitor AWS workloads without a single line of code with Logz.io and Kinesis Firehose

2022-12-19 Amos Etzion

Post Syndicated from Amos Etzion original https://aws.amazon.com/blogs/big-data/monitor-aws-workloads-without-a-single-line-of-code-with-logz-io-and-kinesis-firehose/

Observability data provides near real-time insights into the health and performance of AWS workloads, so that engineers can quickly address production issues and troubleshoot them before widespread customer impact.

As AWS workloads grow, observability data has been exploding, which requires flexible big data solutions to handle the throughput of large and unpredictable volumes of observability data.

Solution overview

One option is Amazon Kinesis Data Firehose, which is a popular service for streaming huge volumes of AWS data for storage and analytics. By pulling data from Amazon CloudWatch, Amazon Kinesis Data Firehose can deliver data to observability solutions.

Among these observability solutions is Logz.io, which can now ingest metric data from Amazon Kinesis Data Firehose and make it easier to get metrics from your AWS account to your Logz.io account for analysis, alerting, and correlation with logs and traces.

In a few clicks and a few configurations, we’ll see how you can start streaming your metric data (and soon, log data!) to Logz.io for storage and analysis.

Prerequisites

Logz.io account – Create a free trial here
Logz.io shipping token – Learn about metrics tokens here. You need to be a Logz.io administrator.
Access to Amazon CloudWatch and Amazon Kinesis Data Firehose with the appropriate permissions to manage HTTP endpoints.
Appropriate permissions to create an Amazon Simple Storage Service (Amazon S3) bucket

Sending Amazon CloudWatch metric data to Logz.io with an Amazon Kinesis Data Firehose

Amazon Kinesis Data Firehose is a service for ingesting, processing, and loading data from large, distributed sources such as logs or clickstreams into multiple consumers for storage and real-time analytics. Kinesis Data Firehose supports more than 50 sources and destinations as of today. This integration can be set up in minutes without a single line of code and enables near real-time analytics for observability data generated by AWS services by using Amazon CloudWatch, Amazon Kinesis Data Firehose, and Logz.io.

Once the integration is configured, Logz.io customers can open the Infrastructure Monitoring product to see their data coming in and populating their dashboards. To see some of the data analytics and correlation you get with Logz.io, check out this short demonstration.

Let’s begin a step-by-step tutorial for setting up the integration.

Start by going to Amazon Kinesis Data Firehose and creating a delivery stream with Data Firehose.

Kinesis Firehose Console

Next you select a source and destination. Select Direct Put as the source and Logz.io the destination.
Next, configure the destination settings. Give the HTTP endpoint a name, which should include logz.io.
Select from the dropdown the appropriate endpoint you would like to use.

If you’re sending data to a European region, then set it to Logz.io Metrics EU. Or you can use the us-east-1 destination by selecting Logz.io Metrics US.

Next, add your Logz.io Shipping Token. You can find this by going to Settings in Logz.io and selecting Manage Tokens, which requires Logz.io administrator to access. This ensures that your account is only ingesting data from the defined sources (e.g., this Amazon Kinesis Data Firehose delivery stream).

Kinesis Stream config

Keep Content encoding on Disabled and set your desired Retry Duration.

You can also configure Buffer hints to your preferences.

Next, determine your Backup settings in case something goes wrong. In most cases, it’s only necessary to back up the failed data. Simply choose an Amazon S3 bucket or create a new one to store data if it doesn’t make it to Logz.io. Then, select Create a delivery stream.

Now it’s time to connect Amazon CloudWatch to our Amazon Kinesis Data Firehose Delivery Stream.

Navigate to Amazon CloudWatch and select Streams in the Metrics menu. Select Create metrics stream.
Next, you can either select to send all your Amazon CloudWatch metrics to Logz.io, or only metrics from specified namespaces.

In this case, we chose Amazon Elastic Compute Cloud (Amazon EC2), Amazon Relational Database Service (Amazon RDS), AWS Lambda, and Elastic Load Balancing (ELB).

Under Configuration, choose the Select an existing Firehose owned by your account option and choose the Amazon Kinesis Data Firehose you just configured.

Metric Streams Config

If you’d like, you can choose additional statistics in the Add additional statistics box, which provides helpful metrics in terms of percentiles to monitor like latency metrics (i.e., which services have the highest average latency). This may increase your costs.

Lastly, give your metric stream a name and hit Create metric stream.

That’s it! Without writing a single line of code, we configured an integration with AWS and Logz.io that enables fast and easy infrastructure monitoring through Amazon CloudWatch data collection.

Your metrics will be stored in Logz.io for 18 months out of the box, without requiring any overhead management.

You can also begin to build dashboards and alerts to begin monitoring – like this Amazon EC2 monitoring dashboard below.

ec2 monitoring dashboard Logz.io

Conclusion

This post demonstrated how to configure an integration with AWS and Logz.io for efficient infrastructure monitoring through Amazon CloudWatch.

To learn more about building metrics dashboards in Logz.io, you can watch this video.

Currently, some users might find that they are sending more data than they really need, which can raise costs. In future versions of this integration, it will be easier to narrow down the metrics to reduce costs.

Want to try it yourself? Create a Logz.io account today, navigate to our infrastructure monitoring product, and start streaming metric data to Logz.io to start monitoring.

About the authors

Amos Etzion – Product Manager at Logz.io

Charlie Klein – Product Marketing Manager at Logz.io

Mark Kriaf – Partner Solutions Architect at AWS

Introducing native Delta Lake table support with AWS Glue crawlers

2022-12-19 Noritaka Sekiyama

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/

Delta Lake is an open-source project that helps implement modern data lake architectures commonly built on Amazon S3 or other cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud. Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum.

AWS Glue includes Delta crawler, a capability that makes discovering datasets simpler by scanning Delta Lake transaction logs in Amazon Simple Storage Service (Amazon S3), extracting their schema, creating manifest files in Amazon S3, and automatically populating the AWS Glue Data Catalog, which keeps the metadata current. The newly created AWS Glue Data Catalog table has format SymlinkTextInputFormat. Delta crawler creates a manifest file, which is a text file containing the list of data files that query engines such as Presto, Trino, or Athena can use to query the table rather than finding the files with the directory listing. A previous blog post demonstrated how it works. Manifest files needed to be regenerated on a periodic basis to include newer transactions in the original Delta Lake tables which resulted in expensive I/O operations, longer processing times, and increased storage footprint.

With today’s launch, Glue crawler is adding support for creating AWS Glue Data Catalog tables for native Delta Lake tables and does not require generating manifest files. This improves customer experience because now you don’t have to regenerate manifest files whenever a new partition becomes available or a table’s metadata changes. With the native Delta Lake tables and automatic schema evolution with no additional manual intervention, this reduces the time to insight by making newly ingested data quickly available for analysis with your preferred analytics and machine learning (ML) tools.

Amazon Athena SQL engine version 3 started supporting Delta Lake native connector. AWS Glue for Apache Spark also started supporting Delta Lake native connector in Glue version 3.0 and later. Amazon EMR started supporting Delta Lake in EMR release version 6.9.0 and later. It means that you can query the Delta transaction log directly in Amazon Athena, AWS Glue for Apache Spark, and Amazon EMR. It makes the experience of working with native Delta Lake tables seamless across the platforms.

This post demonstrates how AWS Glue crawlers work with native Delta Lake tables and describes typical use cases to query native Delta Lake tables.

How AWS Glue crawler works with native Delta Lake tables

Now AWS Glue crawler has two different options:

Native table: Create a native Delta Lake table definition on AWS Glue Data Catalog.
Symlink table: Create a symlink-based manifest table definition on AWS Glue Data Catalog from a Delta Lake table, and generate its symlink files on Amazon S3.

Native table

Native Delta Lake tables are accessible from Amazon Athena (engine version 3), AWS Glue for Apache Spark (Glue version 3.0 and later), Amazon EMR (release version 6.9.0 and later), and other platforms that support Delta Lake tables. With the native Delta Lake tables, you have the capabilities such as ACID transactions, all while needing to maintain just a single source of truth.

Symlink table

Symlink tables are a consistent snapshot of a native Delta Lake table, represented using the SymlinkTextInputFormat using parquet files. The symlink tables are accessible from Amazon Athena and Amazon Redshift Spectrum.

Since the symlink tables are a snapshot of the original native Delta Lake tables, you need to maintain both the original native Delta Lake tables and the symlink tables. When the data or schema in an original Delta Lake table is updated, the symlink tables in the AWS Glue Data Catalog may become out of sync. It means that you can still query the symlink table and get a consistent result, but the result of the table is at the previous point in time.

Crawl native Delta Lake tables using AWS Glue crawler

In this section, let’s go through how to crawl native Delta Lake tables using AWS Glue crawler.

Prerequisite

Here’s the prerequisite for this tutorial:

Install and configure AWS Command Line Interface (AWS CLI).
Create your S3 bucket if you do not have it.
Create your IAM role for AWS Glue crawler if you do not have it.
Run the following command to copy the sample Delta Lake table into your S3 bucket. (Replace your_s3_bucket with your S3 bucket name.)

$ aws s3 sync s3://aws-bigdata-blog/artifacts/delta-lake-crawler/sample_delta_table/ s3://your_s3_bucket/data/sample_delta_table

Create a Delta Lake crawler

A Delta Lake crawler can be created through the AWS Glue console, AWS Glue SDK, or AWS CLI. Specify a DeltaTarget with the following configurations:

DeltaTables – A list of S3 DeltaPaths where the Delta Lake tables are located. (Note that each path must be the parent of a _delta_log folder. If the Delta transaction log is located at s3://bucket/sample_delta_table/_delta_log, then the path s3://bucket/sample_delta_table/ should be provided.
WriteManifest – A Boolean value indicating whether or not the crawler should write the manifest files for each DeltaPath. This parameter is only applicable for Delta Lake tables created via manifest files
CreateNativeDeltaTable – A Boolean value indicating whether the crawler should create a native Delta Lake table. If set to False, the crawler would create a symlink table instead. Note that both WriteManifest and CreateNativeDeltaTable options can’t be set to True.
ConnectionName – An optional connection name stored in the Data Catalog that the crawler should use to access Delta Lake tables backed by a VPC.

In this instruction, create the crawler through the console. Complete the following steps to create a Delta Lake crawler:

Open the AWS Glue console.
Choose Crawlers.
Choose Create crawler.
For Name, enter delta-lake-native-crawler, and choose Next.
Under Data sources, choose Add a data source.
For Data source, select Delta Lake.
For Include delta lake table path(s), enter s3://your_s3_bucket/data/sample_delta_table/.
For Create tables for querying, choose Create Native tables,
Choose Add a Delta Lake data source.
Choose Next.
For Existing IAM role, choose your IAM role, then choose Next.
For Target database, choose Add database, then Add database dialog appears. For Database name, enter delta_lake_native, then choose Create. Choose Next.
Choose Create crawler.
The Delta Lake crawler can be triggered to run through the console or through the SDK or AWS CLI using the StartCrawl API. It could also be scheduled through the console to trigger the crawlers at specific times. In this instruction, run the crawler through the console.
Select delta-lake-native-crawler, and choose Run.
Wait for the crawler to complete.

After the crawler has run, you can see the Delta Lake table definition in the AWS Glue console:

You can also verify an AWS Glue table definition through the following AWS CLI command:

$ aws glue get-table --database delta_lake_native --name sample_delta_table
{
    "Table": {
        "Name": "sample_delta_table",
        "DatabaseName": "delta_lake_native",
        "Owner": "owner",
        "CreateTime": "2022-11-08T12:11:20+09:00",
        "UpdateTime": "2022-11-08T13:19:06+09:00",
        "LastAccessTime": "2022-11-08T13:19:06+09:00",
        "Retention": 0,
        "StorageDescriptor": {
            "Columns": [
                {
                    "Name": "product_id",
                    "Type": "string"
                },
                {
                    "Name": "product_name",
                    "Type": "string"
                },
                {
                    "Name": "price",
                    "Type": "bigint"
                },
                {
                    "Name": "currency",
                    "Type": "string"
                },
                {
                    "Name": "category",
                    "Type": "string"
                },
                {
                    "Name": "updated_at",
                    "Type": "double"
                }
            ],
            "Location": "s3://your_s3_bucket/data/sample_delta_table/",
            "AdditionalLocations": [],
            "InputFormat": "org.apache.hadoop.mapred.SequenceFileInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat",
            "Compressed": false,
            "NumberOfBuckets": -1,
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                "Parameters": {
                    "serialization.format": "1",
                    "path": "s3://your_s3_bucket/data/sample_delta_table/"
                }
            },
            "BucketColumns": [],
            "SortColumns": [],
            "Parameters": {
                "EXTERNAL": "true",
                "UPDATED_BY_CRAWLER": "delta-lake-native-connector",
                "spark.sql.sources.schema.part.0": "{\"type\":\"struct\",\"fields\":[{\"name\":\"product_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"product_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"price\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"CURRENCY\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"category\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"updated_at\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}",
                "CrawlerSchemaSerializerVersion": "1.0",
                "CrawlerSchemaDeserializerVersion": "1.0",
                "spark.sql.partitionProvider": "catalog",
                "classification": "delta",
                "spark.sql.sources.schema.numParts": "1",
                "spark.sql.sources.provider": "delta",
                "delta.lastCommitTimestamp": "1653462383292",
                "delta.lastUpdateVersion": "6",
                "table_type": "delta"
            },
            "StoredAsSubDirectories": false
        },
        "PartitionKeys": [],
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {
            "EXTERNAL": "true",
            "UPDATED_BY_CRAWLER": "delta-lake-native-connector",
            "spark.sql.sources.schema.part.0": "{\"type\":\"struct\",\"fields\":[{\"name\":\"product_id\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"product_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"price\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}},{\"name\":\"CURRENCY\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"category\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"updated_at\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}",
            "CrawlerSchemaSerializerVersion": "1.0",
            "CrawlerSchemaDeserializerVersion": "1.0",
            "spark.sql.partitionProvider": "catalog",
            "classification": "delta",
            "spark.sql.sources.schema.numParts": "1",
            "spark.sql.sources.provider": "delta",
            "delta.lastCommitTimestamp": "1653462383292",
            "delta.lastUpdateVersion": "6",
            "table_type": "delta"
        },
        "CreatedBy": "arn:aws:sts::012345678901:assumed-role/AWSGlueServiceRole/AWS-Crawler",
        "IsRegisteredWithLakeFormation": false,
        "CatalogId": "012345678901",
        "IsRowFilteringEnabled": false,
        "VersionId": "1",
        "DatabaseId": "0bd458e335a2402c828108f267bc770c"
    }
}

After you create the table definition on AWS Glue Data Catalog, AWS analytics services such as Athena and AWS Glue Spark jobs are able to query the Delta Lake table.

Query Delta Lake tables using Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run on datasets at petabyte scale. You can use Athena to query your S3 data lake for use cases such as data exploration for machine learning (ML) and AI, business intelligence (BI) reporting, and ad hoc querying.

There are now two ways to use Delta Lake tables in Athena:

For native table: Use Athena’s newly launched native support for Delta Lake tables. You can learn more in Querying Delta Lake tables. This method no longer requires regenerating manifest files after every transaction. Data updates are available for queries in Athena as soon as they are performed in the original Delta Lake tables, and you get up to 40 percent improvement in query performance over querying manifest files. Since Athena optimizes data scans in native Delta Lake queries using statistics in Delta Lake files, you get the advantage of reduced cost for Athena queries. This post focuses on this approach.
For symlink table: Use SymlinkTextInputFormat to query symlink tables through manifest files generated from Delta Lake tables. This was previously the only manner in which Delta Lake table querying was supported via Athena and is no longer recommended when you use only Athena to query the Delta Lake tables.

To use the native Delta Lake connector in Athena, you need to use Athena engine version 3. If you are using an older engine version, change the engine version.

Complete following steps to start queries on Athena:

Open the Amazon Athena console.
Run the following query.

SELECT * FROM "delta_lake_native"."sample_delta_table" limit 10;

The following screenshot shows our output:

Query Delta Lake tables using AWS Glue for Apache Spark

AWS Glue for Apache Spark natively supports Delta Lake. AWS Glue version 3.0 (Apache Spark 3.1.1) supports Delta Lake 1.0.0, and AWS Glue version 4.0 (Apache Spark 3.3.0) supports Delta Lake 2.1.0. With this native support for Delta Lake, what you need for configuring Delta Lake is to provide a single job parameter --datalake-formats delta. There is no need to configure a separate connector for Delta Lake in AWS Marketplace. It reduces the configuration steps required to use these frameworks in AWS Glue for Apache Spark.

AWS Glue also provides a serverless notebook interface called AWS Glue Studio notebook to query and process data interactively. Complete the following steps to launch AWS Glue Studio notebook and query a Delta Lake table:

On the AWS Glue console, choose Jobs in the navigation plane.
Under Create job, select Jupyter Notebook.
Choose Create a new notebook from scratch, and choose Create.
For Job name, enter delta-sql.
For IAM role, choose your IAM role. If you don’t have your own role for the AWS Glue job, create it by following the steps documented in the AWS Glue Developer Guide.
Choose Start notebook job.
Copy and paste the following code to the first cell and run the cell.
```
%glue_version 3.0
%%configure
{
  "--datalake-formats": "delta"
}
```

Run the existing cell containing the following code.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Copy and paste the following code to the third cell and run the cell.

%%sql
SELECT * FROM `delta_lake_native`.`sample_delta_table` limit 10

The following screenshot shows our output:

Clean up

Now for the final step, cleaning up the resources:

Delete your data under your S3 path: s3://your_s3_bucket/data/sample_delta_table/.
Delete the AWS Glue crawler delta-lake-native-crawler.
Delete the AWS Glue database delta_lake_native.
Delete the AWS Glue notebook job delta-sql.

Conclusion

This post demonstrated how to crawl native Delta Lake tables using an AWS Glue crawler and how to query the crawled tables from Athena and Glue Spark jobs. Start using AWS Glue crawlers for your own native Delta Lake tables.

If you have comments or feedback, please feel free to leave them in the comments.

About the authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.

Kyle Duong is a Software Development Engineer on the AWS Glue and Lake Formation team. He is passionate about building big data technologies and distributed systems. In his free time, he enjoys cycling or playing basketball.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Getting started with AWS Glue Data Quality for ETL Pipelines

2022-12-16 Deenbandhu Prasad

Post Syndicated from Deenbandhu Prasad original https://aws.amazon.com/blogs/big-data/getting-started-with-aws-glue-data-quality-for-etl-pipelines/

Today, hundreds of thousands of customers use data lakes for analytics and machine learning. However, data engineers have to cleanse and prepare this data before it can be used. The underlying data has to be accurate and recent for customer to make confident business decisions. Otherwise, data consumers lose trust in the data and make suboptimal or incorrect decisions. It is a common task for data engineers to evaluate whether the data is accurate and recent or not. Today there are various data quality tools. However, common data quality tools usually require manual processes to monitor data quality.

AWS Glue Data Quality is a preview feature of AWS Glue that measures and monitors the data quality of Amazon Simple Storage Service (Amazon S3) data lakes and in AWS Glue extract, transform, and load (ETL) jobs. This is an open preview feature so it is already enabled in your account in the available Regions. You can easily define and measure the data quality checks in AWS Glue Studio console without writing codes. It simplifies your experience of managing data quality.

This post is Part 2 of a four-post series to explain how AWS Glue Data Quality works. Check out the previous post in this series:

Getting started with AWS Glue Data Quality

Part 1: Getting started with AWS Glue Data Quality from AWS Glue Data Catalog
Part 2: Getting started with AWS Glue Data Quality for ETL Pipelines

In this post, we show how to create an AWS Glue job that measures and monitors the data quality of a data pipeline. We also show how to take action based on the data quality results.

Solution overview

Let’s consider an example use case in which a data engineer needs to build a data pipeline to ingest the data from a raw zone to a curated zone in a data lake. As a data engineer, one of your key responsibilities—along with extracting, transforming, and loading data—is validating the quality of data. Identifying data quality issues upfront helps you prevent placing bad data in the curated zone and avoid arduous data corruption incidents.

In this post, you’ll learn how to easily set up built-in and custom data validation checks in your AWS Glue job to prevent bad data from corrupting the downstream high-quality data.

The dataset used for this post is synthetically generated; the following screenshot shows an example of the data.

Set up resources with AWS CloudFormation

This post includes an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs.

The CloudFormation template generates the following resources:

An Amazon Simple Storage Service (Amazon S3) bucket (gluedataqualitystudio-*).
The following prefixes and objects in the S3 bucket:
- datalake/raw/customer/customer.csv
- datalake/curated/customer/
- scripts/
- sparkHistoryLogs/
- temporary/
AWS Identity and Access Management (IAM) users, roles, and policies. The IAM role (GlueDataQualityStudio-*) has permission to read and write from the S3 bucket.
AWS Lambda functions and IAM policies required by those functions to create and delete this stack.

To create your resources, complete the following steps:

Sign in to the AWS CloudFormation console in the us-east-1 Region.
Choose Launch Stack:
Select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack and wait for the stack creation step to complete.

Implement the solution

To start configuring your solution, complete the following steps:

On the AWS Glue Studio console, choose Jobs in the navigation pane.
Select Visual with a blank canvas and choose Create.
Choose the Job Details tab to configure the job.
For Name, enter GlueDataQualityStudio.
For IAM Role, choose the role starting with GlueDataQualityStudio-*.
For Glue version, choose Glue 3.0.
For Job bookmark, choose Disable. This allows you to run this job multiple times with the same input dataset.
For Number of retries, enter 0.
In the Advanced properties section, provide the S3 bucket created by the CloudFormation template (starting with gluedataqualitystudio-*).
Choose Save.
After the job is saved, choose the Visual tab and on the Source menu, choose Amazon S3.
On the Data source properties – S3 tab, for S3 source type, select S3 location.
Choose Browse S3 and navigate to prefix /datalake/raw/customer/ in the S3 bucket starting with gluedataqualitystudio-* .
Choose Infer schema.
On the Action menu, choose Evaluate Data Quality.
Choose the Evaluate Data Quality node.

On the Transform tab, you can now start building data quality rules. The first rule you create is to check if Customer_ID is unique and not null using the isPrimaryKey rule.
On the Rule types tab of the DQDL rule builder, search for isprimarykey and choose the plus sign.
On the Schema tab of the DQDL rule builder, choose the plus sign next to Customer_ID.
In the rule editor, delete id.

The next rule we add checks that the First_Name column value is present for all the rows.
You can also enter the data quality rules directly in the rule editor. Add a comma (,) and enter IsComplete "First_Name", after the first rule.

Next, you add a custom rule to validate that no row exists without Telephone or Email.
Enter the following custom rule in the rule editor:
```
CustomSql "select count(*) from primary where Telephone is null and Email is null" = 0
```
The Evaluate Data Quality feature provides actions to manage the outcome of a job based on the job quality results.
For this post, select Fail job when data quality fails and choose Fail job without loading target data actions. In the Data quality output setting section, choose Browse S3 and navigate to prefix dqresults in the S3 bucket starting with gluedataqualitystudio-*.
On the Target menu, choose Amazon S3.
Choose the Data target – S3 bucket node.
On the Data target properties – S3 tab, for Format, choose Parquet, and for Compression Type, choose Snappy.
For S3 Target Location, choose Browse S3 and navigate to the prefix /datalake/curated/customer/ in the S3 bucket starting with gluedataqualitystudio-*.
Choose Save, then choose Run.
You can view the job run details on the Runs tab. In our example, the job fails with the error message “AssertionError: The job failed due to failing DQ rules for node: <node>.”
You can review the data quality result on the Data quality tab. In our example, the custom data quality validation failed because one of the rows in the dataset had no Telephone or Email value.Evaluate Data Quality results is also written to the S3 bucket in JSON format based on the data quality result location parameter of the node.
Navigate to dqresults prefix under the S3 bucket starting gluedataqualitystudio-*. You will see that the data quality result is partitioned by date.

The following is the output of the JSON file. You can use this file output to build custom data quality visualization dashboards.

You can also monitor the Evaluate Data Quality node through Amazon CloudWatch metrics and set alarms to send notifications about data quality results. To learn more on how to set up CloudWatch alarms, refer to Using Amazon CloudWatch alarms.

Clean up

To avoid incurring future charges and to clean up unused roles and policies, delete the resources you created:

Delete the GlueDataQualityStudio job you created as part of this post.
On the AWS CloudFormation console, delete the GlueDataQualityStudio stack.

Conclusion

AWS Glue Data Quality offers an easy way to measure and monitor the data quality of your ETL pipeline. In this post, you learned how to take necessary actions based on the data quality results, which helps you maintain high data standards and make confident business decisions.

To learn more about AWS Glue Data Quality, check out the documentation:

Evaluating data quality with AWS Glue Studio
AWS Glue Data Quality (preview)
To dive into the AWS Glue Data Quality APIs, take a look at the documentation: Data Quality API

About the Authors

Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in big data services. He is passionate about helping customers build modern data architecture on the AWS Cloud. He has helped customers of all sizes implement data management, data warehouse, and data lake solutions.

Yannis Mentekidis is a Senior Software Development Engineer on the AWS Glue team.

Organize your AWS Serverless code to prevent merge conflicts

2022-12-16 Mark Curtis

Post Syndicated from Mark Curtis original https://aws.amazon.com/blogs/devops/organize-your-aws-serverless-code-to-prevent-merge-conflicts/

How do you prevent the most common merge conflicts when your team is working on a Serverless application? How do you make sure that your team stays productive and avoids large merge issues while trying to update the same crucial files simultaneously? –The answer to both questions is code organization! You can use cfn-include and swagger-cli to organize, collaborate, and maintain a large serverless application as well as support a large or decentralized development team.

Real life inspiration

WRAP Technologies Inc. (WRAP) creates advanced technologies for the protection and security of public safety. Their WRAP Reality product allows law enforcement agencies to train their officers using virtual reality-based scenarios.

Too many cooks in the kitchen

When multiple developers collaborate on a serverless architecture built with AWS CloudFormation, and its extensions such as the AWS Serverless Application Model (SAM), the nature of specifying resources in both the template.yaml and the optional OpenAPI.yaml specification for Amazon API Gateway leads to merge conflicts, such as the one demonstrated in the following figure where two developers are adding different API endpoints at the same time. These conflicts detract from the developer’s time and agility. Furthermore, navigating and maintaining the long template files required for a larger serverless architecture slows development as the developer scans large files to find a particular resource definition.

Figure 1. The frustrating merge conflicts.

By refactoring and organizing the CloudFormation and OpenAPI files, your development team can realize several benefits:

Improve developer efficiency by decomposing large, hard-to-manage files into a series of well-organized and single-purpose files.
Enhance developer productivity by allowing each developer to have ownership of their own code, thereby reducing the need to coordinate merges with teammates.
Eliminate potential merge issues for files that generate the most conflicts during the development of a typical Serverless API application.

Rapid development

WRAP partnered with AWS to develop and host the backend for their new officer training management platform. This entirely new platform was developed, completed, and available for use in a matter of months. Moreover, it’s a collaboration of developers spread across multiple teams worldwide, all contributing to the same code base. By instituting the norms and techniques of this post, WRAP created a large and maintainable serverless application with minimal developer code collisions.

Development of the WRAP Reality training management system was accomplished using CloudFormation for defining Infrastructure as Code (IaC), and an Amazon API Gateway OpenAPI specification for defining API contracts. The development team for the WRAP Reality training management service leveraged agile development for expediency, including the GitHub Flow branching strategy. However, since project contributors were not co-located, several considerations were put in place to make sure of consistency and speed of code development:

The API specifications and contracts were defined in OpenAPI (Swagger) specifications early in the development process, clearly defining the project structure up front, and allowing developers to independently build infrastructure components.
The two code assets central to the entire project – the CloudFormation template and the OpenAPI Specification – were decomposed into small, easily manageable components. This enabled components to be organized in a way that enhanced development productivity and practically eliminated the inevitable merge conflicts that come with large source code files that are being modified on a daily basis.

The development process was accelerated by utilizing OpenAPI integrations with AWS Services, as well as techniques for managing the OpenAPI specification and Cloudformation Template files.

Sample project

To demonstrate these techniques, we’ll explore the following sample project comprised of API endpoints for “widget” management, available on GitHub. This project provides the following end points:

/widget PUT: Creation of a new widget
/widget GET: Retrieval of a new widget
/reports/color GET: Retrieval of a set of widgets based on the widget color
/reports/filterpage GET: Retrieval of widgets based on specified filters

The overall architecture of the application is shown in the following diagram:

Figure 2. Architecture Diagram

The application comprises:

Amazon API Gateway is a fully-managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. In this example, API Gateway serves as the web service for the API endpoints. The mapping of data to and from the API endpoints to the Lambda functions is formally defined by an OpenAPI specification file.
AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes. In this example, four Lambda functions are used to service each of the four API calls.
Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. DynamoDB is used as a persistent data store for widgets and associated properties.

OpenAPI and AWS service integration

When using API Gateway, developers have the option of using proxy Lambda integrations, or formally defining the API interface in an OpenAPI yaml file. The OpenAPI specification can be leveraged to document the API prior to development, and the example/mock features of the OpenAPI specification facilitates concurrent development by quickly establishing a working infrastructure to build upon. Furthermore, API documentation can be automatically generated from the OpenAPI specification.

As the number of endpoints increases, the OpenAPI specification file can grow in size, reaching thousands of lines of code that must be updated and maintained regularly by multiple developers. To aid in management and usability, the OpenAPI file can be decomposed into separate files for endpoints, responses, fields, and schemas.

Start with a “skeleton” file as an entry point for the OpenAPI definition, and then add a separate file for the definition of each endpoint or construct. For example, the sample project entry point is api/apiSkeleton.yaml, which contains the global definitions and effectively defines a simple list of endpoints and the reference ($ref) file path to each endpoint’s definition.

The application comprises:

/reports/color:
    $ref: './paths/reports/reportsColor.yaml'

  /reports/filterpage:
    $ref: './paths/reports/reportsFilterPage.yaml'

Diving into a file referenced by an endpoint, we see that it contains all of the specification details for that endpoint. Looking at the reportsColor.yaml file reveals the full endpoint specification for /reports/color:

get:
  description: Get widgets by color
  parameters:
    - in: path
      $ref: '../../requestParameters/color.yaml'
  responses:
    200:
      description: Get All the Widgets of a color
      content:
        application/json:
          schema:
            $ref: '../../schemas/widgetList.yaml'
    . . .

In turn, this endpoint specification can include further references to yaml files defining common parameters, schemas, and even full gateway responses. For example, color.yaml defines the color path variable:

  type: string
    description: "The widget's color"
    example: "Red"

To paraphrase a common catch phrase, “With a great many files, comes a great responsibility for organization.” To this end, we offer the following organizational structure as a start. Place all of the related API specifications in an “api” subfolder of your project. Have child subfolders for field, metadata, and gateway response definition files. Then, create child subfolder trees for each branch of your endpoints that mirror the endpoint paths. This will result in a highly-organized directory structure, as seen in the sample project:

├── api
│   ├── apiSkeleton.yaml
│   ├── fields
│   │   ├── color.yaml
│   │   ├── metadata
│   │   │   ├── count.yaml
│   │   │   ├── message.yaml
│   │   └── widgetname.yaml
│   ├── gatewayResponses
│   │   ├── error.yaml
│   │   └── notFound.yaml
│   ├── paths
│   │   ├── reports
│   │   │   ├── reportsColor.yaml
│   │   │   └── reportsFilterPage.yaml
│   │   └── widget
│   │       ├── widgetPut.yaml
│   │       └── widgetWidgetnameGet.yaml

We still need a consolidated single OpenAPI file to provide to CloudFormation during deployment to AWS. Therefore, the multiple files are combined and validated using the swagger-cli bundle command, resulting in a single file for deployment. The bundle command must be executed before a CloudFormation build. This command can also be included as a shortcut in the Makefile as the “buildOpenApi” command:

swagger-cli bundle -o api/api.yaml --dereference --t yaml  api/apiSkeleton.yaml

make buildOpenApi

Once compiled, api/api.yaml is then used normally for API Gateway integrations and as a Postman API Collection import. As api/api.yaml is dynamically compiled, it’s included in .gitignore and not checked in to AWS CodeCommit.

cfn-include and nested stacks

The CloudFormation template that defines the infrastructure for even a simple service can grow to considerable length, perhaps thousands of lines. This presents challenges from a support and continued development perspective, as specific code locations become difficult to find and merge conflicts become commonplace.

CloudFormation Nested Stacks are a method of breaking a large CloudFormation template into separate templates. When there are clear delineations between groups of resources in a stack breaking it into separate nested stacks makes sense. There is also a 500 resource limit in a single CloudFormation stack and in order to go above that nested or separate stacks are necessary. Depending on the complexity of the architecture and frequency of updates however, the Nested Stacks can also become large. Furthermore, in a serverless architecture, the logical separation of architecture layers into separate stacks may not be direct, for example when a Lambda function is triggered by an event sent to an EventBridge event bus, then that Lambda function sends a different event back to the same event bus.

In these cases, CloudFormation templates can be decomposed to further leverage cfn-include . With this technique, the top-level CloudFormation template becomes a skeleton file which contains the stack parameters, global specifications, a list of resource names without properties, and the outputs. The properties of each resource are contained in separate files, referenced by an ‘include’ directive.
CloudFormation template organization

To organize your CloudFormation template, deconstruct the template into one-file-per-resource, with one main “skeleton” file as the main entry point. This skeleton file contains the full parameters, global section, conditions, and output specification. The resources are specified by resource name in this skeleton file, and then an ‘include’ directive points to the file that contains the body of the resource declaration. See the following example of the main skeleton file with two resources:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: >
  Widget API Service
Globals:
  Function:
    Handler: app.lambda_handler
    Runtime: python3.8
Resources:

    WidgetApi:
        !Include ./resources/apigw/widgetApiGW.yaml

    WidgetDdbTable:
        !Include ./resources/dynamodb/widgetDdbTable.yaml

Then, the resource files contain the properties of that specific resource. For example, widgetApiGW.yaml defines an API Gateway:

Type: AWS::Serverless::Api
    Properties:
      DefinitionBody:
        Fn::Transform:
          Name: AWS::Include
          Parameters:
            Location: api/api.yaml
      EndpointConfiguration:
        Type: REGIONAL
      StageName: prod
      TracingEnabled: true

This approach has the benefit of breaking the CloudFormation template into multiple small files, while still maintaining a top-level holistic view. The resource definitions, which normally comprise the majority of the content and can cause merge conflicts, are moved out of the main template.

For organization, you can create a directory in your project to contain the CloudFormation scripts. This directory also contains the entry-point skeleton file. Create further sub-folders for resources, and then further folders by resource type and architecture. We found that placing applicable AWS Identity and Access Management (IAM) role resource definitions in the same folder with the applied resource facilitated easier navigation. For example:

├── cloudformation
│   ├── resources
│   │   ├── apigw
│   │   │   └── widgetApiGW.yaml
│   │   ├── dynamodb
│   │   │   └── widgetDdbTable.yaml
│   │   └── lambda
│   │       ├── layers
│   │       │   └── lambdaDDBEnv.yaml
│   │       ├── reports
│   │       │   ├── reportsColorLambda.yaml
│   │       │   └── reportsColorLambdaRole.yaml
│   │       └── widget
│   │           ├── widgetGetLambda.yaml
│   │           └── widgetGetLambdaRole.yaml
│   └── templateSkeleton.yaml

The files must be reconstituted to a single template.yaml for CloudFormation build and deployment. This is accomplished with the cfn-include command. A convenience command can optionally be included in the Makefile.

cfn-include --yaml  cloudFormation/templateSkeleton.yaml > template.yaml

make buildTemplate

As the final template.yaml file is dynamically compiled, it’s included in .gitignore and not checked in to CodeCommit.

Conclusion

This post demonstrates techniques used by WRAP and AWS to rapidly develop and maintain key files in an Serverless architecture. The techniques discussed in this post allowed the WRAP and AWS team to do the following:

Improve developer efficiency by decomposing large, hard-to-manage files into a series of well-organized and single purpose files.
Enhance developer productivity by allowing each developer to have ownership of their own piece of the code without having to coordinate with teammates.
Eliminate potential merge issues on the files that typically generate the most conflicts during the development of a typical Serverless API application.

Applying these techniques was one of the key factors in the rapid development of the WRAP Reality training framework.

About the Authors:

Amazon EMR Serverless cost estimator

2022-12-16 Radhika Ravirala

Post Syndicated from Radhika Ravirala original https://aws.amazon.com/blogs/big-data/amazon-emr-serverless-cost-estimator/

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run applications using open-source big data analytics frameworks such as Apache Spark and Hive without configuring, managing, and scaling clusters or servers. You get all the features of the latest open-source frameworks with the performance-optimized runtime of Amazon EMR, and without having to plan and operate instances and clusters.

With Amazon EMR, you can run your analytics applications on dedicated EMR clusters, on existing Amazon Elastic Kubernetes Service (Amazon EKS) clusters, or using the new EMR Serverless deployment option where you don’t have to manage clusters or instances. When you build a Spark or Hive application using an Amazon EMR release, say Amazon EMR 6.8, you can run the application on EMR clusters, on EKS clusters using Amazon EMR on EKS, or using EMR Serverless without having to change the application.

To learn about the benefits of each deployment option in EMR Serverless, refer to What are some of the feature differences between EMR Serverless and Amazon EMR on EC2? in the Amazon EMR FAQ. You can also learn about the pricing for these options from the Amazon EMR pricing page. Many customers already run data analytics applications on EMR clusters, and find that the new serverless option is simpler and less expensive.

In this post, we discuss how you can estimate what it may cost to run an application that currently runs on EMR clusters using the new serverless option, and perform this analysis simply by using your current application metrics. This approach helps you evaluate and adopt the deployment option that is most cost effective for the application. However, the Amazon EMR pricing page doesn’t tell you how you can easily estimate the cost of running your existing EMR cluster applications on EMR Serverless. In the following sections, we describe an approach that enables you to do that.

Although the example in this post discusses how you can get a cost estimate for applications running on EMR clusters, you can also use the approach if you’re running a Spark or Hive application elsewhere, and want to estimate the cost of running it on EMR Serverless. For example, if you run self-managed Spark or Hive applications on Amazon Elastic Compute Cloud (Amazon EC2) clusters, or if you run Spark jobs on AWS Glue, we show you how you can use this approach to estimate the cost of running the application on EMR Serverless.

Estimating the cost of running applications on your EMR cluster

When you run applications on Amazon EMR clusters, you’re separately charged for the following:

The Amazon EC2 price of running cluster instances (the price for the underlying servers)
The price for Amazon Elastic Block Store (Amazon EBS) volumes, if you choose to attach EBS volumes
The Amazon EMR price for the cluster instances

The total cost of running the cluster includes all three. There are a variety of Amazon EC2 pricing options you can choose from, including On Demand, 1-year and 3-year Reserved Instances, Capacity Savings Plans, and Spot Instances. The Amazon EC2 pricing option that you choose determines (a), the Amazon EC2 price. The cost of running the application on EMR clusters is the sum of (a), (b), and (c). You can compute this cost for the lifetime of running the cluster (from the time a cluster is started to the time the cluster is terminated), or for a specific period of time while the cluster is running. We recommend running the former, that is to compute (a), (b), and (c) from the time the cluster is started to the time the cluster is terminated. If you have set up tags for your Amazon EMR cluster, you can easily get the detailed cost report for your EMR cluster using AWS Cost Explorer.

Estimating the cost of running the same applications using EMR Serverless

When you run the same applications using EMR Serverless, you pay for the amount of vCPU, memory, and storage resources consumed by your applications. There is no separate charge for EC2 instances or EBS volumes. And, you only pay for the resources that are actually used by the application and not for EC2 instances provisioned. For example, when running applications on EMR clusters, when an EC2 instance in the cluster is partially utilized (say, 16 GB memory is used out of 64 GB available on the instance, or 4 VCPUs are utilized out of 16 VCPUs available on the instance), or when the EC2 instance is idle (for example, when the instance is initializing or waiting for an application to start), you still incur Amazon EC2, Amazon EMR, and Amazon EBS charges for the full EC2 instance and for the duration that the instance is active in the EMR cluster. With EMR Serverless, you only pay for the vCPU, memory, and storage resources used from the time workers start to run your Spark or Hive job until the time they stop.

To estimate the cost of running your EMR Spark or Hive application on EMR Serverless, you need to first aggregate the total compute vCore-seconds, memory MB-seconds, and storage GB-seconds consumed by each YARN application that ran on your EMR cluster, from the time the YARN container is started to the time the YARN container is terminated. You can obtain these metrics from YARN resource manager logs accessible from YARN timeline server or YARN CLI tools. You can retrieve the running time, vCore-seconds, and memory MB-seconds used by each of the YARN applications.

If your cluster only runs Spark applications, there is a simpler approach to estimate. Instead of obtaining the vCore-seconds, memory MB-seconds, and storage GB-seconds from YARN resource manager logs, you can obtain these metrics from Spark event logs. We have provided the tool EMR Servless Estimator, which can parse the Spark event logs for your applications and provide the aggregated metrics for your cost estimate.

After you get the usage metrics for your application, you can compute the estimated EMR Serverless cost using EMR Serverless pricing. Simply multiple your aggregated vCore-seconds with EMR Serverless vCPU pricing per second, multiply aggregated memory MB-seconds with the EMR Serverless memory pricing per second, and multiply storage GB-seconds with the EMR Serverless storage pricing per second (only if the storage requirements exceed 20 GB per worker). By adding up these costs for vCPU, memory, and storage, you can compare the cost of running the same applications on EMR Serverless.

In this approach, we assume that the performance of the application is equivalent. In other words, the size (vCPU, memory) and runtime duration for each YARN container on the EMR cluster is the same as the number, size, and runtime duration of workers needed to run the application on EMR Serverless. We make this assumption because the EMR runtime for an EMR release is the same regardless of whether the application is run on an EMR cluster or on EMR Serverless.

Example

Let’s do a sample cost comparison of Amazon EMR on EC2 and EMR Serverless using a single cluster.

We ran a Spark application on an EMR cluster with five nodes (one primary, two core, and two task and gathered YARN metrics using the YARN CLI. The following code shows our aggregate resource allocation.

aggregate resource allocation

We computed the Amazon EMR on EC2 costs as follows:

Cluster instances
- Primary: m5.2xlarge:1
- Core: r5.2xlarge:2
- Task: r5.2xlarge:2
Cluster runtime = 18 min
Instance on-demand cost
- m5.2xlarge (8 vCPU, 32 GiB memory)
  - Amazon EC2: $0.384/hr
  - Amazon EMR incremental: $0.096/hr
- r5.2xlarge (8 vCPU, 64 GiB memory)
  - Amazon EC2: $0.504/hr
  - Amazon EMR incremental: $0.126/hr

The following is the EMR on EC2 cost calculation:

Amazon EMR cost = ((1 primary node x $0.096/hr) + (2 core nodes x $ 0.126/hr) + (2 task nodes x $0.126/hr)) = $0.60
Amazon EC2 cost = ((1 primary x $0.384 /hr ) + (2 core nodes x $0.504/hr) + (2 task nodes x $0.504/hr)) = $2.40
Amazon EMR on EC2 cluster cost/hr = $0.6 + $2.40 = $3/hr * 8/60 hr (runtime in hrs)

The total Amazon EMR on Amazon EC2 cost is $0.40/hr.

To calculate EMR Serverless cost, aggregate the vCore-seconds and memory MB-seconds for the same application you ran previously on the EMR cluster. Then multiply those numbers with the EMR Serverless vCPU and memory price. Our calculation results are as follows:

Total_vcore_seconds = 5737
Total_Memory_mb_seconds = 120156631
Convert to vCPU/hr and memory-GB/hr:
- Aggregated vCPU/hr: 5737/(60*60)=1.59
- Aggregated memory/hr: 120156631/(60*60*1024)=32.5
Total vCPU-hours cost = 33 vCPU * 0.052624 VCPU/hr * 8/60 = $0.23
Total memory GB cost = 1.59 MB * 0.0057785 memory/hr * 8/60 = $0.00122

In this example, the total EMR Serverless cost is $0.231, a 42% reduction.

Conclusion

Amazon EMR Serverless is a recently launched serverless option in Amazon EMR that makes it easy to run open-source frameworks such as Spark and Hive without configuring, managing, and scaling clusters. Customers that already use EMR clusters want to understand how they can estimate the cost of running their EMR applications using EMR Serverless. We have presented an approach that you can use to conduct a cost analysis based on analyzing application metrics from your EMR clusters.

We hope you give this a try, and share your feedback with us!

About the authors

Radhika Ravirala is the Principal Product Manager at AWS.

Matthew Liem is the Senior Solution Architecture Manager at AWS.