Tag Archives: Analytics

Analyze real-time streaming data in Amazon MSK with Amazon Athena

2022-12-16 Scott Rigney

Post Syndicated from Scott Rigney original https://aws.amazon.com/blogs/big-data/analyze-real-time-streaming-data-in-amazon-msk-with-amazon-athena/

Recent advances in ease of use and scalability have made streaming data easier to generate and use for real-time decision-making. Coupled with market forces that have forced businesses to react more quickly to industry changes, more and more organizations today are turning to streaming data to fuel innovation and agility.

Amazon Managed Streaming for Apache Kafka (MSK) is a fully managed service that makes it easy to build and run applications that use Apache Kafka, an open-source distributed event streaming platform designed for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. With Amazon MSK, you can capture real-time data from a wide range of sources such as database change events or web application user clickstreams. Since Kafka is highly optimized for writing and reading fresh data, it’s a great fit for operational reporting. However, gaining insight from this data often requires a specialized stream processing layer to write streaming records to a storage medium like Amazon S3, where it can be accessed by analysts, data scientists, and data engineers for historical analysis and visualization using tools like Amazon QuickSight.

When you want to analyze data where it lives and without developing separate pipelines and jobs, a popular choice is Amazon Athena. With Athena, you can use your existing SQL knowledge to extract insights from a wide range of data sources without learning a new language, developing scripts to extract (and duplicate) data, or managing infrastructure. Athena supports over 25 connectors to popular data sources including Amazon DynamoDB and Amazon Redshift which give data analysts, data engineers, and data scientists the flexibility to run SQL queries on data stored in databases running on-premises or in the cloud alongside data stored in Amazon S3. With Athena, there’s no data movement and you pay only for the queries you run.

What’s new

Starting today, you can now use Athena to query streaming data in MSK and self-managed Apache Kafka. This enables you to run analytical queries on real-time data held in Kafka topics and join that data with other Kafka topics as well as other data in your Amazon S3 data lake – all without the need for separate processes to first store the data on Amazon S3.

Solution overview

In this post, we show you how to get started with real-time SQL analytics using Athena and its connector for MSK. The process involves:

Registering the schema of your streaming data with AWS Glue Schema Registry. Schema Registry is a feature of AWS Glue that allows you to validate and reliably evolve streaming data against JSON schemas. It can also serialize data into a compressed format, which helps you save on data transfer and storage costs.
Creating a new instance of the Amazon Athena MSK Connector. Athena connectors are pre-built applications that run as serverless AWS Lambda applications, so there’s no need for standalone data export processes.
Using the Athena console to run interactive SQL queries on your Kafka topics.

Get started with Athena’s connector for Amazon MSK

In this section, we’ll cover the steps necessary to set up your MSK cluster to work with Athena to run SQL queries on your Kafka topics.

Prerequisites

This post assumes you have a serverless or provisioned MSK cluster set up to receive streaming messages from a producing application. For information, see Setting up Amazon MSK and Getting started using Amazon MSK in the Amazon Managed Streaming for Apache Kafka Developer Guide.

You’ll also need to set up a VPC and a security group before you use the Athena connector for MSK. For more information, see Creating a VPC for a data source connector. Note that with MSK Serverless, VPCs and security groups are created automatically, so you can get started quickly.

Define the schema of your Kafka topics with AWS Glue Schema Registry

To run SQL queries on your Kafka topics, you’ll first need to define the schema of your topics as Athena uses this metadata for query planning. AWS Glue makes it easy to do this with its Schema Registry feature for streaming data sources.

Schema Registry allows you to centrally discover, control, and evolve streaming data schemas for use in analytics applications such as Athena. With AWS Glue Schema Registry, you can manage and enforce schemas on your data streaming applications using convenient integrations with Apache Kafka. To learn more, see AWS Glue Schema Registry and Getting started with Schema Registry.

If configured to do so, the producer of data can auto-register its schema and changes to it with AWS Glue. This is especially useful in use cases where the contents of the data is likely to change over time. However, you can also specify the schema manually and will resemble the following JSON structure.

{
  "tableName": "orders",
  "schemaName": "customer_schema",
  "topicName": "orders",
  "message": {
    "dataFormat": "json",
    "fields": [
      {
        "name": "customer_id",
        "mapping": "customer_id",
        "type": "VARCHAR"
      },
      {
        "name": "item_id",
        "mapping": "item_id",
        "type": "INTEGER"
      }
    ]
  }
}

When setting up your Schema Registry, be sure to give it an easy-to-remember name, such as customer_schema, because you’ll reference it within SQL queries as you’ll see later on. For additional information on schema set up, see Schema examples for the AWS Glue Schema Registry.

Configure the Athena connector for MSK

With your schema registered with Glue, the next step is to set up the Athena connector for MSK. We recommend using the Athena console for this step. For more background on the steps involved, see Deploying a connector and connecting to a data source.

In Athena, federated data source connectors are applications that run on AWS Lambda and handle communication between your target data source and Athena. When a query runs on a federated source, Athena calls the Lambda function and tasks it with running the parts of your query that are specific to that source. To learn more about the query execution workflow, see Using Amazon Athena Federated Query in the Amazon Athena User Guide.

Start by accessing the Athena console and selecting Data sources on the left navigation, then choose Create data source:

Next, search for and select Amazon MSK from the available connectors and select Next.

In Data source details, give your connector a name, like msk, that’s easy to remember and reference in your future SQL queries. Under Connection details section, select Create Lambda function. This will bring you to the AWS Lambda console where you’ll provide additional configuration properties.

In the Lambda application configuration screen (not shown), you’ll provide the Application settings for your connector. To do this, you’ll need a few properties from your MSK cluster and schema registered in Glue.

On another browser tab, use the MSK console to navigate to your MSK cluster and then select the Properties tab. Here you’ll see the VPC subnets and security group IDs from your MSK cluster which you’ll provide in the SubnetIds and SecurityGroupIds fields in the Athena connector’s Application settings form. You can find the value for KafkaEndpoint by clicking View client information.

In the AWS Glue console, navigate to your Schema Registry to find the GlueRegistryArn for the schema you wish to use with this connector.

After providing these and the other required values, click Deploy.

Return to the Athena console and enter the name of the Lambda function you just created in the Connection details box, then click Create data source.

Run queries on streaming data using Athena

With your MSK data connector set up, you can now run SQL queries on the data. Let’s explore a few use cases in more detail.

Use case: interactive analysis

If you want to run queries that aggregate, group, or filter your MSK data, you can run interactive queries using Athena. These queries will run against the current state of your Kafka topics at the time the query was submitted.

Before running any queries, it may be helpful to validate the schema and data types available within your Kafka topics. To do this, run the DESCRIBE command on your Kafka topic, which appears in Athena as a table, as shown below. In this query, the orders table corresponds to the topic you specified in the Schema Registry.

DESCRIBE msk.customer_schema.orders

Now that you know the contents of your topic, you can begin to develop analytical queries. A sample query for a hypothetical Kafka topic containing e-commerce order data is shown below:

SELECT customer_id, SUM(order_total)
FROM msk.customer_schema.orders
GROUP BY customer_id

Because the orders table (and underlying Kafka topic) can contain an unbounded stream of data, the query above is likely to return a different value for SUM(order_total) with each execution of the query.

If you have data in one topic that you need to join with another topic, you can do that too:

SELECT t1.order_id, t2.item_id
FROM msk.customer_schema.orders as t1
JOIN msk.customer_schema.items as t2
ON t1.id = t2.id

Use case: ingesting streaming data to a table on Amazon S3

Federated queries run against the underlying data source which ensures interactive queries, like the ones above, are evaluated against the current state of your data. One consideration is that repeatedly running federated queries can put additional load on the underlying source. If you plan to perform multiple queries on the same source data, you can use Athena’s CREATE TABLE AS SELECT, also known as CTAS, to store the results of a SELECT query in a table on Amazon S3. You can then run queries on your newly created table without going back to the underlying source each time.

CREATE TABLE my_kafka_data
WITH (format = 'Parquet', 
      write_compression = 'SNAPPY')
AS
SELECT order_id, item_id, timestamp
FROM msk.customer_schema.orders

If you plan to do additional downstream analysis on this data, for example within dashboards on Amazon QuickSight, you can enhance the solution above by periodically adding new data to your table. To learn more, see Using CTAS and INSERT INTO for ETL and data analysis. Another benefit of this approach is that you can secure these tables with row-, column-, and table-level data governance policies powered by AWS Lake Formation to ensure only authorized users can access your table.

What else can you do?

With Athena, you can use your existing SQL knowledge to run federated queries that generate insights from a wide range of data sources without learning a new language, developing scripts to extract (and duplicate) data, or managing infrastructure. Athena provides additional integrations with other AWS services and popular analytics tools and SQL IDEs that allow you to do much more with your data. For example, you can:

Visualize the data in business intelligence applications like Amazon QuickSight
Design event-driven data processing workflows with Athena’s integration with AWS Step Functions
Unify multiple data sources to produce rich input features for machine learning in Amazon SageMaker

Conclusion

In this post, we learned about the newly released Athena connector for Amazon MSK. With it, you can run interactive queries on data held in Kafka topics running in MSK or self-managed Apache Kafka. This helps you bring real-time insights to dashboards or enable point-in-time analysis of streaming data to answer time-sensitive business questions. We also covered how to periodically ingest new streaming data into Amazon S3 without the need for a separate sink process. This simplifies recurring analysis of your data without incurring round-trip queries to your underlying Kafka clusters and makes it possible to secure the data with access rules powered by Lake Formation.

We encourage you to evaluate Athena and federated queries on your next analytics project. For help getting started, we recommend the following resources:

If you’re new to Athena and want to learn more about its features and capabilities, see Amazon Athena features.
To learn more about Athena’s connector for MSK, see Amazon Athena MSK Connector.
For a complete list of supported data sources, see Athena Data Source Connectors.

About the authors

Scott Rigney is a Senior Technical Product Manager with Amazon Web Services (AWS) and works with the Amazon Athena team based out of Arlington, Virginia. He is passionate about building analytics products that enable enterprises to make data-driven decisions.

Kiran Matty is a Principal Product Manager with Amazon Web Services (AWS) and works with the Amazon Managed Streaming for Apache Kafka (Amazon MSK) team based out of Palo Alto, California. He is passionate about building performant streaming and analytical services that help enterprises realize their critical use cases.

Migrate Google BigQuery to Amazon Redshift using AWS Schema Conversion tool (SCT)

2022-12-14 Jagadish Kumar

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/migrate-google-bigquery-to-amazon-redshift-using-aws-schema-conversion-tool-sct/

Amazon Redshift is a fast, fully-managed, petabyte scale data warehouse that provides the flexibility to use provisioned or serverless compute for your analytical workloads. Using Amazon Redshift Serverless and Query Editor v2, you can load and query large datasets in just a few clicks and pay only for what you use. The decoupled compute and storage architecture of Amazon Redshift enables you to build highly scalable, resilient, and cost-effective workloads. Many customers migrate their data warehousing workloads to Amazon Redshift and benefit from the rich capabilities it offers. The following are just some of the notable capabilities:

Amazon Redshift seamlessly integrates with broader analytics services on AWS. This enables you to choose the right tool for the right job. Modern analytics is much wider than SQL-based data warehousing. Amazon Redshift lets you build lake house architectures and then perform any kind of analytics, such as interactive analytics, operational analytics, big data processing, visual data preparation, predictive analytics, machine learning (ML), and more.
You don’t need to worry about workloads, such as ETL, dashboards, ad-hoc queries, and so on, interfering with each other. You can isolate workloads using data sharing, while using the same underlying datasets.
When users run many queries at peak times, compute seamlessly scales within seconds to provide consistent performance at high concurrency. You get one hour of free concurrency scaling capacity for 24 hours of usage. This free credit meets the concurrency demand of 97% of the Amazon Redshift customer base.
Amazon Redshift is easy-to-use with self-tuning and self-optimizing capabilities. You can get faster insights without spending valuable time managing your data warehouse.
Fault Tolerance is inbuilt. All of the data written to Amazon Redshift is automatically and continuously replicated to Amazon Simple Storage Service (Amazon S3). Any hardware failures are automatically replaced.
Amazon Redshift is simple to interact with. You can access data with traditional, cloud-native, containerized, and serverless web services-based or event-driven applications and so on.
Redshift ML makes it easy for data scientists to create, train, and deploy ML models using familiar SQL. They can also run predictions using SQL.
Amazon Redshift provides comprehensive data security at no extra cost. You can set up end-to-end data encryption, configure firewall rules, define granular row and column level security controls on sensitive data, and so on.
Amazon Redshift integrates seamlessly with other AWS services and third-party tools. You can move, transform, load, and query large datasets quickly and reliably.

In this post, we provide a walkthrough for migrating a data warehouse from Google BigQuery to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT) and AWS SCT data extraction agents. AWS SCT is a service that makes heterogeneous database migrations predictable by automatically converting the majority of the database code and storage objects to a format that is compatible with the target database. Any objects that can’t be automatically converted are clearly marked so that they can be manually converted to complete the migration. Furthermore, AWS SCT can scan your application code for embedded SQL statements and convert them.

Solution overview

AWS SCT uses a service account to connect to your BigQuery project. First, we create an Amazon Redshift database into which BigQuery data is migrated. Next, we create an S3 bucket. Then, we use AWS SCT to convert BigQuery schemas and apply them to Amazon Redshift. Finally, to migrate data, we use AWS SCT data extraction agents, which extract data from BigQuery, upload it into the S3 bucket, and then copy to Amazon Redshift.

Prerequisites

Before starting this walkthrough, you must have the following prerequisites:

A workstation with AWS SCT, Amazon Corretto 11, and Amazon Redshift drivers.
1. You can use an Amazon Elastic Compute Cloud (Amazon EC2) instance or your local desktop as a workstation. In this walkthrough, we’re using Amazon EC2 Windows instance. To create it, use this guide.
2. To download and install AWS SCT on the EC2 instance that you previously created, use this guide.
3. Download the Amazon Redshift JDBC driver from this location.
4. Download and install Amazon Corretto 11.
A GCP service account that AWS SCT can use to connect to your source BigQuery project.
1. Grant BigQuery Admin and Storage Admin roles to the service account.
2. Copy the Service account key file, which was created in the Google cloud management console, to the EC2 instance that has AWS SCT.
3. Create a Cloud Storage bucket in GCP to store your source data during migration.

This walkthrough covers the following steps:

Create an Amazon Redshift Serverless Workgroup and Namespace
Create the AWS S3 Bucket and Folder
Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT
- Connecting to the Google BigQuery Source
- Connect to the Amazon Redshift Target
- Convert BigQuery schema to an Amazon Redshift
- Analyze the assessment report and address the action items
- Apply converted schema to target Amazon Redshift
Migrate data using AWS SCT data extraction agents
- Generating Trust and Key Stores (Optional)
- Install and start data extraction agent
- Register data extraction agent
- Add virtual partitions for large tables (Optional)
- Create a local migration task
- Start the Local Data Migration Task
View Data in Amazon Redshift

Create an Amazon Redshift Serverless Workgroup and Namespace

In this step, we create an Amazon Redshift Serverless workgroup and namespace. Workgroup is a collection of compute resources and namespace is a collection of database objects and users. To isolate workloads and manage different resources in Amazon Redshift Serverless, you can create namespaces and workgroups and manage storage and compute resources separately.

Follow these steps to create Amazon Redshift Serverless workgroup and namespace:

Navigate to the Amazon Redshift console.
In the upper right, choose the AWS Region that you want to use.
Expand the Amazon Redshift pane on the left and choose Redshift Serverless.
Choose Create Workgroup.
For Workgroup name, enter a name that describes the compute resources.
Verify that the VPC is the same as the VPC as the EC2 instance with AWS SCT.
Choose Next.

For Namespace name, enter a name that describes your dataset.
In Database name and password section, select the checkbox Customize admin user credentials.
- For Admin user name, enter a username of your choice, for example awsuser.
- For Admin user password: enter a password of your choice, for example MyRedShiftPW2022.

Choose Next. Note that data in Amazon Redshift Serverless namespace is encrypted by default.
In the Review and Create page, choose Create.
Create an AWS Identity and Access Management (IAM) role and set it as the default on your namespace, as described in the following. Note that there can only be one default IAM role.
- Navigate to the Amazon Redshift Serverless Dashboard.
- Under Namespaces / Workgroups, choose the namespace that you just created.
- Navigate toSecurity and encryption.
- Under Permissions, choose Manage IAM roles.
- Navigate to Manage IAM roles. Then, choose the Manage IAM roles drop-down and choose Create IAM role.
- Under Specify an Amazon S3 bucket for the IAM role to access, choose one of the following methods:
  - Choose No additional Amazon S3 bucket to allow the created IAM role to access only the S3 buckets with a name starting with redshift.
  - Choose Any Amazon S3 bucket to allow the created IAM role to access all of the S3 buckets.
  - Choose Specific Amazon S3 buckets to specify one or more S3 buckets for the created IAM role to access. Then choose one or more S3 buckets from the table.
- Choose Create IAM role as default. Amazon Redshift automatically creates and sets the IAM role as default.
Capture the Endpoint for the Amazon Redshift Serverless workgroup that you just created.
- Navigate to the Amazon Redshift Serverless Dashboard.
- Under Namespaces / Workgroups, choose the workgroup that you just created.
- Under General information, copy the Endpoint.

Create the S3 bucket and folder

During the data migration process, AWS SCT uses Amazon S3 as a staging area for the extracted data. Follow these steps to create the S3 bucket:

Navigate to the Amazon S3 console
Choose Create bucket. The Create bucket wizard opens.
For Bucket name, enter a unique DNS-compliant name for your bucket (e.g., uniquename-bq-rs). See rules for bucket naming when choosing a name.
For AWS Region, choose the region in which you created the Amazon Redshift Serverless workgroup.
Select Create Bucket.
In the Amazon S3 console, navigate to the S3 bucket that you just created (e.g., uniquename-bq-rs).
Choose “Create folder” to create a new folder.
For Folder name, enter incoming and choose Create Folder.

Convert and apply BigQuery Schema to Amazon Redshift using AWS SCT

To convert BigQuery schema to the Amazon Redshift format, we use AWS SCT. Start by logging in to the EC2 instance that we created previously, and then launch AWS SCT.

Follow these steps using AWS SCT:

Connect to the BigQuery Source

From the File Menu choose Create New Project.
Choose a location to store your project files and data.
Provide a meaningful but memorable name for your project, such as BigQuery to Amazon Redshift.
To connect to the BigQuery source data warehouse, choose Add source from the main menu.
Choose BigQuery and choose Next. The Add source dialog box appears.
For Connection name, enter a name to describe BigQuery connection. AWS SCT displays this name in the tree in the left panel.
For Key path, provide the path of the service account key file that was previously created in the Google cloud management console.
Choose Test Connection to verify that AWS SCT can connect to your source BigQuery project.
Once the connection is successfully validated, choose Connect.

Connect to the Amazon Redshift Target

Follow these steps to connect to Amazon Redshift:

In AWS SCT, choose Add Target from the main menu.
Choose Amazon Redshift, then choose Next. The Add Target dialog box appears.
For Connection name, enter a name to describe the Amazon Redshift connection. AWS SCT displays this name in the tree in the right panel.
For Server name, enter the Amazon Redshift Serverless workgroup endpoint captured previously.
For Server port, enter 5439.
For Database, enter dev.
For User name, enter the username chosen when creating the Amazon Redshift Serverless workgroup.
For Password, enter the password chosen when creating Amazon Redshift Serverless workgroup.
Uncheck the “Use AWS Glue” box.
Choose Test Connection to verify that AWS SCT can connect to your target Amazon Redshift workgroup.
Choose Connect to connect to the Amazon Redshift target.

Note that alternatively you can use connection values that are stored in AWS Secrets Manager.

Convert BigQuery schema to an Amazon Redshift

After the source and target connections are successfully made, you see the source BigQuery object tree on the left pane and target Amazon Redshift object tree on the right pane.

Follow these steps to convert BigQuery schema to the Amazon Redshift format:

On the left pane, right-click on the schema that you want to convert.
Choose Convert Schema.
A dialog box appears with a question, The objects might already exist in the target database. Replace?. Choose Yes.

Once the conversion is complete, you see a new schema created on the Amazon Redshift pane (right pane) with the same name as your BigQuery schema.

The sample schema that we used has 16 tables, 3 views, and 3 procedures. You can see these objects in the Amazon Redshift format in the right pane. AWS SCT converts all of the BigQuery code and data objects to the Amazon Redshift format. Furthermore, you can use AWS SCT to convert external SQL scripts, application code, or additional files with embedded SQL.

Analyze the assessment report and address the action items

AWS SCT creates an assessment report to assess the migration complexity. AWS SCT can convert the majority of code and database objects. However, some of the objects may require manual conversion. AWS SCT highlights these objects in blue in the conversion statistics diagram and creates action items with a complexity attached to them.

To view the assessment report, switch from the Main view to the Assessment Report view as follows:

The Summary tab shows objects that were converted automatically, and objects that weren’t converted automatically. Green represents automatically converted or with simple action items. Blue represents medium and complex action items that require manual intervention.

The Action Items tab shows the recommended actions for each conversion issue. If you select an action item from the list, AWS SCT highlights the object to which the action item applies.

The report also contains recommendations for how to manually convert the schema item. For example, after the assessment runs, detailed reports for the database/schema show you the effort required to design and implement the recommendations for converting Action items. For more information about deciding how to handle manual conversions, see Handling manual conversions in AWS SCT. Amazon Redshift takes some actions automatically while converting the schema to Amazon Redshift. Objects with these actions are marked with a red warning sign.

You can evaluate and inspect the individual object DDL by selecting it from the right pane, and you can also edit it as needed. In the following example, AWS SCT modifies the RECORD and JSON datatype columns in BigQuery table ncaaf_referee_data to the SUPER datatype in Amazon Redshift. The partition key in the ncaaf_referee_data table is converted to the distribution key and sort key in Amazon Redshift.

Apply converted schema to target Amazon Redshift

To apply the converted schema to Amazon Redshift, select the converted schema in the right pane, right-click, and then choose Apply to database.

Migrate data from BigQuery to Amazon Redshift using AWS SCT data extraction agents

AWS SCT extraction agents extract data from your source database and migrate it to the AWS Cloud. In this walkthrough, we show how to configure AWS SCT extraction agents to extract data from BigQuery and migrate to Amazon Redshift.

First, install AWS SCT extraction agent on the same Windows instance that has AWS SCT installed. For better performance, we recommend that you use a separate Linux instance to install extraction agents if possible. For big datasets, you can use several data extraction agents to increase the data migration speed.

Generating trust and key stores (optional)

You can use Secure Socket Layer (SSL) encrypted communication with AWS SCT data extractors. When you use SSL, all of the data passed between the applications remains private and integral. To use SSL communication, you must generate trust and key stores using AWS SCT. You can skip this step if you don’t want to use SSL. We recommend using SSL for production workloads.

Follow these steps to generate trust and key stores:

In AWS SCT, navigate to Settings → Global Settings → Security.
Choose Generate trust and key store.
Enter the name and password for trust and key stores and choose a location where you would like to store them.
Choose Generate.

Install and configure Data Extraction Agent

In the installation package for AWS SCT, you find a sub-folder agent (\aws-schema-conversion-tool-1.0.latest.zip\agents). Locate and install the executable file with a name like aws-schema-conversion-tool-extractor-xxxxxxxx.msi.

In the installation process, follow these steps to configure AWS SCT Data Extractor:

For Listening port, enter the port number on which the agent listens. It is 8192 by default.
For Add a source vendor, enter no, as you don’t need drivers to connect to BigQuery.
For Add the Amazon Redshift driver, enter YES.
For Enter Redshift JDBC driver file or files, enter the location where you downloaded Amazon Redshift JDBC drivers.
For Working folder, enter the path where the AWS SCT data extraction agent will store the extracted data. The working folder can be on a different computer from the agent, and a single working folder can be shared by multiple agents on different computers.
For Enable SSL communication, enter yes. Choose No here if you don’t want to use SSL.
For Key store, enter the storage location chosen when creating the trust and key store.
For Key store password, enter the password for the key store.
For Enable client SSL authentication, enter yes.
For Trust store, enter the storage location chosen when creating the trust and key store.
For Trust store password, enter the password for the trust store.

*************************************************
*                                               *
*     AWS SCT Data Extractor Configuration      *
*              Version 2.0.1.666                *
*                                               *
*************************************************
User name: Administrator
User home: C:\Windows\system32\config\systemprofile
*************************************************
Listening port [8192]: 8192
Add a source vendor [YES/no]: no
No one source data warehouse vendors configured. AWS SCT Data Extractor cannot process data extraction requests.
Add the Amazon Redshift driver [YES/no]: YES
Enter Redshift JDBC driver file or files: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\redshift-jdbc42-2.1.0.9.jar
Working folder [C:\Windows\system32\config\systemprofile]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject
Enable SSL communication [YES/no]: YES
Setting up a secure environment at "C:\Windows\system32\config\systemprofile". This process will take a few seconds...
Key store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftKeyStore
Key store password:
Re-enter the key store password:
Enable client SSL authentication [YES/no]: YES
Trust store [ ]: C:\Users\Administrator\Desktop\BQToRedshiftSCTProject\TrustAndKeyStores\BQToRedshiftTrustStore
Trust store password:
Re-enter the trust store password:

Starting Data Extraction Agent(s)

Use the following procedure to start extraction agents. Repeat this procedure on each computer that has an extraction agent installed.

Extraction agents act as listeners. When you start an agent with this procedure, the agent starts listening for instructions. You send the agents instructions to extract data from your data warehouse in a later section.

To start the extraction agent, navigate to the AWS SCT Data Extractor Agent directory. For example, in Microsoft Windows, double-click C:\Program Files\AWS SCT Data Extractor Agent\StartAgent.bat.

On the computer that has the extraction agent installed, from a command prompt or terminal window, run the command listed following your operating system.
To check the status of the agent, run the same command but replace start with status.
To stop an agent, run the same command but replace start with stop.
To restart an agent, run the same RestartAgent.bat file.

Register the Data Extraction Agent

Follow these steps to register the Data Extraction Agent:

In AWS SCT, change the view to Data Migration view (other) and choose + Register.
In the connection tab:
1. For Description, enter a name to identify the Data Extraction Agent.
2. For Host name, if you installed the Data Extraction Agent on the same workstation as AWS SCT, enter 0.0.0.0 to indicate local host. Otherwise, enter the host name of the machine on which the AWS SCT Data Extraction Agent is installed. It’s recommended to install the Data Extraction Agents on Linux for better performance.
3. For Port, enter the number entered for the Listening Port when installing the AWS SCT Data Extraction Agent.
4. Select the checkbox to use SSL (if using SSL) to encrypt the AWS SCT connection to the Data Extraction Agent.
If you’re using SSL, then in the SSL Tab:
1. For Trust store, choose the trust store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
2. For Key Store, choose the key store name created when generating Trust and Key Stores (optionally, you can skip this if SSL connectivity isn’t needed).
Choose Test Connection.
Once the connection is validated successfully, choose Register.

Add virtual partitions for large tables (optional)

You can use AWS SCT to create virtual partitions to optimize migration performance. When virtual partitions are created, AWS SCT extracts the data in parallel for partitions. We recommend creating virtual partitions for large tables.

Follow these steps to create virtual partitions:

Deselect all objects on the source database view in AWS SCT.
Choose the table for which you would like to add virtual partitioning.
Right-click on the table, and choose Add Virtual Partitioning.
You can use List, Range, or Auto Split partitions. To learn more about virtual partitioning, refer to Use virtual partitioning in AWS SCT. In this example, we use Auto split partitioning, which generates range partitions automatically. You would specify the start value, end value, and how big the partition should be. AWS SCT determines the partitions automatically. For a demonstration, on the Lineorder table:
1. For Start Value, enter 1000000.
2. For End Value, enter 3000000.
3. For Interval, enter 1000000 to indicate partition size.
4. Choose Ok.

You can see the partitions automatically generated under the Virtual Partitions tab. In this example, AWS SCT automatically created the following five partitions for the field:

1. <1000000
2. >=1000000 and <=2000000
3. >2000000 and <=3000000
4. >3000000
5. IS NULL

Create a local migration task

To migrate data from BigQuery to Amazon Redshift, create, run, and monitor the local migration task from AWS SCT. This step uses the data extraction agent to migrate data by creating a task.

Follow these steps to create a local migration task:

In AWS SCT, under the schema name in the left pane, right-click on Standard tables.
Choose Create Local task.
There are three migration modes from which you can choose:
1. Extract source data and store it on a local pc/virtual machine (VM) where the agent runs.
2. Extract data and upload it on an S3 bucket.
3. Choose Extract upload and copy, which extracts data to an S3 bucket and then copies to Amazon Redshift.
In the Advanced tab, for Google CS bucket folder enter the Google Cloud Storage bucket/folder that you created earlier in the GCP Management Console. AWS SCT stores the extracted data in this location.
In the Amazon S3 Settings tab, for Amazon S3 bucket folder, provide the bucket and folder names of the S3 bucket that you created earlier. The AWS SCT data extraction agent uploads the data into the S3 bucket/folder before copying to Amazon Redshift.
Choose Test Task.
Once the task is successfully validated, choose Create.

Start the Local Data Migration Task

To start the task, choose the Start button in the Tasks tab.

First, the Data Extraction Agent extracts data from BigQuery into the GCP storage bucket.
Then, the agent uploads data to Amazon S3 and launches a copy command to move the data to Amazon Redshift.
At this point, AWS SCT has successfully migrated data from the source BigQuery table to the Amazon Redshift table.

View data in Amazon Redshift

After the data migration task executes successfully, you can connect to Amazon Redshift and validate the data.

Follow these steps to validate the data in Amazon Redshift:

Navigate to the Amazon Redshift QueryEditor V2.
Double-click on the Amazon Redshift Serverless workgroup name that you created.
Choose the Federated User option under Authentication.
Choose Create Connection.
Create a new editor by choosing the + icon.
In the editor, write a query to select from the schema name and table name/view name you would like to verify. Explore the data, run ad-hoc queries, and make visualizations and charts and views.

The following is a side-by-side comparison between source BigQuery and target Amazon Redshift for the sports data-set that we used in this walkthrough.

Clean up up any AWS resources that you created for this exercise

Follow these steps to terminate the EC2 instance:

Navigate to the Amazon EC2 console.
In the navigation pane, choose Instances.
Select the check-box for the EC2 instance that you created.
Choose Instance state, and then Terminate instance.
Choose Terminate when prompted for confirmation.

Follow these steps to delete Amazon Redshift Serverless workgroup and namespace

Navigate to Amazon Redshift Serverless Dashboard.
Under Namespaces / Workgroups, choose the workspace that you created.
Under Actions, choose Delete workgroup.
Select the checkbox Delete the associated namespace.
Uncheck Create final snapshot.
Enter delete in the delete confirmation text box and choose Delete.

Follow these steps to delete the S3 bucket

Navigate to Amazon S3 console.
Choose the bucket that you created.
Choose Delete.
To confirm deletion, enter the name of the bucket in the text input field.
Choose Delete bucket.

Conclusion

Migrating a data warehouse can be a challenging, complex, and yet rewarding project. AWS SCT reduces the complexity of data warehouse migrations. Following this walkthrough, you can understand how a data migration task extracts, downloads, and then migrates data from BigQuery to Amazon Redshift. The solution that we presented in this post performs a one-time migration of database objects and data. Data changes made in BigQuery when the migration is in progress won’t be reflected in Amazon Redshift. When data migration is in progress, put your ETL jobs to BigQuery on hold or replay the ETLs by pointing to Amazon Redshift after the migration. Consider using the best practices for AWS SCT.

AWS SCT has some limitations when using BigQuery as a source. For example, AWS SCT can’t convert sub queries in analytic functions, geography functions, statistical aggregate functions, and so on. Find the full list of limitations in the AWS SCT user guide. We plan to address these limitations in future releases. Despite these limitations, you can use AWS SCT to automatically convert most of your BigQuery code and storage objects.

Download and install AWS SCT, sign in to the AWS Console, checkout Amazon Redshift Serverless, and start migrating!

About the authors

Cedrick Hoodye is a Solutions Architect with a focus on database migrations using the AWS Database Migration Service (DMS) and the AWS Schema Conversion Tool (SCT) at AWS. He works on DB migrations related challenges. He works closely with EdTech, Energy, and ISV business sector customers to help them realize the true potential of DMS service. He has helped migrate 100s of databases into the AWS cloud using DMS and SCT.

Amit Arora is a Solutions Architect with a focus on Database and Analytics at AWS. He works with our Financial Technology and Global Energy customers and AWS certified partners to provide technical assistance and design customer solutions on cloud migration projects, helping customers migrate and modernize their existing databases to the AWS Cloud.

Jagadish Kumar is an Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Anusha Challa is a Senior Analytics Specialist Solution Architect at AWS focused on Amazon Redshift. She has helped many customers build large-scale data warehouse solutions in the cloud and on premises. Anusha is passionate about data analytics and data science and enabling customers achieve success with their large-scale data projects.

Create, Train and Deploy Multi Layer Perceptron (MLP) models using Amazon Redshift ML

2022-12-14 Anuradha Karlekar

Post Syndicated from Anuradha Karlekar original https://aws.amazon.com/blogs/big-data/create-train-and-deploy-multi-layer-perceptron-mlp-models-using-amazon-redshift-ml/

Amazon Redshift is a fully managed and petabyte-scale cloud data warehouse which is being used by tens of thousands of customers to process exabytes of data every day to power their analytics workloads. Amazon Redshift comes with a feature called Amazon Redshift ML which puts the power of machine learning in the hands of every data warehouse user, without requiring the users to learn any new programming language, ML concepts or ML tools. Redshift ML abstracts all the intricacies that are involved in the traditional ML approach around data warehouse which traditionally involved repetitive, manual steps to move data back and forth between the data warehouse and ML tools for running long, complex, iterative ML workflow.

Redshift ML uses Amazon SageMaker Autopilot and Amazon SageMaker Neo in the background to make it easy for SQL users such as data analysts, data scientists, BI experts and database developers to create, train, and deploy machine learning (ML) models using familiar SQL commands and then use these models to make predictions on new data for use cases such as customer churn prediction, basket analysis for sales prediction, manufacturing unit lifetime value prediction, and product recommendations. Redshift ML makes the model available as SQL function within the Amazon Redshift data warehouse so you can easily use it in queries and reports.

Amazon Redshift ML supports supervised learning, including regression, binary classification, multi-class classification, and unsupervised learning using K-Means. You can optionally specify XGBoost, MLP, and linear learner model types, which are supervised learning algorithms used for solving either classification or regression problems, and provide a significant increase in speed over traditional hyperparameter optimization techniques. Amazon Redshift ML also supports bring-your-own-model to either import existing SageMaker models that are built using algorithms supported by SageMaker Autopilot, which can be used for local inference; or for the unsupported algorithms, one can alternatively invoke remote SageMaker endpoints for remote inference.

In this blog post, we show you how to use Redshift ML to solve binary classification problem using the Multi Layer Perceptron (MLP) algorithm, which explores different training objectives and chooses the best solution from the validation set.

A multilayer perceptron (MLP) is a deep learning method which deals with training multi-layer artificial neural networks, also called Deep Neural Networks. It is a feedforward artificial neural network that generates a set of outputs from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed graph between the input and output layers. MLP uses backpropagation for training the network. MLP is widely used for solving problems that require supervised learning as well as research into computational neuroscience and parallel distributed processing. It is also used for speech recognition, image recognition and machine translation.

As far as MLP usage with Redshift ML (powered by Amazon SageMaker Autopilot) is concerned, it supports tabular data as of now.

Solution Overview

To use the MLP algorithm, you need to provide inputs or columns representing dimensional values and also the label or target, which is the value you’re trying to predict.

With Redshift ML, you can use MLP on tabular data for regression, binary classification or multiclass classification problems. What is more unique about MLP is, is that the output function of MLP can be a linear or a continuous function as well. It need not be a straight line like the general regression model provides.

In this solution, we use binary classification to detect frauds based upon the credit cards transaction data. The difference between classification models and MLP is that logistic regression uses a logistic function, while perceptrons use a step function. Using the multilayer perceptron model, machines can learn weight coefficients that help them classify inputs. This linear binary classifier is highly effective in arranging and categorizing input data into different classes, allowing probability-based predictions and classifying items into multiple categories. Multilayer Perceptrons have the advantage of learning non-linear models and the ability to train models in real-time.

For this solution, we first ingest the data into Amazon Redshift, we then distribute it for model training and validation, then use Amazon Redshift ML specific queries for model creation and thereby create and utilize the generated SQL function for being able to finally predict the fraudulent transactions.

Prerequisites

To get started, we need an Amazon Redshift cluster or an Amazon Redshift Serverless endpoint and an AWS Identity and Access Management (IAM) role attached that provides access to SageMaker and permissions to an Amazon Simple Storage Service (Amazon S3) bucket.

For an introduction to Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

To create a simple cluster with a default IAM role, see Use the default IAM role in Amazon Redshift to simplify accessing other AWS services.

Data Set Used

In this post, we use the Credit Card Fraud detection data to create, train and deploy MLP model which can be used further to identify fraudulent transactions from the newly captured transaction records.

The dataset contains transactions made by credit cards in September 2013 by European cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a Principal Component Analysis transformation. Due to confidentiality issues, the original features and more background information about the data is not provided. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Here are sample records:

Prepare the data

Load the credit card dataset into Amazon Redshift using the following SQL. You can use the Amazon Redshift query editor v2 or your preferred SQL tool to run these commands.

Alternately we have provided a notebook you may use to execute all the sql commands that can be downloaded here. You will find instructions in this blog on how to import and use notebooks.

To create the table, use the following command:

CREATE TABLE creditcardsfrauds (
    txtime integer,
    v1 float8,
    v2 float8,
    v3 float8,
    v4 float8,
    v5 float8,
    v6 float8,
    v7 float8,
    v8 float8,
    v9 float8,
    v10 float8,
    v11 float8,
    v12 float8,
    v13 float8,
    v14 float8,
    v15 float8,
    v16 float8,
    v17 float8,
    v18 float8,
    v19 float8,
    v20 float8,
    v21 float8,
    v22 float8,
    v23 float8,
    v24 float8,
    v25 float8,
    v26 float8,
    v27 float8,
    v28 float8,
    amount float8,
    class integer
);

Load the data

To load data into Amazon Redshift, use the following COPY command:

COPY creditcardsfrauds
FROM 's3://redshift-ml-blog-mlp/creditcard.csv' 
IAM_ROLE default
CSV QUOTE as '\"' delimiter ',' IGNOREHEADER 1 maxerror 100
REGION 'us-east-1';

Before creating the model, we want to divide our data into two sets by splitting 80% of the dataset for training and 20% for validation, which is a common practice in ML. The training data is input to the ML model to identify the best possible algorithm for the model. After the model is created, we use the validation data to validate the model accuracy.

So, in ‘creditcardsfrauds’ table, we check the distribution of data based upon ‘txtime’ value and identify the cutoff for around 80% of the data to train the model.

With this, the highest txtime value comes to 120954 (based upon the distribution of txtime’s min, max, ranking by window function and ceil(count(*)*0.80) values)), based upon which we consider the transaction records having ‘txtime’ field value less than 120954 for creating training data. We then validate the accuracy of that model by seeing if it correctly identifies the fraudulent transactions by predicting its ‘class’ attribute on the remaining 20% of the data.

This distribution for 80% cutoff need not always be based upon ordered time. It can be picked up randomly as well, based upon the use case under consideration.

Create a model in Redshift ML

To create the model, use the following command:

 CREATE model creditcardsfrauds_mlp
FROM (select * from creditcardsfrauds where txtime < 120954)
TARGET class 
FUNCTION creditcardsfrauds_mlp_fn
IAM_ROLE default
MODEL_TYPE MLP
SETTINGS (
      S3_BUCKET '<<your-amazon-s3-bucket-name>>’,
      MAX_RUNTIME 9600
);

Here, in the settings section of the command, you need to set up an S3_BUCKET which is used to export the data that is sent to SageMaker and store model artifacts.

S3_BUCKET setting is a required parameter of the command, whereas MAX_RUNTIME is an optional one which specifies the maximum amount of time to train. The default value of this parameter is 90 minutes (5400 seconds), however you can override it by explicitly specifying it in the command, just like we have done it here by setting it to run for 9600 seconds.

The preceding statement initiates an Amazon SageMaker Autopilot process in the background to automatically build, train, and tune the best ML model for the input data. It then uses Amazon SageMaker Neo to deploy that model locally in the Amazon Redshift cluster or Amazon Redshift Serverless as a user-defined function (UDF).

You can use the SHOW MODEL command in Amazon Redshift to track the progress of your model creation, which should be in the READY state within the max_runtime parameter you defined while creating the model.

To check the status of the model, use the following command:

show model creditcardsfrauds_mlp;

We notice from the preceding table that the F1-score for the training data is 0.908, which shows very good performance accuracy.

To elaborate, F1-score is the harmonic mean of precision and recall. It combines precision and recall into a single number using the following formula:

Where, Precision means: Of all positive predictions, how many are really positive?

And Recall means: Of all real positive cases, how many are predicted positive?

F1 scores can range from 0 to 1, with 1 representing a model that perfectly classifies each observation into the correct class and 0 representing a model that is unable to classify any observation into the correct class. So higher F1 scores are better.

The following is the detailed tabular outcome for the preceding command after model training was done.

Model Name	creditcardsfrauds_mlp
Schema Name	public
Owner	redshiftml
Creation Time	Sun, 25.09.2022 16:07:18
Model State	READY
validation:binary_f_beta	0.908864
Estimated Cost	112.296925
TRAINING DATA:	.
Query	SELECT * FROM CREDITCARDSFRAUDS WHERE TXTIME < 120954
Target Column	CLASS
PARAMETERS:	.
Model Type	mlp
Problem Type	BinaryClassification
Objective	F1
AutoML Job Name	redshiftml-20221118035728881011
Function Name	creditcardsfrauds_mlp_fn
.	creditcardsfrauds_mlp_fn_prob
Function Parameters	txtime v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 v21 v22 v23 v24 v25 v26 v27 v28 amount
Function Parameter Types	int4 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8 float8
IAM Role	default
S3 Bucket	redshift-ml-blog-mlp
Max Runtime	54000

Redshift ML now supports Prediction Probabilities for binary classification models. For classification problem in machine learning, for a given record, each label can be associated with a probability that indicates how likely this record really belongs to the label. With option to have probabilities along with the label, customers could use the classification results when confidence based on chosen label is higher than a certain threshold value returned by the model

Prediction probabilities are calculated by default for binary classification models and an additional function is created while creating model without impacting performance of the ML model.

In above snippet, you will notice that predication probabilities enhancements have added another function as a suffix (_prob) to model function with a name ‘creditcardsfrauds_mlp_fn_prob’ which could be used to get prediction probabilities.

Additionally, you can check the model explainability to understand which inputs contributed effectively to derive the prediction.

Model explainability helps to understand the cause of prediction by answering questions such as:

Why did the model predict a negative outcome such as blocking of credit card when someone travels to a different country and withdraws a lot of money in different currency?
How does the model make predictions? Lots of data for credit cards can be put in a tabular format and as per MLP process where a fully connected neural network of several layers is involved, we can tell which input feature actually contributed to the model output and its magnitude.
Why did the model make an incorrect prediction? E.g. Why is the card blocked even though the transaction is legitimate?
Which features have the largest influence on the behavior of the model? Is it just based upon the location where the credit card is swiped, or even the time of the day and unusual credit consumption that is influencing the prediction?

Run the following SQL command to retrieve the values from the explainability report:

SELECT json_table.report.explanations.kernel_shap.label0.global_shap_values 
FROM (select explain_model('creditcardsfrauds_mlp') as report) as json_table;

In the preceding screenshot, we have only selected the column that projects shapley values from the response returned by the explain_model function. If you notice the response of the query, the values in every json object show the contribution of different features in terms of influencing the prediction. E.g. from the preceding snippet, v14 feature is influencing the prediction the most and txtime feature does not really play any significant role in predicting ‘class’.

Model validation

Now let’s run the prediction query and validate the accuracy of the model on the validation dataset:

FROM (
  SELECT 
      CASE WHEN class =  
      creditcardsfrauds_mlp_fn(txtime,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,amount) 
      THEN 'PredictedMatchesActual' 
      else 'NoMatch' 
      END as actualvspredicted
    FROM creditcardsfrauds 
    WHERE txtime >= 120954
) 
group by actualvspredicted

We can observe here that Redshift ML is able to identify 99.88 percent of the transactions correctly as fraudulent or non-fraudulent.

Now you can continue to use this SQL function creditcardsfrauds_mlp_fn for local inference in any part of the SQL query while analyzing, visualizing or reporting the newly arriving as well as existing data!

--CREATE A STAGING TABLE TO HOLD NEWLY ARRIVING DATA FROM THE SOURCE WHICH WILL NOT CONAIN THE CLASS COLUMN - AS IT IS TO BE PREDICTED
DROP TABLE creditcardsfrauds_staging;
CREATE TABLE creditcardsfrauds_staging as (select * from creditcardsfrauds limit 0);
Alter table creditcardsfrauds_staging drop column class;

--LETS CONSIDER ONLY ONE RECORD HERE WHICH HAS NEWLY ARRIVED
insert into creditcardsfrauds_staging values(174965,-39999.11383160738512,0.58586417180689,-5.39973021073242,1.81709247345531,-0.840618465991056,-2.94354779071974,-2.20800192003372,1.05873267723056,-1.63233334974982,-5000.24598383776964,11.93351953683592,-53046479695456,-1.12745457501155,-666666.41662797597451,0.141237234328704,-2.54949823633632,-4.61471706851594,-10.47813794126038,-0.0354803664667244,0.306270740368093,0.583275998701341,-0.269208637986581,-0.456107772584008,-0.183659129549716,-0.328167759255761,0.606115810329683,0.884875539542905,-0.253700318894381,-2450000000);

--USE THE FUNCTION TO PREDICT THE VALUE OF CLASS
SELECT txtime, creditcardsfrauds_mlp_fn(txtime,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,amount)
FROM creditcardsfrauds_staging;

Here the output 1 means that the newly captured transaction is fraudulent as per the inference.

Additionally, you can change the above query to include prediction probabilities of label output for the above scenario and decide if you still like to use the prediction by the model.

--USE THE FUNCTION TO PREDICT THE VALUE OF CLASS ALONG WITH THE PROBABILITY
Select txtime, predictedActive.labels[0], predictedActive.probabilities[0] 
from (
SELECT txtime, creditcardsfrauds_mlp_fn_prob(txtime,v1,v2,v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21,v22,v23,v24,v25,v26,v27,v28,amount)as predictedACtive
FROM creditcardsfrauds_staging ) temp

The above screenshot shows that this transaction has 100% likelihood of being fraudulent.

Clean up

To avoid incurring future charges, you can stop the Redshift cluster when not being used. You can even terminate the Redshift cluster altogether if you have run the exercise in this blog post just for experimental purpose. If you are instead using serverless version of Redshift, it will not cost you anything, until it is used. However, like mentioned before, you will have to stop or terminate the cluster if you are using a provisioned version of Redshift.

Conclusion

Redshift ML makes it easy for users of all levels to create, train, and tune models using SQL interface. In this post, we walked you through how to use the MLP algorithm to create binary classification model. You can then use those models to make predictions using simple SQL commands and gain valuable insights.

To learn more about RedShift ML, visit Amazon Redshift ML.

About the authors

Anuradha Karlekar is a Solutions Architect at AWS working majorly for Partners and Startups. She has over 15 years of IT experience extensively in full stack development, deployment, building data ETL pipelines and visualizations. She is passionate about data analytics and text search. Outside work – She is a travel enthusiast!

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Abhishek Pan is a Solutions Architect-Analytics working at AWS India. He engages with customers to define data driven strategy, provide deep dive sessions on analytics use cases & design scalable and performant Analytical applications. He has over 11 years of experience and is passionate about Databases, Analytics and solving customer problems with help of cloud solutions. An avid traveller and tries to capture world through my lenses

Debu Panda is a Senior Manager, Product Management at AWS, is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt).

Amazon EMR launches support for Amazon EC2 M6A, R6A instances to improve cost performance for Spark workloads by 15–50%

2022-12-13 Al MS

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-m6a-r6a-instances-to-improve-cost-performance-for-spark-workloads-by-15-50/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over 2x performance improvements over open-source Apache Spark and Presto.

With Amazon EMR release 6.8, you can now use Amazon Elastic Compute Cloud (Amazon EC2) instances such as M6A and C6A, which use the third generation AMD EPYC processors. These instances improve the price performance of running Spark workloads on Amazon EMR by 15–50 percent over previous generation instances. In this blog post, we describe how we estimated this price performance benefit.

Amazon EMR runtime performance with EC2 M6A instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.8 using Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with M6a instances. Data was stored in Amazon Simple Storage Service (Amazon S3), and results were compared to equivalent clusters with M5a, which is the previous generation instance family. We measured performance improvements using the total query runtime and the geometric mean of query runtime across TPC-DS 3 TB benchmark queries.

Our results showed a 23.6–50.3 percent improvement in total query runtime performance and 22.8–52.4 percent in geometric mean on an EMR cluster with M6a compared to an equivalent EMR cluster with M5a instances. In comparing costs, we observed a 23.2–41.4 percent reduction in cost on the EMR cluster with M6a compared to the equivalent with M5a. M6A 48 XL and 32 XL instances were not benchmarked because the M5A generation does not offer equivalent sizes.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent M6a and M5a instance EMR clusters.

Instance Size	24 XL	16 XL	12 XL	8 XL	4 XL	2 XL	XL
Total size of the cluster (1 Leader + 5 core nodes)	6	6	6	6	6	6	6
Total query runtime on M5A (seconds)	6624.1713838714	5466.7251180433	5269.0578151495	5366.1486275129	7753.6218015794	12118.0922180235	21070.6905510002
Total query runtime on M6A (seconds)	3295.2894058371	3063.7807673078	3399.1509249577	3482.8401591909	4906.2216891762	9184.4366036450	16107.9707619002
Total query runtime improvement with M6A	50.25%	43.96%	35.49%	35.10%	36.72%	24.21%	23.55%
Geometric mean query runtime M5A (sec)	51.1422829354	40.9550798753	38.4890223194	35.3863834186	44.8454957416	61.0454658020	92.6414502105
Geometric mean query runtime M6A (sec)	24.3406154481	22.3484713891	22.9913163520	23.0351017440	28.2855683398	46.4363267349	71.5498816854
Geometric mean query runtime improvement with M6A	52.41%	45.43%	40.27%	34.90%	36.93%	23.93%	22.77%
EC2 M5A instance price ($ per hour)	$4.12800	$2.75200	$2.06400	$1.37600	$0.68800	$0.34400	$0.17200
EMR M5A instance price ($ per hour)	$0.27000	$0.27000	$0.27000	$0.27000	$0.17200	$0.08600	$0.04300
(EC2 + EMR) M5A instance price ($ per hour)	$4.39800	$3.02200	$2.33400	$1.64600	$0.86000	$0.43000	$0.21500
Cost of running on M5A ($ per instance)	$8.09253	$4.58901	$3.41611	$2.45352	$1.85225	$1.44744	$1.25839
EC2 M6A instance price ($ per hour)	$4.14720	$2.76480	$2.07360	$1.38240	$0.69120	$0.34560	$0.17280
EMR M6A price ($ per hour per instance)	$1.03680	$0.69120	$0.51840	$0.34560	$0.17280	$0.08640	$0.04320
(EC2 + EMR) M6A instance price ($ per hour)	$5.18400	$3.45600	$2.59200	$1.72800	$0.86400	$0.43200	$0.21600
Cost of running on M6A ($ per instance)	$4.74522	$2.94123	$2.44739	$1.67176	$1.17749	$1.10213	$0.96648
Total cost reduction with M6A including performance improvement	-41.36%	-35.91%	-28.36%	-31.86%	-36.43%	-23.86%	-23.20%

The following graph shows per query improvements observed on M6a 2XL instances compared to equivalent M5a generation. We observed that two queries take longer to execute on M6a instance clusters compared to M5a instance clusters. Q91 regressed up to 6.64 percent and Q55 regressed up to 1.86 percent on 4 XL instance clusters.

Amazon EMR runtime performance with EC2 R6A instances

R6A instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5A instances. R6A 32XL and 48XL instances were not benchmarked since R5A instances do not have 32XL and 48XL sizes available. Our results showed 16–58.22 percent improvement in total query runtime for seven different instance sizes within the instance family and 20.04–59.59 percent improvement in geometric mean. In comparing costs, we observed 15.85–-50.07 percent reduction in cost on R6A instance EMR clusters compared to R5A EMR instance clusters.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6A and R5A instance EMR clusters.

Instance Size	24 XL	16 XL	12 XL	8 XL	4 XL	2 XL	XL
Total size of the cluster (1 Leader + 5 core nodes)	6	6	6	6	6	6	6
Total query runtime on R5A (seconds)	6934.22936	5530.74672	5834.32344	5718.72582	7615.58392	11431.37368	20688.58642
Total query runtime on R6A (seconds)	2897.44817	2906.49952	3017.85315	3488.83875	4661.32856	7717.33575	17378.49043
Total query runtime improvement with R6A	58.22%	47.45%	48.27%	38.99%	38.79%	32.49%	16.00%
Geometric mean query runtime R5A (sec)	53.27574	41.76973	42.50324	37.62155	44.58173	58.88182	91.72095
Geometric mean query runtime R6A (sec)	21.52803	21.36831	19.94607	21.59493	26.90097	36.57557	73.3405
Geometric mean query runtime improvement with R6A	59.59%	48.84%	53.07%	42.60%	39.66%	37.88%	20.04%
EC2 R5A instance price ($ per hour)	$5.42400	$3.61600	$2.71200	$1.80800	$0.90400	$0.45200	$0.22600
EMR R5A instance price ($ per hour)	$0.27000	$0.27000	$0.27000	$0.27000	$0.22600	$0.11300	$0.05700
(EC2 + EMR) R5A instance price ($ per hour)	$5.69400	$3.88600	$2.98200	$2.07800	$1.13000	$0.56500	$0.28300
Cost of running on R5A ($ per instance)	$10.96764	$5.97013	$4.83276	$3.30098	$2.39045	$1.79409	$1.62635
EC2 R6A instance price ($ per hour)	$5.44320	$3.62880	$2.72160	$1.81440	$0.90720	$0.45360	$0.22680
EMR R6A price ($ per hour per instance)	$1.36080	$0.90720	$0.68040	$0.45360	$0.22680	$0.11340	$0.05670
(EC2 + EMR) R6A instance price ($ per hour)	$6.80400	$4.53600	$3.40200	$2.26800	$1.13400	$0.56700	$0.28350
Cost of running on R6A ($ per instance)	$5.47618	$3.66219	$2.85187	$2.19797	$1.46832	$1.21548	$1.36856
Total cost reduction with R6A including performance improvement	-50.07%	-38.66%	-40.99%	-33.41%	-38.58%	-32.25%	-15.85%

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

We calculated TCO by multiplying cost per hour by number of instances in the cluster and time taken to run the queries on the cluster. We used the on-demand pricing in the US East (N. Virginia) Region for all instances.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with M6A and R6A instances compared to using equivalent previous-generation instances. Using these new instances with Amazon EMR improves price performance by 15–50%.

About the authors

AI MS Al MS is a product manager for Amazon EMR at Amazon Web Services.

Kyeonghyun Ryoo is a Software Development Engineer for EMR at Amazon Web Services. He primarily works on designing and building automation tools for internal teams and customers to maximize their productivity. Outside of work, he is a retired world champion in professional gaming who still enjoy playing video games.

Scale read and write workloads with Amazon Redshift

2022-12-13 Harsha Tadiparthi

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/scale-read-and-write-workloads-with-amazon-redshift/

Amazon Redshift is a fast, fully managed, petabyte-scale cloud data warehouse that enables you to analyze large datasets using standard SQL. The concurrency scaling feature in Amazon Redshift automatically adds and removes capacity by adding concurrency scaling to handle demands from thousands of concurrent users, thereby providing consistent SLAs for unpredictable and spiky workloads such as BI reports, dashboards, and other analytics workloads.

Until now, concurrency scaling only supported auto scaling for read queries; write queries had to run on the main cluster. Now, we are extending concurrency scaling to support auto scaling for common write queries including COPY, INSERT, UPDATE, and DELETE. This is available on Amazon Redshift RA3 provisioned instance types in the Regions where concurrency scaling is available. Amazon Redshift serverless comes with built in dynamic auto scaling capability for read workload scaling.

In this post, we discuss how to enable concurrency scaling to offer consistent SLAs for concurrent workloads such as data loads, ETL (extract, transform, and load), and data processing with reduced queue times.

Concurrency scaling overview

With concurrency scaling, Amazon Redshift automatically and elastically scales query processing power to provide consistently fast performance for hundreds of concurrent queries. Concurrency scaling resources are added to your Amazon Redshift cluster transparently in seconds, as concurrency increases, to serve sudden spikes in concurrent requests with fast performance without wait time. When the workload demand subsides, Amazon Redshift automatically shuts down concurrency scaling resources to save you cost.

The following diagram shows how concurrency scaling works at a high level.

The workflow contains the following steps:

All queries go to the main cluster.
When queries in the designated workload management (WLM) queue begin queuing, Amazon Redshift automatically routes eligible queries to the new clusters, enabling concurrency scaling.
Amazon Redshift automatically spins up a new cluster, processes waiting queries, and shuts down the concurrency scaling cluster when no longer needed.

Enable Amazon Redshift concurrency scaling

You can manage concurrency scaling at the WLM queue level, where you set concurrency scaling policies for specific queues. When concurrency scaling is enabled for a queue, eligible write and read queries are sent to concurrency scaling clusters without having to wait for resources to free up on the main Amazon Redshift cluster. Amazon Redshift handles spinning up concurrency scaling clusters, routing of the queries to the transient clusters, and relinquishing the concurrency clusters.

You can enable concurrency scaling on both automatic and manual WLM.

You first need to determine which parameter group your cluster is. To do so, complete the following steps:

On the Amazon Redshift console, choose Clusters in the navigation pane.
Choose your cluster.
On the Properties tab, note the parameter group associated to the cluster.
Now you can configure your WLM parameters.
Under Configurations in the navigation pane, choose Workload management.
Choose the parameter group associated to the cluster.If you’re using the default parameter group default.redshift-1.0, you need to create a custom parameter group and assign that to the cluster. The default parameter group has preset values for each of its parameters, and it can’t be modified.
On the Parameters tab, you can choose between 1–10 max_concurrency_scaling_clusters.This is the max number of concurrent Amazon Redshift clusters you can have running at the same time. Ten is the soft limit; this limit can be increased by submitting a service limit increase request with a support case.
On the Workload management tab, choose auto mode for the concurrency scaling cluster.

Example use cases

In this section, we use three use cases to help you understand how concurrency scaling for read and write heavy workloads can seamlessly scale to improve workload performance SLAs.

We used a 3 TB Cloud DW benchmark dataset. The test included a total of 103 concurrent queries, with each run using a separate database connection. The 103 queries constituted 60 queries from the 99 TPC-DS queries and 43 write queries, with a mix of copy, insert, update and delete statements. We used RA3.4xlarge 5 compute nodes.

The following scenarios showcase how concurrency scaling for reads and writes can seamlessly auto scale and positively impact a heavy concurrent mixed workload:

All queries triggered concurrently with concurrency scaling turned off
All queries triggered concurrently with concurrency scaling cluster limit set to 5 clusters
All queries triggered concurrently with concurrency scaling cluster limit set to 10 clusters

Scenario 1: All queries triggered concurrently with concurrency scaling turned off

In this benchmark test, all queries completed in 299 minutes. The following are the test details.

The Amazon Redshift query optimizer turned the 103 queries into 257 sub-queries for better performance in this run. Amazon Redshift continuous to learn from operational statistics to optimize your workload.

The following screenshot shows how Amazon Redshift auto WLM mode chose to run 16 queries concurrently while queuing the rest. Because concurrency scaling is turned off, no additional clusters are spun up and the queries continue to wait for running queries to complete before they can be processed. Notice the number of queries queued stayed at a higher number for a long period of time and eventually lowered as only a few queries could concurrently run.

No additional concurrent clusters spun up during the window of the workload, as seen in the following screenshot, requiring the primary cluster to process all the queries.

Scenario 2: All queries triggered concurrently with concurrency scaling cluster max limit set to 5 clusters

In this test, all queries completed in 49 minutes.

The following screenshot depicts significant queuing. Within seconds, five additional Amazon Redshift clusters are spun up into ready state, allowing 53 queries to run simultaneously. This number can change in your cluster based on the query types. Notice the number of queries queued starts lowering as more queries are completed using the five additional clusters.

Over time, the concurrency scaling clusters start to wind down progressively to 0 as the queries no longer waited.

Scenario 3: All queries triggered concurrently with concurrency scaling cluster limit set to 10 clusters

In this test, all queries completed in 28 minutes.

The following screenshot depicts significant queuing. Within seconds, 10 additional Amazon Redshift clusters are spun up into ready state, allowing multiple queries to run simultaneously. This number can change in your cluster based on the query types. Notice the number of queries queued starts lowering as more queries are completed using the five additional clusters.

Over time, the concurrency scaling clusters start to wind down progressively to 0 as the queries no longer waited.

Test results review

The following table summarizes our test results.

.	Test Scenario 1	Test Scenario 2	Test Scenario 3
Total Workload Completion Time	299 Minutes	49 Minutes	28 Minutes

The test results reveal how concurrency scaling for a mixed workload of reads and writes lowered the total workload completion time from 299 minutes to 28 minutes, which is more than 10 times an improvement in SLAs while being cost effective by only paying for the additional clusters when scaling is necessary.

Monitor concurrency scaling

One method to monitor concurrency scaling is via system views. To monitor which queries benefitted from concurrency scaling, you can use concurrency_scaling_status from stl_query. Concurrency scaling of 1 indicates that the query ran on a concurrency scaling cluster. To monitor concurrency scaling usage, you can use the SVCS_CONCURRENCY_SCALING_USAGE system view.

The Amazon CloudWatch metrics ConcurrencyScalingActiveClusters and ConcurrencyScalingSeconds enable you to set up monitoring of concurrency scaling usage. For more information, refer to Monitoring Amazon Redshift using CloudWatch metrics.

Configure usage limit

With every 24 hours used of the main Amazon Redshift cluster, you accrue 1 hour of concurrency scaling credit. This free credit can be used by both read and write queries. For any usage that exceeds the accrued free usage credits, you’re billed on a per-second basis based on the on-demand rate of your Amazon Redshift cluster. You can apply cost controls for concurrency scaling at the cluster level. You can choose to create multiple queues for ETL, Dashboard, and adhoc workload. With this you can choose to turn on concurrency scaling for selective queues.

As shown in the following screenshot, you can choose a time period (daily, weekly, or monthly) and specify the desired usage limit. You can then choose an action option (Alert, Log to system table, or Disable feature). For more details on how to set cost controls for concurrency scaling, refer to Manage and control your cost with Amazon Redshift Concurrency Scaling and Spectrum.

Summary

In this post, we showed how you can enable concurrency scaling to help you meet the SLAs for both read and write workloads by seamlessly scaling out to the maximum number of clusters you configured, thereby increasing your cluster throughput while controlling your costs. Concurrency scaling with read and write capability can enable you to handle a number of scenarios, such as sudden increases in the volume of data in your data pipeline, backfill operations, ad hoc reporting, and month end processing. It’s now time to put this learning into action and begin optimizing your Redshift cluster(s) for both read and write throughput!

About the Authors

Harsha Tadiparthi is a specialist Principal Solutions Architect, Analytics at AWS. He enjoys solving complex customer problems in databases and analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

Harshida Patel is a Specialist Principal Solutions Architect, Analytics with AWS.

Ramu Ponugumati is a Sr. Technical Account Manager, specialist in Analytics and AI/ML at AWS. He works with enterprise customers to modernize and cost optimize workloads, and helps them build reliable and secure applications on the AWS platform. Outside of work, he loves spending time with his family, playing tennis, and gardening.

The AWS Customer Data Platform: overview and architecture

2022-12-12 Larry Bell

Post Syndicated from Larry Bell original https://aws.amazon.com/blogs/architecture/aws-customer-data-platform-overview-and-architecture/

The deprecation of digital consumer identifiers, such as third-party cookies and mobile advertising IDs, and the rapid growth of data from expanding consumer touchpoints, has created challenges in identifying, managing, and reaching customers in digital channels. Organizations must rethink their strategies for collecting and storing customer data. Customer Data Platforms (CDPs) collect, aggregate, and organize customer data sources, and create individual centralized customer profiles to better manage and understand customers.

Independent Software Vendors (ISVs) in the advertising and marketing industry vertical can aid many companies in achieving these goals. The ISVs can help organizations with the heavy lifting required to build, secure, govern, and maintain near real-time, high volume CDPs. However, building these types of vendor solutions, that support a large number of customers, data volumes, and use cases, is a complex undertaking with often unforeseen challenges.

This post examines the logical architecture of the CDP to provide guidance to help reduce complexity, increase agility, improve operational excellence, and optimize cost. Although the material can benefit advanced marketers evaluating building a CDP, the intent of this post is to provide guidance to those ISVs that are facing these challenges for multiple clients.

Marketing CDP logical architecture

The principal challenge of the CDP architecture is to integrate data from many disparate sources and types. Envision a data lake-centric approach based on a layered architecture where the sources of customer data flow through six logical layers: Ingestion; Processing; Storage; Unified Governance (and Security); Cataloging; and Consumption.

A layered, component-oriented architecture promotes separation of concerns, decoupling of tasks, and the flexibility required to build each component consistent with best practices. These components provide the agility necessary to quickly integrate new data sources and support new analytics or product capabilities. The components are depicted in the image below in the conceptual logical model of the marketing CDP and are then described in this post.

Figure 1: Marketing CDP logical architecture

CDP components

Logical ingestion layer

The ingestion layer is responsible for collecting data across various customer touchpoints. It provides the ability to connect with internal and external data sources. It can ingest batch, near real-time, and real-time data into the storage layer. This layer aggregates data from multiple source systems and therefore elevates the marketing CDP as the primary repository of marketing and advertising data across an organization. Sources can include cloud or on-premise data sources, streaming data, file stores, third-party Software as a Service (SaaS) connectors and APIs. Separating this component into three layers reduces complexity while providing agility to the process.

Logical storage layer

A scalable, flexible, resilient, and reliable storage layer is critical to the value proposition of a marketing CDP. It consists of three distinct storage areas:

Raw Zone – Contains ingested data in its original, immutable format, which can be used to source additional attributes in the future. It can also be used to restore data in certain disaster recovery scenarios. This layer acts as an immutable record of what has happened/been observed historically so that we can use that immutable data to generate a source of fact.
Clean Zone – Contains the first transformation of raw data, including conversions to an efficient data format such as Parquet or Avro, as well as basic data quality validations. This layer also acts as an ad-hoc layer to develop answers to unknown question in reasonable time frames so that they can be migrated to the curated zone.
Curated Zone – Contains data, organized by subject area, that is ready for consumption by users and applications For a marketing CDP, this includes identity resolution, data enrichment, customer segmentation, and aggregation.

Automated data archival can be configured individually for each layer, and aligned to compliance requirements set by the organization. Access to these layers is controlled at a granular level to ensure a secure and collaborative data exchange and exploration.

Logical cataloging layer

The cataloging layer provides a centralized governance control, including mechanisms for data access control, versioning, and metadata exploration. It provides the ability to track the schema and the partitioning of datasets. This layer makes the datasets discoverable. The Governance capabilities of the Catalog ensure standardization for audit purposes.

Logical processing layer

This layer is responsible for transforming data into a consumable state by applying business rules for data validation, identity resolution, segmentation, normalization, profile aggregation, and machine learning (ML) processing. This layer comprises custom application logic. The compute resources for this layer are designed to scale independently from storage to handle large data volumes; support schema-on-read, support partitioned data and diverse data formats; and orchestrate event-based data processing pipelines.

Logical consumption layer

The consumption layer is responsible for providing scalable tools to gain insights from the vast amount of data in the marketing CDP.

Analytics layer – Enables consumption by all user personas through several purpose-built analytics tools that support analysis methods, including ad-hoc SQL queries, batch analytics, business intelligence (BI) dashboards and ML-based insights. Components in this layer should support schema-on-read, data partitioning, and a variety of formats.
Data collaboration layer – Consists of data clean rooms where organizations can aggregate customer data from different marketing channels or lines of business, and combine it with first-party data to gain insights while enforcing security, anonymization, and compliance controls.
Activation layer – This layer integrates customer profiles with the organization. It can also integrate with third-party SaaS providers in the advertising and marketing industry, and is capable of enriching data sets for consumption.

Logical security and governance layer

The Security and Governance layer is responsible for providing mechanisms for access control, encryption, auditing, and data privacy. CDP platforms must securely organize and control the flow of customer event and attribute data. The CDP must manage data, regardless of ingestion method, to unify that data to unique customer profiles, centralizing audience segmentation, and forwarding data to your purpose-built data stores.

Privacy regulations, which often vary by region or country, make it necessary to focus on collecting only the vital data for your marketing efforts. The CDP must align to a standards-based security process. There must be procedures in place to audit data collection, follow least privilege data access, and avoid data silos.

A marketing CDP must include the following security and governance aspects:

Encryption at rest – Data must be persisted in encrypted format to protect it from unauthorized access.
Encryption in transit – To protect data in transit, encryption protocols such as TLS and certificates to create a secure HTTPS connection to make API requests.
Key management – Keys must be managed securely because they grant access to data.
Secrets management – Secrets, such as application passwords and login credentials, must be protected from unintended access.
Fine-grained access controls – Control data access to only those users that have the right to see the data.
Data archival – Users need to take advantage of storage tiers and data lifecycle policies, which automatically move data to lower cost tiers over time, based on expected access patterns.
Auditing – It is critical to monitor and record all activity within the environment with the goal of being able to analyze activity down to individual API call level.
Data masking – It is important to allow users the ability to automatically detect and optionally mask, substitute, or encrypt/decrypt Personally Identifiable Information (PII). This helps outputs of the CDP to comply with such standards as HIPAA and GDPR.
Compliance programs – Compliance frameworks such as SOC2, GDPR, CCPA, and others can be attested by tying together governance-focused, audit-friendly service features with applicable compliance or audit standards.

Conclusion: Using the CDP to better manage customers

In this post, we reviewed a logical CDP data architecture that addresses several complexities at scale, using a decoupled, component-driven architecture. The Data Analytics Lens can provide further guidance when designing, deploying, and architecting analytics solution workloads. In addition, ISVs should consider a serverless model for implementation, which helps optimize cost and scalability while reducing the required maintenance on the system.

Migrate a large data warehouse from Greenplum to Amazon Redshift using AWS SCT – Part 3

2022-12-12 Jon Roberts

Post Syndicated from Jon Roberts original https://aws.amazon.com/blogs/big-data/part-3-migrate-a-large-data-warehouse-from-greenplum-to-amazon-redshift-using-aws-sct/

In this third post of a multi-part series, we explore some of the edge cases in migrating a large data warehouse from Greenplum to Amazon Redshift using AWS Schema Conversion Tool (AWS SCT) and how to handle these challenges. Challenges include how best to use virtual partitioning, edge cases for numeric and character fields, and arrays.

You can check out the first post of this series for guidance on planning, running, and validating the migration. You can also check out the second post for best practices for choosing the optimal Amazon Redshift cluster, data architecture, converting stored procedures, compatible functions and queries widely used for SQL conversions, and recommendations for optimizing the length of data types for table columns.

Unbounded character data type

Greenplum supports creating columns as text and varchar without specifying the length of the field. This works without an issue in Greenplum but doesn’t work well in migrating to Amazon Redshift. Amazon Redshift stores data in columnar format and gets better compression when using shorter column lengths. Therefore, the Amazon Redshift best practice is to use the smallest character length possible.

AWS SCT will convert these unbounded fields as large objects (LOBs) instead of treating the columns as character fields with a specified length. LOBs are implemented differently in each database product on the market, but in general, a LOB is not stored with the rest of the table data. Instead, there is a pointer to the location of the data. When the LOB is queried, the database reconstitutes the data automatically for you, but this typically requires more resources.

Amazon Redshift doesn’t support LOBs, so AWS SCT resolves this by loading the data into Amazon Simple Storage Service (Amazon S3) and in the column, it stores the S3 URL. When you need to retrieve this data, you have to query the table, get the S3 URL, and then fetch the data from Amazon S3. This isn’t ideal because most of the time, the actual maximum length of the field doesn’t require it to be treated as a LOB, and storing the data remotely means it will take much longer to fetch the data for queries.

The current resolution is to calculate the maximum length of these columns and update the Greenplum tables before converting to Amazon Redshift with AWS SCT.

Note that in a future release of AWS SCT, the collection of statistics will include calculating the maximum length for each column, and the conversion of unbounded varchar and text will set the length in Amazon Redshift automatically.

The following code is an example of an unbounded character data type:

CREATE TABLE public.mytable 
(description1 text NOT NULL PRIMARY KEY, description2 varchar);
CREATE INDEX ON public.mytable (description2);

This table uses a primary key column on an unbounded text column. This needs to be converted to varchar(n), where n is the maximum length found in this column.

Drop unique constraints on affected columns:

ALTER TABLE public.mytable DROP CONSTRAINT mytable_pkey;

Drop indexes on affected columns:

DROP INDEX public.mytable_description2_idx;

Calculate maximum length of affected columns:

select coalesce(max(length(description1)), 10),
coalesce(max(length(description2)), 10) 
from public.mytable;

Note that in this example, the description1 and description2 columns only contain NULL values, or the table doesn’t have any data in it, or the calculated length of the columns is 10.

Alter the length of the affected columns:

ALTER TABLE public.mytable ALTER COLUMN description1 TYPE varchar(10);
ALTER TABLE public.mytable ALTER COLUMN description2 TYPE varchar(10);

You can now proceed with using AWS SCT to convert the Greenplum schema to Amazon Redshift and avoiding using LOBs to store the column values.

GitHub help

If you have many tables to update and want an automated solution, you can use the add_varchar_lengths.sh script found in the GitHub repo to fix all of the unbounded varchar and text columns in a given schema in Greenplum. The script calculates the appropriate maximum length and then alters the Greenplum tables so the varchar data type is bounded by a length.

Please note that the script also will drop any constraints or indexes on the affected columns.

Empty character data

Greenplum and Amazon Redshift support an empty string value in a field that is different from NULL. The behavior is the same between the two databases. However, AWS SCT defaults to convert empty strings to NULL. This simply needs to be disabled to avoid problems.

In AWS SCT, open your project, choose Settings, Project settings, and Data migration.
Scroll to the bottom and find Use empty as null value.
Deselect this so that AWS SCT doesn’t convert empty strings to NULL.

NaN and Infinity numeric data type

Greenplum supports NaN and Infinity in a numeric field to represent an undefined calculation result and infinity. NaN is very uncommon because when using aggregate functions on a column with a NaN row, the result will also be NaN. Infinity is also uncommon and not useful when aggregating data. However, you may encounter these values in a Greenplum database.

Amazon Redshift doesn’t support NaN and Infinity, and AWS SCT doesn’t check for this in your data. If you do encounter this when using AWS SCT, the task will fail with a numeric conversion error.

To resolve this, it’s suggested to use NULL instead of NaN and Infinity. This allows you to aggregate data and get results other than NaN and, importantly, allow you to convert the Greenplum data to Amazon Redshift.

The following code is an example NaN numeric value:

CREATE TABLE public.mytable2 (id int NOT NULL, amount numeric NOT NULL);
INSERT INTO public.mytable2 VALUES (1, 10), (2, 'NaN'), (3, 20);

Drop the NOT NULL constraint:

ALTER TABLE public.mytable2 ALTER COLUMN amount DROP NOT NULL;

Update the table:

UPDATE public.mytable2 SET amount = NULL where amount = 'NaN';

You can now proceed with using AWS SCT to migrate the Greenplum data to Amazon Redshift.

Note that in a future release of AWS SCT, there will be an option to convert NaN and Infinity to NULL so that you won’t have to update your Greenplum data to migrate to Amazon Redshift.

Virtual partitioning on GP_SEGMENT_ID

For large tables, it’s recommended to use virtual partitioning to extract data from Greenplum. Without virtual partitioning, AWS SCT will run a single query to unload data from Greenplum. For example:

SELECT * FROM store_sales;

If this table is very large, it will take a long time to extract the data because this is a single process querying the data. With virtual partitioning, multiple queries are run in parallel so that the extraction of data is completed faster. It also makes it easier to recover if there is an issue with the task.

Virtual partitioning is very flexible, but a simple way to do this in Amazon Redshift is to utilize the Greenplum hidden column gp_segment_id. This column identifies which segment in Greenplum has the data, and each segment should have an equal number of rows. Therefore, creating partitions for each gp_segment_id is an easy way to implement virtual partitioning.

If you’re not familiar with the term segment, it’s similar to an Amazon Redshift slice.

For example:

SELECT * FROM store_sales WHERE gp_segment_id = 0;
SELECT * FROM store_sales WHERE gp_segment_id = 1;
SELECT * FROM store_sales WHERE gp_segment_id = 2;
...

First, determine the number of segments in Greenplum:

SELECT count(*) 
FROM gp_segment_configuration 
WHERE content >= 0 AND preferred_role = 'p';

Now you can configure AWS SCT.

In AWS SCT, go to Data Migration view (other) and choose (right-click) a large table.
Scroll down to Add virtual partitioning.
For the partition type, choose Auto Split and change the column name to GP_SEGMENT_ID.
Use 0 for Start value, the number of segments found in Step 1 as End value, and Interval of 1.

When you create a local task to load this table, the task will have a sub-task for each gp_segment_id value.

Note that in a future release of AWS SCT, there will be an option to automatically virtually partition tables based on GP_SEGMENT_ID. This option will also retrieve the number of segments automatically.

Arrays

Greenplum supports arrays such as bigint[] that are unbounded. Typically, arrays are kept relatively small in Greenplum because arrays consume more memory in Greenplum than using an alternative strategy. However, it’s possible to have a very large array in Greenplum that isn’t supported by Amazon Redshift.

AWS SCT converts a Greenplum array to varchar(65535), but if the converted array is longer than 65,535 characters, then the load will fail.

The following code is an example of a large array:

CREATE TABLE public.sales 
(sales_id int NOT NULL,
customer_id int NOT NULL,
sales_item_ids bigint[]) 
DISTRIBUTED BY (sales_id);

INSERT INTO public.sales values (1, 100, '{1,2,3}'), (2, 100, '{1,2,3}');

In this example, the sales items are stored in an array for each sales_id. If you encounter an error while loading that the length is too long to load this data into Amazon Redshift with AWS SCT, then this is the solution. It’s also a more efficient pattern to store data in both Greenplum and Amazon Redshift!

Create a new sales table that has all columns from the existing sales table, but exclude the array column:

CREATE TABLE public.sales_new 
(sales_id int NOT NULL,
customer_id int NOT NULL) 
DISTRIBUTED BY (sales_id);

Populate the new sales table with the existing data except for the array column:

INSERT INTO public.sales_new (sales_id, customer_id) 
SELECT sales_id, customer_id FROM public.sales;

We create a new table that is a cross-reference of sales IDs with the sales items. Instead of having a single row for this association, now there will be a row for each relationship.

Create a new sales item table:

CREATE TABLE public.sales_items 
(sales_id int NOT NULL, 
sales_item_id bigint NOT NULL) 
DISTRIBUTED BY (sales_id);

To unnest the array, create a row for each array element:

INSERT INTO public.sales_items 
(sales_id, sales_item_id) 
SELECT sales_id, unnest(sales_item_ids) 
FROM public.sales;

Rename the sales tables:

ALTER TABLE public.sales RENAME TO sales_old;
ALTER TABLE public.sales_new RENAME TO sales;

In AWS SCT, refresh the tables and migrate the revised sales and the new sales_items table.

The following are some example queries before and after.

Before:

SELECT s.sales_id, unnest(s.sales_item_ids) 
FROM public.sales s 
WHERE s.sales_id = 1;

After:

SELECT s.sales_id, i.sales_item_id 
FROM public.sales s 
JOIN public.sales_items i ON s.sales_id = i.sales_id 
WHERE s.sales_id = 1;

Before:

SELECT s.sales_id 
FROM public.sales s 
WHERE s.customer_id = 100
AND 10 = ANY(s.sales_item_ids);

After:

SELECT s.sales_id
FROM public.sales s 
JOIN public.sales_items i ON s.sales_id = i.sales_id 
WHERE s.customer_id = 100
AND i.sales_item_id = 10;

VACUUM ANALYZE

Greenplum, like Amazon Redshift, supports the VACUUM command, which reclaims storage space after UPDATE and DELETE commands are run on a table. Greenplum also allows you to add the ANALYZE option to run both statements with a single command.

The following code is the Greenplum command:

VACUUM ANALYZE table;

This is not very common, but you’ll see this from time to time. If you’re just inserting data into a table, there is no need to run VACUUM, but for ease of use, sometimes developers will use VACUUM ANALYZE.

The following are the Amazon Redshift commands:

VACUUM table;
ANALYZE table;

Amazon Redshift doesn’t support adding ANALYZE to the VACUUM command, so instead, this needs to be two different statements. Also note that Amazon Redshift performs VACUUM and ANALYZE automatically for you so in most cases, you can remove these commands from your scripts entirely.

DISTINCT ON query

Greenplum supports an unusual shortcut for eliminating duplicates in a table. This feature keeps the first row for each set of rows based on the order of the data being fetched. It’s easiest to understand by looking at an example:

CREATE TABLE customer 
(customer_name varchar(100) not null, 
customer_address varchar(1000) not null, 
lastupdate timestamp not null);

INSERT INTO customer VALUES
('ACME', '123 Main St', '2022-01-01'), 
('ACME', '456 Market St', '2022-05-01'), 
('ACME', '789 Broadway', '2022-08-01');

SELECT DISTINCT ON (customer_name) customer_name, customer_address 
FROM customer 
ORDER BY customer_name, lastupdate DESC;

We get the following results:

customer_name | customer_address 
---------------+------------------
 ACME          | 789 Broadway

The solution for running this in Amazon Redshift is to use the ANSI standard row_number() analytical function, as shown in the following code:

SELECT sub.customer_name, sub.customer_address 
FROM (SELECT customer_name, customer_address, row_number() over (partition by customer_name ORDER BY lastupdate DESC) AS row_number FROM customer) AS sub 
WHERE sub.row_number = 1;

Clean up

The examples in this post create tables in Greenplum. To remove these example tables, run the following commands:

DROP TABLE IF EXISTS public.mytable;
DROP TABLE IF EXISTS public.mytable2;
DROP TABLE IF EXISTS public.sales;
DROP TABLE IF EXISTS public.sales_new;
DROP TABLE IF EXISTS public.sales_items;
DROP TABLE IF EXISTS public.customer;

Conclusion

In this post, we covered some of the edge cases when migrating Greenplum to Amazon Redshift and how to handle these challenges, including easy virtual partitioning, edge cases for numeric and character fields, and arrays. This is not an exhaustive list of migrating Greenplum to Amazon Redshift, but this series should help you navigate modernizing your data platform by moving to Amazon Redshift.

For additional details, see the Amazon Redshift Getting Started Guide and the AWS SCT User Guide.

About the Authors

Jon Roberts is a Sr. Analytics Specialist based out of Nashville, specializing in Amazon Redshift. He has over 27 years of experience working in relational databases. In his spare time, he runs.

Nelly Susanto is a Senior Database Migration Specialist of AWS Database Migration Accelerator. She has over 10 years of technical experience focusing on migrating and replicating databases along with data warehouse workloads. She is passionate about helping customers in their cloud journey.

Suresh Patnam is a Principal BDM – GTM AI/ML Leader at AWS. He works with customers to build IT strategy, making digital transformation through the cloud more accessible by leveraging Data & AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Build an AWS Lake Formation permissions inventory dashboard using AWS Glue and Amazon QuickSight

2022-12-12 Srividya Parthasarathy

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/build-an-aws-lake-formation-permissions-inventory-dashboard-using-aws-glue-and-amazon-quicksight/

AWS Lake Formation is an integrated data lake service that makes it easy for you to ingest, clean, catalog, transform, and secure your data and make it available for analysis and machine learning (ML). Lake Formation provides a single place to define fine-grained access control on catalog resources. These permissions are granted to the principals by a data lake admin, and integrated engines like Amazon Athena, AWS Glue, Amazon EMR, and Amazon Redshift Spectrum enforce the access controls defined in Lake Formation. It also allows principals to securely share data catalog resources across multiple AWS accounts and AWS organizations through a centralized approach.

As organizations are adopting Lake Formation for scaling their permissions, there is steady increase in the access policies established and managed within the enterprise. However, it becomes more difficult to analyze and understand the permissions for auditing. Therefore, customers are looking for a simple way to collect, analyze, and visualize permissions data so that they can inspect and validate the policies. It also enables organizations to take actions that help them with compliance requirements.

This solution offers the ability to consolidate and create a central inventory of Lake Formation permissions that are registered in the given AWS account and Region. It provides a high-level view of various permissions that Lake Formation manages and aims at answering questions like:

Who has select access on given table
Which tables have delete permission granted
Which databases or tables does the given principal have select access to

In this post, we walk through how to set up and collect the permissions granted on resources in a given account using the Lake Formation API. AWS Glue makes it straightforward to set up and run jobs for collecting the permission data and creating an external table on the collected data. We use Amazon QuickSight to create a permissions dashboard using an Athena data source and dataset.

Overview of solution

The following diagram illustrates the architecture of this solution.

In this solution, we walk through the following tasks:

Create an AWS Glue job to collect and store permissions data, and create external tables using Boto3.
Verify the external tables created using Athena.
Sign up for a QuickSight Enterprise account and enable Athena access.
Create a dataset using an Athena data source.
Use the datasets for analysis.
Publish the analyses as a QuickSight dashboard.

The collected JSON data is flattened and written into an Amazon Simple Storage Service (Amazon S3) bucket as Parquet files partitioned by account ID, date, and resource type. After the data is stored in Amazon S3, external tables are created on them and filters are added for different types of resource permissions. These datasets can be imported into SPICE, an in-memory query engine that is part of QuickSight, or queried directly from QuickSight to create analyses. Later, you can publish these analyses as a dashboard and share it with other users.

Dashboards are created for the following use cases:

Database permissions
Table permissions
Principal permissions

Prerequisites

You should have the following prerequisites:

An S3 bucket to store the permissions inventory data
An AWS Glue database for permissions inventory metadata
An AWS Identity and Access Management (IAM) role for the AWS Glue job with access to the inventory AWS Glue database and S3 bucket and added as a data lake admin
A QuickSight account with access to Athena
An IAM role for QuickSight with access to the inventory AWS Glue database and S3 bucket

Set up and run the AWS Glue job

We create an AWS Glue job to collect Lake Formation permissions data for the given account and Region that is provided as job parameters, and the collected data is flattened before storage. Data is partitioned by account ID, date, and permissions type, and is stored as Parquet in an S3 bucket using Boto3. We create external tables on the data and add filters for different types of resource permissions.

To create the AWS Glue job, complete the following steps:

Download the Python script file to local.
On the AWS Glue console, under Data Integration and ETL in the navigation pane, choose Jobs.
Under Create job, select Python Shell script editor.
For Options, select Upload and edit an existing script.
For File upload, choose Choose file.
Choose the downloaded file (lf-permissions-inventory.py).
Choose Create.

GlueJob

After the job is created, enter a name for the job (for this post, lf-inventory-builder) and choose Save.

Glue Job save

Choose the Job details tab.
For Name, enter a name for the job.
For IAM Role¸ choose an IAM role that has access to the inventory S3 bucket and inventory schema and registered as data lake admin.
For Type, choose Python Shell.
For Python version, choose Python 3.9.

Glue Job Details

You can leave other property values at their default.
Under Advanced properties¸ configure the following job parameters and values:
1. catalog-id: with the value as the current AWS account ID whose permissions data are collected.
2. databasename: with the value as the AWS Glue database where the inventory-related schema objects are created.
3. region: with the value as the current Region where the job is configured and whose permissions data is collected.
4. s3bucket: with the value as the S3 bucket where the collected permissions data is written.
5. createtable: with the value yes, which enables external table creation on the data.

Job Parameters

Choose Save to save the job settings.

Glue Job Save

Choose Run to start the job.

When the job is complete, the run status changes to Succeeded. You can view the log messages in Amazon CloudWatch Logs.

Job Run

Permissions data is collected and stored in the S3 bucket (under lfpermissions-data) that you provided in the job parameters.

S3 Structure

The following external tables are created on the permissions data and can be queried using Athena:

lfpermissions – A summary of resource permissions
lfpermissionswithgrant – A summary of grantable resource permissions

For both tables, the schema structure is the same and the lftype column indicates what type of permissions the row applies to.

Athena Table Schema

Verify the tables using Athena

You can use Athena to verify the data using the following queries.

For more information, refer to Running SQL queries using Amazon Athena

List the database permissions:

Select * from lfpermissions where lftype=’DATABASE’

List the table permissions:

Select * from lfpermissions where lftype= ‘TABLE’

List the data lake permissions:

Select * from lfpermissions where lftype= ‘DATA_LOCATION’

List the grantable database permissions:

Select * from lfpermissionswithgrant where lftype=’DATABASE’

List the grantable table permissions:

Select * from lfpermissionswithgrant where lftype= ‘TABLE’

List grantable data lake permissions:

Select * from lfpermissionswithgrant where lftype= ‘DATA_LOCATION’

As the next step, we create a QuickSight dashboard with three sheets, each focused on different sets of permissions (database, table, principal) to slice and dice the data.

Sign up for a QuickSight account

If you haven’t signed up for QuickSight, complete the following steps:

QuickSight signup

For Edition, select Enterprise.
Choose Continue.
For Authentication method, select Use IAM federated identities & QuickSight-managed users.
Under QuickSight Region, choose the same Region as your inventory S3 bucket.
Under Account info, enter a QuickSight account name and email address for notification.

QuickSight Form

In the Quick access to AWS services section, for IAM Role, select Use QuickSight-managed role (default).
Allow access to IAM, Athena, and Amazon S3.
Specify the S3 bucket that contains the permissions data.
Choose Finish to complete the signup process.

QuickSight configuration

Note: If the inventory bucket and database is managed by Lake Formation, grant database and table access to the created QuickSight IAM role. For instructions, refer to Granting and revoking permissions on Data Catalog resources.

Configure your dataset in QuickSight

QuickSight is configured with an Athena data source the same Region as the S3 bucket. To set up your dataset, complete the following steps:

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.

Quicksight DataSet

Choose Athena as your data source.

QuickSight Datasource

Enter LF_DASHBOARD_DS as the name of your data source.
Choose Create data source.
For Catalog, leave it as AwsDataCatalog.
For Database, choose database name provided as parameter to the Job.
For Tables, select lfpermissions.
Choose Select.

QuickSight Catalog Info

Select Directly query your data and choose Visualize to take you to the analysis.

Quicksight data mode

Create analyses

We create three sheets for our dashboard to view different levels of permissions.

Sheet 1: Database permission view

To view database permissions, complete the following steps:

On the QuickSight console, choose the plus sign to create a new sheet.
Choose Add, then choose Add title.

QuickSight Title

Name the sheet Database Permissions.
Repeat steps (5-7) to add the following parameters:
- catalogid
- databasename
- permission
- tablename
On the Add menu, choose Add parameter.
Enter a name for the parameter.
Leave the other values as default and choose Create.
Choose Insights in the navigation pane, then choose Add control.

QuickSight Control

Add a control for each parameter:
1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
2. Provide additional information for each parameter according to the following table.

Parameter	Display Name	Dataset	Field
catalogid	AccountID	lfpermissions	catalog_id
databasename	DatabaseName	lfpermissions	databasename
permission	Permission	lfpermissions	permission

Add a control dependency and for Database, choose the options menu and choose Edit.

QuickSight Dependency

Under Format control, choose Control options.
Change the relevant values, choose AccountID, and choose Update.
Similarly, under Permission control, choose Control options.
Change the relevant values, choose AccountID, and choose Update.

We create two visuals for this view.

For the first visual, choose Visualize and choose pivot table as the visual type.
Drag and drop catalog_id and databasename into Rows.
Drag and drop permission into Column.
Drag and drop principal into Values and change the aggregation to Count distinct.

QuickSight Database View1

Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. DATABASE as the value.
Add a filter on catalog_id the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Add a filter on databasename with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose databasename.
Add a filter on permission with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose permission.
Choose Actions in the navigation pane.
Define a new action with the following parameters:
1. For Activation, select Select.
2. For Filter action, select All fields.
3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

Add a second visual and choose the table visual type.
Drag and drop principal to Group by.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. DATABASE as the value.
Add a filter on catalog_id the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Add a filter on databasename the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose databasename.
Add a filter on permission with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose permission.

Now the Database and Permission drop-down menus are populated based on the relevant attributes and changes dynamically.

QuickSight Database View2

Sheet 2: Table permission view

Now that we have created the database permissions sheet, we can add a table permissions sheet.

Choose the plus sign to add a new sheet.
On the QuickSight console, choose Add, then choose Add title.
Name the sheet Table Permissions.
Choose Insights in the navigation pane, then choose Add control.
Add a control for each parameter:
1. For each parameter, for Style¸ choose List, and for Values, select Link to a dataset field.
2. Provide the additional information for each parameter according to the following table.

Parameter	Display Name	Dataset	Field
catalogid	AccountID	lfpermissions	catalog_id
databasename	DatabaseName	lfpermissions	databasename
permission	Permission	lfpermissions	permission
tablename	TableName	lfpermissions	tablename

We create two visuals for this view.

For the first visual, choose Visualize and choose pivot table as the visual type.
Drag and drop catalog_id, databasename, and tablename into Rows.
Drag and drop permission into Column.
Drag and drop principal into Values and change the aggregation to Count distinct.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. TABLE as the value.
Add a filter on catalog_id the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Add a filter on the databasename with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose databasename.
Add a filter on tablename with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose tablename.
Add a filter on permission with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose permission.
Choose Actions in the navigation pane.
Define a new action with the following parameters:
1. For Activation, select Select.
2. For Filter action, select All fields.
3. For Target visuals, select Select visuals and Check principal.

Now we add our second visual.

Add a second visual and choose the table visual type.
Drag and drop principal to Group by.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. TABLE as the value.
Add a filter on catalog_id the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Add a filter on the databasename with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose databasename.
Add a filter on tablename with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose tablename.
Add a filter on permission with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose permission.

Now the Databasename, Tablename, and Permission drop-down menus are populated based on the relevant attributes.

QuickSight Table Permissions

Sheet 3: Principal permission view

Now we add a third sheet for principal permissions.

Choose the plus sign to add a new sheet.
On the QuickSight console, choose Add, then choose Add title.
Name the sheet Principal Permissions.
Choose Insights in the navigation pane, then choose Add control.
Add a control for the catalogid parameter:
1. For Style¸ choose List, and for Values, select Link to a dataset field.
2. Provide the additional information for the parameter according to the following table.

Parameter	Display Name	Dataset	Field
catalogid	AccountID	lfpermissions	catalog_id

We create four visuals for this view.

For the first visual, choose Visualize and choose pivot table as the visual type.
Drag and drop catalog_id and principal into Rows.
Drag and drop permission into Column.
Drag and drop databasename into Values and change the aggregation to Count distinct.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. DATABASE as the value.
Add a filter on the catalog_id field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Choose Actions in the navigation pane.
Define a new action with the following parameters:
1. For Activation, select Select.
2. For Filter action, select All fields.
3. For Target visuals, select Select visuals and Check Databasename.
For the second visual, choose Visualize and choose table as the visual type.
Drag and drop databasename into Group by.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. DATABASE as the value.
Add a filter on the catalog_id field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
For the third visual, choose Visualize and choose pivot table as the visual type.
Drag and drop catalog_id and principal into Rows.
Drag and drop permission into Column.
Drag and drop tablename into Values and change the aggregation to Count distinct.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. TABLE as the value.
Add a filter on the catalog_id field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.
Choose Actions in the navigation pane.
Define a new action with the following parameters:
1. For Activation, select Select.
2. For Filter action, select All fields.
3. For Target visuals, select Select visuals and Check Tablename.
For the final visual, choose Visualize and choose table as the visual type.
Drag and drop tablename into Group by.
Add a filter on the lftype field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. TABLE as the value.
Add a filter on the catalog_id field with the following options:
1. Custom filter as the filter type.
2. Equals as the filter condition.
3. Select Use parameters and choose catalogid.

The following screenshot shows our sheet.

QuickSight Prinicipal View

Create a dashboard

Now that the analysis is ready, you can publish it as a dashboard and share it with other users. For instructions, refer to the tutorial Create an Amazon QuickSight dashboard.

Clean up

To clean up the resources created in this post, complete the following steps:

Delete the AWS Glue job lf-inventory-builder.
Delete the data stored under the bucket provided as the value of the s3bucket job parameter.
Drop the external tables created under the schema provided as the value of the databasename job parameter.
If you signed up for QuickSight to follow along with this post, you can delete the account.
For an existing QuickSight account, delete the following resources:
1. lfpermissions dataset
2. lfpermissions analysis
3. lfpermissions dashboard

Conclusion

In this post, we provided a design and implementation steps for a solution to collect Lake Formation permissions in a given Region of an account and consolidate them for analysis. We also walked through the steps to create a dashboard using Amazon QuickSight. You can utilize other QuickSight visuals to create more sophisticated dashboards based on your requirements.

You can also expand this solution to consolidate permissions for a multi-account setup. You can use a shared bucket across organizations and accounts and configure an AWS Glue job in each account or organization to write their permission data. With this solution, you can maintain a unified dashboard view of all the Lake Formation permissions within your organization, thereby providing a central audit mechanism to comply with business requirements.

Thanks for reading this post! If you have any comments or questions, please don’t hesitate to leave them in the comments section.

About the Author

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

New! Security Analytics provides a comprehensive view across all your traffic

2022-12-09 Zhiyuan Zheng

Post Syndicated from Zhiyuan Zheng original https://blog.cloudflare.com/security-analytics/

New! Security Analytics provides a comprehensive view across all your traffic

An application proxying traffic through Cloudflare benefits from a wide range of easy to use security features including WAF, Bot Management and DDoS mitigation. To understand if traffic has been blocked by Cloudflare we have built a powerful Security Events dashboard that allows you to examine any mitigation events. Application owners often wonder though what happened to the rest of their traffic. Did they block all traffic that was detected as malicious?

Today, along with our announcement of the WAF Attack Score, we are also launching our new Security Analytics.

Security Analytics gives you a security lens across all of your HTTP traffic, not only mitigated requests, allowing you to focus on what matters most: traffic deemed malicious but potentially not mitigated.

Detect then mitigate

Imagine you just onboarded your application to Cloudflare and without any additional effort, each HTTP request is analyzed by the Cloudflare network. Analytics are therefore enriched with attack analysis, bot analysis and any other security signal provided by Cloudflare.

Right away, without any risk of causing false positives, you can view the entirety of your traffic to explore what is happening, when and where.

This allows you to dive straight into analyzing the results of these signals, shortening the time taken to deploy active blocking mitigations and boosting your confidence in making decisions.

We are calling this approach “detect then mitigate” and we have already received very positive feedback from early access customers.

In fact, Cloudflare’s Bot Management has been using this model for the past two years. We constantly hear feedback from our customers that with greater visibility, they have a high confidence in our bot scoring solution. To further support this new way of securing your web applications and bringing together all our intelligent signals, we have designed and developed the new Security Analytics which starts bringing signals from the WAF and other security products to follow this model.

New Security Analytics

Built on top of the success of our analytics experiences, the new Security Analytics employs existing components such as top statistics, in-context quick filters, with a new page layout allowing for rapid exploration and validation. Following sections will break down this new page layout forming a high level workflow.

The key difference between Security Analytics and Security Events, is that the former is based on HTTP requests which covers visibility of your entire site’s traffic, while Security Events uses a different dataset that visualizes whenever there is a match with any active security rule.

Define a focus

The new Security Analytics visualizes the dataset of sampled HTTP requests based on your entire application, same as bots analytics. When validating the “detect then mitigate” model with selected customers, a common behavior observed is to use the top N statistics to quickly narrow down to either obvious anomalies or certain parts of the application. Based on this insight, the page starts with selected top N statistics covering both request sources and request destinations, allowing expanding to view all the statistics available. Questions like “How well is my application admin’s area protected?” lands at one or two quick filter clicks in this area.

Spot anomalies in trends

After a preliminary focus is defined, the core of the interface is dedicated to plotting trends over time. The time series chart has proven to be a powerful tool to help spot traffic anomalies, also allowing plotting based on different criteria. Whenever there is a spike, it is likely an attack or attack attempt has happened.

As mentioned above, different from Security Events, the dataset used in this page is HTTP requests which includes both mitigated and not mitigated requests. By mitigated requests here, we mean “any HTTP request that had a ‘terminating’ action applied by the Cloudflare platform”. The rest of the requests that have not been mitigated are either served by Cloudflare’s cache or reaching the origin. In the case such as a spike in not mitigated requests but flat in mitigated requests, an assumption could be that there was an attack that did not match any active WAF rule. In this example, you can one click to filter on not mitigated requests right in the chart which will update all the data visualized on this page supporting further investigations.

In addition to the default plotting of not mitigated and mitigated requests, you can also choose to plot trends of either attack analysis or bot analysis allowing you to spot anomalies for attack or bot behaviors.

Zoom in with analysis signals

One of the most loved and trusted analysis signals by our customers is the bot score. With the latest addition of WAF Attack Score and content scanning, we are bringing them together into one analytics page, helping you further zoom into your traffic based on some of these signals. The combination of these signals enables you to find answers to scenarios not possible until now:

Attack requests made by (definite) automated sources
Likely attack requests made by humans
Content uploaded with/without malicious content made by bots

Once a scenario is filtered on, the data visualization of the entire page including the top N statistics, HTTP requests trend and sampled log will be updated, allowing you to spot any anomalies among either one of the top N statistics or the time based HTTP requests trend.

Review sampled logs

After zooming into a specific part of your traffic that may be an anomaly, sampled logs provide a detailed view to verify your finding per HTTP request. This is a crucial step in a security study workflow backed by the high engagement rate when examining the usage data of such logs viewed in Security Events. While we are adding more data into each log entry, the expanded log view becomes less readable over time. We have therefore redesigned the expanded view, starting with how Cloudflare responded to a request, followed by our analysis signals, lastly the key components of the raw request itself. By reviewing these details, you validate your hypothesis of an anomaly, and if any mitigation action is required.

Handy insights to get started

When testing the prototype of this analytics dashboard internally, we learnt that the power of flexibility yields the learning curve upwards. To help you get started mastering the flexibility, a handy insights panel is designed. These insights are crafted to highlight specific perspectives into your total traffic. By a simple click on any one of the insights, a preset of filters is applied zooming directly onto the portion of your traffic that you are interested in. From here, you can review the sampled logs or further fine tune any of the applied filters. This approach has been proven with further internal studies of a highly efficient workflow that in many cases will be your starting point of using this dashboard.

How can I get it?

The new Security Analytics is being gradually rolled out to all Enterprise customers who have purchased the new Application Security Core or Advanced Bundles. We plan to roll this out to all other customers in the near future. This new view will be alongside the existing Security Events dashboard.

What’s next

We are still at an early stage moving towards the “detect then mitigate” model, empowering you with greater visibility and intelligence to better protect your web applications. While we are working on enabling more detection capabilities, please share your thoughts and feedback with us to help us improve the experience. If you want to get access sooner, reach out to your account team to get started!

Query cross-account Amazon DynamoDB tables using Amazon Athena Federated Query

2022-12-08 Satyanarayana Adimula

Post Syndicated from Satyanarayana Adimula original https://aws.amazon.com/blogs/big-data/query-cross-account-amazon-dynamodb-tables-using-amazon-athena-federated-query/

Amazon DynamoDB is ideal for applications that need a flexible NoSQL database with low read and write latencies and the ability to scale storage and throughput up or down as needed without code changes or downtime. You can use DynamoDB for use cases including mobile apps, gaming, digital ad serving, live voting, audience interaction for live events, sensor networks, log ingestion, access control for web-based content, metadata storage for Amazon S3 objects, e-commerce shopping carts, and web session management.

What if you have the need to allow other AWS accounts to query your DynamoDB table? What if other accounts need to join data on your DynamoDB table with their data stored in data sources like Amazon CloudWatch, Amazon DocumentDB, Amazon Redshift, Amazon OpenSearch, MySQL, PostgreSQL connected with Athena data source connectors, and Amazon S3?

Amazon Athena cross-account federated query enables you to run SQL queries across data stored in relational, non-relational, object, and custom data sources where data source and its connector are in different AWS accounts from the user querying the data. There are no new charges for querying connectors in another account, but Athena’s standard rates for data scanned, Lambda usage, and other services apply.

This post will demonstrate Athena in an AWS account accessing a DynamoDB table of another AWS account by using the Athena cross-account federated query. It also explains deploying Amazon Athena DynamoDB connector using AWS Serverless Application Repository and setting up Athena cross-account federation between two accounts for the Demo.

Walkthrough

The solution has the following steps to demonstrate Athena cross-account federated query:

Set up Athena federation – To deploy a Lambda function for the data source connector and connect it to a data source.
Set up Athena cross-account federation – To set up IAM permissions for Athena cross-account federation.
Test Athena cross-account federated query – To show a demo of how an AWS account can share its DynamoDB table as an Athena data source with another AWS account.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Two AWS Accounts
AWS resources: Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon DynamoDB, AWS Lambda

Data source connectors

A data source connector is a piece of code that can translate between your target data source and Athena. Athena uses data source connectors that run on AWS Lambda to run federated queries. You can think of a connector as an extension of Athena’s query engine.

Connectors use Apache Arrow as the format for returning data requested in a query, which enables connectors to be implemented in languages such as C, C++, Java, Python, and Rust.

Athena uses data source connectors that run on AWS Lambda to run federated queries. Since connectors are processed in Lambda, they can be used to access data from any data source on the cloud or on premises that is accessible from Lambda

To use a connector in your Athena queries, deploy it to your account using one of the following ways:

AWS Serverless Application Repository
Athena and Lambda consoles

This blog uses the AWS Serverless Application Repository to deploy the Amazon Athena DynamoDB connector.

After you deploy data source connectors, the connector is associated with a catalog that you can specify in SQL queries. You can combine SQL statements from multiple catalogs and span multiple data sources with a single query. When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements.

Architecture

AWS Account-A has a DynamoDB table called Music.
Account-A has an Athena data source connector to federate into DynamoDB.
AWS Account-B has Analysts who need to query the DynamoDB table.
Account-A is sharing the Athena data source with Account-B by using Athena cross-account federated query.

The following figure shows Amazon Athena cross-account federation for Account-B to access DynamoDB in Account-A.

To demonstrate the Athena cross-account federation, create a sample DynamoDB table called music in Account-A. Follow the instructions at Getting started with DynamoDB to create the table Music and load thesample data.

Set up Athena federation

Preparing to create federated queries is a two-part process: deploying a Lambda function for the data source connector and connecting the Lambda function to a data source. For more details, see Enabling cross-account federated queries.

Deploy AthenaDynamoDBConnector using AWS Serverless Application Repository

Sign in as an administrator to AWS Account-A.
Open the Serverless Application Repository.
In the navigation pane, choose Available applications.
Select the option Show apps that create custom IAM roles or resource policies.
In the search box, type the name of the connector AthenaDynamoDBConnector.
Choosing a connector opens the Lambda function’s Application details page in the AWS Lambda console.
On the right side of the details page, for Application settings, fill in the required information.
- Application name – Name of AWS CloudFormation Stack to deploy the connector: AthenaDynamoDBConnector.
- AthenaCatalogName – It is the catalog name to create in Athena. It is also the name of the Lambda function. Give it in lowercase: acct1dynamodb.
- SpillBucket – Specify an existing S3 bucket (spill-bucket) in your account to receive data from any large response payloads that exceed Lambda function response size limits.
Select I acknowledge that this app creates custom IAM roles and resource policies. For more information, choose the Info link.
At the bottom right of the Application settings section, choose Deploy.
Serverless Application Repository will create an AWS CloudFormation stack to deploy the connector.
When the deployment is complete, you will see the Lambda function in the Resources section of the AWS CloudFormation stack. Note down the Lambda function name.

Connect Athena to the data source

Go to Athena console in Account-A.
Choose Data sources. Click Create Data source.
In Choose data source, search for Amazon DynamoDB and select it.
In Data source details, give a Data source name acct1dynamodb
For Lambda function in the Connection details section, choose the name of the function acct1dynamodb from the dropdown.
On the Review and create page, review the data source details, and then choose Create data source.
You will see the data source acctdynamodb in the Data sources.
Go to Query editor. Choose the Data Source acct1dynamodb from the dropdown.
You will see all the tables in the shared data source.

Run the following SQL in Athena Query editor

SELECT songtitle, albumtitle, cast(awards as int) as awards 
FROM "acct1dynamodb"."default"."music" 
WHERE artist = 'Acme Band' 
limit 2;

Verify Athena federation works.

Set up Athena cross-account federation

In Account-A: Set up IAM permissions for cross-account

Sign in as an administrator to Account-A.
On the S3 spill bucket (of the Lambda function), grant GetObject and ListBucket permissions to the IAM user analyst of Account-B.

Note: Replace Account-B-id with your actual AWS cross-account id to which you want to share the DynamoDB table. Replace spill-bucket with your actual S3 bucket in Account-A.

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": ["arn:aws:iam::Account-B-id:user/analyst"]
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
             ],
            "Resource": [
                "arn:aws:s3::: spill-bucket",
                "arn:aws:s3::: spill-bucket/*"
            ]
        }
     ]
 }

Grant InvokeFunction on Lambda function acct1dynamodb to IAM user analyst of Account-B.

Note: Replace Account-A-id with your actual AWS account id where you have the DynamoDB table. Replace Account-B-id with your actual AWS cross-account id to which you want to share the DynamoDB table.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "CrossAccountInvocationStatement",
      "Effect": "Allow",
      "Principal": {
        "AWS": ["arn:aws:iam::Account-B-id:user/analyst"]
      }, 
      "Action": "lambda:InvokeFunction",
      "Resource": "arn:aws:lambda:aws-region:Account-A-id:function:acct1dynamodb"
    }
  ]
}

Go to the Lambda function acct1dynamodb. Choose Configuration and Permissions.

Go to Resource-based policy and Add permissions.

When you save the above permissions, you can see them under Policy statements in Resource-based policy of the Lambda function.

In Account-B: Set up IAM permissions for cross-account

Sign in as an administrator to AWS Account-B.
Create IAM role called AthenaCrossAccountFederated-Account-A-id for Account-A to assume. Add the following inline policy to the role.

Note: Replace Account-B-id with your actual AWS cross-account id to which you want to share the DynamoDB table.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "athena:CreateDataCatalog",
            "Resource": "arn:aws:athena:aws-region:Account-B-id:datacatalog/*"
        }
    ]
}

Grant permission to the IAM user analyst to invoke the Lambda function acct1dymanodb of Account-A

Note: Replace Account-A-id with your actual AWS account id where you have the DynamoDB table.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "lambda:InvokeFunction",
            "Resource": "arn:aws:lambda:aws-region:Account-A-id:function:acct1dynamodb"
        }
    ]
}

Share the Athena Data source with Account-B

After the permissions are in place, you share a data connector in your account (Account-A) with another account (Account-B). Account-A retains full control and ownership of the connector. When Account-A makes configuration changes to the connector, the updated configuration applies to the shared connector in Account-B.

Sign in as an administrator to Account-A.
On Athena, go to Data sources, choose data source acct1dynamodb you want to share. Go to the Share option in the top right corner.

For Account ID, enter the Account-B-id to share your data source with Account-B and click Share.

Test Athena cross-account federated query: Access the shared data source from Account-B

Sign in as IAM user analyst to Account-B.
In Athena, go to Data sources. You will see the data source acct1dynamodb.

Go to Query editor. Choose the Data Source acct1dynamodb from the dropdown.

You will see all the tables in the shared data source.

Run the following SQL in Athena Query editor

SELECT songtitle, albumtitle, cast(awards as int) as awards 
FROM "acct1dynamodb"."default"."music" 
WHERE artist = 'Acme Band' 
limit 2;

Athena cross-account federated has worked! This validates that user analyst in Account-B can see the data of the DynamoDB table of Account-A.

Clean up

To avoid incurring future charges, delete the following resources that were provisioned for this demo:

S3 spill bucket used in AWS Lambda
Lambda function used for the data source connector
Sample DynamoDB table

Conclusion

In this post, we saw how you can access a cross-account DynamoDB table using Athena Federated Query to query the data in place. You can execute a single SQL query to join this data across data sources like Amazon CloudWatch, Amazon DocumentDB, Amazon Redshift, Amazon OpenSearch, MySQL, PostgreSQL, Oracle, SQL Server, HBase, Redis, BigQuery, Snowflake, Teradata with Athena data source connectors and Amazon S3.

About the author

Satya Adimula is a Senior Data Architect at AWS based in Boston. With extensive experience in data and analytics, Satya helps organizations derive their business insights from the data at scale.

How Blueshift integrated their customer data environment with Amazon Redshift to unify and activate customer data for marketing

2022-12-08 Vijay Chitoor

Post Syndicated from Vijay Chitoor original https://aws.amazon.com/blogs/big-data/how-blueshift-integrated-their-customer-data-environment-with-amazon-redshift-to-unify-and-activate-customer-data-for-marketing/

This post was co-written with Vijay Chitoor, Co-Founder & CEO, and Mehul Shah, Co-Founder and CTO from the Blueshift team, as the lead authors.

Blueshift is a San Francisco-based startup that helps marketers deliver exceptional customer experiences on every channel, delivering relevant personalized marketing. Blueshift’s SmartHub Customer Data Platform (CDP) empowers marketing teams to activate their first-party customer data to drive 1:1 personalization on owned (email, mobile) and paid (Google, Facebook, and so on) website and customer (CX) channels.

In this post, Blueshift’s founding team discuss how they used Amazon Redshift Data API to integrate data from their customer’s Amazon Redshift data warehouse with their CDP environment to help marketers activate their enterprise data and drive growth for their businesses.

Business need

In today’s omnichannel world, marketing teams at modern enterprises are being tasked with engaging customers on multiple channels. To successfully deliver intelligent customer engagement, marketers need to operate with a 360 degree view of their customers that takes into account various types of data, including customer behavior, demographics, consent and preferences, transactions, data from human assisted and digital interactions, and more. However, unifying this data and making it actionable for marketers is often a herculean task. Now, for the first time, with the integration of Blueshift with Amazon Redshift, companies can use more data than ever for intelligent cross-channel engagement.

Amazon Redshift is a fast, fully managed cloud data warehouse. Tens of thousands of customers use Amazon Redshift as their analytics infrastructure. Users such as data analysts, database developers, and data scientists use SQL to analyze their data in Amazon Redshift data warehouses. Amazon Redshift provides a web-based query editor in addition to supporting connectivity via ODBC/JDBC or the Redshift Data API.

Blueshift aims at empowering business users to unlock data in such data warehouses and activate audiences with personalized journeys for segmentation, 1:1 messaging, website, mobile, and paid media use cases. Moreover, Blueshift can help combine this data in Amazon Redshift data warehouses with real-time website and mobile data for real-time profiles and activation, enabling this data to be used by marketers in these businesses.

Although the data in Amazon Redshift is incredibly powerful, marketers are unable to use it in it’s original form for customer engagement for a variety of reasons. Firstly, querying the data requires knowledge of query languages like SQL, which marketers aren’t necessarily adept at. Furthermore, marketers need to combine the data in the warehouse with additional sources of data that are critical for customer engagement, including real-time events (for example, a website page viewed by a customer), as well as channel-level permissions and identity.

With the new integration, Blueshift customers can ingest multidimensional data tables from Amazon Redshift (for example, a customer table, transactions table, and product catalog table) into Blueshift to build a single customer view that is accessible by marketers. The bi-directional integration also ensures that predictive data attributes computed in Blueshift, as well as campaign engagement data from Blueshift, are written back into Amazon Redshift tables, enabling technology and analytics teams to have a comprehensive view of the data.

In this post, we describe how Blueshift integrates with Amazon Redshift. We highlight the bi-directional integration with data flowing from a customer’s Amazon Redshift data warehouse to Blueshift’s CDP environment and vice versa. These mechanisms are facilitated through the use of the Redshift Data API.

Solution overview

The integration between the two environments is achieved through a connector. We discuss the connector’s core components in this section. Blueshift uses a hybrid approach using Redshift S3 UNLOAD, Redshift S3 COPY, and the Redshift Data API to simplify the integration between Blueshift and Amazon Redshift, thereby facilitating the data needs to empower marketing teams. The following flow diagram shows the overview of the solution.

Blueshift uses container technology to ingest and process data. The data ingestion and egress containers are scaled up and down depending on the amount of data being processed. One of the key design tenets was to simplify the design by not having to manage connections or active connection pools. The Redshift Data API supports a HTTP-based SQL interface without the need for actively managing connections. As depicted in the process flow, the Redshift Data API lets you access data from Amazon Redshift with various types of traditional, cloud-native, containerized, serverless web service-based applications and event-driven applications. The Blueshift application includes a mix of programming languages, including Ruby (for the customer-facing dashboard), Go (for container workloads), and Python (for data science workloads). The Redshift Data API supports bindings in Python, Go, Java, Node.js, PHP, Ruby, and C++, which makes it simple for developer teams to integrate quickly.

With the Redshift Data API integration in place in Blueshift’s application, IT users from Blueshift customers can set up and validate the data connection, and subsequently Blueshift’s business users (marketers) can seamlessly extract value from data by developing insights and putting those insights into action for the customer data housed in AWS Redshift seamlessly. Therefore, the process developed by Blueshift using the Redshift Data API significantly lowers the barrier for entry for new users without needing data warehousing experience or ongoing IT dependencies for the business user.

The solution architecture depicted in the following figure shows how the various components of the CDP environment and Amazon Redshift integrate to provide the the end-to-end solution.

shows how the various components of the CDP environment and Amazon Redshift integrate to provide the the end-to-end solution

Prerequisites

In this section, we describe the requirements of the integration solution between the two infrastructures. A typical data implementation with customers involves data from Amazon Redshift ingesting into the Blueshift CDP environment. This ingestion mechanism must accommodate different data types, such as the following:

Customer CRM data (user identifiers and various CRM fields). A typical range for data volume to be supported for this data type is 50–500 GB ingested once initially.
Real-time behavior or events data (for example, playing or pausing a movie).
Transactions data, such as subscription purchases. Typical data volumes ingested daily for events and transactions are in the 500 GB – 1 TB range.
Catalog content (for example, a list of shows or movies for discovery), which is typically about 1 GB in size ingested daily.

The integration also needs to support Blueshift’s CDP platform environment to export data to Amazon Redshift. This includes data such as campaign activities like emails being viewed, which can run into tens of TB, and segment or user exports to support a list of users that are part of a segment definition, typically 50–500 GB exported daily.

Integrate Amazon Redshift with data applications

Amazon Redshift provides several ways to quickly integrate data applications.

For the initial data loads, Blueshift uses Redshift S3 UNLOAD to dump Amazon Redshift data into Amazon Simple Storage Service (Amazon S3). Blueshift natively uses Amazon S3 as a persistent object store and supports bulk ingestion and export from Amazon S3. Data loads from Amazon S3 are ingested in parallel and cut down on data load times, enabling Blueshift clients to quickly onboard.

For incremental data ingestion, Blueshift data import jobs track the last time an import was run, and import new rows of data that have been added or updated since the previous import ran. Blueshift stays in sync with changes (updates or inserts) to the Amazon Redshift data warehouse using the Redshift Data API. Blueshift uses the last_updated_at column in Amazon Redshift tables to determine new or updated rows and subsequently ingest those using the Redshift Data API. Blueshift’s data integration cron job syncs data in near-real time using the Redshift Data API by polling for updates on a regular cadence (for example, every 10 minutes, hourly, or daily). The cadence can be tuned depending on data freshness requirements.

The following table summarizes the integration types.

Integration type	Integration mechanism	Advantage
Initial data ingestion from Amazon Redshift to Blueshift	Redshift S3 UNLOAD command	Initial data is exported from Amazon Redshift via Amazon S3 to allow faster parallel loads into Blueshift using the Amazon Redshift UNLOAD command.
Incremental data ingestion from Amazon Redshift to Blueshift	Redshift Data API	Incremental data changes are synchronized using the Redshift Data API in near-real time.
Data export from Blueshift to Amazon Redshift	Redshift S3 COPY command	Blueshift natively stores campaign activity and segment data in Amazon S3, which is loaded into Amazon Redshift using the Redshift S3 COPY command.

Redshift supports numerous out-of-the-box mechanisms to provide data access. Blueshift was able to cut down the data onboarding time for clients by using a hybrid approach of integrating with Amazon Redshift with Redshift S3 UNLOAD, the Redshift Data API, and Redshift S3 COPY. Blueshift is able to cut down the initial data load time, as well as be updated in near-real time with changes in Amazon Redshift and vice versa.

Conclusion

In this post, we showed how Blueshift integrated with the Redshift Data API to ingest customer data. This integration was seamless and demonstrated how straightforward the Redshift Data API makes integration with external applications, such as Blueshift’s CDP environment for marketing, with Amazon Redshift. The outlined use cases in this post are just a few examples of how to use the Redshift Data API to simplify interactions between users and Amazon Redshift clusters.

Now go build and integrate Amazon Redshift with Blueshift.

About the authors

Vijay Chittoor is the CEO & co-founder of Blueshift. Vijay has a wealth of experience in AI, marketing technology and e-commerce domains. Vijay was previously the co-founder & CEO of Mertado (acquired by Groupon to become Groupon Goods), and an early team member at Kosmix (acquired by Walmart to become @WalmartLabs). A former consultant with McKinsey & Co., Vijay is a graduate of Harvard Business School’s MBA Program. He also holds Bachelor’s and Master’s degrees in Electrical Engineering from the Indian Institute of Technology, Bombay.

Mehul Shah is co-Founder & CTO at Blueshift. Previously, he was a co-founder & CTO at Mertado, which was acquired by Groupon to become Groupon Goods. Mehul was an early employee at Kosmix that was acquired by Walmart to become @WalmartLabs. Mehul is a Y Combinator alumni and a graduate of University of Southern California. Mehul is a co-inventor of 12+ patents, and coaches a middle school robotics team.

Manohar Vellala is a Senior Solutions Architect at AWS working with digital native customers on their cloud native journey. He is based in San Francisco Bay Area and is passionate about helping customers build modern applications that can take the full advantage of the cloud. Prior to AWS he worked at H2O.ai where he helped customers build ML models. His interests are Storage, Data Analytics and AI/ML.

Prashant Tyagi joined AWS in September 2020, where he now manages the solutions architecture team focused on enabling digital native businesses. Prashant worked previously at ThermoFisher Scientific, and GE Digital where he held roles as Sr. Director for their Digital Transformation initiatives. Prashant has enabled digital transformation for customers in the Life Sciences and other industry verticals. He has experience in IoT, Data Lakes and AI/ML technical domains. He lives in the bay area in California.

How dynamic data masking support in Amazon Redshift helps achieve data privacy and compliance

2022-12-07 Rohit Vashishtha

Post Syndicated from Rohit Vashishtha original https://aws.amazon.com/blogs/big-data/how-dynamic-data-masking-support-in-amazon-redshift-helps-achieve-data-privacy-and-compliance/

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that offers simple operations and high performance. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Today, Amazon Redshift is the most widely used cloud data warehouse.

Dynamic data masking (DDM) support (preview) in Amazon Redshift enables you to simplify the process of protecting sensitive data in your Amazon Redshift data warehouse. You can now use DDM to protect data based on your job role or permission rights and level of data sensitivity through a SQL interface. DDM support (preview) in Amazon Redshift enables you to hide, obfuscate, or pseudonymize column values within the tables in your data warehouse without incurring additional storage costs. It is configurable to allow you to define consistent, format-preserving, and irreversible masked data values.

DDM support (preview) in Amazon Redshift provides a native feature to support your need to mask data for regulatory or compliance requirements, or to increase internal privacy standards. Compared to static data masking where underlying data at rest gets permanently replaced or redacted, DDM support (preview) in Amazon Redshift enables you to temporarily manipulate the display of sensitive data in transit at query time based on user privilege, leaving the original data at rest intact. You control access to data through masking policies that apply custom obfuscation rules to a given user or role. That way, you can respond to changing privacy requirements without altering the underlying data or editing SQL queries.

With DDM support (preview) in Amazon Redshift, you can do the following:

Define masking policies that apply custom obfuscation policies (for example, masking policies to handle credit card, PII entries, HIPAA or GDPR needs, and more)
Transform the data at query time to apply masking policies
Attach masking policies to roles or users
Attach multiple masking policies with varying levels of obfuscation to the same column in a table and assign them to different roles with priorities to avoid conflicts
Implement cell-level masking by using conditional columns when creating your masking policy
Use masking policies to partially or completely redact data, or hash it by using user-defined functions (UDFs)

Here’s what our customers have to say on DDM support(private beta) in Amazon Redshift:

“Baffle delivers data-centric protection for enterprises via a data security platform that is transparent to applications and unique to data security. Our mission is to seamlessly weave data security into every data pipeline. Previously, to apply data masking to an Amazon Redshift data source, we had to stage the data in an Amazon S3 bucket. Now, by utilizing the Amazon Redshift Dynamic Data Masking capability, our customers can protect sensitive data throughout the analytics pipeline, from secure ingestion to responsible consumption reducing the risk of breaches.”

-Ameesh Divatia, CEO & co-founder of Baffle

“EnergyAustralia is a leading Australian energy retailer and generator, with a mission to lead the clean energy transition for customers in a way that is reliable, affordable and sustainable for all. We enable all corners of our business with Data & Analytics capabilities that are used to optimize business processes and enhance our customers’ experience. Keeping our customers’ data safe is a top priority across our teams. In the past, this involved multiple layers of custom built security policies that could make it cumbersome for analysts to find the data they require. The new AWS dynamic data masking feature will significantly simplify our security processes so we continue to keep customer data safe, while also reducing the administrative overhead.”

-William Robson, Data Solutions Design Lead, EnergyAustralia

Use case

For our use case, a retail company wants to control how they show credit card numbers to users based on their privilege. They also don’t want to duplicate the data for this purpose. They have the following requirements:

Users from Customer Service should be able to view the first six digits and the last four digits of the credit card for customer verification
Users from Fraud Prevention should be able to view the raw credit card number only if it’s flagged as fraud
Users from Auditing should be able to view the raw credit card number
All other users should not be able to view the credit card number

Solution overview

The solution encompasses creating masking policies with varying masking rules and attaching one or more to the same role and table with an assigned priority to remove potential conflicts. These policies may pseudonymize results or selectively nullify results to comply with retailers’ security requirements. We refer to multiple masking policies being attached to a table as a multi-modal masking policy. A multi-modal masking policy consists of three parts:

A data masking policy that defines the data obfuscation rules
Roles with different access levels depending on the business case
The ability to attach multiple masking policies on a user or role and table combination with priority for conflict resolution

The following diagram illustrates how DDM support (preview) in Amazon Redshift policies works with roles and users for our retail use case.

For a user with multiple roles, the masking policy with the highest attachment priority is used. For example, in the following example, Ken is part of the Public and FrdPrvnt role. Because the FrdPrvnt role has a higher attachment priority, card_number_conditional_mask will be applied.

Prerequisites

To implement this solution, you need to complete the following prerequisites:

Have an AWS account.
Have an Amazon Redshift cluster provisioned with DDM support (preview) or a serverless workgroup with DDM support (preview).
1. Navigate to the provisioned or serverless Amazon Redshift console and choose Create preview cluster.
2. In the create cluster wizard, choose the preview track.
Have Superuser privilege, or the sys:secadmin role on the Amazon Redshift data warehouse created in step 2.

Preparing the data

To set up our use case, complete the following steps:

On the Amazon Redshift console, choose Query editor v2 in Explorer.
If you’re familiar with SQL Notebooks, you can download the Jupyter notebook for the demonstration, and import it to quickly get started.
Create the table and populate contents.

Create users.

-- 1- Create the credit cards table
CREATE TABLE credit_cards (
customer_id INT,
is_fraud BOOLEAN,
credit_card TEXT
);
-- 2- Populate the table with sample values
INSERT INTO credit_cards
VALUES
(100,'n', '453299ABCDEF4842'),
(100,'y', '471600ABCDEF5888'),
(102,'n', '524311ABCDEF2649'),
(102,'y', '601172ABCDEF4675'),
(102,'n', '601137ABCDEF9710'),
(103,'n', '373611ABCDEF6352')
;
--run GRANT to grant SELECT permission on the table
GRANT SELECT ON credit_cards TO PUBLIC;
--create four users
CREATE USER Kate WITH PASSWORD '1234Test!';
CREATE USER Ken  WITH PASSWORD '1234Test!';
CREATE USER Bob  WITH PASSWORD '1234Test!';
CREATE USER Jane WITH PASSWORD '1234Test!';

Implement the solution

To satisfy the security requirements, we need to make sure that each user sees the same data in different ways based on their granted privileges. To do that, we use user roles combined with masking policies as follows:

Create user roles and grant different users to different roles:

-- 1. Create User Roles
CREATE ROLE cust_srvc_role;
CREATE ROLE frdprvnt_role;
CREATE ROLE auditor_role;
-- note that public role exist by default.

-- Grant Roles to Users
GRANT ROLE cust_srvc_role to Kate;
GRANT ROLE frdprvnt_role  to Ken;
GRANT ROLE auditor_role   to Bob;
-- note that regualr_user is attached to public role by default.

Create masking policies:

-- 2. Create Masking policies

-- 2.1 create a masking policy that fully masks the credit card number
CREATE MASKING POLICY Mask_CC_Full
WITH (credit_card VARCHAR(256))
USING ('XXXXXXXXXXXXXXXX');

--2.2- Create a scalar SQL user-defined function(UDF) that partially obfuscates credit card number, only showing the first 6 digits and the last 4 digits
CREATE FUNCTION REDACT_CREDIT_CARD (text)
  returns text
immutable
as $$
  select left($1,6)||'XXXXXX'||right($1,4)
$$ language sql;


--2.3- create a masking policy that applies the REDACT_CREDIT_CARD function
CREATE MASKING POLICY Mask_CC_Partial
WITH (credit_card VARCHAR(256))
USING (REDACT_CREDIT_CARD(credit_card));

-- 2.4- create a masking policy that will display raw credit card number only if it is flagged for fraud 
CREATE MASKING POLICY Mask_CC_Conditional
WITH (is_fraud BOOLEAN, credit_card VARCHAR(256))
USING (CASE WHEN is_fraud 
                 THEN credit_card 
                 ELSE Null 
       END);

-- 2.5- Create masking policy that will show raw credit card number.
CREATE MASKING POLICY Mask_CC_Raw
WITH (credit_card varchar(256))
USING (credit_card);

Attach the masking policies on the table or column to the user or role:

-- 3. ATTACHING MASKING POLICY
-- 3.1- make the Mask_CC_Full the default policy for all users
--    all users will see this masking policy unless a higher priority masking policy is attached to them or their role

ATTACH MASKING POLICY Mask_CC_Full
ON credit_cards(credit_card)
TO PUBLIC;

-- 3.2- attach Mask_CC_Partial to the cust_srvc_role role
--users with the cust_srvc_role role can see partial credit card information
ATTACH MASKING POLICY Mask_CC_Partial
ON credit_cards(credit_card)
TO ROLE cust_srvc_role
PRIORITY 10;

-- 3.3- Attach Mask_CC_Conditional masking policy to frdprvnt_role role
--    users with frdprvnt_role role can only see raw credit card if it is fraud
ATTACH MASKING POLICY Mask_CC_Conditional
ON credit_cards(credit_card)
USING (is_fraud, credit_card)
TO ROLE frdprvnt_role
PRIORITY 20;

-- 3.4- Attach Mask_CC_Raw masking policy to auditor_role role
--    users with auditor_role role can see raw credit card numbers
ATTACH MASKING POLICY Mask_CC_Raw
ON credit_cards(credit_card)
TO ROLE auditor_role
PRIORITY 30;

Test the solution

Let’s confirm that the masking policies are created and attached.

Check that the masking policies are created with the following code:

-- 1.1- Confirm the masking policies are created
SELECT * FROM svv_masking_policy;

Check that the masking policies are attached:
```
-- 1.2- Verify attached masking policy on table/column to user/role.
SELECT * FROM svv_attached_masking_policy;
```
Now we can test that different users can see the same data masked differently based on their roles.

Test that the Customer Service agents can only view the first six digits and the last four digits of the credit card number:

-- 1- Confirm that customer service agent can only view the first 6 digits and the last 4 digits of the credit card number
SET SESSION AUTHORIZATION Kate;
SELECT * FROM credit_cards;

Test that the Fraud Prevention users can only view the raw credit card number when it’s flagged as fraud:

-- 2- Confirm that Fraud Prevention users can only view fraudulent credit card number
SET SESSION AUTHORIZATION Ken;
SELECT * FROM credit_cards;

Test that Auditor users can view the raw credit card number:

-- 3- Confirm the auditor can view RAW credit card number
SET SESSION AUTHORIZATION Bob;
SELECT * FROM credit_cards;

Test that general users can’t view any digits of the credit card number:

-- 4- Confirm that regular users can not view any digit of the credit card number
SET SESSION AUTHORIZATION Jane;
SELECT * FROM credit_cards;

Modify the masking policy

To modify an existing masking policy, you must detach it from the role first and then drop and recreate it.

In our use case, the business changed direction and decided that Customer Service agents should only be allowed to view the last four digits of the credit card number.

Detach and drop the policy:

--reset session authorization to the default
RESET SESSION AUTHORIZATION;
--detach masking policy from the credit_cards table
DETACH MASKING POLICY Mask_CC_Partial
ON                    credit_cards(credit_card)
FROM ROLE             cust_srvc_role;
-- Drop the masking policy
DROP MASKING POLICY Mask_CC_Partial;
-- Drop the function used in masking
DROP FUNCTION REDACT_CREDIT_CARD (TEXT);

Recreate the policy and reattach the policy on the table or column to the intended user or role.Note that this time we created a scalar Python UDF. It’s possible to create a SQL, Python, and Lambda UDF based on your use case.

-- Re-create the policy and re-attach it to role

-- Create a user-defined function that partially obfuscates credit card number, only showing the last 4 digits
CREATE FUNCTION REDACT_CREDIT_CARD (credit_card TEXT) RETURNS TEXT IMMUTABLE AS $$
    import re
    regexp = re.compile("^([0-9A-F]{6})[0-9A-F]{5,6}([0-9A-F]{4})")
    match = regexp.search(credit_card)
    if match != None:
        last = match.group(2)
    else:
        last = "0000"
    return "XXXXXXXXXXXX{}".format(last)
$$ LANGUAGE plpythonu;

--Create a masking policy that applies the REDACT_CREDIT_CARD function
CREATE MASKING POLICY Mask_CC_Partial
WITH (credit_card VARCHAR(256))
USING (REDACT_CREDIT_CARD(credit_card));

-- attach Mask_CC_Partial to the cust_srvc_role role
-- users with the cust_srvc_role role can see partial credit card information
ATTACH MASKING POLICY Mask_CC_Partial
ON credit_cards(credit_card)
TO ROLE cust_srvc_role
PRIORITY 10;

Test that Customer Service agents can only view the last four digits of the credit card number:

-- Confirm that customer service agent can only view the last 4 digits of the credit card number
SET SESSION AUTHORIZATION Kate;
SELECT * FROM credit_cards;

Clean up

When you’re done with the solution, clean up your resources:

Detach the masking policies from the table:

-- Cleanup
--reset session authorization to the default
RESET SESSION AUTHORIZATION;

--1.	Detach the masking policies from table
DETACH MASKING POLICY Mask_CC_Full
ON credit_cards(credit_card)
FROM PUBLIC;
DETACH MASKING POLICY Mask_CC_Partial
ON credit_cards(credit_card)
FROM ROLE cust_srvc_role;
DETACH MASKING POLICY Mask_CC_Conditional
ON credit_cards(credit_card)
FROM ROLE frdprvnt_role;
DETACH MASKING POLICY Mask_CC_Raw
ON credit_cards(credit_card)
FROM ROLE auditor_role;

Drop the masking policies:

-- 2.	Drop the masking policies 
DROP MASKING POLICY Mask_CC_Full;
DROP MASKING POLICY Mask_CC_Partial;
DROP MASKING POLICY Mask_CC_Conditional;
DROP MASKING POLICY Mask_CC_Raw;

Revoke and drop each user and role:

-- 3.	Revoke/Drop - role/user 
REVOKE ROLE cust_srvc_role from Kate;
REVOKE ROLE frdprvnt_role  from Ken;
REVOKE ROLE auditor_role   from Bob;

DROP ROLE cust_srvc_role;
DROP ROLE frdprvnt_role;
DROP ROLE auditor_role;

DROP USER Kate;
DROP USER Ken;
DROP USER Bob;
DROP USER Jane;

Drop the function and table:

-- 4.	Drop function and table 
DROP FUNCTION REDACT_CREDIT_CARD (credit_card TEXT);
DROP TABLE credit_cards;

Considerations and best practices

Consider the following:

Always create a default policy attached to the public user. If you create a new user, they will always have a minimum policy attached. It will enforce the intended security posture.
Remember that DDM policies in Amazon Redshift always follow invoker permissions convention, not definer (for more information, refer to Security and privileges for stored procedures ). That being said, the masking policies are applicable based on the user or role running it.
For best performance, create the masking functions using a scalar SQL UDF, if possible. The performance of scalar UDFs typically goes by the order of SQL to Python to Lambda, in that order. Generally, SQL UDF outperforms Python UDFs and the latter outperforms scalar Lambda UDFs.
DDM policies in Amazon Redshift are applied ahead of any predicate or join operations. For example, if you’re running a join on a masked column (per your access policy) to an unmasked column, the join will lead to a mismatch. That’s an expected behavior.
Always detach a masking policy from all users or roles before dropping it.
As of this writing, the solution has the following limitations:
- You can apply a mask policy on tables and columns and attach it to a user or role, but groups are not supported.
- You can’t create a mask policy on views, materialized views, and external tables.
- The DDM support (preview) in Amazon Redshift is available in following regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm).

Performance benchmarks

Based on various tests performed on TPC-H datasets, we’ve found built-in functions to be more performant as compared to functions created externally using scalar Python or Lambda UDFs.

Expand the solution

You can take this solution further and set up a masking policy that restricts SSN and email address access as follows:

Customer Service agents accessing pre-built dashboards may only view the last four digits of SSNs and complete email addresses for correspondence
Analysts cannot view SSNs or email addresses
Auditing services may access raw values for SSNs as well as email addresses

For more information, refer to Use DDM support (preview) in Amazon Redshift for E-mail & SSN Masking.

Conclusion

In this post, we discussed how to use DDM support (preview) in Amazon Redshift to define configuration-driven, consistent, format-preserving, and irreversible masked data values. With DDM support (preview) in Amazon Redshift, you can control your data masking approach using familiar SQL language. You can take advantage of the Amazon Redshift role-based access control capability to implement different levels of data masking. You can create a masking policy to identify which column needs to be masked, and you have the flexibility of choosing how to show the masked data. For example, you can completely hide all the information of the data, replace partial real values with wildcard characters, or define your own way to mask the data using SQL expressions, Python, or Lambda UDFs. Additionally, you can apply a conditional masking based on other columns, which selectively protects the column data in a table based on the values in one or more columns.

We encourage you to create your own user defined functions for various use-cases and accomplish desired security posture using dynamic data masking support in Amazon Redshift.

About the Authors

Rohit Vashishtha is a Senior Analytics Specialist Solutions Architect at AWS based in Dallas, TX. He has more than 16 years of experience architecting, building, leading, and maintaining big data platforms. Rohit helps customers modernize their analytic workloads using the breadth of AWS services and ensures that customers get the best price/performance with the utmost security and data governance.

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms. Ahmed is passionate about helping customers build efficient, performant, and scalable analytic solutions.

Variyam Ramesh is a Senior Analytics Specialist Solutions Architect at AWS based in Charlotte, NC. He is an accomplished technology leader helping customers conceptualize, develop, and deliver innovative analytic solutions.

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

James Moore is a Technical Lead at Amazon Redshift focused on SQL features and security. His work over the last 10 years has spanned distributed systems, machine learning, and databases. He is passionate about building scalable software that enables customers to solve real-world problems.

Amazon EMR launches support for Amazon EC2 C6i, M6i, I4i, R6i and R6id instances to improve cost performance for Spark workloads by 6–33%

2022-12-07 Al MS

Post Syndicated from Al MS original https://aws.amazon.com/blogs/big-data/amazon-emr-launches-support-for-amazon-ec2-c6i-m6i-i4i-r6i-and-r6id-instances-to-improve-cost-performance-for-spark-workloads-by-6-33/

Amazon EMR provides a managed service to easily run analytics applications using open-source frameworks such as Apache Spark, Hive, Presto, Trino, HBase, and Flink. The Amazon EMR runtime for Spark and Presto includes optimizations that provide over two times performance improvements over open-source Apache Spark and Presto, so that your applications run faster and at lower cost.

With Amazon EMR release 6.8, you can now use Amazon Elastic Compute Cloud (Amazon EC2) instances such as C6i, M6i, I4i, R6i, and R6id, which use the third-generation Intel Xeon scalable processors. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33% over previous generation instances.

In this post, we describe how we estimated the cost-performance benefit from using Amazon EMR with these new instances compared to using equivalent previous generation instances.

Amazon EMR runtime performance improvements with EC2 I4i instances

We ran TPC-DS 3 TB benchmark queries on Amazon EMR 6.8 using the Amazon EMR runtime for Apache Spark (compatible with Apache Spark 3.3) with five node clusters of I4i instances with data in Amazon Simple Storage Service (Amazon S3), and compared it to equivalent sized I3 instances. We measured performance improvements using the total query runtime and geometric mean of query runtime across the TPC-DS 3 TB benchmark queries.

Our results showed between 36.41–44.39% improvement in total query runtime performance on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters, and between 36–45.2% improvement in geometric mean. To measure cost improvement, we added up the Amazon EMR and Amazon EC2 cost per instance per hour (on-demand) and multiplied it by the total query runtime. Note that I4i 32XL instances were not benchmarked because I3 instances don’t have the 32 XL size available. We observed between 22.56–33.1% reduced instance hour cost on I4i instance EMR clusters compared to equivalent I3 instance EMR clusters to run the TPC-DS benchmark queries. All TPC-DS queries ran faster on I4i instance clusters compared to I3 instance clusters.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent I3 and I4i instance EMR clusters.

Instance Size	16 XL	8 XL	4 XL	2 XL	XL
Number of core instances in EMR cluster	5	5	5	5	5
Total query runtime on I3 (seconds)	4752.15457	4506.43694	7110.03042	11853.40336	21333.05743
Total query runtime on I4I (seconds)	2642.77407	2812.05517	4415.0023	7537.52779	12981.20251
Total query runtime improvement with I4I	44.39%	37.60%	37.90%	36.41%	39.15%
Geometric mean query runtime on I3 (sec)	34.99551	29.14821	41.53093	60.8069	95.46128
Geometric mean query runtime on I4I (sec)	19.17906	18.65311	25.66263	38.13503	56.95073
Geometric mean query runtime improvement with I4I	45.20%	36.01%	38.21%	37.29%	40.34%
EC2 I3 instance price ($ per hour)	$4.990	$2.496	$1.248	$0.624	$0.312
EMR I3 instance price ($ per hour)	$0.270	$0.270	$0.270	$0.156	$0.078
(EC2 + EMR) I3 instance price ($ per hour)	$5.260	$2.766	$1.518	$0.780	$0.390
Cost of running on I3 ($ per instance)	$6.943	$3.462	$2.998	$2.568	$2.311
EC2 I4I instance price ($ per hour)	$5.491	$2.746	$1.373	$0.686	$0.343
EMR I4I price ($ per hour per instance)	$1.373	$0.687	$0.343	$0.172	$0.086
(EC2 + EMR) I4I instance price ($ per hour)	$6.864	$3.433	$1.716	$0.858	$0.429
Cost of running on I4I ($ per instance)	$5.039	$2.681	$2.105	$1.795	$1.546
Total cost reduction with I4I including performance improvement	-27.43%	-22.56%	-29.79%	-30.09%	-33.10%

The following graph shows per query improvements we observed on I4i 2XL instances with EMR Runtime for Spark on Amazon EMR version 6.8 compared to equivalent I3 2XL instances for the TPC-DS 3 TB benchmark.

Amazon EMR runtime performance improvements with EC2 M6i instances

M6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent M5 instances. Our test results showed between 13.45–29.52% improvement in total query runtime for seven different instance sizes within the instance family, and between 7.98–25.37% improvement in geometric mean. On cost comparison, we observed 7.98–25.37% reduced instance hour cost on M6i instance EMR clusters compared to M5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent M6i and M5 instance EMR clusters.

Instance Size	24 XL	16 XL	12 XL	8 XL	4 XL	2 XL	XL
Number of core instances in EMR cluster	5	5	5	5	5	5	5
Total query runtime on M5 (seconds)	4027.58043	3782.10766	3348.05362	3516.4308	5621.22532	10075.45109	17278.15146
Total query runtime on M6I (seconds)	3106.43834	2665.70607	2714.69862	3043.5975	4195.02715	8226.88301	14515.50394
Total query runtime improvement with M6I	22.87%	29.52%	18.92%	13.45%	25.37%	18.35%	15.99%
Geometric mean query runtime M5 (sec)	30.45437	28.5207	23.95314	23.55958	32.95975	49.43178	75.95984
Geometric mean query runtime M6I (sec)	23.76853	19.21783	19.16869	19.9574	24.23012	39.09965	60.79494
Geometric mean query runtime improvement with M6I	21.95%	32.62%	19.97%	15.29%	26.49%	20.90%	19.96%
EC2 M5 instance price ($ per hour)	$4.61	$3.07	$2.30	$1.54	$0.77	$0.38	$0.19
EMR M5 instance price ($ per hour)	$0.27	$0.27	$0.27	$0.27	$0.19	$0.10	$0.05
(EC2 + EMR) M5 instance price ($ per hour)	$4.88	$3.34	$2.57	$1.81	$0.96	$0.48	$0.24
Cost of running on M5 ($ per instance)	$5.46	$3.51	$2.39	$1.76	$1.50	$1.34	$1.15
EC2 M6I instance price ($ per hour)	$4.61	$3.07	$2.30	$1.54	$0.77	$0.38	$0.19
EMR M6I price ($ per hour per instance)	$1.15	$0.77	$0.58	$0.38	$0.19	$0.10	$0.05
(EC2 + EMR) M6I instance price ($ per hour)	$5.76	$3.84	$2.88	$1.92	$0.96	$0.48	$0.24
Cost of running on M6I ($ per instance)	$4.97	$2.84	$2.17	$1.62	$1.12	$1.10	$0.97
Total cost reduction with M6I including performance improvement	-8.92%	-19.02%	-9.28%	-7.98%	-25.37%	-18.35%	-15.99%

Amazon EMR runtime performance improvements with EC2 R6i instances

R6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5 instances. Our test results showed between 14.25–32.23% improvement in total query runtime for six different instance sizes within the instance family, and between 16.12–36.5% improvement in geometric mean. R5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 5.48–23.5% reduced instance hour cost on R6i instance EMR clusters compared to R5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6i and R5 instance EMR clusters.

Instance Size	24 XL	16 XL	12 XL	8 XL	4 XL	2XL
Number of core instances in EMR cluster	5	5	5	5	5	5

Total query runtime on R5 (seconds)	4024.4737	3715.74432	3552.97298	3535.69879	5379.73168	9121.41532
Total query runtime on R6I (seconds)	2865.83169	2518.24192	2513.4849	3031.71973	4544.44854	6977.9508
Total query runtime improvement with R6I	28.79%	32.23%	29.26%	14.25%	15.53%	23.50%
Geometric mean query runtime R5 (sec)	30.59066	28.30849	25.30903	23.85511	32.33391	47.28424
Geometric mean query runtime R6I (sec)	21.87897	17.97587	17.54117	20.00918	26.6277	34.52817
Geometric mean query runtime improvement with R6I	28.48%	36.50%	30.69%	16.12%	17.65%	26.98%
EC2 R5 instance price ($ per hour)	$6.0480	$4.0320	$3.0240	$2.0160	$1.0080	$0.5040
EMR R5 instance price ($ per hour)	$0.2700	$0.2700	$0.2700	$0.2700	$0.2520	$0.1260
(EC2 + EMR) R5 instance price ($ per hour)	$6.3180	$4.3020	$3.2940	$2.2860	$1.2600	$0.6300
Cost of running on R5 ($ per instance)	$7.0630	$4.4403	$3.2510	$2.2452	$1.8829	$1.5962
EC2 R6I instance price ($ per hour)	$6.0480	$4.0320	$3.0240	$2.0160	$1.0080	$0.5040
EMR R6I price ($ per hour per instance)	$1.5120	$1.0080	$0.7560	$0.5040	$0.2520	$0.1260
(EC2 + EMR) R6I instance price ($ per hour)	$7.5600	$5.0400	$3.7800	$2.5200	$1.2600	$0.6300
Cost of running on R6I ($ per instance)	$6.0182	$3.5255	$2.6392	$2.1222	$1.5906	$1.2211
Total cost reduction with R6I including performance improvement	-14.79%	-20.60%	-18.82%	-5.48%	-15.53%	-23.50%

Amazon EMR runtime performance improvements with EC2 C6i instances

C6i instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent C5 instances. Our test results showed between 16.9–58.22% improvement in total query runtime for four different instance sizes within the instance family, and between 20.25–59.59% improvement in geometric mean. Only C6i 24, 12, 4, and 2xlarge sizes were benchmarked because C5 doesn’t have 32, 16 and 8 xlarge sizes. C5.xlarge instances didn’t have sufficient memory to run TPC-DS benchmark queries, and weren’t included in this comparison. On cost comparison, we observed 16.75–50.07% reduced instance hour cost on C6i instance EMR clusters compared to C5 EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent C6i and C5 instance EMR clusters.

Instance Size *	24 XL	12 XL	4 XL	2 XL
Number of core instances in EMR cluster	5	5	5	5
Total query runtime on C5 (seconds)	3435.59808	2900.84981	5945.12879	10173.00757
Total query runtime on C6I (seconds)	2711.16147	2471.86778	5195.30093	8787.43422
Total query runtime improvement with C6I	21.09%	14.79%	12.61%	13.62%
Geometric mean query runtime C5 (sec)	25.67058	20.06539	31.76582	46.78632
Geometric mean query runtime C6I (sec)	20.4458	17.14133	26.92196	39.32622
Geometric mean query runtime improvement with C6I	20.35%	14.57%	15.25%	15.95%
EC2 C5 instance price ($ per hour)	$4.080	$2.040	$0.680	$0.340
EMR C5 instance price ($ per hour)	$0.270	$0.270	$0.170	$0.085
(EC2 + EMR) C5 instance price ($ per hour)	$4.35000	$2.31000	$0.85000	$0.42500
Cost of running on C5 ($ per instance)	$4.15135	$1.86138	$1.40371	$1.20098
EC2 C6I instance price ($ per hour)	$4.0800	$2.0400	$0.6800	$0.3400
EMR C6I price ($ per hour per instance)	$1.02000	$0.51000	$0.17000	$0.08500
(EC2 + EMR) C6I instance price ($ per hour)	$5.10000	$2.55000	$0.85000	$0.42500
Cost of running on C6I ($ per instance)	$3.84081	$1.75091	$1.22667	$1.03741
Total cost reduction with C6I including performance improvement	-7.48%	-5.93%	-12.61%	-13.62%

Amazon EMR runtime performance improvements with EC2 R6id instances

R6id instances showed a similar performance improvement while running Apache Spark workloads compared to equivalent R5D instances. Our test results showed between 11.8–28.7% improvement in total query runtime for five different instance sizes within the instance family, and between 15.1–32.0% improvement in geometric mean. R6ID 32 XL instances were not benchmarked because R5D instances don’t have these sizes available. On cost comparison, we observed 6.8–11.5% reduced instance hour cost on R6ID instance EMR clusters compared to R5D EMR instance clusters to run the TPC-DS benchmark queries.

The following table shows the results from running TPC-DS 3 TB benchmark queries using Amazon EMR 6.8 over equivalent R6id and R5d instance EMR clusters.

Instance Size	24 XL	16 XL	12 XL	8 XL	4 XL	2 XL	XL
Number of core instances in EMR cluster	5	5	5	5	5	5	5
Total query runtime on R5D (seconds)	4054.4492975042	3691.7569385583	3598.6869168064	3532.7398928104	5397.5330161574	9281.2627059927	16862.8766838096
Total query runtime on R6ID (seconds)	2992.1198446983	2633.7131630720	2632.3186613402	2729.8860537867	4583.1040980373	7921.9960917943	14867.5391541445
Total query runtime improvement with R6ID	26.20%	28.66%	26.85%	22.73%	15.09%	14.65%	11.83%
Geometric mean query runtime R5D (sec)	31.0238156851	28.1432927726	25.7532157307	24.0596427675	32.5800246829	48.2306670294	76.6771994376
Geometric mean query runtime R6ID (sec)	22.8681174894	19.1282742957	18.6161830746	18.0498249257	25.9500918360	39.6580341258	65.0947323858
Geometric mean query runtime improvement with R6ID	26.29%	32.03%	27.71%	24.98%	20.35%	17.77%	15.11%
EC2 R5D instance price ($ per hour)	$6.912000	$4.608000	$3.456000	$2.304000	$1.152000	$0.576000	$0.288000
EMR R5D instance price ($ per hour)	$0.270000	$0.270000	$0.270000	$0.270000	$0.270000	$0.144000	$0.072000
(EC2 + EMR) R5D instance price ($ per hour)	$7.182000	$4.878000	$3.726000	$2.574000	$1.422000	$0.720000	$0.360000
Cost of running on R5D ($ per instance)	$8.088626	$5.002331	$3.724641	$2.525909	$2.132026	$1.856253	$1.686288
EC2 R6ID instance price ($ per hour)	$7.257600	$4.838400	$3.628800	$2.419200	$1.209600	$0.604800	$0.302400
EMR R6ID price ($ per hour per instance)	$1.814400	$1.209600	$0.907200	$0.604800	$0.302400	$0.151200	$0.075600
(EC2 + EMR) R6ID instance price ($ per hour)	$9.072000	$6.048000	$4.536000	$3.024000	$1.512000	$0.756000	$0.378000
Cost of running on R6ID ($ per instance)	$7.540142	$4.424638	$3.316722	$2.293104	$1.924904	$1.663619	$1.561092
Total cost reduction with R6ID including performance improvement	-6.78%	-11.55%	-10.95%	-9.22%	-9.71%	-10.38%	-7.42%

Benchmarking methodology

The benchmark used in this post is derived from the industry-standard TPC-DS benchmark, and uses queries from the Spark SQL Performance Tests GitHub repo with the following fixes applied.

Conclusion

In this post, we described how we estimated the cost-performance benefit from using Amazon EMR with C6i, M6i, I4i, R6i, and R6id, instances compared to using equivalent previous generation instances. Using these new instances with Amazon EMR improves cost-performance by an additional 5–33%.

About the authors

AI MS Al MS is a product manager for Amazon EMR at Amazon Web Services.

Perform multi-cloud analytics using Amazon QuickSight, Amazon Athena Federated Query, and Microsoft Azure Synapse

2022-12-05 Harish Rajagopalan

Post Syndicated from Harish Rajagopalan original https://aws.amazon.com/blogs/big-data/perform-multi-cloud-analytics-using-amazon-quicksight-amazon-athena-federated-query-and-microsoft-azure-synapse/

In this post, we show how to use Amazon QuickSight and Amazon Athena Federated Query to build dashboards and visualizations on data that is stored in Microsoft Azure Synapse databases.

Organizations today use data stores that are best suited for the applications they build. Additionally, they may also continue to use some of their legacy data stores as they modernize and migrate to the cloud. These disparate data stores might be spread across on-premises data centers and different cloud providers. This presents a challenge for analysts to be able to access, visualize, and derive insights from the disparate data stores.

QuickSight is a fast, cloud-powered business analytics service that enables employees within an organization to build visualizations, perform ad hoc analysis, and quickly get business insights from their data on their devices anytime. Amazon Athena is a serverless interactive query service that provides full ANSI SQL support to query a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet, that are stored on Amazon Simple Storage Service (Amazon S3). For data that isn’t stored on Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store it in Amazon S3.

Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena. You can think of a connector as an extension of Athena’s query engine. In this post, we use the Athena connector for Azure Synapse analytics that enables Athena to run SQL queries on your Azure Synapse databases using JDBC.

Solution overview

Consider the following reference architecture for visualizing data from Azure Synapse Analytics.

In order to explain this architecture, let’s walk through a sample use case of analyzing fitness data of users. Our sample dataset contains users’ fitness information like age, height, and weight, and daily run stats like miles, calories, average heart rate, and average speed, along with hours of sleep.

We run queries on this dataset to derive insights using visualizations in QuickSight. With QuickSight, you can create trends of daily miles run, keep track of the average heart rate over a period of time, and detect anomalies, if any. You can also track your daily sleep patterns and compare how rest impacts your daily activities. The out-of-the-box insights feature gives vital weekly insights that can help you be on top of your fitness goals. The following screenshot shows sample rows of our dataset stored in Azure Synapse.

Prerequisites

Make sure you have the following prerequisites:

An AWS account set up with QuickSight enabled. If you don’t have a QuickSight account, you can sign up for one. You can access the QuickSight free trial as part of the AWS Free Tier option.
An Azure account with data pre-loaded in Synapse. We use a sample fitness dataset in this post. We used a data generator to generate this data.
A virtual private connection (VPN) between AWS and Azure.

Note that the AWS resources for the steps in this post need to be in the same Region.

Configure a Lambda connector

To configure your Lambda connector, complete the following steps:

Load the data.
In the Azure account, the sample data for fitness devices is stored and accessed in an Azure Synapse Analytics workspace using a dedicated SQL pool table. The firewall settings for Synapse should allow for access to a VPC through a VPN. You can use your data or tables that you need to connect QuickSight to in this step.
On the Amazon S3 console, create a spillover bucket and note the name to use in a later step.
This bucket is used for storing the spillover data for the Synapse connector.
On the AWS Serverless Application Repository console, choose Available applications in the navigation pane.
On the Public applications tab, search for synapse and choose AthenaSynapseConnector with the AWS verified author tag.
Create the Lambda function with the following configuration:
1. For Name, enter AthenaSynapseConnector.
2. For SecretNamePrefix, enter AthenaJdbcFederation.
3. For SpillBucket, enter the name of the S3 bucket you created.
4. For DefaultConnectionString, enter the JDBC connection string from the Azure SQL pool connection strings property.
5. For LambdaFunctionName, enter a function name.
6. For SecurityGroupIds and SubnetIds, enter the security group and subnet for your VPC (this is needed for the template to run successfully).
7. Leave the remaining values as their default.
Choose Deploy.
After the function is deployed successfully, navigate to the athena_hybrid_azure function.
On the Configurations tab, choose Environment variables in the navigation pane.
Add the key azure_synapse_demo_connection_string with the same value as the default key (the JDBC connection string from the Azure SQL pool connection strings property).

For this post, we removed the VPC configuration.
Choose VPC in the navigation pane and choose None to remove the VPC configuration.
Now you’re ready to configure the data source.
On the Athena console, choose Data sources in the navigation pane.
Choose Create data source.
Choose Microsoft Azure Synapse as your data source.
Choose Next.
Create a data source with the following parameters:
1. For Data source name, enter azure_synapse_demo.
2. For Connection details, choose the Lambda function athena_hybrid_azure.
Choose Next.

Create a dataset on QuickSight to read the data from Azure Synapse

Now that the configuration on the Athena side is complete, let’s configure QuickSight.

On the QuickSight console, on the user name menu, choose Manage QuickSight.
Choose Security & permissions in the navigation pane.
Under QuickSight access to AWS services, choose Manage.
Choose Amazon Athena and in the pop-up permissions box, choose Next.
On the S3 Bucket tab, select the spill bucket you created earlier.
On the Lambda tab, select the athena_hybrid_azure function.
Choose Finish.
If the QuickSight access to AWS services window appears, choose Save.
Choose the QuickSight icon on the top left and choose New dataset.
Choose Athena as the data source.
For Data source name, enter a name.
Check the Athena workgroup settings where the Athena data source was created.
Choose Create data source.
Choose the catalog azure_synapse_demo and the database dbo.
Choose Edit/Preview data.
Change the query mode to SPICE.
Choose Publish & Visualize.
Create an analysis in QuickSight.
Publish a QuickSight dashboard.

If you’re new to QuickSight or looking to build stunning dashboards, this workshop provides step-by-step instructions to grow your dashboard building skills from basic to advanced level. The following screenshot is an example dashboard to give you some inspiration.

Clean up

To avoid ongoing charges, complete the following steps:

Delete the S3 bucket created for the Athena spill data.
Delete the Athena data source.
On the AWS CloudFormation console, select the stack you created for AthenaSynapseConnector and choose Delete.
This will delete the created resources, such as the Lambda function. Check the stack’s Events tab to track the progress of the deletion, and wait for the stack status to change to DELETE_COMPLETE.
Delete the QuickSight datasets.
Delete the QuickSight analysis.
Delete your QuickSight subscription and close the account.

Conclusion

In this post, we showed you how to overcome the challenges of connecting to and analyzing data in other clouds by using AWS analytics services to connect to Azure Synapse Analytics with Athena Federated Query and QuickSight. We also showed you how to visualize and derive insights from the fitness data using QuickSight. With QuickSight and Athena Federated Query, organizations can now access additional data sources beyond those already supported natively by QuickSight. If you have data in sources other than Amazon S3, you can use Athena Federated Query to analyze the data in place or build pipelines that extract and store data in Amazon S3.

For more information and resources for QuickSight and Athena, visit Analytics on AWS.

About the authors

Harish Rajagopalan is a Senior Solutions Architect at Amazon Web Services. Harish works with enterprise customers and helps them with their cloud journey.

Salim Khan is a Specialist Solutions Architect for Amazon QuickSight. Salim has over 16 years of experience implementing enterprise business intelligence (BI) solutions. Prior to AWS, Salim worked as a BI consultant catering to industry verticals like Automotive, Healthcare, Entertainment, Consumer, Publishing and Financial Services. He has delivered business intelligence, data warehousing, data integration and master data management solutions across enterprises.

Sriram Vasantha is a Senior Solutions Architect at AWS in Central US helping customers innovate on the cloud. Sriram focuses on application and data modernization, DevSecOps, and digital transformation. In his spare time, Sriram enjoys playing different musical instruments like Piano, Organ, and Guitar.

Adarsha Nagappasetty is a Senior Solutions Architect at Amazon Web Services. Adarsha works with enterprise customers in Central US and helps them with their cloud journey. In his spare time, Adarsha enjoys spending time outdoors with his family!

Explore your data lake using Amazon Athena for Apache Spark

2022-12-01 Pathik Shah

Post Syndicated from Pathik Shah original https://aws.amazon.com/blogs/big-data/explore-your-data-lake-using-amazon-athena-for-apache-spark/

Amazon Athena now enables data analysts and data engineers to enjoy the easy-to-use, interactive, serverless experience of Athena with Apache Spark in addition to SQL. You can now use the expressive power of Python and build interactive Apache Spark applications using a simplified notebook experience on the Athena console or through Athena APIs. For interactive Spark applications, you can spend less time waiting and be more productive because Athena instantly starts running applications in less than a second. And because Athena is serverless and fully managed, analysts can run their workloads without worrying about the underlying infrastructure.

Data lakes are a common mechanism to store and analyze data because they allow companies to manage multiple data types from a wide variety of sources, and store this data, structured and unstructured, in a centralized repository. Apache Spark is a popular open-source, distributed processing system optimized for fast analytics workloads against data of any size. It’s often used to explore data lakes to derive insights. For performing interactive data explorations on the data lake, you can now use the instant-on, interactive, and fully managed Apache Spark engine in Athena. It enables you to be more productive and get started quickly, spending almost no time setting up infrastructure and Spark configurations.

In this post, we show how you can use Athena for Apache Spark to explore and derive insights from your data lake hosted on Amazon Simple Storage Service (Amazon S3).

Solution overview

We showcase reading and exploring CSV and Parquet datasets to perform interactive analysis using Athena for Apache Spark and the expressive power of Python. We also perform visual analysis using the pre-installed Python libraries. For running this analysis, we use the built-in notebook editor in Athena.

For the purpose of this post, we use the NOAA Global Surface Summary of Day public dataset from the Registry of Open Data on AWS, which consists of daily weather summaries from various NOAA weather stations. The dataset is primarily in plain text CSV format. We have transformed the entire and subsets of the CSV dataset into Parquet format for our demo.

Before running the demo, we want to introduce the following concepts related to Athena for Spark:

Sessions – When you open a notebook in Athena, a new session is started for it automatically. Sessions keep track of the variables and state of notebooks.
Calculations – Running a cell in a notebook means running a calculation in the current session. As long as a session is running, calculations use and modify the state that is maintained for the notebook.

For more details, refer to Session and Calculations.

Prerequisites

For this demo, you need the following prerequisites:

An AWS account with access to the AWS Management Console
Athena permissions on the workgroup DemoAthenaSparkWorkgroup, which you create as part of this demo
AWS Identity and Access Management (IAM) permissions to create, read, and update the IAM role and policies created as part of the demo
Amazon S3 permissions to create an S3 bucket and read the bucket location

The following policy grants these permissions. Attach it to the IAM role or user you use to sign in to the console. Make sure to provide your AWS account ID and the Region in which you’re running the demo.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "athena:*",
            "Resource": "arn:aws:athena:<REGION>:<ACCOUNT_ID>:workgroup/DemoAthenaSparkWorkgroup"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:CreatePolicy",
                "iam:GetRole",
                "iam:ListAttachedRolePolicies",
                "iam:CreateRole",
                "iam:AttachRolePolicy",
                "iam:PutRolePolicy",
                "iam:ListRolePolicies",
                "iam:GetRolePolicy",
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::<ACCOUNT_ID>:role/service-role/AWSAthenaSparkExecutionRole-*",
                "arn:aws:iam::<ACCOUNT_ID>:policy/service-role/AWSAthenaSparkRolePolicy-*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::<ACCOUNT_ID>-<REGION>-athena-results-bucket-*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:ListPolicies",
                "iam:ListRoles",
                "athena:ListWorkGroups",
                "athena:ListEngineVersions"
            ],
            "Resource": "*"
        }
    ]
}

Create your Athena workgroup

We create a new Athena workgroup with Spark as the engine. Complete the following steps:

On the Athena console, choose Workgroups in the navigation pane.
Choose Create workgroup.
For Workgroup name, enter DemoAthenaSparkWorkgroup.
Make sure to enter the exact name because the preceding IAM permissions are scoped down for the workgroup with this name.
For Analytics engine, choose Apache Spark.
For Additional configurations, select Use defaults.
The defaults include the creation of an IAM role with the required permissions to run Spark calculations on Athena and an S3 bucket to store calculation results. It also sets the notebook (which we create later) encryption key management to an AWS Key Management Service (AWS KMS) key owned by Athena.
Optionally, add tags to your workgroup.
Choose Create workgroup.

Modify the IAM role

Creating the workgroup creates a new IAM role. Choose the newly created workgroup, then the value under Role ARN to be redirected to the IAM console.

Add the following permission as an inline policy to the IAM role created earlier. This allows the role to read the S3 datasets. For instructions, refer to the section To embed an inline policy for a user or role (console) in Adding IAM identity permissions (console).

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/*",
                "arn:aws:s3:::noaa-gsod-pds/2022/*",
                "arn:aws:s3:::noaa-gsod-pds",
                "arn:aws:s3:::athena-examples-us-east-1"
            ]
        }
    ]
}

Set up your notebook

To run the analysis on Spark on Athena, we need a notebook. Complete the following steps to create one:

On the Athena console, choose Notebook Editor.
Choose the newly created workgroup DemoAthenaSparkWorkgroup on the drop-down menu.
Choose Create Notebook.
Provide a notebook name, for example AthenaSparkBlog.
Keep the default session parameters.
Choose Create.

Your notebook should now be loaded, which means you can start running Spark code. You should see the following screenshot.

Explore the dataset

Now that we have workgroup and notebook created, let’s start exploring the NOAA Global Surface Summary of Day dataset. The datasets used in this post are stored in the following locations:

CSV data for year 2022 – s3://noaa-gsod-pds/2022/
Parquet data for year 2021 – s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2021/
Parquet data for year 2020 – s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2020/
Entire dataset in Parquet format (until October 2022) – s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/historical/

In the rest of this post, we show PySpark code snippets. Copy the code and enter it in the notebook’s cell. Press Shift+Enter to run the code as a calculation. Alternatively, you can choose Run. Add more cells to run subsequent code snippets.

We start by reading the CSV dataset for the year 2022 and print its schema to understand the columns contained in the dataset. Run the following code in the notebook cell:

year_22_csv = spark.read.option("header","true").csv(f"s3://noaa-gsod-pds/2022/")
year_22_csv.printSchema()

We get the following output.

We were able to submit the preceding code as a calculation instantly using the notebook.

Let’s continue exploring the dataset. Looking at the columns in the schema, we’re interested in previewing the data for the following attributes in 2022:

TEMP – Mean temperature
WDSP – Mean wind speed
GUST – Maximum wind gust
MAX – Maximum temperature
MIN – Minimum temperature
Name – Station name

Run the following code:

year_22_csv.select('NAME','DATE','TEMP','WDSP','GUST','MAX','MIN').show()

We get the following output.

Now we have an idea of what the dataset looks like. Next, let’s perform some analysis to find the maximum recorded temperature for the Seattle-Tacoma Airport in 2022. Run the following code:

from pyspark.sql.functions import max

year_22_csv.filter("NAME == 'SEATTLE TACOMA AIRPORT, WA US'").agg(max("MAX").alias("max_temp_yr_2022")).show()

We get the following output.

Next, we want to find the maximum recorded temperature for each month of 2022. For this, we use the Spark SQL feature of Athena. First, we need to create a temporary view on the year_22_csv data frame. Run the following code:

year_22_csv.createOrReplaceTempView("y22view")

To run our Spark SQL query, we use %%sql magic:

%%sql
select month(to_date(date,'yyyy-MM-dd')) as month_yr_22,max(MAX) as max_temp 
from y22view where NAME == 'SEATTLE TACOMA AIRPORT, WA US' 
group by 1

We get the following output.

The output of the preceding query produces the month in numeric form. To make it more readable, let’s convert the month numbers into month names. For this, we use a user-defined function (UDF) and register it to use in the Spark SQL queries for the rest of the notebook session. Run the following code to create and register the UDF:

import calendar

# UDF to convert month number to month name
spark.udf.register("month_name_udf",lambda x: calendar.month_name[int(x)])

We rerun the query to find the maximum recorded temperature for each month of 2022 but with the month_name_udf UDF we just created. Also, this time we sort the results based on the maximum temperature value. See the following code:

%%sql
select month_name_udf(month(to_date(date,'yyyy-MM-dd'))) as month_yr_22,
max(MAX) as max_temp
from y22view where NAME == 'SEATTLE TACOMA AIRPORT, WA US' 
group by 1 order by 2 desc

The following output shows the month names.

Until now, we have run interactive explorations for the year 2022 of the NOAA Global Surface Summary of Day dataset. Let’s say we want to compare the temperature values with the previous 2 years. We compare the maximum temperature across 2020, 2021, and 2022. As a reminder, the dataset for 2022 is in CSV format and for 2020 and 2021, the datasets are in Parquet format.

To continue with the analysis, we read the 2020 and 2021 Parquet datasets into the data frame and create temporary views on the respective data frames. Run the following code:

#Read the dataset
year_20_pq = spark.read.parquet(f"s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2020/")
year_21_pq = spark.read.parquet(f"s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/2021/")

#Create temporary views
year_20_pq.createOrReplaceTempView("y20view")
year_21_pq.createOrReplaceTempView("y21view")

#Preview the datasets
print('Preview for year 2020:')
year_20_pq.select('NAME','DATE','TEMP','WDSP','GUST','MAX','MIN').show(1)
print('Preview for year 2021:')
year_21_pq.select('NAME','DATE','TEMP','WDSP','GUST','MAX','MIN').show(1)

We get the following output.

To compare the recorded maximum temperature for each month in 2020, 2021, and 2022, we perform a join operation on the three views created so far from their respective data frames. Also, we reuse the month_name_udf UDF to convert month number to month name. Run the following code:

%%sql
select month_name_udf(month(to_date(y21.DATE,'yyyy-MM-dd'))) as month, max(y20.max) as max_temp_2020, 
max(y21.max) as max_temp_2021, max(y22.max) as max_temp_2022 \
from y20view y20 inner join y21view y21 inner join y22view y22 \
on month(to_date(y20.DATE,'yyyy-MM-dd')) = month(to_date(y21.DATE,'yyyy-MM-dd'))
and month(to_date(y21.DATE,'yyyy-MM-dd')) = month(to_date(y22.DATE,'yyyy-MM-dd')) \
where y20.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y21.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y22.NAME == 'SEATTLE TACOMA AIRPORT, WA US' \
group by 1

We get the following output.

So far, we’ve read CSV and Parquet datasets, run analysis on the individual datasets, and performed join and aggregation operations on them to derive insights instantly in an interactive mode. Next, we show how you can use the pre-installed libraries like Seaborn, Matplotlib, and Pandas for Spark on Athena to generate a visual analysis. For the full list of preinstalled Python libraries, refer to List of preinstalled Python libraries.

We plot a visual analysis to compare the recorded maximum temperature values for each month in 2020, 2021, and 2022. Run the following code, which creates a Spark data frame from the SQL query, converts it into a Pandas data frame, and uses Seaborn and Matplotlib for plotting:

import seaborn as sns
import matplotlib.pyplot as plt

y20_21_22=spark.sql("select month(to_date(y21.DATE,'yyyy-MM-dd')) as month, max(y20.max) as max_temp_yr_2020, \
max(y21.max) as max_temp_yr_2021, max(y22.max) as max_temp_yr_2022 \
from y20view y20 inner join y21view y21 inner join y22view y22 \
on month(to_date(y20.DATE,'yyyy-MM-dd')) = month(to_date(y21.DATE,'yyyy-MM-dd')) \
and month(to_date(y21.DATE,'yyyy-MM-dd')) = month(to_date(y22.DATE,'yyyy-MM-dd')) \
where y20.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y21.NAME == 'SEATTLE TACOMA AIRPORT, WA US' and y22.NAME == 'SEATTLE TACOMA AIRPORT, WA US' \
group by 1 order by 1")

#convert to pandas dataframe
y20_21_22=y20_21_22.toPandas()

#change datatypes to float for plotting
y20_21_22['max_temp_yr_2020']= y20_21_22['max_temp_yr_2020'].astype(float)
y20_21_22['max_temp_yr_2021']= y20_21_22['max_temp_yr_2021'].astype(float)
y20_21_22['max_temp_yr_2022']= y20_21_22['max_temp_yr_2022'].astype(float)

# Unpivot dataframe from wide to long format for plotting
y20_21_22=y20_21_22.melt('month',var_name='max_temperature', \
             value_name='temperature')

plt.clf()

sns.catplot(data=y20_21_22,x='month',y='temperature', hue='max_temperature', \
            sort=False, kind='point', height=4, aspect=1.5)
%matplot plt

The following graph shows our output.

Next, we plot a heatmap showing the maximum temperature trend for each month across all the years in the dataset. For this, we have converted the entire CSV dataset (until October 2022) into Parquet format and stored it in s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/historical/.

Run the following code to plot the heatmap:

noaa = spark.read.parquet(f"s3://athena-examples-us-east-1/athenasparkblog/noaa-gsod-pds/parquet/historical/")
noaa.createOrReplaceTempView("noaaview")

#query to find maximum temperature for each month from year 1973 to 2022
year_hist=spark.sql("select month(to_date(date,'yyyy-MM-dd')) as month, \
year(to_date(date,'yyyy-MM-dd')) as year,  cast(max(temp) as float) as temp \
from noaaview where NAME == 'SEATTLE TACOMA AIRPORT, WA US' group by 1,2") 

# convert spark dataframe to pandas
year_hist=year_hist.toPandas()
year_hist=year_hist.pivot("month","year","temp")

plt.clf()
grid_kws = {"height_ratios": (0.9, .05), "hspace": .5}
f, (ax, cbar_ax) = plt.subplots(2, gridspec_kw=grid_kws)

sns.heatmap(year_hist, ax=ax, cbar_ax=cbar_ax, cmap="RdYlBu_r", \
            cbar_kws={"orientation": "horizontal"})
%matplot plt

We get the following output.

From the potting, we can see the trend has been almost similar across the years, where the temperature rises during summer months and lowers as winter approaches in the Seattle-Tacoma Airport area. You can continue exploring the datasets further, running more analyses and plotting more visuals to get the feel of the interactive and instant-on experience Athena for Apache Spark offers.

Clean up resources

When you’re done with the demo, make sure to delete the S3 bucket you created to store the workgroup calculations to avoid storage costs. Also, you can delete the workgroup, which deletes the notebook as well.

Conclusion

In this post, we saw how you can use the interactive and serverless experience of Athena for Spark as the engine to run calculations instantly. You just need to create a workgroup and notebook to start running the Spark code. We explored datasets stored in different formats in an S3 data lake and ran interactive analyses to derive various insights. Also, we ran visual analyses by plotting charts using the preinstalled libraries. To learn more about Spark on Athena, refer to Using Apache Spark in Amazon Athena.

About the Authors

Pathik Shah is a Sr. Big Data Architect on Amazon Athena. He joined AWS in 2015 and has been focusing in the big data analytics space since then, helping customers build scalable and robust solutions using AWS analytics services.

Raj Devnath is a Sr. Product Manager at AWS working on Amazon Athena. He is passionate about building products customers love and helping customers extract value from their data. His background is in delivering solutions for multiple end markets, such as finance, retail, smart buildings, home automation, and data communication systems.

Centrally manage access and permissions for Amazon Redshift data sharing with AWS Lake Formation

2022-11-30 Srividya Parthasarathy

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/centrally-manage-access-and-permissions-for-amazon-redshift-data-sharing-with-aws-lake-formation/

Today’s global, data-driven organizations treat data as an asset and use it across different lines of business (LOBs) to drive timely insights and better business decisions. Amazon Redshift data sharing allows you to securely share live, transactionally consistent data in one Amazon Redshift data warehouse with another Amazon Redshift data warehouse within the same AWS account, across accounts, and across Regions, without needing to copy or move data from one cluster to another.

Some customers share their data with 50–100 data warehouses in different accounts and do a lot of cross-sharing, making it difficult to track who is accessing what data. They have to navigate to an individual account’s Amazon Redshift console to retrieve the access information. Also, many customers have their data lake on Amazon Simple Storage Service (Amazon S3), which is shared within and across various business units. As the organization grows and democratizes the data, administrators want the ability to manage the datashare centrally for governance and auditing, and to enforce fine-grained access control.

Working backward from customer ask, we are announcing the preview of the following new feature: Amazon Redshift data sharing integration with AWS Lake Formation, which enables Amazon Redshift customers to centrally manage access to their Amazon Redshift datashares using Lake Formation.

Lake Formation has been a popular choice for centrally governing data lakes backed by Amazon S3. Now, with Lake Formation support for Amazon Redshift data sharing, it opens up new design patterns, and broadens governance and security posture across data warehouses. With this integration, you can use Lake Formation to define fine-grained access control on tables and views being shared with Amazon Redshift data sharing for federated AWS Identity and Access Management (IAM) users and IAM roles.

Customers are using the data mesh approach, which provides a mechanism to share data across business units. Customers are also using a modern data architecture to share data from data lake stores and Amazon Redshift purpose-built data stores across business units. Lake Formation provides the ability to enforce data governance within and across business units, which enables secure data access and sharing, easy data discovery, and centralized audit for data access.

United Airlines is in the business of connecting people and uniting the world.

“As a data-driven enterprise, United is trying to create a unified data and analytics experience for our analytics community that will innovate and build modern data-driven applications. We believe we can achieve this by building a purpose-built data mesh architecture using a variety of AWS services like Athena, Aurora, Amazon Redshift, and Lake Formation to simplify management and governance around granular data access and collaboration.”

-Ashok Srinivas, Director of ML Engineering and Sarang Bapat, Director of Data Engineering.

In this post, we show how to centrally manage access and permissions for Amazon Redshift data sharing with Lake Formation.

Solution overview

In this solution, we demonstrate how integration of Amazon Redshift data sharing with Lake Formation for data governance can help you build data domains, and how you can use the data mesh approach to bring data domains together to enable data sharing and federation across business units. The following diagram illustrates our solution architecture.

solution architecture

The data mesh is a decentralized, domain-oriented architecture that emphasizes separating data producers from data consumers via a centralized, federated Data Catalog. Typically, the producers and consumers run within their own account. The details of these data mesh characteristics are as follows:

Data producers – Data producers own their data products and are responsible for building their data, maintaining its accuracy, and keeping their data product up to date. They determine what datasets can be published for consumption and share their datasets by registering them with the centralized data catalog in a central governance account. You might have a producer steward or administrator persona for managing the data products with the central governance steward or administrators team.
Central governance account – Lake Formation enables fine-grained access management on the shared dataset. The centralized Data Catalog offers consumers the ability to quickly find shared datasets, allows administrators to centrally manage access permissions on shared datasets, and provides security teams the ability to audit and track data product usage across business units.
Data consumers – The data consumer obtains access to shared resources from the central governance account. These resources are available inside the consumer’s AWS Glue Data Catalog, allowing fine-grained access on the database and table that can be managed by the consumer’s data stewards and administrators.

The following steps provide an overview of how Amazon Redshift data sharing can be governed and managed by Lake Formation in the central governance pattern of a data mesh architecture:

In the producer account, data objects are created and maintained in the Amazon Redshift producer cluster. A data warehouse admin creates the Amazon Redshift datashare and adds datasets (tables, views) to the share.
The data warehouse admin grants and authorizes access on the datashare to the central governance account’s Data Catalog.
In the central governance account, the data lake admin accepts the datashare and creates the AWS Glue database that points to the Amazon Redshift datashare so that Lake Formation can manage it.
The data lake admin shares the AWS Glue database and tables to the consumer account using Lake Formation cross-account sharing.
In the consumer account, the data lake admin accepts the resource share invitation via AWS Resource Access Manager (AWS RAM) and can view the database listed in the account.
The data lake admin defines the fine-grained access control and grants permissions on databases and tables to IAM users (for this post, consumer1 and consumer2) in the account.
In the Amazon Redshift cluster, the data warehouse admin creates an Amazon Redshift database that points to the Glue database and authorizes usage on the created Amazon Redshift database to the IAM users.
The data analyst as an IAM user can now use their preferred tools like the Amazon Redshift query editor to access the dataset based on the Lake Formation fine-grained permissions.

We use the following account setup for our example in this post:

Producer account – 123456789012
Central account – 112233445566
Consumer account – 665544332211

Prerequisites

Create the Amazon Redshift data share and add datasets

In the data producer account, create an Amazon Redshift cluster using the RA3 node type with encryption enabled. Complete the following steps:

On the Amazon Redshift console, create a cluster subnet group.

For more information, refer to Managing cluster subnet groups using the console.

Choose Create cluster.
For Cluster identifier, provide the cluster name of your choice.
For Preview track, choose preview_2022.

For Node type, choose one of the RA3 node types.

This feature is only supported on the RA3 node type.

For Number of nodes, enter the number of nodes that you need for your cluster.
Under Database configurations, choose the admin user name and admin user password.
Under Cluster permissions, you can select the IAM role and set it as the default.

For more information about the default IAM role, refer to Creating an IAM role as default for Amazon Redshift.

cluster permissions

Turn on the Use defaults option next to Additional configurations to modify the default settings.
Under Network and security, specify the following:
1. For Virtual private cloud (VPC), choose the VPC you would like to deploy the cluster in.
2. For VPC security groups, either leave as default or add the security groups of your choice.
3. For Cluster subnet group, choose the cluster subnet group you created.

additional configurations

Under Database configuration, in the Encryption section, select Use AWS Key Management Service (AWS KMS) or Use a hardware security module (HSM).

Encryption is disabled by default.

For Choose an AWS KMS key, you can either choose an existing AWS Key Management Service (AWS KMS) key, or choose Create an AWS KMS key to create a new key.

For more information, refer to Creating keys.

database configurations

Choose Create cluster.
For this post, create tables and load data into the producer Amazon Redshift cluster using the following script.

Authorize the datashare

Install or update the latest AWS Command Line Interface (AWS CLI) version to run the AWS CLI to authorize the datashare. For instructions, refer to Installing or updating the latest version of the AWS CLI.

Set up Lake Formation permissions

To use the AWS Glue Data Catalog in Lake Formation, complete the following steps in the central governance account to update the Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control:

Sign in to the Lake Formation console as admin.
In the navigation pane, under Data catalog, choose Settings.
Deselect Use only IAM access control for new databases.
Deselect Use only IAM access control for new tables in new databases.
Choose Version 2 for Cross account version settings.
Choose Save.

data catalog settings

Set up an IAM user as a data lake administrator

If you’re using an existing data lake administrator user or role add the following managed policies, if not attached and skip the below setup steps:

AWSGlueServiceRole
AmazonRedshiftFullAccess

Otherwise, to set up an IAM user as a data lake administrator, complete the following steps:

On the IAM console, choose Users in the navigation pane.
Select the IAM user who you want to designate as the data lake administrator.
Choose Add an inline policy on the Permissions tab.
Replace <AccountID> with your own account ID and add the following policy:

{
    "Version": "2012-10-17",
    "Statement": [ {
        "Condition": {"StringEquals": {
            "iam:AWSServiceName":"lakeformation.amazonaws.com"}},
            "Action":"iam:CreateServiceLinkedRole",
            "Resource": "*",
            "Effect": "Allow"},
            {"Action": ["iam:PutRolePolicy"],
            "Resource": "arn:aws:iam::<AccountID>:role/aws-service role/lakeformation.amazonaws.com/AWSServiceRoleForLakeFormationDataAccess",
            "Effect": "Allow"
     },{
                  "Effect": "Allow",
                  "Action": [
                    "ram:AcceptResourceShareInvitation",
                    "ram:RejectResourceShareInvitation",
                    "ec2:DescribeAvailabilityZones",
                    "ram:EnableSharingWithAwsOrganization"
                  ],
                  "Resource": "*"
                }]
}

Provide a policy name.
Review and save your settings.
Choose Add permissions, and choose Attach existing policies directly.
Add the following policies:
1. AWSLakeFormationCrossAccountManager
2. AWSGlueConsoleFullAccess
3. AWSGlueServiceRole
4. AWSLakeFormationDataAdmin
5. AWSCloudShellFullAccess
6. AmazonRedshiftFullAccess
Choose Next: Review and add permissions.

Data consumer account setup

In the consumer account, follow the steps mentioned previously in the central governance account to set up Lake Formation and a data lake administrator.

In the data consumer account, create an Amazon Redshift cluster using the RA3 node type with encryption (refer to the steps demonstrated to create an Amazon Redshift cluster in the producer account).
Choose Launch stack to deploy an AWS CloudFormation template to create two IAM users with policies.

The stack creates the following users under the data analyst persona:

consumer1
consumer2

After the CloudFormation stack is created, navigate to the Outputs tab of the stack.
Capture the ConsoleIAMLoginURL and LFUsersCredentials values.

createiamusers

Choose the LFUsersCredentials value to navigate to the AWS Secrets Manager console.
In the Secret value section, choose Retrieve secret value.

secret value

Capture the secret value for the password.

Both consumer1 and consumer2 need to use this same password to log in to the AWS Management Console.

secret value

Configure an Amazon Redshift datashare using Lake Formation

Producer account

Create a datashare using the console

Complete the following steps to create an Amazon Redshift datashare in the data producer account and share it with Lake Formation in the central account:

On the Amazon Redshift console, choose the cluster to create the datashare.
On the cluster details page, navigate to the Datashares tab.
Under Datashares created in my namespace, choose Connect to database.

connect to database

Choose Create datashare.

create datashare

For Datashare type, choose Datashare.
For Datashare name, enter the name (for this post, demotahoeds).
For Database name, choose the database from where to add datashare objects (for this post, dev).
For Publicly accessible, choose Turn off (or choose Turn on to share the datashare with clusters that are publicly accessible).

datashare information

Under DataShare objects, choose Add to add the schema to the datashare (in this post, the public schema).
Under Tables and views, choose Add to add the tables and views to the datashare (for this post, we add the table customer and view customer_view).

datashare objects

Under Data consumers, choose Publish to AWS Data Catalog.
For Publish to the following accounts, choose Other AWS accounts.
Provide the AWS account ID of the consumer account. For this post, we provide the AWS account ID of the Lake Formation central governance account.
To share within the same account, choose Local account.
Choose Create datashare.

data consumers

After the datashare is created, you can verify by going back to the Datashares tab and entering the datashare name in the search bar under Datashares created in my namespace.
Choose the datashare name to view its details.
Under Data consumers, you will see the consumer status of the consumer data catalog account as Pending Authorization.

data consumers

Choose the checkbox against the consumer data catalog which will enable the Authorize option.

authorize

Click Authorize to authorize the datashare access to the consumer account data catalog, consumer status will change to Authorized.

authorized

Create a datashare using a SQL command

Complete the following steps to create a datashare in data producer account 1 and share it with Lake Formation in the central account:

On the Amazon Redshift console, in the navigation pane, choose Editor, then Query editor V2.
Choose (right-click) the cluster name and choose Edit connection or Create Connection.
For Authentication, choose Temporary credentials.

Refer to Connecting to an Amazon Redshift database to learn more about the various authentication methods.

For Database, enter a database name (for this post, dev).
For Database user, enter the user authorized to access the database (for this post, awsuser).
Choose Save to connect to the database.

Connecting to an Amazon Redshift database

Run the following SQL commands to create the datashare and add the data objects to be shared:

create datashare demotahoeds;
ALTER DATASHARE demotahoeds ADD SCHEMA PUBLIC;
ALTER DATASHARE demotahoeds ADD TABLE customer;
ALTER DATASHARE demotahoeds ADD TABLE customer_view;

Run the following SQL command to share the producer datashare to the central governance account:

GRANT USAGE ON DATASHARE demotahoeds TO ACCOUNT '<central-aws-account-id>' via DATA CATALOG

Run the following SQL command

You can verify the datashare created and objects shared by running the following SQL command:

DESC DATASHARE demotahoeds

DESC DATASHARE demotahoeds

Run the following command using the AWS CLI to authorize the datashare to the central data catalog so that Lake Formation can manage them:

aws redshift authorize-data-share \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-identifier DataCatalog/<central-aws-account-id>

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:XXXXXXXXXX:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [{
        "ConsumerIdentifier": "DataCatalog/XXXXXXXXXXXX",
        "Status": "AUTHORIZED",
        "CreatedDate": "2022-11-09T21:10:30.507000+00:00",
        "StatusChangeDate": "2022-11-09T21:10:50.932000+00:00"
    }]
}

You can verify the datashare status on the console by following the steps outlined in the previous section.

Central catalog account

The data lake admin accepts and registers the datashare with Lake Formation in the central governance account and creates a database for the same. Complete the following steps:

Sign in to the console as the data lake administrator IAM user or role.
If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
Under Data catalog in the navigation pane, choose Data sharing and view the Amazon Redshift datashare invitations on the Configuration tab.
Select the datashare and choose Review Invitation.

AWS Lake Formation data sharing

A window pops up with the details of the invitation.

Choose Accept to register the Amazon Redshift datashare to the AWS Glue Data Catalog.

accept reject invitation

Provide a name for the AWS Glue database and choose Skip to Review and create.

Skip to Review and create

Review the content and choose Create database.

create database

After the AWS Glue database is created on the Amazon Redshift datashare, you can view them under Shared Databases.

Shared Databases.

You can also use the AWS CLI to register the datashare and create the database. Use the following commands:

Describe the Amazon Redshift datashare that is shared with the central account:

aws redshift describe-data-shares

Accept and associate the Amazon Redshift datashare to Data Catalog:

aws redshift associate-data-share-consumer \
--data-share-arn 'arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds' \
--consumer-arn arn:aws:glue:us-east-1:<central-aws-account-id>:catalog

The following is an example output:

 {
    "DataShareArn": "arn:aws:redshift:us-east-1:123456789012:datashare:cd8d91b5-0c17-4567-a52a-59f1bdda71cd/demotahoeds",
    "ProducerArn": "arn:aws:redshift:us-east-1:123456789012:namespace:cd8d91b5-0c17-4567-a52a-59f1bdda71cd",
    "AllowPubliclyAccessibleConsumers": false,
    "DataShareAssociations": [
        {
            "ConsumerIdentifier": "arn:aws:glue:us-east-1:112233445566:catalog",
            "Status": "ACTIVE",
            "ConsumerRegion": "us-east-1",
            "CreatedDate": "2022-11-09T23:25:22.378000+00:00",
            "StatusChangeDate": "2022-11-09T23:25:22.378000+00:00"
        }
    ]
}

aws lakeformation register-resource \
--resource-arn arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds

Create the AWS Glue database that points to the accepted Amazon Redshift datashare:

aws glue create-database --region <central-catalog-region> --cli-input-json '{
    "CatalogId": "<central-aws-account-id>",
    "DatabaseInput": {
        "Name": "demotahoedb",
        "FederatedDatabase": {
            "Identifier": "arn:aws:redshift:<producer-region>:<producer-aws-account-id>:datashare:<producer-cluster-namespace>/demotahoeds",
            "ConnectionName": "aws:redshift"
        }
    }
}'

Now the data lake administrator of the central governance account can view and share access on both the database and tables to the data consumer account using the Lake Formation cross-account sharing feature.

Grant datashare access to the data consumer

To grant the data consumer account permissions on the shared AWS Glue database, complete the following steps:

On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
Choose Grant.
Under Principals, select External accounts.
Provide the data consumer account ID (for this post, 665544332211).
Under LF_Tags or catalog resources, select Named data catalog resources.
For Databases, choose the database demotahoedb.
Select Describe for both Database permissions and Grantable permissions.
Choose Grant to apply the permissions.

grant data permissions

To grant the data consumer account permissions on tables, complete the following steps:

On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
Choose Grant.
Under Principals, select External accounts.
Provide the consumer account (for this post, we use 665544332211).
Under LF-Tags or catalog resources, select Named data catalog resources.
For Databases, choose the database demotahoedb.
For Tables, choose All tables.
Select Describe and Select for both Table permissions and Grantable permissions.
Choose Grant to apply the changes.

grant the data consumer account permissions on tables

Consumer account

The consumer admin will receive the shared resources from the central governance account and delegate access to other users in the consumer account as shown in the following table.

IAM User	Object Access	Object Type	Access Level
`consumer1`	`public.customer`	Table	All
`consumer2`	`public.customer_view`	View	specific columns: `c_customer_id`, `c_birth_country`, `cd_gender`, `cd_marital_status`, `cd_education_status`

In the data consumer account, follow these steps to accept the resources shared with the account:

Sign in to the console as the data lake administrator IAM user or role.
If this is your first time logging in to the Lake Formation console, select Add myself and choose Get started.
Sign in to the AWS RAM console.
In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations. You will receive 2 invitations.

Choose the pending invitations and accept the resource share.

Choose the pending invitation and accept the resource share

On the Lake formation console, under Data catalog in the navigation pane, choose Databases to view the cross-account shared database.

choose Databases to view the cross-account shared database

Grant access to the data analyst and IAM users using Lake Formation

Now the data lake admin in the data consumer account can delegate permissions on the shared database and tables to users in the consumer account.

Grant database permissions to consumer1 and consumer2

To grant the IAM users consumer1 and consumer2 database permissions, follow these steps:

On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
Select the database demotahoedb and on the Actions menu, choose Grant.

choose Grant database

Under Principals, select IAM users and roles.
Choose the IAM users consumer1 and consumer2.
Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
Select Describe for Database permissions.
Choose Grant to apply the permissions.

Choose Grant to apply the permissions

Grant table permissions to consumer1

To grant the IAM user consumer1 permissions on table public.customer, follow these steps:

Under Data catalog in the navigation pane, choose Databases.
Select the database demotahoedb and on the Actions menu, choose Grant.
Under Principals, select IAM users and roles.
Choose IAM user consumer1.
Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
For Tables, choose public.customer.
Select Describe and Select for Table permissions.
Choose Grant to apply the permissions.

Grant table permissions to consumer1

Grant column permissions to consumer2

To grant the IAM user consumer2 permissions on non-sensitive columns in public.customer_view, follow these steps:

Under Data catalog in the navigation pane, choose Databases.
Select the database demotahoedb and on the Actions menu, choose Grant.
Under Principals, select IAM users and roles.
Choose the IAM user consumer2.
Under LF-Tags or catalog resources, demotahoedb is already selected for Databases.
For Tables, choose public.customer_view.

Grant column permissions to consumer2

Select Select for Table permissions.
Under Data Permissions, select Column-based access.
Select Include columns and choose the non-sensitive columns (c_customer_id, c_birth_country, cd_gender, cd_marital_status, and cd_education_status).
Choose Grant to apply the permissions.

table permissions

Consume the datashare from the data consumer account in the Amazon Redshift cluster

In the Amazon Redshift consumer data warehouse, log in as the admin user using Query Editor V2 and complete the following steps:

Create the Amazon Redshift database from the shared catalog database using the following SQL command:

CREATE DATABASE demotahoedb FROM ARN 'arn:aws:glue:<producer-region>:<producer-aws-account-id>:database/demotahoedb' WITH DATA CATALOG SCHEMA demotahoedb ;

Run the following SQL commands to create and grant usage on the Amazon Redshift database to the IAM users consumer1 and consumer2:

CREATE USER IAM:consumer1 password disable;
CREATE USER IAM:consumer2  password disable;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer1;
GRANT USAGE ON DATABASE demotahoedb TO IAM:consumer2;

In order to use a federated identity to enforce Lake Formation permissions, follow the next steps to configure Query Editor v2.

Choose the settings icon in the bottom left corner of the Query Editor v2, then choose Account settings.

identity to enforce Lake Formation permissions

Under Connection settings, select Authenticate with IAM credentials.
Choose Save.

Authenticate with IAM credentials

Query the shared datasets as a consumer user

To validate that the IAM user consumer1 has datashare access from Amazon Redshift, perform the following steps:

Sign in to the console as IAM user consumer1.
On the Amazon Redshift console, choose Query Editor V2 in the navigation pane.
To connect to the consumer cluster, choose the consumer cluster in the tree-view pane.
When prompted, for Authentication, select Temporary credentials using your IAM identity.
For Database, enter the database name (for this post, dev).
The user name will be mapped to your current IAM identity (for this post, consumer1).
Choose Save.

edit connection for redshift

Once you’re connected to the database, you can validate the current logged-in user with the following SQL command:

select current_user;

To find the federated databases created on the consumer account, run the following SQL command:

SHOW DATABASES FROM DATA CATALOG [ACCOUNT '<id1>', '<id2>'] [LIKE 'expression'];

federated databases created on the consumer account

To validate permissions for consumer1, run the following SQL command:

select * from demotahoedb.public.customer limit 10;

As shown in the following screenshot, consumer1 is able to successfully access the datashare customer object.

Now let’s validate that consumer2 doesn’t have access to the datashare tables “public.customer” on the same consumer cluster.

Log out of the console and sign in as IAM user consumer2.
Follow the same steps to connect to the database using the query editor.
Once connected, run the same query:

select * from demotahoedb.public.customer limit 10;

The user consumer2 should get a permission denied error, as in the following screenshot.

should get a permission denied error

Let’s validate the column-level access permissions of consumer2 on public.customer_view view.

Connect to Query Editor v2 as consumer2 and run the following SQL command:

select c_customer_id,c_birth_country,cd_gender,cd_marital_status from demotahoedb.public.customer_view limit 10;

In the following screenshot, you can see consumer2 is only able to access columns as granted by Lake Formation.

access columns as granted by Lake Formation

Conclusion

A data mesh approach provides a method by which organizations can share data across business units. Each domain is responsible for the ingestion, processing, and serving of their data. They are data owners and domain experts, and are responsible for data quality and accuracy. Using Amazon Redshift data sharing with Lake Formation for data governance helps build the data mesh architecture, enabling data sharing and federation across business units with fine-grained access control.

Special thanks to everyone who contributed to launch Amazon Redshift data sharing with AWS Lake Formation:

Debu Panda, Michael Chess, Vlad Ponomarenko, Ting Yan, Erol Murtezaoglu, Sharda Khubchandani, Rui Bi

References

About the Authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Harshida Patel is a Analytics Specialist Principal Solutions Architect, with AWS.

Ranjan Burman is a Analytics Specialist Solutions Architect, with AWS.

Vikram Sahadevan is a Senior Resident Architect on the AWS Data Lab team. He enjoys efforts that focus around providing prescriptive architectural guidance, sharing best practices, and removing technical roadblocks with joint engineering engagements between customers and AWS technical resources that accelerate data, analytics, artificial intelligence, and machine learning initiatives.

Steve Mitchell is a Senior Solution Architect with a passion for analytics and data mesh. He enjoys working closely with customers as they transition to a modern data architecture.

Data: The genesis for modern invention

2022-11-30 Swami Sivasubramanian

Post Syndicated from Swami Sivasubramanian original https://aws.amazon.com/blogs/big-data/data-the-genesis-for-modern-invention/

It only takes one groundbreaking invention—one iconic idea that solves a widespread pain point for customers—to create or transform an industry forever. From the invention of the telegraph, to the discovery of GPS, to the earliest cloud computing services, history is filled with examples of these “eureka” moments that continue to have long-lasting impacts on the way we do business today.

Cognitive scientists John Kounios and Mark Beeman demonstrated that great inventors don’t simply stumble upon their epiphanies; in reality, an idea is preceded by a collection of life experiences, educational knowledge, or even past failures the human brain processes and assimilates over time. Their ideas are preceded by a collection of data points.

When we apply this concept to organizations, and the vast amount of data being produced on a daily basis, we realize there’s an incredible opportunity to ingest, store, process, analyze, and visualize data to create the next big thing.

Today—more than ever before—data is the genesis for modern invention. But to produce new ideas with our data, we need to build dynamic, end-to-end data strategies that lead to new customer experiences as the final output. Some of the biggest brands in the world like Formula 1, Toyota, and Georgia-Pacific are already leveraging AWS to do just that.

This week at AWS re:Invent 2022, I shared several key learnings we’ve collected after working with these brands and more than 1.5 million customers who are using AWS to build their data strategies.

I also revealed several new services and innovations for our customers. Here are a few highlights.

You need a comprehensive set of services to get the job done

Creating a data lake to perform analytics and machine learning (ML) is not an end-to-end data strategy. Your needs will inevitably grow and change over time, which is why we believe every customer should have access to a wide variety of tools based on data types, personas, and their specific use cases.

And our data supports this, with 94% of the top 1,000 AWS customers using more than 10 of our databases and analytics services. A one-size-fits-all approach just doesn’t work in the long run.

You need a comprehensive set of services that enable you to store and query data in your databases, data lakes, and data warehouses; services that help you act on your data with analytics, business intelligence, and machine learning; and services that help you catalog and govern your data across your organization.

You should also have access to services that support a variety of data types for your future use cases, whether you’re working with financial data, clinical data, or retail data. Many of our customers are also using their data to create machine learning models, but some data types are still too cumbersome to work with and prepare for ML.

For example, geospatial data, which supports use cases like self-driving cars, urban planning, or even crop yield in agricultural farms, can be incredibly difficult to access, prepare, and visualize for ML. That’s why this week we announced new capabilities for Amazon SageMaker that make it easier for data scientists to work with geospatial data.

Performance and security are paramount

Performance and security continue to be critical components of our customers’ data strategies.

You’ll need to perform at scale across your data warehouses, databases, and data lakes, or when you want to quickly analyze and visualize your data. We’ve built our business on high-performing services like Amazon Aurora, Amazon DynamoDB, and Amazon Redshift, and this week, we announced several new capabilities to continue building on our performance innovations to date.

For our serverless, interactive query service, Amazon Athena, we announced a new integration with Apache Spark that enables you to spin up Spark workloads up to 75 times faster than other serverless Spark offerings. We also introduced a new feature called Elastic Clusters within our fully managed document database, Amazon DocumentDB, that enables customers to easily scale out or shard their data across multiple database instances.

To help our customers protect their data from potential compromises, we announced Amazon GuardDuty RDS Protection to intelligently detect potential threats for their data stored in Aurora, as well as a new open-source project that allows developers to safely use PostgreSQL extensions in their core databases without worrying about unintended security impacts.

Connecting data is critical for deeper insights

To get the most of your data, you need to combine data silos for deeper insights. However, connecting data across siloes typically requires complex extract, transform, and load (ETL) pipelines, which means creating a manual integration every time you want to ask a different question of your data or build a different ML model. This isn’t fast enough to keep up with the speed that businesses need to move today.

Zero ETL is the future. And we’ve been making strides in this zero-ETL future for several years by deepening integrations between our services. But this week, we’re getting closer to a zero-ETL future by announcing Aurora now supports zero-ETL integration with Amazon Redshift to bring transactional data in Aurora and the analytical capabilities in Amazon Redshift together.

We also announced a new auto-copy feature from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift that removes the need for you to build and manage ETL pipelines whenever you want to use your data for analytics. And we’re not stopping here. With AWS, you can now connect to hundreds of data sources, from software as a service (SaaS) applications to on-premises data stores.

We’ll continue to build no zero-ETL capabilities into our services to help our customers easily analyze all their data, no matter where it resides.

Data governance unleashes innovation

Governance was historically used as a defensive measure, which meant locking down data in silos. But in reality, the right governance strategy helps you move and innovate faster with guardrails that give the right people access to your data, when and where they need it.

In addition to fine-grained access controls within AWS Lake Formation, this week we’re making it easier for customers to govern access and privileges within more of our data services with new capabilities announced in Amazon Redshift and Amazon SageMaker.

Our customers also told us they want an end-to-end strategy that enables them to govern their data across the entire data journey. That’s why this week we announced Amazon DataZone, a new data management service that helps you catalog, discover, analyze, share, and govern data across the organization.

When you properly manage secure access to your data, it can flow to the right places and connect the dots across siloed teams and departments.

Build with AWS

With the introduction of these new services and features this week, as well as our comprehensive set of data services, it’s important to remember that support is available as you build your end-to-end data strategy. In fact, we have an entire team at AWS, as well as an extensive network of partners to help our customers build data foundations that will meet their needs now—and well into the future.

For more information about re:Invent 2022, please visit our event page.

About the Author

Swami Sivasubramanian is the Vice President of AWS Data and Machine Learning.

Log analytics the easy way with Amazon OpenSearch Serverless

2022-11-30 Prashant Agrawal

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/big-data/log-analytics-the-easy-way-with-amazon-opensearch-serverless/

We recently announced the preview release of Amazon OpenSearch Serverless, a new serverless option for Amazon OpenSearch Service, which makes it easy for you to run large-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. It automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads.

OpenSearch Serverless supports two primary use cases:

Log analytics that focuses on analyzing large volumes of semi-structured, machine-generated time series data for operational, security, and user behavior insights
Full-text search that powers customer applications in their internal networks (content management systems, legal documents) and internet-facing applications such as ecommerce website catalog search and content search

This post focuses on building a simple log analytics pipeline with OpenSearch Serverless.

Solution overview

In the following sections, we walk through the steps to create and access a collection in OpenSearch Serverless, and demonstrate how to configure two different data ingestion pipelines to index data into the collection.

Create a collection

To get started with OpenSearch Serverless, you first create a collection. A collection in OpenSearch Serverless is a logical grouping of one or more indexes that represent an analytics workload.

The following graphic gives a quick navigation for creating a collection. Alternatively, refer to this blog post to learn more about how to create and configure a collection in OpenSearch Serverless.

Access the collection

You can use the AWS Identity and Access Management (IAM) credentials with a secret key and access key ID for your IAM users and roles to access your collection programmatically. Alternatively, you can set up SAML authentication for accessing the OpenSearch Dashboards. Note that SAML authentication is only available to access OpenSearch Dashboards; you require IAM credentials to perform any operations using the AWS Command Line Interface (AWS CLI), API, and OpenSearch clients for indexing and searching data. In this post, we use IAM credentials to access the collections.

Create a data ingestion pipeline

OpenSearch Serverless supports the same ingestion pipelines as the open-source OpenSearch and managed clusters. These clients include applications like Logstash and Amazon Kinesis Data Firehose, and language clients like Java Script, Python, Go, Java, and more. For more details on all the ingestion pipelines and supported clients, refer to ingesting data into OpenSearch Serverless collections.

Using Logstash

The open-source version of Logstash (Logstash OSS) provides a convenient way to use the bulk API to upload data into your collections. OpenSearch Serverless supports the logstash-output-opensearch output plugin, which supports IAM credentials for data access control. In this post, we show how to use the file input plugin to send data from your command line console to an OpenSearch Serverless collection. Complete the following steps:

Download the logstash-oss-with-opensearch-output-plugin file (this example uses the distro for macos-x64; for other distros, refer to the artifacts):
```
wget https://artifacts.opensearch.org/logstash/logstash-oss-with-opensearch-output-plugin-8.4.0-macos-x64.tar.gz
```

Extract the downloaded tarball:

tar -zxvf logstash-oss-with-opensearch-output-plugin-8.4.0-macos-x64.tar.gz
cd logstash-8.4.0/

Update the logstash-output-opensearch plugin to the latest version:
```
./bin/logstash-plugin update logstash-output-opensearch
```
The OpenSearch output plugin for OpenSearch Serverless uses IAM credentials to authenticate. In this example, we show how to use the file input plugin to read data from a file and ingest into an OpenSearch Serverless collection.

Create a log file with the following sample data and name it sample.log:

{"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:25.855Z","traffic":"heavy","weather_category":"Cloudy","weather":"Cloudy"}
{"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.155Z","traffic":"heavy","weather_category":"Cloudy","weather":"Heavy Fog"}
{"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.255Z","traffic":"heavy","weather_category":"Cloudy","weather":"Cloudy"}
{"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.556Z","traffic":"heavy","weather_category":"Cloudy","weather":"Heavy Fog"}
{"deviceId":2823605996,"fleetRegNo":"IRV82MBYQ1","oilLevel":0.92,"milesTravelled":1.105,"totalFuelUsed":0.01,"carrier":"AOS Van Lines","temperature":14,"tripId":6741375582,"originODC":"ODC Las Vegas","originCountry":"United States","originCity":"Las Vegas","originState":"Nevada","originGeo":"36.16,-115.13","destinationODC":"ODC San Jose","destinationCountry":"United States","destinationCity":"San Jose","destinationState":"California","destinationGeo":"37.33,-121.89","speedInMiles":18,"distanceMiles":382.81,"milesToDestination":381.705,"@timestamp":"2022-11-17T17:11:26.756Z","traffic":"heavy","weather_category":"Cloudy","weather":"Cloudy"}

Create a new file and add the following content, and save the file as logstash-output-opensearch.conf after providing the information about your file path, host, Region, access key, and secret access key:

input {
   file {
     path => "<path/to/your/sample.log>"
     start_position => "beginning"
   }
}
output {
    opensearch {
        ecs_compatibility => disabled
        index => "logstash-sample"
        hosts => "<HOST>:443"
        auth_type => {
            type => 'aws_iam'
            aws_access_key_id => '<AWS_ACCESS_KEY_ID>'
            aws_secret_access_key => '<AWS_SECRET_ACCESS_KEY>'
            region => '<REGION>'
            service_name => 'aoss'
            }
        legacy_template => false
        default_server_major_version => 2
    }
}

Use the following command to start Logstash with the config file created in the previous step. This creates an index called logstash-sample and ingests the document added under the sample.log file:
```
./bin/logstash -f <path/to/your/config/file>
```

Search using OpenSearch Dashboards by running the following query:

GET logstash-sample/_search
{
  "query": {
    "match_all": {}
  },
  "track_total_hits" : true
}

In this step, you used a file input plugin from Logstash to send data to OpenSearch Serverless. You can replace the input plugin with any other plugin supported by Logstash, such as Amazon Simple Storage Service (Amazon S3), stdin, tcp, or others, to send data to the OpenSearch Serverless collection.

Using a Python client

OpenSearch provides high-level clients for several popular programming languages, which you can use to integrate with your application. With OpenSearch Serverless, you can continue to use your existing OpenSearch client to load and query your data in collections.

In this section, we show how to use the opensearch-py client for Python to establish a secure connection with your OpenSearch Serverless collection, create an index, send sample logs, and analyze those log data using OpenSearch Dashboards. In this example, we use a sample event generated from fleets carrying goods and packages. This data contains pertinent fields such as source, destination, weather, speed, and traffic. The following is a sample record:

"_source" : {
    "deviceId" : 2823605996,
    "fleetRegNo" : "IRV82MBYQ1",
    "carrier" : "AOS Van Lines",
    "temperature" : 14,
    "tripId" : 6741375582,
    "originODC" : "ODC Las Vegas",
    "originCountry" : "United States",
    "originCity" : "Las Vegas",
    "destinationCity" : "San Jose",
    "@timestamp" : "2022-11-17T17:11:25.855Z",
    "traffic" : "heavy",
    "weather" : "Cloudy"
    ...
    ...
}

To set up the Python client for OpenSearch, you must have the following prerequisites:

Python3 installed on your local machine or the server from where you are running this code
Package Installer for Python (PIP) installed
The AWS CLI configured; we use it to store the secret key and access key for credentials

Complete the following steps to set up the Python client:

Add the OpenSearch Python client to your project and use Python’s virtual environment to set up the required packages:

mkdir python-sample
cd python-sample
python3 -m venv .env
source .env/bin/activate
.env/bin/python3 -m pip install opensearch-py
.env/bin/python3 -m pip install requests_aws4auth
.env/bin/python3 -m pip install boto3
.env/bin/python3 -m pip install geopy

Save your frequently used configuration settings and credentials in files that are maintained by the AWS CLI (see Quick configuration with aws configure) by using the following commands and providing your access key, secret key, and Region:
```
aws configure
```

The following sample code uses the opensearch-py client for Python to establish a secure connection to the specified OpenSearch Serverless collection and index a sample document to index time series. You must provide values for region and host. Note that you must use aoss as the service name for OpenSearch Service. Copy the code and save in a file as sample_python.py:

from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
import boto3

host = '<host>' # OpenSearch Serverless collection endpoint
region = '<region>' # e.g. us-west-2

service = 'aoss'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service,
session_token=credentials.token)

# Create an OpenSearch client
client = OpenSearch(
    hosts = [{'host': host, 'port': 443}],
    http_auth = awsauth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)
# Specify index name
index_name = 'octank-iot-logs-2022-11-19'

# Prepare a document to index 
document = {
    "deviceId" : 2823605996,
    "fleetRegNo" : "IRV82MBYQ1",
    "carrier" : "AOS Van Lines",
    "temperature" : 14,
    "tripId" : 6741375582,
    "originODC" : "ODC Las Vegas",
    "originCountry" : "United States",
    "originCity" : "Las Vegas",
    "destinationCity" : "San Jose",
    "@timestamp" : "2022-11-19T17:11:25.855Z",
    "traffic" : "heavy",
    "weather" : "Cloudy"
}

# Index Documents
response = client.index(
    index = index_name,
    body = document
)

print('\n Document indexed with response:')
print(response)


# Search for the Documents
q = 'heavy'
query = {
    'size': 5,
        'query': {
        'multi_match': {
        'query': q,
        'fields': ['traffic']
        }
    }
}

response = client.search(
body = query,
index = index_name
)
print('\nSearch results:')
print(response)

Run the sample code:
```
python3 sample_python.py
```
On the OpenSearch Service console, select your collection.
On OpenSearch Dashboards, choose Dev Tools.

Run the following search query to retrieve documents:

GET octank-iot-logs-*/_search
{
  "query": {
    "match_all": {}
  }
}

After you have ingested the data, you can use OpenSearch Dashboards to visualize your data. In the following example, we analyze data visually to gain insights on various dimensions such as average fuel consumed by a specific fleet, traffic conditions, distance traveled, and average mileage by the fleet.

Conclusion

In this post, you created a log analytics pipeline using OpenSearch Serverless, a new serverless option for OpenSearch Service. With OpenSearch Serverless, you can focus on building your application without having to worry about provisioning, tuning, and scaling the underlying infrastructure. OpenSearch Serverless supports the same ingestion pipelines and high-level clients as the open-source OpenSearch project. You can easily get started using the familiar OpenSearch indexing and query APIs to load and search your data and use OpenSearch Dashboards to visualize that data.

Stay tuned for a series of posts focusing on the various options available for you to build effective log analytics and search applications. Get hands-on with OpenSearch Serverless by taking the Getting Started with Amazon OpenSearch Serverless workshop and build a similar log analytics pipeline that was discussed in this post.

About the authors

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Pavani Baddepudi is a senior product manager working in search services at AWS. Her interests include distributed systems, networking, and security.

New for Amazon Redshift – Simplify Data Ingestion and Make Your Data Warehouse More Secure and Reliable

2022-11-30 Danilo Poccia

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-redshift-simplify-data-ingestion-and-make-your-data-warehouse-more-secure-and-reliable/

When we talk with customers, we hear that they want to be able to harness insights from data in order to make timely, impactful, and actionable business decisions. A common pattern with data-driven organizations is that they have many diﬀerent data sources they need to ingest into their analytics systems. This requires them to build manual data pipelines spanning across their operational databases, data lakes, streaming data, and data within their warehouse. As a consequence of this complex setup, it can take data engineers weeks or even months to build data ingestion pipelines. These data pipelines are costly, and the delays can lead to missed business opportunities. Additionally, data warehouses are increasingly becoming mission critical systems that require high availability, reliability, and security.

Amazon Redshift is a fully managed petabyte-scale data warehouse used by tens of thousands of customers to easily, quickly, securely, and cost-effectively analyze all their data at any scale. This year at re:Invent, Amazon Redshift has announced a number of features to help you simplify data ingestion and get to insights easily and quickly, within a secure, reliable environment.

In this blog, I introduce some of these new features that ﬁt into two main categories:

Simplify data ingestion
- Amazon Redshift now supports auto-copy from Amazon S3 (available in preview). With this new capability, Amazon Redshift automatically loads the files that arrive in an Amazon Simple Storage Service (Amazon S3) location that you specify into your data warehouse. The files can use any of the formats supported by the Amazon Redshift copy command, such as CSV, JSON, Parquet, and Avro. In this way, you don’t need to manually or repeatedly run copy procedures. Amazon Redshift automates file ingestion and takes care of data-loading steps under the hood.
- With Amazon Aurora zero-ETL integration with Amazon Redshift, you can use Amazon Redshift for near real-time analytics and machine learning on petabytes of transactional data stored on Amazon Aurora MySQL databases (available in limited preview). With this capability, you can choose the Amazon Aurora databases containing the data you want to analyze with Amazon Redshift. Data is then replicated into your data warehouse within seconds after transactional data is written into Amazon Aurora, eliminating the need to build and maintain complex data pipelines. You can replicate data from multiple Amazon Aurora databases into the same Amazon Redshift instance to run analytics across multiple applications. With near real-time access to transactional data, you can leverage Amazon Redshift’s analytics and capabilities, such as built-in machine learning (ML), materialized views, data sharing, and federated access to multiple data stores and data lakes, to derive insights from transactional and other data.
- With the general availability of Amazon Redshift Streaming Ingestion, you can now natively ingest hundreds of megabytes of data per second from Amazon Kinesis Data Streams and Amazon MSK into an Amazon Redshift materialized view and query it in seconds. Learn more in this post.
Make your data warehouse more secure and reliable
- You can now improve the availability of your data warehouse by choosing multiple Availability Zone (AZ) deployments. Multi-AZ deployments for your Amazon Redshift clusters are available in preview and reduce recovery times to seconds through automatic recovery. In this way, you can build solutions that are more compliant with the recommendations of the Reliability Pillar of the AWS Well-Architected Framework.
- With dynamic data masking (available in preview), you can protect sensitive information stored in your data warehouse and ensure that only the relevant data is accessible by users based on their roles. You can limit how much identifiable data is visible to users using multiple levels of policies so different users and groups can have different levels of data access without having to create multiple copies of data. Dynamic data masking complements other granular access control capabilities in Amazon Redshift including row-level and column-level security and role-based access controls. In this way, Dynamic Data Masking helps you meet requirements for GDPR, CCPA, and other privacy regulations.
- Amazon Redshift now supports central access controls for data sharing with AWS Lake Formation (available in public preview). You can now use Lake Formation to simplify governance of data shared from Amazon Redshift and centrally manage granular access across all data-sharing consumers.

There have been other interesting news for Amazon Redshift at re:Invent you might have already heard about:

The general availability of Amazon Redshift integration for Apache Spark makes it easy to build and run Spark applications on Amazon Redshift and Redshift Serverless, opening up the data warehouse for a broader set of AWS analytics and machine learning solutions.
AWS Backup now supports Amazon Redshift. AWS Backup allows you to define a central backup policy to manage data protection of your applications and can also protect your Amazon Redshift clusters. In this way, you have a consistent experience when managing data protection across all supported services.

Availability and Pricing
Multi-AZ deployments, central access control for data sharing with AWS Lake Formation, auto-copy from Amazon S3, and dynamic data masking are available in preview in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm).

There is no additional cost for using auto-copy from Amazon S3 and near real-time analytics on transactional data. There is no extra charge for dynamic data masking and central access control for data sharing. For more information, see Amazon Redshift pricing.

These new capabilities take you one step further in analyzing all your data across data sources with simple data ingestion capabilities, while improving the security and reliability of your data warehouse.

— Danilo

New — Amazon Athena for Apache Spark

2022-11-30 Donnie Prakoso

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/new-amazon-athena-for-apache-spark/

When Jeff Barr first announced Amazon Athena in 2016, it changed my perspective on interacting with data. With Amazon Athena, I can interact with my data in just a few steps—starting from creating a table in Athena, loading data using connectors, and querying using the ANSI SQL standard.

Over time, various industries, such as financial services, healthcare, and retail, have needed to run more complex analyses for a variety of formats and sizes of data. To facilitate complex data analysis, organizations adopted Apache Spark. Apache Spark is a popular, open-source, distributed processing system designed to run fast analytics workloads for data of any size.

However, building the infrastructure to run Apache Spark for interactive applications is not easy. Customers need to provision, configure, and maintain the infrastructure on top of the applications. Not to mention performing optimal tuning resources to avoid slow application starts and suffering from idle costs.

Introducing Amazon Athena for Apache Spark
Today, I’m pleased to announce Amazon Athena for Apache Spark. With this feature, we can run Apache Spark workloads, use Jupyter Notebook as the interface to perform data processing on Athena, and programmatically interact with Spark applications using Athena APIs. We can start Apache Spark in under a second without having to manually provision the infrastructure.

Here’s a quick preview:

Quick preview of Amazon Athena for Apache Spark

How It Works
Since Amazon Athena for Apache Spark runs serverless, this benefits customers in performing interactive data exploration to gain insights without the need to provision and maintain resources to run Apache Spark. With this feature, customers can now build Apache Spark applications using the notebook experience directly from the Athena console or programmatically using APIs.

The following figure explains how this feature works:

How Amazon Athena for Apache Spark works

On the Athena console, you can now run notebooks and run Spark applications with Python using Jupyter notebooks. In this Jupyter notebook, customers can query data from various sources and perform multiple calculations and data visualizations using Spark applications without context switching.

Amazon Athena integrates with AWS Glue Data Catalog, which helps customers to work with any data source in AWS Glue Data Catalog, including data in Amazon S3. This opens possibilities for customers in building applications to analyze and visualize data to explore data to prepare data sets for machine learning pipelines.

As I demonstrated in the demo preview section, the initialization for the workgroup running the Apache Spark engine takes under a second to run resources for interactive workloads. To make this possible, Amazon Athena for Apache Spark uses Firecracker, a lightweight micro-virtual machine, which allows for instant startup time and eliminates the need to maintain warm pools of resources. This benefits customers who want to perform interactive data exploration to get insights without having to prepare resources to run Apache Spark.

Get Started with Amazon Athena for Apache Spark
Let’s see how we can use Amazon Athena for Apache Spark. In this post, I will explain step-by-step how to get started with this feature.

The first step is to create a workgroup. In the context of Athena, a workgroup helps us to separate workloads between users and applications.

To create a workgroup, from the Athena dashboard, select Create Workgroup.

Select Create Workgroup

On the next page, I give the name and description for this workgroup.

Creating a workgroup

On the same page, I can choose Apache Spark as the engine for Athena. In addition, I also need to specify a service role with appropriate permissions to be used inside a Jupyter notebook. Then, I check Turn on example notebook, which makes it easy for me to get started with Apache Spark inside Athena. I also have the option to encrypt Jupyter notebooks managed by Athena or use the key I have configured in AWS Key Management Service (AWS KMS).

After that, I need to define an Amazon Simple Storage Service (Amazon S3) bucket to store calculation results from the Jupyter notebook. Once I’m sure of all the configurations for this workgroup, I just have to select Create workgroup.

Configure Calculation Results Settings

Now, I can see the workgroup already created in Athena.

Select newly created workgroup

To see the details of this workgroup, I can select the link from the workgroup. Since I also checked the Turn on example notebook when creating this workgroup, I have a Jupyter notebook to help me get started. Amazon Athena also provides flexibility for me to import existing notebooks that I can upload from my laptop with Import file or create new notebooks from scratch by selecting Create notebook.

Example notebook is available in the workgroup

When I select the Jupyter notebook example, I can start building my Apache Spark application.

When I run a Jupyter notebook, it automatically creates a session in the workgroup. Subsequently, each time I run a calculation inside the Jupyter notebook, all results will be recorded in the session. This way, Athena provides me with full information to review each calculation by selecting Calculation ID, which took me to the Calculation details page. Here, I can review the Code and also Results for the calculation.

Review code and results of a calculation

In the session, I can adjust the Coordinator size and Executor size, with 1 data processing unit (DPU) by default. A DPU consists of 4 vCPU and 16 GB of RAM. Changing to a larger DPU allows me to process tasks faster if I have complex calculations.

Configuring session parameters

Programmatic API Access
In addition to using the Athena console, I can also use programmatic access to interact with the Spark application inside Athena. For example, I can create a workgroup with the create-work-group command, start a notebook with create-notebook, and run a notebook session with start-session.

Using programmatic access is useful when I need to execute commands such as building reports or computing data without having to open the Jupyter notebook.

With my Jupyter notebook that I’ve created before, I can start a session by running the following command with the AWS CLI:

$> aws athena start-session \
    --work-group <WORKGROUP_NAME>\
    --engine-configuration '{"CoordinatorDpuSize": 1, "MaxConcurrentDpus":20, "DefaultExecutorDpuSize": 1, "AdditionalConfigs":{"NotebookId":"<NOTEBOOK_ID>"}}'
    --notebook-version "Jupyter 1"
    --description "Starting session from CLI"

{
    "SessionId":"<SESSION_ID>",
    "State":"CREATED"
}

Then, I can run a calculation using the start-calculation-execution API.

$ aws athena start-calculation-execution \
    --session-id "<SESSION_ID>"
    --description "Demo"
    --code-block "print(5+6)"

{
    "CalculationExecutionId":"<CALCULATION_EXECUTION_ID>",
    "State":"CREATING"
}

In addition to using code inline, with the --code-block flag, I can also pass input from a Python file using the following command:

$ aws athena start-calculation-execution \
    --session-id "<SESSION_ID>"
    --description "Demo"
    --code-block file://<PYTHON FILE>

{
    "CalculationExecutionId":"<CALCULATION_EXECUTION_ID>",
    "State":"CREATING"
}

Pricing and Availability
Amazon Athena for Apache Spark is available today in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland). To use this feature, you are charged based on the amount of compute usage defined by the data processing unit or DPU per hour. For more information see our pricing page here.

To get started with this feature, see Amazon Athena for Apache Spark to learn more from the documentation, understand the pricing, and follow the step-by-step walkthrough.

Happy building,

— Donnie