Tag Archives: Advanced (300)

Build an end-to-end change data capture with Amazon MSK Connect and AWS Glue Schema Registry

Post Syndicated from Kalyan Janaki original https://aws.amazon.com/blogs/big-data/build-an-end-to-end-change-data-capture-with-amazon-msk-connect-and-aws-glue-schema-registry/

The value of data is time sensitive. Real-time processing makes data-driven decisions accurate and actionable in seconds or minutes instead of hours or days. Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real time to a downstream system. Capturing every change from transactions in a source database and moving them to the target in real time keeps the systems synchronized, and helps with real-time analytics use cases and zero-downtime database migrations. The following are a few benefits of CDC:

  • It eliminates the need for bulk load updating and inconvenient batch windows by enabling incremental loading or real-time streaming of data changes into your target repository.
  • It ensures that data in multiple systems stays in sync. This is especially important if you’re making time-sensitive decisions in a high-velocity data environment.

Kafka Connect is an open-source component of Apache Kafka that works as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems. The AWS Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas. Kafka Connect and Schema Registry integrate to capture schema information from connectors. Kafka Connect provides a mechanism for converting data from the internal data types used by Kafka Connect to data types represented as Avro, Protobuf, or JSON Schema. AvroConverter, ProtobufConverter, and JsonSchemaConverter automatically register schemas generated by Kafka connectors (source) that produce data to Kafka. Connectors (sink) that consume data from Kafka receive schema information in addition to the data for each message. This allows sink connectors to know the structure of the data to provide capabilities like maintaining a database table schema in a data catalog.

The post demonstrates how to build an end-to-end CDC using Amazon MSK Connect, an AWS managed service to deploy and run Kafka Connect applications and AWS Glue Schema Registry, which allows you to centrally discover, control, and evolve data stream schemas.

Solution overview

On the producer side, for this example we choose a MySQL-compatible Amazon Aurora database as the data source, and we have a Debezium MySQL connector to perform CDC. The Debezium connector continuously monitors the databases and pushes row-level changes to a Kafka topic. The connector fetches the schema from the database to serialize the records into a binary form. If the schema doesn’t already exist in the registry, the schema will be registered. If the schema exists but the serializer is using a new version, the schema registry checks the compatibility mode of the schema before updating the schema. In this solution, we use backward compatibility mode. The schema registry returns an error if a new version of the schema is not backward compatible, and we can configure Kafka Connect to send incompatible messages to the dead-letter queue.

On the consumer side, we use an Amazon Simple Storage Service (Amazon S3) sink connector to deserialize the record and store changes to Amazon S3. We build and deploy the Debezium connector and the Amazon S3 sink using MSK Connect.

Example schema

For this post, we use the following schema as the first version of the table:

{ 
    “Database Name”: “sampledatabase”, 
    “Table Name”: “movies”, 
    “Fields”: [
         { 
            “name”: “movie_id”, 
            “type”: “INTEGER” 
         },
         { 
            “name”: “title”, 
            “type”: “STRING” 
         },
         { 
            “name”: “release_year”,
            “type”: “INTEGER” 
         }
     ] 
}

Prerequisites

Before configuring the MSK producer and consumer connectors, we need to first set up a data source, MSK cluster, and new schema registry. We provide an AWS CloudFormation template to generate the supporting resources needed for the solution:

  • A MySQL-compatible Aurora database as the data source. To perform CDC, we turn on binary logging in the DB cluster parameter group.
  • An MSK cluster. To simplify the network connection, we use the same VPC for the Aurora database and the MSK cluster.
  • Two schema registries to handle schemas for message key and message value.
  • One S3 bucket as the data sink.
  • MSK Connect plugins and worker configuration needed for this demo.
  • One Amazon Elastic Compute Cloud (Amazon EC2) instance to run database commands.

To set up resources in your AWS account, complete the following steps in an AWS Region that supports Amazon MSK, MSK Connect, and the AWS Glue Schema Registry:

  1. Choose Launch Stack:
  2. Choose Next.
  3. For Stack name, enter suitable name.
  4. For Database Password, enter the password you want for the database user.
  5. Keep other values as default.
  6. Choose Next.
  7. On the next page, choose Next.
  8. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  9. Choose Create stack.

Custom plugin for the source and destination connector

A custom plugin is a set of JAR files that contain the implementation of one or more connectors, transforms, or converters. Amazon MSK will install the plugin on the workers of the MSK Connect cluster where the connector is running. As part of this demo, for the source connector we use open-source Debezium MySQL connector JARs, and for the destination connector we use the Confluent community licensed Amazon S3 sink connector JARs. Both the plugins are also added with libraries for Avro Serializers and Deserializers of the AWS Glue Schema Registry. These custom plugins are already created as part of the CloudFormation template deployed in the previous step.

Use the AWS Glue Schema Registry with the Debezium connector on MSK Connect as the MSK producer

We first deploy the source connector using the Debezium MySQL plugin to stream data from an Amazon Aurora MySQL-Compatible Edition database to Amazon MSK. Complete the following steps:

  1. On the Amazon MSK console, in the navigation pane, under MSK Connect, choose Connectors.
  2. Choose Create connector.
  3. Choose Use existing custom plugin and then pick the custom plugin with name starting msk-blog-debezium-source-plugin.
  4. Choose Next.
  5. Enter a suitable name like debezium-mysql-connector and an optional description.
  6. For Apache Kafka cluster, choose MSK cluster and choose the cluster created by the CloudFormation template.
  7. In Connector configuration, delete the default values and use the following configuration key-value pairs and with the appropriate values:
    • name – The name used for the connector.
    • database.hostsname – The CloudFormation output for Database Endpoint.
    • database.user and database.password – The parameters passed in the CloudFormation template.
    • database.history.kafka.bootstrap.servers – The CloudFormation output for Kafka Bootstrap.
    • key.converter.region and value.converter.region – Your Region.
name=<Connector-name>
connector.class=io.debezium.connector.mysql.MySqlConnector
database.hostname=<DBHOST>
database.port=3306
database.user=<DBUSER>
database.password=<DBPASSWORD>
database.server.id=42
database.server.name=db1
table.whitelist=sampledatabase.movies
database.history.kafka.bootstrap.servers=<MSK-BOOTSTRAP>
database.history.kafka.topic=dbhistory.demo1
key.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter
value.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter
key.converter.region=<REGION>
value.converter.region=<REGION>
key.converter.registry.name=msk-connect-blog-keys
value.converter.registry.name=msk-connect-blog-values
key.converter.compatibility=FORWARD
value.converter.compatibility=FORWARD
key.converter.schemaAutoRegistrationEnabled=true
value.converter.schemaAutoRegistrationEnabled=true
transforms=unwrap
transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.drop.tombstones=false
transforms.unwrap.delete.handling.mode=rewrite
transforms.unwrap.add.fields=op,source.ts_ms
tasks.max=1

Some of these settings are generic and should be specified for any connector. For example:

  • connector.class is the Java class of the connector
  • tasks.max is the maximum number of tasks that should be created for this connector

Some settings (database.*, transforms.*) are specific to the Debezium MySQL connector. Refer to Debezium MySQL Source Connector Configuration Properties for more information.

Some settings (key.converter.* and value.converter.*) are specific to the Schema Registry. We use the AWSKafkaAvroConverter from the AWS Glue Schema Registry Library as the format converter. To configure AWSKafkaAvroConverter, we use the value of the string constant properties in the AWSSchemaRegistryConstants class:

  • key.converter and value.converter control the format of the data that will be written to Kafka for source connectors or read from Kafka for sink connectors. We use AWSKafkaAvroConverter for Avro format.
  • key.converter.registry.name and value.converter.registry.name define which schema registry to use.
  • key.converter.compatibility and value.converter.compatibility define the compatibility model.

Refer to Using Kafka Connect with AWS Glue Schema Registry for more information.

  1. Next, we configure Connector capacity. We can choose Provisioned and leave other properties as default
  2. For Worker configuration, choose the custom worker configuration with name starting msk-gsr-blog created as part of the CloudFormation template.
  3. For Access permissions, use the AWS Identity and Access Management (IAM) role generated by the CloudFormation template MSKConnectRole.
  4. Choose Next.
  5. For Security, choose the defaults.
  6. Choose Next.
  7. For Log delivery, select Deliver to Amazon CloudWatch Logs and browse for the log group created by the CloudFormation template (msk-connector-logs).
  8. Choose Next.
  9. Review the settings and choose Create connector.

After a few minutes, the connector changes to running status.

Use the AWS Glue Schema Registry with the Confluent S3 sink connector running on MSK Connect as the MSK consumer

We deploy the sink connector using the Confluent S3 sink plugin to stream data from Amazon MSK to Amazon S3. Complete the following steps:

    1. On the Amazon MSK console, in the navigation pane, under MSK Connect, choose Connectors.
    2. Choose Create connector.
    3. Choose Use existing custom plugin and choose the custom plugin with name starting msk-blog-S3sink-plugin.
    4. Choose Next.
    5. Enter a suitable name like s3-sink-connector and an optional description.
    6. For Apache Kafka cluster, choose MSK cluster and select the cluster created by the CloudFormation template.
    7. In Connector configuration, delete the default values provided and use the following configuration key-value pairs with appropriate values:
        • name – The same name used for the connector.
        • s3.bucket.name – The CloudFormation output for Bucket Name.
        • s3.region, key.converter.region, and value.converter.region – Your Region.
name=<CONNERCOR-NAME>
connector.class=io.confluent.connect.s3.S3SinkConnector
s3.bucket.name=<BUCKET-NAME>
key.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter
value.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter
s3.region=<REGION>
storage.class=io.confluent.connect.s3.storage.S3Storage
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
format.class=io.confluent.connect.s3.format.parquet.ParquetFormat
flush.size=10
tasks.max=1
key.converter.schemaAutoRegistrationEnabled=true
value.converter.schemaAutoRegistrationEnabled=true
key.converter.region=<REGION>
value.converter.region=<REGION>
value.converter.avroRecordType=GENERIC_RECORD
key.converter.avroRecordType=GENERIC_RECORD
value.converter.compatibility=NONE
key.converter.compatibility=NONE
store.kafka.keys=false
schema.compatibility=NONE
topics=db1.sampledatabase.movies
value.converter.registry.name=msk-connect-blog-values
key.converter.registry.name=msk-connect-blog-keys
store.kafka.headers=false
  1. Next, we configure Connector capacity. We can choose Provisioned and leave other properties as default
  2. For Worker configuration, choose the custom worker configuration with name starting msk-gsr-blog created as part of the CloudFormation template.
  3. For Access permissions, use the IAM role generated by the CloudFormation template MSKConnectRole.
  4. Choose Next.
  5. For Security, choose the defaults.
  6. Choose Next.
  7. For Log delivery, select Deliver to Amazon CloudWatch Logs and browse for the log group created by the CloudFormation template msk-connector-logs.
  8. Choose Next.
  9. Review the settings and choose Create connector.

After a few minutes, the connector is running.

Test the end-to-end CDC log stream

Now that both the Debezium and S3 sink connectors are up and running, complete the following steps to test the end-to-end CDC:

  1. On the Amazon EC2 console, navigate to the Security groups page.
  2. Select the security group ClientInstanceSecurityGroup and choose Edit inbound rules.
  3. Add an inbound rule allowing SSH connection from your local network.
  4. On the Instances page, select the instance ClientInstance and choose Connect.
  5. On the EC2 Instance Connect tab, choose Connect.
  6. Ensure your current working directory is /home/ec2-user and it has the files create_table.sql, alter_table.sql , initial_insert.sql, and insert_data_with_new_column.sql.
  7. Create a table in your MySQL database by running the following command (provide the database host name from the CloudFormation template outputs):
mysql -h <DATABASE-HOST> -u master -p < create_table.sql
  1. When prompted for a password, enter the password from the CloudFormation template parameters.
  2. Insert some sample data into the table with the following command:
mysql -h <DATABASE-HOST> -u master -p < initial_insert.sql
  1. When prompted for a password, enter the password from the CloudFormation template parameters.
  2. On the AWS Glue console, choose Schema registries in the navigation pane, then choose Schemas.
  3. Navigate to db1.sampledatabase.movies version 1 to check the new schema created for the movies table:
{
  "type": "record",
  "name": "Value",
  "namespace": "db1.sampledatabase.movies",
  "fields": [
    {
      "name": "movie_id",
      "type": "int"
    },
    {
      "name": "title",
      "type": "string"
    },
    {
      "name": "release_year",
      "type": "int"
    },
    {
      "name": "__op",
      "type": [
        "null",
        "string"
      ],
      "default": null
    },
    {
      "name": "__source_ts_ms",
      "type": [
        "null",
        "long"
      ],
      "default": null
    },
    {
      "name": "__deleted",
      "type": [
        "null",
        "string"
      ],
      "default": null
    }
  ],
  "connect.name": "db1.sampledatabase.movies.Value"
}

A separate S3 folder is created for each partition of the Kafka topic, and data for the topic is written in that folder.

  1. On the Amazon S3 console, check for data written in Parquet format in the folder for your Kafka topic.

Schema evolution

After the initial schema is defined, applications may need to evolve it over time. When this happens, it’s critical for the downstream consumers to be able to handle data encoded with both the old and the new schema seamlessly. Compatibility modes allow you to control how schemas can or can’t evolve over time. These modes form the contract between applications producing and consuming data. For detailed information about different compatibility modes available in the AWS Glue Schema Registry, refer to AWS Glue Schema Registry. In our example, we use backward combability to ensure consumers can read both the current and previous schema versions. Complete the following steps:

  1. Add a new column to the table by running the following command:
mysql -h <DATABASE-HOST> -u master -p < alter_table.sql
  1. Insert new data into the table by running the following command:
mysql -h <DATABASE-HOST> -u master -p < insert_data_with_new_column.sql
  1. On the AWS Glue console, choose Schema registries in the navigation pane, then choose Schemas.
  2. Navigate to the schema db1.sampledatabase.movies version 2 to check the new version of the schema created for the movies table movies including the country column that you added:
{
  "type": "record",
  "name": "Value",
  "namespace": "db1.sampledatabase.movies",
  "fields": [
    {
      "name": "movie_id",
      "type": "int"
    },
    {
      "name": "title",
      "type": "string"
    },
    {
      "name": "release_year",
      "type": "int"
    },
    {
      "name": "COUNTRY",
      "type": "string"
    },
    {
      "name": "__op",
      "type": [
        "null",
        "string"
      ],
      "default": null
    },
    {
      "name": "__source_ts_ms",
      "type": [
        "null",
        "long"
      ],
      "default": null
    },
    {
      "name": "__deleted",
      "type": [
        "null",
        "string"
      ],
      "default": null
    }
  ],
  "connect.name": "db1.sampledatabase.movies.Value"
}
  1. On the Amazon S3 console, check for data written in Parquet format in the folder for the Kafka topic.

Clean up

To help prevent unwanted charges to your AWS account, delete the AWS resources that you used in this post:

  1. On the Amazon S3 console, navigate to the S3 bucket created by the CloudFormation template.
  2. Select all files and folders and choose Delete.
  3. Enter permanently delete as directed and choose Delete objects.
  4. On the AWS CloudFormation console, delete the stack you created.
  5. Wait for the stack status to change to DELETE_COMPLETE.

Conclusion

This post demonstrated how to use Amazon MSK, MSK Connect, and the AWS Glue Schema Registry to build a CDC log stream and evolve schemas for data streams as business needs change. You can apply this architecture pattern to other data sources with different Kafka connecters. For more information, refer to the MSK Connect examples.


About the Author

Kalyan Janaki is Senior Big Data & Analytics Specialist with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Improve productivity by using keyboard shortcuts in Amazon Athena query editor

Post Syndicated from Naresh Gautam original https://aws.amazon.com/blogs/big-data/improve-productivity-by-using-keyboard-shortcuts-in-amazon-athena-query-editor/

Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. You can analyze data or build applications from an Amazon Simple Storage Service (Amazon S3) data lake and over 25 data sources, including on-premises data sources or other cloud systems using SQL or Python. Athena is built on open-source Trino and Presto engines and Apache Spark frameworks, with no provisioning or configuration effort required.

Different types of users rely on Athena, including business analysts, data scientists, security, and operations engineers. Athena provides a query editor to enter and run queries on data using structured query language (SQL). The query editor provides features like run, cancel, and save queries or statements. Additionally, it provides keyboard shortcuts for user-friendly operation.

This post discusses the keyboard shortcuts available and how you can use them.

Accessing the Athena console

If you’re new to Athena and don’t know how to access the Athena console and run queries and statements, refer to the following getting started tutorial. This tutorial walks you through using Athena to query data. You’ll create a table based on sample data stored in Amazon S3, query the table, and check the results of the query.

Keyboard shortcuts

The query editor provides keyboard shortcuts for different action types like running a query, formatting a query, line operations, selection, multi-cursor, go to, find/replace, and folding. Compared to reaching for the mouse or navigating a menu, a single keyboard shortcut saves a moment of your time.

With keyboard shortcuts, you can use key combinations to edit your SQL statement without using a mouse. For example, you can use multiple cursors in your editing window for selecting all instances of text you wish to edit, and edit your text, fold or unfold selected text, find and replace text, and perform line operations like remove line, move lines, and more.

You can also find these keyboard shortcuts on the query editor on the bottom right corner, as highlighted in the following screenshot.

The following table shows the keyboards shortcuts for Window/Linux and Mac.

Action Type Action Windows/Linux Mac
Other Execute query Ctrl-Enter Cmd-Enter, Ctrl-Enter
Other Format query Ctrl-Alt-L Opt-Cmd-L
Other Previous query Ctrl-Up Ctrl-Shift-Up
Other Next query Ctrl-Down Ctrl-Shift-Down
Other Close tab Alt-X Opt-X
Other Previous tab Ctrl-, Ctrl-,
Other Next tab Ctrl-. Ctrl-.
Other Indent Tab Tab
Other Outdent Shift-Tab Shift-Tab
Other Save Ctrl-S Cmd-S
Other Undo Ctrl-Z Cmd-Z
Other Redo Ctrl-Shift-Z, Ctrl-Y Cmd-Shift-Z, Cmd-Y
Other Toggle comment Ctrl-/ Cmd-/
Other Transpose letters Ctrl-T Ctrl-T
Other Change to lower case Ctrl-Shift-U Ctrl-Shift-U
Other Change to upper case Ctrl-U Ctrl-U
Other Overwrite Insert Insert
Other Delete Delete
Line Operations Remove line Ctrl-D Cmd-D
Line Operations Copy lines down Alt-Shift-Down Cmd-Opt-Down
Line Operations Copy lines up Alt-Shift-Up Cmd-Opt-Up
Line Operations Move lines down Alt-Down Opt-Down
Line Operations Move lines up Alt-Up Opt-Up
Line Operations Remove to line end Alt-Delete Ctrl-K
Line Operations Remove to line start Alt-Backspace Cmd-Backspace
Line Operations Remove word left Ctrl-Backspace Opt-Backspace, Ctrl-Opt-Backspace
Line Operations Remove word right Ctrl-Delete Opt-Delete
Line Operations Split line Ctrl-O
Selection Select all Ctrl-A Cmd-A
Selection Select left Shift-Left Shift-Left
Selection Select right Shift-Right Shift-Right
Selection Select word left Ctrl-Shift-Left Opt-Shift-Left
Selection Select word right Ctrl-Shift-Right Opt-Shift-Right
Selection Select line start Shift-Home Shift-Home
Selection Select line end Shift-End Shift-End
Selection Select to line end Alt-Shift-Right Cmd-Shift-Right
Selection Select to line start Alt-Shift-Left Cmd-Shift-Left
Selection Select up Shift-Up Shift-Up
Selection Select down Shift-Down Shift-Down
Selection Select page up Shift-PageUp Shift-PageUp
Selection Select page down Shift-PageDown Shift-PageDown
Selection Select to start Ctrl-Shift-Home Cmd-Shift-Up
Selection Select to end Ctrl-Shift-End Cmd-Shift-Down
Selection Duplicate selection Ctrl-Shift-D Cmd-Shift-D
Selection Select to matching bracket Ctrl-Shift-P
Multicursor Add multi-cursor above Ctrl-Alt-Up Ctrl-Opt-Up
Multicursor Add multi-cursor below Ctrl-Alt-Down Ctrl-Opt-Down
Multicursor Add next occurrence to multi-selection Ctrl-Alt-Right Ctrl-Opt-Right
Multicursor Add previous occurrence to multi-selection Ctrl-Alt-Left Ctrl-Opt-Left
Multicursor Move multi-cursor from current line to the line above Ctrl-Alt-Shift-Up Ctrl-Opt-Shift-Up
Multicursor Move multi-cursor from current line to the line below Ctrl-Alt-Shift-Down Ctrl-Opt-Shift-Down
Multicursor Remove current occurrence from multi-selection and move to next Ctrl-Alt-Shift-Right Ctrl-Opt-Shift-Right
Multicursor Remove current occurrence from multi-selection and move to previous Ctrl-Alt-Shift-Left Ctrl-Opt-Shift-Left
Multicursor Select all from multi-selection Ctrl-Shift-L Ctrl-Shift-L
Go to Go to left Left Left, Ctrl-B
Go to Go to right Right Right, Ctrl-F
Go to Go to word left Ctrl-Left Opt-Left
Go to Go to word right Ctrl-Right Opt-Right
Go to Go line up Up Up, Ctrl-P
Go to Go line down Down Down, Ctrl-N
Go to Go to line start Alt-Left, Home Cmd-Left, Home, Ctrl-A
Go to Go to line end Alt-Right, End Cmd-Right, End, Ctrl-E
Go to Go to page up PageUp Opt-PageUp
Go to Go to page down PageDown Opt-PageDown, Ctrl-V
Go to Go to start Ctrl-Home Cmd-Home, Cmd-Up
Go to Go to end Ctrl-End Cmd-End, Cmd-Down
Go to Scroll line down Ctrl-Down Cmd-Down
Go to Scroll line up Ctrl-Up
Go to Go to matching bracket Ctrl-P
Go to Scroll page down Opt-PageDown
Go to Scroll page up Opt-PageUp
Find/Replace Find Ctrl-F Cmd-F
Find/Replace Replace Ctrl-H Cmd-Opt-F
Find/Replace Find next Ctrl-K Cmd-G
Find/Replace Find previous Ctrl-Shift-K Cmd-Shift-G
Folding Fold selection Alt-L, Ctrl-F1 Cmd-Opt-L, Cmd-F1
Folding Unfold Alt-Shift-L, Ctrl-Shift-F1 Cmd-Opt-Shift-L, Cmd-Shift-F1
Folding Unfold all Alt-Shift-0 Cmd-Opt-Shift-0
Other Autocomplete Ctrl-Space Ctrl-Space
Other Focus out Esc Esc

For illustration, you can perform the Format query action by using the keyboard shortcut (Ctrl-Alt-L for Windows/Linux, Opt-Cmd-L for Mac). It converts unformatted SQL to a well-formatted SQL, as shown in the following screenshots.

Similarly, you can try out the Toggle comment command (Ctrl-/ for Windows/Linux, Cmd-/ for Mac) to comment or uncomment lines of SQL in the Athena query editor. This comes in very handy when you want to quickly comment out specific lines in your query, as shown in the following screenshots.

You can do line operations like Remove line, Copy lines down, Copy lines up, and more. The following screenshots show an example of the Remove line action (Ctrl-D for Windows/Linux, Cmd-D for Mac).

You can do a line selection like Select all, Select left, Select line start, and more. The following screenshots show an example the Select all action (Ctrl-A for Windows/Linux, Cmd-A for Mac).

You can do multi-cursor actions like Add multi-cursor above, Add multi-cursor below, Add next occurrence to multi-selection, Add previous occurrence to multi-selection, Move multi-cursor from current line to the line above, and more. The following example is of the Add multi-cursor above action (Ctrl-Alt-Up for Windows/Linux, Ctrl-Opt-Up for Mac).

You can do go to actions like Go to left, Go to right, Go to word left, and more. The following is an example of the Go to left action (Ctrl-B).

You can do find and replace actions like Find, Replace, Find next, and more. The following is an example of the Replace action (Ctrl-H for Windows/Linux, Cmd-Opt-F for Mac).

You can also do folding actions like Fold selection, Unfold, and Unfold all. The following example is of the Unfold action (Alt-Shift-L or Ctrl-Shift-F1 for Windows/Linux, Cmd-Opt-Shift-L or Cmd-Shift-F1 for Mac).

Conclusion

In this post, we saw how Athena provides an array of native options to help you improve productivity when analyzing your data. You can go to the Athena console and start running SQL statements or querying data using the built-in query editor. The query editor provides key shortcuts to improve your productivity by using key combinations to edit SQL statements, instead of using a mouse.

If you have any questions or suggestions, please leave a comment.


About the Authors

Naresh Gautam is a Data Analytics and AI/ML leader at AWS with 20 years of experience, who enjoys helping customers architect highly available, high-performance, and cost-effective data analytics and AI/ML solutions to empower customers with data-driven decision-making. In his free time, he enjoys meditation and cooking.

Srikanth Sopirala is a Principal Analytics Specialist Solutions Architect at AWS. He is a seasoned leader with over 20 years of experience, who is passionate about helping customers build scalable data and analytics solutions to gain timely insights and make critical business decisions. In his spare time, he enjoys reading, spending time with his family, and road biking.

Harsh Vardhan is an AWS Solutions Architect, specializing in analytics. He has over 5 years of experience working in the field of big data and data science. He is passionate about helping customers adopt best practices and discover insights from their data.

How to use policies to restrict where EC2 instance credentials can be used from

Post Syndicated from Liam Wadman original https://aws.amazon.com/blogs/security/how-to-use-policies-to-restrict-where-ec2-instance-credentials-can-be-used-from/

Today AWS launched two new global condition context keys that make it simpler for you to write policies in which Amazon Elastic Compute Cloud (Amazon EC2) instance credentials work only when used on the instance to which they are issued. These new condition keys are available today in all AWS Regions, as well as AWS GovCloud and China partitions.

Using these new condition keys, you can write service control policies (SCPs) or AWS Identity and Access Management (IAM) policies that restrict the virtual private clouds (VPCs) and private IP addresses from which your EC2 instance credentials can be used, without hard-coding VPC IDs or IP addresses in the policy. Previously, you had to list specific VPC IDs and IP addresses in the policy if you wanted to use it to restrict where EC2 credentials were used. With this new approach, you can use less policy space and reduce the time spent on updates when your list of VPCs and network ranges changes.

In this blog post, we will show you how to use these new condition keys in an SCP and a resource policy to help ensure that the IAM role credentials assigned to your EC2 instances can only be used from the instances to which they were issued.

New global condition keys

The two new condition keys are as follows:

  • aws:EC2InstanceSourceVPC — This single-valued condition key contains the VPC ID to which an EC2 instance is deployed.
  • aws:EC2InstanceSourcePrivateIPv4 — This single-valued condition key contains the primary IPv4 address of an EC2 instance.

These new conditions are available only for use with credentials issued to an EC2 instance. You don’t have to make configuration changes to activate the new condition keys.

Let’s start by reviewing some existing IAM conditions and how to combine them with the new conditions. When requests are made to an AWS service over a VPC endpoint, the value of the aws:SourceVpc condition key is the ID of the VPC into which the endpoint is deployed. The value of the aws:VpcSourceIP condition key is the IP address from which the endpoint receives the request. The aws:SourceVpc and aws:VpcSourceIP keys are null when requests are made through AWS public service endpoints. These condition keys relate to dynamic properties of the network path by which your AWS Signature Version 4-signed request reached the API endpoint. For a list of AWS services that support VPC endpoints, see AWS services that integrate with AWS PrivateLink.

The two new condition keys relate to dynamic properties of the EC2 role credential itself. By using the two new credential-relative condition keys with the existing network path-relative aws:SourceVPC and aws:VpcSourceIP condition keys, you can create SCPs to help ensure that credentials for EC2 instances are only used from the EC2 instances to which they were issued. By writing policies that compare the two sets of dynamic values, you can configure your environment such that requests signed with an EC2 instance credential are denied if they are used anywhere other than the EC2 instance to which they were issued.

Policy examples

In the following SCP example, access is denied if the value of aws:SourceVpc is not equal to the value of aws:ec2InstanceSourceVPC, or if the value of aws:VpcSourceIp is not equal to the value of aws:ec2InstanceSourcePrivateIPv4. The policy uses aws:ViaAWSService to allow certain AWS services to take action on your behalf when they use your role’s identity to call services, such as when Amazon Athena queries Amazon Simple Storage Service (Amazon S3).

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:ec2InstanceSourceVPC": "${aws:SourceVpc}"
        },
        "Null": {
          "ec2:SourceInstanceARN": "false"
        },
        "BoolIfExists": {
          "aws:ViaAWSService": "false"
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:ec2InstanceSourcePrivateIPv4": "${aws:VpcSourceIp}"
        },
        "Null": {
          "ec2:SourceInstanceARN": "false"
        },
        "BoolIfExists": {
          "aws:ViaAWSService": "false"
        }
      }
    }
  ]
}

Because we encase aws:SourceVpc and aws:VpcSourceIp in “${}” in these policies, they are treated as a variable using the value in the request being made. However, in the IAM policy language, the operator on the left side of a comparison is implicitly treated as a variable, while the operator on the right side must be explicitly declared as a variable. The “Null” operator on the ec2:SourceInstanceARN condition key is designed to ensure that this policy only applies to EC2 instance roles, and not roles used for other purposes, such as those used in AWS Lambda functions.

The two deny statements in this example form a logical “or” statement, such that either a request from a different VPC or a different IP address evaluates in a deny. But functionally, they act in an “and” fashion. To be allowed, a request must satisfy both the VPC-based and the IP-based conditions because failure of either denies the call. Because VPC IDs are globally unique values, it’s reasonable to use the VPC-based condition without the private IP condition. However, you should avoid evaluating only the private IP condition without also evaluating the VPC condition. Private IPs can be the same across different environments, so aws:ec2InstanceSourcePrivateIPv4 is safe to use only in conjunction with the VPC-based condition.

Note: SCPs do not grant IAM permissions; they only remove them. Thus, you must permit your EC2 instances to use AWS services by using IAM policies associated with their roles. For more information, see Determining whether a request is allowed or denied within an account.

If you have specific EC2 instance roles that you want to exclude from the statement, you can apply exception logic through tags or role names.

The following example applies to roles used as EC2 instance roles, except those with a tag of exception-to-vpc-ip where the value is equal to true by using the aws:PrincipalTag condition key. The three condition operators (StringNotEquals, Null, and BoolIfExists) in the same condition block are evaluated with a logical AND operation, and if either of the tests doesn’t evaluate, then the deny statement doesn’t apply. Hence, EC2 instance roles with a principal tag of exception-to-vpc-ip equal to true are not subject to this SCP.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:ec2InstanceSourceVPC": "${aws:SourceVpc}",
          "aws:PrincipalTag/exception-to-vpc-ip": "true"
        },
        "Null": {
          "ec2:SourceInstanceARN": "false"
        },
        "BoolIfExists": {
          "aws:ViaAWSService": "false"
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:ec2InstanceSourcePrivateIPv4": "${aws:VpcSourceIp}",
           "aws:principaltag/exception-to-vpc-ip": "true"
        },
        "Null": {
          "ec2:SourceInstanceARN": "false"
        },
        "BoolIfExists": {
          "aws:ViaAWSService": "false"
        }
      }
    }
  ]
}

You can apply exception logic to other attributes of your IAM roles. For example, you can use the aws:PrincipalArn condition key to exempt certain roles based on their AWS account. You can also specify where you want this SCP to be applied in your AWS Organizations organization. You can apply SCPs directly to accounts, organizational units, or organizational roots. For more information about inheritance when applying SCPs in Organizations, see Understanding policy inheritance.

You can also apply exception logic to your SCP statements at the IAM Action. The following example statement restricts an EC2 instance’s credential usage to only the instance from which it was issued, except for calls to IAM by using a NotAction element. You should use this exception logic if an AWS service doesn’t have a VPC endpoint, or if you don’t want to use VPC endpoints to access a particular service.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "NotAction": "iam:*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:ec2InstanceSourceVPC": "${aws:SourceVpc}"
        },
        "Null": {
          "ec2:SourceInstanceARN": "false"
        },
        "BoolIfExists": {
          "aws:ViaAWSService": "false"
        }
      }
    },
    {
      "Effect": "Deny",
      "NotAction": "iam:*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:ec2InstanceSourcePrivateIPv4": "${aws:VpcSourceIp}"
        },
        "Null": {
          "ec2:SourceInstanceARN": "false"
        },
        "BoolIfExists": {
          "aws:ViaAWSService": "false"
        }
      }
    }
  ]
}

Because these new condition keys are global condition keys, you can use the keys in all relevant AWS policy types, such as the following policy for an S3 bucket. When using this as a bucket policy, make sure to replace <DOC-EXAMPLE-BUCKET> with the ARN of your S3 bucket.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": "s3:*",
      "Principal": {
        "AWS": "*"
      },
      "Resource": [
        "arn:aws:s3:::<DOC-EXAMPLE-BUCKET>/*",
        "arn:aws:s3::: <DOC-EXAMPLE-BUCKET>"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:ec2InstanceSourceVPC": "${aws:SourceVpc}"
        },
        "Null": {
          "ec2:SourceInstanceARN": "false"
        },
        "BoolIfExists": {
          "aws:ViaAWSService": "false"
        }
      }
    },
    {
      "Effect": "Deny",
      "Action": "*",
      "Principal": {
        "AWS": "*"
      },
      "Resource": [
        "arn:aws:s3::: <DOC-EXAMPLE-BUCKET> /*",
        "arn:aws:s3::: <DOC-EXAMPLE-BUCKET>"
      ],
      "Condition": {
        "StringNotEquals": {
          "aws:ec2InstanceSourcePrivateIPv4": "${aws:VpcSourceIp}"
        },
        "Null": {
          "ec2:SourceInstanceARN": "false"
        },
        "BoolIfExists": {
          "aws:ViaAWSService": "false"
        }
      }
    }
  ]
}

This policy restricts access to your S3 bucket to EC2 instance roles that are used only from the instance to which they were vended. Like the previous policy examples, there are two deny statements in this example to form a logical “or” statement but a functional “and” statement, because a request must come from the same VPC and same IP address of the instance that it was issued to, or else it evaluates to a deny.

Conclusion

In this blog post, you learned about the newly launched aws:ec2InstanceSourceVPC and aws:ec2InstanceSourcePrivateIPv4 condition keys. You also learned how to use them with SCPs and resource policies to limit the usage of your EC2 instance roles to the instances from which they originated when requests are made over VPC endpoints. Because these new condition keys are global condition keys, you can use them in all relevant AWS policy types. These new condition keys are available today in all Regions, as well as AWS GovCloud and China partitions.

If you have questions, comments, or concerns, contact AWS Support or start a new thread at AWS Identity and Access Management or Compute re:Post.

If you have feedback about this post, submit comments in the Comments section below.

Liam Wadman

Liam Wadman

Liam is a Solutions Architect with the Identity Solutions team. When he’s not building exciting solutions on AWS or helping customers, he’s often found in the hills of British Columbia on his Mountain Bike. Liam points out that you cannot spell LIAM without IAM.

Joshua Levinson

Joshua Levinson

Joshua is a Senior Product Manager at AWS on the EC2 team. He is passionate about helping customers with highly scalable features on EC2 and across AWS, and enjoys the challenge of building simplified solutions to complex problems. Outside of work, he enjoys cooking, reading with his kids, and Olympic weightlifting.

Use Apache Iceberg in a data lake to support incremental data processing

Post Syndicated from Flora Wu original https://aws.amazon.com/blogs/big-data/use-apache-iceberg-in-a-data-lake-to-support-incremental-data-processing/

Apache Iceberg is an open table format for very large analytic datasets, which captures metadata information on the state of datasets as they evolve and change over time. It adds tables to compute engines including Spark, Trino, PrestoDB, Flink, and Hive using a high-performance table format that works just like a SQL table. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback.

Apache Iceberg integration is supported by AWS analytics services including Amazon EMR, Amazon Athena, and AWS Glue. Amazon EMR can provision clusters with Spark, Hive, Trino, and Flink that can run Iceberg. Starting with Amazon EMR version 6.5.0, you can use Iceberg with your EMR cluster without requiring a bootstrap action. In early 2022, AWS announced general availability of Athena ACID transactions, powered by Apache Iceberg. The recently released Athena query engine version 3 provides better integration with the Iceberg table format. AWS Glue 3.0 and later supports the Apache Iceberg framework for data lakes.

In this post, we discuss what customers want in modern data lakes and how Apache Iceberg helps address customer needs. Then we walk through a solution to build a high-performance and evolving Iceberg data lake on Amazon Simple Storage Service (Amazon S3) and process incremental data by running insert, update, and delete SQL statements. Finally, we show you how to performance tune the process to improve read and write performance.

How Apache Iceberg addresses what customers want in modern data lakes

More and more customers are building data lakes, with structured and unstructured data, to support many users, applications, and analytics tools. There is an increased need for data lakes to support database like features such as ACID transactions, record-level updates and deletes, time travel, and rollback. Apache Iceberg is designed to support these features on cost-effective petabyte-scale data lakes on Amazon S3.

Apache Iceberg addresses customer needs by capturing rich metadata information about the dataset at the time the individual data files are created. There are three layers in the architecture of an Iceberg table: the Iceberg catalog, the metadata layer, and the data layer, as depicted in the following figure (source).

The Iceberg catalog stores the metadata pointer to the current table metadata file. When a select query is reading an Iceberg table, the query engine first goes to the Iceberg catalog, then retrieves the location of the current metadata file. Whenever there is an update to the Iceberg table, a new snapshot of the table is created, and the metadata pointer points to the current table metadata file.

The following is an example Iceberg catalog with AWS Glue implementation. You can see the database name, the location (S3 path) of the Iceberg table, and the metadata location.

The metadata layer has three types of files: the metadata file, manifest list, and manifest file in a hierarchy. At the top of the hierarchy is the metadata file, which stores information about the table’s schema, partition information, and snapshots. The snapshot points to the manifest list. The manifest list has the information about each manifest file that makes up the snapshot, such as location of the manifest file, the partitions it belongs to, and the lower and upper bounds for partition columns for the data files it tracks. The manifest file tracks data files as well as additional details about each file, such as the file format. All three files work in a hierarchy to track the snapshots, schema, partitioning, properties, and data files in an Iceberg table.

The data layer has the individual data files of the Iceberg table. Iceberg supports a wide range of file formats including Parquet, ORC, and Avro. Because the Iceberg table tracks the individual data files instead of only pointing to the partition location with data files, it isolates the writing operations from reading operations. You can write the data files at any time, but only commit the change explicitly, which creates a new version of the snapshot and metadata files.

Solution overview

In this post, we walk you through a solution to build a high-performing Apache Iceberg data lake on Amazon S3; process incremental data with insert, update, and delete SQL statements; and tune the Iceberg table to improve read and write performance. The following diagram illustrates the solution architecture.

To demonstrate this solution, we use the Amazon Customer Reviews dataset in an S3 bucket (s3://amazon-reviews-pds/parquet/). In real use case, it would be raw data stored in your S3 bucket. We can check the data size with the following code in the AWS Command Line Interface (AWS CLI):

//Run this AWS CLI command to check the data size
aws s3 ls --summarize --human-readable --recursive s3://amazon-reviews-pds/parquet

The total object count is 430, and total size is 47.4 GiB.

To set up and test this solution, we complete the following high-level steps:

  1. Set up an S3 bucket in the curated zone to store converted data in Iceberg table format.
  2. Launch an EMR cluster with appropriate configurations for Apache Iceberg.
  3. Create a notebook in EMR Studio.
  4. Configure the Spark session for Apache Iceberg.
  5. Convert data to Iceberg table format and move data to the curated zone.
  6. Run insert, update, and delete queries in Athena to process incremental data.
  7. Carry out performance tuning.

Prerequisites

To follow along with this walkthrough, you must have an AWS account with an AWS Identity and Access Management (IAM) role that has sufficient access to provision the required resources.

Set up the S3 bucket for Iceberg data in the curated zone in your data lake

Choose the Region in which you want to create the S3 bucket and provide a unique name:

s3://iceberg-curated-blog-data

Launch an EMR cluster to run Iceberg jobs using Spark

You can create an EMR cluster from the AWS Management Console, Amazon EMR CLI, or AWS Cloud Development Kit (AWS CDK). For this post, we walk you through how to create an EMR cluster from the console.

  1. On the Amazon EMR console, choose Create cluster.
  2. Choose Advanced options.
  3. For Software Configuration, choose the latest Amazon EMR release. As of January 2023, the latest release is 6.9.0. Iceberg requires release 6.5.0 and above.
  4. Select JupyterEnterpriseGateway and Spark as the software to install.
  5. For Edit software settings, select Enter configuration and enter [{"classification":"iceberg-defaults","properties":{"iceberg.enabled":true}}].
  6. Leave other settings at their default and choose Next.
  7. For Hardware, use the default setting.
  8. Choose Next.
  9. For Cluster name, enter a name. We use iceberg-blog-cluster.
  10. Leave the remaining settings unchanged and choose Next.
  11. Choose Create cluster.

Create a notebook in EMR Studio

We now walk you through how to create a notebook in EMR Studio from the console.

  1. On the IAM console, create an EMR Studio service role.
  2. On the Amazon EMR console, choose EMR Studio.
  3. Choose Get started.

The Get started page appears in a new tab.

  1. Choose Create Studio in the new tab.
  2. Enter a name. We use iceberg-studio.
  3. Choose the same VPC and subnet as those for the EMR cluster, and the default security group.
  4. Choose AWS Identity and Access Management (IAM) for authentication, and choose the EMR Studio service role you just created.
  5. Choose an S3 path for Workspaces backup.
  6. Choose Create Studio.
  7. After the Studio is created, choose the Studio access URL.
  8. On the EMR Studio dashboard, choose Create workspace.
  9. Enter a name for your Workspace. We use iceberg-workspace.
  10. Expand Advanced configuration and choose Attach Workspace to an EMR cluster.
  11. Choose the EMR cluster you created earlier.
  12. Choose Create Workspace.
  13. Choose the Workspace name to open a new tab.

In the navigation pane, there is a notebook that has the same name as the Workspace. In our case, it is iceberg-workspace.

  1. Open the notebook.
  2. When prompted to choose a kernel, choose Spark.

Configure a Spark session for Apache Iceberg

Use the following code, providing your own S3 bucket name:

%%configure -f
{
"conf": {
"spark.sql.catalog.demo": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.demo.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.demo.warehouse": "s3://iceberg-curated-blog-data",
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.catalog.demo.io-impl":"org.apache.iceberg.aws.s3.S3FileIO"
}
}

This sets the following Spark session configurations:

  • spark.sql.catalog.demo – Registers a Spark catalog named demo, which uses the Iceberg Spark catalog plugin.
  • spark.sql.catalog.demo.catalog-impl – The demo Spark catalog uses AWS Glue as the physical catalog to store Iceberg database and table information.
  • spark.sql.catalog.demo.warehouse – The demo Spark catalog stores all Iceberg metadata and data files under the root path defined by this property: s3://iceberg-curated-blog-data.
  • spark.sql.extensions – Adds support to Iceberg Spark SQL extensions, which allows you to run Iceberg Spark procedures and some Iceberg-only SQL commands (you use this in a later step).
  • spark.sql.catalog.demo.io-impl – Iceberg allows users to write data to Amazon S3 through S3FileIO. The AWS Glue Data Catalog by default uses this FileIO, and other catalogs can load this FileIO using the io-impl catalog property.

Convert data to Iceberg table format

You can use either Spark on Amazon EMR or Athena to load the Iceberg table. In the EMR Studio Workspace notebook Spark session, run the following commands to load the data:

// create a database in AWS Glue named reviews if not exist
spark.sql("CREATE DATABASE IF NOT EXISTS demo.reviews")

// load reviews - this load all the parquet files
val reviews_all_location = "s3://amazon-reviews-pds/parquet/"
val reviews_all = spark.read.parquet(reviews_all_location)

// write reviews data to an Iceberg v2 table
reviews_all.writeTo("demo.reviews.all_reviews").tableProperty("format-version", "2").createOrReplace()

After you run the code, you should find two prefixes created in your data warehouse S3 path (s3://iceberg-curated-blog-data/reviews.db/all_reviews): data and metadata.

Process incremental data using insert, update, and delete SQL statements in Athena

Athena is a serverless query engine that you can use to perform read, write, update, and optimization tasks against Iceberg tables. To demonstrate how the Apache Iceberg data lake format supports incremental data ingestion, we run insert, update, and delete SQL statements on the data lake.

Navigate to the Athena console and choose Query editor. If this is your first time using the Athena query editor, you need to configure the query result location to be the S3 bucket you created earlier. You should be able to see that the table reviews.all_reviews is available for querying. Run the following query to verify that you have loaded the Iceberg table successfully:

select * from reviews.all_reviews limit 5;

Process incremental data by running insert, update, and delete SQL statements:

//Example update statement
update reviews.all_reviews set star_rating=5 where product_category = 'Watches' and star_rating=4

//Example delete statement
delete from reviews.all_reviews where product_category = 'Watches' and star_rating=1

Performance tuning

In this section, we walk through different ways to improve Apache Iceberg read and write performance.

Configure Apache Iceberg table properties

Apache Iceberg is a table format, and it supports table properties to configure table behavior such as read, write, and catalog. You can improve the read and write performance on Iceberg tables by adjusting the table properties.

For example, if you notice that you write too many small files for an Iceberg table, you can config the write file size to write fewer but bigger size files, to help improve query performance.

Property Default Description
write.target-file-size-bytes 536870912 (512 MB) Controls the size of files generated to target about this many bytes

Use the following code to alter the table format:

//Example code to alter table format in EMR Studio Workspace notebook
spark.sql("ALTER TABLE demo.reviews.all_reviews 
SET TBLPROPERTIES ('write_target_data_file_size_bytes'='536870912')")

Partitioning and sorting

To make a query run fast, the less data read the better. Iceberg takes advantage of the rich metadata it captures at write time and facilitates techniques such as scan planning, partitioning, pruning, and column-level stats such as min/max values to skip data files that don’t have match records. We walk you through how query scan planning and partitioning work in Iceberg and how we use them to improve query performance.

Query scan planning

For a given query, the first step in a query engine is scan planning, which is the process to find the files in a table needed for a query. Planning in an Iceberg table is very efficient, because Iceberg’s rich metadata can be used to prune metadata files that aren’t needed, in addition to filtering data files that don’t contain matching data. In our tests, we observed Athena scanned 50% or less data for a given query on an Iceberg table compared to original data before conversion to Iceberg format.

There are two types of filtering:

  • Metadata filtering – Iceberg uses two levels of metadata to track the files in a snapshot: the manifest list and manifest files. It first uses the manifest list, which acts as an index of the manifest files. During planning, Iceberg filters manifests using the partition value range in the manifest list without reading all the manifest files. Then it uses selected manifest files to get data files.
  • Data filtering – After selecting the list of manifest files, Iceberg uses the partition data and column-level stats for each data file stored in manifest files to filter data files. During planning, query predicates are converted to predicates on the partition data and applied first to filter data files. Then, the column stats like column-level value counts, null counts, lower bounds, and upper bounds are used to filter out data files that can’t match the query predicate. By using upper and lower bounds to filter data files at planning time, Iceberg greatly improves query performance.

Partitioning and sorting

Partitioning is a way to group records with the same key column values together in writing. The benefit of partitioning is faster queries that access only part of the data, as explained earlier in query scan planning: data filtering. Iceberg makes partitioning simple by supporting hidden partitioning, in the way that Iceberg produces partition values by taking a column value and optionally transforming it.

In our use case, we first run the following query on the Iceberg table not partitioned. Then we partition the Iceberg table by the category of the reviews, which will be used in the query WHERE condition to filter out records. With partitioning, the query could scan much less data. See the following code:

//Example code in EMR Studio Workspace notebook to create an Iceberg table all_reviews_partitioned partitioned by product_category
reviews_all.writeTo("demo.reviews.all_reviews_partitioned").tableProperty("format-version", "2").partitionedBy($"product_category").createOrReplace()

Run the following select statement on the non-partitioned all_reviews table vs. the partitioned table to see the performance difference:

//Run this query on all_reviews table and the partitioned table for performance testing
select marketplace,customer_id, review_id,product_id,product_title,star_rating from reviews.all_reviews where product_category = 'Watches' and review_date between date('2005-01-01') and date('2005-03-31')

//Run the same select query on partitioned dataset
select marketplace,customer_id, review_id,product_id,product_title,star_rating from reviews.all_reviews_partitioned where product_category = 'Watches' and review_date between date('2005-01-01') and date('2005-03-31')

The following table shows the performance improvement of data partitioning, with about 50% performance improvement and 70% less data scanned.

Dataset Name Non-Partitioned Dataset Partitioned Dataset
Runtime (seconds) 8.20 4.25
Data Scanned (MB) 131.55 33.79

Note that the runtime is the average runtime with multiple runs in our test.

We saw good performance improvement after partitioning. However, this can be further improved by using column-level stats from Iceberg manifest files. In order to use the column-level stats effectively, you want to further sort your records based on the query patterns. Sorting the whole dataset using the columns that are often used in queries will reorder the data in such a way that each data file ends up with a unique range of values for the specific columns. If these columns are used in the query condition, it allows query engines to further skip data files, thereby enabling even faster queries.

Copy-on-write vs. read-on-merge

When implementing update and delete on Iceberg tables in the data lake, there are two approaches defined by the Iceberg table properties:

  • Copy-on-write – With this approach, when there are changes to the Iceberg table, either updates or deletes, the data files associated with the impacted records will be duplicated and updated. The records will be either updated or deleted from the duplicated data files. A new snapshot of the Iceberg table will be created and pointing to the newer version of data files. This makes the overall writes slower. There might be situations that concurrent writes are needed with conflicts so retry has to happen, which increases the write time even more. On the other hand, when reading the data, there is no extra process needed. The query will retrieve data from the latest version of data files.
  • Merge-on-read – With this approach, when there are updates or deletes on the Iceberg table, the existing data files will not be rewritten; instead new delete files will be created to track the changes. For deletes, a new delete file will be created with the deleted records. When reading the Iceberg table, the delete file will be applied to the retrieved data to filter out the delete records. For updates, a new delete file will be created to mark the updated records as deleted. Then a new file will be created for those records but with updated values. When reading the Iceberg table, both the delete and new files will be applied to the retrieved data to reflect the latest changes and produce the correct results. So, for any subsequent queries, an extra step to merge the data files with the delete and new files will happen, which will usually increase the query time. On the other hand, the writes might be faster because there is no need to rewrite the existing data files.

To test the impact of the two approaches, you can run the following code to set the Iceberg table properties:

//Run code to alter Iceberg table property to set copy-on-write and merge-on-read in EMR Studio Workspace notebook
spark.sql(“ALTER TABLE demo.reviews.all_reviews 
SET TBLPROPERTIES (‘write.delete.mode’=’copy-on-write’,’write.update.mode’=’copy-on-write’)”)

Run the update, delete, and select SQL statements in Athena to show the runtime difference for copy-on-write vs. merge-on-read:

//Example update statement
update reviews.all_reviews set star_rating=5 where product_category = ‘Watches’ and star_rating=4

//Example delete statement
delete from reviews.all_reviews where product_category = ‘Watches’ and star_rating=1

//Example select statement
select marketplace,customer_id, review_id,product_id,product_title,star_rating from reviews.all_reviews where product_category = ‘Watches’ and review_date between date(‘2005-01-01’) and date(‘2005-03-31’)

The following table summarizes the query runtimes.

Query Copy-on-Write Merge-on-Read
UPDATE DELETE SELECT UPDATE DELETE SELECT
Runtime (seconds) 66.251 116.174 97.75 10.788 54.941 113.44
Data scanned (MB) 494.06 3.07 137.16 494.06 3.07 137.16

Note that the runtime is the average runtime with multiple runs in our test.

As our test results show, there are always trade-offs in the two approaches. Which approach to use depends on your use cases. In summary, the considerations come down to latency on the read vs. write. You can reference the following table and make the right choice.

. Copy-on-Write Merge-on-Read
Pros Faster reads Faster writes
Cons Expensive writes Higher latency on reads
When to use Good for frequent reads, infrequent updates and deletes or large batch updates Good for tables with frequent updates and deletes

Data compaction

If your data file size is small, you might end up with thousands or millions of files in an Iceberg table. This dramatically increases the I/O operation and slows down the queries. Furthermore, Iceberg tracks each data file in a dataset. More data files lead to more metadata. This in turn increases the overhead and I/O operation on reading metadata files. In order to improve the query performance, it’s recommended to compact small data files to larger data files.

When updating and deleting records in Iceberg table, if the read-on-merge approach is used, you might end up with many small deletes or new data files. Running compaction will combine all these files and create a newer version of the data file. This eliminates the need to reconcile them during reads. It’s recommended to have regular compaction jobs to impact reads as little as possible while still maintaining faster write speed.

Run the following data compaction command, then run the select query from Athena:

//Data compaction 
optimize reviews.all_reviews REWRITE DATA USING BIN_PACK

//Run this query before and after data compaction
select marketplace,customer_id, review_id,product_id,product_title,star_rating from reviews.all_reviews where product_category = 'Watches' and review_date between date('2005-01-01') and date('2005-03-31')

The following table compares the runtime before vs. after data compaction. You can see about 40% performance improvement.

Query Before Data Compaction After Data Compaction
Runtime (seconds) 97.75 32.676 seconds
Data scanned (MB) 137.16 M 189.19 M

Note that the select queries ran on the all_reviews table after update and delete operations, before and after data compaction. The runtime is the average runtime with multiple runs in our test.

Clean up

After you follow the solution walkthrough to perform the use cases, complete the following steps to clean up your resources and avoid further costs:

  1. Drop the AWS Glue tables and database from Athena or run the following code in your notebook:
// DROP the table 
spark.sql("DROP TABLE demo.reviews.all_reviews") 
spark.sql("DROP TABLE demo.reviews.all_reviews_partitioned") 

// DROP the database 
spark.sql("DROP DATABASE demo.reviews")
  1. On the EMR Studio console, choose Workspaces in the navigation pane.
  2. Select the Workspace you created and choose Delete.
  3. On the EMR console, navigate to the Studios page.
  4. Select the Studio you created and choose Delete.
  5. On the EMR console, choose Clusters in the navigation pane.
  6. Select the cluster and choose Terminate.
  7. Delete the S3 bucket and any other resources that you created as part of the prerequisites for this post.

Conclusion

In this post, we introduced the Apache Iceberg framework and how it helps resolve some of the challenges we have in a modern data lake. Then we walked you though a solution to process incremental data in a data lake using Apache Iceberg. Finally, we had a deep dive into performance tuning to improve read and write performance for our use cases.

We hope this post provides some useful information for you to decide whether you want to adopt Apache Iceberg in your data lake solution.


About the Authors

Flora Wu is a Sr. Resident Architect at AWS Data Lab. She helps enterprise customers create data analytics strategies and build solutions to accelerate their businesses outcomes. In her spare time, she enjoys playing tennis, dancing salsa, and traveling.

Daniel Li is a Sr. Solutions Architect at Amazon Web Services. He focuses on helping customers develop, adopt, and implement cloud services and strategy. When not working, he likes spending time outdoors with his family.

Proactive Insights with Amazon DevOps Guru for RDS

Post Syndicated from Kishore Dhamodaran original https://aws.amazon.com/blogs/devops/proactive-insights-with-amazon-devops-guru-for-rds/

Today, we are pleased to announce a new Amazon DevOps Guru for RDS capability: Proactive Insights. DevOps Guru for RDS is a fully-managed service powered by machine learning (ML), that uses the data collected by RDS Performance Insights to detect and alert customers of anomalous behaviors within Amazon Aurora databases. Since its release, DevOps Guru for RDS has empowered customers with information to quickly react to performance problems and to take corrective actions. Now, Proactive Insights adds recommendations related to operational issues that may prevent potential issues in the future.

Proactive Insights requires no additional set up for customers already using DevOps Guru for RDS, for both Amazon Aurora MySQL-Compatible Edition and Amazon Aurora PostgreSQL-Compatible Edition.

The following are example use cases of operational issues available for Proactive Insights today, with more insights coming over time:

  • Long InnoDB History for Aurora MySQL-Compatible engines – Triggered when the InnoDB history list length becomes very large.
  • Temporary tables created on disk for Aurora MySQL-Compatible engines – Triggered when the ratio of temporary tables created versus all temporary tables breaches a threshold.
  • Idle In Transaction for Aurora PostgreSQL-Compatible engines – Triggered when sessions connected to the database are not performing active work, but can keep database resources blocked.

To get started, navigate to the Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights. In the following screen capture, the number three indicates that there are three ongoing proactive insights. Click on that number to see the listing of the corresponding Proactive Insights, which may include RDS or other Proactive Insights supported by Amazon DevOps Guru.

Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights

Figure 1. Amazon DevOps Guru Dashboard where you can see a summary of your system’s overall health, including ongoing proactive insights.

Ongoing problems (including reactive and proactive insights) are also highlighted against your database instance on the Database list page in the Amazon RDS console.

Proactive and Reactive Insights are highlighted against your database instance on the Database list page in the Amazon RDS console

Figure 2. Proactive and Reactive Insights are highlighted against your database instance on the Database list page in the Amazon RDS console.

In the following sections, we will dive deep on these use cases of DevOps Guru for RDS Proactive Insights.

Long InnoDB History for Aurora MySQL-Compatible engines

The InnoDB history list is a global list of the undo logs for committed transactions. MySQL uses the history list to purge records and log pages when transactions no longer require the history.  If the InnoDB history list length grows too large, indicating a large number of old row versions, queries and even the database shutdown process can become slower.

DevOps Guru for RDS now detects when the history list length exceeds 1 million records and alerts users to close (either by commit or by rollback) any unnecessary long-running transactions before triggering database changes that involve a shutdown (this includes reboots and database version upgrades).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS InnoDB History List Length Anomalous” Proactive Insight with an ongoing status. You will notice that Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, the history list for row changes increased significantly, which affects query and shutdown performance.

Long InnoDB History for Aurora MySQL-Compatible engines Insight overview

Figure 3. Long InnoDB History for Aurora MySQL-Compatible engines Insight overview.

The Metrics panel gives you a graphical representation of the history list length and the timeline, allowing you to correlate it with any anomalous application activity that may have occurred during this window.

Long InnoDB History for Aurora MySQL-Compatible engines Metrics panel

Figure 4. Long InnoDB History for Aurora MySQL-Compatible engines Metrics panel.

The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem. You will also notice the rationale behind the recommendation under the “Why is DevOps Guru recommending this?” column.

The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem

Figure 5. The Recommendations section suggests actions that you can take to mitigate this issue before it leads to a bigger problem.

Temporary tables created on disk for Aurora MySQL-Compatible engines

Sometimes it is necessary for the MySQL database to create an internal temporary table while processing a query. An internal temporary table can be held in memory and processed by the TempTable or MEMORY storage engine, or stored on disk by the InnoDB storage engine. An increase of temporary tables created on disk instead of in memory can impact the database performance.

DevOps Guru for RDS now monitors the rate at which the database creates temporary tables and the percentage of those temporary tables that use disk. When these values cross recommended levels over a given period of time, DevOps Guru for RDS creates an insight exposing this situation before it becomes critical.

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Temporary Tables On Disk AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insight provides an “Insight overview”, “Metrics” and “Recommendations”.

Insight overview provides you basic information on this insight. In our case, more than 58% of the total temporary tables created per second were using disk, with a sustained rate of two temporary tables on disk created every second, which indicates that query performance is degrading.

Temporary tables created on disk insight overview

Figure 6. Temporary tables created on disk insight overview.

The Metrics panel shows you a graphical representation of the information specific for this insight. You will be presented with the evolution of the amount of temporary tables created on disk per second, the percentage of temporary tables on disk (out of the total number of database-created temporary tables), and of the overall rate at which the temporary tables are created (per second).

Temporary tables created on disk evolution of the amount of temporary tables created on disk per second

Figure 7. Temporary tables created on disk – evolution of the amount of temporary tables created on disk per second.

Temporary tables created on disk the percentage of temporary tables on disk (out of the total number of database-created temporary tables)

Figure 8. Temporary tables created on disk – the percentage of temporary tables on disk (out of the total number of database-created temporary tables).

Temporary tables created on disk overall rate at which the temporary tables are created (per second)

Figure 9. Temporary tables created on disk – overall rate at which the temporary tables are created (per second).

The Recommendations section suggests actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Temporary tables created on disk actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more

Figure 10. Temporary tables created on disk – actions to avoid this situation when possible, such as not using BLOB and TEXT data types, tuning tmp_table_size and max_heap_table_size database parameters, data set reduction, columns indexing and more.

Additional explanations on this use case can be found by clicking on the “View troubleshooting doc” link.

Idle In Transaction for Aurora PostgreSQL-Compatible engines

A connection that has been idle in transaction  for too long can impact performance by holding locks, blocking other queries, or by preventing VACUUM (including autovacuum) from cleaning up dead rows.
PostgreSQL database requires periodic maintenance, which is known as vacuuming. Autovacuum in PostgreSQL automates the execution of VACUUM and ANALYZE commands. This process gathers the table statistics and deletes the dead rows. When vacuuming does not occur, this negatively impacts the database performance. It leads to an increase in table and index bloat (the disk space that was used by a table or index and is available for reuse by the database but has not been reclaimed), leads to stale statistics and can even end in transaction wraparound (when the number of unique transaction ids reaches its maximum of about two billion).

DevOps Guru for RDS monitors the time spent by sessions in an Aurora PostgreSQL database in idle in transaction state and raises initially a warning notification, followed by an alarm notification if the idle in transaction state continues (the current thresholds are 1800 seconds for the warning and 3600 seconds for the alarm).

From the DevOps Guru console, navigate to Insights, choose Proactive, then choose “RDS Idle In Transaction Max Time AnomalousProactive Insight with an ongoing status. You will notice this Proactive Insights provides an “Insight overview”, “Metrics” and “Recommendations”.

In our case, a connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

A connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance

Figure 11. A connection has been in “idle in transaction” state for more than 1800 seconds, which could impact the database performance.

The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started

Figure 12. The Metrics panel shows you a graphical representation of when the long-running “idle in transaction” connections started.

As with the other insights, recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Recommended actions are listed and a troubleshooting doc is linked for even more details on this use case

Figure 13. Recommended actions are listed and a troubleshooting doc is linked for even more details on this use case.

Conclusion

With Proactive Insights, DevOpsGuru for RDS enhances its abilities to help you monitor your databases by notifying you about potential operational issues, before they become bigger problems down the road. To get started, you need to ensure that you have enabled Performance Insights on the database instance(s) you want monitored, as well as ensure and confirm that DevOps Guru is enabled to monitor those instances (for example by enabling it at account level, by monitoring specific CloudFormation stacks or by using AWS tags for specific Aurora resources). Proactive Insights is available in all regions where DevOps Guru for RDS is supported. To learn more about Proactive Insights, join us for a free hands-on Immersion Day (available in three time zones) on March 15th or April 12th.

About the authors:

Kishore Dhamodaran

Kishore Dhamodaran is a Senior Solutions Architect at AWS.

Raluca Constantin

Raluca Constantin is a Senior Database Engineer with the Relational Database Services (RDS) team at Amazon Web Services. She has 16 years of experience in the databases world. She enjoys travels, hikes, arts and is a proud mother of a 12y old daughter and a 7y old son.

Jonathan Vogel

Jonathan is a Developer Advocate at AWS. He was a DevOps Specialist Solutions Architect at AWS for two years prior to taking on the Developer Advocate role. Prior to AWS, he practiced professional software development for over a decade. Jonathan enjoys music, birding and climbing rocks.

Reduce Amazon EMR cluster costs by up to 19% with new enhancements in Amazon EMR Managed Scaling

Post Syndicated from Sushant Majithia original https://aws.amazon.com/blogs/big-data/reduce-amazon-emr-cluster-costs-by-up-to-19-with-new-enhancements-in-amazon-emr-managed-scaling/

In June 2020, AWS announced the general availability of Amazon EMR Managed Scaling. With EMR Managed Scaling, you specify the minimum and maximum compute limits for your clusters, and Amazon EMR automatically resizes your cluster for optimal performance and resource utilization. EMR Managed Scaling constantly monitors key workload-related metrics and uses an algorithm that optimizes the cluster size for best resource utilization. Given that the feature is completely managed, improvements to the algorithm are immediately realized without needing a version upgrade. Amazon EMR can scale the cluster up during peaks and scale it down gracefully during idle periods, reducing your costs and optimizing cluster capacity for the best performance.

Throughout 2022, we made multiple enhancements to the EMR Managed Scaling algorithm. With these improvements, we observed that for clusters enabled with EMR Managed Scaling, utilization improved by up to 15 percent, and total costs were reduced further by up to 19 percent. Starting mid-December 2022, EMR Managed Scaling enhancements were enabled by default for clusters using Amazon EMR versions 5.34.0 and later and Amazon EMR versions 6.4.0 and later for both new and existing clusters. Further, given that the feature is completely managed, you will get the new optimized Managed Scaling algorithm by default, and no action is needed on your end.

Listed below are some of the key enhancements we enabled for EMR Managed Scaling:

  • Improved cluster utilization with targeted scale-down of your EMR cluster
  • Reduced costs by preventing scale-down of instances that store intermediate shuffle data using Spark Shuffle data awareness
  • Improved cluster utilization and reduce costs with gradual scale-up of your EMR cluster

Customer success stories

How the enhanced EMR Managed Scaling algorithm helped a technology enterprise reduce costs:

To illustrate the cost savings by examples, we looked at an EMR clusters for a technology enterprise, which heavily uses Amazon EMR to process real time billing data between Kafka and S3 using Spark. They run a persistent EMR cluster with EMR version 5.35 and have EMR Managed Scaling turned-on. The following Amazon CloudWatch dashboard shows how starting December 21, the enhanced Managed Scaling algorithm provisioned (total nodes requested) only 70 nodes vs. the previous Managed Scaling algorithm which provisioned 179 nodes for a similar job profile. The lower the number of resources provisioned to run your jobs, the lower the total cost of your EMR cluster.

How the enhanced EMR Managed Scaling algorithm helped an advertising enterprise reduce costs:

We also looked at an EMR cluster for an advertising enterprise, which leverages Amazon EMR for their data analytics strategy and executes their batch ETL jobs using Spark. They run their clusters on EMR version 6.5 and have EMR Managed Scaling turned-on. The following Amazon CloudWatch dashboard shows how starting December 15, the enhanced Managed Scaling algorithm provisioned (total units requested) only 41 nodes vs. the previous Managed Scaling algorithm which provisioned 86 nodes for a similar job profile.

Estimating the cost savings and utilization improvements for your EMR clusters:

Cluster cost savings:

To view estimated cost savings for your EMR cluster with the EMR Managed Scaling enhancements, please follow the steps below:

  • Open the CloudWatch metrics console and, under EMR, search by your ClusterId.
  • From the list of metrics available for EMR, select the following two metrics:
    • Running capacity – Based on the unit type you specified in your Managed Scaling policy, this will be available as either “TotalUnitsRunning” or “TotalNodesRunning” or “TotalVCPURunning
    • Capacity requested by Managed Scaling – Based on the unit type you specified in your Managed Scaling policy, this will be available as either “TotalUnitsRequested” or “TotalNodesRequested” or “TotalVCPURequested
  •  Plot both of the metrics to your CloudWatch dashboard.
  • Select the time frame as the 3 months between November 2022 and January 2023 to view the improvements with the enhanced Managed Scaling algorithm when compared to the previous Managed Scaling algorithm.

Cluster utilization improvements:

To estimate the improvements in your EMR cluster utilization with the EMR Managed Scaling enhancements, please follow the steps below:

  • Open the CloudWatch metrics console and, under EMR, search by your ClusterId.
  • From the list of metrics available for EMR, select the “YARNMemoryAvailablePercentage” metric.
  • To derive memory utilized by YARN, add a math expression such as “Add Math → Start with empty expression”
    • For the new math expression, set Label=Yarn Utilization and set Details=100-YARNMemoryAvailablePercentage.
  • Plot the cluster utilization metric to your CloudWatch dashboard.
  • Select the time frame as the 3 months between November 2022 and January 2023 to view the improvements with the enhanced Managed Scaling algorithm when compared to the previous Managed Scaling algorithm.

What’s next

We will continue to tune the Managed Scaling algorithm with every new EMR release and thereby improve the customer experience when scaling clusters with EMR Managed Scaling.

Conclusion

In this post, we provided an overview of the key enhancement we launched in EMR Managed Scaling. With these enhancements, we observed that the cluster utilization improved by up to 15 percent, and cluster cost was reduced by up to 19 percent. Starting mid-December 2022, these enhancements were enabled by default for EMR clusters using Amazon EMR versions 5.34.0 and later, and Amazon EMR versions 6.4.0 and later. Given that EMR Managed Scaling is a completely managed feature, you will get the new, optimized EMR Managed Scaling algorithm by default, and no action is needed from your end.

To learn more and get started with EMR Managed Scaling, visit the EMR Managed Scaling documentation page.


About the Authors

Sushant Majithia is a Principal Product Manager for EMR at Amazon Web Services.

 Vishal Vyas is a Senior Software Engineer for EMR at Amazon Web Services.

Matthew Liem is a Senior Solution Architecture Manager at AWS.

Simplify Online Analytical Processing (OLAP) queries in Amazon Redshift using new SQL constructs such as ROLLUP, CUBE, and GROUPING SETS

Post Syndicated from Satesh Sonti original https://aws.amazon.com/blogs/big-data/simplify-online-analytical-processing-olap-queries-in-amazon-redshift-using-new-sql-constructs-such-as-rollup-cube-and-grouping-sets/

Amazon Redshift is a fully managed, petabyte-scale, massively parallel data warehouse that makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools.

We are continuously investing to make analytics easy with Redshift by simplifying SQL constructs and adding new operators. Now we are adding ROLLUP, CUBE, and GROUPING SETS SQL aggregation extensions to perform multiple aggregate operations in single statement and easily include subtotals, totals, and collections of subtotals in a query.

In this post, we discuss how to use these extensions to simplify your queries in Amazon Redshift.

Solution overview

Online Analytical Processing (OLAP) is an effective tool for today’s data and business analysts. It helps you see your mission-critical metrics at different aggregation levels in a single pane of glass. An analyst can use OLAP aggregations to analyze buying patterns by grouping customers by demographic, geographic, and psychographic data, and then summarizing the data to look for trends. This could include analyzing the frequency of purchases, the time frames between purchases, and the types of items being purchased. Such analysis can provide insight into customer preferences and behavior, which can be used to inform marketing strategies and product development. For example, a data analyst can query the data to display a spreadsheet showing a company’s certain type of products sold in the US in the month of July, compare revenue figures with those for the same products in September, and then see a comparison of other product sales in the US at the same time period.

Traditionally, business analysts and data analysts use a set of SQL UNION queries to achieve the desired level of detail and rollups. However, it can be very time consuming and cumbersome to write and maintain. Furthermore, the level of detail and rollups that can be achieved with this approach is limited, because it requires the user to write multiple queries for each different level of detail and rollup.

Many customers are considering migrating to Amazon Redshift from other data warehouse systems that support OLAP GROUP BY clauses. To make this migration process as seamless as possible, Amazon Redshift now offers support for ROLLUP, CUBE, and GROUPING SETS. This will allow for a smoother migration of OLAP workloads, with minimal rewrites. Ultimately, this will result in a faster and streamlined transition to Amazon Redshift. Business and data analysts can now write a single SQL to do the job of multiple UNION queries.

In the next sections, we use sample supplier balances data from TPC-H dataset as a running example to demonstrate the use of ROLLUP, CUBE, and GROUPING SETS extensions. This dataset consists of supplier account balances across different regions and countries. We demonstrate how to find account balance subtotals and grand totals at each nation level, region level, and a combination of both. All these analytical questions can be answered by a business user by running simple single-line SQL statements. Along with aggregations, this post also demonstrates how the results can be traced back to attributes participated in generating subtotals.

Data preparation

To set up the use case, complete the following steps:

  1. On the Amazon Redshift console, in the navigation pane, choose Editor¸ then Query editor v2.

The query editor v2 opens in a new browser tab.

  1. Create a supplier sample table and insert sample data:
create table supp_sample (supp_id integer, region_nm char(25), nation_nm char(25), acct_balance numeric(12,2));

INSERT INTO public.supp_sample (supp_id,region_nm,nation_nm,acct_balance)
VALUES
(90470,'AFRICA                   ','KENYA                    ',1745.57),
(99910,'AFRICA                   ','ALGERIA                  ',3659.98),
(26398,'AMERICA                  ','UNITED STATES            ',2575.77),
(43908,'AMERICA                  ','CANADA                   ',1428.27),
(3882,'AMERICA                  ','UNITED STATES            ',7932.67),
(42168,'ASIA                     ','JAPAN                    ',343.34),
(68461,'ASIA                     ','CHINA                    ',2216.11),
(89676,'ASIA                     ','INDIA                    ',4160.75),
(52670,'EUROPE                   ','RUSSIA                   ',2469.40),
(32190,'EUROPE                   ','RUSSIA                   ',1119.55),
(19587,'EUROPE                   ','GERMANY                  ',9904.98),
(1134,'MIDDLE EAST              ','EGYPT                    ',7977.48),
(35213,'MIDDLE EAST              ','EGYPT                    ',737.28),
(36132,'MIDDLE EAST              ','JORDAN                   ',5052.87);

We took a sample from the result of the following query run on TPC-H dataset. You can use the following query and take sample records to try the SQL statement described in this post:

select s_suppkey supp_id, r.r_name region_nm,n.n_name nation_nm, s.s_acctbal acct_balance
from supplier s, nation n, region r
where
s.s_nationkey = n.n_nationkey
and n.n_regionkey = r.r_regionkey

Let’s review the sample data before running the SQLs using GROUPING SETS, ROLLUP, and CUBE extensions.

The supp_sample table consists of supplier account balances from various nations and regions across the world. The following are the attribute definitions:

  • supp_id – The unique identifier for each supplier
  • region_nm – The region in which the supplier operates
  • nation_nm – The nation in which the supplier operates
  • acct_balance – The supplier’s outstanding account balance

GROUPING SETS

GROUPING SETS is a SQL aggregation extension to group the query results by one or more columns in a single statement. You can use GROUPING SETS instead of performing multiple SELECT queries with different GROUP BY keys and merge (UNION) their results.

In this section, we show how to find the following:

  • Account balances aggregated for each region
  • Account balances aggregated for each nation
  • Merged results of both aggregations

Run the following SQL statement using GROUPING SETS:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY GROUPING SETS (region_nm, nation_nm);

As shown in the following screenshot, the result set includes aggregated account balances by region_nm, followed by nation_nm, and then both results combined in a single output.

ROLLUP

The ROLLUP function generates aggregated results at multiple levels of grouping, starting from the most detailed level and then aggregating up to the next level. It groups data by particular columns and extra rows that represent the subtotals, and assumes a hierarchy among the GROUP BY columns.

In this section, we show how to find the following:

  • Account balances for each combination of region_nm and nation_nm
  • Rolled-up account balances for each region_nm
  • Rolled-up account balances for all regions

Use the following SQL statement using ROLLUP:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY ROLLUP (region_nm, nation_nm)
ORDER BY region_nm,nation_nm;

The following result shows rolled-up values starting from each combination of region_nm and nation_nm and rolls up in the hierarchy from nation_nm to region_nm. The rows with a value for region_nm and NULL value for nation_nm represent the subtotals for the region (marked in green). The rows with NULL value for both region_nm and nation_nm has the grand total—the rolled-up account balances for all regions (marked in red).


ROLLUP is structurally equivalent to the following GROUPING SETS query:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
GROUP BY GROUPING SETS((region_nm, nation_nm), (region_nm), ())
ORDER BY region_nm,nation_nm;

You can rewrite the preceding ROLLUP query using GROUPING SETS. However, using ROLLUP is a much simpler and readable construct for this use case.

CUBE

CUBE groups data by the provided columns, returning extra subtotal rows representing the totals throughout all levels of grouping columns, in addition to the grouped rows. CUBE returns the same rows as ROLLUP, while adding additional subtotal rows for every combination of grouping column not covered by ROLLUP.

In this section, we show how to find the following:

  • Account balance subtotals for each nation_nm
  • Account balance subtotals for each region_nm
  • Account balance subtotals for each group of region_nm and nation_nm combination
  • Overall total account balance for all regions

Run the following SQL statement using CUBE:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
WHERE region_nm in ('AFRICA','AMERICA','ASIA') GROUP BY CUBE(region_nm, nation_nm)
ORDER BY region_nm, nation_nm;

In the preceding query, we added a filter to limit results for easy explanation. You can remove this filter in your test to view data for all regions.

In the following result sets, you can see the subtotals at region level (marked in green). These subtotal records are the same records generated by ROLLUP. Additionally, CUBE generated subtotals for each nation_nm (marked in yellow). Finally, you can also see the grand total for all three regions mentioned in the query (marked in red).

CUBE is structurally equivalent to the following GROUPING SETS query:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM supp_sample
WHERE region_nm in ('AFRICA','AMERICA','ASIA') -- added the filter to limit results.  You can remove this filter in your test to view data for all regions
GROUP BY GROUPING SETS((region_nm, nation_nm), (region_nm), (nation_nm), ())
ORDER BY region_nm;

You can rewrite the preceding CUBE query using GROUPING SETS. However, using CUBE is a much simpler and readable construct for this use.

NULL values

NULL is a valid value in a column that participates in GROUPING SETS, ROLLUP, and CUBE, and it’s not aggregated with the NULL values added explicitly to the result set to satisfy the schema of returning tuples.

Let’s create an example table orders containing details about items ordered, item descriptions, and quantity of the items:

-- Create example orders table and insert sample records
CREATE TABLE orders(item_no int,description varchar,quantity int);
INSERT INTO orders(item_no,description,quantity)
VALUES
(101,'apples',10),
(102,null,15),
(103,'banana',20);

--View the data
SELECT * FROM orders;

We use the following ROLLUP query to aggregate quantities by item_no and description:

SELECT item_no, description, sum(quantity)
FROM orders
GROUP BY ROLLUP(item_no, description)
ORDER BY 1,2;

In the following result, there are two output rows for item_no 102. The row marked in green is the actual data record in the input, and the row marked in red is the subtotal record added by the ROLLUP function.

This demonstrates that NULL values in input are separate from the NULL values added by SQL aggregate extensions.

Grouping and Grouping_ID functions

GROUPING indicates whether a column in the GROUP BY list is aggregated or not. GROUPING(expr) returns 0 if a tuple is grouped on expr; otherwise it returns 1. GROUPING_ID(expr1, expr2, …, exprN) returns an integer representation of the bitmap that consists of GROUPING(expr1), GROUPING(expr2), …, GROUPING(exprN).

This feature helps us clearly understand the aggregation grain, slice and dice data, and apply filters when business users are performing analysis. Also provides auditability for the generated aggregations.

For example, let’s use the preceding supp_sampe table. The following ROLLUP query utilizes GROUPING and GROUPING_ID functions:

SELECT region_nm,
nation_nm,
sum(acct_balance) as total_balance,
GROUPING(region_nm) as gr,
GROUPING(nation_nm) as gn,
GROUPING_ID(region_nm, nation_nm) as grn
FROM supp_sample
GROUP BY ROLLUP(region_nm, nation_nm)
ORDER BY region_nm;

In the following result set, the rows rolled up at nation_nm have 1 value for gn. This indicates that the total_balance is the aggregated value for all the nation_nm values in the region. The last row has gr value as 1. It indicates that total_balance is an aggregated value at region level including all the nations. The grn is an integer representation of bitmap (11 in binary translated to 3 in integer representation).

Performance assessment

Performance is often a key factor, and we wanted to make sure we’re offering most performant SQL features in Amazon Redshift. We performed benchmarking with the 3 TB TPC-H public dataset on an Amazon Redshift cluster with different sizes (5-node Ra3-4XL, 2-node Ra3-4XL, 2-node-Ra3-XLPLUS). Additionally, we disabled query caching so that query results aren’t cached. This allows us to measure the performance of the database as opposed to its ability to serve results from cache. The results were consistent across multiple runs.

We loaded the supplier, region, and nation files from the 3 TB public dataset and created a view on top of those three tables, as shown in the following code. This query joins the three tables to create a unified record. The joined dataset is used for performance assessment.

create view v_supplier_balances as
select r.r_name region_nm,n.n_name nation_nm, s.s_acctbal acct_balance
from supplier s, nation n, region r
where
s.s_nationkey = n.n_nationkey
and n.n_regionkey = r.r_regionkey;

We ran the following example SELECT queries using GROUPING SETS, CUBE, and ROLLUP, and captured performance metrics in the following tables.
ROLLUP:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY ROLLUP (region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 120 118 117
2-node-Ra3-4XL 405 389 391
2-node-Ra3-XLPLUS 490 460 461

CUBE:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY CUBE(region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 224 215 214
2-node-Ra3-4XL 412 392 392
2-node-Ra3-XLPLUS 872 798 793

GROUPING SETS:

SELECT region_nm, nation_nm, sum(acct_balance) as total_balance
FROM v_supplier_balances
GROUP BY GROUPING SETS(region_nm, nation_nm)
ORDER BY region_nm;
Cluster Run 1 in ms Run 2 in ms Run 3 in ms
5-node-Ra3-4XL 210 198 198
2-node-Ra3-4XL 345 328 328
2-node-Ra3-XLPLUS 675 674 674

When we ran the same set of queries for ROLLUP and CUBE and ran with UNION ALL, we saw better performance with ROLLUP and CUBE functionality.

Cluster CUBE (run in ms) ROLLUP (run in ms) UNION ALL (run in ms)
5-node-Ra3-4XL 214 117 321
2-node-Ra3-4XL 392 391 543
2-node-Ra3-XLPLUS 793 461 932

Clean up

To clean up your resources, drop the tables and views you created while following along with the example in this post.

Conclusion

In this post, we talked about the new aggregated extensions ROLLUP, CUBE, and GROUPING SETS added to Amazon Redshift. We also discussed general uses cases, implementation examples, and performance results. You can simplify your existing aggregation queries using these new SQL aggregation extensions and use them in future development for building more simplified, readable queries. If you have any feedback or questions, please leave them in the comments section.


About the Authors

Satesh Sonti is a Sr. Analytics Specialist Solutions Architect based out of Atlanta, specialized in building enterprise data platforms, data warehousing, and analytics solutions. He has over 16 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.

Yanzhu Ji is a Product Manager on the Amazon Redshift team. She worked on the Amazon Redshift team as a Software Engineer before becoming a Product Manager. She has a rich experience of how the customer-facing Amazon Redshift features are built from planning to launching, and always treats customers’ requirements as first priority. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Dinesh Kumar is a Database Engineer with more than a decade of experience working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.

Securely validate business application resilience with AWS FIS and IAM

Post Syndicated from Dr. Rudolf Potucek original https://aws.amazon.com/blogs/devops/securely-validate-business-application-resilience-with-aws-fis-and-iam/

To avoid high costs of downtime, mission critical applications in the cloud need to achieve resilience against degradation of cloud provider APIs and services.

In 2021, AWS launched AWS Fault Injection Simulator (FIS), a fully managed service to perform fault injection experiments on workloads in AWS to improve their reliability and resilience. At the time of writing, FIS allows to simulate degradation of Amazon Elastic Compute Cloud (EC2) APIs using API fault injection actions and thus explore the resilience of workflows where EC2 APIs act as a fault boundary. 

In this post we show you how to explore additional fault boundaries in your applications by selectively denying access to any AWS API. This technique is particularly useful for fully managed, “black box” services like Amazon Simple Storage Service (S3) or Amazon Simple Queue Service (SQS) where a failure of read or write operations is sufficient to simulate problems in the service. This technique is also useful for injecting failures in serverless applications without needing to modify code. While similar results could be achieved with network disruption or modifying code with feature flags, this approach provides a fine granular degradation of an AWS API without the need to re-deploy and re-validate code.

Overview

We will explore a common application pattern: user uploads a file, S3 triggers an AWS Lambda function, Lambda transforms the file to a new location and deletes the original:

S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Figure 1. S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

We will simulate the user upload with an Amazon EventBridge rate expression triggering an AWS Lambda function which creates a file in S3:

S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Figure 2. S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Using this architecture we can explore the effect of S3 API degradation during file creation and deletion. As shown, the API call to delete a file from S3 is an application fault boundary. The failure could occur, with identical effect, because of S3 degradation or because the AWS IAM role of the Lambda function denies access to the API.

To inject failures we use AWS Systems Manager (AWS SSM) automation documents to attach and detach IAM policies at the API fault boundary and FIS to orchestrate the workflow.

Each Lambda function has an IAM execution role that allows S3 write and delete access, respectively. If the processor Lambda fails, the S3 file will remain in the bucket, indicating a failure. Similarly, if the IAM execution role for the processor function is denied the ability to delete a file after processing, that file will remain in the S3 bucket.

Prerequisites

Following this blog posts will incur some costs for AWS services. To explore this test application you will need an AWS account. We will also assume that you are using AWS CloudShell or have the AWS CLI installed and have configured a profile with administrator permissions. With that in place you can create the demo application in your AWS account by downloading this template and deploying an AWS CloudFormation stack:

git clone https://github.com/aws-samples/fis-api-failure-injection-using-iam.git
cd fis-api-failure-injection-using-iam
aws cloudformation deploy --stack-name test-fis-api-faults --template-file template.yaml --capabilities CAPABILITY_NAMED_IAM

Fault injection using IAM

Once the stack has been created, navigate to the Amazon CloudWatch Logs console and filter for /aws/lambda/test-fis-api-faults. Under the EventBridgeTimerHandler log group you should find log events once a minute writing a timestamped file to an S3 bucket named fis-api-failure-ACCOUNT_ID. Under the S3TriggerHandler log group you should find matching deletion events for those files.

Once you have confirmed object creation/deletion, let’s take away the permission of the S3 trigger handler lambda to delete files. To do this you will attach the FISAPI-DenyS3DeleteObject  policy that was created with the template:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam attach-role-policy \
  --role-name ${ROLE_NAME}\
  --policy-arn ${POLICY_ARN}

With the deny policy in place you should now see object deletion fail and objects should start showing up in the S3 bucket. Navigate to the S3 console and find the bucket starting with fis-api-failure. You should see a new object appearing in this bucket once a minute:

S3 bucket listing showing files not being deleted because IAM permissions DENY file deletion during FIS experiment.

Figure 3. S3 bucket listing showing files not being deleted because IAM permissions DENY file deletion during FIS experiment.

If you would like to graph the results you can navigate to AWS CloudWatch, select “Logs Insights“, select the log group starting with /aws/lambda/test-fis-api-faults-S3CountObjectsHandler, and run this query:

fields @timestamp, @message
| filter NumObjects >= 0
| sort @timestamp desc
| stats max(NumObjects) by bin(1m)
| limit 20

This will show the number of files in the S3 bucket over time:

AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

Figure 4. AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

You can now detach the policy:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam detach-role-policy \
  --role-name ${ROLE_NAME}\
  --policy-arn ${POLICY_ARN}

We see that newly written files will once again be deleted but the un-processed files will remain in the S3 bucket. From the fault injection we learned that our system does not tolerate request failures when deleting files from S3. To address this, we should add a dead letter queue or some other retry mechanism.

Note: if the Lambda function does not return a success state on invocation, EventBridge will retry. In our Lambda functions we are cost conscious and explicitly capture the failure states to avoid excessive retries.

Fault injection using SSM

To use this approach from FIS and to always remove the policy at the end of the experiment, we first create an SSM document to automate adding a policy to a role. To inspect this document, open the SSM console, navigate to the “Documents” section, find the FISAPI-IamAttachDetach document under “Owned by me”, and examine the “Content” tab (make sure to select the correct region). This document takes the name of the Role you want to impact and the Policy you want to attach as parameters. It also requires an IAM execution role that grants it the power to list, attach, and detach specific policies to specific roles.

Let’s run the SSM automation document from the console by selecting “Execute Automation”. Determine the ARN of the FISAPI-SSM-Automation-Role from CloudFormation or by running:

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

Use FISAPI-SSM-Automation-Role, a duration of 2 minutes expressed in ISO8601 format as PT2M, the ARN of the deny policy, and the name of the target role FISAPI-TARGET-S3TriggerHandlerRole:

Image of parameter input field reflecting the instructions in blog text.

Figure 5. Image of parameter input field reflecting the instructions in blog text.

Alternatively execute this from a shell:

ASSUME_ROLE_NAME=FISAPI-SSM-Automation-Role
ASSUME_ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ASSUME_ROLE_NAME}'].Arn" --output text )
echo Assume Role ARN: $ASSUME_ROLE_ARN

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws ssm start-automation-execution \
  --document-name FISAPI-IamAttachDetach \
  --parameters "{
      \"AutomationAssumeRole\": [ \"${ASSUME_ROLE_ARN}\" ],
      \"Duration\": [ \"PT2M\" ],
      \"TargetResourceDenyPolicyArn\": [\"${POLICY_ARN}\" ],
      \"TargetApplicationRoleName\": [ \"${ROLE_NAME}\" ]
    }"

Wait two minutes and then examine the content of the S3 bucket starting with fis-api-failure again. You should now see two additional files in the bucket, showing that the policy was attached for 2 minutes during which files could not be deleted, and confirming that our application is not resilient to S3 API degradation.

Permissions for injecting failures with SSM

Fault injection with SSM is controlled by IAM, which is why you had to specify the FISAPI-SSM-Automation-Role:

Visual representation of IAM permission used for fault injections with SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the SSM user needing to have a pass-role permission to grant the SSM execution role to the SSM service.

Figure 6. Visual representation of IAM permission used for fault injections with SSM.

This role needs to contain an assume role policy statement for SSM to allow assuming the role:

      AssumeRolePolicyDocument:
        Statement:
          - Action:
             - 'sts:AssumeRole'
            Effect: Allow
            Principal:
              Service:
                - "ssm.amazonaws.com"

The role also needs to contain permissions to describe roles and their attached policies with an optional constraint on which roles and policies are visible:

          - Sid: GetRoleAndPolicyDetails
            Effect: Allow
            Action:
              - 'iam:GetRole'
              - 'iam:GetPolicy'
              - 'iam:ListAttachedRolePolicies'
            Resource:
              # Roles
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn
              # Policies
              - !Ref AwsFisApiPolicyDenyS3DeleteObject

Finally the SSM role needs to allow attaching and detaching a policy document. This requires

  1. an ALLOW statement
  2. a constraint on the policies that can be attached
  3. a constraint on the roles that can be attached to

In the role we collapse the first two requirements into an ALLOW statement with a condition constraint for the Policy ARN. We then express the third requirement in a DENY statement that will limit the '*' resource to only the explicit role ARNs we want to modify:

          - Sid: AllowOnlyTargetResourcePolicies
            Effect: Allow
            Action:  
              - 'iam:DetachRolePolicy'
              - 'iam:AttachRolePolicy'
            Resource: '*'
            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Policies that can be attached
                  - !Ref AwsFisApiPolicyDenyS3DeleteObject
          - Sid: DenyAttachDetachAllRolesExceptApplicationRole
            Effect: Deny
            Action: 
              - 'iam:DetachRolePolicy'
              - 'iam:AttachRolePolicy'
            NotResource: 
              # Roles that can be attached to
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn

We will discuss security considerations in more detail at the end of this post.

Fault injection using FIS

With the SSM document in place you can now create an FIS template that calls the SSM document. Navigate to the FIS console and filter for FISAPI-DENY-S3PutObject. You should see that the experiment template passes the same parameters that you previously used with SSM:

Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

Figure 7. Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

You can now run the FIS experiment and after a couple minutes once again see new files in the S3 bucket.

Permissions for injecting failures with FIS and SSM

Fault injection with FIS is controlled by IAM, which is why you had to specify the FISAPI-FIS-Injection-EperimentRole:

Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

Figure 8. Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

This role needs to contain an assume role policy statement for FIS to allow assuming the role:

      AssumeRolePolicyDocument:
        Statement:
          - Action:
              - 'sts:AssumeRole'
            Effect: Allow
            Principal:
              Service:
                - "fis.amazonaws.com"

The role also needs permissions to list and execute SSM documents:

            - Sid: RequiredReadActionsforAWSFIS
              Effect: Allow
              Action:
                - 'cloudwatch:DescribeAlarms'
                - 'ssm:GetAutomationExecution'
                - 'ssm:ListCommands'
                - 'iam:ListRoles'
              Resource: '*'
            - Sid: RequiredSSMStopActionforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:CancelCommand'
              Resource: '*'
            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT'

Finally, remember that the SSM document needs to use a Role of its own to execute the fault injection actions. Because that Role is different from the Role under which we started the FIS experiment, we need to explicitly allow SSM to assume that role with a PassRole statement which will expand to FISAPI-SSM-Automation-Role:

            - Sid: RequiredIAMPassRoleforSSMADocuments
              Effect: Allow
              Action: 'iam:PassRole'
              Resource: !Sub 'arn:aws:iam::${AWS::AccountId}:role/${SsmAutomationRole}'

Secure and flexible permissions

So far, we have used explicit ARNs for our guardrails. To expand flexibility, we can use wildcards in our resource matching. For example, we might change the Policy matching from:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Explicitly listed policies - secure but inflexible
                  - !Ref AwsFisApiPolicyDenyS3DeleteObject

or the equivalent:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Explicitly listed policies - secure but inflexible
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${FullPolicyName}

to a wildcard notation like this:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Wildcard policies - secure and flexible
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${PolicyNamePrefix}*'

If we set PolicyNamePrefix to FISAPI-DenyS3 this would now allow invoking FISAPI-DenyS3PutObject and FISAPI-DenyS3DeleteObject but would not allow using a policy named FISAPI-DenyEc2DescribeInstances.

Similarly, we could change the Resource matching from:

            NotResource: 
              # Explicitly listed roles - secure but inflexible
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn

to a wildcard equivalent like this:

            NotResource: 
              # Wildcard policies - secure and flexible
              - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixEventBridge}*'
              - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixS3}*'
and setting RoleNamePrefixEventBridge to FISAPI-TARGET-EventBridge and RoleNamePrefixS3 to FISAPI-TARGET-S3.

Finally, we would also change the FIS experiment role to allow SSM documents based on a name prefix by changing the constraint on automation execution from:

            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                # Explicitly listed resource - secure but inflexible
                # Note: the $DEFAULT at the end could also be an explicit version number
                # Note: the 'automation-definition' is automatically created from 'document' on invocation
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT'

to

            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                # Wildcard resources - secure and flexible
                # 
                # Note: the 'automation-definition' is automatically created from 'document' on invocation
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationDocumentPrefix}*'

and setting SsmAutomationDocumentPrefix to FISAPI-. Test this by updating the CloudFormation stack with a modified template:

aws cloudformation deploy --stack-name test-fis-api-faults --template-file template2.yaml --capabilities CAPABILITY_NAMED_IAM

Permissions governing users

In production you should not be using administrator access to use FIS. Instead we create two roles FISAPI-AssumableRoleWithCreation and FISAPI-AssumableRoleWithoutCreation for you (see this template). These roles require all FIS and SSM resources to have a Name tag that starts with FISAPI-. Try assuming the role without creation privileges and running an experiment. You will notice that you can only start an experiment if you add a Name tag, e.g. FISAPI-secure-1, and you will only be able to get details of experiments and templates that have proper Name tags.

If you are working with AWS Organizations, you can add further guard rails by defining SCPs that control the use of the FISAPI-* tags similar to this blog post.

Caveats

For this solution we are choosing to attach policies instead of permission boundaries. The benefit of this is that you can attach multiple independent policies and thus simulate multi-step service degradation. However, this means that it is possible to increase the permission level of a role. While there are situations where this might be of interest, e.g. to simulate security breaches, please implement a thorough security review of any fault injection IAM policies you create. Note that modifying IAM Roles may trigger events in your security monitoring tools.

The AttachRolePolicy and DetachRolePolicy calls from AWS IAM are eventually consistent, meaning that in some cases permission propagation when starting and stopping fault injection may take up to 5 minutes each.

Cleanup

To avoid additional cost, delete the content of the S3 bucket and delete the CloudFormation stack:

# Clean up policy attachments just in case
CLEANUP_ROLES=$(aws iam list-roles --query "Roles[?starts_with(RoleName,'FISAPI-')].RoleName" --output text)
for role in $CLEANUP_ROLES; do
  CLEANUP_POLICIES=$(aws iam list-attached-role-policies --role-name $role --query "AttachedPolicies[?starts_with(PolicyName,'FISAPI-')].PolicyName" --output text)
  for policy in $CLEANUP_POLICIES; do
    echo Detaching policy $policy from role $role
    aws iam detach-role-policy --role-name $role --policy-arn $policy
  done
done
# Delete S3 bucket content
ACCOUNT_ID=$( aws sts get-caller-identity --query Account --output text )
S3_BUCKET_NAME=fis-api-failure-${ACCOUNT_ID}
aws s3 rm --recursive s3://${S3_BUCKET_NAME}
aws s3 rb s3://${S3_BUCKET_NAME}
# Delete cloudformation stack
aws cloudformation delete-stack --stack-name test-fis-api-faults
aws cloudformation wait stack-delete-complete --stack-name test-fis-api-faults

Conclusion 

AWS Fault Injection Simulator provides the ability to simulate various external impacts to your application to validate and improve resilience. We’ve shown how combining FIS with IAM to selectively deny access to AWS APIs provides a generic path to explore fault boundaries across all AWS services. We’ve shown how this can be used to identify and improve a resilience problem in a common S3 upload workflow. To learn about more ways to use FIS, see this workshop.

About the authors:

Dr. Rudolf Potucek

Dr. Rudolf Potucek is Startup Solutions Architect at Amazon Web Services. Over the past 30 years he gained a PhD and worked in different roles including leading teams in academia and industry, as well as consulting. He brings experience from working with academia, startups, and large enterprises to his current role of guiding startup customers to succeed in the cloud.

Rudolph Wagner

Rudolph Wagner is a Premium Support Engineer at Amazon Web Services who holds the CISSP and OSCP security certifications, in addition to being a certified AWS Solutions Architect Professional. He assists internal and external Customers with multiple AWS services by using his diverse background in SAP, IT, and construction.

Build a real-time GDPR-aligned Apache Iceberg data lake

Post Syndicated from Dhiraj Thakur original https://aws.amazon.com/blogs/big-data/build-a-real-time-gdpr-aligned-apache-iceberg-data-lake/

Data lakes are a popular choice for today’s organizations to store their data around their business activities. As a best practice of a data lake design, data should be immutable once stored. But regulations such as the General Data Protection Regulation (GDPR) have created obligations for data operators who must be able to erase or update personal data from their data lake when requested.

A data lake built on AWS uses Amazon Simple Storage Service (Amazon S3) as its primary storage environment. When a customer asks to erase or update private data, the data lake operator needs to find the required objects in Amazon S3 that contain the required data and take steps to erase or update that data. This activity can be a complex process for the following reasons:

  • Data lakes may contain many S3 objects (each may contain multiple rows), and often it’s difficult to find the object containing the exact data that needs to be erased or personally identifiable information (PII) to be updated as per the request
  • By nature, S3 objects are immutable and therefore applying direct row-based transactions like DELETE or UPDATE isn’t possible

To handle these situations, a transactional feature on S3 objects is required, and frameworks such as Apache Hudi or Apache Iceberg provide you the transactional feature for upserts in Amazon S3.

AWS contributed the Apache Iceberg integration with the AWS Glue Data Catalog, which enables you to use open-source data computation engines like Apache Spark with Iceberg on AWS Glue. In 2022, Amazon Athena announced support of Iceberg, enabling transaction queries on S3 objects.

In this post, we show you how to stream real-time data to an Iceberg table in Amazon S3 using AWS Glue streaming and perform transactions using Amazon Athena for deletes and updates. We use a serverless mechanism for this implementation, which requires minimum operational overhead to manage and fine-tune various configuration parameters, and enables you to extend your use case to ACID operations beyond the GDPR.

Solution overview

We used the Amazon Kinesis Data Generator (KDG) to produce synthetic streaming data in Amazon Kinesis Data Streams and then processed the streaming input data using AWS Glue streaming to store the data in Amazon S3 in Iceberg table format. As part of the customer’s request, we ran delete and update statements using Athena with Iceberg support.

The following diagram illustrates the solution architecture.

The solution workflow consists of the following steps:

  1. Streaming data is generated in JSON format using the KDG template and inserted into Kinesis Data Streams.
  2. An AWS Glue streaming job is connected to Kinesis Data Streams to process the data using the Iceberg connector.
  3. The streaming job output is stored in Amazon S3 in Iceberg table format.
  4. Athena uses the AWS Glue Data Catalog to store and retrieve table metadata for the Amazon S3 data in Iceberg format.
  5. Athena interacts with the Data Catalog tables in Iceberg format for transactional queries required for GDPR.

The codebase required for this post is available in the GitHub repository.

Prerequisites

Before starting the implementation, make sure the following prerequisites are met:

Deploy resources using AWS CloudFormation

Complete the following steps to deploy your solution resources:

  1. After you sign in to your AWS account, launch the CloudFormation template by choosing Launch Stack:
  2. For Stack name, enter a name.
  3. For Username, enter the user name for the KDG.
  4. For Password, enter the password for the KDG (this must be at least six alphanumeric characters, and contain at least one number).
  5. For IAMGlueStreamingJobRoleName, enter a name for the IAM role used for the AWS Glue streaming job.
  6. Choose Next and create your stack.

This CloudFormation template configures the following resources in your account:

  • An S3 bucket named streamingicebergdemo-XX (note that the XX part is a random unique number to make the S3 bucket name unique)
  • An IAM policy and role
  • The KDG URL used for creating synthetic data
  1. After you complete the setup, go to the Outputs tab of the CloudFormation stack to get the S3 bucket name, AWS Glue job execution role (as per your input), and KDG URL.
  2. Before proceeding with the demo, create a folder named custdata under the created S3 bucket.

Create a Kinesis data stream

We use Kinesis Data Streams to create a serverless streaming data service that is built to handle millions of events with low latency. The following steps guide you on how to create the data stream in the us-east-1 Region:

  1. Log in to the AWS Management Console.
  2. Navigate to Kinesis console (make sure the Region is us-east-1).
  3. Select Kinesis Data Streams and choose Create data stream.
  4. For Data stream name, enter demo-data-stream.
  5. For this post, we select On-demand as the Kinesis data stream capacity mode.

On-demand mode works to eliminate the need for provisioning and managing the capacity for streaming data. However, you can implement this solution with Kinesis Data Streams in provisioned mode as well.

  1. Choose Create data stream.
  2. Wait for successful creation of demo-data-stream and for it to be in Active status.

Set up the Kinesis Data Generator

To create a sample streaming dataset, we use the KDG URL generated on the CloudFormation stack Outputs tab and log in with the credentials used in the parameters for the CloudFormation template. For this post, we use the following template to generate sample data in the demo-data-stream Kinesis data stream.

  1. Log in to the KDG URL with the user name and password you supplied during stack creation.
  2. Change the Region to us-east-1.
  3. Select the Kinesis data stream demo-data-stream.
  4. For Records per second, choose Constant and enter 100 (it can be another number, depending on the rate of record creation).
  5. On the Template 1 tab, enter the KDG data generation template:
{
"year": "{{random.number({"min":2000,"max":2022})}}",
"month": "{{random.number({"min":1,"max":12})}}",
"day": "{{random.number({"min":1,"max":30})}}",
"hour": "{{random.number({"min":0,"max":24})}}",
"minute": "{{random.number({"min":0,"max":60})}}",
"customerid": {{random.number({"min":5023,"max":59874})}},
"firstname" : "{{name.firstName}}",
"lastname" : "{{name.lastName}}",
"dateofbirth" : "{{date.past(70)}}",
"city" : "{{address.city}}",
"buildingnumber" : {{random.number({"min":63,"max":947})}},
"streetaddress" : "{{address.streetAddress}}",
"state" : "{{address.state}}",
"zipcode" : "{{address.zipCode}}",
"country" : "{{address.country}}",
"countrycode" : "{{address.countryCode}}",
"phonenumber" : "{{phone.phoneNumber}}",
"productname" : "{{commerce.productName}}",
"transactionamount": {{random.number(
{
"min":10,
"max":150
}
)}}
}
  1. Choose Test template to test the sample records.
  2. When the testing is correct, choose Send data.

This will start sending 100 records per second in the Kinesis data stream. (To stop sending data, choose Stop Sending Data to Kinesis.)

Integrate Iceberg with AWS Glue

To add the Apache Iceberg Connector for AWS Glue, complete the following steps. The connector is free to use and supports AWS Glue 1.0, 2.0, and 3.0.

  1. On the AWS Glue console, choose AWS Glue Studio in the navigation pane.
  2. In the navigation pane, navigate to AWS Marketplace.
  3. Search for and choose Apache Iceberg Connector for AWS Glue.
  4. Choose Accept Terms and Continue to Subscribe.
  5. Choose Continue to Configuration.
  6. For Fulfillment option, choose your AWS Glue version.
  7. For Software version, choose the latest software version.
  8. Choose Continue to Launch.
  9. Under Usage Instructions, choose the link to activate the connector.
  10. Enter a name for the connection, then choose Create connection and activate the connector.
  11. Verify the new connector on the AWS Glue Studio Connectors.

Create the AWS Glue Data Catalog database

The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue. To create your data warehouse or data lake, you must catalog this data. The AWS Glue Data Catalog is an index to the location and schema of your data. You use the information in the Data Catalog to create and monitor your ETL jobs.

For this post, we create a Data Catalog database named icebergdemodb containing the metadata information of a table named customer, which will be queried through Athena.

  1. On the AWS Glue console, choose Databases in the navigation pane.
  2. Choose Add database.
  3. For Database name, enter icebergdemodb.

This creates an AWS Glue database for metadata storage.

Create a Data Catalog table in Iceberg format

In this step, we create a Data Catalog table in Iceberg table format.

  1. On the Athena console, create an Athena workgroup named demoworkgroup for SQL queries.
  2. Choose Athena engine version 3 for Query engine version.

For more information about Athena versions, refer to Changing Athena engine versions.

  1. Enter the S3 bucket location for Query result configuration under Additional configurations.
  2. Open the Athena query editor and choose demoworkgroup.
  3. Choose the database icebergdemodb.
  4. Enter and run the following DDL to create a table pointing to the Data Catalog database icerbergdemodb. Note that the TBLPROPERTIES section mentions ICEBERG as the table type and LOCATION points to the S3 folder (custdata) URI created in earlier steps. This DDL command is available on the GitHub repo.
CREATE TABLE icebergdemodb.customer(
year string,
month string,
day string,
hour string,
minute string,
customerid string,
firstname string,
lastname string,
dateofbirth string,
city string,
buildingnumber string,
streetaddress string,
state string,
zipcode string,
country string,
countrycode string,
phonenumber string,
productname string,
transactionamount int)
LOCATION '<S3 Location URI>'
TBLPROPERTIES (
'table_type'='ICEBERG',
'format'='parquet',
'write_target_data_file_size_bytes'='536870912',
'optimize_rewrite_delete_file_threshold'='10'
);

After you run the command successfully, you can see the table customer in the Data Catalog.

Create an AWS Glue streaming job

In this section, we create the AWS Glue streaming job, which fetches the record from the Kinesis data stream using the Spark script editor.

  1. On the AWS Glue console, choose Jobs (new) in the navigation pane.
  2. For Create job¸ select Spark script editor.
  3. For Options¸ select Create a new script with boilerplate code.
  4. Choose Create.
  5. Enter the code available in the GitHub repo in the editor.

The sample code keeps appending data in the target location by fetching records from the Kinesis data stream.

  1. Choose the Job details tab in the query editor.
  2. For Name, enter Demo_Job.
  3. For IAM role¸ choose demojobrole.
  4. For Type, choose Spark Streaming.
  5. For Glue Version, choose Glue 3.0.
  6. For Language, choose Python 3.
  7. For Worker type, choose G 0.25X.
  8. Select Automatically scale the number of workers.
  9. For Maximum number of workers, enter 5.
  10. Under Advanced properties, select Use Glue Data Catalog as the Hive metastore.
  11. For Connections, choose the connector you created.
  12. For Job parameters, enter the following key pairs (provide your S3 bucket and account ID):
Key Value
--iceberg_job_catalog_warehouse s3://streamingicebergdemo-XX/custdata/
--output_path s3://streamingicebergdemo-XX
--kinesis_arn arn:aws:kinesis:us-east-1:<AWS Account ID>:stream/demo-data-stream
--user-jars-first True

  1. Choose Run to start the AWS Glue streaming job.
  2. To monitor the job, choose Monitoring in the navigation pane.
  3. Select Demo_Job and choose View run details to check the job run details and Amazon CloudWatch logs.

Run GDPR use cases on Athena

In this section, we demonstrate a few use cases that are relevant to GDPR alignment with the user data that’s stored in Iceberg format in the Amazon S3-based data lake as implemented in the previous steps. For this, let’s consider that the following requests are being initiated in the workflow to comply with the regulations:

  • Delete the records for the input customerid (for example, 59289)
  • Update phonenumber for the customerid (for example, 51842)

The IDs used in this example are samples only because they were created through the KDG template used earlier, which creates sample data. You can search for IDs in your implementation by querying through the Athena query editor. The steps remain the same.

Delete data by customer ID

Complete the following steps to fulfill the first use case:

  1. On the Athena console, and make sure icebergdemodb is chosen as the database.
  2. Open the query editor.
  3. Enter the following query using a customer ID and choose Run:
SELECT count(*)
FROM icebergdemodb.customer
WHERE customerid = '59289';

This query gives the count of records for the input customerid before delete.

  1. Enter the following query with the same customer ID and choose Run:
MERGE INTO icebergdemodb.customer trg
USING (SELECT customerid
FROM icebergdemodb.customer
WHERE customerid = '59289') src
ON (trg.customerid = src.customerid)
WHEN MATCHED
THEN DELETE;

This query deletes the data for the input customerid as per the workflow generated.

  1. Test if there is data with the customer ID using a count query.

The count should be 0.

Update data by customer ID

Complete the following steps to test the second use case:

  1. On the Athena console, make sure icebergdemodb is chosen as the database.
  2. Open the query editor.
  3. Enter the following query with a customer ID and choose Run.
SELECT customerid, phonenumber
FROM icebergdemodb.customer
WHERE customerid = '51936';

This query gives the value for phonenumber before update.

  1. Run the following query to update the required columns:
MERGE INTO icebergdemodb.customer trg
USING (SELECT customerid
FROM icebergdemodb.customer
WHERE customerid = '51936') src
ON (trg.customerid = src.customerid)
WHEN MATCHED
THEN UPDATE SET phonenumber = '000';

This query updates the data to a dummy value.

  1. Run the SELECT query to check the update.

You can see the data is updated correctly.

Vacuum table

A good practice is to run the VACUUM command periodically on the table because operations like INSERT, UPDATE, DELETE, and MERGE will take place on the Iceberg table. See the following code:

VACUUM icebergdemodb.customer;

Considerations

The following are a few considerations to keep in mind for this implementation:

Clean up

Complete the following steps to clean up the resources you created for this post:

    1. Delete the custdata folder in the S3 bucket.
    2. Delete the CloudFormation stack.
    3. Delete the Kinesis data stream.
    4. Delete the S3 bucket storing the data.
    5. Delete the AWS Glue job and Iceberg connector.
    6. Delete the AWS Glue Data Catalog database and table.
    7. Delete the Athena workgroup.
    8. Delete the IAM roles and policies.

Conclusion

This post explained how you can use the Iceberg table format on Athena to implement GDPR use cases like data deletion and data upserts as required, when streaming data is being generated and ingested through AWS Glue streaming jobs in Amazon S3.

The operations for the Iceberg table that we demonstrated in this post aren’t all of the data operations that Iceberg supports. Refer to the Apache Iceberg documentation for details on various operations.


About the Authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Rajdip Chaudhuri is Solutions Architect with Amazon Web Services specializing in data and analytics. He enjoys working with AWS customers and partners on data and analytics requirements. In his spare time, he enjoys soccer.

Introducing AWS Glue crawlers using AWS Lake Formation permission management

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/introducing-aws-glue-crawlers-using-aws-lake-formation-permission-management/

Data lakes provide a centralized repository that consolidates your data at scale and makes it available for different kinds of analytics. AWS Glue crawlers are a popular way to scan data in a data lake, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. AWS Lake Formation enables you to centrally govern, secure, and share your data, and lets you scale permissions easily.

We are pleased to announce AWS Glue crawler and Lake Formation integration. You can now use Lake Formation permissions for the crawler’s access to your Lake Formation managed data lakes, whether those are in your account or in other accounts. Before this release, you had to set up AWS Glue crawler IAM role with Amazon Simple Storage Service (Amazon S3) permissions to crawl data source on Amazon S3. And also establish Amazon S3 bucket policies on the source bucket for the crawler role to access S3 data source. Now you can use AWS Lake Formation permission defined on data lake for crawling the data and you no longer need to configure dedicated Amazon S3 permissions for crawlers. AWS Lake Formation manages crawler IAM role access to various Amazon S3 buckets and/or its prefix using data locations permissions to simplify security management. Further you can apply the same security model for crawlers in addition to AWS Glue jobs, Amazon Athena for centralized governance.

When you configure an AWS Glue crawler to use Lake Formation, by default, the crawler uses Lake Formation in the same account to obtain data access credentials. However, you can also configure the crawler to use Lake Formation of a different account by providing an account ID during creation. The cross-account capability allows you to perform permissions management from a central governance account. Customers prefer the central governance experience over writing bucket policies separately in each bucket-owning account. To build a data mesh architecture, you can author permissions in a single Lake Formation governance to manage access to data locations and crawlers spanning multiple accounts in your data lake. You can refer to How to configure a crawler to use Lake Formation credentials for more information.

In this post, we walk through a single in-account architecture that shows how to enable Lake Formation permissions on the data lake, configure an AWS Glue crawler with Lake Formation permission to scan and populate schema from an S3 data lake into the AWS Glue Data Catalog, and then use an analytical engine like Amazon Athena to query the data.

Solution overview

The AWS Glue crawler and Lake Formation integration supports in-account crawling as well as cross-account crawling. You can configure a crawler to use Lake Formation permissions to access an S3 data store or a Data Catalog table with an underlying S3 location within the same AWS account or another AWS account. You can configure an existing Data Catalog table as a crawler’s target if the crawler and the Data Catalog table reside in the same account. The following figure shows the in-account crawling architecture.

Prerequisites

Complete the following prerequisite steps:

  1. Sign in to the Lake Formation console as admin.
  2. If this is the first time accessing the Lake Formation console, add yourself as the data lake administrator.
  3. In the navigation pane, under Data catalog, choose Settings.
  4. Deselect Use only IAM access control for new databases.
  5. Deselect Use only IAM access control for new tables in new databases.
  6. Keep Version 3 as the current cross-account version.
  7. Choose Save.

Set up your solution resources

We set up the solution resources using AWS CloudFormation. Complete the following steps:

  1. Log in to the AWS Management Console as IAM administrator.
  2. Choose Launch Stack to deploy a CloudFormation template:
  3. For LFBusinessAnalystUserName, keep as the default LFBusinessAnalyst.
  4. Create your stack.
  5. When the stack is complete, on the AWS CloudFormation console, navigate to the Resources tab of the stack.
  6. Note down value of Databasename, DataLakeBucket, and GlueCrawlerName.
  7. Choose the LFBusinessAnalystUserCredentials value to navigate to the AWS Secrets Manager console.
  8. In the Secret value section, choose Retrieve secret value.
  9. Note down the secret value for the password for IAM user LFBusinessAnalyst.

Validate resources

In your account, validate the following resources created by template:

  • AWS Glue database – The Databasename value noted from the CloudFormation template.
  • S3 bucket for the data lake with sample data – The DataLakeBucketvalue value noted from the CloudFormation template.
  • AWS Glue crawler and IAM role with required permission – The GlueCrawlerName value noted from the CloudFormation template.

The template registers the S3 bucket with Lake Formation as the data location. On Lake Formation console left navigation choose Data lake locations under Register and ingest.

The template also grants data location permission on the S3 bucket to the crawler role. On Lake Formation console left navigation choose Data locations under Permissions.

Lastly, the template grants database permission to the crawler role. On Lake Formation console left navigation choose Data lake permissions under Permissions.

Edit and run the AWS Glue crawler

To configure and run the AWS Glue crawler, complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Locate the crawler lfcrawler-<your-account-id> and edit it.
  3. Under Lake Formation configuration, select Use Lake Formation credentials for crawling S3 data source.
  4. Choose Next.
  5. Review and update the crawler settings.

Note that the crawler IAM role uses Lake Formation permission to access the data and doesn’t have any S3 policies.

  1. Run the crawler and verify that the crawler run is complete.
  2. In the AWS Glue database lfcrawlerdb<your-account-id>, verify that the table is created and the schema matches with what you have in the S3 bucket.

The crawler was able to crawl the S3 data source and successfully populate the schema using Lake Formation permissions.

Grant access to the data analyst using Lake Formation

Now the data lake admin can delegate permissions on the database and table to the LFBusinessAnalyst user via the Lake Formation console.

Grant the LFBusinessAnalyst IAM user access to the database with Describe permissions.

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permission .
  2. Choose Grant
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM users LFBusinessAnalyst
  5. Under LF-Tags or catalog resources, choose lfcrawlerdb<your-accountid> for Databases.
  6. Select Describe for Database permissions.
  7. Choose Grant to apply the permissions.

Grant the LFBusinessAnalyst IAM user Select and Describe access to the table.

  1. On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permission.
  2. Choose Grant.
  3. Under Principals, select IAM users and roles.
  4. Choose the IAM users LFBusinessAnalyst.
  5. Under LF-Tags or catalog resources, choose lfcrawlerdb<your-accountid> for Databases and lf_datalake_<your-accountid>_<region> for Tables
  6. Choose Select, Describe for Table permissions.
  7. Choose Grant to apply the permissions.

Verify the tables using Athena

To verify the tables using Athena, complete the following steps:

  1. Log in as LFBusinessAnalyst using the password noted earlier through the CloudFormation stack.
  2. On the Athena console, choose lfconsumer-primary-workgroup as the Athena workgroup.
  3. Run the query to validate access as shown in the following screenshot.

We have successfully crawled Amazon S3 data store using the crawler with Lake Formation permission and populated the metadata in AWS Glue Data Catalog. We have granted Lake Formation permission on database and table to consumer user and validated user access to the data using Athena.

Clean up

To avoid unwanted charges to your AWS account, you can delete the AWS resources:

  1. Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack.
  2. Delete the stack you created.

Summary

In this post, we showed how to use the new AWS Glue crawler integration with Lake Formation. Data lake admins can now share crawled tables with data analysts using Lake Formation, allowing analysts to use analytical services such as Athena. You can centrally manage all permissions in Lake Formation, making it easier to administer and protect data lakes.

Special thanks to everyone who contributed to this crawler feature launch: Anshuman Sharma, Jessica Cheng, Aditya K, Sandya Krishnanand

If you have questions or suggestions, submit them in the comments section.


About the authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Configure ADFS Identity Federation with Amazon QuickSight

Post Syndicated from Adeleke Coker original https://aws.amazon.com/blogs/big-data/configure-adfs-identity-federation-with-amazon-quicksight/

Amazon QuickSight Enterprise edition can integrate with your existing Microsoft Active Directory (AD), providing federated access using Security Assertion Markup Language (SAML) to dashboards. Using existing identities from Active Directory eliminates the need to create and manage separate user identities in AWS Identity Access Management (IAM). Federated users assume an IAM role when access is requested through an identity provider (IdP) such as Active Directory Federation Service (AD FS) based on AD group membership. Although, you can connect AD to QuickSight using AWS Directory Service, this blog focuses on federated logon to QuickSight Dashboards.

With identity federation, your users get one-click access to Amazon QuickSight applications using their existing identity credentials. You also have the security benefit of identity authentication by your IdP. You can control which users have access to QuickSight using your existing IdP. Refer to Using identity federation and single sign-on (SSO) with Amazon QuickSight for more information.

In this post, we demonstrate how you can use a corporate email address as an authentication option for signing in to QuickSight. This post assumes you have an existing Microsoft Active Directory Federation Services (ADFS) configured in your environment.

Solution overview

While connecting to QuickSight from an IdP, your users initiate the sign-in process from the IdP portal. After the users are authenticated, they are automatically signed in to QuickSight. After QuickSight checks that they are authorized, your users can access QuickSight.

The following diagram shows an authentication flow between QuickSight and a third-party IdP. In this example, the administrator has set up a sign-in page to access QuickSight. When a user signs in, the sign-in page posts a request to a federation service that complies with SAML 2.0. The end-user initiates authentication from the sign-in page of the IdP. For more information about the authentication flow, see Initiating sign-on from the identity provider (IdP).

QuickSight IdP flow

The solution consists of the following high-level steps:

  1. Create an identity provider.
  2. Create IAM policies.
  3. Create IAM roles.
  4. Configure AD groups and users.
  5. Create a relying party trust.
  6. Configure claim rules.
  7. Configure QuickSight single sign-on (SSO).
  8. Configure the relay state URL for QuickStart.

Prerequisites

The following are the prerequisites to build the solution explained in this post:

  • An existing or newly deployed AD FS environment.
  • An AD user with permissions to manage AD FS and AD group membership.
  • An IAM user with permissions to create IAM policies and roles, and administer QuickSight.
  • The metadata document from your IdP. To download it, refer to Federation Metadata Explorer.

Create an identity provider

To add your IdP, complete the following steps:

  1. On the IAM console, choose Identity providers in the navigation pane.
  2. Choose Add provider.
  3. For Provider type¸ select SAML.
  4. For Provider name, enter a name (for example, QuickSight_Federation).
  5. For Metadata document, upload the metadata document you downloaded as a prerequisite.
  6. Choose Add provider.
  7. Copy the ARN of this provider to use in a later step.

Add IdP in IAM

Create IAM policies

In this step, you create IAM policies that allow users to access QuickSight only after federating their identities. To provide access to QuickSight and also the ability to create QuickSight admins, authors (standard users), and readers, use the following policy examples.

The following code is the author policy:

{
    "Statement": [
        {
            "Action": [
                "quicksight:CreateUser"
            ],
            "Effect": "Allow",
            "Resource": [
                "*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

The following code is the reader policy:

{ 
"Version": "2012-10-17", 
"Statement": [ 
{ 
"Effect": "Allow",
"Action": "quicksight:CreateReader", 
"Resource": "*" 
} 
] 
}

The following code is the admin policy:

{
    "Statement": [
        {
            "Action": [
                "quicksight:CreateAdmin"
            ],
            "Effect": "Allow",
            "Resource": [
                "*"
            ]
        }
    ],
    "Version": "2012-10-17"
}

Create IAM roles

You can configure email addresses for your users to use when provisioning through your IdP to QuickSight. To do this, add the sts:TagSession action to the trust relationship for the IAM role that you use with AssumeRoleWithSAML. Make sure the IAM role names start with ADFS-.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create new role.
  3. For Trusted entity type, select SAML 2.0 federation.
  4. Choose the SAML IdP you created earlier.
  5. Select Allow programmatic and AWS Management Console access.
  6. Choose Next.
    Create IAM Roles
  7. Choose the admin policy you created, then choose Next.
  8. For Name, enter ADFS-ACCOUNTID-QSAdmin.
  9. Choose Create.
  10. On the Trust relationships tab, edit the trust relationships as follows so you can pass principal tags when users assume the role (provide your account ID and IdP):
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::ACCOUNTID:saml-provider/Identity_Provider"
            },
            "Action":[ "sts:AssumeRoleWithSAML",
	 "sts:TagSession"],
            "Condition": {
                "StringEquals": {
                    "SAML:aud": "https://signin.aws.amazon.com/saml"
                },
                "StringLike": {
                    "aws:RequestTag/Email": "*"
                }
            }
            }
    ]
}
  1. Repeat this process for the role ADFS-ACCOUNTID-QSAuthor and attach the author IAM policy.
  2. Repeat this process for the role ADFS-ACCOUNTID-QSReader and attach the reader IAM policy.

Configure AD groups and users

Now you need to create AD groups that determine the permissions to sign in to AWS. Create an AD security group for each of the three roles you created earlier. Note that the group name should follow same format as your IAM role names.

One approach for creating the AD groups that uniquely identify the IAM role mapping is by selecting a common group naming convention. For example, your AD groups would start with an identifier, for example AWS-, which will distinguish your AWS groups from others within the organization. Next, include the 12-digit AWS account number. Finally, add the matching role name within the AWS account. You should do this for each role and corresponding AWS account you wish to support with federated access. The following screenshot shows an example of the naming convention we use in this post.

AD Groups

Later in this post, we create a rule to pick up AD groups starting with AWS-, the rule will remove AWS-ACCOUNTID- from AD groups name to match the respective IAM role, which is why we use this naming convention here.

Users in Active Directory can subsequently be added to the groups, providing the ability to assume access to the corresponding roles in AWS. You can add AD users to the respective groups based on your business permissions model. Note that each user must have an email address configured in Active Directory.

Create a relying party trust

To add a relying party trust, complete the following steps:

  1. Open the AD FS Management Console.
  2. Choose (right-click) Relying Party Trusts, then choose Add Relying Party Trust.
    Add Relying Party Trust
  3. Choose Claims aware, then choose Start.
  4. Select Import data about the relying party published online or on a local network.
  5. For Federation metadata address, enter https://signin.aws.amazon.com/static/saml-metadata.xml.
  6. Choose Next.
    ADFS Wizard Data Source
  7. Enter a descriptive display name, for example Amazon QuickSight Federation, then choose Next.
  8. Choose your access control policy (for this post, Permit everyone), then choose Next.
    ADFS Access Control
  9. In the Ready to Add Trust section, choose Next.
    ADFS Ready to Add
  10. Leave the defaults, then choose Close.

Configure claim rules

In this section, you create claim rules that identify accounts, set LDAP attributes, get the AD groups, and match them to the roles created earlier. Complete the following steps to create the claim rules for NameId, RoleSessionName, Get AD Groups, Roles, and (optionally) Session Duration:

  1. Select the relying party trust you just created, then choose Edit Claim Issuance Policy.
  2. Add a rule called NameId with the following parameters:
    1. For Claim rule template, choose Transform an Incoming Claim.
    2. For Claim rule name, enter NameId
    3. For Incoming claim type, choose Windows account name.
    4. For Outgoing claim type, choose Name ID.
    5. For Outgoing name ID format, choose Persistent Identifier.
    6. Select Pass through all claim values.
    7. Choose Finish.
      NameId
  3. Add a rule called RoleSessionName with the following parameters:
    1. For Claim rule template, choose Send LDAP Attributes as Claims.
    2. For Claim rule name, enter RoleSessionName.
    3. For Attribute store, choose Active Directory.
    4. For LDAP Attribute, choose E-Mail-Addresses.
    5. For Outgoing claim type, enter https://aws.amazon.com/SAML/Attributes/RoleSessionName.
    6. Add another E-Mail-Addresses LDAP attribute and for Outgoing claim type, enter https://aws.amazon.com/SAML/Attributes/PrincipalTag:Email.
    7. Choose OK.
      RoleSessionName
  4. Add a rule called Get AD Groups with the following parameters:
    1. For Claim rule template, choose Send Claims Using a Custom Rule.
    2. For Claim rule name, enter Get AD Groups
    3. For Custom Rule, enter the following code:
      c:[Type == "http://schemas.microsoft.com/ws/2008/06/identity/claims/windowsaccountname", Issuer == "AD AUTHORITY"] => add(store = "Active Directory", types = ("http://temp/variable"), query = ";tokenGroups;{0}", param = c.Value);

    4. Choose OK.
      Get AD Groups
  1. Add a rule called Roles with the following parameters:
    1. For Claim rule template, choose Send Claims Using a Custom Rule.
    2. For Claim rule name, enter Roles
    3. For Custom Rule, enter the following code (provide your account ID and IdP):
      c:[Type == "http://temp/variable", Value =~ "(?i)^AWS-ACCOUNTID"]=&gt; issue(Type = "https://aws.amazon.com/SAML/Attributes/Role", Value = RegExReplace(c.Value, "AWS-ACCOUNTID-", "arn:aws:iam:: ACCOUNTID:saml-provider/your-identity-provider-name,arn:aws:iam:: ACCOUNTID:role/ADFS-ACCOUNTID-"));

    4. Choose Finish.Roles

Optionally, you can create a rule called Session Duration. This configuration determines how long a session is open and active before users are required to reauthenticate. The value is in seconds. For this post, we configure the rule for 8 hours.

  1. Add a rule called Session Duration with the following parameters:
    1. For Claim rule template, choose Send Claims Using a Custom Rule.
    2. For Claim rule name, enter Session Duration.
    3. For Custom Rule, enter the following code:
      => issue(Type = "https://aws.amazon.com/SAML/Attributes/SessionDuration", Value = "28800");

    4. Choose Finish.Session Duration

You should be able to see these five claim rules, as shown in the following screenshot.
All Claims Rules

  1. Choose OK.
  2. Run the following commands in PowerShell on your AD FS server:
Set-AdfsProperties -EnableIdPInitiatedSignonPage $true

Set-AdfsProperties -EnableRelayStateForIdpInitiatedSignOn $true
  1. Stop and start the AD FS service from PowerShell:
net stop adfssrv

net start adfssrv

Configure E-mail Syncing

With QuickSight Enterprise edition integrated with an IdP, you can restrict new users from using personal email addresses. This means users can only log in to QuickSight with their on-premises configured email addresses. This approach allows users to bypass manually entering an email address. It also ensures that users can’t use an email address that might differ from the email address configured in Active Directory.

QuickSight uses the preconfigured email addresses passed through the IdP when provisioning new users to your account. For example, you can make it so that only corporate-assigned email addresses are used when users are provisioned to your QuickSight account through your IdP. When you configure email syncing for federated users in QuickSight, users who log in to your QuickSight account for the first time have preassigned email addresses. These are used to register their accounts.

To configure E-mail syncing for federated users in QuickSight, complete the following steps:

  1. Log in to your QuickSight dashboard with a QuickSight administrator account.
  2. Choose the profile icon.
    QuickSight SSO
  3. On the drop-down menu, choose on Manage QuickSight.
  4. In the navigation pane, choose Single sign-on (SSO).
  5. For Email Syncing for Federated Users, select ON, then choose Enable in the pop-up window.
  6. Choose Save.SSO Configuration

Configure the relay state URL for QuickStart

To configure the relay state URL, complete the following steps (revise the input information as needed to match your environment’s configuration):

  1. Use the ADFS RelayState Generator to generate your URL.
  2. For IDP URL String, enter https://ADFSServerEndpoint/adfs/ls/idpinitiatedsignon.aspx.
  3. For Relying Party Identifier, enter urn:amazon:webservices or https://signin.aws.amazon.com/saml.
  4. For Relay State/Target App, enter your authenticated users to access. In this case, it’s https://quicksight.aws.amazon.com.
  5. Choose Generate URL.RelayState Generator
  6. Copy the URL and load it in your browser.

You should be presented with a login to your IdP landing page.

ADFS Logon Page

Make sure the user logging in has an email address attribute configured in Active Directory. A successful login should redirect you to the QuickSight dashboard after authentication. If you’re not redirected to the QuickSight dashboard page, make sure you ran the commands listed earlier after you configured your claim rules.

Summary

In this post, we demonstrated how to configure federated identities to a QuickSight dashboard and ensure that users can only sign in with preconfigured email address in your existing Active Directory.

We’d love to hear from you. Let us know what you think in the comments section.


About the Author

Adeleke Coker is a Global Solutions Architect with AWS. He helps customers globally accelerate workload deployments and migrations at scale to AWS. In his spare time, he enjoys learning, reading, gaming and watching sport events.

Maintaining Code Quality with Amazon CodeCatalyst Reports

Post Syndicated from Imtranur Rahman original https://aws.amazon.com/blogs/devops/maintaining-code-quality-with-amazon-codecatalyst-reports/

Amazon CodeCatalyst reports contain details about tests that occur during a workflow run. You can create tests such as unit tests, integration tests, configuration tests, and functional tests. You can use a test report to help troubleshoot a problem during a workflow.

Introduction

In prior posts in this series, I discussed reading The Unicorn Project, by Gene Kim, and how the main character, Maxine, struggles with a complicated Software Development Lifecycle (SDLC) after joining a new team. One of the challenges she encounters is the difficulties in shipping secure, functioning code without an automated testing mechanism. To quote Gene Kim, “Without automated testing, the more code we write, the more money it takes for us to test.”

Software Developers know that shipping vulnerable or non-functioning code to a production environment is to be avoided at all costs; the monetary impact is high and the toll it takes on team morale can be even greater. During the SDLC, developers need a way to easily identify and troubleshoot errors in their code.

In this post, I will focus on how developers can seamlessly run tests as a part of workflow actions as well as configure unit test and code coverage reports with Amazon CodeCatalyst. I will also outline how developers can access these reports to gain insights into their code quality.

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Walkthrough

As with the previous posts in the CodeCatalyst series, I am going to use the Modern Three-tier Web Application blueprint. Blueprints provide sample code and CI/CD workflows to help you get started easily across different combinations of programming languages and architectures. To follow along, you can re-use a project you created previously, or you can refer to a previous post that walks through creating a project using the Three-tier blueprint.

Once the project is deployed, CodeCatalyst opens the project overview. This view shows the content of the README file from the project’s source repository, workflow runs, pull requests, etc. The source repository and workflow are created for me by the project blueprint. To view the source code, I select Code → Source Repositories from the left-hand navigation bar. Then, I select the repository name link from the list of source repositories.

Figure 1. List of source repositories including Mythical Mysfits source code.

Figure 1. List of source repositories including Mythical Mysfits source code.

From here I can view details such as the number of branches, workflows, commits, pull requests and source code of this repo. In this walkthrough, I’m focused on the testing capabilities of CodeCatalyst. The project already includes unit tests that were created by the blueprint so I will start there.

From the Files list, navigate to web → src → components→ __tests__ → TheGrid.spec.js. This file contains the front-end unit tests which simply check if the strings “Good”, “Neutral”, “Evil” and “Lawful”, “Neutral”, “Chaotic” have rendered on the web page. Take a moment to examine the code. I will use these tests throughout the walkthrough.

Figure 2. Unit test for the front-end that test strings have been rendered properly.

Figure 2. Unit test for the front-end that test strings have been rendered properly. 

Next, I navigate to the  workflow that executes the unit tests. From the left-hand navigation bar, select CI/CD → Workflows. Then, find ApplicationDeploymentPipeline, expand Recent runs and select  Run-xxxxx . The Visual tab shows a graphical representation of the underlying YAML file that makes up this workflow. It also provides details on what started the workflow run, when it started,  how long it took to complete, the source repository and whether it succeeded.

Figure 3. The Deployment workflow open in the visual designer.

Figure 3. The Deployment workflow open in the visual designer.

Workflows are comprised of a source and one or more actions. I examined test reports for the back-end in a prior post. Therefore, I will focus on the front-end tests here. Select the build_and_test_frontend action to view logs on what the action ran, its configuration details, and the reports it generated. I’m specifically interested in the Unit Test and Code Coverage reports under the Reports tab:

Figure 4. Reports tab showing line and branch coverage.

Figure 4. Reports tab showing line and branch coverage.

Select the report unitTests.xml (you may need to scroll). Here, you can see an overview of this specific report with metrics like pass rate, duration, test suites, and the test cases for those suites:

Figure 5. Detailed report for the front-end tests

Figure 5. Detailed report for the front-end tests.

This report has passed all checks.  To make this report more interesting, I’ll intentionally edit the unit test to make it fail. First, navigate back to the source repository and open web → src → components→ __tests__→TheGrid.spec.js. This test case is looking for the string “Good” so change it to say “Best” instead and commit the changes.

Figure 6. Front-End Unit Test Code Change.

Figure 6. Front-End Unit Test Code Change.

This will automatically start a new workflow run. Navigating back to CI/CD →  Workflows, you can see a new workflow run is in progress (takes ~7 minutes to complete).

Once complete, you can see that the build_and_test_frontend action failed. Opening the unitTests.xml report again, you can see that the report status is in a Failed state. Notice that the minimum pass rate for this test is 100%, meaning that if any test case in this unit test ever fails, the build fails completely.

There are ways to configure these minimums which will be explored when looking at Code Coverage reports. To see more details on the error message in this report, select the failed test case.

Figure 7. Failed Test Case Error Message.

Figure 7. Failed Test Case Error Message.

As expected, this indicates that the test was looking for the string “Good” but instead, it found the string “Best”. Before continuing, I return to the TheGrid.spec.js file and change the string back to “Good”.

CodeCatalyst also allows me to specify code and branch coverage criteria. Coverage is a metric that can help you understand how much of your source was tested. This ensures source code is properly tested before shipping to a production environment. Coverage is not configured for the front-end, so I will examine the coverage of the back-end.

I select Reports on the left-hand navigation bar, and open the report called backend-coverage.xml. You can see details such as line coverage, number of lines covered, specific files that were scanned, etc.

Figure 8. Code Coverage Report Succeeded.

Figure 8. Code Coverage Report Succeeded.

The Line coverage minimum is set to 70% but the current coverage is 80%, so it succeeds. I want to push the team to continue improving, so I will edit the workflow to raise the minimum threshold to 90%. Navigating back to CI/CD → Workflows → ApplicationDeploymentPipeline, select the Edit button. On the Visual tab, select build_backend. On the Outputs tab, scroll down to Success Criteria and change Line Coverage to 90%.

Figure 9. Configuring Code Coverage Success Criteria.

Figure 9. Configuring Code Coverage Success Criteria.

On the top-right, select Commit. This will push the changes to the repository and start a new workflow run. Once the run has finished, navigate back to the Code Coverage report. This time, you can see it reporting a failure to meet the minimum threshold for Line coverage.

Figure 10. Code Coverage Report Failed.

There are other success criteria options available to experiment with. To learn more about success criteria, see Configuring success criteria for tests.

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges. First, delete the two stacks that CDK deployed using the AWS CloudFormation console in the AWS account you associated when you launched the blueprint. These stacks will have names like mysfitsXXXXXWebStack and mysfitsXXXXXAppStack. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project.

Summary

In this post, I demonstrated how Amazon CodeCatalyst can help developers quickly configure test cases, run unit/code coverage tests, and generate reports using CodeCatalyst’s workflow actions. You can use these reports to adhere to your code testing strategy as a software development team. I also outlined how you can use success criteria to influence the outcome of a build in your workflow.  In the next post, I will demonstrate how to configure CodeCatalyst workflows and integrate Software Composition Analysis (SCA) reports. Stay tuned!

About the authors:

Imtranur Rahman

Imtranur Rahman is an experienced Sr. Solutions Architect in WWPS team with 14+ years of experience. Imtranur works with large AWS Global SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform.Imtranur specializes in Containers, Dev/SecOps, GitOps, microservices based applications, hybrid application solutions, application modernization and loves innovating on behalf of his customers. He is highly customer obsessed and takes pride in providing the best solutions through his extensive expertise.

Wasay Mabood

Wasay is a Partner Solutions Architect based out of New York. He works primarily with AWS Partners on migration, training, and compliance efforts but also dabbles in web development. When he’s not working with customers, he enjoys window-shopping, lounging around at home, and experimenting with new ideas.

A hybrid approach in healthcare data warehousing with Amazon Redshift

Post Syndicated from Bindhu Chinnadurai original https://aws.amazon.com/blogs/big-data/a-hybrid-approach-in-healthcare-data-warehousing-with-amazon-redshift/

Data warehouses play a vital role in healthcare decision-making and serve as a repository of historical data. A healthcare data warehouse can be a single source of truth for clinical quality control systems. Data warehouses are mostly built using the dimensional model approach, which has consistently met business needs.

Loading complex multi-point datasets into a dimensional model, identifying issues, and validating data integrity of the aggregated and merged data points are the biggest challenges that clinical quality management systems face. Additionally, scalability of the dimensional model is complex and poses a high risk of data integrity issues.

The data vault approach solves most of the problems associated with dimensional models, but it brings other challenges in clinical quality control applications and regulatory reports. Because data is closer to the source and stored in raw format, it has to be transformed before it can be used for reporting and other application purposes. This is one of the biggest hurdles with the data vault approach.

In this post, we discuss some of the main challenges enterprise data warehouses face when working with dimensional models and data vaults. We dive deep into a hybrid approach that aims to circumvent the issues posed by these two and also provide recommendations to take advantage of this approach for healthcare data warehouses using Amazon Redshift.

What is a dimensional data model?

Dimensional modeling is a strategy for storing data in a data warehouse using dimensions and facts. It optimizes the database for faster data retrieval. Dimensional models have a distinct structure and organize data to provide reports that increase performance.

In a dimensional model, a transaction record is divided either into facts (often numerical), additive transactional data, or dimensions (referential information that gives context to the facts). This categorization of data into facts and dimensions, as well as the entity-relationship framework of the dimensional model, presents complex business processes in a way that is easy for analysts to understand.

A dimensional model in data warehousing is designed for reading, summarizing, and analyzing numerical information such as patient vital stats, lab reading values, counts, and so on. Regardless of the division or use case it is related to, dimensional data models can be used to store data obtained from tracking various processes like patient encounters, provider practice metrics, aftercare surveys, and more.

The majority of healthcare clinical quality data warehouses are built on top of dimensional modeling techniques. The benefit of using dimensional data modeling is that, when data is stored in a data warehouse, it’s easier to persist and extract it.

Although it’s a competent data structure technique, there are challenges in scalability, source tracking, and troubleshooting with the dimensional modeling approach. Tracking and validating the source of aggregated and compute data points is important in clinical quality regulatory reporting systems. Any mistake in regulatory reports may result in a large penalty from regulatory and compliance agencies. These challenges exist because the data points are labeled using meaningless numeric surrogate keys, and any minor error can impair prediction accuracy, and consequently affect the quality of judgments. The ways to countervail these challenges are by refactoring and bridging the dimensions. But that adds data noise over time and reduces accuracy.

Let’s look at an example of a typical dimensional data warehouse architecture in healthcare, as shown in the following logical model.

The following diagram illustrates a sample dimensional model entity-relationship diagram.

This data model contains dimensions and fact tables. You can use the following query to retrieve basic provider and patient relationship data from the dimensional model:

SELECT * FROM Fac_PatientEncounter FP

JOIN Dim_PatientEncounter DP ON FP.EncounterKey = DP.EncounterKey

JOIN Dim_Provider PR ON PR.ProviderKey = FP.ProviderKey

Challenges of dimensional modeling

Dimensional modeling requires data preprocessing before generating a star schema, which involves a large amount of data processing. Any change to the dimension definition results in a lengthy and time-consuming reprocessing of the dimension data, which often results in data redundancy.

Another issue is that, when relying merely on dimensional modeling, analysts can’t assure the consistency and accuracy of data sources. Especially in healthcare, where lineage, compliance, history, and traceability are of prime importance because of the regulations in place.

A data vault seeks to provide an enterprise data warehouse while solving the shortcomings of dimensional modeling approaches. It is a data modeling methodology designed for large-scale data warehouse platforms.

What is a data vault?

The data vault approach is a method and architectural framework for providing a business with data analytics services to support business intelligence, data warehousing, analytics, and data science needs. The data vault is built around business keys (hubs) defined by the company; the keys obtained from the sources are not the same.

Amazon Redshift RA3 instances and Amazon Redshift Serverless are perfect choices for a data vault. And when combined with Amazon Redshift Spectrum, a data vault can deliver more value.

There are three layers to the data vault:

  • Staging
  • Data vault
  • Business vault

Staging involves the creation of a replica of the original data, which is primarily used to aid the process of transporting data from various sources to the data warehouse. There are no restrictions on this layer, and it is typically not persistent. It is 1:1 with the source systems, generally in the same format as that of the sources.

The data vault is based on business keys (hubs), which are defined by the business. All in-scope data is loaded, and auditability is maintained. At the heart of all data warehousing is integration, and this layer contains integrated data from multiple sources built around the enterprise-wide business keys. Although data lakes resemble data vaults, a data vault provides more features of a data warehouse. However, it combines the functionalities of both.

The business vault stores the outcome of business rules, including deduplication, conforming results, and even computations. When results are calculated for two or more data marts, this helps eliminate redundant computation and associated inconsistencies.

Because business vaults still don’t satisfy reporting needs, enterprises create a data mart after the business vault to satisfy dashboarding needs.

Data marts are ephemeral views that can be implemented directly on top of the business and raw vaults. This makes it easy to adapt over time and eliminates the danger of inconsistent results. If views don’t give the required level of performance, the results can be stored in a table. This is the presentation layer and is designed to be requirements-driven and scope-specific subsets of the warehouse data. Although dimensional modeling is commonly used to deliver this layer, marts can also be flat files, .xml files, or in other forms.

The following diagram shows the typical data vault model used in clinical quality repositories.

When the dimensional model as shown earlier is converted into a data vault using the same structure, it can be represented as follows.

Advantages of a data vault

Although any data warehouse should be built within the context of an overarching company strategy, data vaults permit incremental delivery. You can start small and gradually add more sources over time, just like Kimball’s dimensional design technique.

With a data vault, you don’t have to redesign the structure when adding new sources, unlike dimensional modeling. Business rules can be easily changed because raw and business-generated data is kept independent of each other in a data vault.

A data vault isolates technical data reorganization from business rules, thereby facilitating the separation of these potentially tricky processes. Similarly, data cleaning can be maintained separately from data import.

A data vault accommodates changes over time. Unlike a pure dimensional design, a data vault separates raw and business-generated data and accepts changes from both sources.

Data vaults make it easy to maintain data lineage because it includes metadata identifying the source systems. In contrast to dimensional design, where data is cleansed before loading, data vault updates are always gradual, and results are never lost, providing an automatic audit trail.

When raw data is stored in a data vault, historical attributes that weren’t initially available can be added to the presentation area. Data marts can be implemented as views by adding a new column to an existing view.

In data vault 2.0, hash keys eliminate data load dependencies, which allows near-real-time data loading, as well as concurrent data loads of terabytes to petabytes. The process of mastering both entity-relationship modeling and dimensional design takes time and practice, but the process of automating a data vault is easier.

Challenges of a data vault

A data vault is not a one-size-fits-all solution for data warehouses, and it does have a few limitations.

To begin with, when directly feeding the data vault model into a report on one subject area, you need to combine multiple types of data. Due to the incapability of reporting technologies to perform such data processing, this integration can reduce report performance and increase the risk of errors. However, data vault models could improve report performance by incorporating dimensional models or adding additional reporting layers. And for data models that can be directly reported, a dimensional model can be developed.

Additionally, if the data is static or if it comes from a single source, it reduces the efficacy of data vaults. They often negate many benefits of data vaults, and require more business logic, which can be avoided.

The storage requirement for a data vault is also significantly higher. Three separate tables for the same subject area can effectively increase the number of tables by three, and when they are inserts only. If the data is basic, you can achieve the benefits listed here with a simpler dimensional model rather than deploying a data vault.

The following sample query retrieves provider and patient data from a data vault using the sample model we discussed in this section:

SELECT * FROM Lnk_PatientEncounter LP

JOIN Hub_Provider HP ON LP.ProviderKey = HP.ProviderKey

JOIN Dim_Sat_Provider DSP ON HP.ProviderKey = DSP.ProviderKey AND _Current=1

JOIN Hub_Patient Pt ON Pt.PatientEncounterKey = LP.PatientEncounterKey

JOIN Dim_Sat_PatientEncounter DPt ON DPt.PatientEncounterKey = Pt.PatientEncounterKey AND _Current=1

The query involves many joins, which increases the depth and time for the query run, as illustrated in the following chart.

This following table shows that the SQL depth and runtime is proportional, where depth is the number of joins. If the number of joins increase, then the runtime also increases and therefore the cost.

SQL Depth Runtime in Seconds Cost per Query in Seconds
14 80 40,000
12 60 30,000
5 30 15,000
3 25 12,500

The hybrid model addresses major issues raised by the data vault and dimensional model approaches that we’ve discussed in this post, while also allowing improvements in data collection, including IoT data streaming.

What is a hybrid model?

The hybrid model combines the data vault and a portion of the star schema to provide the advantages of both the data vault and dimensional model, and is mainly intended for logical enterprise data warehouses.

The hybrid approach is designed from the bottom up to be gradual and modular, and it can be used for big data, structured, and unstructured datasets. The primary data contains the business rules and enterprise-level data standards norms, as well as additional metadata needed to transform, validate, and enrich data for dimensional approaches. In this model, data processes from left to right provide data vault advantages, and data processes from right to left provide dimensional model advantages. Here, the data vault satellite tables serve as both satellite tables and dimensional tables.

After combining the dimensional and the data vault models, the hybrid model can be viewed as follows.

The following is an example entity-relation diagram of the hybrid model, which consists of a fact table from the dimensional model and all other entities from the data vault. The satellite entity from the data vault plays the dual role. When it’s connected to a data vault, it acts as a sat table, and when connected to a fact table, it acts as a dimension table. To serve this dual purpose, sat tables have two keys: a foreign key to connect with the data vault, and a primary key to connect with the fact table.

The following diagram illustrates the physical hybrid data model.

The following diagram illustrates a typical hybrid data warehouse architecture.

The following query retrieves provider and patient data from the hybrid model:

SELECT * FROM Fac_PatientEncounter FP

JOIN Dim_Sat_Provider DSP ON FP.DimProviderID =DSP.DimProviderID

JOIN Dim_Sat_PatientEncounter DPt ON DPt.DimPatientEncounterID = Pt.DimPatientEncounterID

The number of joins is reduced from five to three by using the hybrid model.

Advantages of using the hybrid model

With this model, structural information is segregated from descriptive information to promote flexibility and avoid re-engineering in the event of a change. It maintains data integrity, allowing organizations to avoid hefty fines when data integrity is compromised.

The hybrid paradigm enables non-data professionals to interact with raw data by allowing users to update or create metadata and data enrichment rules. The hybrid approach simplifies the process of gathering and evaluating datasets for business applications. It enables concurrent data loading and eliminates the need for a corporate vault.

The hybrid model also benefits from the fact that there is no dependency between objects in the data storage. With hybrid data warehousing, scalability is multiplied.

You can build the hybrid model on AWS and take advantage of the benefits of Amazon Redshift, which is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, simple, and secure analytics at scale. Amazon Redshift continuously adds features that make it faster, more elastic, and easier to use:

  • Amazon Redshift data sharing enhances the hybrid model by eliminating the need for copying data across departments. It also simplifies the work of keeping the single source of truth, saving memory and limiting redundancy. It enables instant, granular, and fast data access across Amazon Redshift clusters without the need to copy or move it. Data sharing provides live access to data so that users always see the most up-to-date and consistent information as it’s updated in the data warehouse.
  • Redshift Spectrum enables you to query open format data directly in the Amazon Simple Storage Service (Amazon S3) data lake without having to load the data or duplicate your infrastructure, and it integrates well with the data lake.
  • With Amazon Redshift concurrency scaling, you can get consistently fast performance for thousands of concurrent queries and users. It instantly adds capacity to support additional users and removes it when the load subsides, with nothing to manage at your end.
  • To realize the benefits of using a hybrid model on AWS, you can get started today without needing to provision and manage data warehouse clusters using Redshift Serverless. All the related services that Amazon Redshift integrates with (such as Amazon Kinesis, AWS Lambda, Amazon QuickSight, Amazon SageMaker, Amazon EMR, AWS Lake Formation, and AWS Glue) are available to work with Redshift Serverless.

Conclusion

With the hybrid model, data can be transformed and loaded into a target data model efficiently and transparently. With this approach, data partners can research data networks more efficiently and promote comparative effectiveness. And with the several newly introduced features of Amazon Redshift, a lot of heavy lifting is done by AWS to handle your workload demands, and you only pay for what you use.

You can get started with the following steps:

  1. Create an Amazon Redshift RA3 instance for your primary clinical data repository and data marts.
  2. Build a data vault schema for the raw vault and create materialized views for the business vault.
  3. Enable Amazon Redshift data shares to share data between the producer cluster and consumer cluster.
  4. Load the structed and unstructured data into the producer cluster data vault for business use.

About the Authors

Bindhu Chinnadurai is a Senior Partner Solutions Architect in AWS based out of London, United Kingdom. She has spent 18+ years working in everything for large scale enterprise environments. Currently she engages with AWS partner to help customers migrate their workloads to AWS with focus on scalability, resiliency, performance and sustainability. Her expertise is DevSecOps.

 Sarathi Balakrishnan was the Global Partner Solutions Architect, specializing in Data, Analytics and AI/ML at AWS. He worked closely with AWS partner globally to build solutions and platforms on AWS to accelerate customers’ business outcomes with state-of-the-art cloud technologies and achieve more in their cloud explorations. He helped with solution architecture, technical guidance, and best practices to build cloud-native solutions. He joined AWS with over 20 years of large enterprise experience in agriculture, insurance, health care and life science, marketing and advertisement industries to develop and implement data and AI strategies.

Automate deployment of an Amazon QuickSight analysis connecting to an Amazon Redshift data warehouse with an AWS CloudFormation template

Post Syndicated from Sandeep Bajwa original https://aws.amazon.com/blogs/big-data/automate-deployment-of-an-amazon-quicksight-analysis-connecting-to-an-amazon-redshift-data-warehouse-with-an-aws-cloudformation-template/

Amazon Redshift is the most widely used data warehouse in the cloud, best suited for analyzing exabytes of data and running complex analytical queries. Amazon QuickSight is a fast business analytics service to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. QuickSight provides easy integration with Amazon Redshift, providing native access to all your data and enabling organizations to scale their business analytics capabilities to hundreds of thousands of users. QuickSight delivers fast and responsive query performance by using a robust in-memory engine (SPICE).

As a QuickSight administrator, you can use AWS CloudFormation templates to migrate assets between distinct environments from development, to test, to production. AWS CloudFormation helps you model and set up your AWS resources so you can spend less time managing those resources and more time focusing on your applications that run in AWS. You no longer need to create data sources or analyses manually. You create a template that describes all the AWS resources that you want, and AWS CloudFormation takes care of provisioning and configuring those resources for you. In addition, with versioning, you have your previous assets, which provides the flexibility to roll back deployments if the need arises. For more details, refer to Amazon QuickSight resource type reference.

In this post, we show how to automate the deployment of a QuickSight analysis connecting to an Amazon Redshift data warehouse with a CloudFormation template.

Solution overview

Our solution consists of the following steps:

  1. Create a QuickSight analysis using an Amazon Redshift data source.
  2. Create a QuickSight template for your analysis.
  3. Create a CloudFormation template for your analysis using the AWS Command Line Interface (AWS CLI).
  4. Use the generated CloudFormation template to deploy a QuickSight analysis to a target environment.

The following diagram shows the architecture of how you can have multiple AWS accounts, each with its own QuickSight environment connected to its own Amazon Redshift data source. In this post, we outline the steps involved in migrating QuickSight assets in the dev account to the prod account. For this post, we use Amazon Redshift as the data source and create a QuickSight visualization using the Amazon Redshift sample TICKIT database.

The following diagram illustrates flow of the high-level steps.

Prerequisites

Before setting up the CloudFormation stacks, you must have an AWS account and an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and the services listed in the architecture.

The migration requires the following prerequisites:

Create a QuickSight analysis in your dev environment

In this section, we walk through the steps to set up your QuickSight analysis using an Amazon Redshift data source.

Create an Amazon Redshift data source

To connect to your Amazon Redshift data warehouse, you need to create a data source in QuickSight. As shown in the following screenshot, you have two options:

  • Auto-discovered
  • Manual connect

QuickSight auto-discovers Amazon Redshift clusters that are associated with your AWS account. These resources must be located in the same Region as your QuickSight account.

For more details, refer to Authorizing connections from Amazon QuickSight to Amazon Redshift clusters.

You can also manually connect and create a data source.

Create an Amazon Redshift dataset

The next step is to create a QuickSight dataset, which identifies the specific data in a data source you want to use.

For this post, we use the TICKIT database created in an Amazon Redshift data warehouse, which consists of seven tables: two fact tables and five dimensions, as shown in the following figure.

This sample database application helps analysts track sales activity for the fictional TICKIT website, where users buy and sell tickets online for sporting events, shows, and concerts.

  1. On the Datasets page, choose New dataset.
  2. Choose the data source you created in the previous step.
  3. Choose Use custom SQL.
  4. Enter the custom SQL as shown in the following screenshot.

The following screenshot shows our completed data source.

Create a QuickSight analysis

The next step is to create an analysis that utilizes this dataset. In QuickSight, you analyze and visualize your data in analyses. When you’re finished, you can publish your analysis as a dashboard to share with others in your organization.

  1. On the All analyses tab of the QuickSight start page, choose New analysis.

The Datasets page opens.

  1. Choose a dataset, then choose Use in analysis.

  1. Create a visual. For more information about creating visuals, see Adding visuals to Amazon QuickSight analyses.

Create a QuickSight template from your analysis

A QuickSight template is a named object in your AWS account that contains the definition of your analysis and references to the datasets used. You can create a template using the QuickSight API by providing the details of the source analysis via a parameter file. You can use templates to easily create a new analysis.

You can use AWS Cloud9 from the console to run AWS CLI commands.

The following AWS CLI command demonstrates how to create a QuickSight template based on the sales analysis you created (provide your AWS account ID for your dev account):

aws quicksight create-template --aws-account-id  <DEVACCOUNT>--template-id QS-RS-SalesAnalysis-Template --cli-input-json file://parameters.json

The parameter.json file contains the following details (provide your source QuickSight user ARN, analysis ARN, and dataset ARN):

{
    "Name": "QS-RS-SalesAnalysis-Temp",
    "Permissions": [
        {"Principal": "<QS-USER-ARN>", 
          "Actions": [ "quicksight:CreateTemplate",
                       "quicksight:DescribeTemplate",                   
                       "quicksight:DescribeTemplatePermissions",
                       "quicksight:UpdateTemplate"         
            ] } ] ,
     "SourceEntity": {
       "SourceAnalysis": {
         "Arn": "<QS-ANALYSIS-ARN>",
         "DataSetReferences": [
           {
             "DataSetPlaceholder": "sales",
             "DataSetArn": "<QS-DATASET-ARN>"
           }
         ]
       }
     },
     "VersionDescription": "1"
    }

You can use the AWS CLI describe-user, describe_analysis, and describe_dataset commands to get the required ARNs.

To upload the updated parameter.json file to AWS Cloud9, choose File from the tool bar and choose Upload local file.

The QuickSight template is created in the background. QuickSight templates aren’t visible within the QuickSight UI; they’re a developer-managed or admin-managed asset that is only accessible via the AWS CLI or APIs.

To check the status of the template, run the describe-template command:

aws quicksight describe-template --aws-account-id <DEVACCOUNT> --template-id "QS-RS-SalesAnalysis-Temp"

The following code shows command output:

Copy the template ARN; we need it later to create a template in the production account.

The QuickSight template permissions in the dev account need to be updated to give access to the prod account. Run the following command to update the QuickSight template. This provides the describe privilege to the target account to extract details of the template from the source account:

aws quicksight update-template-permissions --aws-account-id <DEVACCOUNT> --template-id “QS-RS-SalesAnalysis-Temp” --grant-permissions file://TemplatePermission.json

The file TemplatePermission.json contains the following details (provide your target AWS account ID):

[
  {
    "Principal": "arn:aws:iam::<TARGET ACCOUNT>",
    "Actions": [
      "quicksight:UpdateTemplatePermissions",
      "quicksight:DescribeTemplate"
    ]
  }
]

To upload the updated TemplatePermission.json file to AWS Cloud9, choose the File menu from the tool bar and choose Upload local file.

Create a CloudFormation template

In this section, we create a CloudFormation template containing our QuickSight assets. In this example, we use a YAML formatted template saved on our local machine. We update the following different sections of the template:

  • AWS::QuickSight::DataSource
  • AWS::QuickSight::DataSet
  • AWS::QuickSight::Template
  • AWS::QuickSight::Analysis

Some of the information required to complete the CloudFormation template can be gathered from the source QuickSight account via the describe AWS CLI commands, and some information needs to be updated for the target account.

Create an Amazon Redshift data source in AWS CloudFormation

In this step, we add the AWS::QuickSight::DataSource section of the CloudFormation template.

Gather the following information on the Amazon Redshift cluster in the target AWS account (production environment):

  • VPC connection ARN
  • Host
  • Port
  • Database
  • User
  • Password
  • Cluster ID

You have the option to create a custom DataSourceID. This ID is unique per Region for each AWS account.

Add the following information to the template:

Resources:
  RedshiftBuildQSDataSource:
    Type: 'AWS::QuickSight::DataSource'
    Properties:  
      DataSourceId: "RS-Sales-DW"      
      AwsAccountId: !Sub ${AWS::ACCOUNT ID}
      VpcConnectionProperties:
        VpcConnectionArn: <VPC-CONNECTION-ARN>      
      Type: REDSHIFT   
      DataSourceParameters:
        RedshiftParameters:     
          Host: "<HOST>"
          Port: <PORT>
          Clusterid: "<CLUSTER ID>"
          Database: "<DATABASE>"    
      Name: "RS-Sales-DW"
      Credentials:
        CredentialPair:
          Username: <USER>
          Password: <PASSWORD>
      Permissions:

Create an Amazon Redshift dataset in AWS CloudFormation

In this step, we add the AWS::QuickSight::DataSet section in the CloudFormation template to match the dataset definition from the source account.

Gather the dataset details and run the list-data-sets command to get all datasets from the source account (provide your source dev account ID):

aws quicksight list-data-sets  --aws-account-id <DEVACCOUNT>

The following code is the output:

Run the describe-data-set command, specifying the dataset ID from the previous command’s response:

aws quicksight describe-data-set --aws-account-id <DEVACCOUNT> --data-set-id "<YOUR-DATASET-ID>"

The following code shows partial output:

Based on the dataset description, add the AWS::Quicksight::DataSet resource in the CloudFormation template, as shown in the following code. Note that you can also create a custom DataSetID. This ID is unique per Region for each AWS account.

QSRSBuildQSDataSet:
    Type: 'AWS::QuickSight::DataSet'
    Properties:
      DataSetId: "RS-Sales-DW" 
      Name: "sales" 
      AwsAccountId: !Sub ${AWS::ACCOUNT ID}
      PhysicalTableMap:
        PhysicalTable1:          
          CustomSql:
            SqlQuery: "select sellerid, username, (firstname ||' '|| lastname) as name,city, sum(qtysold) as sales
              from sales, date, users
              where sales.sellerid = users.userid and sales.dateid = date.dateid and year = 2008
              group by sellerid, username, name, city
              order by 5 desc
              limit 10"
            DataSourceArn: !GetAtt RedshiftBuildQSDataSource.Arn
            Name"RS-Sales-DW"
            Columns:
            - Type: INTEGER
              Name: sellerid
            - Type: STRING
              Name: username
            - Type: STRING
              Name: name
            - Type: STRING
              Name: city
            - Type: DECIMAL
              Name: sales                                     
      LogicalTableMap:
        LogicalTable1:
          Alias: sales
          Source:
            PhysicalTableId: PhysicalTable1
          DataTransforms:
          - CastColumnTypeOperation:
              ColumnName: sales
              NewColumnType: DECIMAL
      Permissions:
        - Principal: !Join 
            - ''
            - - 'arn:aws:quicksight:'
              - !Ref QuickSightIdentityRegion
              - ':'
              - !Ref 'AWS::AccountId'
              - ':user/default/'
              - !Ref QuickSightUser
          Actions:
            - 'quicksight:UpdateDataSetPermissions'
            - 'quicksight:DescribeDataSet'
            - 'quicksight:DescribeDataSetPermissions'
            - 'quicksight:PassDataSet'
            - 'quicksight:DescribeIngestion'
            - 'quicksight:ListIngestions'
            - 'quicksight:UpdateDataSet'
            - 'quicksight:DeleteDataSet'
            - 'quicksight:CreateIngestion'
            - 'quicksight:CancelIngestion'
      ImportMode: DIRECT_QUERY

You can specify ImportMode to choose between Direct_Query or Spice.

Create a QuickSight template in AWS CloudFormation

In this step, we add the AWS::QuickSight::Template section in the CloudFormation template, representing the analysis template.

Use the source template ARN you created earlier and add the AWS::Quicksight::Template resource in the CloudFormation template:

QSTCFBuildQSTemplate:
    Type: 'AWS::QuickSight::Template'
    Properties:
      TemplateId: "QS-RS-SalesAnalysis-Temp"
      Name: "QS-RS-SalesAnalysis-Temp"
      AwsAccountId:!Sub ${AWS::ACCOUNT ID}
      SourceEntity:
        SourceTemplate:
          Arn: '<SOURCE-TEMPLATE-ARN>'          
      Permissions:
        - Principal: !Join 
            - ''
            - - 'arn:aws:quicksight:'
              - !Ref QuickSightIdentityRegion
              - ':'
              - !Ref 'AWS::AccountId'
              - ':user/default/'
              - !Ref QuickSightUser
          Actions:
            - 'quicksight:DescribeTemplate'
      VersionDescription: Initial version - Copied over from AWS account.

Create a QuickSight analysis

In this last step, we add the AWS::QuickSight::Analysis section in the CloudFormation template. The analysis is linked to the template created in the target account.

Add the AWS::Quicksight::Analysis resource in the CloudFormation template as shown in the following code:

QSRSBuildQSAnalysis:
    Type: 'AWS::QuickSight::Analysis'
    Properties:
      AnalysisId: 'Sales-Analysis'
      Name: 'Sales-Analysis'
      AwsAccountId:!Sub ${AWS::ACCOUNT ID}
      SourceEntity:
        SourceTemplate:
          Arn: !GetAtt  QSTCFBuildQSTemplate.Arn
          DataSetReferences:
            - DataSetPlaceholder: 'sales'
              DataSetArn: !GetAtt QSRSBuildQSDataSet.Arn
      Permissions:
        - Principal: !Join 
            - ''
            - - 'arn:aws:quicksight:'
              - !Ref QuickSightIdentityRegion
              - ':'
              - !Ref 'AWS::AccountId'
              - ':user/default/'
              - !Ref QuickSightUser
          Actions:
            - 'quicksight:RestoreAnalysis'
            - 'quicksight:UpdateAnalysisPermissions'
            - 'quicksight:DeleteAnalysis'
            - 'quicksight:DescribeAnalysisPermissions'
            - 'quicksight:QueryAnalysis'
            - 'quicksight:DescribeAnalysis'
            - 'quicksight:UpdateAnalysis'      

Deploy the CloudFormation template in the production account

To create a new CloudFormation stack that uses the preceding template via the AWS CloudFormation console, complete the following steps:

  1. On the AWS CloudFormation console, choose Create Stack.
  2. On the drop-down menu, choose with new resources (standard).
  3. For Prepare template, select Template is ready.
  4. For Specify template, choose Upload a template file.
  5. Save the provided CloudFormation template in a .yaml file and upload it.
  6. Choose Next.
  7. Enter a name for the stack. For this post, we use QS-RS-CF-Stack.
  8. Choose Next.
  9. Choose Next again.
  10. Choose Create Stack.

The status of the stack changes to CREATE_IN_PROGRESS, then to CREATE_COMPLETE.

Verify the QuickSight objects in the following table have been created in the production environment.

QuickSight Object Type Object Name (Dev) Object Name ( Prod)
Data Source RS-Sales-DW RS-Sales-DW
Dataset Sales Sales
Template QS-RS-Sales-Temp QS-RS-SalesAnalysis-Temp
Analysis Sales Analysis Sales-Analysis

The following example shows that Sales Analysis was created in the target account.

Conclusion

This post demonstrated an approach to migrate a QuickSight analysis with an Amazon Redshift data source from one QuickSight account to another with a CloudFormation template.

For more information about automating dashboard deployment, customizing access to the QuickSight console, configuring for team collaboration, and implementing multi-tenancy and client user segregation, check out the videos Virtual Admin Workshop: Working with Amazon QuickSight APIs and Admin Level-Up Virtual Workshop, V2 on YouTube.


About the author

Sandeep Bajwa is a Sr. Analytics Specialist based out of Northern Virginia, specialized in the design and implementation of analytics and data lake solutions.

Amazon EMR Serverless supports larger worker sizes to run more compute and memory-intensive workloads

Post Syndicated from Veena Vasudevan original https://aws.amazon.com/blogs/big-data/amazon-emr-serverless-supports-larger-worker-sizes-to-run-more-compute-and-memory-intensive-workloads/

Amazon EMR Serverless allows you to run open-source big data frameworks such as Apache Spark and Apache Hive without managing clusters and servers. With EMR Serverless, you can run analytics workloads at any scale with automatic scaling that resizes resources in seconds to meet changing data volumes and processing requirements. EMR Serverless automatically scales resources up and down to provide just the right amount of capacity for your application.

We are excited to announce that EMR Serverless now offers worker configurations of 8 vCPUs with up to 60 GB memory and 16 vCPUs with up to 120 GB memory, allowing you to run more compute and memory-intensive workloads on EMR Serverless. An EMR Serverless application internally uses workers to execute workloads. and you can configure different worker configurations based on your workload requirements. Previously, the largest worker configuration available on EMR Serverless was 4 vCPUs with up to 30 GB memory. This capability is especially beneficial for the following common scenarios:

  • Shuffle-heavy workloads
  • Memory-intensive workloads

Let’s look at each of these use cases and the benefits of having larger worker sizes.

Benefits of using large workers for shuffle-intensive workloads

In Spark and Hive, shuffle occurs when data needs to be redistributed across the cluster during a computation. When your application performs wide transformations or reduce operations such as join, groupBy, sortBy, or repartition, Spark and Hive triggers a shuffle. Also, every Spark stage and Tez vertex is bounded by a shuffle operation. Taking Spark as an example, by default, there are 200 partitions for every Spark job defined by spark.sql.shuffle.partitions. However, Spark will compute the number of tasks on the fly based on the data size and the operation being performed. When a wide transformation is performed on top of a large dataset, there could be GBs or even TBs of data that need to be fetched by all the tasks.

Shuffles are typically expensive in terms of both time and resources, and can lead to performance bottlenecks. Therefore, optimizing shuffles can have a significant impact on the performance and cost of a Spark job. With large workers, more data can be allocated to each executor’s memory, which minimizes the data shuffled across executors. This in turn leads to increased shuffle read performance because more data will be fetched locally from the same worker and less data will be fetched remotely from other workers.

Experiments

To demonstrate the benefits of using large workers for shuffle-intensive queries, let’s use q78 from TPC-DS, which is a shuffle-heavy Spark query that shuffles 167 GB of data over 12 Spark stages. Let’s perform two iterations of the same query with different configurations.

The configurations for Test 1 are as follows:

  • Size of executor requested while creating EMR Serverless application = 4 vCPUs, 8 GB memory, 200 GB disk
  • Spark job config:
    • spark.executor.cores = 4
    • spark.executor.memory = 8
    • spark.executor.instances = 48
    • Parallelism = 192 (spark.executor.instances * spark.executor.cores)

The configurations for Test 2 are as follows:

  • Size of executor requested while creating EMR Serverless application = 8 vCPUs, 16 GB memory, 200 GB disk
  • Spark job config:
    • spark.executor.cores = 8
    • spark.executor.memory = 16
    • spark.executor.instances = 24
    • Parallelism = 192 (spark.executor.instances * spark.executor.cores)

Let’s also disable dynamic allocation by setting spark.dynamicAllocation.enabled to false for both tests to avoid any potential noise due to variable executor launch times and keep the resource utilization consistent for both tests. We use Spark Measure, which is an open-source tool that simplifies the collection and analysis of Spark performance metrics. Because we’re using a fixed number of executors, the total number of vCPUs and memory requested are the same for both the tests. The following table summarizes the observations from the metrics collected with Spark Measure.

. Total Time Taken for Query in milliseconds shuffleLocalBlocksFetched shuffleRemoteBlocksFetched shuffleLocalBytesRead shuffleRemoteBytesRead shuffleFetchWaitTime shuffleWriteTime
Test 1 153244 114175 5291825 3.5 GB 163.1 GB 1.9 hr 4.7 min
Test 2 108136 225448 5185552 6.9 GB 159.7 GB 3.2 min 5.2 min

As seen from the table, there is a significant difference in performance due to shuffle improvements. Test 2, with half the number of executors that are twice as large as Test 1, ran 29.44% faster, with 1.97 times more shuffle data fetched locally compared to Test 1 for the same query, same parallelism, and same aggregate vCPU and memory resources. Therefore, you can benefit from improved performance without compromising on cost or job parallelism with the help of large executors. We have observed similar performance benefits for other shuffle-intensive TPC-DS queries such as q23a and q23b.

Recommendations

To determine if the large workers will benefit your shuffle-intensive Spark applications, consider the following:

  • Check the Stages tab from the Spark History Server UI of your EMR Serverless application. For example, from the following screenshot of Spark History Server, we can determine that this Spark job wrote and read 167 GB of shuffle data aggregated across 12 stages, looking at the Shuffle Read and Shuffle Write columns. If your jobs shuffle over 50 GB of data, you may potentially benefit from using larger workers with 8 or 16 vCPUs or spark.executor.cores.

  • Check the SQL / DataFrame tab from the Spark History Server UI of your EMR Serverless application (only for Dataframe and Dataset APIs). When you choose the Spark action performed, such as collect, take, showString, or save, you will see an aggregated DAG for all stages separated by the exchanges. Every exchange in the DAG corresponds to a shuffle operation, and it will contain the local and remote bytes and blocks shuffled, as seen in the following screenshot. If the local shuffle blocks or bytes fetched is much less compared to the remote blocks or bytes fetched, you can rerun your application with larger workers (with 8 or 16 vCPUs or spark.executor.cores) and review these exchange metrics in a DAG to see if there is any improvement.

  • Use the Spark Measure tool with your Spark query to obtain the shuffle metrics in the Spark driver’s stdout logs, as shown in the following log for a Spark job. Review the time taken for shuffle reads (shuffleFetchWaitTime) and shuffle writes (shuffleWriteTime), and the ratio of the local bytes fetched to the remote bytes fetched. If the shuffle operation takes more than 2 minutes, rerun your application with larger workers (with 8 or 16 vCPUs or spark.executor.cores) with Spark Measure to track the improvement in shuffle performance and the overall job runtime.
Time taken: 177647 ms

Scheduling mode = FIFO
Spark Context default degree of parallelism = 192

Aggregated Spark stage metrics:
numStages => 22
numTasks => 10156
elapsedTime => 159894 (2.7 min)
stageDuration => 456893 (7.6 min)
executorRunTime => 28418517 (7.9 h)
executorCpuTime => 20276736 (5.6 h)
executorDeserializeTime => 326486 (5.4 min)
executorDeserializeCpuTime => 124323 (2.1 min)
resultSerializationTime => 534 (0.5 s)
jvmGCTime => 648809 (11 min)
shuffleFetchWaitTime => 340880 (5.7 min)
shuffleWriteTime => 245918 (4.1 min)
resultSize => 23199434 (22.1 MB)
diskBytesSpilled => 0 (0 Bytes)
memoryBytesSpilled => 0 (0 Bytes)
peakExecutionMemory => 1794288453176
recordsRead => 18696929278
bytesRead => 77354154397 (72.0 GB)
recordsWritten => 0
bytesWritten => 0 (0 Bytes)
shuffleRecordsRead => 14124240761
shuffleTotalBlocksFetched => 5571316
shuffleLocalBlocksFetched => 117321
shuffleRemoteBlocksFetched => 5453995
shuffleTotalBytesRead => 158582120627 (147.7 GB)
shuffleLocalBytesRead => 3337930126 (3.1 GB)
shuffleRemoteBytesRead => 155244190501 (144.6 GB)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 156913371886 (146.1 GB)
shuffleRecordsWritten => 13867102620

Benefits of using large workers for memory-intensive workloads

Certain types of workloads are memory-intensive and may benefit from more memory configured per worker. In this section, we discuss common scenarios where large workers could be beneficial for running memory-intensive workloads.

Data skew

Data skews commonly occur in several types of datasets. Some common examples are fraud detection, population analysis, and income distribution. For example, when you want to detect anomalies in your data, it’s expected that only less than 1% of the data is abnormal. If you want to perform some aggregation on top of normal vs. abnormal records, 99% of the data will be processed by a single worker, which may lead to that worker running out of memory. Data skews may be observed for memory-intensive transformations like groupBy, orderBy, join, window functions, collect_list, collect_set, and so on. Join types such as BroadcastNestedLoopJoin and Cartesan product are also inherently memory-intensive and susceptible to data skews. Similarly, if your input data is Gzip compressed, a single Gzip file can’t be read by more than one task because the Gzip compression type is unsplittable. When there are a few very large Gzip files in the input, your job may run out of memory because a single task may have to read a huge Gzip file that doesn’t fit in the executor memory.

Failures due to data skew can be mitigated by applying strategies such as salting. However, this often requires extensive changes to the code, which may not be feasible for a production workload that failed due to an unprecedented data skew caused by a sudden surge in incoming data volume. For a simpler workaround, you may just want to increase the worker memory. Using larger workers with more spark.executor.memory allows you to handle data skew without making any changes to your application code.

Caching

In order to improve performance, Spark allows you to cache the data frames, datasets, and RDDs in memory. This enables you to reuse a data frame multiple times in your application without having to recompute it. By default, up to 50% of your executor’s JVM is used to cache the data frames based on the property spark.memory.storageFraction. For example, if your spark.executor.memory is set to 30 GB, then 15 GB is used for cache storage that is immune to eviction.

The default storage level of cache operation is DISK_AND_MEMORY. If the size of the data frame you are trying to cache doesn’t fit in the executor’s memory, a portion of the cache spills to disk. If there isn’t enough space to write the cached data in disk, the blocks are evicted and you don’t get the benefits of caching. Using larger workers allows you to cache more data in memory, boosting job performance by retrieving cached blocks from memory rather than the underlying storage.

Experiments

For example, the following PySpark job leads to a skew, with one executor processing 99.95% of the data with memory-intensive aggregates like collect_list. The job also caches a very large data frame (2.2 TB). Let’s run two iterations of the same job on EMR Serverless with the following vCPU and memory configurations.

Let’s run Test 3 with the previously largest possible worker configurations:

  • Size of executor set while creating EMR Serverless application = 4 vCPUs, 30 GB memory, 200 GB disk
  • Spark job config:
    • spark.executor.cores = 4
    • spark.executor.memory = 27 G

Let’s run Test 4 with the newly released large worker configurations:

  • Size of executor set in while creating EMR Serverless application = 8 vCPUs, 60 GB memory, 200 GB disk
  • Spark job config:
    • spark.executor.cores = 8
    • spark.executor.memory = 54 G

Test 3 failed with FetchFailedException, which resulted due to the executor memory not being sufficient for the job.

Also, from the Spark UI of Test 3, we see that the reserved storage memory of the executors was fully utilized for caching the data frames.

The remaining blocks to cache were spilled to disk, as seen in the executor’s stderr logs:

23/02/06 16:06:58 INFO MemoryStore: Will not store rdd_4_1810
23/02/06 16:06:58 WARN MemoryStore: Not enough space to cache rdd_4_1810 in memory! (computed 134.1 MiB so far)
23/02/06 16:06:58 INFO MemoryStore: Memory use = 14.8 GiB (blocks) + 507.5 MiB (scratch space shared across 4 tasks(s)) = 15.3 GiB. Storage limit = 15.3 GiB.
23/02/06 16:06:58 WARN BlockManager: Persisting block rdd_4_1810 to disk instead.

Around 33% of the persisted data frame was cached on disk, as seen on the Storage tab of the Spark UI.

Test 4 with larger executors and vCores ran successfully without throwing any memory-related errors. Also, only about 2.2% of the data frame was cached to disk. Therefore, cached blocks of a data frame will be retrieved from memory rather than from disk, offering better performance.

Recommendations

To determine if the large workers will benefit your memory-intensive Spark applications, consider the following:

  • Determine if your Spark application has any data skews by looking at the Spark UI. The following screenshot of the Spark UI shows an example data skew scenario where one task processes most of the data (145.2 GB), looking at the Shuffle Read size. If one or fewer tasks process significantly more data than other tasks, rerun your application with larger workers with 60–120 G of memory (spark.executor.memory set anywhere from 54–109 GB factoring in 10% of spark.executor.memoryOverhead).

  • Check the Storage tab of the Spark History Server to review the ratio of data cached in memory to disk from the Size in memory and Size in disk columns. If more than 10% of your data is cached to disk, rerun your application with larger workers to increase the amount of data cached in memory.
  • Another way to preemptively determine if your job needs more memory is by monitoring Peak JVM Memory on the Spark UI Executors tab. If the peak JVM memory used is close to the executor or driver memory, you can create an application with a larger worker and configure a higher value for spark.executor.memory or spark.driver.memory. For example, in the following screenshot, the maximum value of peak JVM memory usage is 26 GB and spark.executor.memory is set to 27 G. In this case, it may be beneficial to use larger workers with 60 GB memory and spark.executor.memory set to 54 G.

Considerations

Although large vCPUs help increase the locality of the shuffle blocks, there are other factors involved such as disk throughput, disk IOPS (input/output operations per second), and network bandwidth. In some cases, more small workers with more disks could offer higher disk IOPS, throughput, and network bandwidth overall compared to fewer large workers. We encourage you to benchmark your workloads against suitable vCPU configurations to choose the best configuration for your workload.

For shuffle-heavy jobs, it’s recommended to use large disks. You can attach up to 200 GB disk to each worker when you create your application. Using large vCPUs (spark.executor.cores) per executor may increase the disk utilization on each worker. If your application fails with “No space left on device” due to the inability to fit shuffle data in the disk, use more smaller workers with 200 GB disk.

Conclusion

In this post, you learned about the benefits of using large executors for your EMR Serverless jobs. For more information about different worker configurations, refer to Worker configurations. Large worker configurations are available in all Regions where EMR Serverless is available.


About the Author

Veena Vasudevan is a Senior Partner Solutions Architect and an Amazon EMR specialist at AWS focusing on big data and analytics. She helps customers and partners build highly optimized, scalable, and secure solutions; modernize their architectures; and migrate their big data workloads to AWS.

Automate replication of relational sources into a transactional data lake with Apache Iceberg and AWS Glue

Post Syndicated from Luis Gerardo Baeza original https://aws.amazon.com/blogs/big-data/automate-replication-of-relational-sources-into-a-transactional-data-lake-with-apache-iceberg-and-aws-glue/

Organizations have chosen to build data lakes on top of Amazon Simple Storage Service (Amazon S3) for many years. A data lake is the most popular choice for organizations to store all their organizational data generated by different teams, across business domains, from all different formats, and even over history. According to a study, the average company is seeing the volume of their data growing at a rate that exceeds 50% per year, usually managing an average of 33 unique data sources for analysis.

Teams often try to replicate thousands of jobs from relational databases with the same extract, transform, and load (ETL) pattern. There is lot of effort in maintaining the job states and scheduling these individual jobs. This approach helps the teams add tables with few changes and also maintains the job status with minimum effort. This can lead to a huge improvement in the development timeline and tracking the jobs with ease.

In this post, we show you how to easily replicate all your relational data stores into a transactional data lake in an automated fashion with a single ETL job using Apache Iceberg and AWS Glue.

Solution architecture

Data lakes are usually organized using separate S3 buckets for three layers of data: the raw layer containing data in its original form, the stage layer containing intermediate processed data optimized for consumption, and the analytics layer containing aggregated data for specific use cases. In the raw layer, tables usually are organized based on their data sources, whereas tables in the stage layer are organized based on the business domains they belong to.

This post provides an AWS CloudFormation template that deploys an AWS Glue job that reads an Amazon S3 path for one data source of the data lake raw layer, and ingests the data into Apache Iceberg tables on the stage layer using AWS Glue support for data lake frameworks. The job expects tables in the raw layer to be structured in the way AWS Database Migration Service (AWS DMS) ingests them: schema, then table, then data files.

This solution uses AWS Systems Manager Parameter Store for table configuration. You should modify this parameter specifying the tables you want to process and how, including information such as primary key, partitions, and the business domain associated. The job uses this information to automatically create a database (if it doesn’t already exist) for every business domain, create the Iceberg tables, and perform the data loading.

Finally, we can use Amazon Athena to query the data in the Iceberg tables.

The following diagram illustrates this architecture.

Solution architecture

This implementation has the following considerations:

  • All tables from the data source must have a primary key to be replicated using this solution. The primary key can be a single column or a composite key with more than one column.
  • If the data lake contains tables that don’t need upserts or don’t have a primary key, you can exclude them from the parameter configuration and implement traditional ETL processes to ingest them into the data lake. That’s outside of the scope of this post.
  • If there are additional data sources that need to be ingested, you can deploy multiple CloudFormation stacks, one to handle each data source.
  • The AWS Glue job is designed to process data in two phases: the initial load that runs after AWS DMS finishes the full load task, and the incremental load that runs on a schedule that applies change data capture (CDC) files captured by AWS DMS. Incremental processing is performed using an AWS Glue job bookmark.

There are nine steps to complete this tutorial:

  1. Set up a source endpoint for AWS DMS.
  2. Deploy the solution using AWS CloudFormation.
  3. Review the AWS DMS replication task.
  4. Optionally, add permissions for encryption and decryption or AWS Lake Formation.
  5. Review the table configuration on Parameter Store.
  6. Perform initial data loading.
  7. Perform incremental data loading.
  8. Monitor table ingestion.
  9. Schedule incremental batch data loading.

Prerequisites

Before starting this tutorial, you should already be familiar with Iceberg. If you’re not, you can get started by replicating a single table following the instructions in Implement a CDC-based UPSERT in a data lake using Apache Iceberg and AWS Glue. Additionally, set up the following:

Set up a source endpoint for AWS DMS

Before we create our AWS DMS task, we need to set up a source endpoint to connect to the source database:

  1. On the AWS DMS console, choose Endpoints in the navigation pane.
  2. Choose Create endpoint.
  3. If your database is running on Amazon RDS, choose Select RDS DB instance, then choose the instance from the list. Otherwise, choose the source engine and provide the connection information either through AWS Secrets Manager or manually.
  4. For Endpoint identifier, enter a name for the endpoint; for example, source-postgresql.
  5. Choose Create endpoint.

Deploy the solution using AWS CloudFormation

Create a CloudFormation stack using the provided template. Complete the following steps:

  1. Choose Launch Stack:
  2. Choose Next.
  3. Provide a stack name, such as transactionaldl-postgresql.
  4. Enter the required parameters:
    1. DMSS3EndpointIAMRoleARN – The IAM role ARN for AWS DMS to write data into Amazon S3.
    2. ReplicationInstanceArn – The AWS DMS replication instance ARN.
    3. S3BucketStage – The name of the existing bucket used for the stage layer of the data lake.
    4. S3BucketGlue – The name of the existing S3 bucket for storing AWS Glue scripts.
    5. S3BucketRaw – The name of the existing bucket used for the raw layer of the data lake.
    6. SourceEndpointArn – The AWS DMS endpoint ARN that you created earlier.
    7. SourceName – The arbitrary identifier of the data source to replicate (for example, postgres). This is used to define the S3 path of the data lake (raw layer) where data will be stored.
  5. Do not modify the following parameters:
    1. SourceS3BucketBlog – The bucket name where the provided AWS Glue script is stored.
    2. SourceS3BucketPrefix – The bucket prefix name where the provided AWS Glue script is stored.
  6. Choose Next twice.
  7. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  8. Choose Create stack.

After approximately 5 minutes, the CloudFormation stack is deployed.

Review the AWS DMS replication task

The AWS CloudFormation deployment created an AWS DMS target endpoint for you. Because of two specific endpoint settings, the data will be ingested as we need it on Amazon S3.

  1. On the AWS DMS console, choose Endpoints in the navigation pane.
  2. Search for and choose the endpoint that begins with dmsIcebergs3endpoint.
  3. Review the endpoint settings:
    1. DataFormat is specified as parquet.
    2. TimestampColumnName will add the column last_update_time with the date of creation of the records on Amazon S3.

AWS DMS endpoint settings

The deployment also creates an AWS DMS replication task that begins with dmsicebergtask.

  1. Choose Replication tasks in the navigation pane and search for the task.

You will see that the Task Type is marked as Full load, ongoing replication. AWS DMS will perform an initial full load of existing data, and then create incremental files with changes performed to the source database.

On the Mapping Rules tab, there are two types of rules:

  • A selection rule with the name of the source schema and tables that will be ingested from the source database. By default, it uses the sample database provided in the prerequisites, dms_sample, and all tables with the keyword %.
  • Two transformation rules that include in the target files on Amazon S3 the schema name and table name as columns. This is used by our AWS Glue job to know to which tables the files in the data lake correspond.

To learn more about how to customize this for your own data sources, refer to Selection rules and actions.

AWS mapping rules

Let’s change some configurations to finish our task preparation.

  1. On the Actions menu, choose Modify.
  2. In the Task Settings section, under Stop task after full load completes, choose Stop after applying cached changes.

This way, we can control the initial load and incremental file generation as two different steps. We use this two-step approach to run the AWS Glue job once per each step.

  1. Under Task logs, choose Turn on CloudWatch logs.
  2. Choose Save.
  3. Wait about 1 minute for the database migration task status to show as Ready.

Add permissions for encryption and decryption or Lake Formation

Optionally, you can add permissions for encryption and decryption or Lake Formation.

Add encryption and decryption permissions

If your S3 buckets used for the raw and stage layers are encrypted using AWS Key Management Service (AWS KMS) customer managed keys, you need to add permissions to allow the AWS Glue job to access the data:

Add Lake Formation permissions

If you’re managing permissions using Lake Formation, you need to allow your AWS Glue job to create your domain’s databases and tables through the IAM role GlueJobRole.

  1. Grant permissions to create databases (for instructions, refer to Creating a Database).
  2. Grant SUPER permissions to the default database.
  3. Grant data location permissions.
  4. If you create databases manually, grant permissions on all databases to create tables. Refer to Granting table permissions using the Lake Formation console and the named resource method or Granting Data Catalog permissions using the LF-TBAC method according to your use case.

After you complete the later step of performing the initial data load, make sure to also add permissions for consumers to query the tables. The job role will become the owner of all the tables created, and the data lake admin can then perform grants to additional users.

Review table configuration in Parameter Store

The AWS Glue job that performs the data ingestion into Iceberg tables uses the table specification provided in Parameter Store. Complete the following steps to review the parameter store that was configured automatically for you. If needed, modify according to your own needs.

  1. On the Parameter Store console, choose My parameters in the navigation pane.

The CloudFormation stack created two parameters:

  • iceberg-config for job configurations
  • iceberg-tables for table configuration
  1. Choose the parameter iceberg-tables.

The JSON structure contains information that AWS Glue uses to read data and write the Iceberg tables on the target domain:

  • One object per table – The name of the object is created using the schema name, a period, and the table name; for example, schema.table.
  • primaryKey – This should be specified for every source table. You can provide a single column or a comma-separated list of columns (without spaces).
  • partitionCols – This optionally partitions columns for target tables. If you don’t want to create partitioned tables, provide an empty string. Otherwise, provide a single column or a comma-separated list of columns to be used (without spaces).
  1. If you want to use your own data source, use the following JSON code and replace the text in CAPS from the template provided. If you’re using the sample data source provided, keep the default settings:
{
    "SCHEMA_NAME.TABLE_NAME_1": {
        "primaryKey": "ONLY_PRIMARY_KEY",
        "domain": "TARGET_DOMAIN",
        "partitionCols": ""
    },
    "SCHEMA_NAME.TABLE_NAME_2": {
        "primaryKey": "FIRST_PRIMARY_KEY,SECOND_PRIMARY_KEY",
        "domain": "TARGET_DOMAIN",
        "partitionCols": "PARTITION_COLUMN_ONE,PARTITION_COLUMN_TWO"
    }
}
  1. Choose Save changes.

Perform initial data loading

Now that the required configuration is finished, we ingest the initial data. This step includes three parts: ingesting the data from the source relational database into the raw layer of the data lake, creating the Iceberg tables on the stage layer of the data lake, and verifying results using Athena.

Ingest data into the raw layer of the data lake

To ingest data from the relational data source (PostgreSQL if you are using the sample provided) to our transactional data lake using Iceberg, complete the following steps:

  1. On the AWS DMS console, choose Database migration tasks in the navigation pane.
  2. Select the replication task you created and on the Actions menu, choose Restart/Resume.
  3. Wait about 5 minutes for the replication task to complete. You can monitor the tables ingested on the Statistics tab of the replication task.

AWS DMS full load statistics

After some minutes, the task finishes with the message Full load complete.

  1. On the Amazon S3 console, choose the bucket you defined as the raw layer.

Under the S3 prefix defined on AWS DMS (for example, postgres), you should see a hierarchy of folders with the following structure:

  • Schema
    • Table name
      • LOAD00000001.parquet
      • LOAD0000000N.parquet

AWS DMS full load objects created on S3

If your S3 bucket is empty, review Troubleshooting migration tasks in AWS Database Migration Service before running the AWS Glue job.

Create and ingest data into Iceberg tables

Before running the job, let’s navigate the script of the AWS Glue job provided as part of the CloudFormation stack to understand its behavior.

  1. On the AWS Glue Studio console, choose Jobs in the navigation pane.
  2. Search for the job that starts with IcebergJob- and a suffix of your CloudFormation stack name (for example, IcebergJob-transactionaldl-postgresql).
  3. Choose the job.

AWS Glue ETL job review

The job script gets the configuration it needs from Parameter Store. The function getConfigFromSSM() returns job-related configurations such as source and target buckets from where the data needs to be read and written. The variable ssmparam_table_values contain table-related information like the data domain, table name, partition columns, and primary key of the tables that needs to be ingested. See the following Python code:

# Main application
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'stackName'])
SSM_PARAMETER_NAME = f"{args['stackName']}-iceberg-config"
SSM_TABLE_PARAMETER_NAME = f"{args['stackName']}-iceberg-tables"

# Parameters for job
rawS3BucketName, rawBucketPrefix, stageS3BucketName, warehouse_path = getConfigFromSSM(SSM_PARAMETER_NAME)
ssm_param_table_values = json.loads(ssmClient.get_parameter(Name = SSM_TABLE_PARAMETER_NAME)['Parameter']['Value'])
dropColumnList = ['db','table_name', 'schema_name','Op', 'last_update_time', 'max_op_date']

The script uses an arbitrary catalog name for Iceberg that is defined as my_catalog. This is implemented on the AWS Glue Data Catalog using Spark configurations, so a SQL operation pointing to my_catalog will be applied on the Data Catalog. See the following code:

catalog_name = 'my_catalog'
errored_table_list = []

# Iceberg configuration
spark = SparkSession.builder \
    .config('spark.sql.warehouse.dir', warehouse_path) \
    .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
    .config(f'spark.sql.catalog.{catalog_name}.warehouse', warehouse_path) \
    .config(f'spark.sql.catalog.{catalog_name}.catalog-impl', 'org.apache.iceberg.aws.glue.GlueCatalog') \
    .config(f'spark.sql.catalog.{catalog_name}.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO') \
    .config('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .getOrCreate()

The script iterates over the tables defined in Parameter Store and performs the logic for detecting if the table exists and if the incoming data is an initial load or an upsert:

# Iteration over tables stored on Parameter Store
for key in ssm_param_table_values:
    # Get table data
    isTableExists = False
    schemaName, tableName = key.split('.')
    logger.info(f'Processing table : {tableName}')

The initialLoadRecordsSparkSQL() function loads initial data when no operation column is present in the S3 files. AWS DMS adds this column only to Parquet data files produced by the continuous replication (CDC). The data loading is performed using the INSERT INTO command with SparkSQL. See the following code:

sqltemp = Template("""
    INSERT INTO $catalog_name.$dbName.$tableName  ($insertTableColumnList)
    SELECT $insertTableColumnList FROM insertTable $partitionStrSQL
""")
SQLQUERY = sqltemp.substitute(
    catalog_name = catalog_name, 
    dbName = dbName, 
    tableName = tableName,
    insertTableColumnList = insertTableColumnList[ : -1],
    partitionStrSQL = partitionStrSQL)

logger.info(f'****SQL QUERY IS : {SQLQUERY}')
spark.sql(SQLQUERY)

Now we run the AWS Glue job to ingest the initial data into the Iceberg tables. The CloudFormation stack adds the --datalake-formats parameter, adding the required Iceberg libraries to the job.

  1. Choose Run job.
  2. Choose Job Runs to monitor the status. Wait until the status is Run Succeeded.

Verify the data loaded

To confirm that the job processed the data as expected, complete the following steps:

  1. On the Athena console, choose Query Editor in the navigation pane.
  2. Verify AwsDataCatalog is selected as the data source.
  3. Under Database, choose the data domain that you want to explore, based on the configuration you defined in the parameter store. If using the sample database provided, use sports.

Under Tables and views, we can see the list of tables that were created by the AWS Glue job.

  1. Choose the options menu (three dots) next to the first table name, then choose Preview Data.

You can see the data loaded into Iceberg tables. Amazon Athena review initial data loaded

Perform incremental data loading

Now we start capturing changes from our relational database and applying them to the transactional data lake. This step is also divided in three parts: capturing the changes, applying them to the Iceberg tables, and verifying the results.

Capture changes from the relational database

Due to the configuration we specified, the replication task stopped after running the full load phase. Now we restart the task to add incremental files with changes into the raw layer of the data lake.

  1. On the AWS DMS console, select the task we created and ran before.
  2. On the Actions menu, choose Resume.
  3. Choose Start task to start capturing changes.
  4. To trigger new file creation on the data lake, perform inserts, updates, or deletes on the tables of your source database using your preferred database administration tool. If using the sample database provided, you could run the following SQL commands:
UPDATE dms_sample.nfl_stadium_data_upd
SET seatin_capacity=93703
WHERE team = 'Los Angeles Rams' and sport_location_id = '31';

update  dms_sample.mlb_data 
set bats = 'R'
where mlb_id=506560 and bats='L';

update dms_sample.sporting_event 
set start_date  = current_date 
where id=11 and sold_out=0;
  1. On the AWS DMS task details page, choose the Table statistics tab to see the changes captured.
    AWS DMS CDC statistics
  2. Open the raw layer of the data lake to find a new file holding the incremental changes inside every table’s prefix, for example under the sporting_event prefix.

The record with changes for the sporting_event table looks like the following screenshot.

AWS DMS objects migrated into S3 with CDC

Notice the Op column in the beginning identified with an update (U). Also, the second date/time value is the control column added by AWS DMS with the time the change was captured.

CDC file schema on Amazon S3

Apply changes on the Iceberg tables using AWS Glue

Now we run the AWS Glue job again, and it will automatically process only the new incremental files since the job bookmark is enabled. Let’s review how it works.

The dedupCDCRecords() function performs deduplication of data because multiple changes to a single record ID could be captured within the same data file on Amazon S3. Deduplication is performed based on the last_update_time column added by AWS DMS that indicates the timestamp of when the change was captured. See the following Python code:

def dedupCDCRecords(inputDf, keylist):
    IDWindowDF = Window.partitionBy(*keylist).orderBy(inputDf.last_update_time).rangeBetween(-sys.maxsize, sys.maxsize)
    inputDFWithTS = inputDf.withColumn('max_op_date', max(inputDf.last_update_time).over(IDWindowDF))
    
    NewInsertsDF = inputDFWithTS.filter('last_update_time=max_op_date').filter("op='I'")
    UpdateDeleteDf = inputDFWithTS.filter('last_update_time=max_op_date').filter("op IN ('U','D')")
    finalInputDF = NewInsertsDF.unionAll(UpdateDeleteDf)

    return finalInputDF

On line 99, the upsertRecordsSparkSQL() function performs the upsert in a similar fashion to the initial load, but this time with a SQL MERGE command.

Review the applied changes

Open the Athena console and run a query that selects the changed records on the source database. If using the provided sample database, use one the following SQL queries:

SELECT * FROM "sports"."nfl_stadiu_data_upd"
WHERE team = 'Los Angeles Rams' and sport_location_id = 31
LIMIT 1;

Amazon Athena review cdc data loaded

Monitor table ingestion

The AWS Glue job script is coded with simple Python exception handling to catch errors during processing a specific table. The job bookmark is saved after each table finishes processing successfully, to avoid reprocessing tables if the job run is retried for the tables with errors.

The AWS Command Line Interface (AWS CLI) provides a get-job-bookmark command for AWS Glue that provides insight into the status of the bookmark for each table processed.

  1. On the AWS Glue Studio console, choose the ETL job.
  2. Choose the Job Runs tab and copy the job run ID.
  3. Run the following command on a terminal authenticated for the AWS CLI, replacing <GLUE_JOB_RUN_ID> on line 1 with the value you copied. If your CloudFormation stack is not named transactionaldl-postgresql, provide the name of your job on line 2 of the script:
jobrun=<GLUE_JOB_RUN_ID>
jobname=IcebergJob-transactionaldl-postgresql
aws glue get-job-bookmark --job-name jobname --run-id $jobrun

In this solution, when a table processing causes an exception, the AWS Glue job will not fail according to this logic. Instead, the table will be added into an array that is printed after the job is complete. In such scenario, the job will be marked as failed after it tries to process the rest of the tables detected on the raw data source. This way, tables without errors don’t have to wait until the user identifies and solves the problem on the conflicting tables. The user can quickly detect job runs that had issues using the AWS Glue job run status, and identify which specific tables are causing the problem using the CloudWatch logs for the job run.

  1. The job script implements this feature with the following Python code:
# Performed for every table
        try:
            # Table processing logic
        except Exception as e:
            logger.info(f'There is an issue with table: {tableName}')
            logger.info(f'The exception is : {e}')
            errored_table_list.append(tableName)
            continue
        job.commit()
if (len(errored_table_list)):
    logger.info('Total number of errored tables are ',len(errored_table_list))
    logger.info('Tables that failed during processing are ', *errored_table_list, sep=', ')
    raise Exception(f'***** Some tables failed to process.')

The following screenshot shows how the CloudWatch logs look for tables that cause errors on processing.

AWS Glue job monitoring with logs

Aligned with the AWS Well-Architected Framework Data Analytics Lens practices, you can adapt more sophisticated control mechanisms to identify and notify stakeholders when errors appear on the data pipelines. For example, you can use an Amazon DynamoDB control table to store all tables and job runs with errors, or using Amazon Simple Notification Service (Amazon SNS) to send alerts to operators when certain criteria is met.

Schedule incremental batch data loading

The CloudFormation stack deploys an Amazon EventBridge rule (disabled by default) that can trigger the AWS Glue job to run on a schedule. To provide your own schedule and enable the rule, complete the following steps:

  1. On the EventBridge console, choose Rules in the navigation pane.
  2. Search for the rule prefixed with the name of your CloudFormation stack followed by JobTrigger (for example, transactionaldl-postgresql-JobTrigger-randomvalue).
  3. Choose the rule.
  4. Under Event Schedule, choose Edit.

The default schedule is configured to trigger every hour.

  1. Provide the schedule you want to run the job.
  2. Additionally, you can use an EventBridge cron expression by selecting A fine-grained schedule.
    Amazon EventBridge schedule ETL job
  3. When you finish setting up the cron expression, choose Next three times, and finally choose Update Rule to save changes.

The rule is created disabled by default to allow you to run the initial data load first.

  1. Activate the rule by choosing Enable.

You can use the Monitoring tab to view rule invocations, or directly on the AWS Glue Job Run details.

Conclusion

After deploying this solution, you have automated the ingestion of your tables on a single relational data source. Organizations using a data lake as their central data platform usually need to handle multiple, sometimes even tens of data sources. Also, more and more use cases require organizations to implement transactional capabilities to the data lake. You can use this solution to accelerate the adoption of such capabilities across all your relational data sources to enable new business use cases, automating the implementation process to derive more value from your data.


About the Authors

Luis Gerardo BaezaLuis Gerardo Baeza is a Big Data Architect in the Amazon Web Services (AWS) Data Lab. He has 12 years of experience helping organizations in the healthcare, financial and education sectors to adopt enterprise architecture programs, cloud computing, and data analytics capabilities. Luis currently helps organizations across Latin America to accelerate strategic data initiatives.

SaiKiran Reddy AenuguSaiKiran Reddy Aenugu is a Data Architect in the Amazon Web Services (AWS) Data Lab. He has 10 years of experience implementing data loading, transformation, and visualization processes. SaiKiran currently helps organizations in North America to adopt modern data architectures such as data lakes and data mesh. He has experience in the retail, airline, and finance sectors.

Narendra MerlaNarendra Merla is a Data Architect in the Amazon Web Services (AWS) Data Lab. He has 12 years of experience in designing and productionalizing both real-time and batch-oriented data pipelines and building data lakes on both cloud and on-premises environments. Narendra currently helps organizations in North America to build and design robust data architectures, and has experience in the telecom and finance sectors.

How to visualize IAM Access Analyzer policy validation findings with QuickSight

Post Syndicated from Mostefa Brougui original https://aws.amazon.com/blogs/security/how-to-visualize-iam-access-analyzer-policy-validation-findings-with-quicksight/

In this blog post, we show you how to create an Amazon QuickSight dashboard to visualize the policy validation findings from AWS Identity and Access Management (IAM) Access Analyzer. You can use this dashboard to better understand your policies and how to achieve least privilege by periodically validating your IAM roles against IAM best practices. This blog post walks you through the deployment for a multi-account environment using AWS Organizations.

Achieving least privilege is a continuous cycle to grant only the permissions that your users and systems require. To achieve least privilege, you start by setting fine-grained permissions. Then, you verify that the existing access meets your intent. Finally, you refine permissions by removing unused access. To learn more, see IAM Access Analyzer makes it easier to implement least privilege permissions by generating IAM policies based on access activity.

Policy validation is a feature of IAM Access Analyzer that guides you to author and validate secure and functional policies with more than 100 policy checks. You can use these checks when creating new policies or to validate existing policies. To learn how to use IAM Access Analyzer policy validation APIs when creating new policies, see Validate IAM policies in CloudFormation templates using IAM Access Analyzer. In this post, we focus on how to validate existing IAM policies.

Approach to visualize IAM Access Analyzer findings

As shown in Figure 1, there are four high-level steps to build the visualization.

Figure 1: Steps to visualize IAM Access Analyzer findings

Figure 1: Steps to visualize IAM Access Analyzer findings

  1. Collect IAM policies

    To validate your IAM policies with IAM Access Analyzer in your organization, start by periodically sending the content of your IAM policies (inline and customer-managed) to a central account, such as your Security Tooling account.

  2. Validate IAM policies

    After you collect the IAM policies in a central account, run an IAM Access Analyzer ValidatePolicy API call on each policy. The API calls return a list of findings. The findings can help you identify issues, provide actionable recommendations to resolve the issues, and enable you to author functional policies that can meet security best practices. The findings are stored in an Amazon Simple Storage Service (Amazon S3) bucket. To learn about different findings, see Access Analyzer policy check reference.

  3. Visualize findings

    IAM Access Analyzer policy validation findings are stored centrally in an S3 bucket. The S3 bucket is owned by the central (hub) account of your choosing. You can use Amazon Athena to query the findings from the S3 bucket, and then create a QuickSight analysis to visualize the findings.

  4. Publish dashboards

    Finally, you can publish a shareable QuickSight dashboard. Figure 2 shows an example of the dashboard.

    Figure 2: Dashboard overview

    Figure 2: Dashboard overview

Design overview

This implementation is a serverless job initiated by Amazon EventBridge rules. It collects IAM policies into a hub account (such as your Security Tooling account), validates the policies, stores the validation results in an S3 bucket, and uses Athena to query the findings and QuickSight to visualize them. Figure 3 gives a design overview of our implementation.

Figure 3: Design overview of the implementation

Figure 3: Design overview of the implementation

As shown in Figure 3, the implementation includes the following steps:

  1. A time-based rule is set to run daily. The rule triggers an AWS Lambda function that lists the IAM policies of the AWS account it is running in.
  2. For each IAM policy, the function sends a message to an Amazon Simple Queue Service (Amazon SQS) queue. The message contains the IAM policy Amazon Resource Name (ARN), and the policy document.
  3. When new messages are received, the Amazon SQS queue initiates the second Lambda function. For each message, the Lambda function extracts the policy document and validates it by using the IAM Access Analyzer ValidatePolicy API call.
  4. The Lambda function stores validation results in an S3 bucket.
  5. An AWS Glue table contains the schema for the IAM Access Analyzer findings. Athena natively uses the AWS Glue Data Catalog.
  6. Athena queries the findings stored in the S3 bucket.
  7. QuickSight uses Athena as a data source to visualize IAM Access Analyzer findings.

Benefits of the implementation

By implementing this solution, you can achieve the following benefits:

  • Store your IAM Access Analyzer policy validation results in a scalable and cost-effective manner with Amazon S3.
  • Add scalability and fault tolerance to your validation workflow with Amazon SQS.
  • Partition your evaluation results in Athena and restrict the amount of data scanned by each query, helping to improve performance and reduce cost.
  • Gain insights from IAM Access Analyzer policy validation findings with QuickSight dashboards. You can use the dashboard to identify IAM policies that don’t comply with AWS best practices and then take action to correct them.

Prerequisites

Before you implement the solution, make sure you’ve completed the following steps:

  1. Install a Git client, such as GitHub Desktop.
  2. Install the AWS Command Line Interface (AWS CLI). For instructions, see Installing or updating the latest version of the AWS CLI.
  3. If you plan to deploy the implementation in a multi-account environment using Organizations, enable all features and enable trusted access with Organizations to operate a service-managed stack set.
  4. Get a QuickSight subscription to the Enterprise edition. When you first subscribe to the Enterprise edition, you get a free trial for four users for 30 days. Trial authors are automatically converted to month-to-month subscription upon trial expiry. For more details, see Signing up for an Amazon QuickSight subscription, Amazon QuickSight Enterprise edition and the Amazon QuickSight Pricing Calculator.

Note: This implementation works in accounts that don’t have AWS Lake Formation enabled. If Lake Formation is enabled in your account, you might need to grant Lake Formation permissions in addition to the implementation IAM permissions. For details, see Lake Formation access control overview.

Walkthrough

In this section, we will show you how to deploy an AWS CloudFormation template to your central account (such as your Security Tooling account), which is the hub for IAM Access Analyzer findings. The central account collects, validates, and visualizes your findings.

To deploy the implementation to your multi-account environment

  1. Deploy the CloudFormation stack to your central account.

    Important: Do not deploy the template to the organization’s management account; see design principles for organizing your AWS accounts. You can choose the Security Tooling account as a hub account.

    In your central account, run the following commands in a terminal. These commands clone the GitHub repository and deploy the CloudFormation stack to your central account.

    # A) Clone the repository
    git clone https://github.com/aws-samples/visualize-iam-access-analyzer-policy-validation-findings.git
      # B) Switch to the repository's directory cd visualize-iam-access-analyzer-policy-validation-findings
      # C) Deploy the CloudFormation stack to your central security account (hub). For <AWSRegion> enter your AWS Region without quotes. make deploy-hub aws-region=<AWSRegion>

    If you want to send IAM policies from other member accounts to your central account, you will need to make note of the CloudFormation stack outputs for SQSQueueUrl and KMSKeyArn when the deployment is complete.

    make describe-hub-outputs aws-region=<AWSRegion>

  2. Switch to your organization’s management account and deploy the stack sets to the member accounts. For <SQSQueueUrl> and <KMSKeyArn>, use the values from the previous step.
    # Create a CloudFormation stack set to deploy the resources to the member accounts.
      make deploy-members SQSQueueUrl=<SQSQueueUrl> KMSKeyArn=<KMSKeyArn< aws-region=<AWSRegion>

To deploy the QuickSight dashboard to your central account

  1. Make sure that QuickSight is using the IAM role aws-quicksight-service-role.
    1. In QuickSight, in the navigation bar at the top right, choose your account (indicated by a person icon) and then choose Manage QuickSight.
    2. On the Manage QuickSight page, in the menu at the left, choose Security & Permissions.
    3. On the Security & Permissions page, under QuickSight access to AWS services, choose Manage.
    4. For IAM role, choose Use an existing role, and then do one of the following:
      • If you see a list of existing IAM roles, choose the role

        arn:aws:iam::<account-id>:role/service-role/aws-quicksight-service-role.

      • If you don’t see a list of existing IAM roles, enter the IAM ARN for the role in the following format:

        arn:aws:iam::<account-id>:role/service-role/aws-quicksight-service-role.

    5. Choose Save.
  2. Retrieve the QuickSight users.
    # <aws-region> your Quicksight main Region, for example eu-west-1
    # <account-id> The ID of your account, for example 123456789012
    # <namespace-name> Quicksight namespace, for example default.
    # You can list the namespaces by using aws quicksight list-namespaces --aws-account-id <account-id>
      aws quicksight list-users --region <aws-region> --aws-account-id <account-id> --namespace <namespace-name>

  3. Make a note of the user’s ARN that you want to grant permissions to list, describe, or update the QuickSight dashboard. This information is found in the arn element. For example, arn:aws:quicksight:us-east-1:111122223333:user/default/User1
  4. To launch the deployment stack for the QuickSight dashboard, run the following command. Replace <quicksight-user-arn> with the user’s ARN from the previous step.
    make deploy-dashboard-hub aws-region=<AWSRegion> quicksight-user-arn=<quicksight-user-arn>

Publish and share the QuickSight dashboard with the policy validation findings

You can publish your QuickSight dashboard and then share it with other QuickSight users for reporting purposes. The dashboard preserves the configuration of the analysis at the time that it’s published and reflects the current data in the datasets used by the analysis.

To publish the QuickSight dashboard

  1. In the QuickSight console, choose Analyses and then choose access-analyzer-validation-findings.
  2. (Optional) Modify the visuals of the analysis. For more information, see Tutorial: Modify Amazon QuickSight visuals.
  3. Share the QuickSight dashboard.
    1. In your analysis, in the application bar at the upper right, choose Share, and then choose Publish dashboard.
    2. On the Publish dashboard page, choose Publish new dashboard as and enter IAM Access Analyzer Policy Validation.
    3. Choose Publish dashboard. The dashboard is now published.
  4. On the QuickSight start page, choose Dashboards.
  5. Select the IAM Access Analyzer Policy Validation dashboard. IAM Access Analyzer policy validation findings will appear within the next 24 hours.

    Note: If you don’t want to wait until the Lambda function is initiated automatically, you can invoke the function that lists customer-managed policies and inline policies by using the aws lambda invoke AWS CLI command on the hub account and wait one to two minutes to see the policy validation findings:

    aws lambda invoke –function-name access-analyzer-list-iam-policy –invocation-type Event –cli-binary-format raw-in-base64-out –payload {} response.json

  6. (Optional) To export your dashboard as a PDF, see Exporting Amazon QuickSight analyses or dashboards as PDFs.

To share the QuickSight dashboard

  1. In the QuickSight console, choose Dashboards and then choose IAM Access Analyzer Policy Validation.
  2. In your dashboard, in the application bar at the upper right, choose Share, and then choose Share dashboard.
  3. On the Share dashboard page that opens, do the following:
    1. For Invite users and groups to dashboard on the left pane, enter a user email or group name in the search box. Users or groups that match your query appear in a list below the search box. Only active users and groups appear in the list.
    2. For the user or group that you want to grant access to the dashboard, choose Add. Then choose the level of permissions that you want them to have.
  4. After you grant users access to a dashboard, you can copy a link to it and send it to them.

For more details, see Sharing dashboards or Sharing your view of a dashboard.

Your teams can use this dashboard to better understand their IAM policies and how to move toward least-privilege permissions, as outlined in the section Validate your IAM roles of the blog post Top 10 security items to improve in your AWS account.

Clean up

To avoid incurring additional charges in your accounts, remove the resources that you created in this walkthrough.

Before deleting the CloudFormation stacks and stack sets in your accounts, make sure that the S3 buckets that you created are empty. To delete everything (including old versioned objects) in a versioned bucket, we recommend emptying the bucket through the console. Before deleting the CloudFormation stack from the central account, delete the Athena workgroup.

To delete remaining resources from your AWS accounts

  1. Delete the CloudFormation stack from your central account by running the following command. Make sure to replace <AWSRegion> with your own Region.
    make delete-hub aws-region=<AWSRegion>

  2. Delete the CloudFormation stack set instances and stack sets by running the following command using your organization’s management account credentials. Make sure to replace <AWSRegion> with your own Region.
    make delete-stackset-instances aws-region=<AWSRegion>
      # Wait for the operation to finish. You can check its progress on the CloudFormation console.
      make delete-stackset aws-region=<AWSRegion>

  3. Delete the QuickSight dashboard by running the following command using the central account credentials. Make sure to replace <AWSRegion> with your own Region.
    make delete-dashboard aws-region=<AWSRegion>

  4. To cancel your QuickSight subscription and close the account, see Canceling your Amazon QuickSight subscription and closing the account.

Conclusion

In this post, you learned how to validate your existing IAM policies by using the IAM Access Analyzer ValidatePolicy API and visualizing the results with AWS analytics tools. By using the implementation, you can better understand your IAM policies and work to reach least privilege in a scalable, fault-tolerant, and cost-effective way. This will help you identify opportunities to tighten your permissions and to grant the right fine-grained permissions to help enhance your overall security posture.

To learn more about IAM Access Analyzer, see previous blog posts on IAM Access Analyzer.

To download the CloudFormation templates, see the visualize-iam-access-analyzer-policy-validation-findings GitHub repository. For information about pricing, see Amazon SQS pricing, AWS Lambda pricing, Amazon Athena pricing and Amazon QuickSight pricing.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post.

Want more AWS Security news? Follow us on Twitter.

Mostefa Brougui

Mostefa Brougui

Mostefa is a Sr. Security Consultant in Professional Services at Amazon Web Services. He works with AWS enterprise customers to design, build, and optimize their security architecture to drive business outcomes.

Tobias Nickl

Tobias Nickl

Tobias works in Professional Services at Amazon Web Services as a Security Engineer. In addition to building custom AWS solutions, he advises AWS enterprise customers on how to reach their business objectives and accelerate their cloud transformation.

Monitor Apache HBase on Amazon EMR using Amazon Managed Service for Prometheus and Amazon Managed Grafana

Post Syndicated from Anubhav Awasthi original https://aws.amazon.com/blogs/big-data/monitor-apache-hbase-on-amazon-emr-using-amazon-managed-service-for-prometheus-and-amazon-managed-grafana/

Amazon EMR provides a managed Apache Hadoop framework that makes it straightforward, fast, and cost-effective to run Apache HBase. Apache HBase is a massively scalable, distributed big data store in the Apache Hadoop ecosystem. It is an open-source, non-relational, versioned database that runs on top of the Apache Hadoop Distributed File System (HDFS). It’s built for random, strictly consistent, real-time access for tables with billions of rows and millions of columns. Monitoring HBase clusters is critical in order to identify stability and performance bottlenecks and proactively preempt them. In this post, we discuss how you can use Amazon Managed Service for Prometheus and Amazon Managed Grafana to monitor, alert, and visualize HBase metrics.

HBase has built-in support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia or via JMX. You can either use AWS Distro for OpenTelemetry or Prometheus JMX exporters to collect metrics exposed by HBase. In this post, we show how to use Prometheus exporters. These exporters behave like small webservers that convert internal application metrics to Prometheus format and serve it at /metrics path. A Prometheus server running on an Amazon Elastic Compute Cloud (Amazon EC2) instance collects these metrics and remote writes to an Amazon Managed Service for Prometheus workspace. We then use Amazon Managed Grafana to create dashboards and view these metrics using an Amazon Managed Service for Prometheus workspace as its data source.

This solution can be extended to other big data platforms such as Apache Spark and Apache Presto that also use JMX to expose their metrics.

Solution overview

The following diagram illustrates our solution architecture.

Solution Architecture

This post uses an AWS CloudFormation template to perform below actions:

  1. Install an open-source Prometheus server on an EC2 instance.
  2. Create appropriate AWS Identity and Access Management (IAM) roles and security group for the EC2 instance running the Prometheus server.
  3. Create an EMR cluster with an HBase on Amazon S3 configuration.
  4. Install JMX exporters on all EMR nodes.
  5. Create additional security groups for the EMR master and worker nodes to connect with the Prometheus server running on the EC2 instance.
  6. Create a workspace in Amazon Managed Service for Prometheus.

Prerequisites

To implement this solution, make sure you have the following prerequisites:

aws emr create-default-roles

Deploy the CloudFormation template

Deploy the CloudFormation template in the us-east-1 Region:

Launch Stack

It will take 15–20 minutes for the template to complete. The template requires the following fields:

  • Stack Name – Enter a name for the stack
  • VPC – Choose an existing VPC
  • Subnet – Choose an existing subnet
  • EMRClusterName – Use EMRHBase
  • HBaseRootDir – Provide a new HBase root directory (for example, s3://hbase-root-dir/).
  • MasterInstanceType – Use m5x.large
  • CoreInstanceType – Use m5x.large
  • CoreInstanceCount – Enter 2
  • SSHIPRange – Use <your ip address>/32 (you can go to https://checkip.amazonaws.com/ to check your IP address)
  • EMRKeyName – Choose a key pair for the EMR cluster
  • EMRRleaseLabel – Use emr-6.9.0
  • InstanceType – Use the EC2 instance type for installing the Prometheus server

cloud formation parameters

Enable remote writes on the Prometheus server

The Prometheus server is running on an EC2 instance. You can find the instance hostname in the CloudFormation stack’s Outputs tab for key PrometheusServerPublicDNSName.

  1. SSH into the EC2 instance using the key pair:
    ssh -i <sshKey.pem> ec2-user@<Public IPv4 DNS of EC2 instance running Prometheus server>

  2. Copy the value for Endpoint – remote write URL from the Amazon Managed Service for Prometheus workspace console.

  1. Edit remote_write url in /etc/prometheus/conf/prometheus.yml:
sudo vi /etc/prometheus/conf/prometheus.yml

It should look like the following code:

  1. Now we need to restart the Prometheus server to pick up the changes:
sudo systemctl restart prometheus

Enable Amazon Managed Grafana to read from an Amazon Managed Service for Prometheus workspace

We need to add the Amazon Managed Prometheus workspace as a data source in Amazon Managed Grafana. You can skip directly to step 3 if you already have an existing Amazon Managed Grafana workspace and want to use it for HBase metrics.

  1. First, let’s create a workspace on Amazon Managed Grafana. You can follow the appendix to create a workspace using the Amazon Managed Grafana console or run the following API from your terminal (provide your role ARN):
aws grafana create-workspace \
--account-access-type CURRENT_ACCOUNT \
--authentication-providers AWS_SSO \
--permission-type CUSTOMER_MANAGED \
--workspace-data-sources PROMETHEUS \
--workspace-name emr-metrics \
--workspace-role-arn <role-ARN> \
--workspace-notification-destinations SNS
  1. On the Amazon Managed Grafana console, choose Configure users and select a user you want to allow to log in to Grafana dashboards.

Make sure your IAM Identity Center user type is admin. We need this to create dashboards. You can assign the viewer role to all the other users.

  1. Log in to the Amazon Managed Grafana workspace URL using your admin credentials.
  2. Choose AWS Data Sources in the navigation pane.

  1. For Service, choose Amazon Managed Service for Prometheus.

  1. For Regions, choose US East (N. Virginia).

Create an HBase dashboard

Grafana labs has an open-source dashboard that you can use. For example, you can follow the guidance from the following HBase dashboard. Start creating your dashboard and chose the import option. Provide the URL of the dashboard or enter 12722 and choose Load. Make sure your Prometheus workspace is selected on the next page. You should see HBase metrics showing up on the dashboard.

Key HBase metrics to monitor

HBase has a wide range of metrics for HMaster and RegionServer. The following are a few important metrics to keep in mind.

HMASTER Metric Name Metric Description
. hadoop_HBase_numregionservers Number of live region servers
. hadoop_HBase_numdeadregionservers Number of dead region servers
. hadoop_HBase_ritcount Number of regions in transition
. hadoop_HBase_ritcountoverthreshold Number of regions that have been in transition longer than a threshold time (default: 60 seconds)
. hadoop_HBase_ritduration_99th_percentile Maximum time taken by 99% of the regions to remain in transition state
REGIONSERVER Metric Name Metric Description
. hadoop_HBase_regioncount Number of regions hosted by the region server
. hadoop_HBase_storefilecount Number of store files currently managed by the region server
. hadoop_HBase_storefilesize Aggregate size of the store files
. hadoop_HBase_hlogfilecount Number of write-ahead logs not yet archived
. hadoop_HBase_hlogfilesize Size of all write-ahead log files
. hadoop_HBase_totalrequestcount Total number of requests received
. hadoop_HBase_readrequestcount Number of read requests received
. hadoop_HBase_writerequestcount Number of write requests received
. hadoop_HBase_numopenconnections Number of open connections at the RPC layer
. hadoop_HBase_numactivehandler Number of RPC handlers actively servicing requests
Memstore . .
. hadoop_HBase_memstoresize Total memstore memory size of the region server
. hadoop_HBase_flushqueuelength Current depth of the memstore flush queue (if increasing, we are falling behind with clearing memstores out to Amazon S3)
. hadoop_HBase_flushtime_99th_percentile 99th percentile latency for flush operation
. hadoop_HBase_updatesblockedtime Number of milliseconds updates have been blocked so the memstore can be flushed
Block Cache . .
. hadoop_HBase_blockcachesize Block cache size
. hadoop_HBase_blockcachefreesize Block cache free size
. hadoop_HBase_blockcachehitcount Number of block cache hits
. hadoop_HBase_blockcachemisscount Number of block cache misses
. hadoop_HBase_blockcacheexpresshitpercent Percentage of the time that requests with the cache turned on hit the cache
. hadoop_HBase_blockcachecounthitpercent Percentage of block cache hits
. hadoop_HBase_blockcacheevictioncount Number of block cache evictions in the region server
. hadoop_HBase_l2cachehitratio Local disk-based bucket cache hit ratio
. hadoop_HBase_l2cachemissratio Bucket cache miss ratio
Compaction . .
. hadoop_HBase_majorcompactiontime_99th_percentile Time in milliseconds taken for major compaction
. hadoop_HBase_compactiontime_99th_percentile Time in milliseconds taken for minor compaction
. hadoop_HBase_compactionqueuelength Current depth of the compaction request queue (if increasing, we are falling behind with storefile compaction)
. flush queue length Number of flush operations waiting to be processed in the region server (a higher number indicates flush operations are slow)
IPC Queues . .
. hadoop_HBase_queuesize Total data size of all RPC calls in the RPC queues in the region server
. hadoop_HBase_numcallsingeneralqueue Number of RPC calls in the general processing queue in the region server
. hadoop_HBase_processcalltime_99th_percentile 99th percentile latency for RPC calls to be processed in the region server
. hadoop_HBase_queuecalltime_99th_percentile 99th percentile latency for RPC calls to stay in the RPC queue in the region server
JVM and GC . .
. hadoop_HBase_memheapusedm Heap used
. hadoop_HBase_memheapmaxm Total heap
. hadoop_HBase_pausetimewithgc_99th_percentile Pause time in milliseconds
. hadoop_HBase_gccount Garbage collection count
. hadoop_HBase_gctimemillis Time spent in garbage collection, in milliseconds
Latencies . .
. HBase.regionserver.<op>_<measure> Operation latencies, where <op> is Append, Delete, Mutate, Get, Replay, or Increment, and <measure> is min, max, mean, median, 75th_percentile, 95th_percentile, or 99th_percentile
. HBase.regionserver.slow<op>Count Number of operations we thought were slow, where <op> is one of the preceding list
Bulk Load . .
. Bulkload_99th_percentile hadoop_HBase_bulkload_99th_percentile
I/O . .
. FsWriteTime_99th_percentile hadoop_HBase_fswritetime_99th_percentile
. FsReadTime_99th_percentile hadoop_HBase_fsreadtime_99th_percentile
Exceptions . .
. exceptions.RegionTooBusyException .
. exceptions.callQueueTooBig .
. exceptions.NotServingRegionException .

Considerations and limitations

Note the following when using this solution:

  • You can set up alerts on Amazon Managed Service for Prometheus and visualize them in Amazon Managed Grafana.
  • This architecture can be easily extended to include other open-source frameworks such as Apache Spark, Apache Presto, and Apache Hive.
  • Refer to the pricing details for Amazon Managed Service for Prometheus and Amazon Managed Grafana.
  • These scripts are for guidance purposes only and aren’t ready for production deployments. Make sure to perform thorough testing.

Clean up

To avoid ongoing charges, delete the CloudFormation stack and workspaces created in Amazon Managed Grafana and Amazon Managed Service for Prometheus.

Conclusion

In this post, you learned how to monitor EMR HBase clusters and set up dashboards to visualize key metrics. This solution can serve as a unified monitoring platform for multiple EMR clusters and other applications. For more information on EMR HBase, see Release Guide and HBase Migration whitepaper.


Appendix

Complete the following steps to create a workspace on Amazon Managed Grafana:

  1. Log in to the Amazon Managed Grafana console and choose Create workspace.

  1. For Authentication access, select AWS IAM Identity Center.

If you don’t have IAM Identity Center enabled, refer to Enable IAM Identity Center.

  1. Optionally, to view Prometheus alerts in your Grafana workspace, select Turn Grafana alerting on.

  1. On the next page, select Amazon Managed Service for Prometheus as the data source.

  1. After the workspace is created, assign users to access Amazon Managed Grafana.

  1. For a first-time setup, assign admin privileges to the user.

You can add other users with only viewer access.

Make sure you are able to log in to the Grafana workspace URL using your IAM Identity Center user credentials.


About the Author

Anubhav Awasthi is a Sr. Big Data Specialist Solutions Architect at AWS. He works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.

How OLX Group migrated to Amazon Redshift RA3 for simpler, faster, and more cost-effective analytics

Post Syndicated from Miguel Chin original https://aws.amazon.com/blogs/big-data/how-olx-group-migrated-to-amazon-redshift-ra3-for-simpler-faster-and-more-cost-effective-analytics/

This is a guest post by Miguel Chin, Data Engineering Manager at OLX Group and David Greenshtein, Specialist Solutions Architect for Analytics, AWS.

OLX Group is one of the world’s fastest-growing networks of online marketplaces, operating in over 30 countries around the world. We help people buy and sell cars, find housing, get jobs, buy and sell household goods, and much more.

We live in a data-producing world, and as companies want to become data driven, there is the need to analyze more and more data. These analyses are often done using data warehouses. However, a common data warehouse issue with ever-growing volumes of data is storage limitations and the degrading performance that comes with it. This scenario is very familiar to us in OLX Group. Our data warehouse is built using Amazon Redshift and is used by multiple internal teams to power their products and data-driven business decisions. As such, it’s crucial to maintain a cluster with high availability and performance while also being storage cost-efficient.

In this post, we share how we modernized our Amazon Redshift data warehouse by migrating to RA3 nodes and how it enabled us to achieve our business expectations. Hopefully you can learn from our experience in case you are considering doing the same.

Status quo before migration

Here at OLX Group, Amazon Redshift has been our choice for data warehouse for over 5 years. We started with a small Amazon Redshift cluster of 7 DC2.8xlarge nodes, and as its popularity and adoption increased inside the OLX Group data community, this cluster naturally grew.

Before migrating to RA3, we were using a 16 DC2.8xlarge nodes cluster with a highly tuned workload management (WLM), and performance wasn’t an issue at all. However, we kept facing challenges with storage demand due to having more users, more data sources, and more prepared data. Almost every day we would get an alert that our disk space was close to 100%, which was about 40 TB worth of data.

Our usual method to solve storage problems used to be to simply increase the number of nodes. Overall, we reached a cluster size of 18 nodes. However, this solution wasn’t cost-efficient enough because we were adding compute capacity to the cluster even though computation power was underutilized. We saw this as a temporary solution, and we mainly did it to buy some time to explore other cost-effective alternatives, such as RA3 nodes.

Amazon Redshift RA3 nodes along with Redshift Managed Storage (RMS) provided separation of storage and compute, enabling us to scale storage and compute separately to better meet our business requirements.

Our data warehouse had the following configuration before the migration:

  • 18 x DC2.8xlarge nodes
  • 250 monthly active users, consistently increasing
  • 10,000 queries per hour, 30 queries in parallel
  • 40 TB of data, consistently increasing
  • 100% disk space utilization

This cluster’s performance was generally good, ETL (extract, transform, and load) and interactive queries barely had any queue time, and 80% of them would finish in under 5 minutes.

Evaluating the performance of Amazon Redshift clusters with RA3 nodes

In this section, we discuss how we conducted a performance evaluation of RA3 nodes with an Amazon Redshift cluster.

Test environment

In order to be confident with the performance of the RA3 nodes, we decided to stress test them in a controlled environment before making the decision to migrate. To assess the nodes and find an optimal RA3 cluster configuration, we collaborated with AllCloud, the AWS premier consulting partner. The following figures illustrate the approach we took to evaluate the performance of RA3.

Test setup

This strategy aims to replicate a realistic workload in different RA3 cluster configurations and compare them with our DC2 configuration. To do this, we required the following:

  • A reference cluster snapshot – This ensures that we can replay any tests starting from the same state.
  • A set of queries from the production cluster – This set can be reconstructed from the Amazon Redshift logs (STL_QUERYTEXT) and enriched by metadata (STL_QUERY). It should be noted that we only took into consideration SELECT and FETCH query types (to simplify this first stage of performance tests). The following chart shows what the profile of our test set looked like.

  • A replay tool to orchestrate all the query operations – AllCloud developed a Python application for us for this purpose.

For more details about approach we used, including using the Amazon Redshift Simple Replay utility, refer to Compare different node types for your workload using Amazon Redshift.

Next, we picked which cluster configurations we wanted to test, which RA3 type, and how many nodes. For the specifications of each node type, refer to Amazon Redshift pricing.

First, we decided to test the same DC2 cluster we had in production as a way to validate our test environment, followed by RA3 clusters using RA3.4xlarge nodes with various numbers of nodes. We used RA3.4xlarge because it gives us more flexibility to fine-tune how many nodes we need compared to the RA3.16xlarge instance (1 x RA3.16xlarge node is equivalent to 4 x RA3.4xlarge nodes in terms of CPU and memory). With this in mind, we tested the following cluster configurations and used the replay tool to take measurements of the performance of each cluster.

18 x DC2

(Reference)

18 x RA3

(Before Classic Resize)

18 x RA3 6 x RA3
Queries Number 1560 1560 1560 1560
Timeouts 25 66 127
Duration/s Mean 1.214 1.037 1.167 1.921
Std. 2.268 2.026 2.525 3.488
Min. 0.003 0.000 0.002 0.002
Q 25% 0.005 0.004 0.004 0.004
Q 50% 0.344 0.163 0.118 0.183
Q 75% 1.040 0.746 1.076 2.566
Max. 25.411 15.492 19.770 19.132

These results show how the DC2 cluster compares with other RA3 configurations. For 50% of the faster queries (quantile 50%) they ran faster than on DC2. Regarding the number of RA3 nodes, six nodes were clearly slower, particularly noticeable on quantile 75% of query durations.

We used the following steps to deploy different clusters:

  1. Use 18 x DC2.8xlarge, restored from the original snapshot (18 x DC2.8xlarge).
  2. Take measurements 18 x DC2.
  3. Use 18 x RA3.4xlarge, restored from the original snapshot (18 x DC2.8xlarge).
  4. Take measurements 18 x RA3 (before classic resize).
  5. Use 6 x RA3.4xlarge, classic resize from 18 x RA3.4xlarge.
  6. Take snapshot from 6 x RA3.4xlarge.
  7. Take measurements 6 x RA3.
  8. Use 6 x RA3.4xlarge, restored from 6 x RA3.4xlarge snapshot.
  9. Use 18 x RA3.4xlarge, elastic resize from 6 x RA3.4xlarge.
  10. Take measurements 18x RA3.

Although these are promising results, there were some limitations in the test environment setup. We were concerned that we weren’t stressing the clusters enough, queries were only running in sequence using a single client, and the fact that we were using only SELECT and FETCH query types moved us away from a realistic workload. Therefore, we proceeded to the second stage of our tests.

Concurrency stress test

To stress the clusters, we changed our replay tool to run multiple queries in parallel. Queries extracted from the log files were queued with the same frequency as they were originally run in the reference cluster. Up to 50 clients take queries from the queue and send them to Amazon Redshift. The timing of all queries is recorded for comparison with the reference cluster.

The cluster performance is evaluated by measuring the temporal course of the query concurrency. If a cluster is equally performant as the reference cluster, the concurrency will closely follow the concurrency of the reference cluster. Queries pushed to the query queue are immediately picked up by a client and sent to the cluster. If the cluster isn’t capable of handling the queries as fast as the reference cluster, the number of running concurrent queries will increase when compared to the reference cluster. We also decided to keep concurrency scaling disabled during this test because we wanted to focus on node types instead of cluster features.

The following table shows the concurrent queries running on a DC2 and RA3 (both 18 nodes) with two different query test sets (3:00 AM and 1:00 PM). These were selected so we could test both our day and overnight workloads. 3:00 AM is when we have a peak of automated ETL jobs running, and 1:00 PM is when we have high user activity.

The median of running concurrent queries on the RA3 cluster is much higher than the DC2 one. This led us to conclude that a cluster of 18 RA3.4xlarge might not be enough to handle this workload reliably.

Concurrency 18 x DC2.8xlarge 18 x RA3.4xlarge
Starting 3:00 AM 1:00 PM 3:00 AM 1:00 PM
Mean 5 7 10 5
STD 11 13 7 4
25% 1 1 5 2
50% 2 2 8 4
75% 4 4 13 7
Max 50 50 50 27

RA3.16xlarge

Initially, we chose the RA3.4xlarge node type for more granular control in fine-tuning the number of nodes. However, we overlooked one important detail: the same instance type is used for worker and leader nodes. A leader node needs to manage all the parallel processing happening in the cluster, and a single RA3.4xlarge wasn’t enough to do so.

With this in mind, we tested two more cluster configurations: 6 x RA3.16xlarge and 8 x RA3.16xlarge, and once again measured concurrency. This time the results were much better; RA3.16xlarge was able to keep up with the reference concurrency, and the sweet spot seemed to be between 6–8 nodes.

Concurrency 18 x DC2.8xlarge 18 x RA3.4xlarge 6 x RA3.16xlarge 8 x RA3.16xlarge
Starting 3:00 AM 1:00 PM 3:00 AM 1:00 PM 3:00 AM 3:00 AM
Mean 5 7 10 5 3 1
STD 11 13 7 4 4 1
25% 1 1 5 2 2 0
50% 2 2 8 4 3 1
75% 4 4 13 7 4 2
Max 50 50 50 27 38 9

Things were looking better and our target configuration was now a 7 x RA3.16xlarge cluster. We were now confident enough to proceed with the migration.

The migration

Regardless of how excited we were to proceed, we still wanted to do a calculated migration. It’s best practice to have a playbook for migrations—a step-by-step guide on what needs to be done and also a contingency plan that includes a rollback plan. For simplicity reasons, we list here only the relevant steps in case you are looking for inspiration.

Migration plan

The migration plan included the following key steps:

  1. Remove the DNS from the current cluster, in our case in Amazon Route 53. No users should be able to query after this.
  2. Check if any sessions are still running a query, and decide to wait or stop it. This strongly indicates these users are using the direct cluster URL to connect.
    1. To check running sessions, use SELECT * FROM STV_SESSIONS.
    2. To check stopped sessions, use SELECT PG_TERMINATE_BACKEND(xxxxx);.
  3. Create a snapshot of the DC2 cluster.
  4. Pause the DC2 cluster.
  5. Create an RA3 cluster from the snapshot with the following configuration:
    1. Node type – RA3.16xlarge
    2. Number of nodes – 7
    3. Database name – Same as the DC2
    4. Associated IAM roles – Same as the DC2
    5. VPC – Same as the DC2
    6. VPC security groups – Same as the DC2
    7. Parameter groups – Same as the DC2
  6. Wait for SELECT COUNT(1) FROM STV_UNDERREPPED_BLOCKS to return 0. This is related to the hydration process of the cluster.
  7. Point the DNS to the RA3 cluster.
  8. Users can now query the cluster again.

Contingency plan

In case the performance of hourly and daily ETL is not acceptable, the contingency plan is triggered:

  1. Add one more node to deal with the unexpected workload.
  2. Increase the limit of concurrency scaling hours.
  3. Reassess the parameter group.

Following this plan, we migrated from DC2 to RA3 nodes in roughly 3.5 hours, from stopping the old cluster to booting the new one and letting our processes fully synchronize. We then proceeded to monitor performance for a couple of hours. Storage capacity was looking great and everything was running smoothly, but we were curious to see how the overnight processes would perform.

The next morning, we woke up to what we dreaded: a slow cluster. We triggered our contingency plan and in the following few days we ended up implementing all three actions we had in the contingency plan.

Adding one extra node itself didn’t provide much help, however users did experience good performance during the hours concurrency scaling was on. The concurrency scaling feature allows Amazon Redshift to temporarily increase cluster capacity whenever the workload requires it. We configured it to allow a maximum of 4 hours per day—1 hour for free and 3 hours paid. We chose this particular value because price-wise it is equivalent to adding one more node (taking us to nine nodes) with the added advantage of only using and paying for it when the workload requires it.

The last action we took was related to the parameter group, in particular, the WLM. As initially stated, we had a manually fine-tuned WLM, but it proved to be inefficient for this new RA3 cluster. Therefore, we decided to try auto WLM with the following configuration.

Manual WLM before introducing auto WLM Queue 1 Data Team ETL queue (daily and hourly), admin, monitoring, data quality queries
Queue 2 Users queue (for both their ETL and ad hoc queries)
Auto WLM Queue 1: Priority highest Daily Data Team ETL queue
Queue 2: Priority high Admin queries
Queue 3: Priority normal User queries and hourly Data Team ETL
Queue 4: Priority low Monitoring, data quality queries

Manual WLM requires you to manually allocate a percentage of resources and define a number of slots per queue. Although this gives you resource segregation, it also means resources are constantly allocated and can go to waste if they’re not used. Auto WLM dynamically sets these variables depending on each queue’s priority and workload. This means that a query in the highest priority queue will get all the resources allocated to it, while lower priority queues will need to wait for available resources. With this in mind, we split our ETL depending on its priority: daily ETL to highest, hourly ETL to normal (to give a fair chance for user queries to compete for resources), and monitoring and data quality to low.

After applying concurrency scaling and auto WLM, we achieved stable performance for a whole week, and considered the migration a success.

Status quo after migration

Almost a year has passed since we migrated to RA3 nodes, and we couldn’t be more satisfied. Thanks to Redshift Managed Storage (RMS), our disk space issues are a thing of the past, and performance has been generally great compared to our previous DC2 cluster. We are now at 300 monthly active users. Cluster costs did increase due to the new node type and concurrency scaling, but we now feel prepared for the future and don’t expect any cluster resizing anytime soon.

Looking back, we wanted to have a carefully planned and prepared migration, and we were able to learn more about RA3 with our test environment. However, our experience also shows that test environments aren’t always bulletproof, and some details may be overlooked. In the end, these are our main takeaways from the migration to RA3 nodes:

  • Pick the right node type according to your workload. An RA3.16xlarge cluster provides more powerful leader and worker nodes.
  • Use concurrency scaling to provision more resources when the workload demands it. Adding a new node is not always the most cost-efficient solution.
  • Manual WLM requires a lot of adjustments; using auto WLM allows for a better and fairer distribution of cluster resources.

Conclusion

In this post, we covered how OLX Group modernized our Amazon Redshift data warehouse by migrating to RA3 nodes. We detailed how we tested before migration, the migration itself, and the outcome. We are now starting to explore the possibilities provided by the RA3 nodes. In particular, the data sharing capabilities together with Redshift Serverless open the door for exciting architecture setups that we are looking forward to.

If you are going through the same storage issues we used to face with your Amazon Redshift cluster, we highly recommend migrating to RA3 nodes. Its RMS feature decouples the scalability of compute and storage power, providing a more cost-efficient solution.

Thanks for reading this post and hopefully you found it useful. If you’re going through the same scenario and have any questions, feel free to reach out.


About the author

Miguel Chin is a Data Engineering Manager at OLX Group, one of the world’s fastest-growing networks of trading platforms. He is responsible for managing a domain-oriented team of data engineers that helps shape the company’s data ecosystem by evangelizing cutting-edge data concepts like data mesh.

David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.

Synchronize your Salesforce and Snowflake data to speed up your time to insight with Amazon AppFlow

Post Syndicated from Ramesh Ranganathan original https://aws.amazon.com/blogs/big-data/synchronize-your-salesforce-and-snowflake-data-to-speed-up-your-time-to-insight-with-amazon-appflow/

This post was co-written with Amit Shah, Principal Consultant at Atos.

Customers across industries seek meaningful insights from the data captured in their Customer Relationship Management (CRM) systems. To achieve this, they combine their CRM data with a wealth of information already available in their data warehouse, enterprise systems, or other software as a service (SaaS) applications. One widely used approach is getting the CRM data into your data warehouse and keeping it up to date through frequent data synchronization.

Integrating third-party SaaS applications is often complicated and requires significant effort and development. Developers need to understand the application APIs, write implementation and test code, and maintain the code for future API changes. Amazon AppFlow, which is a low-code/no-code AWS service, addresses this challenge.

Amazon AppFlow is a fully managed integration service that enables you to securely transfer data between SaaS applications, like Salesforce, SAP, Zendesk, Slack, and ServiceNow, and AWS services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift in just a few clicks. With Amazon AppFlow, you can run data flows at enterprise scale at the frequency you choose—on a schedule, in response to a business event, or on demand.

In this post, we focus on synchronizing your data from Salesforce to Snowflake (on AWS) without writing code. This post walks you through the steps to set up a data flow to address full and incremental data load using an example use case.

Solution overview

Our use case involves the synchronization of the Account object from Salesforce into Snowflake. In this architecture, you use Amazon AppFlow to filter and transfer the data to your Snowflake data warehouse.

You can configure Amazon AppFlow to run your data ingestion in three different ways:

  • On-demand – You can manually run the flow through the AWS Management Console, API, or SDK call.
  • Event-driven – Amazon AppFlow can subscribe and listen to change data capture (CDC) events from the source SaaS application.
  • Scheduled – Amazon AppFlow can run schedule-triggered flows based on a pre-defined schedule rule. With scheduled flows, you can choose either full or incremental data transfer:
    • With full transfer, Amazon AppFlow transfers a snapshot of all records at the time of the flow run from the source to the destination.
    • With incremental transfer, Amazon AppFlow transfers only the records that have been added or changed since the last successful flow run. To determine the incremental delta of your data, AppFlow requires you to specify a source timestamp field to instruct how Amazon AppFlow identifies new or updated records.

We use the on-demand trigger for the initial load of data from Salesforce to Snowflake, because it helps you pull all the records, irrespective of their creation. To then synchronize data periodically with Snowflake, after we run the on-demand trigger, we configure a scheduled trigger with incremental transfer. With this approach, Amazon AppFlow pulls the records based on a chosen timestamp field from the Salesforce Account object periodically, based on the time interval specified in the flow.

The Account_Staging table is created in Snowflake to act as a temporary storage that can be used to identify the data change events. Then the permanent table (Account) is updated from the staging table by running a SQL stored procedure that contains the incremental update logic. The following figure depicts the various components of the architecture and the data flow from the source to the target.

The data flow contains the following steps:

  1. First, the flow is run with on-demand and full transfer mode to load the full data into Snowflake.
  2. The Amazon AppFlow Salesforce connector pulls the data from Salesforce and stores it in the Account Data S3 bucket in CSV format.
  3. The Amazon AppFlow Snowflake connector loads the data into the Account_Staging table.
  4. A scheduled task, running at regular intervals in Snowflake, triggers a stored procedure.
  5. The stored procedure starts an atomic transaction that loads the data into the Account table and then deletes the data from the Account_Staging table.
  6. After the initial data is loaded, you update the flow to capture incremental updates from Salesforce. The flow trigger configuration is changed to scheduled, to capture data changes in Salesforce. This enables Snowflake to get all updates, deletes, and inserts in Salesforce at configured intervals.
  7. The flow uses the configured LastModifiedDate field to determine incremental changes.
  8. Steps 3, 4, and 5 are run again to load the incremental updates into the Snowflake Accounts table.

Prerequisites

To get started, you need the following prerequisites:

  • A Salesforce user account with sufficient privileges to install connected apps. Amazon AppFlow uses a connected app to communicate with Salesforce APIs. If you don’t have a Salesforce account, you can sign up for a developer account.
  • A Snowflake account with sufficient permissions to create and configure the integration, external stage, table, stored procedures, and tasks.
  • An AWS account with access to AWS Identity and Access Management (IAM), Amazon AppFlow, and Amazon S3.

Set up Snowflake configuration and Amazon S3 data

Complete the following steps to configure Snowflake and set up your data in Amazon S3:

  1. Create two S3 buckets in your AWS account: one for holding the data coming from Salesforce, and another for holding error records.

A best practice when creating your S3 bucket is to make sure you block public access to the bucket to ensure your data is not accessible by unauthorized users.

  1. Create an IAM policy named snowflake-access that allows listing the bucket contents and reading S3 objects inside the bucket.

Follow the instructions for steps 1 and 2 in Configuring a Snowflake Storage Integration to Access Amazon S3 to create an IAM policy and role. Replace the placeholders with your S3 bucket names.

  1. Log in to your Snowflake account and create a new warehouse called SALESFORCE and database called SALESTEST.
  2. Specify the format in which data will be available in Amazon S3 for Snowflake to load (for this post, CSV):
USE DATABASE SALESTEST;
CREATE or REPLACE file format my_csv_format
type = csv
field_delimiter = ','
Y skip_header = 1
null_if = ('NULL', 'null')
empty_field_as_null = true
compression = gzip;
  1. Amazon AppFlow uses the Snowflake COPY command to move data using an S3 bucket. To configure this integration, follow steps 3–6 in Configuring a Snowflake Storage Integration to Access Amazon S3.

These steps create a storage integration with your S3 bucket, update IAM roles with Snowflake account and user details, and creates an external stage.

This completes the setup in Snowflake. In the next section, you create the required objects in Snowflake.

Create schemas and procedures in Snowflake

In your Snowflake account, complete the following steps to create the tables, stored procedures, and tasks for implementing the use case:

  1. In your Snowflake account, open a worksheet and run the following DDL scripts to create the Account and Account_staging tables:
CREATE or REPLACE TABLE ACCOUNT_STAGING (
ACCOUNT_NUMBER STRING NOT NULL,
ACCOUNT_NAME STRING,
ACCOUNT_TYPE STRING,
ANNUAL_REVENUE NUMBER,
ACTIVE BOOLEAN NOT NULL,
DELETED BOOLEAN,
LAST_MODIFIED_DATE STRING,
primary key (ACCOUNT_NUMBER)
);

CREATE or REPLACE TABLE ACCOUNT (
ACCOUNT_NUMBER STRING NOT NULL,
ACCOUNT_NAME STRING,
ACCOUNT_TYPE STRING,
ANNUAL_REVENUE NUMBER,
ACTIVE BOOLEAN NOT NULL,
LAST_MODIFIED_DATE STRING,
primary key (ACCOUNT_NUMBER)
);
  1. Create a stored procedure in Snowflake to load data from staging to the Account table:
CREATE or REPLACE procedure sp_account_load( )
returns varchar not null
language sql
as
$$
begin
Begin transaction;
merge into ACCOUNT using ACCOUNT_STAGING
on ACCOUNT.ACCOUNT_NUMBER = ACCOUNT_STAGING.ACCOUNT_NUMBER
when matched AND ACCOUNT_STAGING.DELETED=TRUE then delete
when matched then UPDATE SET
ACCOUNT.ACCOUNT_NAME = ACCOUNT_STAGING.ACCOUNT_NAME,
ACCOUNT.ACCOUNT_TYPE = ACCOUNT_STAGING.ACCOUNT_TYPE,
ACCOUNT.ANNUAL_REVENUE = ACCOUNT_STAGING.ANNUAL_REVENUE,
ACCOUNT.ACTIVE = ACCOUNT_STAGING.ACTIVE,
ACCOUNT.LAST_MODIFIED_DATE = ACCOUNT_STAGING.LAST_MODIFIED_DATE
when NOT matched then
INSERT (
ACCOUNT.ACCOUNT_NUMBER,
ACCOUNT.ACCOUNT_NAME,
ACCOUNT.ACCOUNT_TYPE,
ACCOUNT.ANNUAL_REVENUE,
ACCOUNT.ACTIVE,
ACCOUNT.LAST_MODIFIED_DATE
)
values(
ACCOUNT_STAGING.ACCOUNT_NUMBER,
ACCOUNT_STAGING.ACCOUNT_NAME,
ACCOUNT_STAGING.ACCOUNT_TYPE,
ACCOUNT_STAGING.ANNUAL_REVENUE,
ACCOUNT_STAGING.ACTIVE,
ACCOUNT_STAGING.LAST_MODIFIED_DATE
) ;

Delete from ACCOUNT_STAGING;
Commit;
end;
$$
;

This stored procedure determines whether the data contains new records that need to be inserted or existing records that need to be updated or deleted. After a successful run, the stored procedure clears any data from your staging table.

  1. Create a task in Snowflake to trigger the stored procedure. Make sure that the time interval for this task is more than the time interval configured in Amazon AppFlow for pulling the incremental changes from Salesforce. The time interval should be sufficient for data to be processed.
CREATE OR REPLACE TASK TASK_ACCOUNT_LOAD
WAREHOUSE = SALESFORCE
SCHEDULE = 'USING CRON 5 * * * * America/Los_Angeles'
AS
call sp_account_load();
  1. Provide the required permissions to run the task and resume the task:
show tasks;
  • As soon as task is created it will be suspended state so needs to resume it manually first time
ALTER TASK TASK_ACCOUNT_LOAD RESUME;
  • If the role which is assigned to us doesn’t have proper access to resume/execute task needs to grant execute task privilege to that role
GRANT EXECUTE TASK, EXECUTE MANAGED TASK ON ACCOUNT TO ROLE SYSADMIN;

This completes the Snowflake part of configuration and setup.

Create a Salesforce connection

First, let’s create a Salesforce connection that can be used by AppFlow to authenticate and pull records from your Salesforce instance. On the AWS console, make sure you are in the same Region where your Snowflake instance is running.

  1. On the Amazon AppFlow console, choose Connections in the navigation pane.
  2. From the list of connectors, select Salesforce.
  3. Choose Create connection.
  4. For Connection name, enter a name of your choice (for example, Salesforce-blog).
  5. Leave the rest of the fields as default and choose Continue.
  6. You’re redirected to a sign-in page, where you need to log in to your Salesforce instance.
  7. After you allow Amazon AppFlow access to your Salesforce account, your connection is successfully created.
           

 Create a Snowflake connection

Complete the following steps to create your Snowflake connection:

  1. On the Connections menu, choose Snowflake.
  2. Choose Create connection.
  3. Provide information for the Warehouse, Stage name, and Bucket details fields.
  4. Enter your credential details.

  1. For Region, choose the same Region where Snowflake is running.
  2. For Connection name, name your connection Snowflake-blog.
  3. Leave the rest of the fields as default and choose Connect.

Create a flow in Amazon AppFlow

Now you create a flow in Amazon AppFlow to load the data from Salesforce to Snowflake. Complete the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. Choose Create flow.
  3. On the Specify flow details page, enter a name for the flow (for example, AccountData-SalesforceToSnowflake).
  4. Optionally, provide a description for the flow and tags.
  5. Choose Next.

  1. On the Configure flow page, for Source name¸ choose Salesforce.
  2. Choose the Salesforce connection we created in the previous step (Salesforce-blog).
  3. For Choose Salesforce object, choose Account.
  4. For Destination name, choose Snowflake.
  5. Choose the newly created Snowflake connection.
  6. For Choose Snowflake object, choose the staging table you created earlier (SALESTEST.PUBLIC. ACCOUNT_STAGING).

  1. In the Error handling section, provide your error S3 bucket.
  2. For Choose how to trigger the flow¸ select Run on demand.
  3. Choose Next.

  1. Select Manually map fields to map the fields between your source and destination.
  2. Choose the fields Account Number, Account Name, Account Type, Annual Revenue, Active, Deleted, and Last Modified Date.

  1. Map each source field to its corresponding destination field.
  2. Under Additional settings, leave the Import deleted records unchecked (default setting).

  1. In the Validations section, add validations for the data you’re pulling from Salesforce.

Because the schema for the Account_Staging table in Snowflake database has a NOT NULL constraint for the fields Account_Number and Active, records containing a null value for these fields should be ignored.

  1. Choose Add Validation to configure validations for these fields.
  2. Choose Next.

  1. Leave everything else as default, proceed to the final page, and choose Create Flow.
  2. After the flow is created, choose Run flow.

When the flow run completes successfully, it will bring all records into your Snowflake staging table.

Verify data in Snowflake

The data will be loaded into the Account_staging table. To verify that data is loaded in Snowflake, complete the following steps:

  1. Validate the number of records by querying the ACCOUNT_STAGING table in Snowflake.
  2. Wait for your Snowflake task to run based on the configured schedule.
  3. Verify that all the data is transferred to the ACCOUNT table and the ACCOUNT_STAGING table is truncated.

Configure an incremental data load from Salesforce

Now let’s configure an incremental data load from Salesforce:

  1. On the Amazon AppFlow console, select your flow, and choose Edit.
  2. Go to the Edit configuration step and change to Run flow on schedule.
  3. Set the flow to run every 5 minutes, and provide a start date of Today, with a start time in the future.
  4. Choose Incremental transfer and choose the LastModifiedDate field.
  5. Choose Next.
  6. In the Additional settings section, select Import deleted records.

This ensures that deleted records from the source are also ingested.

  1. Choose Save and then choose Activate flow.

Now your flow is configured to capture all incremental changes.

Test the solution

Log in to your Salesforce account, and edit any record in the Account object.

Within 5 minutes or less, a scheduled flow will pick up your change and write the changed record into your Snowflake staging table and trigger the synchronization process.

You can see the details of the run, including number of records transferred, on the Run History tab of your flow.

Clean up

Clean up the resources in your AWS account by completing the following steps:

  1. On the Amazon AppFlow console, choose Flows in the navigation pane.
  2. From the list of flows, select the flow AccountData-SalesforceToSnowflakeand delete it.
  3. Enter delete to delete the flow.
  4. Choose Connections in the navigation pane.
  5. Choose Salesforce from the list of connectors, select Salesforce-blog, and delete it.
  6. Enter delete to delete the connector.
  7. On the Connections page, choose Snowflake from the list of connectors, select Snowflake-blog, and delete it.
  8. Enter delete to delete the connector.
  9. On the IAM console, choose Roles in the navigation page, then select the role you created for Snowflake and delete it.
  10. Choose Policies in the navigation pane, select the policy you created for Snowflake, and delete it.
  11. On the Amazon S3 console, search for the data bucket you created, choose Empty to delete the objects, then delete the bucket.
  12. Search for the error bucket you created, choose Empty to delete the objects, then delete the bucket.
  13. Clean up resources in your Snowflake account:
  • Delete the task TASK_ACCOUNT_LOAD:
ALTER TASK TASK_ACCOUNT_LOAD SUSPEND;
DROP TASK TASK_ACCOUNT_LOAD;
  • Delete the stored procedure sp_account_load:
DROP procedure sp_account_load();
  • Delete the tables ACCOUNT_STAGING and ACCOUNT:
DROP TABLE ACCOUNT_STAGING;
DROP TABLE ACCOUNT;

Conclusion

In this post, we walked you through how to integrate and synchronize your data from Salesforce to Snowflake using Amazon AppFlow. This demonstrates how you can set up your ETL jobs without having to learn new programming languages by using Amazon AppFlow and your familiar SQL language. This is a proof of concept, but you can try to handle edge cases like failure of Snowflake tasks or understand how incremental transfer works by making multiple changes to a Salesforce record within the scheduled time interval.

For more information on Amazon AppFlow, visit Amazon AppFlow.


About the authors

Ramesh Ranganathan is a Senior Partner Solution Architect at AWS. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, application modernization and cloud native development. He is passionate about technology and enjoys experimenting with AWS Serverless services.

Kamen Sharlandjiev is an Analytics Specialist Solutions Architect and Amazon AppFlow expert. He’s on a mission to make life easier for customers who are facing complex data integration challenges. His secret weapon? Fully managed, low-code AWS services that can get the job done with minimal effort and no coding.

Amit Shah is a cloud based modern data architecture expert and currently leading AWS Data Analytics practice in Atos. Based in Pune in India, he has 20+ years of experience in data strategy, architecture, design and development. He is on a mission to help organization become data-driven.