Tag Archives: Advanced (300)

Amazon EMR on EKS widens the performance gap: Run Apache Spark workloads 5.37 times faster and at 4.3 times lower cost

Post Syndicated from Melody Yang original https://aws.amazon.com/blogs/big-data/amazon-emr-on-eks-widens-the-performance-gap-run-apache-spark-workloads-5-37-times-faster-and-at-4-3-times-lower-cost/

Amazon EMR on EKS provides a deployment option for Amazon EMR that allows organizations to run open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). With EMR on EKS, Spark applications run on the Amazon EMR runtime for Apache Spark. This performance-optimized runtime offered by Amazon EMR makes your Spark jobs run fast and cost-effectively. Also, you can run other types of business applications, such as web applications and machine learning (ML) TensorFlow workloads, on the same EKS cluster. EMR on EKS simplifies your infrastructure management, maximizes resource utilization, and reduces your cost.

We have been continually improving the Spark performance in each Amazon EMR release to further shorten job runtime and optimize users’ spending on their Amazon EMR big data workloads. As of the Amazon EMR 6.5 release in January 2022, the optimized Spark runtime was 3.5 times faster than OSS Spark v3.1.2 with up to 61% lower costs. Amazon EMR 6.10 is now 1.59 times faster than Amazon EMR 6.5, which has resulted in 5.37 times better performance than OSS Spark v3.3.1 with 76.8% cost savings.

In this post, we describe the benchmark setup and results on top of the EMR on EKS environment. We also share a Spark benchmark solution that suits all Amazon EMR deployment options, so you can replicate the process in your environment for your own performance test cases. The solution uses the TPC-DS dataset and unmodified data schema and table relationships, but derives queries from TPC-DS to support the SparkSQL test cases. It is not comparable to other published TPC-DS benchmark results.

Benchmark setup

To compare with the EMR on EKS 6.5 test result detailed in the post Amazon EMR on Amazon EKS provides up to 61% lower costs and up to 68% performance improvement for Spark workloads, this benchmark for the latest release (Amazon EMR 6.10) uses the same approach: a TPC-DS benchmark framework and the same size of TPC-DS input dataset from an Amazon Simple Storage Service (Amazon S3) location. For the source data, we chose the 3 TB scale factor, which contains 17.7 billion records, approximately 924 GB compressed data in Parquet file format. The setup instructions and technical details can be found in the aws-sample repository.

In summary, the entire performance test job includes 104 SparkSQL queries and was completed in approximately 24 minutes (1,397.55 seconds) with an estimated running cost of $5.08 USD. The input data and test result outputs were both stored on Amazon S3.

The job has been configured with the following parameters that match with the previous Amazon EMR 6.5 test:

  • EMR release – EMR 6.10.0
  • Hardware:
    • Compute – 6 X c5d.9xlarge instances, 216 vCPU, 432 GiB memory in total
    • Storage – 6 x 900 NVMe SSD build-in storage
    • Amazon EBS root volume – 6 X 20GB gp2
  • Spark configuration:
    • Driver pod – 1 instance among other 7 executors on a shared Amazon Elastic Compute Cloud (Amazon EC2) node:
      • spark.driver.cores=4
      • spark.driver.memory=5g
      • spark.kubernetes.driver.limit.cores=4.1
    • Executor pod – 47 instances distributed over 6 EC2 nodes
      • spark.executor.cores=4
      • spark.executor.memory=6g
      • spark.executor.memoryOverhead=2G
      • spark.kubernetes.executor.limit.cores=4.3
  • Metadata store – We use Spark’s in-memory data catalog to store metadata for TPC-DS databases and tables—spark.sql.catalogImplementation is set to the default value in-memory. The fact tables are partitioned by the date column, which consists of partitions ranging from 200–2,100. No statistics are pre-calculated for these tables.

Results

A single test session consists of 104 Spark SQL queries that were run sequentially. We ran each Spark runtime session (EMR runtime for Apache Spark, OSS Apache Spark) three times. The Spark benchmark job produces a CSV file to Amazon S3 that summarizes the median, minimum, and maximum runtime for each individual query.

The way we calculate the final benchmark results (geomean and the total job runtime) are based on arithmetic means. We take the mean of the median, minimum, and maximum values per query using the formula of AVERAGE(), for example AVERAGE(F2:H2). Then we take a geometric mean of the average column I by the formula GEOMEAN(I2:I105) and SUM(I2:I105) for the total runtime.

Previously, we observed that EMR on EKS 6.5 is 3.5 times faster than OSS Spark on EKS, and costs 2.6 times less. From this benchmark, we found that the gap has widened: EMR on EKS 6.10 now provides a 5.37 times performance improvement on average and up to 11.61 times improved performance for individual queries over OSS Spark 3.3.1 on Amazon EKS. From the running cost perspective, we see the significant reduction by 4.3 times.

The following graph shows the performance improvement of Amazon EMR 6.10 compared to OSS Spark 3.3.1 at the individual query level. The X-axis shows the name of each query, and the Y-axis shows the total runtime in seconds on logarithmic scale. The most significant performance gains for eight queries (q14a, q14b, q23b, q24a, q24b, q4, q67, q72) demonstrated over 10 times faster for the runtime.

Job cost estimation

The cost estimate doesn’t account for Amazon S3 storage, or PUT and GET requests. The Amazon EMR on EKS uplift calculation is based on the hourly billing information provided by AWS Cost Explorer.

  • c5d.9xlarge hourly price – $1.728
  • Number of EC2 instances – 6
  • Amazon EBS storage per GB-month – $0.10
  • Amazon EBS gp2 root volume – 20GB
  • Job run time (hour)
    • OSS Spark 3.3.1 – 2.09
    • EMR on EKS 6.5.0 – 0.68
    • EMR on EKS 6.10.0 – 0.39
Cost component OSS Spark 3.3.1 on EKS EMR on EKS 6.5.0 EMR on EKS 6.10.0
Amazon EC2 $21.67 $7.05 $4.04
EMR on EKS $ – $1.57 $0.99
Amazon EKS $0.21 $0.07 $0.04
Amazon EBS root volume $0.03 $0.01 $0.01
Total $21.88 $8.70 $5.08

Performance enhancements

Although we improve on Amazon EMR’s performance with each release, Amazon EMR 6.10 contained many performance optimizations, making it 5.37 times faster than OSS Spark v3.3.1 and 1.59 times faster than our first release of 2022, Amazon EMR 6.5. This additional performance boost was achieved through the addition of multiple optimizations, including:

  • Enhancements to join performance, such as the following:
    • Shuffle-Hash Joins (SHJ) are more CPU and I/O efficient than Shuffle-Sort-Merge Joins (SMJ) when the costs of building and probing the hash table, including the availability of memory, are less than the cost of sorting and performing the merge join. However, SHJs have drawbacks, such as risk of out of memory errors due to its inability to spill to disk, which prevents them from being aggressively used across Spark in place of SMJs by default. We have optimized our use of SHJs so that they can be applied to more queries by default than in OSS Spark.
    • For some query shapes, we have eliminated redundant joins and enabled the use of more performant join types.
  • We have reduced the amount of data shuffled before joins and the potential for data explosions after joins by selectively pushing down aggregates through joins.
  • Bloom filters can improve performance by reducing the amount of data shuffled before the join. However, there are cases where bloom filters are not beneficial and can even regress performance. For example, the bloom filter introduces a dependency between stages that reduces query parallelism, but may end up filtering out relatively little data. Our enhancements allow bloom filters to be safely applied to more query plans than OSS Spark.
  • Aggregates with high-precision decimals are computationally intensive in OSS Spark. We optimized high-precision decimal computations to increasing their performance.

Summary

With version 6.10, Amazon EMR has further enhanced the EMR runtime for Apache Spark in comparison to our previous benchmark tests for Amazon EMR version 6.5. When running EMR workloads with the the equivalent Apache Spark version 3.3.1, we observed 1.59 times better performance with 41.6% cheaper costs than Amazon EMR 6.5.

With our TPC-DS benchmark setup, we observed a significant performance increase of 5.37 times and a cost reduction of 4.3 times using EMR on EKS compared to OSS Spark.

To learn more and get started with EMR on EKS, try out the EMR on EKS Workshop and visit the EMR on EKS Best Practices Guide page.


About the Authors

Melody YangMelody Yang is a Senior Big Data Solution Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.

Ashok Chintalapati is a software development engineer for Amazon EMR at Amazon Web Services.

Connect to Amazon MSK Serverless from your on-premises network

Post Syndicated from Masudur Rahaman Sayem original https://aws.amazon.com/blogs/big-data/connect-to-amazon-msk-serverless-from-your-on-premises-network/

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed, highly available, and secure Apache Kafka service. Amazon MSK reduces the work needed to set up, scale, and manage Apache Kafka in production. With Amazon MSK, you can create a cluster in minutes and start sending data.

With Amazon MSK Serverless, you can run Apache Kafka without having to manage the underlying infrastructure. Amazon MSK will automatically provision, scale, and manage your Apache Kafka clusters, so you can focus on your applications without worrying about the operational overhead. Additionally, MSK Serverless offers fine-grained, pay-as-you-go pricing, making it a cost-effective option for organizations with unpredictable workloads.

Connecting to MSK Serverless is easy. You can set up a serverless cluster using the API or AWS Management Console in minutes. MSK Serverless provides bootstrap information as a private DNS endpoint, allowing clients to connect to the serverless Apache Kafka cluster. A common use case of using MSK Serverless is an on-premises client that needs to process real-time data streams. However, the private DNS endpoint is only accessible from virtual private clouds (VPCs) that have been configured to connect and isn’t directly resolvable from an on-premises network. This can pose a challenge for on-premises clients to discover and connect to the MSK Serverless cluster.
In this post, we guide you through a step-by-step process to connect your on-premises client to MSK Serverless, overcoming this challenge.

Solution overview

The following diagram illustrates the solution architecture.

The flow of the solution is as follows:

  1. The DNS query for your MSK endpoint is routed to a locally configured on-premises DNS server.
  2. The on-premises DNS as configured performs conditional forwarding for kafka-serverless.REPLACE-MSK-SERVERLESS-REGION.amazonaws.com to an Amazon Route 53 inbound resolver endpoint IP address.
  3. The inbound resolver endpoint performs DNS resolution by forwarding the query to the private hosted zone that was created along with the MSK Serverless cluster.
  4. The IP addresses returned by the DNS query are the private IP addresses of the interface VPC endpoint, which allow your on-premises host to establish private connectivity over AWS VPN or AWS Direct Connect.
  5. The interface endpoint is a collection of one or more elastic network interfaces with a private IP address in your account that serves as an entry point for traffic destined to a MSK Serverless service.

Note that at this time, this solution works only for MSK Serverless clusters with a single VPC.

Prerequisites

In this section, we discuss the prerequisite steps to complete in order to implement this solution.

Establish network connectivity between on premises and the AWS Cloud

To use MSK Serverless from your on-premises network, you need to establish a network connection between your on-premises environment and the VPC that you have set up for MSK Serverless. Various secure methods are available to connect your on-premises network to the AWS Cloud. Refer to Network-to-Amazon VPC connectivity options for more information.

Create a security group for allowing inbound TCP/UDP connections from your on-premises network

Create a security group with the following configurations on the same VPC that you configured for MSK Serverless:

Inbound rule:

  • Source: [On-premises CIDR range]
  • Protocol: TCP/UDP
  • Port Range: 53

Outbound rule: Leave it to default

For more information, refer to Work with security groups.

Update the MSK security group for inbound connections from your on-premises network

To ensure that your MSK Serverless cluster can be accessed from your on-premises network, you need to adjust the cluster’s security group settings to allow incoming traffic from your network on TCP port 9098. Complete the following steps:

  1. On the Amazon MSK console, choose Clusters in the navigation pane.
  2. Navigate to your serverless MSK cluster’s properties.

  1. Choose the security group associated with your MSK cluster.

Because MSK Serverless supports configuring multiple VPCs, make sure to choose the security group associated with the VPC that you configured for connecting from your on-premises network.

  1. To enable connections from your on-premises CIDR block to MSK Serverless, add an inbound rule that allows traffic on TCP port 9098 from your on-premises CIDR.

This ensures that your on-premises network can communicate with MSK Serverless on the specified port.

Configure a Route 53 inbound resolver endpoint

MSK Serverless provides a DNS endpoint that serves as the starting point for an Apache Kafka client to connect to the cluster. However, this endpoint isn’t publicly discoverable and can only be accessed from within the configured VPC. To resolve the serverless DNS endpoint outside of your VPC, you can set up a Route 53 resolver endpoint. This allows you to access the endpoint securely by creating a hybrid cloud setup over VPN or Direct Connect.

To configure the Route 53 resolver using the console, complete the following steps:

  1. On the Route 53 console, under Resolver in the navigation pane, choose Inbound endpoints.
  2. Choose Create inbound endpoint.

  1. For Endpoint name, enter the endpoint name.
  2. For VPC in the Region, choose the VPC where you configured MSK Serverless.
  3. For Security group for this endpoint, choose the security group that you created as a prerequisite for inbound TCP/UDP connections.

The security group of the inbound resolver endpoint should allow traffic from the on-premises DNS Server IP address on TCP/UDP port 53.

In the next step, you add your IP addresses, ensuring that the number of IP addresses matches the number of subnets in your MSK cluster.

  1. Choose the Availability Zones and subnets that are the same as your MSK Serverless network configuration.
  2. Select Use an IP address that is selected automatically.

  1. Choose Create inbound endpoint.

  1. Copy the inbound endpoint IP addresses.

Configure the on-premises DNS server

In this example, we use a Microsoft DNS server. To configure a conditional forwarder, complete the following steps:

  1. Open DNS Manager.
  2. Run the following command in the Run command window:
dnsmgmt.msc
  1. Choose (right-click) Conditional Forwarders under the server of your choosing, then choose New Conditional Forwarder.


In the next step, you enter kafka-serverless.REPLACE-MSK-SERVERLESS-REGION.amazonaws.com, using the IP address of Route 53 inbound resolver endpoints that you created earlier. You can find the MSK endpoint information by accessing the cluster’s client information. To learn more about getting client information, refer to Getting the bootstrap brokers for an Amazon MSK cluster.

  1. For DNS Domain, enter your endpoint name. For example, kafka-serverless.ap-southeast-2.amazonaws.com. Do not enter the entire endpoint name.
  2. Choose OK.

Test the DNS resolution

DNS (Domain Name System) uses TCP/UDP port 53. To test whether you can connect any of the Route 53 inbound endpoints, run the following command from your on-premises client:

telnet Route53-INBOUND-ENDPOINT-IP 53

For example: telnet 10.1.0.133 53

The following is a sample output:

Trying 10.1.0.133...
Connected to 10.1.0.133.
Escape character is '^]'.
Connection closed by foreign host.

Run the following command to check whether you can connect with the MSK Serverless endpoint from your on-premises client. To get the MSK Serverless endpoint information, refer to Create an MSK Serverless cluster.

dig MSK-SERVERLESS-ENDPOINT-REMOVE-PORT-NUMBER +short

For example: dig boot-abcdc9.c3.kafka-serverless.ap-southeast-2.amazonaws.com +short

The following is a sample output:

vpce-0bcb06d53aab34111-vt8yzx2b.vpce-svc-05dc791a527abcd.ap-southeast-2.vpce.amazonaws.com.
10.1.1.185
10.1.0.191

If the DNS resolution fails, check your network connectivity from on premises. For more information about troubleshooting connectivity issues, refer to How do I troubleshoot VPN tunnel connectivity to an Amazon VPC or Troubleshooting AWS Direct Connect.

After you create a serverless MSK cluster, the service automatically creates an interface VPC endpoint for the cluster. You can use the dig command as shown above to retrieve the VPC endpoint ID and its associated IP address, which confirms that you are now able to connect to the MSK Serverless cluster from your on-premises environment.

Test your Kafka client

Once you complete the configuration of the Route 53 inbound resolver endpoint and on-premises DNS server, you can test your Kafka client from an on-premises network. For instructions, refer to Create a client machine. This documentation guides you through the necessary steps to set up your client machine and verify that it can successfully connect to your MSK cluster from your on-premises network.

Conclusion

MSK Serverless makes it easy for you to manage your data. You don’t have to worry about setting up and running your own Kafka cluster, which saves time and effort. In this post, we explored the option of on-premises connectivity with MSK Serverless and how it can greatly benefit organizations. By establishing this connection, you can gain access to a wide range of real-time analytics use case possibilities and unlock the full potential of your data.

We encourage you to try on-premises connectivity with MSK serverless.


About the Authors

Masudur Rahaman Sayem is a Streaming Data Architect at AWS. He works with AWS customers globally to design and build data streaming architectures to solve real-world business problems. He specializes in optimizing solutions that use streaming data services and NoSQL. Sayem is very passionate about distributed computing.

Akeef Khan is a Solutions Architect at Amazon Web Services. He helps SMB Greenfield customers adopt the cloud. Whilst being a generalist SA, Akeef is passionate about networking.

How Morningstar used tag-based access controls in AWS Lake Formation to manage permissions for an Amazon Redshift data warehouse

Post Syndicated from Don Drake original https://aws.amazon.com/blogs/big-data/how-morningstar-used-tag-based-access-controls-in-aws-lake-formation-to-manage-permissions-for-an-amazon-redshift-data-warehouse/

This post was co-written by Ashish Prabhu, Stephen Johnston, and Colin Ingarfield at Morningstar and Don Drake, at AWS.

With “Empowering Investor Success” as the core motto, Morningstar aims at providing our investors and advisors with the tools and information they need to make informed investment decisions.

In this post, Morningstar’s Data Lake Team Leads discuss how they utilized tag-based access control in their data lake with AWS Lake Formation and enabled similar controls in Amazon Redshift.

The business challenge

At Morningstar, we built a data lake solution that allows our consumers to easily ingest data, make it accessible via the AWS Glue Data Catalog, and grant access to consumers to query the data via Amazon Athena. In this solution, we were required to ensure that the consumers could only query the data to which they had explicit access. To enforce our access permissions, we chose Lake Formation tag-based access control (TBAC). TBAC helps us categorize the data into a simple, broad level or a complex, more granular level using tags and then grant consumers access to those tags based on what group of data they need. Tag-based entitlements allow us to have a flexible and manageable entitlements system that solves our complex entitlements scenarios.

However, our consumers pushed us for better query performance and enhanced analytical capabilities. We realized we needed a data warehouse to cater to all of these consumer requirements, so we evaluated Amazon Redshift. Amazon Redshift provides us with features that we could use to work with our consumers and enable their analytical requirements:

  • Better performance for consumers’ analytical requirements
  • Ability to tune query performance with user-specified sort keys and distribution keys
  • Ability to have different representations of the same data via views and materialized views
  • Consistent query performance regardless of concurrency

Many new Amazon Redshift features helped solve and scale our analytical query requirements, specifically Amazon Redshift Serverless and Amazon Redshift data sharing.

Because our Lake Formation-enforced data lake is a central data repository for all our data, it makes sense for us to flow the data permissions from the data lake into Amazon Redshift. We utilize AWS Identity and Access Management (IAM) authentication and want to centralize the governance of permissions based on IAM roles and groups. For each AWS Glue database and table, we have a corresponding Amazon Redshift schema and table. Our goal was to ensure customers who have access to AWS Glue tables via Lake Formation also have access to the corresponding tables in Amazon Redshift.

However, we faced a problem with user-based entitlements as we moved to Amazon Redshift.

The entitlements problem

Even though we added Amazon Redshift as part of our overall solution, the entitlement requirements and challenges that came with it remained the same for our users consuming via Lake Formation. At the same time, we had to find a way to implement entitlements in our Amazon Redshift data warehouse with the same set of tags that we had already defined in Lake Formation. Amazon Redshift supports resource-based entitlements but doesn’t support tag-based entitlements. The challenge we had to overcome was how to map our existing tag-based entitlements in Lake Formation to the resource-based entitlements in Amazon Redshift.

The data in the AWS Glue Data Catalog needed to be also loaded in the Amazon Redshift data warehouse native tables. This was necessary so that the users get a familiar list of schema and tables that they are accustomed to seeing in the Data Catalog when accessing via Athena. This way, our existing data lake consumers could easily transition to Amazon Redshift.

The following diagram illustrates the structure of the AWS Glue Data Catalog mapped 1:1 with the structure of our Amazon Redshift data warehouse.

Shows mapping of Glue databases and tables to Redshift schemas and tables.

We wanted to utilize the ontology of tags in Lake Formation to also be used on the datasets in Amazon Redshift so that consumers could be granted access to the same datasets in both places. This enabled us to have a single entitlement policy source API that would grant appropriate access to both our Amazon Redshift tables as well as the corresponding Lake Formation tables based on the Lake Formation tag-based policies.

Entitlement Policy Source is used by Lake Formation and Redshift

To solve this problem, we needed to build our own solution to convert the tag-based policies in Lake Formation into grants and revokes in the resource-based entitlements in Amazon Redshift.

Solution overview

To solve this mismatch, we wanted to synchronize our Lake Formation tag ontology and classifications to the Amazon Redshift permission model. To do this, we map Lake Formation tags and grants to Amazon Redshift grants with the following steps:

  1. Map all the resources (databases, schemas, tables, and more) in Lake Formation that are tagged to their equivalent Amazon Redshift tables.
  2. Translate each policy in Lake Formation on a tag expression to a set of Amazon Redshift table grants and revokes.

The net result is that when there is a tag or policy change in Lake Formation, a corresponding set of grants or revokes are made to the equivalent Amazon Redshift tables to keep our entitlements in sync.

Map all tagged resources in Lake Formation to Amazon Redshift equivalents

The tag-based access control of Lake Formation allowed us to apply multiple tags on a single resource (database and table) in the AWS Glue Data Catalog. If visualized in a mapping form, the resource tagging can be displayed as how multiple tags on a single table would be flattened into individual entitlements on Amazon Redshift tables.

Mapping of tags in Lake Formation to Redshift tables

Translate tags to Amazon Redshift grants and revokes

To enable the migration of the tag-based policy enforced in Lake Formation, the permissions can be converted into simple grants and revokes that can be done on a per-group level.

There are two fundamental parts to a tag policy: the principal_id and the tag expression (for example, “Acess Level” = “Public”). Assuming that we have an Amazon Redshift database group for each principal_id, then the resources that represent the tag expression can be permissioned accordingly. We plan on migrating from database groups to database roles in a future implementation.

mapping of tags to Redshift user group

The solution implementation

The implementation of this solution led us to develop two components:

  • The mapper service
  • The Amazon Redshift data configuration

The mapper service can be thought of as a translation service. As the name suggests, it has the core business logic to map the tag and policy information into resource-based grants and revokes in Amazon Redshift. It needs to mimic the behavior of Lake Formation when handling the tag policy translation.

To do this translation, the mapper needs to understand and store the metadata at two levels:

  • Understanding what resource in Amazon Redshift is to be tagged with what value
  • Tracking the grants and revokes already performed so they can be updated with changes in the policy

To do this, we created a config schema in our Amazon Redshift cluster, which currently stores all the configurations.

As part of our implementation, we store the mapped (translated) information in Amazon Redshift. This allows us to incrementally update table grants as Lake Formation tags or policies changed. The following diagram illustrates this schema.

schema of configuration stored in Redshift

Business impact and value

The solution we put together has created key business impacts and values out of the current implementation and allows us greater flexibility in the future.

It allows us to get the data to our users faster with the tag policies applied in Lake Formation and translated directly to permissions in Amazon Redshift with immediate effect. It also allows us to have consistency in permissions applied in both Lake Formation and Amazon Redshift, based on the effective permissions derived from tag policies. And all this happens via a single source that grants and revokes permissions across the board, instead of managing them separately.

If we translate this into the business impact and business value that we generate, the solution improves the time to market of our data, but at the same time provides consistent entitlements across the business-driven categories that we define as tags.

The solution also opens up solutions to add more impact as our product scales both horizontally and vertically. There are potential solutions we could implement in terms of automation, users self-servicing their permissions, auditing, dashboards, and more. As our business scales, we expect to take advantage of these capabilities.

Conclusion

In this post, we shared how Morningstar utilized tag-based access control in our data lake with Lake Formation and enabled similar controls in Amazon Redshift. We developed two components that handle mapping of the tag-based access controls to Amazon Redshift permissions. This solution has improved the time to market for our data and provides consistent entitlements across different business-driven categories.

If you have any questions or comments, please leave them in the comments section.


About the Authors

Ashish Prabhu is a Senior Manager of Software Engineering in Morningstar, Inc. He focuses on the solutioning and delivering the different aspects of Data Lake and Data Warehouse for Morningstar’s Enterprise Data and Platform Team. In his spare time he enjoys playing basketball, painting and spending time with his family.

Stephen Johnston is a Distinguished Software Architect at Morningstar, Inc. His focus is on data lake and data warehousing technologies for Morningstar’s Enterprise Data Platform team.

Colin Ingarfield is a Lead Software Engineer at Morningstar, Inc. Based in Austin, Colin focuses on access control and data entitlements on Morningstar’s growing Data Lake platform.

Don Drake is a Senior Analytics Specialist Solutions Architect at AWS. Based in Chicago, Don helps Financial Services customers migrate workloads to AWS.

Patterns for updating Amazon OpenSearch Service index settings and mappings

Post Syndicated from Mikhail Vaynshteyn original https://aws.amazon.com/blogs/big-data/patterns-for-updating-amazon-opensearch-service-index-settings-and-mappings/

Amazon OpenSearch Service is used for a broad set of use cases like real-time application monitoring, log analytics, and website search at scale. As your domain ages and you add additional consumers, you need to reevaluate and change the domain’s configuration to handle additional storage and compute needs. You want to minimize downtime and performance impact as you make these changes.

Customers have been seeking guidance on best practices and patterns for changing index settings without an index maintenance window or affecting overall performance of the OpenSearch Service domain. This is part one of a two-part series, in which we show how to make settings changes to OpenSearch Service indexes with little to no downtime while supporting active producers and consumers of the data.

Indexes in OpenSearch Service

In OpenSearch Service, data must be indexed before it can be queried. Indexing is the method by which search engines organize data for fast retrieval. The resulting structure is called, fittingly, an index. All operations performed on an index are done via index APIs. Also, each index contains index mappings, which define field names and data types in the index. Data producers can add new fields with data types to an index. Index mappings can’t change throughout the index lifecycle.

OpenSearch Service indexes have two types of settings that periodically need adjustments as the profile of your workload changes:

  • Dynamic – Settings that can be changed on the index at any time
  • Static – Settings that can only be defined at the index creation time and can’t be changed throughout the index lifecycle

Dynamic index settings can be changed at any time using the update settings API. While the OpenSearch Service domain is performing instructed operations on dynamic index settings, the index doesn’t require a downtime. Changes to most dynamic index settings won’t trigger background tasks that affect the overall utilization of domain resources; however, some settings such as increasing the number of replicas via index.number_of_replicas or index.auto_expand_replicas, and depending on the domain’s configuration, can cause a temporary increase in resource utilization while the domain adds replicas. We recommend maintaining at least one replica for redundancy reasons, and multiple replicas for high query throughput use cases.

Static index settings such as mapping and shard count are defined at index creation time and can’t be changed throughout the index lifecycle. In this post, we focus on patterns and best practices for working with static index settings, such as changing shard count and patterns for updating index mappings.

All operations and procedures that we cover in this post are issued directly to the OpenSearch REST API or via the Dev Tools in OpenSearch Dashboards.

As with any use case, there is a spectrum of solutions and constraints to be considered. We start with a few simple foundational patterns and build on them, accounting for use cases with more operational constraints and working with large datasets.

Solution overview

OpenSearch Service has a default sharding strategy of 5:1, where each index is divided into five primary shards. Within each index, each primary shard also has its own replica. OpenSearch Service automatically assigns primary shards and replica shards to separate data nodes.

It’s not possible to increase the primary shard number of an existing index, meaning an index must be recreated if you want to increase the primary shard count.

The _reindex operation is ideal for creating destination indexes with updated shards and mapping settings. The _reindex operation is resource intensive. We recommend disabling replicas in your destination index by setting number_of_replicas to 0 and re-enable replicas when the reindex process is complete. If you have your data in a second, durable store, the simplest thing to do is pause updates and reindex from the source. But that’s not always possible. In this post, we share several patterns that enable you to update even static index settings like shard count.

One the major advantages of using the _reindex operation is that it doesn’t require placing the source index in a read-only mode (data producers may continue to write the data while reindexing is in progress). Also, the _reindex operation enables reprocessing, transformation, and reindexing a subset of documents and even selectively combining documents from multiple indexes. With the _reindex operation, you can copy all or a subset of documents that you select through a query to another index. In its most basic form, the _reindex operation requires you to specify a source and a destination index and configuration parameters.

The following are the some of the use cases supported by the reindex API:

  • Reindexing all documents
  • Reindexing from a remote cluster when transferring data between clusters
  • Reindexing a subset of documents that match a search query
  • Combining one or more indexes
  • Transforming documents during reindexing

To increase the shard count, you create a new index, set number_of_shards to your desired primary shard count, set number_of_replicas to 0, update the new index mapping based on your requirement, and run the reindex API operation. After the _reindex operation is complete, we recommend updating number_of_replicas in the destination index settings to achieve your desired level of replica shards.

In the following sections, we provide a walkthrough of the reindex API operation. Note that the patterns and procedures presented in this post have been validated on Amazon OpenSearch Service version 1.3.

Prerequisites

The source of the documents must be stored in the index (the “_source” setting at the index mappings level must be set to “enabled”:true, which is the default). The _reindex operation can’t be used without source documents.

Create the destination index with your desired mapping (field or data type). For demonstration purposes, our source index has a field ratings defined as long, and we want the same field to use the float data type in the destination index:

GET /source_index_name/mappings
{  
  "source_index_name": {
    "mappings" : {
      "properties" : {
        "ratings " : {
          "type" : "long"
        },
…
      }
    }
  }
}

PUT /destination_index_name
{
  "settings": {
    "index": {
      "number_of_shards": <DESIRED_NUMBER_OF_PRIMARY_SHARDS>,
      "number_of_replicas": 0
    }
  },
  "mappings": {
    "properties" : {
      "ratings" : {
          "type" : "float"
        },
…
    }
  }
}

Ensure that you have sufficient disk space on each hot tier data node to house the new index primary shards and, depending on your configuration, replica shards. If disk space is insufficient, perform an update operation on the OpenSearch Service domain to add the required storage capacity. Depending on storage requirements, you may need to migrate the OpenSearch Service domain to a different instance type, because nodes have constraints on the EBS volume size that can be mounted to each instance type. Issue the following operation to validate available disk space:

GET _cat/allocation?v

The following screenshot shows the output.

Check the disk.avail metric for hot storage tier nodes to validate your available disk space.

Use the reindex API operation

The _reindex operation snapshots the index at the beginning of its run and performs processing on a snapshot to minimize impact on the source index. The source index can still be used for querying and processing the data. Although the _reindex operation can run both synchronously and asynchronously, we recommend using an asynchronous run. You can monitor the progress of the _reindex operation, cancel its run, or throttle its run using the _task, _cancel, and _rethrottle operations, respectively.

Because the _reindex operation doesn’t require the source index placed in a read-only mode, query and index update operations are free to continue.

Use the reindex API with the following command:

POST _reindex?wait_for_completion=false
{
  "source": {
	"index": "source_index_name"
  },
  "dest": {
	"index": "destination_index_name",
	"op_type" : "index"
  }
}

The source indexes as part of the _reindex API operation can be supplemented with a query for reindexing a subset of documents and storing them in the destination index. Progress of the re-indexing operation can be monitored via tasks API operation:

GET _tasks

Note that the _reindex operation can be throttled via a _rethrottle API or settings passed as a parameter. You can cancel the task with the _cancel operation:

POST _tasks/TASK_ID/_cancel

The following screenshot shows the output of the _reindex operation for reindexing from source_index_name to destination_index_name.

When the operation is complete, both consumers and producers of the source indexes or aliases need to re-point to the destination index and the same _reindex operation needs to run again to catch up on any create, update, or delete operations performed on the source indexes while the initial _reindex operation was running. This step is required because the _reindex operation is running on a snapshot of the index. At this time, the _reindex operation needs to run with “op_type”:”create” to realign missing and out-of-version documents. See the following API command:

POST _reindex?wait_for_completion=false
{
"conflicts":"proceed",
  "source": {
	"index": "source_index_name"
  },
  "dest": {
	"index": "destination_index_name",
	"op_type" : "create"
  }
}

After the operation is complete and data integrity in the destination index is confirmed, you can delete the source index to reclaim disk space.

Increase index shard count using the split index API

The split index API and shrink index API cover a large array of use cases and present with low resource utilization in the domain. However, these APIs require closing the index for write operations and don’t address use cases that require changes to the mapping settings.

In OpenSearch Service, the number_of_shards index setting is immutable and defined at the time when the index is created. However, although this setting is immutable, there are several patterns to increase or decrease index shard count without needing to explicitly reindex the data. You can alternatively use the split index API to increase index shard count in the environments that can suspend write operations. The split index API provides a simplified way of creating a new index with a different shard setting and without reindexing your data. The split index API operation creates a new index based off of a read-only index with a desired number of primary shards.

In OpenSearch Service, an index alias is a virtual index name that can point to one or more indexes. Referencing to indexes using aliases in your applications allows you to avoid index name changes. Index aliases are used to point consumers and producers to a new index after the split index API operation is complete.

Although the majority of use cases focus on increasing a number of shards on an existing index due to data growth, there are also instances where you need to reduce the number of shards on an existing index. Such cases occasionally happen when an actual index size is less than what was anticipated when the index was created, and you want to align with a shard strategy for operational best practices for OpenSearch Service. In cases where you need to reduce a number of shards on an index, you can use the shrink index API to achieve this task.

Conclusion

In this post, we reviewed best practices when reindexing data for making changes in OpenSearch Service static index settings and mappings that require little or no index downtime. We also covered use of the split index and shrink index APIs for changing the primary index shard count for use cases where the index can be placed in a read-only state.

In our next post, we’ll explore patterns for remote indexing to alleviate load and resource utilization on the source OpenSearch Service domain.


About the Authors

Mikhail Vaynshteyn is a Solutions Architect with Amazon Web Services. Mikhail works with healthcare and life sciences customers to build solutions that help improve patients’ outcomes. Mikhail specializes in data analytics services.

Sukhomoy Basak is a Solutions Architect at Amazon Web Services, with a passion for data and analytics solutions. Sukhomoy works with enterprise customers to help them architect, build, and scale applications to achieve their business outcomes.

Generic orchestration framework for data warehousing workloads using Amazon Redshift RSQL

Post Syndicated from Akhil B original https://aws.amazon.com/blogs/big-data/generic-orchestration-framework-for-data-warehousing-workloads-using-amazon-redshift-rsql/

Tens of thousands of customers run business-critical workloads on Amazon Redshift, AWS’s fast, petabyte-scale cloud data warehouse delivering the best price-performance. With Amazon Redshift, you can query data across your data warehouse, operational data stores, and data lake using standard SQL. You can also integrate AWS services like Amazon EMR, Amazon Athena, Amazon SageMaker, AWS Glue, AWS Lake Formation, and Amazon Kinesis to take advantage of all of the analytic capabilities in the AWS Cloud.

Amazon Redshift RSQL is a native command-line client for interacting with Amazon Redshift clusters and databases. You can connect to an Amazon Redshift cluster, describe database objects, query data, and view query results in various output formats. You can use Amazon Redshift RSQL to replace existing extract, transform, load (ETL) and automation scripts, such as Teradata BTEQ scripts. You can wrap Amazon Redshift RSQL statements within a shell script to replicate existing functionality in the on-premise systems. Amazon Redshift RSQL is available for Linux, Windows, and macOS operating systems.

This post explains how you can create a generic configuration-driven orchestration framework using AWS Step Functions, Amazon Elastic Compute Cloud (Amazon EC2), AWS Lambda, Amazon DynamoDB, and AWS Systems Manager to orchestrate RSQL-based ETL workloads. If you’re migrating from legacy data warehouse workloads to Amazon Redshift, you can use this methodology to orchestrate your data warehousing workloads.

Solution overview

Customers migrating from legacy data warehouses to Amazon Redshift may have a significant investment in proprietary scripts like Basic Teradata Query (BTEQ) scripting for database automation, ETL, or other tasks. You can now use the AWS Schema Conversion Tool (AWS SCT) to automatically convert proprietary scripts like BTEQ scripts to Amazon Redshift RSQL scripts. The converted scripts run on Amazon Redshift with little to no changes. To learn about new options for database scripting, refer to Accelerate your data warehouse migration to Amazon Redshift – Part 4.

During such migrations, you may also want to modernize your current on-premises, third-party orchestration tools with a cloud-native framework to replicate and enhance your current orchestration capability. Orchestrating data warehouse workloads includes scheduling the jobs, checking if the pre-conditions have been met, running the business logic embedded within RSQL, monitoring the status of the jobs, and alerting if there are any failures.

This solution allows on-premises customers to migrate to a cloud-native orchestration framework that uses AWS serverless services such as Step Functions, Lambda, DynamoDB, and Systems Manager to run the Amazon Redshift RSQL jobs deployed on a persistent EC2 instance. You can also deploy the solution for greenfield implementations. In addition to meeting functional requirements, this solution also provides full auditing, logging, and monitoring of all ETL and ELT processes that are run.

To ensure high availability and resilience, you can use multiple EC2 instances that are a part of an auto scaling group along with Amazon Elastic File System (Amazon EFS) to deploy and run the RSQL jobs. When using auto scaling groups, you can install RSQL onto the EC2 instance as a part of the bootstrap script. You can also deploy the Amazon Redshift RSQL scripts onto the EC2 instance using AWS CodePipeline and AWS CodeDeploy. For more details, refer to Auto Scaling groups, the Amazon EFT User Guide, and Integrating CodeDeploy with Amazon EC2 Auto Scaling.

The following diagram illustrates the architecture of the orchestration framework.

Architecture Diagram

The key components of the framework are as follows:

  1. Amazon EventBridge is used as the ETL workflow scheduler, and it triggers a Lambda function at a preset schedule.
  2. The function queries a DynamoDB table for the configuration associated to the RSQL job and queries the status of the job, run mode, and restart information for that job.
  3. After receiving the configuration, the function triggers a Step Functions state machine by passing the configuration details.
  4. Step Functions starts running different stages (like configuration iteration, run type check, and more) of the workflow.
  5. Step Functions uses the Systems Manager SendCommand API to trigger the RSQL job and goes into a paused state with TaskToken. The RSQL scripts are persisted on an EC2 instance and are wrapped in a shell script. Systems Manager runs an AWS-RunShellScript SSM document to run the RSQL job on the EC2 instance.
  6. The RSQL job performs ETL and ELT operations on the Amazon Redshift cluster. When it’s complete, it returns a success/failure code and status message back to the calling shell script.
  7. The shell script calls a custom Python module with the success/failure code, status message, and the callwait TaskToken that was received from Step Functions. The Python module logs the RSQL job status in the job audit DynamoDB audit table, and exports logs to the Amazon CloudWatch log group.
  8. The Python module then performs a SendTaskSuccess or SendTaskFailure API call based on the RSQL job run status. Based on the status of the RSQL job, Step Functions either resumes the flow or stops with failure.
  9. Step Functions logs the workflow status (success or failure) in the DynamoDB workflow audit table.

Prerequisites

You should have the following prerequisites:

Deploy AWS CDK stacks

Complete the following steps to deploy your resources using the AWS CDK:

  1. Clone the GitHub repo:
    git clone https://github.com/aws-samples/amazon-redshift-rsql-orchestration-framework.git

  2. Update the following the environment parameters in cdk.json (this file can be found in the infra directory):
    1. ec2_instance_id – The EC2 instance ID on which RSQL jobs are deployed
    2. redshift_secret_id – The name of the Secrets Manager key that stores the Amazon Redshift database credentials
    3. rsql_script_path – The absolute directory path in the EC2 instance where the RSQL jobs are stored
    4. rsql_log_path – The absolute directory path in the EC2 instance used for storing the RSQL job logs
    5. rsql_script_wrapper – The absolute directory path of the RSQL wrapper script (rsql_trigger.sh) on the EC2 instance.

    The following is a sample cdk.json file after being populated with the parameters

        "environment": {
          "ec2_instance_id" : "i-xxxx",
          "redshift_secret_id" : "blog-secret",
          "rsql_script_path" : "/home/ec2-user/blog_test/rsql_scripts/",
          "rsql_log_path" : "/home/ec2-user/blog_test/logs/",
          "rsql_script_wrapper" : "/home/ec2-user/blog_test/instance_code/rsql_trigger.sh"
        }
    

  3. Deploy the AWS CDK stack with the following code:
    cd amazon-redshift-rsql-orchestration-framework/lambdas/lambda-layer/
    sh zip_lambda_layer.sh
    cd ../../infra/
    python3 -m venv ./venv
    source .venv/bin/activate
    pip install -r requirements.txt
    cdk bootstrap <AWS Account ID>/<AWS Region>
    cdk deploy --all

Let’s look at the resources the AWS CDK stack deploys in more detail.

CloudWatch log group

A CloudWatch log group (/ops/rsql-logs/) is created, which is used to store, monitor, and access log files from EC2 instances and other sources.

The log group is used to store the RSQL job run logs. For each RSQL script, all the stdout and stderr logs are stored as a log stream within this log group.

DynamoDB configuration table

The DynamoDB configuration table (rsql-blog-rsql-config-table) is the basic building block of this solution. All the RSQL jobs, restart information and run mode (sequential or parallel), and sequence in which the jobs are to be run are stored in this configuration table.

The table has the following structure:

  • workflow_id – The identifier for the RSQL-based ETL workflow.
  • workflow_description – The description for the RSQL-based ETL workflow.
  • workflow_stages – The sequence of stages within a workflow.
  • execution_type – The type of run for RSQL jobs (sequential or parallel).
  • stage_description – The description for the stage.
  • scripts – The list of RSQL scripts to be run. The RSQL scripts must be placed in the location defined in a later step.

The following is an example of an entry in the configuration table. You can see the workflow_id is blog_test_workflow and the description is Test Workflow for Blog.

It has three stages that are triggered in the following order: Schema & Table Creation Stage, Data Insertion Stage 1, and Data Insertion Stage 2. The stage Schema & Table Creation Stage has two RSQL jobs running sequentially, and Data Insertion Stage 1 and Data Insertion Stage 2 each have two jobs running in parallel.

{
	"workflow_id": "blog_test_workflow",
	"workflow_description": "Test Workflow for Blog",
	"workflow_stages": [{
			"execution_flag": "y",
			"execution_type": "sequential",
			"scripts": [
				"rsql_blog_script_1.sh",
				"rsql_blog_script_2.sh"
			],
			"stage_description": "Schema & Table Creation Stage"
		},
		{
			"execution_flag": "y",
			"execution_type": "parallel",
			"scripts": [
				"rsql_blog_script_3.sh",
				"rsql_blog_script_4.sh"
			],
			"stage_description": "Data Insertion Stage 1"
		},
		{
			"execution_flag": "y",
			"execution_type": "parallel",
			"scripts": [
				"rsql_blog_script_5.sh",
				"rsql_blog_script_6.sh"
			],
			"stage_description": "Data Insertion Stage 2"
		}
	]
}

DynamoDB audit tables

The audit tables store the run details for each RSQL job within the ETL workflow with a unique identifier for monitoring and reporting purposes. The reason why there are two audit tables is because one table stores the audit information at a RSQL job level and the other stores it at a workflow level.

The job audit table (rsql-blog-rsql-job-audit-table) has the following structure:

  • job_name – The name of the RSQL script
  • workflow_execution_id – The run ID for the workflow
  • execution_start_ts – The start timestamp for the RSQL job
  • execution_end_ts – The end timestamp for the RSQL job
  • execution_status – The run status of the RSQL job (Running, Completed, Failed)
  • instance_id – The EC2 instance ID on which the RSQL job is run
  • ssm_command_id – The Systems Manager command ID used to trigger the RSQL job
  • workflow_id – The workflow_id under which the RSQL job is run

The workflow audit table (rsql-blog-rsql-workflow-audit-table) has the following structure:

  • workflow_execution_id – The run ID for the workflow
  • workflow_id – The identifier for a particular workflow
  • execution_start_ts – The start timestamp for the workflow
  • execution_status – The run status of the workflow or state machine (Running, Completed, Failed)
  • rsql_jobs – The list of RSQL scripts that are a part of the workflow
  • execution_end_ts – The end timestamp for the workflow

Lambda functions

The AWS CDK creates the Lambda functions that retrieve the config data from the DynamoDB config table, update the audit details in DynamoDB, trigger the RSQL scripts on the EC2 instance, and iterate through each stage. The following is a list of the functions:

  • rsql-blog-master-iterator-lambda
  • rsql-blog-parallel-load-check-lambda
  • rsql-blog-sequential-iterator-lambda
  • rsql-blog-rsql-invoke-lambda
  • rsql-blog-update-audit-ddb-lambda

Step Functions state machines

This solution implements a Step Functions callback task integration pattern that enables Step Functions workflows to send a token to an external system via multiple AWS services.

The AWS CDK deploys the following state machines:

  • RSQLParallelStateMachine – The parallel state machine is triggered if the execution_type for a stage in the configuration table is set to parallel. The Lambda function with a callback token is triggered in parallel for each of the RSQL scripts using a Map state.
  • RSQLSequentialStateMachine – The sequential state machine is triggered if the execution_type for a stage in the configuration table is set to sequential. This state machine uses a iterator design pattern to run each RSQL job within the stage as per the sequence mentioned in the configuration.
  • RSQLMasterStatemachine – The primary state machine iterates through each stage and triggers different state machines based on the run mode (sequential or parallel) using a Choice state.

Move the RSQL script and instance code

Copy the instance_code and rsql_scripts directories (present in the GitHub repo) to the EC2 instance. Make sure the framework directory within instance_code is copied as well.

The following screenshots show that the instance_code and rsql_scripts directories are copied to the same parent folder on the EC2 instance.

Instance Code Scripts Image
Instance Code EC2 Copy Image
RSQL Script Image
RSQL Script EC2 Copy Image

RSQL script run workflow

To further illustrate the mechanism to run the RSQL scripts, see the following diagram.

RSQL Script Workflow Diagram

The Lambda function, which gets the configuration details from the configuration DynamoDB table, triggers the Step Functions workflow, which performs the following steps:

  1. A Lambda function defined as a workflow step receives the Step Functions TaskToken and configuration details.
  2. The TaskToken and configuration details are passed onto the EC2 instance using the Systems Manger SendCommand API call. After the Lambda function is run, the workflow branch goes into paused state and waits for a callback token.
  3. The RSQL scripts are run on the EC2 instance, which perform ETL and ELT on Amazon Redshift. After the scripts are run, the RSQL script passes the completion status and TaskToken to a Python script. This Python script is embedded within the RSQL script.
  4. The Python script updates the RSQL job status (success/failure) in the job audit DynamoDB table. It also exports the RSQL job logs to the CloudWatch log group.
  5. The Python script passes the RSQL job status (success/failure) and the status message back to the Step Functions workflow along with TaskToken using the SendTaskSuccess or SendTaskFailure API call.
  6. Depending on the job status received, Step Functions either resumes the workflow or stops the workflow.

If EC2 auto scaling groups are used, then you can use the Systems Manager SendCommand to ensure resilience and high availability by specifying one or more EC2 instances (that are a part of the auto scaling group). For more information, refer to Run commands at scale.

When multiple EC2 instances are used, set the max-concurrency parameter of the RunCommand API call to 1, which makes sure that the RSQL job is triggered on only one EC2 instance. For further details, refer to Using concurrency controls.

Run the orchestration framework

To run the orchestration framework, complete the following steps:

  1. On the DynamoDB console, navigate to the configuration table and insert the configuration details provided earlier. For instructions on how to insert the example JSON configuration details, refer to Write data to a table using the console or AWS CLI.DynamoDB Config Insertion
  2. On the Lambda console, open the rsql-blog-rsql-workflow-trigger-lambda function and choose Test.Workflow Trigger Lambda Function
  3. Add the test event similar to the following code and choose Test:
    {
    	"workflow_id": "blog_test_workflow",
    	"workflow_execution_id": "demo_test_26"
    }

    Workflow Trigger Lambda function Payload

  4. On the Step Functions console, navigate to the rsql-master-state-machine function to open the details page.RSQL Master Step Function
  5. Choose Edit, then choose Workflow Studio New. The following screenshot shows the primary state machine.RSQL Master Step Function Flow
  6. Choose Cancel to leave Workflow Studio, then choose Cancel again to leave edit mode. You’re directed back to the details page.
    RSQL Master Step Function Details
  7. On the Executions tab, choose the latest run.
    RSQL Master Step Function Execution
  8. From the Graph view, you can check the status of each state by choosing it. Every state that uses an external resource has a link to it on the Details tab.RSQL Master Step Function Execution Graph
  9. The orchestration framework runs the ETL load, which consists of the following sample RSQL scripts:
    • rsql_blog_script_1.sh – This script creates a schema rsql_blog within the database
    • rsql_blog_script_2.sh – This script creates a table blog_table within the schema created in the earlier script
    • rsql_blog_script_3.sh – Inserts one row into the table created in the previous script
    • rsql_blog_script_4.sh – Inserts one row into the table created in the previous script
    • rsql_blog_script_5.sh – Inserts one row into the table created in the previous script
    • rsql_blog_script_6.sh – Inserts one row into the table created in the previous script

You need to replace these RSQL scripts with the RSQL scripts developed for your workloads by inserting the relevant configuration details into the configuration DynamoDB table (rsql-blog-rsql-config-table).

Validation

After you run the framework, you’ll find a schema (called rsql_blog) with one table (called blog_table) created. This table consists of four rows.

RSQL Execution Table

You can check the logs of the RSQL job in the CloudWatch log group (/ops/rsql-logs/) and also the run status of the workflow in the workflow audit DynamoDB table (rsql-blog-rsql-workflow-audit-table).

RSQL Script CloudWatch Logs
RSQL Workflow Audit Record

Clean up

To avoid ongoing charges for the resources that you created, delete them. AWS CDK deletes all resources except data resources such as DynamoDB tables.

  • First, delete all AWS CDK stacks
    cdk destroy --all

  • On the DynamoDB console, select the following tables and delete them:
    • rsql-blog-rsql-config-table
    • rsql-blog-rsql-job-audit-table
    • rsql-blog-rsql-workflow-audit-table

Conclusion

You can use Amazon Redshift RSQL, Systems Manager, EC2 instances, and Step Functions to create a modern and cost-effective orchestration framework for ETL workflows. There is no overhead to create and manage different state machines for each of your ETL workflow. In this post, we demonstrated how to use this configuration-based generic orchestration framework to trigger complex RSQL-based ETL workflows.

You can also trigger an email notification through Amazon Simple Notification Service (Amazon SNS) within the state machine to the notify the operations team of the completion status of the ETL process. Further, you can achieve a event-driven ETL orchestration framework by using EventBridge to start the workflow trigger lambda function.


About the Authors

Akhil is a Data Analytics Consultant at AWS Professional Services. He helps customers design & build scalable data analytics solutions and migrate data pipelines and data warehouses to AWS. In his spare time, he loves travelling, playing games and watching movies.


Ramesh Raghupathy is a Senior Data Architect with WWCO ProServe at AWS. He works with AWS customers to architect, deploy, and migrate to data warehouses and data lakes on the AWS Cloud. While not at work, Ramesh enjoys traveling, spending time with family, and yoga.

Raza Hafeez is a Senior Data Architect within the Shared Delivery Practice of AWS Professional Services. He has over 12 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.

Dipal Mahajan is a Lead Consultant with Amazon Web Services based out of India, where he guides global customers to build highly secure, scalable, reliable, and cost-efficient applications on the cloud. He brings extensive experience on Software Development, Architecture and Analytics from industries like finance, telecom, retail and healthcare.

Simplify web app authentication: A guide to AD FS federation with Amazon Cognito user pools

Post Syndicated from Leo Drakopoulos original https://aws.amazon.com/blogs/security/simplify-web-app-authentication-a-guide-to-ad-fs-federation-with-amazon-cognito-user-pools/

August 13, 2018: Date this post was first published, on the Front-End Web and Mobile Blog. We updated the CloudFormation template, provided additional clarification on implementation steps, and revised to account for the new Amazon Cognito UI.


User authentication and authorization can be challenging when you’re building web and mobile apps. The challenges include handling user data and passwords, token-based authentication, federating identities from external identity providers (IdPs), managing fine-grained permissions, scalability, and more.

In this blog post, we will show you how to federate identities from Windows Server Active Directory to authenticate users into your web app by using AWS services. The main AWS service that we’ll use for this purpose is Amazon Cognito.

With Amazon Cognito user pools, you can add user sign-up and sign-in to your mobile and web apps by using a secure and scalable user directory. In addition, you can federate users from a SAML IdP with Amazon Cognito user pools, map these users to a user directory, and get standard authentication tokens from a user pool after the user authenticates with a SAML IdP.

This post explains how to integrate Amazon Cognito user pools with Microsoft Active Directory Federation Services (AD FS) to obtain JSON web tokens (JWTs) in your web app—which in turn can be used for downstream authentication. To demonstrate the complete authentication flow, we’ve created a simple REST API that’s built on Amazon API Gateway. The REST API retrieves data from an Amazon DynamoDB table with the help of an AWS Lambda function. We’ll use the JWT tokens that are vended from user pools to authenticate to the REST API, which is hosted on API Gateway.

A benefit of using Amazon Cognito user pools to federate users from a SAML provider is that a user pool supports SAML 2.0 post-binding endpoints. This helps eliminate the need for client-side parsing of the SAML assertion response, and the user pool directly receives the SAML response from your IdP through a user agent.

As part of the SAML federation feature, the user pool acts as a service provider on behalf of your application. The user pool becomes a single point of identity management for your application, and your application doesn’t need to integrate with multiple SAML IdPs.

Solution overview

Figure 1 shows the authentication flow that we present throughout this blog post.

Figure 1: Authentication flow with Amazon Cognito user pool

Figure 1: Authentication flow with Amazon Cognito user pool

As shown in the figure, the authentication flow involves the following steps:

  1. The app starts the sign-up and sign-in process by directing the user to the Cognito user pools hosted web UI. For a mobile app, you can use a web view to show the hosted web UI. For this post, you will use a web app that is hosted on Amazon Simple Storage Service (Amazon S3) fronted by Amazon CloudFront.
  2. The Amazon Cognito user pool determines the appropriate IdP based on your configuration. For AD FS, the IdP is determined by the metadata file or metadata endpoint URL from your SAML IdP. For example, if you use AD FS, the metadata URL looks like the following: https://<yourservername>/FederationMetadata/2007-06/FederationMetadata.xml
  3. The user is redirected to the IdP—in this case, Active Directory.
  4. The IdP authenticates the user if necessary. If the IdP recognizes that the user has an active session, then the IdP skips the authentication to provide a single sign-on experience.
  5. The IdP sends the SAML assertion to Amazon Cognito.
  6. The user’s profile is created in the user pool.
  7. After verifying the SAML assertion and collecting the user attributes (claims) from the assertion, Amazon Cognito returns OIDC tokens (ID, access, and refresh tokens) to the app for the user who is now signed in.
  8. The app then makes a GET request to API Gateway, passing along the JWT token for authorization. If authorized, the request is forwarded to Lambda for data retrieval from DynamoDB.

Installation and configuration walkthrough

To build the authentication flow that we described in the previous section, complete the following steps.

  • Step 1: Install Active Directory and AD FS
  • Step 2: Create an Amazon Cognito user pool
  • Step 3: Configure Active Directory and AD FS
  • Step 4: Complete the Amazon Cognito configuration
  • Step 5: Deploy and configure the web app

Step 1: Install Active Directory and AD FS

You will need to set up Active Directory and AD FS. For instructions on how to install both with an AWS CloudFormation template, see Enabling Federation to AWS Using Windows Active Directory, ADFS, and SAML 2.0. To complete the walkthrough in this blog post, you will need to have a working Active Directory service and AD FS service, and a user created within Active Directory. For this walkthrough, we created a user named bob with an email address of [email protected].

Step 2: Create an Amazon Cognito user pool

  1. Sign in to the Amazon Cognito console and do one of the following:
    • If you have an existing user pool, in the left navigation pane, choose User pools and then choose Create user pool to create a new user pool for this walkthrough.
    • If you don’t have an existing user pool, you will see a landing page. Keep the dropdown list as default and choose Create user pool.
  2. In the Configure sign-in experience section, for Cognito user pool sign-in options, select Email, and then choose Next.
  3. In the Configure security requirements section, under Multi-factor authentication, select No MFA, leave the other fields as default, and then choose Next.
  4. In the Configure sign-up experience section, under Attribute verification and user account confirmation, deselect Allow Cognito to automatically send messages to verify and confirm, and choose Next.
  5. In the Configure message delivery section, under Email, select Send email with Cognito, leave the other fields as default, and then choose Next.
  6. In the Integrate your app section, enter a user pool name, select Use the Cognito Hosted UI, and create a domain name using a Cognito domain.
  7. In the Initial app client section as shown in Figure 2, for App client name, enter SAML-IdP; and for Allowed callback URLs, enter https://localhost. Then choose Next.
    Figure 2: Set up the initial app client to create the Cognito user pool

    Figure 2: Set up the initial app client to create the Cognito user pool

  8. In the Review and create section, review all settings, and then scroll to the bottom of the page and choose Create user pool.

Step 3: Configure Active Directory and AD FS

Now that you’ve created an Amazon Cognito user pool, you need to set up Amazon Cognito as a relying party in the SAML identity provider (in this case, AD FS). After you configure AD FS, you will return to Amazon Cognito to complete the final configurations for the application to work.

  1. Connect to the Windows Server instance where you installed AD FS as an administrator through the remote desktop protocol (RDP).
  2. Open the AD FS 2.0 console.
  3. To make sure that the user you created in Step 1 has an email address, in the user property window for your user, choose General. Figure 3 shows our user named bob in Active Directory with an email address of [email protected].
    Figure 3: User properties of bob in the Active Directory

    Figure 3: User properties of bob in the Active Directory

  4. Determine the Uniform Resource Name (URN) for the Amazon Cognito user pool. The form of the URN is urn:amazon:cognito:sp:<user-pool-id>. You can find the user pool ID in the General settings tab.
  5. Configure AD FS as follows to work with the Amazon Cognito user pool:
    1. Go to Trust Relationships > Relying Party Trusts > Add relying party trusts. This will start a wizard.
    2. Select Enter data about the relying party manually.
    3. Enter a display name for the relying party configuration.
    4. On the next screen, do not configure a certificate.
    5. Enable support for the SAML 2.0 single sign-on service URL.
    6. Add the Amazon Cognito user pool URN as the relying party trust identifier.
    7. Configure the SAML POST binding. The SAML 2.0 post-binding endpoint (also known as the assertion consumer URL) for the Amazon Cognito user pool is https://<domain-prefix>.auth.<<region>.amazoncognito.com/saml2/idpresponse.  You configured this as the domain name in Step 2.6.
    8. Select Permit all users to access this relying party.
    9. Choose Finish.
  6. Navigate to Trust Relationships Relying Party Trusts. You should see that the URN of Amazon Cognito is configured as the relying party, as shown in Figure 4:
Figure 4: Amazon Cognito trusted as the relying party

Figure 4: Amazon Cognito trusted as the relying party

In a SAML federation, the IdP can pass various attributes about the user, the authentication method, or other points of context to the service provider (in this case, Amazon Cognito) in the form of SAML attributes. In AD FS, claim rules are used to assemble these required attributes using a combination of Active Directory lookups, simple transformations, and regular expression-based custom rules. In this example, you will configure two claim rules: Name ID and E-Mail.

  1. The Edit Claim Rules window should already be open. If it isn’t, select your relying party trust from the Trust Relationships > Relying Party Trusts screen, and then, in the Actions tab on the right side, choose Edit Claim Rules.
  2. On the Configure Claim Rule page, enter the following values for each configuration element, and then choose OK.
    • Claim rule name: Name ID
    • Incoming claim type: Windows account name
    • Outgoing claim type: Name ID
    • Outgoing name ID format: Persistent identifier
  3. Repeat the preceding steps for the E-mail claim:
    • Claim rule name: Email
    • Attribute Directory: Active Directory
    • LDAP Attributes: Email Addresses
    • Outgoing Claim Type: Email Address
  4. Before leaving the AD FS configuration, download the metadata file for the AD FS. The metadata URL for AD FS looks like the following: https://<servername>/FederationMetadata/2007-06/FederationMetadata.xmlM. The metadata file describes the endpoint of your SAML IdP (the AD FS service) to the service provider (Amazon Cognito).

Step 4: Complete the Amazon Cognito configuration

  1. Sign in to the Amazon Cognito console.
  2. Select the Amazon Cognito user pool that you created earlier, navigate to Sign-in experience Federated identity provider sign-in, and choose Add identity provider, as shown in Figure 5.
    Figure 5: Add a federated identity provider in the Amazon Cognito console

    Figure 5: Add a federated identity provider in the Amazon Cognito console

  3. Choose SAML as the identity provider.
  4. As shown in Figure 6, enter a name for your identity provider, choose Select file, and then upload the FederationMetadata.xml file that you downloaded at the end of Step 3.
    Figure 6: Set up SAML federation with the user pool

    Figure 6: Set up SAML federation with the user pool

  5. Provide the SAML attribute to map attributes between your SAML provider and your user pool as follows:
    • For User pool attribute, select email.
    • For SAML attribute, enter http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress

    These mappings map the claims from the SAML assertion from AD FS to the user pool attributes. You configured an E-mail claim in AD FS, so you need to map this with the appropriate attribute in the user pool.

  6. Choose Add identity provider.

Step 5: Deploy and configure a web app

To reduce the number of steps required for this walkthrough, we have provided a CloudFormation template that you can use to complete the deployment, which deploys the architecture shown in Figure 7:

Figure 7: Web app architecture deployed by the CloudFormation template

Figure 7: Web app architecture deployed by the CloudFormation template

This architecture is essentially the same as step 8 from the authentication flow diagram (Figure 1) earlier in this post. In Figure 7, we have added Amazon S3 and Amazon CloudFront to the diagram, which is where your static website is hosted. Complete the following steps for this walkthrough:

  • Step 5.1: Create the AWS CloudFormation stack
  • Step 5.2: Manually integrate Amazon Cognito user pools with API Gateway
  • Step 5.3: Update the configuration for Amazon Cognito
  • Step 5.4: Update the configuration for the client-side application and upload to Amazon S3
  • Step 5.5: Insert a row into a DynamoDB table to help you test the application

Step 5.1: Create the AWS CloudFormation stack

Let’s deploy this infrastructure:

  1. Download the code repository, which includes the CloudFormation template named prerequisites.yaml and the sample code for a web app named DataManager.
  2. Navigate to the CloudFormation console in the Region where you deployed the user pool, and choose Create Stack.
  3. To upload the template to Amazon S3, choose Browse and select prerequisites.yaml  in the folder where you downloaded it.
  4. Provide a Stack name and a unique Bucket name.

    Note: S3 bucket names should not contain uppercase characters.

  5. Choose Next, and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  6. Choose Create and then wait for the resources to be deployed.

    Note: If the deployment fails with the error message API: s3:CreateBucket Access Denied, review the IAM permissions available for the IAM user or the role used and make sure that the s3:CreateBucket permission has been granted.

Step 5.2: Manually integrate the Amazon Cognito user pool with API Gateway

  1. Open the API Gateway console. You should see that an API named DataManager has been created by CloudFormation, as shown in Figure 8:
    Figure 8: APIs in the API Gateway console

    Figure 8: APIs in the API Gateway console

  2. Under APIs, choose DataManager, and then choose Authorizers.
  3. Choose Create new Authorizer, and then populate the relevant details:
    • For Name, enter SamlAuthorizer (Make sure that the name of the user pool is the same as the one that you created).
    • For Type, select Cognito.
    • For Cognito user pool, enter Samlfederation.
    • For Token source, enter Authorization.

    With this configuration, you use the user pools authorizer to authenticate Get requests to your Rest API that’s hosted on API Gateway. In the dropdown for Cognito User Pool, add the user pool that you created in Step 2: Create an Amazon Cognito user pool. Choose Create.

  4. Navigate back to APIs > Resources, choose GET, and then choose Method Request.
  5. To add the authorizer that you just created, under Settings, in the Authorization dropdown, choose your authorizer. Remember to save the setting by choosing the small tick symbol on the right side. If you don’t see the Cognito authorizer, just wait for several minutes for updates from API Gateway.
    Figure 9: Add the Cognito authorizer for the API GET method

    Figure 9: Add the Cognito authorizer for the API GET method

Step 5.3: Update the configuration for Amazon Cognito

Now you need to update the Amazon Cognito configuration based on the CloudFront distribution that you deployed using the CloudFormation template in Step 5.1.

  1. Navigate to the CloudFormation console and locate the CloudFormation stack that was deployed. As shown in Figure 10, in the Outputs tab, copy the values for CloudfrontEndpoint and DataManagerApiInvokeUrl because you will need them later.
    Figure 10: Outputs of the CloudFormation template deployment

    Figure 10: Outputs of the CloudFormation template deployment

  2. Navigate to the Amazon Cognito console and go to your user pool. Choose the App integration tab, scroll to the bottom of the page, and for App client name, choose the App client that you added during user pool creation.
  3. On the page for your App client, in the Hosted UI section, choose Edit, and then do the following:
    • For both the Allowed callback URLs and Allowed sign-out URLs, enter the CloudFront endpoint.
    • For OAuth grant types, select Implicit grant.
    • For OpenID Connect scopes, select Email and OpenID.
    Figure 11: Configure the hosted UI for the app client

    Figure 11: Configure the hosted UI for the app client

The Amazon Cognito hosted UI provides an OAuth 2.0 compliant authorization server. It includes the default implementation of end user flows, such as registration and authentication. Because the application interacts with Amazon Cognito through an OAuth 2.0 implicit flow, which requires a redirect, the website needs to use HTTPS.

Note: In a production scenario, instead of implicit flow, an authorization code grant is the preferred method in the OAuth 2.0 framework because it’s more secure.

To have an HTTPS endpoint for the Amazon S3 static website, you can use the CloudFront distribution that was deployed by the CloudFormation template in Step 5.1.

When one of your users successfully logs in to the Active Directory infrastructure, the user is automatically redirected to the callback URL. In this case, this is a CloudFront distribution URL with an Amazon Cognito ID token, access token, and refresh token.

Step 5.4: Update the configuration for your client-side application, and upload it to Amazon S3

Navigate to the code that you previously cloned in Step 5.1, and perform the following steps:

  1. With a file manager, navigate to the folder where the cloned content is located. Open the DataManager directory.
  2. Open the js folder. Using a text editor, open the config.js file.
  3. From the Amazon Cognito console, copy the client app application ID as the value of the userPoolClientId property. You can find the application ID in the App clients menu of the Amazon Cognito console.
  4. Change the value of the Region property to the Region that you are using (for example, us-east-2)
  5. While you are still in the Amazon Cognito console, open the Domain name page, and copy the custom prefix into the value for the authDomainPrefix property.
  6. Open the CloudFormation console and choose the stack that was created in Step 5.1. With the stack selected, open the Outputs tab.
    • Copy the value of the CloudfrontEndpoint output variable to the redirect_uri property.
    • Copy the value of the DataManagerApiInvokeUrl output variable to the invokeUri property.
  7. Copy the files to the S3 bucket that hosts the static website. To upload the files, use the AWS Command Line Interface (AWS CLI) or the Amazon S3 console.

Step 5.5: Insert a row into the DynamoDB table to help test your application

The CloudFormation template that you used in Step 5.1 created a DynamoDB table that you can use to test your application. Now you need to add a row to the table (as shown in the Items returned section of Figure 12), so that you can get some results when you test your application. To add a row, in the left menu, choose Tables Update settings to find the table, and then choose Actions Create item.

The Lambda function that retrieves data from the ADFSSecretData DynamoDB table only retrieves data from rows where the email matches the one used to log in to Active Directory. To achieve this, you pass the event.requestContext.authorizer.claims.email.object within the Lambda function. This object contains the email that you used to log in to Active Directory.

Figure 12: Search result of DynamoDB table

Figure 12: Search result of DynamoDB table

Now you’re ready to test the application.

  1. Open the CloudFront URL in your browser and choose Enter. This should immediately take you to the web app landing page. From there, you’re automatically redirected to the Amazon Cognito hosted UI. You should see a screen similar to the following that says Sign in with your corporate ID:
    Figure 13: Cognito hosted UI sign-in page

    Figure 13: Cognito hosted UI sign-in page

  2. After you choose your SAML provider, you are redirected to your AD FS infrastructure that shows a login screen similar to the following:
    Figure 14: AD FS sign-in page

    Figure 14: AD FS sign-in page

    Note: If there’s an error, make sure that there’s a mapping in the host file for your AD FS server, with the appropriate hostname or public IP address of the EC2 instance where the AD FS infrastructure is hosted

    On the login screen, for Username, enter the user’s email address (in our example, that’s Bob’s email address), and for Password, enter the password that you defined in Active Directory, as shown in Figure 14. If the login is successful, you’re redirected back to the web app with a valid ID and access tokens.

    Figure 15: Sample web app home page

    Figure 15: Sample web app home page

  3. Choose Refresh to see the data that you stored in DynamoDB.
    Figure 16: Retrieval of the data from DynamoDB

    Figure 16: Retrieval of the data from DynamoDB

Summary

In this walkthrough, you federated users from AD FS, and successfully authenticated those users to our REST API that’s hosted on API Gateway.

The SAML federation feature in Amazon Cognito helps you set up and integrate your apps with multiple SAML IdPs. By using the SAML federation capabilities of Amazon Cognito, your apps don’t need to handle the type of SAML IdP that they are interacting with. Amazon Cognito takes care of it on behalf of your application.
 


This article was originally written by Adrian Hall, who was an AWS Solutions Architect when he wrote it.
 


 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Leo Drakopoulos

Leo Drakopoulos

Leo is a Principal Solutions Architect working within the financial services industry. His focus is AWS Serverless and Container-based architectures. He enjoys helping customers adopt a culture of innovation and use cloud-native architectures.

Jun Zhang

Jun Zhang

Jun is a Solutions Architect based in Zurich. He helps Swiss customers architect cloud-based solutions to achieve their business potential. He has a passion for sustainability and strives to solve current environmental challenges with technology. He is also a huge tennis fan and enjoys playing board games a lot.

How to use Amazon GuardDuty and AWS WAF v2 to automatically block suspicious hosts

Post Syndicated from Eucke Warren original https://aws.amazon.com/blogs/security/how-to-use-amazon-guardduty-and-aws-waf-v2-to-automatically-block-suspicious-hosts/

In this post, we’ll share an automation pattern that you can use to automatically detect and block suspicious hosts that are attempting to access your Amazon Web Services (AWS) resources. The automation will rely on Amazon GuardDuty to generate findings about the suspicious hosts, and then you can respond to those findings by programmatically updating AWS WAF to block the host from accessing your workloads.

You should implement security measures across your AWS resources by using a holistic approach that incorporates controls across multiple areas. In the AWS CAF Security Perspective section of the AWS Security Incident Response Guide, we define these controls across four categories:

  • Directive controls — Establish the governance, risk, and compliance models the environment will operate within
  • Preventive controls — Protect your workloads and mitigate threats and vulnerabilities
  • Detective controls — Provide full visibility and transparency over the operation of your deployments in AWS
  • Responsive controls — Drive remediation of potential deviations from your security baselines

Security automation is a key principle outlined in the Response Guide. It helps reduce operational overhead and creates repeatable, predictable approaches to monitoring and responding to events. AWS services provide the building blocks to create powerful patterns for the automated detection and remediation of threats against your AWS environments. You can configure automated flows that use both detective and responsive controls and might also feed into preventative controls to help mitigate risks in the future. Depending on the type of source event, you can automatically invoke specific actions, such as modifying access controls, terminating instances, or revoking credentials.

The patterns highlighted in this post provide an example of how to automatically remediate detected threats. You should modify these patterns to suit your defined requirements, and test and validate them before deploying them in a production environment.

AWS services used for the example pattern

Amazon GuardDuty is a continuous security monitoring and threat detection service that incorporates threat intelligence, anomaly detection, and machine learning to help protect your AWS resources, including your AWS accounts. Amazon EventBridge delivers a near-real-time stream of system events that describe changes in AWS resources. Amazon GuardDuty sends events to Amazon CloudWatch when a change in the findings takes place. In the context of GuardDuty, such changes include newly generated findings and subsequent occurrences of these findings. You can quickly set up rules to match events generated by GuardDuty findings in EventBridge events and route those events to one or more target actions. The pattern in this post routes matched events to AWS Lambda, which then updates AWS WAF web access control lists (web ACLs) and Amazon Virtual Private Cloud (Amazon VPC) network access control lists (network ACLs). AWS WAF is a web application firewall that helps protect your web applications from common web exploits that could affect application availability, security, or excess resource consumption. It supports both managed rules as well as a powerful rule language for custom rules. A network ACL is stateless and is an optional layer of security for your VPC that helps you restrict specific inbound and outbound traffic at the subnet level.

Pattern overview

This example pattern assumes that Amazon GuardDuty is enabled in your AWS account. If it isn’t enabled, you can learn more about the free trial and pricing, and follow the steps in the GuardDuty documentation to configure the service and start monitoring your account. The example code will only work in the us-east-1 AWS Region due to the use of Amazon CloudFront and web ACLs within the template.

Figure 1 shows how the AWS CloudFormation template creates the sample pattern.

Figure 1: How the CloudFormation template works

Figure 1: How the CloudFormation template works

Here’s how the pattern works, as shown in the diagram:

  1. A GuardDuty finding is generated due to suspected malicious activity.
  2. An EventBridge event is configured to filter for GuardDuty finding types by using event patterns.
  3. A Lambda function is invoked by the EventBridge event and parses the GuardDuty finding.
  4. The Lambda function checks the Amazon DynamoDB state table for an existing entry that matches the identified host. If state data is not found in the table for the identified host, a new entry is created in the Amazon DynamoDB state table.
  5. The Lambda function creates a web ACL rule inside AWS WAF and updates a subnet network ACL.
  6. A notification email is sent through Amazon Simple Notification Service (SNS).

A second Lambda function runs on a 5-minute recurring schedule and removes entries that are past the configurable retention period from AWS WAF IPSets (an IPSet is a list that contains the blocklisted IPs or CIDRs), VPC network ACLs, and the DynamoDB table.

GuardDuty prefix patterns and findings

The EventBridge event rule provided by the example automation uses the following seven prefix patterns, which allow coverage for 36 GuardDuty finding types. These specific finding types are of a network nature, and so we can use AWS WAF to block them. Be sure to read through the full list of finding types in the GuardDuty documentation to better understand what GuardDuty can report findings for. The covered findings are as follows:

  1. UnauthorizedAccess:EC2
    • UnauthorizedAccess:EC2/MaliciousIPCaller.Custom
    • UnauthorizedAccess:EC2/MetadataDNSRebind
    • UnauthorizedAccess:EC2/RDPBruteForce
    • UnauthorizedAccess:EC2/SSHBruteForce
    • UnauthorizedAccess:EC2/TorClient
    • UnauthorizedAccess:EC2/TorRelay
  2. Recon:EC2
    • Recon:EC2/PortProbeEMRUnprotectedPort
    • Recon:EC2/PortProbeUnprotectedPort
    • Recon:EC2/Portscan
  3. Trojan:EC2
    • Trojan:EC2/BlackholeTraffic
    • Trojan:EC2/BlackholeTraffic!DNS
    • Trojan:EC2/DGADomainRequest.B
    • Trojan:EC2/DGADomainRequest.C!DNS
    • Trojan:EC2/DNSDataExfiltration
    • Trojan:EC2/DriveBySourceTraffic!DNS
    • Trojan:EC2/DropPoint
    • Trojan:EC2/DropPoint!DNS
    • Trojan:EC2/PhishingDomainRequest!DNS
  4. Backdoor:EC2
    • Backdoor:EC2/C&CActivity.B
    • Backdoor:EC2/C&CActivity.B!DNS
    • Backdoor:EC2/DenialOfService.Dns
    • Backdoor:EC2/DenialOfService.Tcp
    • Backdoor:EC2/DenialOfService.Udp
    • Backdoor:EC2/DenialOfService.UdpOnTcpPorts
    • Backdoor:EC2/DenialOfService.UnusualProtocol
    • Backdoor:EC2/Spambot
  5. Impact:EC2
    • Impact:EC2/AbusedDomainRequest.Reputation
    • Impact:EC2/BitcoinDomainRequest.Reputation
    • Impact:EC2/MaliciousDomainRequest.Reputation
    • Impact:EC2/PortSweep
    • Impact:EC2/SuspiciousDomainRequest.Reputation
    • Impact:EC2/WinRMBruteForce
  6. CryptoCurrency:EC2
    • CryptoCurrency:EC2/BitcoinTool.B
    • CryptoCurrency:EC2/BitcoinTool.B!DNS
  7. Behavior:EC2
    • Behavior:EC2/NetworkPortUnusual
    • Behavior:EC2/TrafficVolumeUnusual

When activity occurs that generates one of these GuardDuty finding types and is then matched by the EventBridge event rule, an entry is created in the target web ACLs and subnet network ACLs to deny access from the suspicious host, and then a notification is sent to an email address by this pattern’s Lambda function. Blocking traffic from the suspicious host helps to mitigate potential threats while you perform additional investigation and remediation. For more information, see Remediating a compromised EC2 instance.

Solution deployment

To deploy the solution, you’ll do the following steps. Each step is described in more detail in the sections that follow.

  1. Download the required files.
  2. Create your Amazon Simple Storage Service (Amazon S3) bucket and upload the .zip files.
  3. Deploy the CloudFormation template.
  4. Create and test the Lambda function for a GuardDuty finding event.
  5. Confirm the entry for the test event in the VPC network ACL.
  6. Confirm the entry in the AWS WAF IP sets.
  7. Confirm the SNS notification email alert.
  8. Apply the AWS WAF web ACLs to resources.

Step 1: Download the required files

Download the following four files from the amazon-guardduty-waf-acl GitHub code repository:

  1. CloudFormation template – Copy and save the linked raw text, using the file name guarddutytoacl.template on your local file system.
  2. JSON event test file – Copy and save the linked raw text, using the file name gd2acl_test_event.json on your local file system.
  3. guardduty_to_acl_lambda_wafv2.zip – Choose the Download button on the GitHub page and save the .zip file to your local file system.
  4. prune_old_entries_wafv2.zip – Choose the Download button on the GitHub page and save the .zip file to your local file system.

Step 2: Create your S3 bucket and upload .zip files

For this step, create an S3 bucket with public access blocked, and then upload the Lambda .zip files to the newly created S3 bucket.

To create your S3 bucket and upload .zip files

  1. Create an S3 bucket in the us-east-1 Region.
  2. Upload the .zip files guardduty_to_acl_lambda_wafv2.zip and prune_old_entries_wafv2.zip that you saved to your local file system in Step 1 to the newly created S3 bucket.

Step 3: Deploy the CloudFormation template

For this step, deploy the CloudFormation template only to the us-east-1 Region within the AWS account where GuardDuty findings are to be monitored.

To deploy the CloudFormation template

  1. Sign in to the AWS Management Console, choose the CloudFormation service, and set N.Virginia (us-east-1) as the Region.
  2. Choose Create stack, and then choose With new resources (standard).
  3. When the Create stack landing page is presented, make sure that Template is ready is selected in the Prepare template section. In the Template source section, choose Upload a template file.
  4. Choose the Choose file button and browse to the location where the guarddutytoacl.template file was saved on your local file system. Select the file, choose Open, and then choose Next.
  5. On the Specify stack details page, provide the following input parameters. You can modify the default values to customize the pattern for your environment.

    Input parameter Input parameter description
    Notification email The email address to receive notifications. Must be a valid email address.
    Retention time, in minutes How long to retain IP addresses in the blocklist (in minutes). The default is 12 hours.
    S3 bucket for artifacts The S3 bucket with artifact files (Lambda functions, templates, HTML files, and so on). Keep the default value for deployment into the N. Virginia Region.
    S3 path to artifacts The path in the S3 bucket that contains artifact files. Keep the default value for deployment into the N. Virginia Region.
    CloudFrontWebACL Create CloudFront Web ACL? If set to true, a CloudFront IP set will be created automatically.
    RegionalWebACL Create Regional Web ACL? If set to true, a Regional IP set will be created automatically.

    Figure 2 shows an example of the values entered on this page.

    Figure 2: CloudFormation parameters on the Specify stack details page

    Figure 2: CloudFormation parameters on the Specify stack details page

  6. Enter values for all of the input parameters, and then choose Next.
  7. On the Configure stack options page, accept the defaults, and then choose Next.
  8. On the Review page, confirm the details, check the box acknowledging that the template will require capabilities for AWS::IAM::Role, and then choose Create Stack.

    The stack normally requires no more than 3–5 minutes to complete.

  9. While the stack is being created, check the email inbox that you specified for the Notification email address parameter. Look for an email message with the subject “AWS Notification – Subscription Confirmation”. Choose the link in the email to confirm the subscription to the SNS topic. You should see a message similar to the following.
    Figure 3: Subscription confirmation

    Figure 3: Subscription confirmation

When the Status field for the CloudFormation stack changes to CREATE_COMPLETE, as shown in Figure 4, the pattern is implemented and is ready for testing.

Figure 4: The stack status is CREATE_COMPLETE

Figure 4: The stack status is CREATE_COMPLETE

Step 4: Create and test the Lambda function for a GuardDuty finding event

After the CloudFormation stack has completed deployment, you can test the functionality by using a Lambda test event.

To create and run a Lambda GuardDuty finding test event

  1. In the AWS Management Console, choose Services > VPC > Subnets and locate a subnet that is suitable for testing the pattern.
  2. On the Details tab, copy the subnet ID to the clipboard or to a text editor.
    Figure 5: The subnet ID value on the Details tab

    Figure 5: The subnet ID value on the Details tab

  3. In the AWS Management Console, choose Services > CloudFormation > GuardDutytoACL stack. On the Outputs tab for the stack, look for the GuardDutytoACLLambda entry.
    Figure 6: The GuardDutytoACLLambda entry on the Outputs tab

    Figure 6: The GuardDutytoACLLambda entry on the Outputs tab

  4. Choose the link for the entry, and you’ll be redirected to the Lambda console, with the Lambda Code source page already open.
    Figure 7: The Lambda function open in the Lambda console

    Figure 7: The Lambda function open in the Lambda console

  5. In the middle of the Code source menu, in the Test dropdown list, locate and select the Configure test event option.
    Figure 8: Select Configure test event from the dropdown list

    Figure 8: Select Configure test event from the dropdown list

  6. To facilitate testing, we’ve provided a test event file. On the Configure test event page, do the following:
    1. For Event name, enter a name.
    2. In the body of the Event JSON field, paste the provided test event JSON, overwriting the existing contents.
    3. Update the value of SubnetId key (line 35) to the value of the subnet ID that you chose in Step 1 of this procedure.
    4. Choose Save.
    Figure 9: Update the value of the subnetId key

    Figure 9: Update the value of the subnetId key

  7. Choose Test to invoke the Lambda function with the test event. You should see the message “Status: succeeded” at the top of the execution results, similar to what is shown in Figure 10.
    Figure 10: The Test button and the “succeeded” message

    Figure 10: The Test button and the “succeeded” message

Step 5: Confirm the entry in the VPC network ACL

In this step, you’ll confirm that the DENY entry was created in the network ACL. This pattern is configured to create up to 10 entries in an ACL, ranging between rule numbers 71 and 80. Because network ACL rules are processed in order, it’s important that the DENY rule is placed before the ALLOW rule.

To confirm the entry in the VPC network ACL

  1. In the AWS Management Console, choose Services > VPC > Subnets, and locate the subnet you provided for the test event.
  2. Choose the network ACL link and confirm that the new DENY entry was generated from the test event.
    Figure 11: Check the entry from the test event on the Network tab

    Figure 11: Check the entry from the test event on the Network tab

    Note that VPC network ACL entries are created in the rule number range between 71 and 80. Older entries are aged out to create a “sliding window” of blocked hosts.

Step 6: Confirm the entry in the AWS WAF IP sets and blocklists

Next, verify that the entry was added to the CloudFront AWS WAF IP set and to the Application Load Balancer (ALB) AWS WAF IP set.

To confirm the entry in the AWS WAF IP set and blocklist

  1. In the AWS Management Console, choose Services > WAF & Shield > Web ACLs, and then set the selected Region to Global (CloudFront).
  2. Find and select the web ACL name that starts with CloudFrontBlockListWeb. In the Rule view, on the Rules tab, select the rule named CloudFrontBlocklistIPSetRule. Note that 198.51.100.0/32 appears as an entry in the rule.
    Figure 12: Confirm that the IP address was added

    Figure 12: Confirm that the IP address was added

  3. In the AWS Management Console, on the left navigation menu, choose Web ACLs, and then set the selected Region to US East (N. Virginia).
  4. Find and select the web ACL name that starts with RegionalBlocklistACL. In the Rule view, on the Rules tab, select the rule named RegionalBlocklistIPSetRule. Note that 198.51.100.0/32 appears as an entry in the rule.
    Figure 13: Make sure that the IP address was added

    Figure 13: Make sure that the IP address was added

There might be specific host addresses that you want to prevent from being added to the blocklist. You can do this within GuardDuty by using a trusted IP list. Trusted IP lists consist of IP addresses that you have allowlisted for secure communication with your AWS infrastructure and applications. GuardDuty doesn’t generate findings for IP addresses on trusted IP lists. For more information, see Working with trusted IP lists and threat lists.

Step 7: Confirm the SNS notification email

Finally, verify that the SNS notification was sent to the email address you set up.

To confirm receipt of the SNS notification email

  • Review the email inbox that you specified for the AdminEmail parameter and look for a message with the subject line “AWS GD2ACL Alert”. The contents of the message from SNS should be similar to the following.
    Figure 14: SNS message example

    Figure 14: SNS message example

Step 8: Apply the AWS WAF web ACLs to resources

The final task is to associate the web ACL with the CloudFront distributions and Application Load Balancers that you want to automatically update with this pattern. To learn how to do this, see Associating or disassociating a web ACL with an AWS resource.

You can also use AWS Firewall Manager to associate the web ACLs. AWS Firewall Manager can simplify your AWS WAF administration and maintenance tasks across multiple accounts and resources. With Firewall Manager, you set up your firewall rules just once. The service automatically applies your rules across your accounts and resources, even as you add new resources.

Conclusion

In this post, you’ve learned how to use Lambda to automatically update AWS WAF and VPC network ACLs in response to GuardDuty findings. With just a few steps, you can use this sample pattern to help mitigate threats by blocking communication with suspicious hosts. You can explore additional possible patterns by using GuardDuty finding types and Amazon EventBridge target actions. This pattern’s code is available on GitHub. Feel free to play around with the code to add more GuardDuty findings to this pattern and also to build bigger and better patterns! Make sure to modify the patterns in this post to suit your defined requirements, and test and validate them before deploying them in a production environment.

If you have comments about this blog post, you can submit them in the Comments section below. If you have questions about using this pattern, start a thread in the GuardDutyAWS WAF, or CloudWatch forums, or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Eucke Warren

Eucke Warren

Eucke is a Sr Solution Architect helping ISV customers grow and mature securely. He has been fortunate to be able to work with technology for more than 30 years and counts automation, infrastructure, and security as areas of focus. When he’s not supporting customers, he enjoys time with his wife, family, and the company of a very bossy 18-pound dog.

Geoff Sweet

Geoff Sweet

Geoff has been in industry for over 20 years. He began his career in Electrical Engineering. Starting in IT during the dot-com boom, he has held a variety of diverse roles, such as systems architect, network architect, and, for the past several years, security architect. Geoff specializes in infrastructure security.

Enabling DevSecOps with Amazon CodeCatalyst

Post Syndicated from Imtranur Rahman original https://aws.amazon.com/blogs/devops/enabling-devsecops-with-amazon-codecatalyst/

DevSecOps is the practice of integrating security testing at every stage of the software development process. Amazon CodeCatalyst includes tools that encourage collaboration between developers, security specialists, and operations teams to build software that is both efficient and secure. DevSecOps brings cultural transformation that makes security a shared responsibility for everyone who is building the software.

Introduction

In a prior post in this series, Maintaining Code Quality with Amazon CodeCatalyst Reports, I discussed how developers can quickly configure test cases, run unit tests, set up code coverage, and generate reports using CodeCatalyst’s workflow actions. This was done through the lens of Maxine, the main character of Gene Kim’s The Unicorn Project. In the story, Maxine meets Purna – the QA and Release Manager and Shannon – a Security Engineer. Everyone has the same common goal to integrate security into every stage of the Software Development Lifecycle (SDLC) to ensure secure code deployments. The issue Maxine faces is that security testing is not automated and the separation of responsibilities by role leads to project stagnation.

In this post, I will focus on how DevSecOps teams can use Amazon CodeCatalyst to easily integrate and automate security using CodeCatalyst workflows. I’ll start by checking for vulnerabilities using OWASP dependency checker and Mend SCA. Then, I’ll conduct Static Analysis (SA) of source code using Pylint. I will also outline how DevSecOps teams can influence the outcome of a build by defining success criteria for Software Composition Analysis (SCA) and Static Analysis actions in the workflow. Last, I’ll show you how to gain insights from CodeCatalyst reports and surface potential issues to development teams through CodeCatalyst Issues for faster remediation.

Prerequisites

If you would like to follow along with this walkthrough, you will need to:

Walkthrough

To follow along, you can re-use a project you created previously, or you can refer to a previous post that walks through creating a project using the Modern Three-tier Web Application blueprint. Blueprints provide sample code and CI/CD workflows to help you get started easily across different combinations of programming languages and architectures. The back-end code for this project is written in Python and the front-end code is written in JavaScript.

Modern Three-tier Web Application architecture including a presentation, application and data layer

Figure 1. Modern Three-tier Web Application architecture including a presentation, application and data layer

Once the project is deployed, CodeCatalyst opens the project overview. Select CI/CD → Workflows → ApplicationDeploymentPipeline to view the current workflow.

Six step Workflow described in the prior paragraph

Figure 2. ApplicationDeploymentPipeline

Modern applications use a wide array of open-source dependencies to speed up feature development, but sometimes these dependencies have unknown exploits within them. As a DevSecOps engineer, I can easily edit this workflow to scan for those vulnerable dependencies to ensure I’m delivering secure code.

Software Composition Analysis (SCA)

Software composition analysis (SCA) is a practice in the fields of Information technology and software engineering for analyzing custom-built software applications to detect embedded open-source software and analyzes whether they are up-to-date, contain security flaws, or have licensing requirements. For this walkthrough, I’ll highlight two SCA methods:

Note that developers can replace either of these with a tool of their choice so long as that tool outputs an SCA report format supported by CodeCatalyst.

Software Composition Analysis using OWASP Dependency Checker

To get started, I select Edit at the top-right of the workflows tab. By default, CodeCatalyst opens the YAML tab. I change to the Visual tab to visually edit the workflow and add a CodeCatalyst Action by selecting “+Actions” (1) and then “+” (2). Next select the Configuration (3) tab and edit the Action Name (4). Make sure to select the check mark after you’re done.

New action configuration showing steps to add a build action

Figure 3. New Action Initial Configuration

Scroll down in the Configuration tab to Shell commands. Here, copy and paste the following command snippets that runs when action is invoked.

#Set Source Repo Directory to variable
- Run: sourceRepositoryDirectory=$(pwd)
#Install Node Dependencies
- Run: cd web &amp;&amp; npm install
#Install known vulnerable dependency (This is for Demonstrative Purposes Only)
- Run: npm install [email protected]
#Go to parent directory and download OWASP dependency-check CLI tool
- Run: cd .. && wget https://github.com/jeremylong/DependencyCheck/releases/download/v8.1.2/dependency-check-8.1.2-release.zip
#Unzip file - Run: unzip dependency-check-8.1.2-release.zip
#Navigate to dependency-check script location
- Run: cd dependency-check/bin
#Execute dependency-check shell script. Outputs in SARIF format
- Run: ./dependency-check.sh --scan $sourceRepositoryDirectory/web -o $sourceRepositoryDirectory/web/vulnerabilities -f SARIF --disableYarnAudit

These commands will install the node dependencies, download the OWASP dependency-check tool, and run it to generate findings in a SARIF file. Note the third command, which installs a module with known vulnerabilities (This is for demonstrative purposes only).

On the Outputs (1) tab, I change the Report prefix (2) to owasp-frontend. Then I set the Success criteria (3) for Vulnerabilities to 0 – Critical (4). This configuration will stop the workflow if any critical vulnerabilities are found.

Report configuration showing SCA configuration

Figure 4: owasp-dependecy-check-frontend

It is a best practice to scan for vulnerable dependencies before deploying resources so I’ll set my owasp-dependency-check-frontend action as the first step in the workflow. Otherwise, I might accidentally deploy vulnerable code. To do this, I select the Build (1) action group and set the Depends on (2) dropdown to my owasp-dependency-check-frontend action. Now, my action will run before any resources are built and deployed to my AWS environment. To save my changes and run the workflow, I select Commit (3) and provide a commit message.

Setting OWASP as the First Action

Figure 5: Setting OWASP as the First Workflow Action

Amazon CodeCatalyst shows me the state of the workflow run in real-time. After the workflow completes, I see that the action has entered a failed state. If I were a QA Manager like Purna from the Unicorn Project, I would want to see why the action failed. On the lefthand navigation bar, I select the Reports owasp-frontend-web/vulnerabilities/dependency-check-report.sarif for more details.

SCA report showing 1 critical and 7 medium findings

Figure 6: SCA Report Overview

This report view provides metadata such as the workflow name, run ID, action name, repository, and the commit ID. I can also see the report status, a bar graph of vulnerabilities grouped by severity, the number of libraries scanned, and a Findings panel. I had set the success criteria for this report to 0 – Critical so it failed because 1 Critical vulnerability was found. If I select a specific finding ID, I can learn more about that specific finding and even view it on the National Vulnerability Database website.

Dialog showing CVE details for the critical vulnerability

Figure 7: Critical Vulnerability CVE Finding

Now I can raise this issue with the development team through the Issues board on the left-hand navigation panel. See this previous post to learn more about how teams can collaborate in CodeCatalyst.

Note: Let’s remove [email protected] install from owasp-dependency-check-frontend action’s list of commands to allow the workflow to proceed and finish successfully.

Software Composition Analysis using Mend

Mend, formerly known as WhiteSource, is an application security company built to secure today’s digital world. Mend secures all aspects of software, providing automated remediation, prevention, and protection from problem to solution versus only detection and suggested fixes. Find more information about Mend here.

Mend Software Composition Analysis (SCA) can be run as an action within Amazon CodeCatalyst CI/CD workflows, making it easy for developers to perform open-source software vulnerability detection when building and deploying their software projects. This makes it easier for development teams to quickly build and deliver secure applications on AWS.

Getting started with CodeCatalyst and Mend is very easy. After logging in to my Mend Account, I need to create a new Mend Product named Amazon-CodeCatalyst and a Project named mythical-misfits.

Next, I navigate back to my existing workflow in CodeCatalyst and add a new action. However, this time I’ll select the Mend SCA action.

Adding the Mend action

Figure 8: Mend Action

All I need to do now is go to the Configuration tab and set the following values:

  • Mend Project Name: mythical-misfits
  • Mend Product Name: Amazon-CodeCatalyst
  • Mend License Key: You can get the License Key from your Mend account in the CI/CD Integration section. You can get more information from here.

Mend Action Configuration

Figure 9: Mend Action Configuration

Then I commit the changes and return to Mend.

Mend console showing analysis of the Mythical Mysfits app

Figure 10: Mend Console

After successful execution, Mend will automatically update and show a report similar to the screenshot above. It contains useful information about this project like vulnerabilities, licenses, policy violations, etc. To learn more about the various capabilities of Mend SCA, see the documentation here.

Static Analysis (SA)

Static analysis, also called static code analysis, is a method of debugging that is done by examining the code without executing the program. The process provides an understanding of the code structure and can help ensure that the code adheres to industry standards. Static analysis is used in software engineering by software development and quality assurance teams.

Currently, my workflow does not do static analysis. As a DevSecOps engineer, I can add this as a step to the workflow. For this walkthrough, I’ll create an action that uses Pylint to scan my Python source code for Static Analysis. Note that you can also use other static analysis tools or a GitHub Action like SuperLinter, as covered in this previous post.

Static Analysis using Pylint

After navigating back to CI/CD → Workflows → ApplicationDeploymentPipeline and selecting Edit, I create a new test action. I change the action name to pylint and set the Configuration tab to run the following shell commands:

- Run: pip install pylint 
- Run: pylint $PWD --recursive=y --output-format=json:pylint-report.json --exit-zero

On the Outputs tab, I change the Report prefix to pylint. Then I set the Success criteria for Static analysis as shown in the figure below:

Report configuration tab showing static analysis configuration

Figure 11: Static Analysis Report Configuration

Being that Static Analysis is typically run before any execution, the pylint or OWASP action should be the very first action in the workflow. For the sake of this blog we will use pylint. I select the OWASP or Mend actions I created before, set the Depends on dropdown to my pylint action, and commit the changes. Once the workflow finishes, I can go to Reports > pylint-pylint-report.json for more details.

Static analysis report showing 7 high findings

Figure 12: Pylint Static Analysis Report

The Report status is Failed because more than 1 high-severity or above bug was detected. On the Results tab I can view each finding in greater detail, including the severity, type of finding, message from the linter, and which specific line the error originates from.

Cleanup

If you have been following along with this workflow, you should delete the resources you deployed so you do not continue to incur charges. First, delete the two stacks that AWS Cloud Development Kit (CDK) deployed using the AWS CloudFormation console in the AWS account you associated when you launched the blueprint. These stacks will have names like mysfitsXXXXXWebStack and mysfitsXXXXXAppStack. Second, delete the project from CodeCatalyst by navigating to Project settings and choosing Delete project.

Conclusion

In this post, I demonstrated how DevSecOps teams can easily integrate security into Amazon CodeCatalyst workflows to automate security testing by checking for vulnerabilities using OWASP dependency checker or Mend through Software Composition Analysis (SCA) of dependencies. I also outlined how DevSecOps teams can configure Static Analysis (SA) reports and use success criteria to influence the outcome of a workflow action.

Imtranur Rahman

Imtranur Rahman is an experienced Sr. Solutions Architect in WWPS team with 14+ years of experience. Imtranur works with large AWS Global SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform.Imtranur specializes in Containers, Dev/SecOps, GitOps, microservices based applications, hybrid application solutions, application modernization and loves innovating on behalf of his customers. He is highly customer obsessed and takes pride in providing the best solutions through his extensive expertise.

Wasay Mabood

Wasay is a Partner Solutions Architect based out of New York. He works primarily with AWS Partners on migration, training, and compliance efforts but also dabbles in web development. When he’s not working with customers, he enjoys window-shopping, lounging around at home, and experimenting with new ideas.

AWS Glue crawlers support cross-account crawling to support data mesh architecture

Post Syndicated from Sandeep Adwankar original https://aws.amazon.com/blogs/big-data/aws-glue-crawlers-support-cross-account-crawling-to-support-data-mesh-architecture/

Data lakes have come a long way, and there’s been tremendous innovation in this space. Today’s modern data lakes are cloud native, work with multiple data types, and make this data easily available to diverse stakeholders across the business. As time has gone by, data lakes have grown significantly and have evolved to data meshes as a way to scale. Thoughtworks defines a data mesh as “a shift in a modern distributed architecture that applies platform thinking to create self-serve data infrastructure, treating data as the product.”

Data mesh advocates for decentralized ownership and delivery of enterprise data management systems that benefit several personas. Data producers can use the data mesh platform to create datasets and share them across business teams to ensure data availability, reliability, and interoperability across functions and data subject areas. Data consumers now have better data sharing with data mesh and federation across business units without compromising data security. The data governance team can support distributed data, where all data is accessible to those with the proper authority to access it. With data mesh, data doesn’t have to be consolidated into a single data lake or account and can remain within different databases and data lakes. An essential capability needed in such a data lake architecture is the ability to continuously understand changes in the data lakes in various other domains and make those available to data consumers. Without such a capability, manual work is needed to understand producers’ updates and make them available to consumers and governance.

AWS customers use a modern data architecture to facilitate governance and data sharing across logical or physical governance boundaries to create data domains aligned to lines of business. Each line of business creates and manages their dataset on Amazon Simple Storage Service (Amazon S3) and uses AWS Glue crawlers to discover new datasets and register them to the AWS Glue Data Catalog, add new tables and partitions, and detect schema changes. These datasets are shared with data consumers that access the data using services like Amazon Athena, Amazon Redshift, Amazon EMR, and more.

In the post Introducing AWS Glue crawlers using AWS Lake Formation permission management, we introduced a new set of capabilities in AWS Glue crawlers and AWS Lake Formation that simplifies crawler setup and supports centralized permissions for in-account and cross-account crawling of S3 data lakes. In this post, we demonstrate the same capability for a data mesh architecture in which we establish a central governance layer to catalog the data owned by the data producer and share it with the data consumer for ease of discovery. The AWS Glue crawler cross-account capability allows you to crawl data sources in different producer accounts while still having those changes cataloged in a centralized governance account. Customers prefer the central governance experience over writing bucket policies separately in each bucket owning the account of a data mesh producer. To build a data mesh architecture, now you can author permissions in a single Lake Formation governance to manage access to data locations and crawlers spanning multiple accounts in the data mesh.

According to the Allstate Corporation:

“By leveraging the power of AWS Lake Formation in our modern data architecture, we will be able to further unlock the potential of our data and empower our analytics community to drive innovation and build data-driven applications. The granular data access and collaboration provided by this architecture will enable us to build a truly unified data and analytics experience, bringing us one step closer to realizing our vision of becoming a fully data-driven enterprise.”

– Prashant Mehrotra, Director – Machine Learning and R&D, Allstate

In this post, we walk through the creation of a simplified data mesh architecture that shows how to use an AWS Glue crawler with Lake Formation to automate bringing changes from data producer domains to data consumers while maintaining centralized governance.

Solution overview

In a data mesh architecture, you have several producer accounts that own S3 buckets, several consumer accounts who wants to access shared datasets, and a central governance account to manage data shares between producers and consumers. This central governance account doesn’t own any S3 bucket or actual tables.

The following figure shows a simplified data mesh architecture with a single producer account, a centralized governance account, and a single consumer account. The data mesh producer account hosts the encrypted S3 bucket, which is shared with the central governance account. The central governance account registers the S3 bucket with Lake Formation using an AWS Identity and Access Management (IAM) role, which has permissions to the S3 bucket and AWS Key Management Service (AWS KMS). The central account creates the database for storing the dataset schema and shares it with the producer account. The producer account, as the S3 bucket owner, runs a crawler to crawl the buckets registered with the central account using Lake Formation permissions and populates the database. Now the shared database with new datasets are available to share with consumers in the data mesh. The central governance account can now share the database with a consumer admin, who can delegate access to other personas (such as data analysts) in the consumer account for data access.

shows a simplified data mesh architecture with a single producer account, a centralized governance account, and a single consumer account

In the following sections, we provide AWS CloudFormation templates to set up the resources in each account. Then we provide the steps to configure the crawler, manage permissions and sharing, and validate the solution by running queries with Athena.

Prerequisites

Complete the following steps in each account (producer, central governance, and consumer) to update the Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control:

  1. Sign in to the Lake Formation console as admin.
  2. If this is the first time accessing the Lake Formation console, add yourself as the data lake administrator.
    add yourself as the data lake administrator.
  3. In the navigation pane, under Data catalog, choose Settings.
  4. Uncheck Use only IAM access control for new databases.
  5. Uncheck Use only IAM access control for new tables in new databases.
  6. Keep Version 3 as the current cross-account version.
  7. Choose Save.

Set up resources in the central governance account

The CloudFormation template for the central account creates a CentralDataMeshOwner user assigned as Lake Formation admin. The CentralDataMeshOwner user in the central governance account performs the necessary steps to share the central catalogs with the producer and consumer accounts. The CentralDataMeshOwner user also sets up a custom Lake Formation service role to register the S3 data lake location. Complete the following steps:

  1. Log in to the central governance account console as IAM administrator.
  2. Choose Launch Stack to deploy the CloudFormation template:
  3. For DataMeshOwnerUserName, keep the default (CentralDataMeshOwner).
  4. For ProducerAWSAccount, enter the producer account ID.
  5. Create the stack.
  6. After the stack launches, on the AWS CloudFormation console, navigate to the Resources tab of the stack.
  7. Note down the value of RegisterLocationServiceRole.
  8. Choose the LFUsersPassword value to navigate to the AWS Secrets Manager console.
  9. In the Secret value section, choose Retrieve secret value.
  10. Note down the secret value for the password for IAM user CentralDataMeshOwner.

Set up resources in the producer account

The CloudFormation template for the producer account creates the following resources:

  • IAM user LOBProducerSteward
  • S3 bucket retail-datalake-<producer account id >-<producer region>
  • KMS key used for bucket encryption
  • Required S3 bucket policies to provide access to the central governance account
  • AWS Glue crawler and crawler IAM role with necessary permissions

Complete the following steps:

  1. Log in to the producer account console as IAM administrator.
  2. Choose Launch Stack to deploy the CloudFormation template:
  3. For CentralAccountID, enter the central account ID.
  4. For CentralAccountLFServiceRole, enter the value of RegisterLocationServiceRole from CloudFormation noted earlier.
  5. Create the stack.
  6. When the stack is complete, on the AWS CloudFormation console, navigate to the Resources tab of the stack.
  7. Note down the AWSGlueServiceRole value.
  8. Choose the ProducerStewardUserCredentials value to navigate to the Secrets Manager console.
  9. In the Secret value section, choose Retrieve secret value.
  10. Note down the secret value for the password for IAM user LOBProducerSteward.
  11. On the Amazon S3 console, check the bucket policies for retail-datalake-<producer account id >-<producer region> and make sure it is shared with the central governance account IAM role.

This is required for registering the bucket with Lake Formation in the central account so that the account can manage the data sharing.

  1. On the AWS KMS console, check that the bucket is encrypted with the customer managed key and the key is shared with the central governance account.

Set up resources in the consumer account

The CloudFormation template for the consumer account creates the following resources:

  • IAM user ConsumerAdminUser assigned to the data lake admin
  • IAM user LFBusinessAnalyst1
  • S3 bucket for Athena output
  • Athena workgroup

Complete the following steps:

  1. Log in to the consumer account console as IAM administrator.
  2. Choose Launch Stack to deploy the CloudFormation template:
  3. Create the stack.
  4. When the stack is complete, on the AWS CloudFormation console, navigate to the Resources tab of the stack.
  5. Choose the AllConsumerUsersCredentials value to navigate to the Secrets Manager console.
  6. In the Secret value section, choose Retrieve secret value.
  7. Note down the secret value for the password for the IAM user ConsumerAdminUser.

Now that all the accounts have been set up, we set up cross-account sharing on AWS with a central governance account to manage sharing of permissions across producers and consumers.

Configure the central governance account to manage sharing with the producer account

Sign in to the central governance account as CentralDataMeshOwner using the password noted earlier through the central governance account CloudFormation stack. Then complete the following steps:

  1. On Lake Formation console, choose Data lake locations under Register and ingest in the navigation pane.
  2. For Amazon S3 path, provide the path retail-datalake-<producer account id >-<region>.
  3. For IAM role, choose the IAM role created using the CloudFormation stack.

This role has permissions for the accessing the encrypted S3 bucket and its key. Do not choose the role AWSServiceRoleForLakeFormationDataAccess.

  1. Choose Register location.
  2. In the navigation pane, choose Databases.
  3. Choose Create database.
  4. For Database name¸ enter datameshtestdatabase.
  5. Choose Create database.
  6. In the navigation pane, choose Data locations and choose Grant.
  7. Select External account and provide the producer account for AWS account ID, AWS organization ID, or IAM principal ARN.
  8. For Storage location, provide the data lake bucket path.
  9. Select Grantable, then choose Grant.
  10. Choose Data lake permissions, then choose Grant.
  11. Select External accounts and provide the producer account number.
  12. For Databases, choose datameshtestdatabase.
  13. For Database permissions and Grantable permissions, select Create table, Alter, and Describe.
  14. Choose Grant.

Configure the crawler in the producer account to populate the schema

Sign in to producer account as LOBProducerSteward with the password noted earlier through the producer account CloudFormation stack, then complete the following steps:

  1. On the AWS RAM console, accept the pending resource share from the central account.
  2. On the Lake Formation console, choose Databases under Data catalog in the navigation pane.
  3. Choose datameshtestdatabase, and on the Action menu, choose Create resource link.
  4. For Resource link name, enter datameshtestdatabaselink.
  5. Choose Create.
  6. On the AWS Glue console, choose Crawlers in the navigation pane.
  7. Choose the crawler CrossAccountCrawler-<accountid>.
  8. Choose Edit, then choose Configure security settings.
  9. Select Use Lake Formation credentials for crawling S3 data source.
  10. Select In a different account and provide the account ID of the central governance account.
  11. Choose Next.
  12. Choose datameshtestdatabaselink as the database and choose Update.
  13. In the navigation pane, choose Data locations and choose Grant.
  14. Select My account, and choose the crawler IAM role for IAM users and roles.
  15. For Storage locations, choose the bucket retail-datalake-<accountid>-<region>.
  16. For Registered account location, enter the central account ID.
  17. Choose Grant.
    Alternatively, you can also use the AWS CLI to grant data location permission on bucket registered in central account to the crawler role using below command:

    aws lakeformation grant-permissions 
    --principal DataLakePrincipalIdentifier="<Crawler Role ARN>" 
    --permissions "DATA_LOCATION_ACCESS” 
    --resource ‘{ "DataLocation": {"ResourceArn":"<S3 bucket arn>", "CatalogId": "<Central Account id>"}}'

    For using CLI, refer to Installing or updating the latest version of the AWS CLI.

  18. In the navigation pane, choose Data lake permissions.
  19. Choose the crawler IAM role for the principal account.
  20. Choose datameshtestdatabase for the database.
  21. For Database permissions, select Create, Describe, and Alter.
  22. Choose Grant.
  23. Choose the crawler IAM role for the principal account.
  24. Choose datameshtestdatabaselink for the database.
  25. For Resource link permissions, select Describe.
  26. Choose Grant.
  27. Run the crawler.

The following screenshot shows the details after a successful run.

When the crawler is complete, you can validate the table created under the database datameshtestdatabaselink.

This table is owned by the producer account and available in the central governance account under the shared database datameshtestdatabase. Now the data lake admin in the central governance account can share the database and populated table with the consumer account.

Configure the central governance account to manage sharing of read-only access with the consumer account

Sign in to the central governance account as CentralDataMeshOwner with the password noted earlier through the central governance account CloudFormation stack, then complete the following steps:

  1. Grant database permissions to the consumer account.
  2. For Principals, choose external account and provide <consumer accountID>
  3. For Databases, select datameshtestdatabase.
  4. For Database permissions, select Describe.
  5. For Grantable permissions¸ select Describe.
  6. Choose Grant.

  7. Grant table permissions to the consumer account.
  8. For Principals, choose external account and provide <consumer accountID>.
  9. For Databases, select datameshtestdatabase.
  10. For Tables, select retail_datalake_<accountID>_<region>.
  11. For Table permissions, select Select and Describe.
  12. For Grantable permissions¸ select Select and Describe.
  13. Choose Grant.

Configure the consumer account as the consumer account data lake admin

Sign to the consumer account as ConsumerAdminUser with the password noted earlier through the consumer account CloudFormation stack. (Note that in the consumer account Lake Formation configuration, both ConsumerAdminUser and LFBusinessAnalyst1 have the same password.)

  1. On the AWS RAM console, accept the resource share from the central account.
  2. On the Lake Formation console, validate that the shared database datameshtestdatabase is available and create the resource link datameshtestdatabaselink using the shared database.

The following screenshot shows the details after the resource link is created.

  1. On the Lake Formation console, choose Grant.
  2. Choose LFBusinessAnalyst1 for IAM users and roles.
  3. Choose datameshtestdatabase for the database under Named data catalog resources.
  4. Select Describe for Database permissions.
  5. On the Lake Formation console, choose Grant.
  6. Choose LFBusinessAnalyst1 for IAM users and roles.
  7. Choose datameshtestdatabaselink for the database under Named data catalog resources.
  8. Select Describe for Resource link permissions.
  9. On the Lake Formation console, choose Grant.
  10. Choose LFBusinessAnalyst1 for IAM users and roles.
  11. Choose retail_datalake_<accountid>_<region> for the table under Named data catalog resources.
  12. Select Select and Describe for Table permissions.

Run queries in the consumer account

Sign to the consumer account console as LFBusinessAnalyst1 with the password noted earlier through the consumer account CloudFormation stack, then complete the following steps:

  1. On the Athena console, and choose lfconsumer-workgroup as the Athena workgroup.
  2. Run the following query to validate access:
select * from datameshtestdatabaselink.retail_datalake_<accountid>_<region>

We have successfully registered the dataset and created a Data Catalog in the central governance account. We crawled the data lake that was registered with the central governance account using Lake Formation permissions from the producer account and populated the schema. We granted Lake Formation permission on the database and table from the central account to the consumer user and validated consumer user access to the data using Athena.

Clean up

To avoid unwanted charges to your AWS account, delete the AWS resources:

  1. Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack in all three accounts.
  2. Delete the stacks you created.

Conclusion

In this post, we showed how to set up cross-account crawling using a central governance account with the new AWS Glue crawler capability of Lake Formation integration. This capability allows data producers to set up crawling capabilities in their own domain so that changes are seamlessly available to data governance and data consumers. Implementing a data mesh with AWS Glue crawlers, Lake Formation, Athena, and other analytical services provide a well-understood, performant, scalable, and cost-effective solution to integrate, prepare, and serve data.

If you have questions or suggestions, submit them in the comments section.

For more resources, refer to the following:


About the authors

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Piyali Kamra is a seasoned enterprise architect and a hands-on technologist who believes that building large scale enterprise systems is not an exact science but more like an art, in which tools and technologies must be carefully selected based on the team’s culture , strengths , weaknesses and risks , in tandem with having a futuristic vision as to how you want to shape your product a few years down the road.

Automate the deployment of an NGINX web service using Amazon ECS with TLS offload in CloudHSM

Post Syndicated from Nikolas Nikravesh original https://aws.amazon.com/blogs/security/automate-the-deployment-of-an-nginx-web-service-using-amazon-ecs-with-tls-offload-in-cloudhsm/

Customers who require private keys for their TLS certificates to be stored in FIPS 140-2 Level 3 certified hardware security modules (HSMs) can use AWS CloudHSM to store their keys for websites hosted in the cloud. In this blog post, we will show you how to automate the deployment of a web application using NGINX in AWS Fargate, with full integration with CloudHSM. You will also use AWS CodeDeploy to manage the deployment of changes to your Amazon Elastic Container Service (Amazon ECS) service.

CloudHSM offers FIPS 140-2 Level 3 HSMs that you can integrate with NGINX or Apache HTTP Server through the OpenSSL Dynamic Engine. The CloudHSM Client SDK 5 includes the OpenSSL Dynamic Engine to allow your web server to use a private key stored in the HSM with TLS versions 1.2 and 1.3 to support applications that are required to use FIPS 140-2 Level 3 validated HSMs.

CloudHSM uses the private key in the HSM as part of the server verification step of the TLS handshake that occurs every time that a new HTTPS connection is established between the client and server. Using the exchanged symmetric key, OpenSSL software performs the key exchange and bulk encryption. For more information about this process and how CloudHSM fits in, see How SSL/TLS offload with AWS CloudHSM works.

Solution overview

This blog post uses the AWS Cloud Development Kit (AWS CDK) to deploy the solution infrastructure. The AWS CDK allows you to define your cloud application resources using familiar programming languages.

Figure 1 shows an overview of the overall architecture deployed in this blog. This solution contains three CDK stacks: The TlsOffloadContainerBuildStack CDK stack deploys the CodeCommit, CodeBuild, and AmazonECR resources. The TlsOffloadEcsServiceStack CDK stack deploys the ECS Fargate service along with the required VPC resources. The TlsOffloadPipelineStack CDK stack deploys the CodePipeline resources to automate deployments of changes to the service configuration.

Figure 1: Overall architecture

Figure 1: Overall architecture

At a high level, here’s how the solution in Figure 1 works:

  1. Clients make an HTTPS request to the public IP address exposed by Network Load Balancer to connect to the web server and establish a secure connection that uses TLS.
  2. Network Load Balancer routes the request to one of the ECS hosts running in private virtual private cloud (VPC) subnets, which are connected to the CloudHSM cluster.
  3. The NGINX web server that is running on ECS containers performs a TLS handshake by using the private key stored in the HSM to establish a secure connection with the requestor.

Note: Although we don’t focus on perimeter protection in this post, AWS has a number of services that help provide layered perimeter protection for your internet-facing applications, such as AWS Shield and AWS WAF.

Figure 2 shows an overview of the automation infrastructure that is deployed by the TlsOffloadContainerBuildStack and TlsOffloadPipelineStack CDK stacks.

Figure 2: Deployment pipeline

Figure 2: Deployment pipeline

At a high level, here’s how the solution in Figure 2 works:

  1. A developer makes changes to the service configuration and commits the changes to the AWS CodeCommit repository.
  2. AWS CodePipeline detects the changes and invokes AWS CodeBuild to build a new version of the Docker image that is used in Amazon ECS.
  3. CodeBuild builds a new Docker image and publishes it to the Amazon Elastic Container Registry (Amazon ECR) repository.
  4. AWS CodeDeploy creates a new revision of the ECS task definition for the Amazon ECS service and initiates a deployment of the new service.

Required services

To build this architecture in your account, you need to use a role within your account that can configure the following services and features:

Prerequisites

To follow this walkthrough, you need to have the following components in place:

Step 1: Store secrets in Secrets Manager

As with other container projects, you need to decide what to build statically into the container (for example, libraries, code, or packages) and what to set as runtime parameters, to be pulled from a parameter store. In this walkthrough, we use Secrets Manager to store sensitive parameters and use the integration of Amazon ECS with Secrets Manager to securely retrieve them when the container is launched.

Important: You need to store the following information in Secrets Manager as plaintext, not as key/value pairs.

To create a new secret

  1. Open the Secrets Manager console and choose Store a new secret.
  2. On the Choose secret type page, do the following:
    1. For Secret type, choose Other type of secret.
    2. In Key/value pairs, choose Plaintext and enter your secret just as you would need it in your application.

The following is a list of the required secrets for this solution and how they look in the Secrets Manager console.

  • Your cluster-issuing certificate – this is the certificate that corresponds to the private key that you used to sign the cluster’s certificate signing request. In this example, the name of the secret for the certificate is tls/clustercert.
    Figure 3: Store the cluster certificate

    Figure 3: Store the cluster certificate

  • The web server certificate – In this example, the name of the secret for the web server certificate is tls/servercert. It will look similar to the following:
    Figure 4: Store the web server certificate

    Figure 4: Store the web server certificate

  • The fake PEM file for the private key stored in the HSM that you generated in the Prerequisites section. In this example, the name of the secret for the fake PEM file is tls/fakepem.
    Figure 5: Store the fake PEM

    Figure 5: Store the fake PEM

  • The HSM pin used to authenticate with the HSMs in your cluster. In this example, the name of the secret for the HSM pin is tls/pin.
    Figure 6: Store the HSM pin

    Figure 6: Store the HSM pin

After you’ve stored your secrets, you should see output similar to the following:

Figure 7: List of required secrets

Figure 7: List of required secrets

Step 2: Download and configure the CDK app

This post uses the AWS CDK to deploy the solution infrastructure. In this section, you will download the CDK app and configure it.

To download and configure the CDK app

  1. In your CDK environment that you created in the Prerequisites section, check out the source code from the aws-cloudhsm-tls-offload-blog GitHub repository.
  2. Edit the app_config.json file and update the <placeholder values> with your target configuration:
    {
        "applicationAccount": "<AWS_ACCOUNT_ID>",
        "applicationRegion": "<REGION>",
        "networkConfig": {
            "vpcId": "<VPC_ID>",
            "publicSubnets": ["<PUBLIC_SUBNET_1>", "<PUBLIC_SUBNET_2>", ...],
            "privateSubnets": ["<PRIVATE_SUBNET_1>", "<PRIVATE_SUBNET_2>", ...]
        },
        "secrets": {
            "cloudHsmPin": "arn:aws:secretsmanager:<REGION>:<AWS_ACCOUNT_ID>:secret:<SECRET_ID>",
            "fakePem": "arn:aws:secretsmanager:<REGION>:<AWS_ACCOUNT_ID>:secret:<SECRET_ID>",
            "serverCert": "arn:aws:secretsmanager:<REGION>:<AWS_ACCOUNT_ID>:secret:<SECRET_ID>",
            "clusterCert": "arn:aws:secretsmanager:<REGION>:<AWS_ACCOUNT_ID>:secret:<SECRET_ID>"
        },
        "cloudhsm": {
            "clusterId": "<CLUSTER_ID>",
            "clusterSecurityGroup": "<CLUSTER_SECURITY_GROUP>"
        }
    }

  3. Run the following command to build the CDK stacks from the root of the project directory.
    npm run build

  4. To view the stacks that are available to deploy, run the following command from the root of the project directory.
    cdk ls

    You should see the following stacks available to deploy:

    • TlsOffloadContainerBuildStack — Deploys the CodeCommit, CodeBuild, and ECR repository that builds the ECS container image.
    • TlsOffloadEcsServiceStack — Deploys the ECS Fargate service along with the required VPC resources.
    • TlsOffloadPipelineStack — Deploys the CodePipeline that automates the deployment of updates to the service.

Step 3: Deploy the container build stack

In this step, you will deploy the container build stack, and then create a build and verify that the image was built successfully.

To deploy the container build stack

Deploy the TlsOffloadContainerBuildStack stack that we described in Figure 2 to your AWS account. In your CDK environment, run the following command:

cdk deploy TlsOffloadContainerBuildStack

The command line interface (CLI) will prompt you to approve the changes. After you approve them, you will see the following resources deployed to your newly created CodeCommit repository.

  • Dockerfile — This file provides a containerized environment for each of the Fargate containers to run. It downloads and installs necessary dependencies to run the NGINX web server with CloudHSM.
  • nginx.conf — This file provides NGINX with the configuration settings to run an HTTPS web server with CloudHSM configured as the SSL engine that performs the TLS handshake. The following nginx.conf values have already been configured in the file; if you want to make changes, update the file before deployment:
    • ssl_engine is set to cloudhsm
    • the environment variable is env CLOUDHSM_PIN
    • error_log is set to stderr so that the Fargate container can capture the logs in CloudWatch
    • the server section is set up to listen on port 443
    • ssl_ciphers are configured for a server with an RSA private key
  • run.sh — This script configures the CloudHSM OpenSSL Dynamic Engine on the Fargate task before the NGINX server is started.
  • nginx.service — This file specifies the configuration settings that systemd uses to run the NGINX service. Included in this file is a reference to the file that contains the environment variables for the NGINX service. This provides the HSM pin to the OpenSSL Engine.
  • index.html — This file is a sample HTML file that is displayed when you navigate to the HTTPS endpoint of the load balancer in your browser.
  • dhparam.pem — This file provides sample Diffie-Hellman parameters for demonstration purposes, but AWS recommends that you generate your own. You can generate your own Diffie-Hellman parameters by running the following command with the OpenSSL CLI. These parameters are not required for TLS but are recommended to provide perfect forward secrecy in your encrypted messages.
    openssl dhparam -out ./dhparam.pem 2048

Your repository should look like the following:

Figure 8: CodeCommit repository

Figure 8: CodeCommit repository

Before you deploy the Amazon ECS service, you need to build your first Docker image to populate the ECR repository. To successfully deploy the service, you need to have at least one image already present in the repository.

To create a build and verify the image was built successfully

  1. Open the AWS CodeBuild console.
  2. Find the CodeBuild project that was created by the CDK deployment and select it.
  3. Choose Start Build to initiate a new build.
  4. Wait for the build to complete successfully, and then open the Amazon ECR console.
  5. Select the repository that the CDK deployment created.

You should now see an image in your repository, similar to the following:

Figure 9: ECR repository

Figure 9: ECR repository

Step 4: Deploy the Amazon ECS service

Now that you have successfully built an ECR image, you can deploy the Amazon ECS service. This step deploys the following resources to your account:

  • VPC endpoints for the required AWS services that your ECS task needs to communicate with, including the following:
    • Amazon ECR
    • Secrets Manager
    • CloudWatch
    • CloudHSM
  • Network Load Balancer, which load balances HTTPS traffic to your ECS tasks.
  • A CloudWatch Logs log group to host the logs for the ECS tasks.
  • An ECS cluster with ECS tasks using your previously built Docker image that hosts the NGINX service.

To deploy the Amazon ECS service with the CDK

  • In your CDK environment, run the following command:
    cdk deploy TlsOffloadEcsServiceStack

The CLI will prompt you to approve the changes. After you approve them, you will see these resources deploy to your account.

Checkpoint

At this point, you should have a working service. To confirm that you do, in your browser, navigate using HTTPS to the public address associated with the Network Load Balancer. While not covered in this blog, you can additionally configure DNS routing using Amazon Route53 to setup a custom domain name for your web service. You should see a screen similar to the following.

Figure 10: The sample website

Figure 10: The sample website

Step 5: Use CodePipeline to automate the deployment of changes to the web server

Now that you have deployed a preliminary version of the application, you can take a few steps to automate further releases of the web server. As you maintain this application in production, you might need to update one or more of the following items:

  • Your website HTML source and other required libraries (for example, CSS or JavaScript)
  • Your Docker environment, such as the OpenSSL libraries, operating system and CloudHSM packages, and NGINX version.
  • Re-deploy the service after rotating your web server private key and certificate in Secrets Manager

Next, you will set up a CodePipeline project that orchestrates the end-to-end deployment of a change to the application—from an update to the code in our CodeCommit repo to the deployment of updated container images and the redirection of user traffic by the load balancer to the updated application.

This step deploys to your account a deployment pipeline that connects your CodeCommit, CodeBuild, and Amazon ECS services.

Deploy the CodePipeline stack with CDK

In your CDK environment, run the following command:

cdk deploy TlsOffloadPipelineStack

The CLI will prompt you to approve the changes. After you approve them, you will see the resources deploy to your account.

Start a deployment

To verify that your automation is working correctly, start a new deployment in your CodePipeline by making a change to your source repository. If everything works, the CodeBuild project will build the latest version of the Dockerfile located in your CodeCommit repository and push it to Amazon ECR. Then, the CodeDeploy application will create a new version of the ECS task definition and deploy new tasks while spinning down the existing tasks.

View your website

Now that the deployment is complete, you should again be able to view your website in your browser by navigating to the website for your application. If you made changes to the source code, such as changes to your index.html file, you should see these changes now.

Verify that the web server is properly configured by checking that the website’s certificate matches the one that you created in the Prerequisites section. Figure 11 shows an example of a certificate.

Figure 11: Certificate for the application

Figure 11: Certificate for the application

To verify that your NGINX service is using your CloudHSM cluster to offload the TLS handshake, you can view the CloudHSM client logs for this application in CloudWatch in the log group that you specified when you configured the ECS task definition.

To view your CloudHSM client logs in CloudWatch

  1. Open the CloudWatch console.
  2. In the navigation pane, select Log Groups.
  3. Select the log group that was created for you by the CDK deployment.
  4. Select a log stream entry. Each log stream corresponds to an ECS instance that is running the NGINX web server.
  5. You should see the client logs for this instance, which will look similar to the following:
    Figure 12: Fargate task logs

    Figure 12: Fargate task logs

You can also verify your HSM connectivity by viewing your HSM audit logs.

To view your HSM audit logs

  1. Open the CloudWatch console.
  2. In the navigation pane, select Log Groups.
  3. Select the log group corresponding to your CloudHSM cluster. The log group has the following format: /aws/cloudhsm/<cluster-id>.
  4. You can see entries similar to the following, which indicates that the NGINX application is connecting and logging in to the HSM to perform cryptographic operations.
    Time: 02/04/23 17:45:40.333033, usecs:1675532740333033
    Version No : 1.0
    Sequence No : 0x2
    Reboot counter : 0x8
    Opcode : CN_LOGIN (0xd)
    Command Type(hex) : CN_MGMT_CMD (0x0)
    User id : 3
    Session Handle : 0x15010002
    Response : 0x0:HSM Return: SUCCESS
    Log type : USER_AUTH_LOG (2)
    User Name : crypto_user
    User Type : CN_CRYPTO_USER (1) 

Conclusion

In this post, you learned how to set up a NGINX web server on Fargate in a secure, private subnet that offloads the TLS termination to a FIPS 140-2 Level 3 HSM environment that uses the CloudHSM OpenSSL Dynamic Engine. You also learned how to set up a deployment pipeline to automate the Fargate deployments when updates are made.

You can expand this solution to fit your individual use case. For example, you can use the NGINX web server as a reverse proxy for additional servers in your internal network, and set up mutual TLS between these internal servers.

Further reading

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS CloudHSM re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Alket Memushaj

Alket Memushaj

Alket Memushaj is a Principal Solutions Architect in the Market Development team for Capital Markets at AWS. In his role, Alket helps customers transform their business with the power of the AWS Cloud. His main focus is on helping customers deploy data and analytics, risk management, and electronic trading platforms in AWS. Alket previously led engineering teams at Morgan Stanley and consulted for global financial services at VMware.

Nikolas Nikravesh

Nikolas Nikravesh

Nikolas is a Software Development Engineer at AWS CloudHSM. He works with the SDK team to develop standards compliant SDKs and integrations to enable AWS customers to develop secure applications with CloudHSM.

Brad Woodward

Brad Woodward

Brad is a Senior Customer Delivery Architect with AWS Professional Services. Brad has presented at RSA and DefCon Skytalks, been an instructor at BlackHat and BlackHat Europe, presented tools at BlackHat Arsenal, and is the maintainer of several open source tools and platforms.

Unit Testing AWS Lambda with Python and Mock AWS Services

Post Syndicated from Kevin Hakanson original https://aws.amazon.com/blogs/devops/unit-testing-aws-lambda-with-python-and-mock-aws-services/

When building serverless event-driven applications using AWS Lambda, it is best practice to validate individual components.  Unit testing can quickly identify and isolate issues in AWS Lambda function code.  The techniques outlined in this blog demonstrates unit test techniques for Python-based AWS Lambda functions and interactions with AWS Services.

The full code for this blog is available in the GitHub project as a demonstrative example.

Example use case

Let’s consider unit testing a serverless application which provides an API endpoint to generate a document.  When the API endpoint is called with a customer identifier and document type, the Lambda function retrieves the customer’s name from DynamoDB, then retrieves the document text from DynamoDB for the given document type, finally generating and writing the resulting document to S3.

Figure 1. Example application architecture

Figure 1. Example application architecture

  1. Amazon API Gateway provides an endpoint to request the generation of a document for a given customer.  A document type and customer identifier are provided in this API call.
  2. The endpoint invokes an AWS Lambda function that generates a document using the customer identifier and the document type provided.
  3. An Amazon DynamoDB table stores the contents of the documents and the users name, which are retrieved by the Lambda function.
  4. The resulting text document is stored to Amazon S3.

Our testing goal is to determine if an isolated “unit” of code works as intended. In this blog, we will be writing tests to provide confidence that the logic written in the above AWS Lambda function behaves as we expect. We will mock the service integrations to Amazon DynamoDB and S3 to isolate and focus our tests on the Lambda function code, and not on the behavior of the AWS Services.

Define the AWS Service resources in the Lambda function

Before writing our first unit test, let’s look at the Lambda function that contains the behavior we wish to test.  The full code for the Lambda function is available in the GitHub repository as src/sample_lambda/app.py.

As part of our Best practices for working AWS Lambda functions, we recommend initializing AWS service resource connections outside of the handler function and in the global scope.  Additionally, we can retrieve any relevant environment variables in the global scope so that subsequent invocations of the Lambda function do not repeatedly need to retrieve them.  For organization, we can put the resource and variables in a dictionary:

_LAMBDA_DYNAMODB_RESOURCE = { "resource" : resource('dynamodb'), 
                              "table_name" : environ.get("DYNAMODB_TABLE_NAME","NONE") }

However, globally scoped code and global variables are challenging to test in Python, as global statements are executed on import, and outside of the controlled test flow.  To facilitate testing, we define classes for supporting AWS resource connections that we can override (patch) during testing.  These classes will accept a dictionary containing the boto3 resource and relevant environment variables.

For example, we create a DynamoDB resource class with a parameter “boto3_dynamodb_resource” that accepts a boto3 resource connected to DynamoDB:

class LambdaDynamoDBClass:
    def __init__(self, lambda_dynamodb_resource):
        self.resource = lambda_dynamodb_resource["resource"]
        self.table_name = lambda_dynamodb_resource["table_name"]
        self.table = self.resource.Table(self.table_name)

Build the Lambda Handler

The Lambda function handler is the method in the AWS Lambda function code that processes events. When the function is invoked, Lambda runs the handler method. When the handler exits or returns a response, it becomes available to process another event.

To facilitate unit test of the handler function, move as much of logic as possible to other functions that are then called by the Lambda hander entry point.  Also, pass the AWS resource global variables to these subsequent function calls.  This approach enables us to mock and intercept all resources and calls during test.

In our example, the handler references the global variables, and instantiates the resource classes to setup the connections to specific AWS resources.  (We will be able to override and mock these connections during unit test.)

Then the handler calls the create_letter_in_s3 function to perform the steps of creating the document, passing the resource classes.  This downstream function avoids directly referencing the global context or any AWS resource connections directly.

def lambda_handler(event: APIGatewayProxyEvent, context: LambdaContext) -> Dict[str, Any]:

    global _LAMBDA_DYNAMODB_RESOURCE
    global _LAMBDA_S3_RESOURCE

    dynamodb_resource_class = LambdaDynamoDBClass(_LAMBDA_DYNAMODB_RESOURCE)
    s3_resource_class = LambdaS3Class(_LAMBDA_S3_RESOURCE)

    return create_letter_in_s3(
            dynamo_db = dynamodb_resource_class,
            s3 = s3_resource_class,
            doc_type = event["pathParameters"]["docType"],
            cust_id = event["pathParameters"]["customerId"])

Unit testing with mock AWS services

Our Lambda function code has now been written and is ready to be tested, let’s take a look at the unit test code!   The full code for the unit test is available in the GitHub repository as tests/unit/src/test_sample_lambda.py.

In production, our Lambda function code will directly access the AWS resources we defined in our function handler; however, in our unit tests we want to isolate our code and replace the AWS resources with simulations.  This isolation facilitates running unit tests in an isolated environment to prevent accidental access to actual cloud resources.

Moto is a python library for Mocking AWS Services that we will be using to simulate AWS resource our tests.  Moto supports many AWS resources, and it allows you to test your code with little or no modification by emulating functionality of these services.

Moto uses decorators to intercept and simulate responses to and from AWS resources.  By adding a decorator for a given AWS service, subsequent calls from the module to that service will be re-directed to the mock.

@moto.mock_dynamodb
@moto.mock_s3

Configure Test Setup and Tear-down

The mocked AWS resources will be used during the unit test suite.  Using the setUp() method allows you to define and configure the mocked global AWS Resources before the tests are run.

We define the test class and a setUp() method and initialize the mock AWS resource.  This includes configuring the resource to prepare it for testing, such as defining a mock DynamoDB table or creating a mock S3 Bucket.

class TestSampleLambda(TestCase):
    def setUp(self) -> None:
        dynamodb = boto3.resource("dynamodb", region_name="us-east-1")
        dynamodb.create_table(
            TableName = self.test_ddb_table_name,
            KeySchema = [{"AttributeName": "PK", "KeyType": "HASH"}],
            AttributeDefinitions = [{"AttributeName": "PK", 
                                     "AttributeType": "S"}],
            BillingMode = 'PAY_PER_REQUEST'
           
        s3_client = boto3.client('s3', region_name="us-east-1")
        s3_client.create_bucket(Bucket = self.test_s3_bucket_name ) 

After creating the mocked resources, the setup function creates resource class object referencing those mocked resources, which will be used during testing.

        mocked_dynamodb_resource = resource("dynamodb")
        mocked_s3_resource = resource("s3")
        mocked_dynamodb_resource = { "resource" : resource('dynamodb'),
                                     "table_name" : self.test_ddb_table_name  }
        mocked_s3_resource = { "resource" : resource('s3'),
                               "bucket_name" : self.test_s3_bucket_name }
        self.mocked_dynamodb_class = LambdaDynamoDBClass(mocked_dynamodb_resource)
        self.mocked_s3_class = LambdaS3Class(mocked_s3_resource)

Test #1: Verify the code writes the document to S3

Our first test will validate our Lambda function writes the customer letter to an S3 bucket in the correct manner.  We will follow the standard test format of arrange, act, assert when writing this unit test.

Arrange the data we need in the DynamoDB table:

def test_create_letter_in_s3(self) -> None:
    
    self.mocked_dynamodb_class.table.put_item(Item={"PK":"D#UnitTestDoc",
                                                        "data":"Unit Test Doc Corpi"})
    self.mocked_dynamodb_class.table.put_item(Item={"PK":"C#UnitTestCust",
                                                        "data":"Unit Test Customer"})

Act by calling the create_letter_in_s3 function.  During these act calls, the test passes the AWS resources as created in the setUp().

    test_return_value = create_letter_in_s3(
                        dynamo_db = self.mocked_dynamodb_class,
                        s3=self.mocked_s3_class,
                        doc_type = "UnitTestDoc",
                        cust_id = "UnitTestCust"
                        )

Assert by reading the data written to the mock S3 bucket, and testing conformity to what we are expecting:

bucket_key = "UnitTestCust/UnitTestDoc.txt"
    body = self.mocked_s3_class.bucket.Object(bucket_key).get()['Body'].read()

    self.assertEqual(test_return_value["statusCode"], 200)
    self.assertIn("UnitTestCust/UnitTestDoc.txt", test_return_value["body"])
    self.assertEqual(body.decode('ascii'),"Dear Unit Test Customer;\nUnit Test Doc Corpi")

Tests #2 and #3: Data not found error conditions

We can also test error conditions and handling, such as keys not found in the database.  For example, if a customer identifier is submitted, but does not exist in the database lookup, does the logic handle this and return a “Not Found” code of 404?

To test this in test #2, we add data to the mocked DynamoDB table, but then submit a customer identifier that is not in the database.

This test, and a similar test #3 for “Document Types not found”, are implemented in the example test code on GitHub.

Test #4: Validate the handler interface

As the application logic resides in independently tested functions, the Lambda handler function provides only interface validation and function call orchestration.  Therefore, the test for the handler validates that the event is parsed correctly, any functions are invoked as expected, and the return value is passed back.

To emulate the global resource variables and other functions, patch both the global resource classes and logic functions.

    @patch("src.sample_lambda.app.LambdaDynamoDBClass")
    @patch("src.sample_lambda.app.LambdaS3Class")
    @patch("src.sample_lambda.app.create_letter_in_s3")
    def test_lambda_handler_valid_event_returns_200(self,
                            patch_create_letter_in_s3 : MagicMock,
                            patch_lambda_s3_class : MagicMock,
                            patch_lambda_dynamodb_class : MagicMock
                            ):

Arrange for the test by setting return values for the patched objects.

patch_lambda_dynamodb_class.return_value = self.mocked_dynamodb_class
        patch_lambda_s3_class.return_value = self.mocked_s3_class

        return_value_200 = {"statusCode" : 200, "body":"OK"}
        patch_create_letter_in_s3.return_value = return_value_200

We need to provide event data when invoking the Lambda handler.  A good practice is to save test events as separate JSON files, rather than placing them inline as code. In the example project, test events are located in the folder “tests/events/”. During test execution, the event object is created from the JSON file using the utility function named load_sample_event_from_file.

test_event = self.load_sample_event_from_file("sampleEvent1")

Act by calling the lambda_handler function.

test_return_value = lambda_handler(event=test_event, context=None)

Assert by ensuring the create_letter_in_s3 function is called with the expected parameters based on the event, and a create_letter_in_s3 function return value is passed back to the caller.  In our example, this value is simply passed with no alterations.

patch_create_letter_in_s3.assert_called_once_with(
                                        dynamo_db=self.mocked_dynamodb_class,
                                        s3=self.mocked_s3_class,
                                        doc_type=test_event["pathParameters"]["docType"],
                                        cust_id=test_event["pathParameters"]["customerId"])

       self.assertEqual(test_return_value, return_value_200)

Tear Down

The tearDown() method is called immediately after the test method has been run and the result is recorded.  In our example tearDown() method, we clean up any data or state created so the next test won’t be impacted.

Running the unit tests

The unittest Unit testing framework can be run using the Python pytest utility.  To ensure network isolation and verify the unit tests are not accidently connecting to AWS resources, the pytest-socket project provides the ability to disable network communication during a test.

pytest -v --disable-socket -s tests/unit/src/

The pytest command results in a PASSED or FAILED status for each test.  A PASSED status verifies that your unit tests, as written, did not encounter errors or issues,

Conclusion

Unit testing is a software development process in which different parts of an application, called units, are individually and independently tested. Tests validate the quality of the code and confirm that it functions as expected. Other developers can gain familiarity with your code base by consulting the tests. Unit tests reduce future refactoring time, help engineers get up to speed on your code base more quickly, and provide confidence in the expected behaviour.

We’ve seen in this blog how to unit test AWS Lambda functions and mock AWS Services to isolate and test individual logic within our code.

AWS Lambda Powertools for Python has been used in the project to validate hander events.   Powertools provide a suite of utilities for AWS Lambda functions to ease adopting best practices such as tracing, structured logging, custom metrics, idempotency, batching, and more.

Learn more about AWS Lambda testing in our prescriptive test guidance, and find additional test examples on GitHub.  For more serverless learning resources, visit Serverless Land.

About the authors:

Tom Romano

Tom Romano is a Solutions Architect for AWS World Wide Public Sector from Tampa, FL, and assists GovTech and EdTech customers as they create new solutions that are cloud-native, event driven, and serverless. He is an enthusiastic Python programmer for both application development and data analytics. In his free time, Tom flies remote control model airplanes and enjoys vacationing with his family around Florida and the Caribbean.

Kevin Hakanson

Kevin Hakanson is a Sr. Solutions Architect for AWS World Wide Public Sector based in Minnesota. He works with EdTech and GovTech customers to ideate, design, validate, and launch products using cloud-native technologies and modern development practices. When not staring at a computer screen, he is probably staring at another screen, either watching TV or playing video games with his family.

Manage users and group memberships on Amazon QuickSight using SCIM events generated in IAM Identity Center with Azure AD

Post Syndicated from Wakana Vilquin-Sakashita original https://aws.amazon.com/blogs/big-data/manage-users-and-group-memberships-on-amazon-quicksight-using-scim-events-generated-in-iam-identity-center-with-azure-ad/

Amazon QuickSight is cloud-native, scalable business intelligence (BI) service that supports identity federation. AWS Identity and Access Management (IAM) allows organizations to use the identities managed in their enterprise identity provider (IdP) and federate single sign-on (SSO) to QuickSight. As more organizations are building centralized user identity stores with all their applications, including on-premises apps, third-party apps, and applications on AWS, they need a solution to automate user provisioning into these applications and keep their attributes in sync with their centralized user identity store.

When architecting a user repository, some organizations decide to organize their users in groups or use attributes (such as department name), or a combination of both. If your organization uses Microsoft Azure Active Directory (Azure AD) for centralized authentication and utilizes its user attributes to organize the users, you can enable federation across all QuickSight accounts as well as manage users and their group membership in QuickSight using events generated in the AWS platform. This allows system administrators to centrally manage user permissions from Azure AD. Provisioning, updating, and de-provisioning users and groups in QuickSight no longer requires management in two places with this solution. This makes sure that users and groups in QuickSight stay consistent with information in Azure AD through automatic synchronization.

In this post, we walk you through the steps required to configure federated SSO between QuickSight and Azure AD via AWS IAM Identity Center (Successor to AWS Single Sign-On) where automatic provisioning is enabled for Azure AD. We also demonstrate automatic user and group membership update using a System for Cross-domain Identity Management (SCIM) event.

Solution overview

The following diagram illustrates the solution architecture and user flow.

solution architecture and user flow.

In this post, IAM Identity Center provides a central place to bring together administration of users and their access to AWS accounts and cloud applications. Azure AD is the user repository and configured as the external IdP in IAM Identity Center. In this solution, we demonstrate the use of two user attributes (department, jobTitle) specifically in Azure AD. IAM Identity Center supports automatic provisioning (synchronization) of user and group information from Azure AD into IAM Identity Center using the SCIM v2.0 protocol. With this protocol, the attributes from Azure AD are passed along to IAM Identity Center, which inherits the defined attribute for the user’s profile in IAM Identity Center. IAM Identity Center also supports identity federation with SAML (Security Assertion Markup Language) 2.0. This allows IAM Identity Center to authenticate identities using Azure AD. Users can then SSO into applications that support SAML, including QuickSight. The first half of this post focuses on how to configure this end to end (see Sign-In Flow in the diagram).

Next, user information starts to get synchronized between Azure AD and IAM Identity Center via SCIM protocol. You can automate creating a user in QuickSight using an AWS Lambda function triggered by the CreateUser SCIM event originated from IAM Identity Center, which was captured in Amazon EventBridge. In the same Lambda function, you can subsequently update the user’s membership by adding into the specified group (whose name is comprised of two user attributes: department-jobTitle, otherwise create the group if it doesn’t exist yet, prior to adding the membership.

In this post, this automation part is omitted because it would be redundant with the content discussed in the following sections.

This post explores and demonstrates an UpdateUser SCIM event triggered by the user profile update on Azure AD. The event is captured in EventBridge, which invokes a Lambda function to update the group membership in QuickSight (see Update Flow in the diagram). Because a given user is supposed to belong to only one group at a time in this example, the function will replace the user’s current group membership with the new one.

In Part I, you set up SSO to QuickSight from Azure AD via IAM Identity Center (the sign-in flow):

  1. Configure Azure AD as the external IdP in IAM Identity Center.
  2. Add and configure an IAM Identity Center application in Azure AD.
  3. Complete configuration of IAM Identity Center.
  4. Set up SCIM automatic provisioning on both Azure AD and IAM Identity Center, and confirm in IAM Identity Center.
  5. Add and configure a QuickSight application in IAM Identity Center.
  6. Configure a SAML IdP and SAML 2.0 federation IAM role.
  7. Configure attributes in the QuickSight application.
  8. Create a user, group, and group membership manually via the AWS Command Line Interface (AWS CLI) or API.
  9. Verify the configuration by logging in to QuickSight from the IAM Identity Center portal.

In Part II, you set up automation to change group membership upon an SCIM event (the update flow):

  1. Understand SCIM events and event patterns for EventBridge.
  2. Create attribute mapping for the group name.
  3. Create a Lambda function.
  4. Add an EventBridge rule to trigger the event.
  5. Verify the configuration by changing the user attribute value at Azure AD.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • IAM Identity Center. For instructions, refer to Steps 1–2 in the AWS IAM Identity Center Getting Started guide.
  • A QuickSight account subscription.
  • Basic understanding of IAM and privileges required to create an IAM IdP, roles, and policies.
  • An Azure AD subscription. You need at least one user with the following attributes to be registered in Azure AD:
    • userPrincipalName – Mandatory field for Azure AD user.
    • displayName – Mandatory field for Azure AD user.
    • Mail – Mandatory field for IAM Identity Center to work with QuickSight.
    • jobTitle – Used to allocate user to group
    • department – Used to allocate user to group.
    • givenName – Optional field.
    • surname – Optional field.

Part I: Set up SSO to QuickSight from Azure AD via IAM Identity Center

This section presents the steps to set up the sign-in flow.

Configure an external IdP as Azure AD in IAM Identity Center

To configure your external IdP, complete the following steps:

  1. On the IAM Identity Center console, choose Settings.
  2. Choose Actions on the Identity source tab, then choose Change identity source.
  3. Choose External identity provider, then choose Next.

The IdP metadata is displayed. Keep this browser tab open.

Add and configure an IAM Identity Center application in Azure AD

To set up your IAM Identity Center application, complete the following steps:

  1. Open a new browser tab.
  2. Log in to the Azure AD portal using your Azure administrator credentials.
  3. Under Azure services, choose Azure Active Directory.
  4. In the navigation pane, under Manage, choose Enterprise applications, then choose New application.
  5. In the Browse Azure AD Galley section, search for IAM Identity Center, then choose AWS IAM Identity Center (successor to AWS Single Sign-On).
  6. Enter a name for the application (in this post, we use IIC-QuickSight) and choose Create.
  7. In the Manage section, choose Single sign-on, then choose SAML.
  8. In the Assign users and groups section, choose Assign users and groups.
  9. Choose Add user/group and add at least one user.
  10. Select User as its role.
  11. In the Set up single sign on section, choose Get started.
  12. In the Basic SAML Configuration section, choose Edit, and fill out following parameters and values:
  13. Identifier – The value in the IAM Identity Center issuer URL field.
  14. Reply URL – The value in the IAM Identity Center Assertion Consumer Service (ACS) URL field.
  15. Sign on URL – Leave blank.
  16. Relay State – Leave blank.
  17. Logout URL – Leave blank.
  18. Choose Save.

The configuration should look like the following screenshot.

configuration

  1. In the SAML Certificates section, download the Federation Metadata XML file and the Certificate (Raw) file.
    Federation Metadata XML file and the Certificate (Raw) file

You’re all set with Azure AD SSO configuration at this moment. Later on, you’ll return to this page to configure automated provisioning, so keep this browser tab open.

Complete configuration of IAM Identity Center

Complete your IAM Identity Center configuration with the following steps:

  1. Go back to the browser tab for IAM Identity Center console which you have kept open in previous step.
  2. For IdP SAML metadata under the Identity provider metadata section, choose Choose file.
  3. Choose the previously downloaded metadata file (IIC-QuickSight.xml).
  4. For IdP certificate under the Identity provider metadata section, choose Choose file.
  5. Choose the previously downloaded certificate file (IIC-QuickSight.cer).
  6. Choose Next.
  7. Enter ACCEPT, then choose Change Identity provider source.

Set up SCIM automatic provisioning on both Azure AD and IAM Identity Center

Your provisioning method is still set as Manual (non-SCIM). In this step, we enable automatic provisioning so that IAM Identity Center becomes aware of the users, which allows identity federation to QuickSight.

  1. In the Automatic provisioning section, choose Enable.
    choose Enable
  2. Choose Access token to show your token.
    access token
  3. Go back to the browser tab (Azure AD), which you kept open in Step 1.
  4. In the Manage section, choose Enterprise applications.
  5. Choose IIC-QuickSight, then choose Provisioning.
  6. Choose Automatic in Provisioning Mode and enter the following values:
  7. Tenant URL – The value in the SCIM endpoint field.
  8. Secret Token – The value in the Access token field.
  9. Choose Test Connection.
  10. After the test connection is successfully complete, set Provisioning Status to On.
    set Provisioning Status to On
  11. Choose Save.
  12. Choose Start provisioning to start automatic provisioning using the SCIM protocol.

When provisioning is complete, it will result in propagating one or more users from Azure AD to IAM Identity Center. The following screenshot shows the users that were provisioned in IAM Identity Center.

the users that were provisioned in IAM Identity Center

Note that upon this SCIM provisioning, the users in QuickSight should be created using the Lambda function triggered by the event originated from IAM Identity Center. In this post, we create a user and group membership via the AWS CLI (Step 8).

Add and configure a QuickSight application in IAM Identity Center

In this step, we create a QuickSight application in IAM Identity Center. You also configure an IAM SAML provider, role, and policy for the application to work. Complete the following steps:

  1. On the IAM Identity Center console, on the Applications page, choose Add Application.
  2. For Pre-integrated application under Select an application, enter quicksight.
  3. Select Amazon QuickSight, then choose Next.
  4. Enter a name for Display name, such as Amazon QuickSight.
  5. Choose Download under IAM Identity Center SAML metadata file and save it in your computer.
  6. Leave all other fields as they are, and save the configuration.
  7. Open the application you’ve just created, then choose Assign Users.

The users provisioned via SCIM earlier will be listed.

  1. Choose all of the users to assign to the application.

Configure a SAML IdP and a SAML 2.0 federation IAM role

To set up your IAM SAML IdP for IAM Identity Center and IAM role, complete the following steps:

  1. On the IAM console, in the navigation pane, choose Identity providers, then choose Add provider.
  2. Choose SAML as Provider type, and enter Azure-IIC-QS as Provider name.
  3. Under Metadata document, choose Choose file and upload the metadata file you downloaded earlier.
  4. Choose Add provider to save the configuration.
  5. In the navigation pane, choose Roles, then choose Create role.
  6. For Trusted entity type, select SAML 2.0 federation.
  7. For Choose a SAML 2.0 provider, select the SAML provider that you created, then choose Allow programmatic and AWS Management Console access.
  8. Choose Next.
  9. On the Add Permission page, choose Next.

In this post, we create QuickSight users via an AWS CLI command, therefore we’re not creating any permission policy. However, if the self-provisioning feature in QuickSight is required, the permission policy for the CreateReader, CreateUser, and CreateAdmin actions (depending on the role of the QuickSight users) is required.

  1. On the Name, review, and create page, under Role details, enter qs-reader-azure for the role.
  2. Choose Create role.
  3. Note the ARN of the role.

You use the ARN to configure attributes in your IAM Identity Center application.

Configure attributes in the QuickSight application

To associate the IAM SAML IdP and IAM role to the QuickSight application in IAM Identity Center, complete the following steps:

  1. On the IAM Identity Center console, in the navigation pane, choose Applications.
  2. Select the Amazon QuickSight application, and on the Actions menu, choose Edit attribute mappings.
  3. Choose Add new attribute mapping.
  4. Configure the mappings in the following table.
User attribute in the application Maps to this string value or user attribute in IAM Identity Center
Subject ${user:email}
https://aws.amazon.com/SAML/Attributes/RoleSessionName ${user:email}
https://aws.amazon.com/SAML/Attributes/Role arn:aws:iam::<ACCOUNTID>:role/qs-reader-azure,arn:aws:iam::<ACCOUNTID>:saml-provider/Azure-IIC-QS
https://aws.amazon.com/SAML/Attributes/PrincipalTag:Email ${user:email}

Note the following values:

  • Replace <ACCOUNTID> with your AWS account ID.
  • PrincipalTag:Email is for the email syncing feature for self-provisioning users that need to be enabled on the QuickSight admin page. In this post, don’t enable this feature because we register the user with an AWS CLI command.
  1. Choose Save changes.

Create a user, group, and group membership with the AWS CLI

As described earlier, users and groups in QuickSight are being created manually in this solution. We create them via the following AWS CLI commands.

The first step is to create a user in QuickSight specifying the IAM role created earlier and email address registered in Azure AD. The second step is to create a group with the group name as combined attribute values from Azure AD for the user created in the first step. The third step is to add the user into the group created earlier; member-name indicates the user name created in QuickSight that is comprised of <IAM Role name>/<session name>. See the following code:

aws quicksight register-user \
--aws-account-id <ACCOUNTID> --namespace default \
--identity-type IAM --email <email registered in Azure AD> \
--user-role READER --iam-arn arn:aws:iam::<ACCOUNTID>:role/qs-reader-azure \
--session-name <email registered in Azure AD>

 aws quicksight create-group \
--aws-account-id <ACCOUNTID> --namespace default \
--group-name Marketing-Specialist

 aws quicksight create-group-membership \
--aws-account-id <ACCOUNTID> --namespace default \
--member-name qs-reader-azure/<email registered in Azure AD> \
–-group-name Marketing-Specialist

At this point, the end-to-end configuration of Azure AD, IAM Identity Center, IAM, and QuickSight is complete.

Verify the configuration by logging in to QuickSight from the IAM Identity Center portal

Now you’re ready to log in to QuickSight using the IdP-initiated SSO flow:

  1. Open a new private window in your browser.
  2. Log in to the IAM Identity Center portal (https://d-xxxxxxxxxx.awsapps.com/start).

You’re redirected to the Azure AD login prompt.

  1. Enter your Azure AD credentials.

You’re redirected back to the IAM Identity Center portal.

  1. In the IAM Identity Center portal, choose Amazon QuickSight.

IAM Identity Center portal, choose Amazon QuickSight

You’re automatically redirected to your QuickSight home.
automatically redirected to your QuickSight home

Part II: Automate group membership change upon SCIM events

In this section, we configure the update flow.

Understand the SCIM event and event pattern for EventBridge

When an Azure AD administrator makes any changes to the attributes on the particular user profile, the change will be synced with the user profile in IAM Identity Center via SCIM protocol, and the activity is recorded in an AWS CloudTrail event called UpdateUser by sso-directory.amazonaws.com (IAM Identity Center) as the event source. Similarly, the CreateUser event is recorded when a user is created on Azure AD, and the DisableUser event is for when a user is disabled.

The following screenshot on the  Event history page shows two CreateUser events: one is recorded by IAM Identity Center, and the other one is by QuickSight. In this post, we use the one from IAM Identity Center.

CloudTrail console

In order for EventBridge to be able to handle the flow properly, each event must specify the fields of an event that you want the event pattern to match. The following event pattern is an example of the UpdateUser event generated in IAM Identity Center upon SCIM synchronization:

{
  "source": ["aws.sso-directory"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventSource": ["sso-directory.amazonaws.com"],
    "eventName": ["UpdateUser"]
  }
}

In this post, we demonstrate an automatic update of group membership in QuickSight that is triggered by the UpdateUser SCIM event.

Create attribute mapping for the group name

In order for the Lambda function to manage group membership in QuickSight, it must obtain the two user attributes (department and jobTitle). To make the process simpler, we’re combining two attributes in Azure AD (department, jobTitle) into one attribute in IAM Identity Center (title), using the attribute mappings feature in Azure AD. IAM Identity Center then uses the title attribute as a designated group name for this user.

  1. Log in to the Azure AD console, navigate to Enterprise Applications, IIC-QuickSight, and Provisioning.
  2. Choose Edit attribute mappings.
  3. Under Mappings, choose Provision Azure Active Directory Users.
    Azure AD console, Under mappings
  4. Choose jobTitle from the list of Azure Active Directory Attributes.
  5. Change the following settings:
    1. Mapping TypeExpression
    2. ExpressionJoin("-", [department], [jobTitle])
    3. Target attribute title
      update settings
  6. Choose Save.
  7. You can leave the provisioning page.

The attribute is automatically updated in IAM Identity Center. The updated user profile looks like the following screenshots (Azure AD on the left, IAM Identity Center on the right).

updated user profile
Job related information

Create a Lambda function

Now we create a Lambda function to update QuickSight group membership upon the SCIM event. The core part of the function is to obtain the user’s title attribute value in IAM Identity Center based on the triggered event information, and then to ensure that the user exists in QuickSight. If the group name doesn’t exist yet, it creates the group in QuickSight and then adds the user into the group. Complete the following steps:

  1. On the Lambda console, choose Create function.
  2. For Name, enter UpdateQuickSightUserUponSCIMEvent.
  3. For Runtime, choose Python 3.9.
  4. For Time Out, set to 15 seconds.
  5. For Permissions, create and attach an IAM role that includes the following permissions (the trusted entity (principal) should be lambda.amazonaws.com):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "MinimalPrivForScimQsBlog",
                "Effect": "Allow",
                "Action": [
                    "identitystore:DescribeUser",
                    "quicksight:RegisterUser",
                    "quicksight:DescribeUser",
                    "quicksight:CreateGroup",
                    "quicksight:DeleteGroup",
                    "quicksight:DescribeGroup",
                    "quicksight:ListUserGroups",
                    "quicksight:CreateGroupMembership",
                    "quicksight:DeleteGroupMembership",
                    "quicksight:DescribeGroupMembership",
                    "logs:CreateLogGroup",
                    "logs:CreateLogStream",
                    "logs:PutLogEvents"
                ],
                "Resource": "*"
            }
        ]
    }

  6. Write Python code using the Boto3 SDK for IdentityStore and QuickSight. The following is the entire sample Python code:
import sys
import boto3
import json
import logging
from time import strftime
from datetime import datetime

# Set logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
  '''
  Modify QuickSight group membership upon SCIM event from IAM Identity Center originated from Azure AD.
  It works in this way:
    Azure AD -> SCIM -> Identity Center -> CloudTrail -> EventBridge -> Lambda -> QuickSight
  Note that this is a straightforward sample to show how to update QuickSight group membership upon certain SCIM event.
  For example, it assumes that 1:1 user-to-group assigmnent, only one (combined) SAML attribute, etc. 
  For production, take customer requirements into account and develop your own code.
  '''

  # Setting variables (hard-coded. get dynamically for production code)
  qs_namespace_name = 'default'
  qs_iam_role = 'qs-reader-azure'

  # Obtain account ID and region
  account_id = boto3.client('sts').get_caller_identity()['Account']
  region = boto3.session.Session().region_name

  # Setup clients
  qs = boto3.client('quicksight')
  iic = boto3.client('identitystore')

  # Check boto3 version
  logger.debug(f"## Your boto3 version: {boto3.__version__}")

  # Get user info from event data
  event_json = json.dumps(event)
  logger.debug(f"## Event: {event_json}")
  iic_store_id = event['detail']['requestParameters']['identityStoreId']
  iic_user_id = event['detail']['requestParameters']['userId']  # For UpdateUser event, userId is provided through requestParameters
  logger.info("## Getting user info from Identity Store.")
  try:
    res_iic_describe_user = iic.describe_user(
      IdentityStoreId = iic_store_id,
      UserId = iic_user_id
    )
  except Exception as e:
    logger.error("## Operation failed due to unknown error. Exiting.")
    logger.error(e)
    sys.exit()
  else:
    logger.info(f"## User info retrieval succeeded.")
    azure_user_attribute_title = res_iic_describe_user['Title']
    azure_user_attribute_userprincipalname = res_iic_describe_user['UserName']
    qs_user_name = qs_iam_role + "/" + azure_user_attribute_userprincipalname
    logger.info(f"#### Identity Center user name: {azure_user_attribute_userprincipalname}")
    logger.info(f"#### QuickSight group name desired: {azure_user_attribute_title}")
    logger.debug(f"#### res_iic_describe_user: {json.dumps(res_iic_describe_user)}, which is {type(res_iic_describe_user)}")

  # Exit if user is not present since this function is supposed to be called by UpdateUser event
  try:
    # Get QuickSight user name
    res_qs_describe_user = qs.describe_user(
      UserName = qs_user_name,
      AwsAccountId = account_id,
      Namespace = qs_namespace_name
    )
  except qs.exceptions.ResourceNotFoundException as e:
    logger.error(f"## User {qs_user_name} is not found in QuickSight.")
    logger.error(f"## Make sure the QuickSight user has been created in advance. Exiting.")
    logger.error(e)
    sys.exit()
  except Exception as e:
    logger.error("## Operation failed due to unknown error. Exiting.")
    logger.error(e)
    sys.exit()
  else:
    logger.info(f"## User {qs_user_name} is found in QuickSight.")

  # Remove current membership unless it's the desired one
  qs_new_group = azure_user_attribute_title  # Set "Title" SAML attribute as the desired QuickSight group name
  in_desired_group = False  # Set this flag True when the user is already a member of the desired group
  logger.info(f"## Starting group membership removal.")
  try:
    res_qs_list_user_groups = qs.list_user_groups(
      UserName = qs_user_name,
      AwsAccountId = account_id,
      Namespace = qs_namespace_name
    )
  except Exception as e:
    logger.error("## Operation failed due to unknown error. Exiting.")
    logger.error(e)
    sys.exit()
  else:
    # Skip if the array is empty (user is not member of any groups)
    if not res_qs_list_user_groups['GroupList']:
      logger.info(f"## User {qs_user_name} is not a member of any QuickSight group. Skipping removal.")
    else:
      for grp in res_qs_list_user_groups['GroupList']:
        qs_current_group = grp['GroupName']
        # Retain membership if the new and existing group names match
        if qs_current_group == qs_new_group:
          logger.info(f"## The user {qs_user_name} already belong to the desired group. Skipping removal.")
          in_desired_group = True
        else:
          # Remove all unnecessary memberships
          logger.info(f"## Removing user {qs_user_name} from existing group {qs_current_group}.")
          try:
            res_qs_delete_group_membership = qs.delete_group_membership(
              MemberName = qs_user_name,
              GroupName = qs_current_group,
              AwsAccountId = account_id,
              Namespace = qs_namespace_name
            )
          except Exception as e:
            logger.error(f"## Operation failed due to unknown error. Exiting.")
            logger.error(e)
            sys.exit()
          else:
            logger.info(f"## The user {qs_user_name} has removed from {qs_current_group}.")

  # Create group membership based on IIC attribute "Title"
  logger.info(f"## Starting group membership assignment.")
  if in_desired_group is True:
      logger.info(f"## The user already belongs to the desired one. Skipping assignment.")
  else:
    try:
      logger.info(f"## Checking if the desired group exists.")
      res_qs_describe_group = qs.describe_group(
        GroupName = qs_new_group,
        AwsAccountId = account_id,
        Namespace = qs_namespace_name
      )
    except qs.exceptions.ResourceNotFoundException as e:
      # Create a QuickSight group if not present
      logger.info(f"## Group {qs_new_group} is not present. Creating.")
      today = datetime.now()
      res_qs_create_group = qs.create_group(
        GroupName = qs_new_group,
        Description = 'Automatically created at ' + today.strftime('%Y.%m.%d %H:%M:%S'),
        AwsAccountId = account_id,
        Namespace = qs_namespace_name
      )
    except Exception as e:
      logger.error(f"## Operation failed due to unknown error. Exiting.")
      logger.error(e)
      sys.exit()
    else:
      logger.info(f"## Group {qs_new_group} is found in QuickSight.")

    # Add the user to the desired group
    logger.info("## Modifying group membership based on its latest attributes.")
    logger.info(f"#### QuickSight user name: {qs_user_name}")
    logger.info(f"#### QuickSight group name: {qs_new_group}")
    try: 
      res_qs_create_group_membership = qs.create_group_membership(
        MemberName = qs_user_name,
        GroupName = qs_new_group,
        AwsAccountId = account_id,
        Namespace = qs_namespace_name
    )
    except Exception as e:
      logger.error("## Operation failed due to unknown error. Exiting.")
      logger.error(e)
    else:
      logger.info("## Group membership modification succeeded.")
      qs_group_member_name = res_qs_create_group_membership['GroupMember']['MemberName']
      qs_group_member_arn = res_qs_create_group_membership['GroupMember']['Arn']
      logger.debug("## QuickSight group info:")
      logger.debug(f"#### qs_user_name: {qs_user_name}")
      logger.debug(f"#### qs_group_name: {qs_new_group}")
      logger.debug(f"#### qs_group_member_name: {qs_group_member_name}")
      logger.debug(f"#### qs_group_member_arn: {qs_group_member_arn}")
      logger.debug("## IIC info:")
      logger.debug(f"#### IIC user name: {azure_user_attribute_userprincipalname}")
      logger.debug(f"#### IIC user id: {iic_user_id}")
      logger.debug(f"#### Title: {azure_user_attribute_title}")
      logger.info(f"## User {qs_user_name} has been successfully added to the group {qs_new_group} in {qs_namespace_name} namespace.")
  
  # return response
  return {
    "namespaceName": qs_namespace_name,
    "userName": qs_user_name,
    "groupName": qs_new_group
  }

Note that this Lambda function requires Boto3 1.24.64 or later. If the Boto3 included in the Lambda runtime is older than this, use a Lambda layer to use the latest version of Boto3. For more details, refer to How do I resolve “unknown service”, “parameter validation failed”, and “object has no attribute” errors from a Python (Boto 3) Lambda function.

Add an EventBridge rule to trigger the event

To create an EventBridge rule to invoke the previously created Lambda function, complete the following steps:

  1. On the EventBridge console, create a new rule.
  2. For Name, enter updateQuickSightUponSCIMEvent.
  3. For Event pattern, enter the following code:
    {
      "source": ["aws.sso-directory"],
      "detail-type": ["AWS API Call via CloudTrail"],
      "detail": {
        "eventSource": ["sso-directory.amazonaws.com"],
        "eventName": ["UpdateUser"]
      }
    }

  4. For Targets, choose the Lambda function you created (UpdateQuickSightUserUponSCIMEvent).
  5. Enable the rule.

Verify the configuration by changing a user attribute value at Azure AD

Let’s modify a user’s attribute at Azure AD, and then check if the new group is created and that the user is added into the new one.

  1. Go back to the Azure AD console.
  2. From Manage, click Users.
  3. Choose one of the users you previously used to log in to QuickSight from the IAM Identity Center portal.
  4. Choose Edit properties, then edit the values for Job title and Department.
    Edit Properties
  5. Save the configuration.
  6. From Manage, choose Enterprise application, your application name, and Provisioning.
  7. Choose Stop provisioning and then Start provisioning in sequence.

In Azure AD, the SCIM provisioning interval is fixed to 40 minutes. To get immediate results, we manually stop and start the provisioning.

Provisioning status

  1. Navigate to the QuickSight console.
  2. On the drop-down user name menu, choose Manage QuickSight.
  3. Choose Manage groups.

Now you should find that the new group is created and the user is assigned to this group.

new group is created and the user is assigned to this group

Clean up

When you’re finished with the solution, clean up your environment to minimize cost impact. You may want to delete the following resources:

  • Lambda function
  • Lambda layer
  • IAM role for the Lambda function
  • CloudWatch log group for the Lambda function
  • EventBridge rule
  • QuickSight account
    • Note : There can only be one QuickSight account per AWS account. So your QuickSight account might already be used by other users in your organization. Delete the QuickSight account only if you explicitly set it up to follow this blog and are absolutely sure that it is not being used by any other users.
  • IAM Identity Center instance
  • IAM ID Provider configuration for Azure AD
  • Azure AD instance

Summary

This post provided step-by-step instructions to configure IAM Identity Center SCIM provisioning and SAML 2.0 federation from Azure AD for centralized management of QuickSight users. We also demonstrated automated group membership updates in QuickSight based on user attributes in Azure AD, by using SCIM events generated in IAM Identity Center and setting up automation with EventBridge and Lambda.

With this event-driven approach to provision users and groups in QuickSight, system administrators can have full flexibility in where the various different ways of user management could be expected depending on the organization. It also ensures the consistency of users and groups between QuickSight and Azure AD whenever a user accesses QuickSight.

We are looking forward to hearing any questions or feedback.


About the authors

Takeshi Nakatani is a Principal Bigdata Consultant on Professional Services team in Tokyo. He has 25 years of experience in IT industry, expertised in architecting data infrastructure. On his days off, he can be a rock drummer or a motorcyclyst.

Wakana Vilquin-Sakashita is Specialist Solution Architect for Amazon QuickSight. She works closely with customers to help making sense of the data through visualization. Previously Wakana worked for S&P Global  assisting customers to access data, insights and researches relevant for their business.

Simplify management of Network Firewall rule groups with VPC managed prefix lists

Post Syndicated from Mojgan Toth original https://aws.amazon.com/blogs/security/simplify-management-of-network-firewall-rule-groups-with-vpc-managed-prefix-lists/

In this blog post, we will show you how to use managed prefix lists to simplify management of your AWS Network Firewall rules and policies across your Amazon Virtual Private Cloud (Amazon VPC) in the same AWS Region.

AWS Network Firewall is a stateful, managed, network firewall and intrusion detection and prevention service for your Amazon VPC. With Network Firewall, you can filter inbound and outbound traffic to or from internet gateways; AWS Direct Connect gateways; AWS PrivateLink, AWS Site-to-Site VPN, and AWS Client VPN gateways; NAT gateways; and even between other attached VPCs and subnets.

You can use Network Firewall to help prevent your VPC from accessing unauthorized domains, to block IP addresses, and to perform deep packet inspection or protocol filtering. However, it can be time consuming to update your firewall’s rule groups to add, remove, or modify the list of IP addresses across multiple Network Firewall instances that can be deployed in distributed, centralized, or combined deployment models.

With prefix lists, you can group one or more CIDR blocks into a single object. Therefore, you can group IP addresses that you frequently use in a prefix list, and reference this list in Network Firewall rule groups. With this approach, you don’t need to update individual firewall rules when scaling the network to add new IP addresses, and the Network Firewall rule groups that reference the prefix list are automatically updated.

In this post, we will show you how to build an example configuration in your test environment that uses customer-managed prefix lists in a Network Firewall rule group.

Note: This configuration will incur costs as described at AWS Network Firewall pricing.

Prerequisites

For this walkthrough, make sure that you have the following prerequisites in place:

Solution overview

In this post, we will show you how to create a simple architecture in a VPC to create three different VPC prefix lists for private and public subnets and provide protection by restricting traffic flow to the firewall subnet. Then you will create a stateful Network Firewall rule group to include IP set references that are mapped to VPC prefix lists. Figure 1 illustrates the architecture of a protected VPC.

Figure 1: Simple architecture of a protected VPC

Figure 1: Simple architecture of a protected VPC

In this example, the following three subnets are in the protected VPC:

  1. Firewall subnet: 10.1.0.0/28
    This subnet is dedicated for use by Network Firewall. The Network Firewall endpoint is deployed into a dedicated subnet of the VPC.
  2. Public subnet (protected subnet): 10.1.2.0/28
    The resources are designed to be internet-facing, so this subnet needs to communicate with the internet gateway. The NAT gateway and load balancer are also hosted on this subnet.
  3. Private subnet (protected workload subnet): 10.1.3.0/28
    This is the subnet where you host your private workload that doesn’t accept incoming traffic from the internet (in our example, this is the webservers). The private workload can send requests to the internet through the NAT gateway.

Deploy the CloudFormation template

The following AWS CloudFormation template deploys a network firewall and related resources in a distributed architecture across two Availability Zones. In production, AWS recommends that you use multiple Availability Zones to help ensure high availability and improve fault tolerance. To simplify the instructions, we will focus on a single Availability Zone for this blog post.

To deploy the CloudFormation template

  • Choose the following Launch Stack button.

    Launch Stack

    Launch the CloudFormation template in the Region of your choice. Make sure that the Region that you choose supports Network Firewall. Select the Availability Zone or Zones to be used for this deployment, and leave the rest of the options as default.

Create the VPC prefix lists

In this section, we will show you how to define your requirements and implement them within Network Firewall to only enable Secure Shell (SSH) traffic from a trusted IP range (an authorized public subnet on the protected VPC) to the private subnet. We will also show you how to block Internet Control Message Protocol (ICMP) traffic from another IP range (with CIDR 10.0.1.0/24).

You will create the following VPC prefix lists:

  • Public-ip-list — includes the protected subnet: 10.1.2.0/28
  • Private-deny-list — includes a CIDR block from the other VPC: 10.0.1.0/24
  • Private-allow-list — includes the protected workload subnet: 10.1.3.0/28

To create the VPC prefix lists

  1. Open the Amazon VPC console and choose Managed prefix lists.
  2. Choose Create prefix list, and then do the following, as shown in Figure 2:
    • For Prefix list name, enter a name for the prefix list. In our example, the name is Public-ip-list.
    • For Max entries, enter the maximum number of entries for the prefix list. In our example, this number is 10.
    • For Address family, select the prefix list that supports IPv4 entries.

      Note: Network Firewall currently supports only references to IPv4 prefix lists.

    • For Prefix list entries, choose Add new entry, and then enter the CIDR block and a description for the entry. In our example, the CIDR block is 10.1.2.0/28.
    • Choose Create prefix list.
      Figure 2: Example of managed prefix lists

      Figure 2: Example of managed prefix lists

  3. Repeat the preceding steps for the two remaining prefix lists: Private-deny-list and Private-allow-list.

When you’ve finished creating the prefix lists, you can view them under Managed prefix lists, as shown in Figure 3.

Figure 3: Example of VPC prefix lists

Figure 3: Example of VPC prefix lists

Create a Network Firewall rule group

The next step is to create a Network Firewall rule group. A Network Firewall rule group is a reusable set of criteria for inspecting and handling network traffic. As part of this configuration, we will take advantage of customer-managed VPC prefix lists as a variable to simplify the management of the rules.

To create a Network Firewall rule group

  1. In the Amazon VPC console, in the left navigation pane, choose Network Firewall rule groups.
  2. From the Rule groups tab, select Create Network Firewall rule group, and then do the following, as shown in Figure 4:
    • For Rule group type, select Stateful rule group.
    • For Name, enter your network firewall rule group.
    • For Capacity, enter 25 or another appropriate value.
    • For Stateful rule group options, select 5-tuple.
    • Under Stateful rule order, select Default.
    Figure 4: Network Firewall rule group

    Figure 4: Network Firewall rule group

  3. In the IP set references section, do the following, as shown in Figure 5:
    1. For IP set preference variable name, enter new variable names for each of your VPC prefix lists.
    2. From the IP set resource ID dropdown, select an IP set.

    In this example, you are creating three IP set references that are mapped to the VPC prefix lists that you configured in the previous sections, as shown in the following table.

    IP set references variable name Mapped VPC prefix list name to IP set references CIDR block
    IP_list_Allow_ssh_subnets public-ip-list 10.1.2.0/28
    IP_list_Private_Deny private-deny-list 10.0.1.0/24
    IP_list_private_subnets private-allow-list 10.1.3.0/28
    Figure 5: Example of IP set references

    Figure 5: Example of IP set references

  4. In the Add rule section, do the following, as shown in Figure 6:
    1. Select the protocol.
    2. For Source, select Custom and then enter the IP set reference variable name for the source IP address with the following format: <@Your_ip_set_reference_name>. In our example, the name is @IP_list_Allow_ssh_subnets.
    3. For Source port, select Custom and enter the appropriate port number.
    4. For Destination, choose Custom and then enter the IP set reference variable name for the destination IP address with the following format: <@Your_ip_set_reference_name>. In our example, the name is @IP_list_Private_subnets.
    5. For Destination port, choose Custom and enter the appropriate port number.
    6. For Traffic direction, select Any.
    7. For Action, select Pass.
    8. Choose Add rule.
    Figure 6: Example of a Network Firewall rule group with custom IP set references

    Figure 6: Example of a Network Firewall rule group with custom IP set references

  5. For the next set of rules, repeat the preceding steps and choose the appropriate protocol, source, destination, traffic direction, and action, as shown in the following table.

    Protocol Source Destination Source port Destination port Direction Action
    SSH @IP_list_Allow_ssh_subnets @IP_list_private_subnets 22 22 Forward Pass
    SSH Any @IP_list_private_subnets Any 22 Forward Drop
    ICMP @IP_list_Private_Deny Any Any Any Forward Drop

    After completion, you will have a set of stateful rules, as shown in Figure 7.

    Figure 7: Example list of Network Firewall rules

    Figure 7: Example list of Network Firewall rules

Congratulations! You have configured Network Firewall rule groups by using VPC prefix lists for a simplified management to allow SSH traffic only from authorized subnets and to deny ICMP traffic from unauthorized subnets.

For the next steps, you can test your configuration by trying to use protocols such as SSH or ICMP from unauthorized subnets to your private subnets and reviewing the behavior. You can also test your configuration by doing the same from authorized subnets and comparing the results. Furthermore, you can create logging and monitoring solutions in Network Firewall to review the dropped or allowed packets from your Network Firewall log groups in CloudWatch Logs or use contributor insights to analyze Network Firewall logs.

Clean up the resources

To clean up the resources that you created for this walkthrough, do the following:

  1. Remove all subnet associations from the route tables.
  2. Delete Network Firewall policies, rule groups, and IP set preferences.
  3. Delete the network firewall.
  4. Delete VPC prefix lists.
  5. Delete your subnets.
  6. Delete the route tables.
  7. Delete the VPC.
  8. Delete the CloudFormation stack (if you created your environment through CloudFormation).

Conclusion

In this post, you learned how to use Amazon VPC managed prefix lists to simplify management of IP addresses within Network Firewall rule groups. IP set preferences that are mapped to your VPC prefix lists are a great tool to help simplify your firewall rules and reduce operational overhead and administration as you scale your network.

For information about pricing, see AWS Network Firewall pricing. For more information about managed prefix lists, see Work with customer-managed prefix lists. For more examples and use cases, see previous Network Firewall posts on the AWS Security Blog.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Mojgan Toth

Mojgan Toth

Mojgan is a Sr. Technical Account Manager. She loves putting together solutions around well-architecture and resiliency. When it comes to personal life, she loves cooking, painting and spending time with her family specially her two little sons. They love outdoor activities such as bike rides and hikes.

How SafetyCulture scales unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift

Post Syndicated from Anish Moorjani original https://aws.amazon.com/blogs/big-data/how-safetyculture-scales-unpredictable-dbt-cloud-workloads-in-a-cost-effective-manner-with-amazon-redshift/

This post is co-written by Anish Moorjani, Data Engineer at SafetyCulture.

SafetyCulture is a global technology company that puts the power of continuous improvement into everyone’s hands. Its operations platform unlocks the power of observation at scale, giving leaders visibility and workers a voice in driving quality, efficiency, and safety improvements.

Amazon Redshift is a fully managed data warehouse service that tens of thousands of customers use to manage analytics at scale. Together with price-performance, Amazon Redshift enables you to use your data to acquire new insights for your business and customers while keeping costs low.

In this post, we share the solution SafetyCulture used to scale unpredictable dbt Cloud workloads in a cost-effective manner with Amazon Redshift.

Use case

SafetyCulture runs an Amazon Redshift provisioned cluster to support unpredictable and predictable workloads. A source of unpredictable workloads is dbt Cloud, which SafetyCulture uses to manage data transformations in the form of models. Whenever models are created or modified, a dbt Cloud CI job is triggered to test the models by materializing the models in Amazon Redshift. To balance the needs of unpredictable and predictable workloads, SafetyCulture used Amazon Redshift workload management (WLM) to flexibly manage workload priorities.

With plans for further growth in dbt Cloud workloads, SafetyCulture needed a solution that does the following:

  • Caters for unpredictable workloads in a cost-effective manner
  • Separates unpredictable workloads from predictable workloads to scale compute resources independently
  • Continues to allow models to be created and modified based on production data

Solution overview

The solution SafetyCulture used is comprised of Amazon Redshift Serverless and Amazon Redshift Data Sharing, along with the existing Amazon Redshift provisioned cluster.

Amazon Redshift Serverless caters to unpredictable workloads in a cost-effective manner because compute cost is not incurred when there is no workload. You pay only for what you use. In addition, moving unpredictable workloads into a separate Amazon Redshift data warehouse allows each Amazon Redshift data warehouse to scale resources independently.

Amazon Redshift Data Sharing enables data access across Amazon Redshift data warehouses without having to copy or move data. Therefore, when a workload is moved from one Amazon Redshift data warehouse to another, the workload can continue to access data in the initial Amazon Redshift data warehouse.

The following figure shows the solution and workflow steps:

  1. We create a serverless instance to cater for unpredictable workloads. Refer to Managing Amazon Redshift Serverless using the console for setup steps.
  2. We create a datashare called prod_datashare to allow the serverless instance access to data in the provisioned cluster. Refer to Getting started data sharing using the console for setup steps. Database names are identical to allow queries with full path notation database_name.schema_name.object_name to run seamlessly in both data warehouses.
  3. dbt Cloud connects to the serverless instance and models, created or modified, are tested by being materialized in the default database dev, in either each users’ personal schema or a pull request related schema. Instead of dev, you can use a different database designated for testing. Refer to Connect dbt Cloud to Redshift for setup steps.
  4. You can query materialized models in the serverless instance with materialized models in the provisioned cluster to validate changes. After you validate the changes, you can implement models in the serverless instance in the provisioned cluster.

Outcome

SafetyCulture carried out the steps to create the serverless instance and datashare, with integration to dbt Cloud, with ease. SafetyCulture also successfully ran its dbt project with all seeds, models, and snapshots materialized into the serverless instance via run commands from the dbt Cloud IDE and dbt Cloud CI jobs.

Regarding performance, SafetyCulture observed dbt Cloud workloads completing on average 60% faster in the serverless instance. Better performance could be attributed to two areas:

  • Amazon Redshift Serverless measures compute capacity using Redshift Processing Units (RPUs). Because it costs the same to run 64 RPUs in 10 minutes and 128 RPUs in 5 minutes, having a higher number of RPUs to complete a workload sooner was preferred.
  • With dbt Cloud workloads isolated on the serverless instance, dbt Cloud was configured with more threads to allow materialization of more models at once.

To determine cost, you can perform an estimation. 128 RPUs provides approximately the same amount of memory that an ra3.4xlarge 21-node provisioned cluster provides. In US East (N. Virginia), the cost of running a serverless instance with 128 RPUs is $48 hourly ($0.375 per RPU hour * 128 RPUs). In the same Region, the cost of running an ra3.4xlarge 21-node provisioned cluster on demand is $68.46 hourly ($3.26 per node hour * 21 nodes). Therefore, an accumulated hour of unpredictable workloads on a serverless instance is 29% more cost-effective than an on-demand provisioned cluster. Calculations in this example should be recalculated when performing future cost estimations because prices may change over time.

Learnings

SafetyCulture had two key learnings to better integrate dbt with Amazon Redshift, which can be helpful for similar implementations.

First, when integrating dbt with an Amazon Redshift datashare, configure INCLUDENEW=True to ease management of database objects in a schema:

ALTER DATASHARE datashare_name SET INCLUDENEW = TRUE FOR SCHEMA schema;

For example, assume the model customers.sql is materialized by dbt as the view customers. Next, customers is added to a datashare. When customers.sql is modified and rematerialized by dbt, dbt creates a new view with a temporary name, drops customers, and renames the new view to customers. Although the new view carries the same name, it’s a new database object that wasn’t added to the datashare. Therefore, customers is no longer found in the datashare.

Configuring INCLUDENEW=True allows new database objects to be automatically added to the datashare. An alternative to configuring INCLUDENEW=True and providing more granular control is the use of dbt post-hook.

Second, when integrating dbt with more than one Amazon Redshift data warehouse, define sources with database to aid dbt in evaluating the right database.

For example, assume a dbt project is used across two dbt Cloud environments to isolate production and test workloads. The dbt Cloud environment for production workloads is configured with the default database prod_db and connects to a provisioned cluster. The dbt Cloud environment for test workloads is configured with the default database dev and connects to a serverless instance. In addition, the provisioned cluster contains the table prod_db.raw_data.sales, which is made available to the serverless instance via a datashare as prod_db′.raw_data.sales.

When dbt compiles a model containing the source {{ source('raw_data', 'sales') }}, the source is evaluated as database.raw_data.sales. If database is not defined for sources, dbt sets the database to the configured environment’s default database. Therefore, the dbt Cloud environment connecting to the provisioned cluster evaluates the source as prod_db.raw_data.sales, while the dbt Cloud environment connecting to the serverless instance evaluates the source as dev.raw_data.sales, which is incorrect.

Defining database for sources allows dbt to consistently evaluate the right database across different dbt Cloud environments, because it removes ambiguity.

Conclusion

After testing Amazon Redshift Serverless and Data Sharing, SafetyCulture is satisfied with the result and has started productionalizing the solution.

“The PoC showed the vast potential of Redshift Serverless in our infrastructure,” says Thiago Baldim, Data Engineer Team Lead at SafetyCulture. “We could migrate our pipelines to support Redshift Serverless with simple changes to the standards we were using in our dbt. The outcome provided a clear picture of the potential implementations we could do, decoupling the workload entirely by teams and users and providing the right level of computation power that is fast and reliable.”

Although this post specifically targets unpredictable workloads from dbt Cloud, the solution is also relevant for other unpredictable workloads, including ad hoc queries from dashboards. Start exploring Amazon Redshift Serverless for your unpredictable workloads today.


About the authors

Anish Moorjani is a Data Engineer in the Data and Analytics team at SafetyCulture. He helps SafetyCulture’s analytics infrastructure scale with the exponential increase in the volume and variety of data.

Randy Chng is an Analytics Solutions Architect at Amazon Web Services. He works with customers to accelerate the solution of their key business problems.

Role-based access control in Amazon OpenSearch Service via SAML integration with AWS IAM Identity Center

Post Syndicated from Scott Chang original https://aws.amazon.com/blogs/big-data/role-based-access-control-in-amazon-opensearch-service-via-saml-integration-with-aws-iam-identity-center/

Amazon OpenSearch Service is a managed service that makes it simple to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. AWS IAM Identity Center (successor to AWS Single Sign-On) helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications. To build a strong least-privilege security posture, customers also wanted fine-grained access control to manage dashboard permission by user role. In this post, we demonstrate a step-by-step procedure to implement IAM Identity Center to OpenSearch Service via native SAML integration, and configure role-based access control in OpenSearch Dashboards by using group attributes in IAM Identity Center. You can follow the steps in this post to achieve both authentication and authorization for OpenSearch Service based on the groups configured in IAM Identity Center.

Solution overview

Let’s review how to map users and groups in IAM Identity Center to OpenSearch Service security roles. Backend roles in OpenSearch Service are used to map external identities or attributes of workgroups to pre-defined OpenSearch Service security roles.

The following diagram shows the solution architecture. Create two groups, assign a user to each group and edit attribute mappings in IAM Identity Center. If you have integrated IAM Identity Center with your Identity Provider (IdP), you can use existing users and groups mapped to your IdP for this test. The solution uses two roles: all_access for administrators, and alerting_full_access for developers who are only allowed to manage OpenSearch Service alerts. You can set up backend role mapping in OpenSearch Dashboards by group ID. Based on the following diagram, you can map the role all_access to the group Admin, and alerting_full_access to Developer. User janedoe is in the group Admin, and user johnstiles is in the group Developer.

Then you will log in as each user to verify the access control by looking at the different dashboard views.

Let’s get started!

Prerequisites

Complete the following prerequisite steps:

  1. Have an AWS account.
  2. Have an Amazon OpenSearch Service domain.
  3. Enable IAM Identity Center in the same Region as the OpenSearch Service domain.
  4. Test your users in IAM Identity Center (to create users, refer to Add users).

Enable SAML in Amazon OpenSearch Service and copy SAML parameters

To configure SAML in OpenSearch Service, complete the following steps:

  1. On the OpenSearch Service console, choose Domains in the navigation pane.
  2. Choose your domain.
  3. On the Security configuration tab, confirm that Fine-grained access control is enabled.
  4. On the Actions menu, choose Edit security configuration.
  5. Select Enable SAML authentication.

You can also configure SAML during domain creation if you are creating a new OpenSearch domain. For more information, refer to SAML authentication for OpenSearch Dashboards.

  1. Copy the values for Service provider entity ID and IdP-Initiated SSO URL.

Create a SAML application in IAM Identity Center

To create a SAML application in IAM Identity Center, complete the following steps:

  1. On the IAM Identity Center console, choose Applications in the navigation pane.
  2. Choose Add application.
  3. Select Add customer SAML 2.0 application, then choose Next.
  4. Enter your application name for Display name.
  5. Under IAM Identity Center metadata, choose Download to download the SAML metadata file.
  6. Under Application metadata, select Manually type your metadata values.
  7. For Application ACS URL, enter the IdP-initiated URL you copied earlier.
  8. For Application SAML audience, enter the service provider entity ID you copied earlier.
  9. Choose Submit.
  10. On the Actions menu, choose Edit attribute mappings.
  11. Create attributes and map the following values:
    1. Subject map to ${user:email}, the format is emailAddress.
    2. Role map to ${user:groups}, the format is unspecified.
  12. Choose Save changes.
  13. On the IAM Identity Center console, choose Groups in the navigation pane.
  14. Create two groups: Developer and Admin.
  15. Assign user janedoe to the group Admin.
  16. Assign user johnstiles to the group Developer.
  17. Open the Admin group and copy the group ID.

Finish SAML configuration and map the SAML primary backend role

To complete your SAML configuration and map the SAML primary backend role, complete the following steps:

  1. On the OpenSearch Service console, choose Domains in the navigation pane.
  2. Open your domain and choose Edit security configuration.
  3. Under SAML authentication for OpenSearch Dashboards/Kibana, for Import IdP metadata, choose Import from XML file.
  4. Upload the IdP metadata downloaded from the IAM Identity Center metadata file.

The IdP entity ID will be auto populated.

  1. Under SAML master backend role, enter the group ID of the Admin group you copied earlier.
  2. For Roles key, enter Role for the SAML assertion.

This is because we defined and mapped Role to ${user:groups} as a SAML attribute in IAM Identity Center.

  1. Choose Save changes.

Configure backend role mapping for the Developer group

You have completely integrated IAM Identity Center with OpenSearch Service and mapped the Admin group as the primary role (all_access) in OpenSearch Service. Now you will log in to OpenSearch Dashboards as Admin and configure mapping for the Developer group.

There are two ways to log in to OpenSearch Dashboards:

  • OpenSearch Dashboards URL – On the OpenSearch Service console, navigate to your domain and choose the Dashboards URL under General Information. (For example, https://opensearch-domain-name-random-keys.us-west-2.es.amazonaws.com/_dashboards)
  • AWS access portal URL – On the IAM Identity Center console, choose Dashboard in the navigation pane and choose the access portal URL under Settings summary. (For example, https://d-1234567abc.awsapps.com/start)

Complete the following steps:

  1. Log in as the user in the Admin group (janedoe).
  2. Choose the tile for your OpenSearch Service application to be redirected to OpenSearch Dashboards.
  3. Choose the menu icon, then choose Security, Roles.
  4. Choose the alerting_full_access role and on the Mapped users tab, choose Manage mapping.
  5. For Backend roles, enter the group ID of Developer.
  6. Choose Map to apply the change.

Now you have successfully mapped the Developer group to the alerting_full_access role in OpenSearch Service.

Verify permissions

To verify permissions, complete the following steps:

  1. Log out of the Admin account in OpenSearch Service as log in as a Developer user.
  2. Choose the OpenSearch Service application tile to be redirected to OpenSearch Dashboards.

You can see there are only alerting related features available on the drop-down menu. This Developer user can’t see all of the Admin features, such as Security.

Clean up

After you test the solution, remember to delete all of the resources you created to avoid incurring future charges:

  1. Delete your Amazon OpenSearch Service domain.
  2. Delete the SAML application, users, and groups in IAM Identity Center.

Conclusion

In the post, we walked through a solution of how to map roles in Amazon OpenSearch Service to groups in IAM Identity Center by using SAML attributes to achieve role-based access control for accessing OpenSearch Dashboards. We connected IAM Identity Center users to OpenSearch Dashboards, and also mapped predefined OpenSearch Service security roles to IAM Identity Center groups based on group attributes. This makes it easier to manage permissions without updating the mapping when new users belonging to the same workgroup want to log in to OpenSearch Dashboards. You can follow the same procedure to provide fine-grained access to workgroups based on team functions or compliance requirements.


About the Authors

Scott Chang is a Solution Architecture at AWS based in San Francisco. He has over 14 years of hands-on experience in Networking also familiar with Security and Site Reliability Engineering. He works with one of major strategic customers in west region to design highly scalable, innovative and secure cloud solutions.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch service. He builds large scale search applications and solutions. Muthu is interested in the topics of networking and security and is based out of Austin, Texas

How to choose the right Amazon MSK cluster type for you

Post Syndicated from Ali Alemi original https://aws.amazon.com/blogs/big-data/how-to-choose-the-right-amazon-msk-cluster-type-for-you/

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an AWS streaming data service that manages Apache Kafka infrastructure and operations, making it easy for developers and DevOps managers to run Apache Kafka applications and Kafka Connect connectors on AWS, without the need to become experts in operating Apache Kafka. Amazon MSK operates, maintains, and scales Apache Kafka clusters, provides enterprise-grade security features out of the box, and has built-in AWS integrations that accelerate development of streaming data applications. You can easily get started by creating an MSK cluster using the AWS Management Console with a few clicks.

When creating a cluster, you must choose a cluster type from two options: provisioned or serverless. Choosing the best cluster type for each workload depends on the type of workload and your DevOps preferences. Amazon MSK provisioned clusters offer more flexibility in how you scale, configure, and optimize your cluster. Amazon MSK Serverless, on the other hand, makes scaling, load management, and operation of the cluster easier for you. With MSK Serverless, you can run your applications without having to configure, manage the infrastructure, or optimize clusters, and you pay for the data volume you stream and retain. MSK Serverless fully manages partitions, including monitoring as well as ensuring an even balance of partition distribution across brokers in the cluster (auto-balancing).

In this post, I examine a use case with the fictitious company AnyCompany, who plans to use Amazon MSK for two applications. They must decide between provisioned or serverless cluster types. I describe a process by which they work backward from the applications’ requirements to find the best MSK cluster type for their workloads, including how the organizational structure and application requirements are relevant in finding the best offering. Lastly, I examine the requirements and their relationship to Amazon MSK features.

Use case

AnyCompany is an enterprise organization that is ready to move two of their Kafka applications to Amazon MSK.

The first is a large ecommerce platform, which is a legacy application that currently uses a self-managed Apache Kafka cluster run in their data centers. AnyCompany wants to migrate this application to the AWS Cloud and use Amazon MSK to reduce maintenance and operations overhead. AnyCompany has a DataOps team that has been operating self-managed Kafka clusters in their data centers for years. AnyCompany wants to continue using the DataOps team to manage the MSK cluster on behalf of the development team. There is very little flexibility for code changes. For example, a few modules of the application require plaintext communication and access to the Apache ZooKeeper cluster that comes with an MSK cluster. The ingress throughput for this application doesn’t fluctuate often. The ecommerce platform only experiences a surge in user activity during special sales events. The DataOps team has a good understanding of this application’s traffic pattern, and are confident that they can optimize an MSK cluster by setting some custom broker-level configurations.

The second application is a new cloud-native gaming application currently in development. AnyCompany hopes to launch this gaming application soon followed by a marketing campaign. Throughput needs for this application are unknown. The application is expected to receive high traffic initially, then user activity should decline gradually. Because the application is going to launch first in the US, traffic during the day is expected to be higher than at night. This application offers a lot of flexibility in terms of Kafka client version, encryption in transit, and authentication. Because this is a cloud-native application, AnyCompany hopes they can delegate full ownership of its infrastructure to the development team.

Solution overview

Let’s examine a process that helps AnyCompany decide between the two Amazon MSK offerings. The following diagram shows this process at a high level.

In the following sections, I explain each step in detail and the relevant information that AnyCompany needs to collect before they make a decision.

Competency in Apache Kafka

AWS recommends a list of best practices to follow when using the Amazon MSK provisioned offering. Amazon MSK provisioned, offers more flexibility so you make scaling decisions based on what’s best for your workloads. For example, you can save on cost by consolidating a group of workloads into a single cluster. You can decide which metrics are important to monitor and optimize your cluster through applying custom configurations to your brokers. You can choose your Apache Kafka version, among different supported versions, and decide when to upgrade to a new version. Amazon MSK takes care of applying your configuration and upgrading each broker in a rolling fashion.

With more flexibility, you have more responsibilities. You need to make sure your cluster is right-sized at any time. You can achieve this by monitoring a set of cluster-level, broker-level, and topic-level metrics to ensure you have enough resources that are needed for your throughput. You also need to make sure the number of partitions assigned to each broker doesn’t exceed the numbers suggested by Amazon MSK. If partitions are unbalanced, you need to even-load them across all brokers. If you have more partitions than recommended, you need to either upgrade brokers to a larger size or increase the number of brokers in your cluster. There are also best practices for the number of TCP connections when using AWS Identity and Access Management (IAM) authentication.

An MSK Serverless cluster takes away the complexity of right-sizing clusters and balancing partitions across brokers. This makes it easy for developers to focus on writing application code.

AnyCompany has an experienced DataOps team who are familiar with scaling operations and best practices for the MSK provisioned cluster type. AnyCompany can use their DataOps team’s Kafka expertise for building automations and easy-to-follow standard procedures on behalf of the ecommerce application team. The gaming development team is an exception, because they are expected to take the full ownership of the infrastructure.

In the following sections, I discuss other steps in the process before deciding which cluster type is right for each application.

Custom configuration

In certain use cases, you need to configure your MSK cluster differently from its default settings. This could be due to your application requirements. For example, AnyCompany’s ecommerce platform requires setting up brokers such that the default retention period for all topics is set to 72 hours. Also, topics should be or auto-created when they are requested and don’t exist.

The Amazon MSK provisioned offering provides a default configuration for brokers, topics, and Apache ZooKeeper nodes. It also allows you to create custom configurations and use them to create new MSK clusters or update existing clusters. An MSK cluster configuration consists of a set of properties and their corresponding values.

MSK Serverless doesn’t allow applying broker-level configuration. This is because AWS takes care of configuring and managing the backend nodes. It takes away the heavy lifting of configuring the broker nodes. You only need to manage your applications’ topics. To learn more, refer to the list of topic-level configurations that MSK Serverless allows you to change.

Unlike the ecommerce platform, AnyCompany’s gaming application doesn’t need broker-level custom configuration. The developers want to set the retention.ms and max.message.bytes per each topic only.

Application requirements

Apache Kafka applications differ in terms of their security; the way they connect, write, or read data; data retention period; and scaling patterns. For example, some applications can only scale vertically, whereas other applications can scale only horizontally. Although a flexible application can work with encryption in transit, a legacy application may only be able to communicate in plaintext format.

Cluster-level quotas

Amazon MSK enforces some quotas to ensure the performance, reliability, and availability of the service for all customers. These quotas are subject to change at any time. To access the latest values for each dimension, refer to Amazon MSK quota. Note that some of the quotas are soft limits and can be increased using a support ticket.

When choosing a cluster type in Amazon MSK, it’s important to understand your application requirements and compare those against quotas in relation with each offering. This makes sure you choose the best cluster type that meets your goals and application’s needs. Let’s examine how you can calculate the throughput you need and other important dimensions you need to compare with Amazon MSK quotas:

  • Number of clusters per account – Amazon MSK may have quotas for how many clusters you can create in a single AWS account. If this is limiting your ability to create more clusters, you can consider creating those in multiple AWS accounts and using secure connectivity patterns to provide access to your applications.
  • Message size – You need to make sure the maximum message size that your producer writes for a single message is lower than the configured size in the MSK cluster. MSK provisioned clusters allow you to change the default value in a custom configuration. If you choose MSK Serverless, check this value in Amazon MSK quota. The average message size is helpful when calculating the total ingress or egress throughput of the cluster, which I demonstrate later in this post.
  • Message rate per second – This directly influences total ingress and egress throughput of the cluster. Total ingress throughput equals the message rate per second multiplied by message size. You need to make sure your producer is configured for optimal throughput by adjusting batch.size and linger.ms properties. If you’re choosing MSK Serverless, you need to make sure you configure your producer to optimal batches with the rate that is lower than its request rate quota.
  • Number of consumer groups – This directly influences the total egress throughput of the cluster. Total egress throughput equals the ingress throughput multiplied by the number of consumer groups. If you’re choosing MSK Serverless, you need to make sure your application can work with these quotas.
  • Maximum number of partitions – Amazon MSK provisioned recommends not exceeding certain limits per broker (depending the broker size). If the number of partitions per broker exceeds the maximum value specified in the previous table, you can’t perform certain upgrade or update operations. MSK Serverless also has a quota of maximum number of partitions per cluster. You can request to increase the quota by creating a support case.

Partition-level quotas

Apache Kafka organizes data in structures called topics. Each topic consists of a single or many partitions. Partitions are the degree of parallelism in Apache Kafka. The data is distributed across brokers using data partitioning. Let’s examine a few important Amazon MSK requirements, and how you can make sure which cluster type works better for your application:

  • Maximum throughput per partition – MSK Serverless automatically balances the partitions of your topic between the backend nodes. It instantly scales when your ingress throughput increases. However, each partition has a quota of how much data it accepts. This is to ensure the data is distributed evenly across all partitions and backend nodes. In an MSK Serverless cluster, you need to create your topic with enough partitions such that the aggregated throughput is equal to the maximum throughput your application requires. You also need to make sure your consumers read data with a rate that is below the maximum egress throughput per partition quota. If you’re using Amazon MSK provisioned, there is no partition-level quota for write and read operations. However, AWS recommends you monitor and detect hot partitions and control how partitions should balance among the broker nodes.
  • Data storage – The amount of time each message is kept in a particular topic directly influences the total amount of storage needed for your cluster. Amazon MSK allows you to manage the retention period at the topic level. MSK provisioned clusters allow broker-level configuration to set the default data retention period. MSK Serverless clusters allow unlimited data retention, but there is a separate quota for the maximum data that can be stored in each partition.

Security

Amazon MSK recommends that you secure your data in the following ways. Availability of the security features varies depending on the cluster type. Before making a decision about your cluster type, check if your preferred security options are supported by your choice of cluster type.

  • Encryption at rest – Amazon MSK integrates with AWS Key Management Service (AWS KMS) to offer transparent server-side encryption. Amazon MSK always encrypts your data at rest. When you create an MSK cluster, you can specify the KMS key that you want Amazon MSK to use to encrypt your data at rest.
  • Encryption in transit – Amazon MSK uses TLS 1.2. By default, it encrypts data in transit between the brokers of your MSK cluster. You can override this default when you create the cluster. For communication between clients and brokers, you must specify one of the following settings:
    • Only allow TLS encrypted data. This is the default setting.
    • Allow both plaintext and TLS encrypted data.
    • Only allow plaintext data.
  • Authentication and authorization – Use IAM to authenticate clients and allow or deny Apache Kafka actions. Alternatively, you can use TLS or SASL/SCRAM to authenticate clients, and Apache Kafka ACLs to allow or deny actions.

Cost of ownership

Amazon MSK helps you avoid spending countless hours and significant resources just managing your Apache Kafka cluster, adding little or no value to your business. With a few clicks on the Amazon MSK console, you can create highly available Apache Kafka clusters with settings and configuration based on Apache Kafka’s deployment best practices. Amazon MSK automatically provisions and runs Apache Kafka clusters. Amazon MSK continuously monitors cluster health and automatically replaces unhealthy nodes with no application downtime. In addition, Amazon MSK secures Apache Kafka clusters by encrypting data at rest and in transit. These capabilities can significantly reduce your Total Cost of Ownership (TCO).

With MSK provisioned clusters, you can specify and then scale cluster capacity to meet your needs. With MSK Serverless clusters, you don’t need to specify or scale cluster capacity. MSK Serverless automatically scales the cluster capacity based on the throughput, and you only pay per GB of data that your producers write to and your consumers read from the topics in your cluster. Additionally, you pay an hourly rate for your serverless clusters and an hourly rate for each partition that you create. The MSK Serverless cluster type generally offers a lower cost of ownership by taking away the cost of engineering resources needed for monitoring, capacity planning, and scaling MSK clusters. However, if your organization has a DataOps team with Kafka competency, you can use this competency to operate optimized MSK provisioned clusters. This allows you to save on Amazon MSK costs by consolidating several Kafka applications into a single cluster. There are a few critical considerations to decide when and how to split your workloads between multiple MSK clusters.

Apache ZooKeeper

Apache ZooKeeper is a service included in Amazon MSK when you create a cluster. It manages the Apache Kafka metadata and acts as a quorum controller for leader elections. Although interacting with ZooKeeper is not a recommended pattern, some Kafka applications have a dependency to connect directly to ZooKeeper. During the migration to Amazon MSK, you may find a few of these applications in your organization. This could be because they use an older version of the Kafka client library or other reasons. For example, applications that help with Apache Kafka admin operations or visibility such as Cruise Control usually need this kind of access.

Before you choose your cluster type, you first need to check which offering provides direct access to the ZooKeeper cluster. As of writing this post, only Amazon MSK provisioned provides direct access to ZooKeeper.

How AnyCompany chooses their cluster types

AnyCompany first needs to collect some important requirements about each of their applications. The following table shows these requirements. The rows marked with an asterisk (*) are calculated based on the values in previous rows.

Dimension Ecommerce Platform Gaming Application
Message rate per second 150,000 1,000
Maximum message size 15 MB 1 MB
Average message size 30 KB 15 KB
* Ingress throughput (average message size * message rate per second) 4.5GBps 15MBps
Number of consumer groups 2 1
* Outgress throughput (ingress throughput * number of consumer groups) 9 GBps 15 MBps
Number of topics 100 10
Average partition per topic 100 5
* Total number of partitions (number of topics * average partition per topic) 10,000 50
* Ingress per partition (ingress throughput / total number of partitions) 450 KBps 300 KBps
* Outgress per partition (outgress throughput / total number of partitions) 900 KBps 300 KBps
Data retention 72 hours 168 hours
* Total storage needed (ingress throughput * retention period in seconds) 1,139.06 TB 1.3 TB
Authentication Plaintext and SASL/SCRAM IAM
Need ZooKeeper access Yes No

For the gaming application, AnyCompany doesn’t want to use their in-house Kafka competency to support an MSK provisioned cluster. Also, the gaming application doesn’t need custom configuration, and its throughput needs are below the quotas set by the MSK Serverless cluster type. In this scenario, an MSK Serverless cluster makes more sense.

For the e-commerce platform, AnyCompany wants to use their Kafka competency. Moreover, their throughput needs exceed the MSK Serverless quotas, and the application requires some broker-level custom configuration. The ecommerce platform also can’t split between multiple clusters. Because of these reasons, AnyCompany chooses the MSK provisioned cluster type in this scenario. Additionally, AnyCompany can save more on cost with the Amazon MSK provisioned pricing model. Their throughput is consistent at most times and AnyCompany wants to use their DataOps team to optimize a provisioned MSK cluster and make scaling decisions based on their own expertise.

Conclusion

Choosing the best cluster type for your applications may seem complicated at first. In this post, I showed a process that helps you work backward from your application’s requirement and the resources available to you. MSK provisioned clusters offer more flexibility in how you scale, configure, and optimize your cluster. MSK Serverless, on the other hand, is a cluster type that makes it easier for you to run Apache Kafka clusters without having to manage compute and storage capacity. I generally recommend you begin with MSK Serverless if your application doesn’t require broker-level custom configurations, and your application throughput needs don’t exceed the quotas for the MSK Serverless cluster type. Sometimes it’s best to split your workloads between multiple MSK Serverless clusters, but if that isn’t possible, you may need to consider an MSK provisioned cluster. To operate an optimized MSK provisioned cluster, you need to have Kafka competency within your organization.

For further reading on Amazon MSK, visit the official product page.


About the author

Ali Alemi is a Streaming Specialist Solutions Architect at AWS. Ali advises AWS customers with architectural best practices and helps them design real-time analytics data systems that are reliable, secure, efficient, and cost-effective. He works backward from customers’ use cases and designs data solutions to solve their business problems. Prior to joining AWS, Ali supported several public sector customers and AWS consulting partners in their application modernization journey and migration to the cloud.

Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena

Post Syndicated from Houssem Chihoub original https://aws.amazon.com/blogs/big-data/build-a-serverless-transactional-data-lake-with-apache-iceberg-amazon-emr-serverless-and-amazon-athena/

Since the deluge of big data over a decade ago, many organizations have learned to build applications to process and analyze petabytes of data. Data lakes have served as a central repository to store structured and unstructured data at any scale and in various formats. However, as data processing at scale solutions grow, organizations need to build more and more features on top of their data lakes. One important feature is to run different workloads such as business intelligence (BI), Machine Learning (ML), Data Science and data exploration, and Change Data Capture (CDC) of transactional data, without having to maintain multiple copies of data. Additionally, the task of maintaining and managing files in the data lake can be tedious and sometimes complex.

Table formats like Apache Iceberg provide solutions to these issues. They enable transactions on top of data lakes and can simplify data storage, management, ingestion, and processing. These transactional data lakes combine features from both the data lake and the data warehouse. You can simplify your data strategy by running multiple workloads and applications on the same data in the same location. However, using these formats requires building, maintaining, and scaling infrastructure and integration connectors that can be time-consuming, challenging, and costly.

In this post, we show how you can build a serverless transactional data lake with Apache Iceberg on Amazon Simple Storage Service (Amazon S3) using Amazon EMR Serverless and Amazon Athena. We provide an example for data ingestion and querying using an ecommerce sales data lake.

Apache Iceberg overview

Iceberg is an open-source table format that brings the power of SQL tables to big data files. It enables ACID transactions on tables, allowing for concurrent data ingestion, updates, and queries, all while using familiar SQL. Iceberg employs internal metadata management that keeps track of data and empowers a set of rich features at scale. It allows you to time travel and roll back to old versions of committed data transactions, control the table’s schema evolution, easily compact data, and employ hidden partitioning for fast queries.

Iceberg manages files on behalf of the user and unlocks use cases such as:

  • Concurrent data ingestion and querying, including streaming and CDC
  • BI and reporting with expressive simple SQL
  • Empowering ML feature stores and training sets
  • Compliance and regulations workloads, such as GDPR find and forget
  • Reinstating late-arriving data, which is dimensions data arriving later than the fact data. For example, the reason for a flight delay may arrive well after the fact that the fligh is delayed.
  • Tracking data changes and rollback

Build your transactional data lake on AWS

You can build your modern data architecture with a scalable data lake that integrates seamlessly with an Amazon Redshift powered cloud warehouse. Moreover, many customers are looking for an architecture where they can combine the benefits of a data lake and a data warehouse in the same storage location. In the following figure, we show a comprehensive architecture that uses the modern data architecture strategy on AWS to build a fully featured transactional data lake. AWS provides flexibility and a wide breadth of features to ingest data, build AI and ML applications, and run analytics workloads without having to focus on the undifferentiated heavy lifting.

Data can be organized into three different zones, as shown in the following figure. The first zone is the raw zone, where data can be captured from the source as is. The transformed zone is an enterprise-wide zone to host cleaned and transformed data in order to serve multiple teams and use cases. Iceberg provides a table format on top of Amazon S3 in this zone to provide ACID transactions, but also to allow seamless file management and provide time travel and rollback capabilities. The business zone stores data specific to business cases and applications aggregated and computed from data in the transformed zone.

One important aspect to a successful data strategy for any organization is data governance. On AWS, you can implement a thorough governance strategy with fine-grained access control to the data lake with AWS Lake Formation.

Serverless architecture overview

In this section, we show you how to ingest and query data in your transactional data lake in a few steps. EMR Serverless is a serverless option that makes it easy for data analysts and engineers to run Spark-based analytics without configuring, managing, and scaling clusters or servers. You can run your Spark applications without having to plan capacity or provision infrastructure, while paying only for your usage. EMR Serverless supports Iceberg natively to create tables and query, merge, and insert data with Spark. In the following architecture diagram, Spark transformation jobs can load data from the raw zone or source, apply the cleaning and transformation logic, and ingest data in the transformed zone on Iceberg tables. Spark code can run instantaneously on an EMR Serverless application, which we demonstrate later in this post.

The Iceberg table is synced with the AWS Glue Data Catalog. The Data Catalog provides a central location to govern and keep track of the schema and metadata. With Iceberg, ingestion, update, and querying processes can benefit from atomicity, snapshot isolation, and managing concurrency to keep a consistent view of data.

Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. To serve BI and reporting analysis, it allows you to build and run queries on Iceberg tables natively and integrates with a variety of BI tools.

Sales data model

Star schema and its variants are very popular for modeling data in data warehouses. They implement one or more fact tables and dimension tables. The fact table stores the main transactional data from the business logic with foreign keys to dimensional tables. Dimension tables hold additional complementary data to enrich the fact table.

In this post, we take the example of sales data from the TPC-DS benchmark. We zoom in on a subset of the schema with the web_sales fact table, as shown in the following figure. It stores numeric values about sales cost, ship cost, tax, and net profit. Additionally, it has foreign keys to dimensional tables like date_dim, time_dim, customer, and item. These dimensional tables store records that give more details. For instance, you can show when a sale took place by which customer for which item.

Dimension-based models have been used extensively to build data warehouses. In the following sections, we show how to implement such a model on top of Iceberg, providing data warehousing features on top of your data lake, and run different workloads in the same location. We provide a complete example of building a serverless architecture with data ingestion using EMR Serverless and Athena using TPC-DS queries.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Basic knowledge about data management and SQL

Deploy solution resources with AWS CloudFormation

We provide an AWS CloudFormation template to deploy the data lake stack with the following resources:

  • Two S3 buckets: one for scripts and query results, and one for the data lake storage
  • An Athena workgroup
  • An EMR Serverless application
  • An AWS Glue database and tables on external public S3 buckets of TPC-DS data
  • An AWS Glue database for the data lake
  • An AWS Identity and Access Management (IAM) role and polices

Complete the following steps to create your resources:

  1. Launch the CloudFormation stack:

Launch Button

This automatically launches AWS CloudFormation in your AWS account with the CloudFormation template. It prompts you to sign in as needed.

  1. Keep the template settings as is.
  2. Check the I acknowledge that AWS CloudFormation might create IAM resources box.
  3. Choose Submit

When the stack creation is complete, check the Outputs tab of the stack to verify the resources created.

Upload Spark scripts to Amazon S3

Complete the following steps to upload your Spark scripts:

  1. Download the following scripts: ingest-iceberg.py and update-item.py.
  2. On the Amazon S3 console, go to the datalake-resources-<AccountID>-us-east-1 bucket you created earlier.
  3. Create a new folder named scripts.
  4. Upload the two PySpark scripts: ingest-iceberg.py and update-item.py.

Create Iceberg tables and ingest TPC-DS data

To create your Iceberg tables and ingest the data, complete the following steps:

  1. On the Amazon EMR console, choose EMR Serverless in the navigation pane.
  2. Choose Manage applications.
  3. Choose the application datalake-app.

  1. Choose Start application.

Once started, it will provision the pre-initialized capacity as configured at creation (one Spark driver and two Spark executors). The pre-initialized capacity are resources that will be provisioned when you start your application. They can be used instantly when you submit jobs. However, they incur charges even if they’re not used when the application is in a started state. By default, the application is set to stop when idle for 15 minutes.

Now that the EMR application has started, we can submit the Spark ingest job ingest-iceberg.py. The job creates the Iceberg tables and then loads data from the previously created AWS Glue Data Catalog tables on TPC-DS data in an external bucket.

  1. Navigate to the datalake-app.
  2. On the Job runs tab, choose Submit job.

  1. For Name, enter ingest-data.
  2. For Runtime role, choose the IAM role created by the CloudFormation stack.
  3. For Script location, enter the S3 path for your resource bucket (datalake-resource-<####>-us-east-1>scripts>ingest-iceberg.py).

  1. Under Spark properties, choose Edit in text.
  2. Enter the following properties, replacing <BUCKET_NAME> with your data lake bucket name datalake-<####>-us-east-1 (not datalake-resources)
--conf spark.executor.cores=2 --conf spark.executor.memory=4g --conf spark.driver.cores=2 --conf spark.driver.memory=8g --conf spark.executor.instances=2 --conf spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.dev.warehouse=s3://<BUCKET_NAME>/warehouse --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.dev.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=myIcebergLockTab --conf spark.dynamicAllocation.maxExecutors=8 --conf spark.driver.maxResultSize=1G --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

  1. Submit the job.

You can monitor the job progress.

Query Iceberg tables

In this section, we provide examples of data warehouse queries from TPC-DS on the Iceberg tables.

  1. On the Athena console, open the query editor.
  2. For Workgroup, switch to DatalakeWorkgroup.

  1. Choose Acknowledge.

The queries in DatalakeWorkgroup will run on Athena engine version 3.

  1. On the Saved queries tab, choose a query to run on your Iceberg tables.

The following queries are listed:

  • Query3 – Report the total extended sales price per item brand of a specific manufacturer for all sales in a specific month of the year.
  • Query45 – Report the total web sales for customers in specific zip codes, cities, counties, or states, or specific items for a given year and quarter.
  • Query52 – Report the total of extended sales price for all items of a specific brand in a specific year and month.
  • Query6 – List all the states with at least 10 customers who during a given month bought items with the price tag at least 20% higher than the average price of items in the same category.
  • Query75 – For 2 consecutive years, track the sales of items by brand, class, and category.
  • Query86a – Roll up the web sales for a given year by category and class, and rank the sales among peers within the parent. For each group, compute the sum of sales and location with the hierarchy and rank within the group.

These queries are examples of queries used in decision-making and reporting in an organization. You can run them in the order you want. For this post, we start with Query3.

  1. Before you run the query, confirm that Database is set to datalake.

  1. Now you can run the query.

  1. Repeat these steps to run the other queries.

Update the item table

After running the queries, we prepare a batch of updates and inserts of records into the item table.

  1. First, run the following query to count the number of records in the item Iceberg table:
SELECT count(*) FROM "datalake"."item_iceberg";

This should return 102,000 records.

  1. Select item records with a price higher than $90:
SELECT count(*) FROM "datalake"."item_iceberg" WHERE i_current_price > 90.0;

This will return 1,112 records.

The update-item.py job takes these 1,112 records, modifies 11 records to change the name of the brand to Unknown, and changes the remaining 1,101 records’ i_item_id key to flag them as new records. As a result, a batch of 11 updates and 1,101 inserts are merged into the item_iceberg table.

The 11 records to be updated are those with price higher than $90, and the brand name starts with corpnameless.

  1. Run the following query:
SELECT count(*) FROM "datalake"."item_iceberg" WHERE i_current_price > 90.0 AND i_brand LIKE 'corpnameless%';

The result is 11 records. The item_update.py job replaces the brand name with Unknown and merges the batch into the Iceberg table.

Now you can return to the EMR Serverless console and run the job on the EMR Serverless application.

  1. On the application details page, choose Submit job.
  2. For Name, enter update-item-job.
  3. For Runtime role¸ use the same role that you used previously.
  4. For S3 URI, enter the update-item.py script location.

  1. Under Spark properties, choose Edit in text.
  2. Enter the following properties, replacing the <BUCKET-NAME> with your own datalake-<####>-us-east-1:
--conf spark.executor.cores=2 --conf spark.executor.memory=8g --conf spark.driver.cores=4 --conf spark.driver.memory=8g --conf spark.executor.instances=2 --conf spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.dev=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.dev.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=myIcebergLockTab --conf spark.dynamicAllocation.maxExecutors=4 --conf spark.driver.maxResultSize=1G --conf spark.sql.catalog.dev.warehouse=s3://<BUCKET-NAME>/warehouse --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

  1. Then submit the job.

  1. After the job finishes successfully, return to the Athena console and run the following query:
SELECT count(*) FROM "datalake"."item_iceberg";

The returned result is 103,101 = 102,000 + (1,112 – 11). The batch was merged successfully.

Time travel

To run a time travel query, complete the following steps:

  1. Get the timestamp of the job run via the application details page on the EMR Serverless console, or the Spark UI on the History Server, as shown in the following screenshot.

This time could be just minutes before you ran the update Spark job.

  1. Convert the timestamp from the format YYYY/MM/DD hh:mm:ss to YYYY-MM-DDThh:mm:ss.sTZD with time zone. For example, from 2023/02/20 14:40:41 to 2023-02-20 14:40:41.000 UTC.
  2. On the Athena console, run the following query to count the item table records at a time before the update job, replacing <TRAVEL_TIME> with your time:
SELECT count(*) FROM "datalake"."item_iceberg" FOR TIMESTAMP AS OF TIMESTAMP <TRAVEL_TIME>;

The query will give 102,000 as a result, the expected table size before running the update job.

  1. Now you can run a query with a timestamp after the successful run of the update job (for example, 2023-02-20 15:06:00.000 UTC):
SELECT count(*) FROM "datalake"."item_iceberg" FOR TIMESTAMP AS OF TIMESTAMP <TRAVEL_TIME>;

The query will now give 103,101 as the size of the table at that time, after the update job successfully finished.

Additionally, you can query in Athena based on the version ID of a snapshot in Iceberg. However, for more advanced use cases, such as to roll back to a given version or to find version IDs, you can use Iceberg’s SDK or Spark on Amazon EMR.

Clean up

Complete the following steps to clean up your resources:

  1. On the Amazon S3 console, empty your buckets.
  2. On the Athena console, delete the workgroup DatalakeWorkgroup.
  3. On the EMR Studio console, stop the application datalake-app.
  4. On the AWS CloudFormation console, delete the CloudFormation stack.

Conclusion

In this post, we created a serverless transactional data lake with Iceberg tables, EMR Serverless, and Athena. We used TPC-DS sales data with 10 GB data and more than 7 million records in the fact table. We demonstrated how straightforward it is to rely on SQL and Spark to run serverless jobs for data ingestion and upserts. Moreover, we showed how to run complex BI queries directly on Iceberg tables from Athena for reporting.

You can start building your serverless transactional data lake on AWS today, and dive deep into the features and optimizations Iceberg provides to build analytics applications more easily. Iceberg can also help you in the future to improve performance and reduce costs.


About the Author

Houssem is a Specialist Solutions Architect at AWS with a focus on analytics. He is passionate about data and emerging technologies in analytics. He holds a PhD on data management in the cloud. Prior to joining AWS, he worked on several big data projects and published several research papers in international conferences and venues.

Enhance your analytics embedding experience with the new Amazon QuickSight JavaScript SDK

Post Syndicated from Raj Jayaraman original https://aws.amazon.com/blogs/big-data/enhance-your-analytics-embedding-experience-with-the-new-amazon-quicksight-javascript-sdk/

Amazon QuickSight is a fully managed, cloud-native business intelligence (BI) service that makes it easy to connect to your data, create interactive dashboards and reports, and share these with tens of thousands of users, either within QuickSight or embedded in your application or website.

QuickSight recently launched a new major version of its Embedding SDK (v2.0) to improve developer experience when embedding QuickSight in your application or website. The QuickSight SDK v2.0 adds several customization improvements such as an optional preloader and new external hooks for managing undo, redo, print options, and parameters. Additionally, there are major rewrites to deliver developer-focused improvements, including static type checking, enhanced runtime validation, strong consistency in call patterns, and optimized event chaining.

The new SDK supports improved code completion when integrated with IDEs through its adoption of TypeScript and the newly introduced frameOptions and contentOptions, which segment embedding options into parameters unified for all embedding experiences and parameters unique for each embedding experience, respectively. Additionally, SDK v2.0 offers increased visibility by providing new experience-specific information and warnings within the SDK. This increases transparency, and developers can monitor and handle new content states.

The QuickSight SDK v2.0 is modernized by using promises for all actions, so developers can use async and await functions for better event management. Actions are further standardized to return a response for both data requesting and non-data requesting actions, so developers have full visibility to the end-to-end application handshake.

In addition to the new SDK, we are also introducing state persistence for user-based dashboard and console embedding. The GenerateEmbedUrlForRegisteredUser API is updated to support this feature and improves end-user experience and interactivity on embedded content.

SDK Feature overview

The QuickSight SDK v2.0 offers new functionalities along with elevating developers’ experience. The following functionalities have been added in this version:

  • Dashboard undo, redo, and reset actions can now be invoked from the application
  • A loading animation can be added to the dashboard container while the contents of the dashboard are loaded
  • Frame creation, mounting, and failure are communicated as change events that can be used by the application
  • Actions getParameter() values and setParameter() values are unified, eliminating additional data transformations

Using the new SDK

The embed URL obtained using the GenerateEmbedUrlForRegisteredUser or GenerateEmbedUrlForAnonymousUser APIs can be consumed in the application using the embedDashboard experience in SDK v2.0. This method takes two parameters:

  • frameOptions – This is a required parameter, and its properties determine the container options to embed a dashboard:
    • url – The embed URL generated using GenerateEmbedUrlForRegisteredUser or GenerateEmbedUrlForAnonymousUser APIs
    • container – The parent HTMLElement to embed the dashboard
  • contentOptions – This is an optional parameter that controls the dashboard locale and captures events from the SDK.

The following sample code uses the preceding parameters to embed a dashboard:

<html>
    <head>
        <!-- ... -->
        <script src=”https://unpkg.com/[email protected]/dist/quicksight-embedding-js-sdk.min.js"></script>
        <!-- ... -->
        <script>
            (async () => {
                const {
                    createEmbeddingContext,
                } = window.QuickSightEmbedding;
                
                const embeddingContext = await createEmbeddingContext();
                
                const frameOptions = {
                    url: '<YOUR_EMBED_URL>',
                    container: '#your-embed-container'
                };
                
                const contentOptions = {
                    toolbarOptions: {
                        reset: true,
                        undoRedo: true,
                    }
                };
                
                embeddedDashboard = await EmbeddingContext.embedDashboard(frameOptions, contentOptions);                
            })();
        </script>
    </head>
    <body>
        <div id="your-embed-container"></div>
    </body>
</html>

Render a loading animation while the dashboard loads

SDK v2.0 allows an option to render a loading animation in the iFrame container while the dashboard loads. This improves user experience by suggesting resource loading is in progress and where it will appear, and eliminates any perceived latency.

You can enable a loading animation by using the withIframePlaceholder option in the frameOption parameter:

const frameOptions = {
           url: '<YOUR_EMBED_URL>',
            container: '#your-embed-container',            
            withIframePlaceholder: true
}

This option is supported by all embedding experiences.

Monitor changes in SDK code status

SDK v2.0 supports a new callback onChange, which returns eventNames along with corresponding eventCodes to indicate errors, warnings, or information from the SDK.

You can use the events returned by the callback to monitor frame creation status and code status returned by the SDK. For example, if the SDK returns an error when an invalid embed URL is used, you can use a placeholder text or image in place of the embedded experience to notify the user.

The following eventNames and eventCodes are returned as part of the onChange callback when there is a change in the SDK code status.

eventName eventCode
ERROR FRAME_NOT_CREATED: Invoked when the creation of the iframe element failed
NO_BODY: Invoked when there is no body element in the hosting HTML
NO_CONTAINER: Invoked when the experience container is not found
NO_URL: Invoked when no URL is provided in the frameOptions
INVALID_URL: Invoked when the URL provided is not a valid URL for the experience
INFO FRAME_STARTED: Invoked just before the iframe is created
FRAME_MOUNTED: Invoked after the iframe is appended into the experience container
FRAME_LOADED: Invoked after the iframe element emitted the load event
WARN UNRECOGNIZED_CONTENT_OPTIONS: Invoked when the content options for the experience contain unrecognized properties
UNRECOGNIZED_EVENT_TARGET: Invoked when a message with an unrecognized event target is received

See the following code:

const frameOptions = {
            url: '<YOUR_EMBED_URL>',
            container: '#your-embed-container',            
            withIframePlaceholder: true
            onChange: (changeEvent, metadata) => {
                switch (changeEvent.eventName) {
                    case 'ERROR': {
                        document.getElementById("your-embed-container").append('Unable to load Dashboard at this time.');
                        break;
                    }
                }
            }
        }

Monitor interactions in embedded dashboards

Another callback supported by SDK v2.0 is onMessage, which returns information about specific events within an embedded experience. The eventName returned depends on the type of embedding experience used and allows application developers to invoke custom code for specific events.

For example, you can monitor if an embedded dashboard is fully loaded or invoke a custom function that logs the parameter values end-users set or change within the dashboard. Your application can now work seamlessly with SDK v2.0 to track and react to interactions within an embedded experience.

The eventNames returned are specific to the embedding experience used. The following eventNames are for the dashboard embedding experience. For additional eventNames, visit the GitHub repo.

  • CONTENT_LOADED
  • ERROR_OCCURRED
  • PARAMETERS_CHANGED
  • SELECTED_SHEET_CHANGED
  • SIZE_CHANGED
  • MODAL_OPENED

See the following code:

const contentOptions = {
                    onMessage: async (messageEvent, experienceMetadata) => {
                        switch (messageEvent.eventName) {
                            case 'PARAMETERS_CHANGED': {
                                ….. // Custom code
                                break;
                            }
…
}

Initiate dashboard print from the application

The new SDK version supports initiating undo, redo, reset, and print from the parent application, without having to add the native embedded QuickSight navbar. This allows developers flexibility to add custom buttons or application logic to control and invoke these options.

For example, you can add a standalone button in your application that allows end-users to print an embedded dashboard, without showing a print icon or navbar within the embedded frame. This can be done using the initiatePrint action:

embeddedDashboard.initiatePrint();

The following code sample shows a loading animation, SDK code status, and dashboard interaction monitoring, along with initiating dashboard print from the application:

<!DOCTYPE html>
<html lang="en">
  <head>
    <script src=" https://unpkg.com/[email protected]/dist/quicksight-embedding-js-sdk.min.js "></script>
    <title>Embedding demo</title>

    <script>
      $(document).ready(function() {

        var embeddedDashboard;

        document.getElementById("print_button").onclick = function printDashboard() {
            embeddedDashboard.initiatePrint();
        }

        function embedDashboard(embedUrl) {
          const {
            createEmbeddingContext
          } = window.QuickSightEmbedding;
          (async () => {
            const embeddingContext = await createEmbeddingContext();
            const messageHandler = (messageEvent) => {
              switch (messageEvent.eventName) {
                case 'CONTENT_LOADED': {
                  document.getElementById("print_button").style.display="block";
                  break;
                }
                case 'ERROR_OCCURRED': {
                  console.log('Error occurred', messageEvent.message);
                  break;
                }
                case 'PARAMETERS_CHANGED': {
                  // Custom code..
                  break;
                }
              }
            }
            const frameOptions = {
      url: '<YOUR_EMBED_URL>',
              container: document.getElementById("dashboardContainer"),
              width: "100%",
              height: "AutoFit",
              loadingHeight: "200px",
              withIframePlaceholder: true,
              onChange: (changeEvent, metadata) => {
                switch (changeEvent.eventName) {
                  case 'ERROR': {
                    document.getElementById("dashboardContainer").append('Unable to load Dashboard at this time.');
                    break;
                  }
                }
              }
            }
            const contentOptions = {
              locale: "en-US",
              onMessage: messageHandler
            }
            embeddedDashboard = await embeddingContext.embedDashboard(frameOptions, contentOptions);
          })();
        }
      });
    </Script>
  </head>
  <body>
    <div>
       <button type="button" id="print_button" style="display:none;">Print</button> 
    </div>
    <div id="dashboardContainer"></div>
  </body>
</html>

State persistence

In addition to the new SDK, QuickSight now supports state persistence for dashboard and console embedding. State Persistance means when readers slice and dice embedded dashboards with filters, QuickSight will persist filter selection until they return to the dashboard. Readers can pick up where they left off and don’t have to re-select filters.

State persistence is currently supported only for the user-based (not anonymous) dashboard and console embedding experience.

You can enable state persistence using the FeatureConfigurations parameter in the GenerateEmbedUrlForRegisteredUser API. FeatureConfigurations contains StatePersistence structure that can be customized by setting Enabled as true or false.

The API structure is below:

generate-embed-url-for-registered-user
	aws-account-id <value>
	[session-lifetime-in-minutes <value>]
	user-arn <value>
	[cli-input-json | cli-input-yaml]
	[allowed-domains <value>]
	[generate-cli-skeleton <value>]
	experience-configuration <value>
		Dashboard
			InitialDashboardId <value>
			[FeatureConfigurations]
				[StatePersistence]
					Enabled <value>
		QuickSightConsole
			InitialPath <value>
			[FeatureConfigurations]
				[StatePersistence]
					Enabled <value>

The following code disables state persistence for QuickSight console embedding:

aws quicksight generate-embed-url-for-registered-user \
--aws-account-id <AWS_Account_ID> \
--user-arn arn:aws:quicksight:us-east-1:<AWS_Account_ID>:user/<Namespace>/<QuickSight_User_Name>
--experience-configuration '{"QuickSightConsole": {
"InitialPath": "/start/analyses",
"FeatureConfigurations": {"StatePersistence": {"Enabled": false}}}}' \
--region <Region>

The following code enables state persistence for QuickSight dashboard embedding:

aws quicksight generate-embed-url-for-registered-user \
--aws-account-id <AWS_Account_ID> \
--user-arn arn:aws:quicksight:us-east-1:<AWS_Account_ID>:user/<Namespace>/<QuickSight_User_Name>
--experience-configuration '{"Dashboard": {
"InitialDashboardId": “<Dashboard_ID>",
"FeatureConfigurations": {"StatePersistence": {"Enabled": true}}}}' \
--region <Region>

Considerations

Note the following when using these features:

  • For dashboard embedding, state persistence is disabled by default. To enable this feature, set Enabled parameter in StatePersistence to true.
  • For console embedding, state persistence is enabled by default. To disable this feature, set Enabled parameter in StatePersistence to false.

Conclusion

With the latest iteration of the QuickSight Embedding SDK, you can indicate when an embedded experience is loading, monitor and respond to errors from the SDK, observe changes and interactivity, along with invoking undo, redo, reset, and print actions from application code.

Additionally, you can enable state persistence to persist filter selection for readers and allow them to pick up where they left off when revisiting an embedded dashboard.

For more detailed information about the SDK and experience-specific options, visit the GitHub repo.


About the authors

Raj Jayaraman is a Senior Specialist Solutions Architect for Amazon QuickSight. Raj focuses on helping customers develop sample dashboards, embed analytics and adopt BI design patterns and best practices.

Mayank Agarwal is a product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. He focuses on account administration, governance and developer experience. He started his career as an embedded software engineer developing handheld devices. Prior to QuickSight he was leading engineering teams at Credence ID, developing custom mobile embedded device and web solutions using AWS services that make biometric enrollment and identification fast, intuitive, and cost-effective for Government sector, healthcare and transaction security applications.

Rohit Pujari is the Head of Product for Embedded Analytics at QuickSight. He is passionate about shaping the future of infusing data-rich experiences into products and applications we use every day. Rohit brings a wealth of experience in analytics and machine learning from having worked with leading data companies, and their customers. During his free time, you can find him lining up at the local ice cream shop for his second scoop.

Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift

Post Syndicated from Vaidy Kalpathy original https://aws.amazon.com/blogs/big-data/simplify-data-loading-into-type-2-slowly-changing-dimensions-in-amazon-redshift/

Thousands of customers rely on Amazon Redshift to build data warehouses to accelerate time to insights with fast, simple, and secure analytics at scale and analyze data from terabytes to petabytes by running complex analytical queries. Organizations create data marts, which are subsets of the data warehouse and usually oriented for gaining analytical insights specific to a business unit or team. The star schema is a popular data model for building data marts.

In this post, we show how to simplify data loading into a Type 2 slowly changing dimension in Amazon Redshift.

Star schema and slowly changing dimension overview

A star schema is the simplest type of dimensional model, in which the center of the star can have one fact table and a number of associated dimension tables. A dimension is a structure that captures reference data along with associated hierarchies, while a fact table captures different values and metrics that can be aggregated by dimensions. Dimensions provide answers to exploratory business questions by allowing end-users to slice and dice data in a variety of ways using familiar SQL commands.

Whereas operational source systems contain only the latest version of master data, the star schema enables time travel queries to reproduce dimension attribute values on past dates when the fact transaction or event actually happened. The star schema data model allows analytical users to query historical data tying metrics to corresponding dimensional attribute values over time. Time travel is possible because dimension tables contain the exact version of the associated attributes at different time ranges. Relative to the metrics data that keeps changing on a daily or even hourly basis, the dimension attributes change less frequently. Therefore, dimensions in a star schema that keeps track of changes over time are referred to as slowly changing dimensions (SCDs).

Data loading is one of the key aspects of maintaining a data warehouse. In a star schema data model, the central fact table is dependent on the surrounding dimension tables. This is captured in the form of primary key-foreign key relationships, where the dimension table primary keys are referred by foreign keys in the fact table. In the case of Amazon Redshift, uniqueness, primary key, and foreign key constraints are not enforced. However, declaring them will help the optimizer arrive at optimal query plans, provided that the data loading processes enforce their integrity. As part of data loading, the dimension tables, including SCD tables, get loaded first, followed by the fact tables.

SCD population challenge

Populating an SCD dimension table involves merging data from multiple source tables, which are usually normalized. SCD tables contain a pair of date columns (effective and expiry dates) that represent the record’s validity date range. Changes are inserted as new active records effective from the date of data loading, while simultaneously expiring the current active record on a previous day. During each data load, incoming change records are matched against existing active records, comparing each attribute value to determine whether existing records have changed or were deleted or are new records coming in.

In this post, we demonstrate how to simplify data loading into a dimension table with the following methods:

  • Using Amazon Simple Storage Service (Amazon S3) to host the initial and incremental data files from source system tables
  • Accessing S3 objects using Amazon Redshift Spectrum to carry out data processing to load native tables within Amazon Redshift
  • Creating views with window functions to replicate the source system version of each table within Amazon Redshift
  • Joining source table views to project attributes matching with dimension table schema
  • Applying incremental data to the dimension table, bringing it up to date with source-side changes

Solution overview

In a real-world scenario, records from source system tables are ingested on a periodic basis to an Amazon S3 location before being loaded into star schema tables in Amazon Redshift.

For this demonstration, data from two source tables, customer_master and customer_address, are combined to populate the target dimension table dim_customer, which is the customer dimension table.

The source tables customer_master and customer_address share the same primary key, customer_id, and will be joined on the same to fetch one record per customer_id along with attributes from both tables. row_audit_ts contains the latest timestamp at which the particular source record was inserted or last updated. This column helps identify the change records since the last data extraction.

rec_source_status is an optional column that indicates if the corresponding source record was inserted, updated, or deleted. This is applicable in cases where the source system itself provides the changes and populates rec_source_status appropriately.

The following figure provides the schema of the source and target tables.

Let’s look closer at the schema of the target table, dim_customer. It contains different categories of columns:

  • Keys – It contains two types of keys:
    • customer_sk is the primary key of this table. It is also called the surrogate key and has a unique value that is monotonically increasing.
    • customer_id is the source primary key and provides a reference back to the source system record.
  • SCD2 metadatarec_eff_dt and rec_exp_dt indicate the state of the record. These two columns together define the validity of the record. The value in rec_exp_dt will be set as ‘9999-12-31’ for presently active records.
  • Attributes – Includes first_name, last_name, employer_name, email_id, city, and country.

Data loading into a SCD table involves a first-time bulk data loading, referred to as the initial data load. This is followed by continuous or regular data loading, referred to as an incremental data load, to keep the records up to date with changes in the source tables.

To demonstrate the solution, we walk through the following steps for initial data load (1–7) and incremental data load (8–12):

  1. Land the source data files in an Amazon S3 location, using one subfolder per source table.
  2. Use an AWS Glue crawler to parse the data files and register tables in the AWS Glue Data Catalog.
  3. Create an external schema in Amazon Redshift to point to the AWS Glue database containing these tables.
  4. In Amazon Redshift, create one view per source table to fetch the latest version of the record for each primary key (customer_id) value.
  5. Create the dim_customer table in Amazon Redshift, which contains attributes from all relevant source tables.
  6. Create a view in Amazon Redshift joining the source table views from Step 4 to project the attributes modeled in the dimension table.
  7. Populate the initial data from the view created in Step 6 into the dim_customer table, generating customer_sk.
  8. Land the incremental data files for each source table in their respective Amazon S3 location.
  9. In Amazon Redshift, create a temporary table to accommodate the change-only records.
  10. Join the view from Step 6 and dim_customer and identify change records comparing the combined hash value of attributes. Populate the change records into the temporary table with an I, U, or D indicator.
  11. Update rec_exp_dt in dim_customer for all U and D records from the temporary table.
  12. Insert records into dim_customer, querying all I and U records from the temporary table.

Prerequisites

Before you get started, make sure you meet the following prerequisites:

Land data from source tables

Create separate subfolders for each source table in an S3 bucket and place the initial data files within the respective subfolder. In the following image, the initial data files for customer_master and customer_address are made available within two different subfolders. To try out the solution, you can use customer_master_with_ts.csv and customer_address_with_ts.csv as initial data files.

It’s important to include an audit timestamp (row_audit_ts) column that indicates when each record was inserted or last updated. As part of incremental data loading, rows with the same primary key value (customer_id) can arrive more than once. The row_audit_ts column helps identify the latest version of such records for a given customer_id to be used for further processing.

Register source tables in the AWS Glue Data Catalog

We use an AWS Glue crawler to infer metadata from delimited data files like the CSV files used in this post. For instructions on getting started with an AWS Glue crawler, refer to Tutorial: Adding an AWS Glue crawler.

Create an AWS Glue crawler and point it to the Amazon S3 location that contains the source table subfolders, within which the associated data files are placed. When you’re creating the AWS Glue crawler, create a new database named rs-dimension-blog. The following screenshots show the AWS Glue crawler configuration chosen for our data files.

Note that for the Set output and scheduling section, the advanced options are left unchanged.

Running this crawler should create the following tables within the rs-dimension-blog database:

  • customer_address
  • customer_master

Create schemas in Amazon Redshift

First, create an AWS Identity and Access Management (IAM) role named rs-dim-blog-spectrum-role. For instructions, refer to Create an IAM role for Amazon Redshift.

The IAM role has Amazon Redshift as the trusted entity, and the permissions policy includes AmazonS3ReadOnlyAccess and AWSGlueConsoleFullAccess, because we’re using the AWS Glue Data Catalog. Then associate the IAM role with the Amazon Redshift cluster or endpoint.

Instead, you can also set the IAM role as the default for your Amazon Redshift cluster or endpoint. If you do so, in the following create external schema command, pass the iam_role parameter as iam_role default.

Now, open Amazon Redshift Query Editor V2 and create an external schema passing the newly created IAM role and specifying the database as rs-dimension-blog. The database name rs-dimension-blog is the one created in the Data Catalog as part of configuring the crawler in the preceding section. See the following code:

create external schema spectrum_dim_blog 
from data catalog 
database 'rs-dimension-blog' 
iam_role 'arn:aws:iam::<accountid>:role/rs-dim-blog-spectrum-role';

Check if the tables registered in the Data Catalog in the preceding section are visible from within Amazon Redshift:

select * 
from spectrum_dim_blog.customer_master 
limit 10;

select * 
from spectrum_dim_blog.customer_address 
limit 10;

Each of these queries will return 10 rows from the respective Data Catalog tables.

Create another schema in Amazon Redshift to host the table, dim_customer:

create schema rs_dim_blog;

Create views to fetch the latest records from each source table

Create a view for the customer_master table, naming it vw_cust_mstr_latest:

create view rs_dim_blog.vw_cust_mstr_latest as with rows_numbered as (
  select 
    customer_id, 
    first_name, 
    last_name, 
    employer_name, 
    row_audit_ts, 
    row_number() over(
      partition by customer_id 
      order by 
        row_audit_ts desc
    ) as rnum 
  from 
    spectrum_dim_blog.customer_master
) 
select 
  customer_id, 
  first_name, 
  last_name, 
  employer_name, 
  row_audit_ts, 
  rnum 
from 
  rows_numbered 
where 
  rnum = 1 with no schema binding;

The preceding query uses row_number, which is a window function provided by Amazon Redshift. Using window functions enables you to create analytic business queries more efficiently. Window functions operate on a partition of a result set, and return a value for every row in that window. The row_number window function determines the ordinal number of the current row within a group of rows, counting from 1, based on the ORDER BY expression in the OVER clause. By including the PARTITION BY clause as customer_id, groups are created for each value of customer_id and ordinal numbers are reset for each group.

Create a view for the customer_address table, naming it vw_cust_addr_latest:

create view rs_dim_blog.vw_cust_addr_latest as with rows_numbered as (
  select 
    customer_id, 
    email_id, 
    city, 
    country, 
    row_audit_ts, 
    row_number() over(
      partition by customer_id 
      order by 
        row_audit_ts desc
    ) as rnum 
  from 
    spectrum_dim_blog.customer_address
) 
select 
  customer_id, 
  email_id, 
  city, 
  country, 
  row_audit_ts, 
  rnum 
from 
  rows_numbered 
where 
  rnum = 1 with no schema binding;

Both view definitions use the row_number window function of Amazon Redshift, ordering the records by descending order of the row_audit_ts column (the audit timestamp column). The condition rnum=1 fetches the latest record for each customer_id value.

Create the dim_customer table in Amazon Redshift

Create dim_customer as an internal table in Amazon Redshift within the rs_dim_blog schema. The dimension table includes the column customer_sk, that acts as the surrogate key column and enables us to capture a time-sensitive version of each customer record. The validity period for each record is defined by the columns rec_eff_dt and rec_exp_dt, representing record effective date and record expiry date, respectively. See the following code:

create table rs_dim_blog.dim_customer (
  customer_sk bigint, 
  customer_id bigint, 
  first_name varchar(100), 
  last_name varchar(100), 
  employer_name varchar(100), 
  email_id varchar(100), 
  city varchar(100), 
  country varchar(100), 
  rec_eff_dt date, 
  rec_exp_dt date
) diststyle auto;

Create a view to consolidate the latest version of source records

Create the view vw_dim_customer_src, which consolidates the latest records from both source tables using left outer join, keeping them ready to be populated into the Amazon Redshift dimension table. This view fetches data from the latest views defined in the section “Create views to fetch the latest records from each source table”:

create view rs_dim_blog.vw_dim_customer_src as 
select 
  m.customer_id, 
  m.first_name, 
  m.last_name, 
  m.employer_name, 
  a.email_id, 
  a.city, 
  a.country 
from 
  rs_dim_blog.vw_cust_mstr_latest as m 
  left join rs_dim_blog.vw_cust_addr_latest as a on m.customer_id = a.customer_id 
order by 
  m.customer_id with no schema binding;

At this point, this view fetches the initial data for loading into the dim_customer table that we are about to create. In your use-case, use a similar approach to create and join the required source table views to populate your target dimension table.

Populate initial data into dim_customer

Populate the initial data into the dim_customer table by querying the view vw_dim_customer_src. Because this is the initial data load, running row numbers generated by the row_number window function will suffice to populate a unique value in the customer_sk column starting from 1:

insert into rs_dim_blog.dim_customer 
select 
  row_number() over() as customer_sk, 
  customer_id, 
  first_name, 
  last_name, 
  employer_name, 
  email_id, 
  city, 
  country, 
  cast('2022-07-01' as date) rec_eff_dt, 
  cast('9999-12-31' as date) rec_exp_dt 
from 
  rs_dim_blog.vw_dim_customer_src;

In this query, we have specified ’2022-07-01’ as the value in rec_eff_dt for all initial data records. For your use-case, you can modify this date value as appropriate to your situation.

The preceding steps complete the initial data loading into the dim_customer table. In the next steps, we proceed with populating incremental data.

Land ongoing change data files in Amazon S3

After the initial load, the source systems provide data files on an ongoing basis, either containing only new and change records or a full extract containing all records for a particular table.

You can use the sample files customer_master_with_ts_incr.csv and customer_address_with_ts_incr.csv, which contain changed as well as new records. These incremental files need to be placed in the same location in Amazon S3 where the initial data files were placed. Please see section “Land data from source tables”. This will result in the corresponding Redshift Spectrum tables automatically reading the additional rows.

If you used the sample file for customer_master, after adding the incremental files, the following query shows the initial as well as incremental records:

select 
  customer_id, 
  first_name, 
  last_name, 
  employer_name, 
  row_audit_ts 
from 
  spectrum_dim_blog.customer_master 
order by 
  customer_id;

In case of full extracts, we can identify deletes occurring in the source system tables by comparing the previous and current versions and looking for missing records. In case of change-only extracts where the rec_source_status column is present, its value will help us identify deleted records. In either case, land the ongoing change data files in the respective Amazon S3 locations.

For this example, we have uploaded the incremental data for the customer_master and customer_address source tables with a few customer_id records receiving updates and a few new records being added.

Create a temporary table to capture change records

Create the temporary table temp_dim_customer to store all changes that need to be applied to the target dim_customer table:

create temp table temp_dim_customer (
  customer_sk bigint, 
  customer_id bigint, 
  first_name varchar(100), 
  last_name varchar(100), 
  employer_name varchar(100), 
  email_id varchar(100), 
  city varchar(100), 
  country varchar(100), 
  rec_eff_dt date, 
  rec_exp_dt date, 
  iud_operation character(1)
);

Populate the temporary table with new and changed records

This is a multi-step process that can be combined into a single complex SQL. Complete the following steps:

  1. Fetch the latest version of all customer attributes by querying the view vw_dim_customer_src:
select 
  customer_id, 
  sha2(
    coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
  ) as hash_value, 
  first_name, 
  last_name, 
  employer_name, 
  email_id, 
  city, 
  country, 
  current_date rec_eff_dt, 
  cast('9999-12-31' as date) rec_exp_dt 
from 
  rs_dim_blog.vw_dim_customer_src;

Amazon Redshift offers hashing functions such as sha2, which converts a variable length string input into a fixed length character output. The output string is a text representation of the hexadecimal value of the checksum with the specified number of bits. In this case, we pass a concatenated set of customer attributes whose change we want to track, specifying the number of bits as 512. We’ll use the output of the hash function to determine if any of the attributes have undergone a change. This dataset will be called newver (new version).

Because we landed the ongoing change data in the same location as the initial data files, the records retrieved from the preceding query (in newver) include all records, even the unchanged ones. But because of the definition of the view vw_dim_customer_src, we get only one record per customerid, which is its latest version based on row_audit_ts.

  1. In a similar manner, retrieve the latest version of all customer records from dim_customer, which are identified by rec_exp_dt=‘9999-12-31’. While doing so, also retrieve the sha2 value of all customer attributes available in dim_customer:
select 
  customer_id, 
  sha2(
    coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
  ) as hash_value, 
  first_name, 
  last_name, 
  employer_name, 
  email_id, 
  city, 
  country 
from 
  rs_dim_blog.dim_customer 
where 
  rec_exp_dt = '9999-12-31';

This dataset will be called oldver (old or existing version).

  1. Identify the current maximum surrogate key value from the dim_customer table:
select 
  max(customer_sk) as maxval 
from 
  rs_dim_blog.dim_customer;

This value (maxval) will be added to the row_number before being used as the customer_sk value for the change records that need to be inserted.

  1. Perform a full outer join of the old version of records (oldver) and the new version (newver) of records on the customer_id column. Then compare the old and new hash values generated by the sha2 function to determine if the change record is an insert, update, or delete:
case when oldver.customer_id is null then 'I'
when newver.customer_id is null then 'D'
when oldver.hash_value != newver.hash_value then 'U'
else 'N' end as iud_op

We tag the records as follows:

  • If the customer_id is non-existent in the oldver dataset (oldver.customer_id is null), it’s tagged as an insert (‘I').
  • Otherwise, if the customer_id is non-existent in the newver dataset (newver.customer_id is null), it’s tagged as a delete (‘D').
  • Otherwise, if the old hash_value and new hash_value are different, these records represent an update (‘U').
  • Otherwise, it indicates that the record has not undergone any change and therefore can be ignored or marked as not-to-be-processed (‘N').

Make sure to modify the preceding logic if the source extract contains rec_source_status to identify deleted records.

Although sha2 output maps a possibly infinite set of input strings to a finite set of output strings, the chances of collision of hash values for the original row values and changed row values are very unlikely. Instead of individually comparing each column value before and after, we compare the hash values generated by sha2 to conclude if there has been a change in any of the attributes of the customer record. For your use-case, we recommend you choose a hash function that works for your data conditions after adequate testing. Instead, you can compare individual column values if none of the hash functions satisfactorily meet your expectations.

  1. Combining the outputs from the preceding steps, let’s create the INSERT statement that captures only change records to populate the temporary table:
insert into temp_dim_customer (
  customer_sk, customer_id, first_name, 
  last_name, employer_name, email_id, 
  city, country, rec_eff_dt, rec_exp_dt, 
  iud_operation
) with newver as (
  select 
    customer_id, 
    sha2(
      coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
    ) as hash_value, 
    first_name, 
    last_name, 
    employer_name, 
    email_id, 
    city, 
    country, 
    current_date rec_eff_dt, 
    cast('9999-12-31' as date) rec_exp_dt 
  from 
    rs_dim_blog.vw_dim_customer_src
), 
oldver as (
  select 
    customer_id, 
    sha2(
      coalesce(first_name, '') || coalesce(last_name, '') || coalesce(employer_name, '') || coalesce(email_id, '') || coalesce(city, '') || coalesce(country, ''), 512
    ) as hash_value, 
    first_name, 
    last_name, 
    employer_name, 
    email_id, 
    city, 
    country 
  from 
    rs_dim_blog.dim_customer 
  where 
    rec_exp_dt = '9999-12-31'
), 
maxsk as (
  select 
    max(customer_sk) as maxval 
  from 
    rs_dim_blog.dim_customer
), 
allrecs as (
  select 
    coalesce(oldver.customer_id, newver.customer_id) as customer_id, 
    case when oldver.customer_id is null then 'I' when newver.customer_id is null then 'D' when oldver.hash_value != newver.hash_value then 'U' else 'N' end as iud_op, 
    newver.first_name, 
    newver.last_name, 
    newver.employer_name, 
    newver.email_id, 
    newver.city, 
    newver.country, 
    newver.rec_eff_dt, 
    newver.rec_exp_dt 
  from 
    oldver full 
    outer join newver on oldver.customer_id = newver.customer_id
) 
select 
  (maxval + (row_number() over())) as customer_sk, 
  customer_id, 
  first_name, 
  last_name, 
  employer_name, 
  email_id, 
  city, 
  country, 
  rec_eff_dt, 
  rec_exp_dt, 
  iud_op 
from 
  allrecs, 
  maxsk 
where 
  iud_op != 'N';

Expire updated customer records

With the temp_dim_customer table now containing only the change records (either ‘I’, ‘U’, or ‘D’), the same can be applied on the target dim_customer table.

Let’s first fetch all records with values ‘U’ or ‘D’ in the iud_op column. These are records that have either been deleted or updated in the source system. Because dim_customer is a slowly changing dimension, it needs to reflect the validity period of each customer record. In this case, we expire the presently active recorts that have been updated or deleted. We expire these records as of yesterday (by setting rec_exp_dt=current_date-1) matching on the customer_id column:

update 
  rs_dim_blog.dim_customer 
set 
  rec_exp_dt = current_date - 1 
where 
  customer_id in (
    select 
      customer_id 
    from 
      temp_dim_customer as t 
    where 
      iud_operation in ('U', 'D')
  ) 
  and rec_exp_dt = '9999-12-31';

Insert new and changed records

As the last step, we need to insert the newer version of updated records along with all first-time inserts. These are indicated by ‘U’ and ‘I’, respectively, in the iud_op column in the temp_dim_customer table:

insert into rs_dim_blog.dim_customer (
  customer_sk, customer_id, first_name, 
  last_name, employer_name, email_id, 
  city, country, rec_eff_dt, rec_exp_dt
) 
select 
  customer_sk, 
  customer_id, 
  first_name, 
  last_name, 
  employer_name, 
  email_id, 
  city, 
  country, 
  rec_eff_dt, 
  rec_exp_dt 
from 
  temp_dim_customer 
where 
  iud_operation in ('I', 'U');

Depending on the SQL client setting, you might want to run a commit transaction; command to verify that the preceding changes are persisted successfully in Amazon Redshift.

Check the final output

You can run the following query and see that the dim_customer table now contains both the initial data records plus the incremental data records, capturing multiple versions for those customer_id values that got changed as part of incremental data loading. The output also indicates that each record has been populated with appropriate values in rec_eff_dt and rec_exp_dt corresponding to the record validity period.

select 
  * 
from 
  rs_dim_blog.dim_customer 
order by 
  customer_id, 
  customer_sk;

For the sample data files provided in this article, the preceding query returns the following records. If you’re using the sample data files provided in this post, note that the values in customer_sk may not match with what is shown in the following table.

In this post, we only show the important SQL statements; the complete SQL code is available in load_scd2_sample_dim_customer.sql.

Clean up

If you no longer need the resources you created, you can delete them to prevent incurring additional charges.

Conclusion

In this post, you learned how to simplify data loading into Type-2 SCD tables in Amazon Redshift, covering both initial data loading and incremental data loading. The approach deals with multiple source tables populating a target dimension table, capturing the latest version of source records as of each run.

Refer to Amazon Redshift data loading best practices for further materials and additional best practices, and see Updating and inserting new data for instructions to implement updates and inserts.


About the Author

Vaidy Kalpathy is a Senior Data Lab Solution Architect at AWS, where he helps customers modernize their data platform and defines end to end data strategy including data ingestion, transformation, security, visualization. He is passionate about working backwards from business use cases, creating scalable and custom fit architectures to help customers innovate using data analytics services on AWS.

Build an end-to-end change data capture with Amazon MSK Connect and AWS Glue Schema Registry

Post Syndicated from Kalyan Janaki original https://aws.amazon.com/blogs/big-data/build-an-end-to-end-change-data-capture-with-amazon-msk-connect-and-aws-glue-schema-registry/

The value of data is time sensitive. Real-time processing makes data-driven decisions accurate and actionable in seconds or minutes instead of hours or days. Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real time to a downstream system. Capturing every change from transactions in a source database and moving them to the target in real time keeps the systems synchronized, and helps with real-time analytics use cases and zero-downtime database migrations. The following are a few benefits of CDC:

  • It eliminates the need for bulk load updating and inconvenient batch windows by enabling incremental loading or real-time streaming of data changes into your target repository.
  • It ensures that data in multiple systems stays in sync. This is especially important if you’re making time-sensitive decisions in a high-velocity data environment.

Kafka Connect is an open-source component of Apache Kafka that works as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems. The AWS Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas. Kafka Connect and Schema Registry integrate to capture schema information from connectors. Kafka Connect provides a mechanism for converting data from the internal data types used by Kafka Connect to data types represented as Avro, Protobuf, or JSON Schema. AvroConverter, ProtobufConverter, and JsonSchemaConverter automatically register schemas generated by Kafka connectors (source) that produce data to Kafka. Connectors (sink) that consume data from Kafka receive schema information in addition to the data for each message. This allows sink connectors to know the structure of the data to provide capabilities like maintaining a database table schema in a data catalog.

The post demonstrates how to build an end-to-end CDC using Amazon MSK Connect, an AWS managed service to deploy and run Kafka Connect applications and AWS Glue Schema Registry, which allows you to centrally discover, control, and evolve data stream schemas.

Solution overview

On the producer side, for this example we choose a MySQL-compatible Amazon Aurora database as the data source, and we have a Debezium MySQL connector to perform CDC. The Debezium connector continuously monitors the databases and pushes row-level changes to a Kafka topic. The connector fetches the schema from the database to serialize the records into a binary form. If the schema doesn’t already exist in the registry, the schema will be registered. If the schema exists but the serializer is using a new version, the schema registry checks the compatibility mode of the schema before updating the schema. In this solution, we use backward compatibility mode. The schema registry returns an error if a new version of the schema is not backward compatible, and we can configure Kafka Connect to send incompatible messages to the dead-letter queue.

On the consumer side, we use an Amazon Simple Storage Service (Amazon S3) sink connector to deserialize the record and store changes to Amazon S3. We build and deploy the Debezium connector and the Amazon S3 sink using MSK Connect.

Example schema

For this post, we use the following schema as the first version of the table:

{ 
    “Database Name”: “sampledatabase”, 
    “Table Name”: “movies”, 
    “Fields”: [
         { 
            “name”: “movie_id”, 
            “type”: “INTEGER” 
         },
         { 
            “name”: “title”, 
            “type”: “STRING” 
         },
         { 
            “name”: “release_year”,
            “type”: “INTEGER” 
         }
     ] 
}

Prerequisites

Before configuring the MSK producer and consumer connectors, we need to first set up a data source, MSK cluster, and new schema registry. We provide an AWS CloudFormation template to generate the supporting resources needed for the solution:

  • A MySQL-compatible Aurora database as the data source. To perform CDC, we turn on binary logging in the DB cluster parameter group.
  • An MSK cluster. To simplify the network connection, we use the same VPC for the Aurora database and the MSK cluster.
  • Two schema registries to handle schemas for message key and message value.
  • One S3 bucket as the data sink.
  • MSK Connect plugins and worker configuration needed for this demo.
  • One Amazon Elastic Compute Cloud (Amazon EC2) instance to run database commands.

To set up resources in your AWS account, complete the following steps in an AWS Region that supports Amazon MSK, MSK Connect, and the AWS Glue Schema Registry:

  1. Choose Launch Stack:
  2. Choose Next.
  3. For Stack name, enter suitable name.
  4. For Database Password, enter the password you want for the database user.
  5. Keep other values as default.
  6. Choose Next.
  7. On the next page, choose Next.
  8. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  9. Choose Create stack.

Custom plugin for the source and destination connector

A custom plugin is a set of JAR files that contain the implementation of one or more connectors, transforms, or converters. Amazon MSK will install the plugin on the workers of the MSK Connect cluster where the connector is running. As part of this demo, for the source connector we use open-source Debezium MySQL connector JARs, and for the destination connector we use the Confluent community licensed Amazon S3 sink connector JARs. Both the plugins are also added with libraries for Avro Serializers and Deserializers of the AWS Glue Schema Registry. These custom plugins are already created as part of the CloudFormation template deployed in the previous step.

Use the AWS Glue Schema Registry with the Debezium connector on MSK Connect as the MSK producer

We first deploy the source connector using the Debezium MySQL plugin to stream data from an Amazon Aurora MySQL-Compatible Edition database to Amazon MSK. Complete the following steps:

  1. On the Amazon MSK console, in the navigation pane, under MSK Connect, choose Connectors.
  2. Choose Create connector.
  3. Choose Use existing custom plugin and then pick the custom plugin with name starting msk-blog-debezium-source-plugin.
  4. Choose Next.
  5. Enter a suitable name like debezium-mysql-connector and an optional description.
  6. For Apache Kafka cluster, choose MSK cluster and choose the cluster created by the CloudFormation template.
  7. In Connector configuration, delete the default values and use the following configuration key-value pairs and with the appropriate values:
    • name – The name used for the connector.
    • database.hostsname – The CloudFormation output for Database Endpoint.
    • database.user and database.password – The parameters passed in the CloudFormation template.
    • database.history.kafka.bootstrap.servers – The CloudFormation output for Kafka Bootstrap.
    • key.converter.region and value.converter.region – Your Region.
name=<Connector-name>
connector.class=io.debezium.connector.mysql.MySqlConnector
database.hostname=<DBHOST>
database.port=3306
database.user=<DBUSER>
database.password=<DBPASSWORD>
database.server.id=42
database.server.name=db1
table.whitelist=sampledatabase.movies
database.history.kafka.bootstrap.servers=<MSK-BOOTSTRAP>
database.history.kafka.topic=dbhistory.demo1
key.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter
value.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter
key.converter.region=<REGION>
value.converter.region=<REGION>
key.converter.registry.name=msk-connect-blog-keys
value.converter.registry.name=msk-connect-blog-values
key.converter.compatibility=FORWARD
value.converter.compatibility=FORWARD
key.converter.schemaAutoRegistrationEnabled=true
value.converter.schemaAutoRegistrationEnabled=true
transforms=unwrap
transforms.unwrap.type=io.debezium.transforms.ExtractNewRecordState
transforms.unwrap.drop.tombstones=false
transforms.unwrap.delete.handling.mode=rewrite
transforms.unwrap.add.fields=op,source.ts_ms
tasks.max=1

Some of these settings are generic and should be specified for any connector. For example:

  • connector.class is the Java class of the connector
  • tasks.max is the maximum number of tasks that should be created for this connector

Some settings (database.*, transforms.*) are specific to the Debezium MySQL connector. Refer to Debezium MySQL Source Connector Configuration Properties for more information.

Some settings (key.converter.* and value.converter.*) are specific to the Schema Registry. We use the AWSKafkaAvroConverter from the AWS Glue Schema Registry Library as the format converter. To configure AWSKafkaAvroConverter, we use the value of the string constant properties in the AWSSchemaRegistryConstants class:

  • key.converter and value.converter control the format of the data that will be written to Kafka for source connectors or read from Kafka for sink connectors. We use AWSKafkaAvroConverter for Avro format.
  • key.converter.registry.name and value.converter.registry.name define which schema registry to use.
  • key.converter.compatibility and value.converter.compatibility define the compatibility model.

Refer to Using Kafka Connect with AWS Glue Schema Registry for more information.

  1. Next, we configure Connector capacity. We can choose Provisioned and leave other properties as default
  2. For Worker configuration, choose the custom worker configuration with name starting msk-gsr-blog created as part of the CloudFormation template.
  3. For Access permissions, use the AWS Identity and Access Management (IAM) role generated by the CloudFormation template MSKConnectRole.
  4. Choose Next.
  5. For Security, choose the defaults.
  6. Choose Next.
  7. For Log delivery, select Deliver to Amazon CloudWatch Logs and browse for the log group created by the CloudFormation template (msk-connector-logs).
  8. Choose Next.
  9. Review the settings and choose Create connector.

After a few minutes, the connector changes to running status.

Use the AWS Glue Schema Registry with the Confluent S3 sink connector running on MSK Connect as the MSK consumer

We deploy the sink connector using the Confluent S3 sink plugin to stream data from Amazon MSK to Amazon S3. Complete the following steps:

    1. On the Amazon MSK console, in the navigation pane, under MSK Connect, choose Connectors.
    2. Choose Create connector.
    3. Choose Use existing custom plugin and choose the custom plugin with name starting msk-blog-S3sink-plugin.
    4. Choose Next.
    5. Enter a suitable name like s3-sink-connector and an optional description.
    6. For Apache Kafka cluster, choose MSK cluster and select the cluster created by the CloudFormation template.
    7. In Connector configuration, delete the default values provided and use the following configuration key-value pairs with appropriate values:
        • name – The same name used for the connector.
        • s3.bucket.name – The CloudFormation output for Bucket Name.
        • s3.region, key.converter.region, and value.converter.region – Your Region.
name=<CONNERCOR-NAME>
connector.class=io.confluent.connect.s3.S3SinkConnector
s3.bucket.name=<BUCKET-NAME>
key.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter
value.converter=com.amazonaws.services.schemaregistry.kafkaconnect.AWSKafkaAvroConverter
s3.region=<REGION>
storage.class=io.confluent.connect.s3.storage.S3Storage
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
format.class=io.confluent.connect.s3.format.parquet.ParquetFormat
flush.size=10
tasks.max=1
key.converter.schemaAutoRegistrationEnabled=true
value.converter.schemaAutoRegistrationEnabled=true
key.converter.region=<REGION>
value.converter.region=<REGION>
value.converter.avroRecordType=GENERIC_RECORD
key.converter.avroRecordType=GENERIC_RECORD
value.converter.compatibility=NONE
key.converter.compatibility=NONE
store.kafka.keys=false
schema.compatibility=NONE
topics=db1.sampledatabase.movies
value.converter.registry.name=msk-connect-blog-values
key.converter.registry.name=msk-connect-blog-keys
store.kafka.headers=false
  1. Next, we configure Connector capacity. We can choose Provisioned and leave other properties as default
  2. For Worker configuration, choose the custom worker configuration with name starting msk-gsr-blog created as part of the CloudFormation template.
  3. For Access permissions, use the IAM role generated by the CloudFormation template MSKConnectRole.
  4. Choose Next.
  5. For Security, choose the defaults.
  6. Choose Next.
  7. For Log delivery, select Deliver to Amazon CloudWatch Logs and browse for the log group created by the CloudFormation template msk-connector-logs.
  8. Choose Next.
  9. Review the settings and choose Create connector.

After a few minutes, the connector is running.

Test the end-to-end CDC log stream

Now that both the Debezium and S3 sink connectors are up and running, complete the following steps to test the end-to-end CDC:

  1. On the Amazon EC2 console, navigate to the Security groups page.
  2. Select the security group ClientInstanceSecurityGroup and choose Edit inbound rules.
  3. Add an inbound rule allowing SSH connection from your local network.
  4. On the Instances page, select the instance ClientInstance and choose Connect.
  5. On the EC2 Instance Connect tab, choose Connect.
  6. Ensure your current working directory is /home/ec2-user and it has the files create_table.sql, alter_table.sql , initial_insert.sql, and insert_data_with_new_column.sql.
  7. Create a table in your MySQL database by running the following command (provide the database host name from the CloudFormation template outputs):
mysql -h <DATABASE-HOST> -u master -p < create_table.sql
  1. When prompted for a password, enter the password from the CloudFormation template parameters.
  2. Insert some sample data into the table with the following command:
mysql -h <DATABASE-HOST> -u master -p < initial_insert.sql
  1. When prompted for a password, enter the password from the CloudFormation template parameters.
  2. On the AWS Glue console, choose Schema registries in the navigation pane, then choose Schemas.
  3. Navigate to db1.sampledatabase.movies version 1 to check the new schema created for the movies table:
{
  "type": "record",
  "name": "Value",
  "namespace": "db1.sampledatabase.movies",
  "fields": [
    {
      "name": "movie_id",
      "type": "int"
    },
    {
      "name": "title",
      "type": "string"
    },
    {
      "name": "release_year",
      "type": "int"
    },
    {
      "name": "__op",
      "type": [
        "null",
        "string"
      ],
      "default": null
    },
    {
      "name": "__source_ts_ms",
      "type": [
        "null",
        "long"
      ],
      "default": null
    },
    {
      "name": "__deleted",
      "type": [
        "null",
        "string"
      ],
      "default": null
    }
  ],
  "connect.name": "db1.sampledatabase.movies.Value"
}

A separate S3 folder is created for each partition of the Kafka topic, and data for the topic is written in that folder.

  1. On the Amazon S3 console, check for data written in Parquet format in the folder for your Kafka topic.

Schema evolution

After the initial schema is defined, applications may need to evolve it over time. When this happens, it’s critical for the downstream consumers to be able to handle data encoded with both the old and the new schema seamlessly. Compatibility modes allow you to control how schemas can or can’t evolve over time. These modes form the contract between applications producing and consuming data. For detailed information about different compatibility modes available in the AWS Glue Schema Registry, refer to AWS Glue Schema Registry. In our example, we use backward combability to ensure consumers can read both the current and previous schema versions. Complete the following steps:

  1. Add a new column to the table by running the following command:
mysql -h <DATABASE-HOST> -u master -p < alter_table.sql
  1. Insert new data into the table by running the following command:
mysql -h <DATABASE-HOST> -u master -p < insert_data_with_new_column.sql
  1. On the AWS Glue console, choose Schema registries in the navigation pane, then choose Schemas.
  2. Navigate to the schema db1.sampledatabase.movies version 2 to check the new version of the schema created for the movies table movies including the country column that you added:
{
  "type": "record",
  "name": "Value",
  "namespace": "db1.sampledatabase.movies",
  "fields": [
    {
      "name": "movie_id",
      "type": "int"
    },
    {
      "name": "title",
      "type": "string"
    },
    {
      "name": "release_year",
      "type": "int"
    },
    {
      "name": "COUNTRY",
      "type": "string"
    },
    {
      "name": "__op",
      "type": [
        "null",
        "string"
      ],
      "default": null
    },
    {
      "name": "__source_ts_ms",
      "type": [
        "null",
        "long"
      ],
      "default": null
    },
    {
      "name": "__deleted",
      "type": [
        "null",
        "string"
      ],
      "default": null
    }
  ],
  "connect.name": "db1.sampledatabase.movies.Value"
}
  1. On the Amazon S3 console, check for data written in Parquet format in the folder for the Kafka topic.

Clean up

To help prevent unwanted charges to your AWS account, delete the AWS resources that you used in this post:

  1. On the Amazon S3 console, navigate to the S3 bucket created by the CloudFormation template.
  2. Select all files and folders and choose Delete.
  3. Enter permanently delete as directed and choose Delete objects.
  4. On the AWS CloudFormation console, delete the stack you created.
  5. Wait for the stack status to change to DELETE_COMPLETE.

Conclusion

This post demonstrated how to use Amazon MSK, MSK Connect, and the AWS Glue Schema Registry to build a CDC log stream and evolve schemas for data streams as business needs change. You can apply this architecture pattern to other data sources with different Kafka connecters. For more information, refer to the MSK Connect examples.


About the Author

Kalyan Janaki is Senior Big Data & Analytics Specialist with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.