Tag Archives: Best practices

Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes

2023-05-24 Avijit Goswami

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/improve-operational-efficiencies-of-apache-iceberg-tables-built-on-amazon-s3-data-lakes/

Apache Iceberg is an open table format for large datasets in Amazon Simple Storage Service (Amazon S3) and provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. When you build your transactional data lake using Apache Iceberg to solve your functional use cases, you need to focus on operational use cases for your S3 data lake to optimize the production environment. Some of the important non-functional use cases for an S3 data lake that organizations are focusing on include storage cost optimizations, capabilities for disaster recovery and business continuity, cross-account and multi-Region access to the data lake, and handling increased Amazon S3 request rates.

In this post, we show you how to improve operational efficiencies of your Apache Iceberg tables built on Amazon S3 data lake and Amazon EMR big data platform.

Optimize data lake storage

One of the major advantages of building modern data lakes on Amazon S3 is it offers lower cost without compromising on performance. You can use Amazon S3 Lifecycle configurations and Amazon S3 object tagging with Apache Iceberg tables to optimize the cost of your overall data lake storage. An Amazon S3 Lifecycle configuration is a set of rules that define actions that Amazon S3 applies to a group of objects. There are two types of actions:

Transition actions – These actions define when objects transition to another storage class; for example, Amazon S3 Standard to Amazon S3 Glacier.
Expiration actions – These actions define when objects expire. Amazon S3 deletes expired objects on your behalf.

Amazon S3 uses object tagging to categorize storage where each tag is a key-value pair. From an Apache Iceberg perspective, it supports custom Amazon S3 object tags that can be added to S3 objects while writing and deleting into the table. Iceberg also let you configure a tag-based object lifecycle policy at the bucket level to transition objects to different Amazon S3 tiers. With the s3.delete.tags config property in Iceberg, objects are tagged with the configured key-value pairs before deletion. When the catalog property s3.delete-enabled is set to false, the objects are not hard-deleted from Amazon S3. This is expected to be used in combination with Amazon S3 delete tagging, so objects are tagged and removed using an Amazon S3 lifecycle policy. This property is set to true by default.

The example notebook in this post shows an example implementation of S3 object tagging and lifecycle rules for Apache Iceberg tables to optimize storage cost.

Implement business continuity

Amazon S3 gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites. Amazon S3 is designed for 99.999999999% (11 9’s) of durability, S3 Standard is designed for 99.99% availability, and Standard – IA is designed for 99.9% availability. Still, to make your data lake workloads highly available in an unlikely outage situation, you can replicate your S3 data to another AWS Region as a backup. With S3 data residing in multiple Regions, you can use an S3 multi-Region access point as a solution to access the data from the backup Region. With Amazon S3 multi-Region access point failover controls, you can route all S3 data request traffic through a single global endpoint and directly control the shift of S3 data request traffic between Regions at any time. During a planned or unplanned regional traffic disruption, failover controls let you control failover between buckets in different Regions and accounts within minutes. Apache Iceberg supports access points to perform S3 operations by specifying a mapping of bucket to access points. We include an example implementation of an S3 access point with Apache Iceberg later in this post.

Increase Amazon S3 performance and throughput

Amazon S3 supports a request rate of 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket. The resources for this request rate aren’t automatically assigned when a prefix is created. Instead, as the request rate for a prefix increases gradually, Amazon S3 automatically scales to handle the increased request rate. For certain workloads that need a sudden increase in the request rate for objects in a prefix, Amazon S3 might return 503 Slow Down errors, also known as S3 throttling. It does this while it scales in the background to handle the increased request rate. Also, if supported request rates are exceeded, it’s a best practice to distribute objects and requests across multiple prefixes. Implementing this solution to distribute objects and requests across multiple prefixes involves changes to your data ingress or data egress applications. Using Apache Iceberg file format for your S3 data lake can significantly reduce the engineering effort through enabling the ObjectStoreLocationProvider feature, which adds an S3 hash [0*7FFFFF] prefix in your specified S3 object path.

Iceberg by default uses the Hive storage layout, but you can switch it to use the ObjectStoreLocationProvider. This option is not enabled by default to provide flexibility to choose the location where you want to add the hash prefix. With ObjectStoreLocationProvider, a deterministic hash is generated for each stored file and a subfolder is appended right after the S3 folder specified using the parameter write.data.path (write.object-storage-path for Iceberg version 0.12 and below). This ensures that files written to Amazon S3 are equally distributed across multiple prefixes in your S3 bucket, thereby minimizing the throttling errors. In the following example, we set the write.data.path value as s3://my-table-data-bucket, and Iceberg-generated S3 hash prefixes will be appended after this location:

CREATE TABLE my_catalog.my_ns.my_table
( id bigint,
data string,
category string)
USING iceberg OPTIONS
( 'write.object-storage.enabled'=true,
'write.data.path'='s3://my-table-data-bucket')
PARTITIONED BY (category);

Your S3 files will be arranged under MURMUR3 S3 hash prefixes like the following:

2021-11-01 05:39:24 809.4 KiB 7ffbc860/my_ns/my_table/00328-1642-5ce681a7-dfe3-4751-ab10-37d7e58de08a-00015.parquet
2021-11-01 06:00:10 6.1 MiB 7ffc1730/my_ns/my_table/00460-2631-983d19bf-6c1b-452c-8195-47e450dfad9d-00001.parquet
2021-11-01 04:33:24 6.1 MiB 7ffeeb4e/my_ns/my_table/00156-781-9dbe3f08-0a1d-4733-bd90-9839a7ceda00-00002.parquet

Using Iceberg ObjectStoreLocationProvider is not a foolproof mechanism to avoid S3 503 errors. You still need to set appropriate EMRFS retries to provide additional resiliency. You can adjust your retry strategy by increasing the maximum retry limit for the default exponential backoff retry strategy or enabling and configuring the additive-increase/multiplicative-decrease (AIMD) retry strategy. AIMD is supported for Amazon EMR releases 6.4.0 and later. For more information, refer to Retry Amazon S3 requests with EMRFS.

In the following sections, we provide examples for these use cases.

Storage cost optimizations

In this example, we use Iceberg’s S3 tags feature with the write tag as write-tag-name=created and delete tag as delete-tag-name=deleted. This example is demonstrated on an EMR version emr-6.10.0 cluster with installed applications Hadoop 3.3.3, Jupyter Enterprise Gateway 2.6.0, and Spark 3.3.1. The examples are run on a Jupyter Notebook environment attached to the EMR cluster. To learn more about how to create an EMR cluster with Iceberg and use Amazon EMR Studio, refer to Use an Iceberg cluster with Spark and the Amazon EMR Studio Management Guide, respectively.

The following examples are also available in the sample notebook in the aws-samples GitHub repo for quick experimentation.

Configure Iceberg on a Spark session

Configure your Spark session using the %%configure magic command. You can use either the AWS Glue Data Catalog (recommended) or a Hive catalog for Iceberg tables. In this example, we use a Hive catalog, but we can change to the Data Catalog with the following configuration:

spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog

Before you run this step, create a S3 bucket and an iceberg folder in your AWS account with the naming convention <your-iceberg-storage-blog>/iceberg/.

Update your-iceberg-storage-blog in the following configuration with the bucket that you created to test this example. Note the configuration parameters s3.write.tags.write-tag-name and s3.delete.tags.delete-tag-name, which will tag the new S3 objects and deleted objects with corresponding tag values. We use these tags in later steps to implement S3 lifecycle policies to transition the objects to a lower-cost storage tier or expire them based on the use case.

%%configure -f { "conf":{ "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions", "spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog", "spark.sql.catalog.dev.catalog-impl":"org.apache.iceberg.hive.HiveCatalog", "spark.sql.catalog.dev.io-impl":"org.apache.iceberg.aws.s3.S3FileIO", "spark.sql.catalog.dev.warehouse":"s3://&amp;amp;lt;your-iceberg-storage-blog&amp;amp;gt;/iceberg/", "spark.sql.catalog.dev.s3.write.tags.write-tag-name":"created", "spark.sql.catalog.dev.s3.delete.tags.delete-tag-name":"deleted", "spark.sql.catalog.dev.s3.delete-enabled":"false" } }

Create an Apache Iceberg table using Spark-SQL

Now we create an Iceberg table for the Amazon Product Reviews Dataset:

spark.sql(""" DROP TABLE if exists dev.db.amazon_reviews_iceberg""")
spark.sql(""" CREATE TABLE dev.db.amazon_reviews_iceberg (
marketplace string,
customer_id string,
review_id string,
product_id string,
product_parent string,
product_title string,
star_rating int,
helpful_votes int,
total_votes int,
vine string,
verified_purchase string,
review_headline string,
review_body string,
review_date date,
year int)
USING iceberg
location 's3://<your-iceberg-storage-blog>/iceberg/db/amazon_reviews_iceberg'
PARTITIONED BY (years(review_date))""")

In the next step, we load the table with the dataset using Spark actions.

Load data into the Iceberg table

While inserting the data, we partition the data by review_date as per the table definition. Run the following Spark commands in your PySpark notebook:

df = spark.read.parquet("s3://amazon-reviews-pds/parquet/product_category=Electronics/*.parquet")

df.sortWithinPartitions("review_date").writeTo("dev.db.amazon_reviews_iceberg").append()

Insert a single record into the same Iceberg table so that it creates a partition with the current review_date:

spark.sql("""insert into dev.db.amazon_reviews_iceberg values ("US", "99999999","R2RX7KLOQQ5VBG","B00000JBAT","738692522","Diamond Rio Digital",3,0,0,"N","N","Why just 30 minutes?","RIO is really great",date("2023-04-06"),2023)""")

You can check the new snapshot is created after this append operation by querying the Iceberg snapshot:

spark.sql("""SELECT * FROM dev.db.amazon_reviews_iceberg.snapshots""").show()

You will see an output similar to the following showing the operations performed on the table.

Check the S3 tag population

You can use the AWS Command Line Interface (AWS CLI) or the AWS Management Console to check the tags populated for the new writes. Let’s check the tag corresponding to the object created by a single row insert.

On the Amazon S3 console, check the S3 folder s3://your-iceberg-storage-blog/iceberg/db/amazon_reviews_iceberg/data/ and point to the partition review_date_year=2023/. Then check the Parquet file under this folder to check the tags associated with the data file in Parquet format.

From the AWS CLI, run the following command to see that the tag is created based on the Spark configuration spark.sql.catalog.dev.s3.write.tags.write-tag-name":"created":

xxxx@3c22fb1238d8 ~ % aws s3api get-object-tagging --bucket your-iceberg-storage-blog --key iceberg/db/amazon_reviews_iceberg/data/review_date_year=2023/00000-43-2fb892e3-0a3f-4821-a356-83204a69fa74-00001.parquet

You will see an output, similar to the below, showing the associated tags for the file

{ "VersionId": "null", "TagSet": [{ "Key": "write-tag-name", "Value": "created" } ] }

Delete a record and expire a snapshot

In this step, we delete a record from the Iceberg table and expire the snapshot corresponding to the deleted record. We delete the new single record that we inserted with the current review_date:

spark.sql("""delete from dev.db.amazon_reviews_iceberg where review_date = '2023-04-06'""")

We can now check that a new snapshot was created with the operation flagged as delete:

spark.sql("""SELECT * FROM dev.db.amazon_reviews_iceberg.snapshots""").show()

This is useful if we want to time travel and check the deleted row in the future. In that case, we have to query the table with the snapshot-id corresponding to the deleted row. However, we don’t discuss time travel as part of this post.

We expire the old snapshots from the table and keep only the last two. You can modify the query based on your specific requirements to retain the snapshots:

spark.sql ("""CALL dev.system.expire_snapshots(table => 'dev.db.amazon_reviews_iceberg', older_than => DATE '2024-01-01', retain_last => 2)""")

If we run the same query on the snapshots, we can see that we have only two snapshots available:

spark.sql("""SELECT * FROM dev.db.amazon_reviews_iceberg.snapshots""").show()

From the AWS CLI, you can run the following command to see that the tag is created based on the Spark configuration spark.sql.catalog.dev.s3. delete.tags.delete-tag-name":"deleted":

xxxxxx@3c22fb1238d8 ~ % aws s3api get-object-tagging --bucket avijit-iceberg-storage-blog --key iceberg/db/amazon_reviews_iceberg/data/review_date_year=2023/00000-43-2fb892e3-0a3f-4821-a356-83204a69fa74-00001.parquet

You will see output similar to below showing the associated tags for the file

{ "VersionId": "null", "TagSet": [ { "Key": "delete-tag-name", "Value": "deleted" }, { "Key": "write-tag-name", "Value": "created" } ] }

You can view the existing metadata files from the metadata log entries metatable after the expiration of snapshots:

spark.sql("""SELECT * FROM dev.db.amazon_reviews_iceberg.metadata_log_entries""").show()

The snapshots that have expired show the latest snapshot ID as null.

Create S3 lifecycle rules to transition the buckets to a different storage tier

Create a lifecycle configuration for the bucket to transition objects with the delete-tag-name=deleted S3 tag to the Glacier Instant Retrieval class. Amazon S3 runs lifecycle rules one time every day at midnight Universal Coordinated Time (UTC), and new lifecycle rules can take up to 48 hours to complete the first run. Amazon S3 Glacier is well suited to archive data that needs immediate access (with milliseconds retrieval). With S3 Glacier Instant Retrieval, you can save up to 68% on storage costs compared to using the S3 Standard-Infrequent Access (S3 Standard-IA) storage class, when the data is accessed once per quarter.

When you want to access the data back, you can bulk restore the archived objects. After you restore the objects back in S3 Standard class, you can register the metadata and data as an archival table for query purposes. The metadata file location can be fetched from the metadata log entries metatable as illustrated earlier. As mentioned before, the latest snapshot ID with Null values indicates expired snapshots. We can take one of the expired snapshots and do the bulk restore:

spark.sql("""CALL dev.system.register_table(table => 'db.amazon_reviews_iceberg_archive', metadata_file => 's3://avijit-iceberg-storage-blog/iceberg/db/amazon_reviews_iceberg/metadata/00000-a010f15c-7ac8-4cd1-b1bc-bba99fa7acfc.metadata.json')""").show()

Capabilities for disaster recovery and business continuity, cross-account and multi-Region access to the data lake

Because Iceberg doesn’t support relative paths, you can use access points to perform Amazon S3 operations by specifying a mapping of buckets to access points. This is useful for multi-Region access, cross-Region access, disaster recovery, and more.

For cross-Region access points, we need to additionally set the use-arn-region-enabled catalog property to true to enable S3FileIO to make cross-Region calls. If an Amazon S3 resource ARN is passed in as the target of an Amazon S3 operation that has a different Region than the one the client was configured with, this flag must be set to ‘true‘ to permit the client to make a cross-Region call to the Region specified in the ARN, otherwise an exception will be thrown. However, for the same or multi-Region access points, the use-arn-region-enabled flag should be set to ‘false’.

For example, to use an S3 access point with multi-Region access in Spark 3.3, you can start the Spark SQL shell with the following code:

spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket2/my/key/prefix \
--conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.my_catalog.s3.use-arn-region-enabled=false \
--conf spark.sql.catalog.test.s3.access-points.my-bucket1=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap \
--conf spark.sql.catalog.test.s3.access-points.my-bucket2=arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap

In this example, the objects in Amazon S3 on my-bucket1 and my-bucket2 buckets use the arn:aws:s3::123456789012:accesspoint:mfzwi23gnjvgw.mrap access point for all Amazon S3 operations.

For more details on using access points, refer to Using access points with compatible Amazon S3 operations.

Let’s say your table path is under mybucket1, so both mybucket1 in Region 1 and mybucket2 in Region have paths of mybucket1 inside the metadata files. At the time of the S3 (GET/PUT) call, we replace the mybucket1 reference with a multi-Region access point.

Handling increased S3 request rates

When using ObjectStoreLocationProvider (for more details, see Object Store File Layout), a deterministic hash is generated for each stored file, with the hash appended directly after the write.data.path. The problem with this is that the default hashing algorithm generates hash values up to Integer MAX_VALUE, which in Java is (2^31)-1. When this is converted to hex, it produces 0x7FFFFFFF, so the first character variance is restricted to only [0-8]. As per Amazon S3 recommendations, we should have the maximum variance here to mitigate this.

Starting from Amazon EMR 6.10, Amazon EMR added an optimized location provider that makes sure the generated prefix hash has uniform distribution in the first two characters using the character set from [0-9][A-Z][a-z].

This location provider has been recently open sourced by Amazon EMR via Core: Improve bit density in object storage layout and should be available starting from Iceberg 1.3.0.

To use, make sure the iceberg.enabled classification is set to true, and write.location-provider.impl is set to org.apache.iceberg.emr.OptimizedS3LocationProvider.

The following is a sample Spark shell command:

spark-shell --conf spark.driver.memory=4g \
--conf spark.executor.cores=4 \
--conf spark.dynamicAllocation.enabled=true \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/iceberg-V516168123 \
--conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.my_catalog.table-override.write.location-provider.impl=org.apache.iceberg.emr.OptimizedS3LocationProvider

The following example shows that when you enable the object storage in your Iceberg table, it adds the hash prefix in your S3 path directly after the location you provide in your DDL.

Define the table write.object-storage.enabled parameter and provide the S3 path, after which you want to add the hash prefix using write.data.path (for Iceberg Version 0.13 and above) or write.object-storage.path (for Iceberg Version 0.12 and below) parameters.

Insert data into the table you created.

The hash prefix is added right after the /current/ prefix in the S3 path as defined in the DDL.

Clean up

After you complete the test, clean up your resources to avoid any recurring costs:

Delete the S3 buckets that you created for this test.
Delete the EMR cluster.
Stop and delete the EMR notebook instance.

Conclusion

As companies continue to build newer transactional data lake use cases using Apache Iceberg open table format on very large datasets on S3 data lakes, there will be an increased focus on optimizing those petabyte-scale production environments to reduce cost, improve efficiency, and implement high availability. This post demonstrated mechanisms to implement the operational efficiencies for Apache Iceberg open table formats running on AWS.

To learn more about Apache Iceberg and implement this open table format for your transactional data lake use cases, refer to the following resources:

About the Authors

Avijit Goswami is a Principal Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open-source solutions. Outside of his work, Avijit likes to travel, hike in the San Francisco Bay Area trails, watch sports, and listen to music.

Rajarshi Sarkar is a Software Development Engineer at Amazon EMR/Athena. He works on cutting-edge features of Amazon EMR/Athena and is also involved in open-source projects such as Apache Iceberg and Trino. In his spare time, he likes to travel, watch movies, and hang out with friends.

Prashant Singh is a Software Development Engineer at AWS. He is interested in Databases and Data Warehouse engines and has worked on Optimizing Apache Spark performance on EMR. He is an active contributor in open source projects like Apache Spark and Apache Iceberg. During his free time, he enjoys exploring new places, food and hiking.

Reserving EC2 Capacity across Availability Zones by utilizing On Demand Capacity Reservations (ODCRs)

2023-05-23 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/reserving-ec2-capacity-across-availability-zones-by-utilizing-on-demand-capacity-reservations-odcrs/

This post is written by Johan Hedlund, Senior Solutions Architect, Enterprise PUMA.

Many customers have successfully migrated business critical legacy workloads to AWS, utilizing services such as Amazon Elastic Compute Cloud (Amazon EC2), Auto Scaling Groups (ASGs), as well as the use of Multiple Availability Zones (AZs), Regions for Business Continuity, and High Availability.

These critical applications require increased levels of availability to meet strict business Service Level Agreements (SLAs), even in extreme scenarios such as when EC2 functionality is impaired (see Advanced Multi-AZ Resilience Patterns for examples). Following AWS best practices such as architecting for flexibility will help here, but for some more rigid designs there can still be challenges around EC2 instance availability.

In this post, I detail an approach for Reserving Capacity for this type of scenario to mitigate the risk of the instance type(s) that your application needs being unavailable, including code for building it and ways of testing it.

Baseline: Multi-AZ application with restrictive instance needs

To focus on the problem of Capacity Reservation, our reference architecture is a simple horizontally scalable monolith. This consists of a single executable running across multiple instances as a cluster in an Auto Scaling group across three AZs for High Availability.

The application in this example is both business critical and memory intensive. It needs six r6i.4xlarge instances to meet the required specifications. R6i has been chosen to meet the required memory to vCPU requirements.

The third-party application we need to run, has a significant license cost, so we want to optimize our workload to make sure we run only the minimally required number of instances for the shortest amount of time.

The application should be resilient to issues in a single AZ. In the case of multi-AZ impact, it should failover to Disaster Recovery (DR) in an alternate Region, where service level objectives are instituted to return operations to defined parameters. But this is outside the scope for this post.

The problem: capacity during AZ failover

In this solution, the Auto Scaling Group automatically balances its instances across the selected AZs, providing a layer of resilience in the event of a disruption in a single AZ. However, this hinges on those instances being available for use in the Amazon EC2 capacity pools. The criticality of our application comes with SLAs which dictate that even the very low likelihood of instance types being unavailable in AWS must be mitigated.

The solution: Reserving Capacity

There are 2 main ways of Reserving Capacity for this scenario: (a) Running extra capacity 24/7, (b) On Demand Capacity Reservations (ODCRs).

In the past, another recommendation would have been to utilize Zonal Reserved Instances (Non Zonal will not Reserve Capacity). But although Zonal Reserved Instances do provide similar functionality as On Demand Capacity Reservations combined with Savings Plans, they do so in a less flexible way. Therefore, the recommendation from AWS is now to instead use On Demand Capacity Reservations in combination with Savings Plans for scenarios where Capacity Reservation is required.

The TCO impact of the licensing situation rules out the first of the two valid options. Merely keeping the spare capacity up and running all the time also doesn’t cover the scenario in which an instance needs to be stopped and started, for example for maintenance or patching. Without Capacity Reservation, there is a theoretical possibility that that instance type would not be available to start up again.

This leads us to the second option: On Demand Capacity Reservations.

How much capacity to reserve?

Our failure scenario is when functionality in one AZ is impaired and the Auto Scaling Group must shift its instances to the remaining AZs while maintaining the total number of instances. With a minimum requirement of six instances, this means that we need 6/2 = 3 instances worth of Reserved Capacity in each AZ (as we can’t know in advance which one will be affected).

Spinning up the solution

If you want to get hands-on experience with On Demand Capacity Reservations, refer to this CloudFormation template and its accompanying README file for details on how to spin up the solution that we’re using. The README also contains more information about the Stack architecture. Upon successful creation, you have the following architecture running in your account.

Note that the default instance type for the AWS CloudFormation stack has been downgraded to t2.micro to keep our experiment within the AWS Free Tier.

Testing the solution

Now we have a fully functioning solution with Reserved Capacity dedicated to this specific Auto Scaling Group. However, we haven’t tested it yet.

The tests utilize the AWS Command Line Interface (AWS CLI), which we execute using AWS CloudShell.

To interact with the resources created by CloudFormation, we need some names and IDs that have been collected in the “Outputs” section of the stack. These can be accessed from the console in a tab under the Stack that you have created.

We set these as variables for easy access later (replace the values with the values from your stack):

export AUTOSCALING_GROUP_NAME=ASGWithODCRs-CapacityBackedASG-13IZJWXF9QV8E
export SUBNET_FOR_MANUALLY_ADDED_INSTANCE=subnet-03045a72a6328ef72
export SUBNETS_TO_KEEP=subnet-03045a72a6328ef72,subnet-0fd00353b8a42f251

How does the solution react to scaling out the Auto Scaling Group beyond the Capacity Reservation?

First, let’s look at what happens if the Auto Scaling Group wants to Scale Out. Our requirements state that we should have a minimum of six instances running at any one time. But the solution should still adapt to increased load. Before knowing anything about how this works in AWS, imagine two scenarios:

The Auto Scaling Group can scale out to a total of nine instances, as that’s how many On Demand Capacity Reservations we have. But it can’t go beyond that even if there is On Demand capacity available.
The Auto Scaling Group can scale just as much as it could when On Demand Capacity Reservations weren’t used, and it continues to launch unreserved instances when the On Demand Capacity Reservations run out (assuming that capacity is in fact available, which is why we have the On Demand Capacity Reservations in the first place).

The instances section of the Amazon EC2 Management Console can be used to show our existing Capacity Reservations, as created by the CloudFormation stack.

As expected, this shows that we are currently using six out of our nine On Demand Capacity Reservations, with two in each AZ.

Now let’s scale out our Auto Scaling Group to 12, thus using up all On Demand Capacity Reservations in each AZ, as well as requesting one extra Instance per AZ.

aws autoscaling set-desired-capacity \
--auto-scaling-group-name $AUTOSCALING_GROUP_NAME \
--desired-capacity 12

The Auto Scaling Group now has the desired Capacity of 12:

And in the Capacity Reservation screen we can see that all our On Demand Capacity Reservations have been used up:

In the Auto Scaling Group we see that – as expected – we weren’t restricted to nine instances. Instead, the Auto Scaling Group fell back on launching unreserved instances when our On Demand Capacity Reservations ran out:

How does the solution react to adding a matching instance outside the Auto Scaling Group?

But what if someone else/another process in the account starts an EC2 instance of the same type for which we have the On Demand Capacity Reservations? Won’t they get that Reservation, and our Auto Scaling Group will be left short of its three instances per AZ, which would mean that we won’t have enough reservations for our minimum of six instances in case there are issues with an AZ?

This all comes down to the type of On Demand Capacity Reservation that we have created, or the “Eligibility”. Looking at our Capacity Reservations, we can see that they are all of the “targeted” type. This means that they are only used if explicitly referenced, like we’re doing in our Target Group for the Auto Scaling Group.

It’s time to prove that. First, we scale in our Auto Scaling Group so that only six instances are used, resulting in there being one unused capacity reservation in each AZ. Then, we try to add an EC2 instance manually, outside the target group.

First, scale in the Auto Scaling Group:

aws autoscaling set-desired-capacity \
--auto-scaling-group-name $AUTOSCALING_GROUP_NAME \
--desired-capacity 6

Then, spin up the new instance, and save its ID for later when we clean up:

export MANUALLY_CREATED_INSTANCE_ID=$(aws ec2 run-instances \
--image-id resolve:ssm:/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2 \
--instance-type t2.micro \
--subnet-id $SUBNET_FOR_MANUALLY_ADDED_INSTANCE \
--query 'Instances[0].InstanceId' --output text)

We still have the three unutilized On Demand Capacity Reservations, as expected, proving that the On Demand Capacity Reservations with the “targeted” eligibility only get used when explicitly referenced:

How does the solution react to an AZ being removed?

Now we’re comfortable that the Auto Scaling Group can grow beyond the On Demand Capacity Reservations if needed, as long as there is capacity, and that other EC2 instances in our account won’t use the On Demand Capacity Reservations specifically purchased for the Auto Scaling Group. It’s time for the big test. How does it all behave when an AZ becomes unavailable?

For our purposes, we can simulate this scenario by changing the Auto Scaling Group to be across two AZs instead of the original three.

First, we scale out to seven instances so that we can see the impact of overflow outside the On Demand Capacity Reservations when we subsequently remove one AZ:

aws autoscaling set-desired-capacity \
--auto-scaling-group-name $AUTOSCALING_GROUP_NAME \
--desired-capacity 7

Then, we change the Auto Scaling Group to only cover two AZs:

aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name $AUTOSCALING_GROUP_NAME \
--vpc-zone-identifier $SUBNETS_TO_KEEP

Give it some time, and we see that the Auto Scaling Group is now spread across two AZs, On Demand Capacity Reservations cover the minimum six instances as per our requirements, and the rest is handled by instances without Capacity Reservation:

Cleanup

It’s time to clean up, as those Instances and On Demand Capacity Reservations come at a cost!

First, remove the EC2 instance that we made:

aws ec2 terminate-instances --instance-ids $MANUALLY_CREATED_INSTANCE_ID

Then, delete the CloudFormation stack.

Conclusion

Using a combination of Auto Scaling Groups, Resource Groups, and On Demand Capacity Reservations (ODCRs), we have built a solution that provides High Availability backed by reserved capacity, for those types of workloads where the requirements for availability in the case of an AZ becoming temporarily unavailable outweigh the increased cost of reserving capacity, and where the best practices for architecting for flexibility cannot be followed due to limitations on applicable architectures.

We have tested the solution and confirmed that the Auto Scaling Group falls back on using unreserved capacity when the On Demand Capacity Reservations are exhausted. Moreover, we confirmed that targeted On Demand Capacity Reservations won’t risk getting accidentally used by other solutions in our account.

Now it’s time for you to try it yourself! Download the IaC template and give it a try! And if you are planning on using On Demand Capacity Reservations, then don’t forget to look into Savings Plans, as they significantly reduce the cost of that Reserved Capacity..

Simplify AWS Glue job orchestration and monitoring with Amazon MWAA

2023-05-19 Rushabh Lokhande

Post Syndicated from Rushabh Lokhande original https://aws.amazon.com/blogs/big-data/simplify-aws-glue-job-orchestration-and-monitoring-with-amazon-mwaa/

Organizations across all industries have complex data processing requirements for their analytical use cases across different analytics systems, such as data lakes on AWS, data warehouses (Amazon Redshift), search (Amazon OpenSearch Service), NoSQL (Amazon DynamoDB), machine learning (Amazon SageMaker), and more. Analytics professionals are tasked with deriving value from data stored in these distributed systems to create better, secure, and cost-optimized experiences for their customers. For example, digital media companies seek to combine and process datasets in internal and external databases to build unified views of their customer profiles, spur ideas for innovative features, and increase platform engagement.

In these scenarios, customers looking for a serverless data integration offering use AWS Glue as a core component for processing and cataloging data. AWS Glue is well integrated with AWS services and partner products, and provides low-code/no-code extract, transform, and load (ETL) options to enable analytics, machine learning (ML), or application development workflows. AWS Glue ETL jobs may be one component in a more complex pipeline. Orchestrating the run of and managing dependencies between these components is a key capability in a data strategy. Amazon Managed Workflows for Apache Airflows (Amazon MWAA) orchestrates data pipelines using distributed technologies including on-premises resources, AWS services, and third-party components.

In this post, we show how to simplify monitoring an AWS Glue job orchestrated by Airflow using the latest features of Amazon MWAA.

Overview of solution

This post discusses the following:

How to upgrade an Amazon MWAA environment to version 2.4.3.
How to orchestrate an AWS Glue job from an Airflow Directed Acyclic Graph (DAG).
The Airflow Amazon provider package’s observability enhancements in Amazon MWAA. You can now consolidate run logs of AWS Glue jobs on the Airflow console to simplify troubleshooting data pipelines. The Amazon MWAA console becomes a single reference to monitor and analyze AWS Glue job runs. Previously, support teams needed to access the AWS Management Console and take manual steps for this visibility. This feature is available by default from Amazon MWAA version 2.4.3.

The following diagram illustrates our solution architecture.

Prerequisites

You need the following prerequisites:

Access to the console via a user with permissions to create AWS Glue jobs, create AWS Identity and Access Management (IAM) roles and policies, and manage or launch an Amazon MWAA environment
Amazon Athena configured with a workgroup
AWS CloudTrail set up with a trail logging into an Amazon Simple Storage Service (Amazon S3) bucket

Set up the Amazon MWAA environment

For instructions on creating your environment, refer to Create an Amazon MWAA environment. For existing users, we recommend upgrading to version 2.4.3 to take advantage of the observability enhancements featured in this post.

The steps to upgrade Amazon MWAA to version 2.4.3 differ depending on whether the current version is 1.10.12 or 2.2.2. We discuss both options in this post.

Prerequisites for setting up an Amazon MWAA environment

You must meet the following prerequisites:

The Amazon MWAA environment should be version 2.2.2 or greater. In this post, we demonstrate upgrading to 2.4.3, but you can use 2.2.2 by changing the apache-airflow-providers-amazon package constraints in requirements.txt.
The apache-airflow-providers-amazon package version should be 6.0.0 or greater.
The Amazon MWAA execution role should have glue:StartJobRun and glue:GetJobRun permissions.
The requirements.txt used by Amazon MWAA should include any additional Python modules your DAGs need.

Upgrade from version 1.10.12 to 2.4.3

If you’re using Amazon MWAA version 1.10.12, refer to Migrating to a new Amazon MWAA environment to upgrade to 2.4.3.

Upgrade from version 2.0.2 or 2.2.2 to 2.4.3

If you’re using Amazon MWAA environment version 2.2.2 or lower, complete the following steps:

Create a requirements.txt for any custom dependencies with specific versions required for your DAGs.
Upload the file to Amazon S3 in the appropriate location where the Amazon MWAA environment points to the requirements.txt for installing dependencies.
Follow the steps in Migrating to a new Amazon MWAA environment and select version 2.4.3.

Update your DAGs

Customers who upgraded from an older Amazon MWAA environment may need to make updates to existing DAGs. In Airflow version 2.4.3, the Airflow environment will use the Amazon provider package version 6.0.0 by default. This package may include some potentially breaking changes, such as changes to operator names. For example, the AWSGlueJobOperator has been deprecated and replaced with the GlueJobOperator. To maintain compatibility, update your Airflow DAGs by replacing any deprecated or unsupported operators from previous versions with the new ones. Complete the following steps:

Navigate to Amazon AWS Operators.
Select the appropriate version installed in your Amazon MWAA instance (6.0.0. by default) to find a list of supported Airflow operators.
Make the necessary changes in the existing DAG code and upload the modified files to the DAG location in Amazon S3.

Orchestrate the AWS Glue job from Airflow

This section covers the details of orchestrating an AWS Glue job within Airflow DAGs. Airflow eases the development of data pipelines with dependencies between heterogeneous systems such as on-premises processes, external dependencies, other AWS services, and more.

Orchestrate CloudTrail log aggregation with AWS Glue and Amazon MWAA

In this example, we go through a use case of using Amazon MWAA to orchestrate an AWS Glue Python Shell job that persists aggregated metrics based on CloudTrail logs.

CloudTrail enables visibility into AWS API calls that are being made in your AWS account. A common use case with this data would be to gather usage metrics on principals acting on your account’s resources for auditing and regulatory needs.

As CloudTrail events are being logged, they are delivered as JSON files in Amazon S3, which aren’t ideal for analytical queries. We want to aggregate this data and persist it as Parquet files to allow for optimal query performance. As an initial step, we can use Athena to do the initial querying of the data before doing additional aggregations in our AWS Glue job. For more information about creating an AWS Glue Data Catalog table, refer to Creating the table for CloudTrail logs in Athena using partition projection data. After we’ve explored the data via Athena and decided what metrics we want to retain in aggregate tables, we can create an AWS Glue job.

Create an CloudTrail table in Athena

First, we need to create a table in our Data Catalog that allows CloudTrail data to be queried via Athena. The following sample query creates a table with two partitions on the Region and date (called snapshot_date). Be sure to replace the placeholders for your CloudTrail bucket, AWS account ID, and CloudTrail table name:

create external table if not exists `<<<CLOUDTRAIL_TABLE_NAME>>>`(
  `eventversion` string comment 'from deserializer', 
  `useridentity` struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> comment 'from deserializer', 
  `eventtime` string comment 'from deserializer', 
  `eventsource` string comment 'from deserializer', 
  `eventname` string comment 'from deserializer', 
  `awsregion` string comment 'from deserializer', 
  `sourceipaddress` string comment 'from deserializer', 
  `useragent` string comment 'from deserializer', 
  `errorcode` string comment 'from deserializer', 
  `errormessage` string comment 'from deserializer', 
  `requestparameters` string comment 'from deserializer', 
  `responseelements` string comment 'from deserializer', 
  `additionaleventdata` string comment 'from deserializer', 
  `requestid` string comment 'from deserializer', 
  `eventid` string comment 'from deserializer', 
  `resources` array<struct<arn:string,accountid:string,type:string>> comment 'from deserializer', 
  `eventtype` string comment 'from deserializer', 
  `apiversion` string comment 'from deserializer', 
  `readonly` string comment 'from deserializer', 
  `recipientaccountid` string comment 'from deserializer', 
  `serviceeventdetails` string comment 'from deserializer', 
  `sharedeventid` string comment 'from deserializer', 
  `vpcendpointid` string comment 'from deserializer')
PARTITIONED BY ( 
  `region` string,
  `snapshot_date` string)
ROW FORMAT SERDE 
  'com.amazon.emr.hive.serde.CloudTrailSerde' 
STORED AS INPUTFORMAT 
  'com.amazon.emr.cloudtrail.CloudTrailInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<<<CLOUDTRAIL_BUCKET>>>/AWSLogs/<<<ACCOUNT_ID>>>/CloudTrail/'
TBLPROPERTIES (
  'projection.enabled'='true', 
  'projection.region.type'='enum',
  'projection.region.values'='us-east-2,us-east-1,us-west-1,us-west-2,af-south-1,ap-east-1,ap-south-1,ap-northeast-3,ap-northeast-2,ap-southeast-1,ap-southeast-2,ap-northeast-1,ca-central-1,eu-central-1,eu-west-1,eu-west-2,eu-south-1,eu-west-3,eu-north-1,me-south-1,sa-east-1',
  'projection.snapshot_date.format'='yyyy/mm/dd', 
  'projection.snapshot_date.interval'='1', 
  'projection.snapshot_date.interval.unit'='days', 
  'projection.snapshot_date.range'='2020/10/01,now', 
  'projection.snapshot_date.type'='date',
  'storage.location.template'='s3://<<<CLOUDTRAIL_BUCKET>>>/AWSLogs/<<<ACCOUNT_ID>>>/CloudTrail/${region}/${snapshot_date}')

Run the preceding query on the Athena console, and note the table name and AWS Glue Data Catalog database where it was created. We use these values later in the Airflow DAG code.

Sample AWS Glue job code

The following code is a sample AWS Glue Python Shell job that does the following:

Takes arguments (which we pass from our Amazon MWAA DAG) on what day’s data to process
Uses the AWS SDK for Pandas to run an Athena query to do the initial filtering of the CloudTrail JSON data outside AWS Glue
Uses Pandas to do simple aggregations on the filtered data
Outputs the aggregated data to the AWS Glue Data Catalog in a table
Uses logging during processing, which will be visible in Amazon MWAA

import awswrangler as wr
import pandas as pd
import sys
import logging
from awsglue.utils import getResolvedOptions
from datetime import datetime, timedelta

# Logging setup, redirects all logs to stdout
LOGGER = logging.getLogger()
formatter = logging.Formatter('%(asctime)s.%(msecs)03d %(levelname)s %(module)s - %(funcName)s: %(message)s')
streamHandler = logging.StreamHandler(sys.stdout)
streamHandler.setFormatter(formatter)
LOGGER.addHandler(streamHandler)
LOGGER.setLevel(logging.INFO)

LOGGER.info(f"Passed Args :: {sys.argv}")

sql_query_template = """
select
region,
useridentity.arn,
eventsource,
eventname,
useragent

from "{cloudtrail_glue_db}"."{cloudtrail_table}"
where snapshot_date='{process_date}'
and region in ('us-east-1','us-east-2')
"""

required_args = ['CLOUDTRAIL_GLUE_DB',
                'CLOUDTRAIL_TABLE',
                'TARGET_BUCKET',
                'TARGET_DB',
                'TARGET_TABLE',
                'ACCOUNT_ID']
arg_keys = [*required_args, 'PROCESS_DATE'] if '--PROCESS_DATE' in sys.argv else required_args
JOB_ARGS = getResolvedOptions ( sys.argv, arg_keys)

LOGGER.info(f"Parsed Args :: {JOB_ARGS}")

# if process date was not passed as an argument, process yesterday's data
process_date = (
    JOB_ARGS['PROCESS_DATE']
    if JOB_ARGS.get('PROCESS_DATE','NONE') != "NONE" 
    else (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d") 
)

LOGGER.info(f"Taking snapshot for :: {process_date}")

RAW_CLOUDTRAIL_DB = JOB_ARGS['CLOUDTRAIL_GLUE_DB']
RAW_CLOUDTRAIL_TABLE = JOB_ARGS['CLOUDTRAIL_TABLE']
TARGET_BUCKET = JOB_ARGS['TARGET_BUCKET']
TARGET_DB = JOB_ARGS['TARGET_DB']
TARGET_TABLE = JOB_ARGS['TARGET_TABLE']
ACCOUNT_ID = JOB_ARGS['ACCOUNT_ID']

final_query = sql_query_template.format(
    process_date=process_date.replace("-","/"),
    cloudtrail_glue_db=RAW_CLOUDTRAIL_DB,
    cloudtrail_table=RAW_CLOUDTRAIL_TABLE
)

LOGGER.info(f"Running Query :: {final_query}")

raw_cloudtrail_df = wr.athena.read_sql_query(
    sql=final_query,
    database=RAW_CLOUDTRAIL_DB,
    ctas_approach=False,
    s3_output=f"s3://{TARGET_BUCKET}/athena-results",
)

raw_cloudtrail_df['ct']=1

agg_df = raw_cloudtrail_df.groupby(['arn','region','eventsource','eventname','useragent'],as_index=False).agg({'ct':'sum'})
agg_df['snapshot_date']=process_date

LOGGER.info(agg_df.info(verbose=True))

upload_path = f"s3://{TARGET_BUCKET}/{TARGET_DB}/{TARGET_TABLE}"

if not agg_df.empty:
    LOGGER.info(f"Upload to {upload_path}")
    try:
        response = wr.s3.to_parquet(
            df=agg_df,
            path=upload_path,
            dataset=True,
            database=TARGET_DB,
            table=TARGET_TABLE,
            mode="overwrite_partitions",
            schema_evolution=True,
            partition_cols=["snapshot_date"],
            compression="snappy",
            index=False
        )
        LOGGER.info(response)
    except Exception as exc:
        LOGGER.error("Uploading to S3 failed")
        LOGGER.exception(exc)
        raise exc
else:
    LOGGER.info(f"Dataframe was empty, nothing to upload to {upload_path}")

The following are some key advantages in this AWS Glue job:

We use an Athena query to ensure initial filtering is done outside of our AWS Glue job. As such, a Python Shell job with minimal compute is still sufficient for aggregating a large CloudTrail dataset.
We ensure the analytics library-set option is turned on when creating our AWS Glue job to use the AWS SDK for Pandas library.

Create an AWS Glue job

Complete the following steps to create your AWS Glue job:

Copy the script in the preceding section and save it in a local file. For this post, the file is called script.py.
On the AWS Glue console, choose ETL jobs in the navigation pane.
Create a new job and select Python Shell script editor.
Select Upload and edit an existing script and upload the file you saved locally.
Choose Create.

On the Job details tab, enter a name for your AWS Glue job.
For IAM role, choose an existing role or create a new role that has the required permissions for Amazon S3, AWS Glue, and Athena. The role needs to query the CloudTrail table you created earlier and write to an output location.

You can use the following sample policy code. Replace the placeholders with your CloudTrail logs bucket, output table name, output AWS Glue database, output S3 bucket, CloudTrail table name, AWS Glue database containing the CloudTrail table, and your AWS account ID.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:List*",
                "s3:Get*"
            ],
            "Resource": [
                "arn:aws:s3:::<<<CLOUDTRAIL_LOGS_BUCKET>>>/*",
                "arn:aws:s3:::<<<CLOUDTRAIL_LOGS_BUCKET>>>*"
            ],
            "Effect": "Allow",
            "Sid": "GetS3CloudtrailData"
        },
        {
            "Action": [
                "glue:Get*",
                "glue:BatchGet*"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:<<<YOUR_AWS_ACCT_ID>>>:catalog",
                "arn:aws:glue:us-east-1:<<<YOUR_AWS_ACCT_ID>>>:database/<<<GLUE_DB_WITH_CLOUDTRAIL_TABLE>>>",
                "arn:aws:glue:us-east-1:<<<YOUR_AWS_ACCT_ID>>>:table/<<<GLUE_DB_WITH_CLOUDTRAIL_TABLE>>>/<<<CLOUDTRAIL_TABLE>>>*"
            ],
            "Effect": "Allow",
            "Sid": "GetGlueCatalogCloudtrailData"
        },
        {
            "Action": [
                "s3:PutObject*",
                "s3:Abort*",
                "s3:DeleteObject*",
                "s3:GetObject*",
                "s3:GetBucket*",
                "s3:List*",
                "s3:Head*"
            ],
            "Resource": [
                "arn:aws:s3:::<<<OUTPUT_S3_BUCKET>>>",
                "arn:aws:s3:::<<<OUTPUT_S3_BUCKET>>>/<<<OUTPUT_GLUE_DB>>>/<<<OUTPUT_TABLE_NAME>>>/*"
            ],
            "Effect": "Allow",
            "Sid": "WriteOutputToS3"
        },
        {
            "Action": [
                "glue:CreateTable",
                "glue:CreatePartition",
                "glue:UpdatePartition",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:DeletePartition",
                "glue:BatchCreatePartition",
                "glue:BatchDeletePartition",
                "glue:Get*",
                "glue:BatchGet*"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:<<<YOUR_AWS_ACCT_ID>>>:catalog",
                "arn:aws:glue:us-east-1:<<<YOUR_AWS_ACCT_ID>>>:database/<<<OUTPUT_GLUE_DB>>>",
                "arn:aws:glue:us-east-1:<<<YOUR_AWS_ACCT_ID>>>:table/<<<OUTPUT_GLUE_DB>>>/<<<OUTPUT_TABLE_NAME>>>*"
            ],
            "Effect": "Allow",
            "Sid": "AllowOutputToGlue"
        },
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:/aws-glue/*",
            "Effect": "Allow",
            "Sid": "LogsAccess"
        },
        {
            "Action": [
                "s3:GetObject*",
                "s3:GetBucket*",
                "s3:List*",
                "s3:DeleteObject*",
                "s3:PutObject",
                "s3:PutObjectLegalHold",
                "s3:PutObjectRetention",
                "s3:PutObjectTagging",
                "s3:PutObjectVersionTagging",
                "s3:Abort*"
            ],
            "Resource": [
                "arn:aws:s3:::<<<ATHENA_RESULTS_BUCKET>>>",
                "arn:aws:s3:::<<<ATHENA_RESULTS_BUCKET>>>/*"
            ],
            "Effect": "Allow",
            "Sid": "AccessToAthenaResults"
        },
        {
            "Action": [
                "athena:StartQueryExecution",
                "athena:StopQueryExecution",
                "athena:GetDataCatalog",
                "athena:GetQueryResults",
                "athena:GetQueryExecution"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:<<<YOUR_AWS_ACCT_ID>>>:catalog",
                "arn:aws:athena:us-east-1:<<<YOUR_AWS_ACCT_ID>>>:datacatalog/AwsDataCatalog",
                "arn:aws:athena:us-east-1:<<<YOUR_AWS_ACCT_ID>>>:workgroup/primary"
            ],
            "Effect": "Allow",
            "Sid": "AllowAthenaQuerying"
        }
    ]
}

For Python version, choose Python 3.9.

Select Load common analytics libraries.
For Data processing units, choose 1 DPU.
Leave the other options as default or adjust as needed.

Choose Save to save your job configuration.

Configure an Amazon MWAA DAG to orchestrate the AWS Glue job

The following code is for a DAG that can orchestrate the AWS Glue job that we created. We take advantage of the following key features in this DAG:

We use the DAG run configuration to pass or override parameters. In our example, we can pass an optional process_date variable, which allows us to do backfills via the AWS Glue job. If it is not passed, the previous day’s data is processed.
We pass script_args to the GlueJobOperator. These are passed as arguments to the AWS Glue job run where getResolvedOptions is used to retrieve them.
The AWS Glue job logs are relayed to Amazon MWAA by passing the verbose=True parameter.

"""Sample DAG"""
import airflow.utils
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
from airflow import DAG
from datetime import timedelta
import airflow.utils

# allow backfills via DAG run parameters
process_date = '{{ dag_run.conf.get("process_date") if dag_run.conf.get("process_date") else "NONE" }}'

dag = DAG(
    dag_id = "CLOUDTRAIL_LOGS_PROCESSING",
    default_args = {
        'depends_on_past':False, 
        'start_date':airflow.utils.dates.days_ago(0),
        'retries':1,
        'retry_delay':timedelta(minutes=5),
        'catchup': False
    },
    schedule_interval = None, # None for unscheduled or a cron expression - E.G. "00 12 * * 2" - at 12noon Tuesday
    dagrun_timeout = timedelta(minutes=30),
    max_active_runs = 1,
    max_active_tasks = 1 # since there is only one task in our DAG
)

## Log ingest. Assumes Glue Job is already created
glue_ingestion_job = GlueJobOperator(
    task_id="<<<some-task-id>>>",
    job_name="<<<GLUE_JOB_NAME>>>",
    script_args={
        "--ACCOUNT_ID":"<<<YOUR_AWS_ACCT_ID>>>",
        "--CLOUDTRAIL_GLUE_DB":"<<<GLUE_DB_WITH_CLOUDTRAIL_TABLE>>>",
        "--CLOUDTRAIL_TABLE":"<<<CLOUDTRAIL_TABLE>>>",
        "--TARGET_BUCKET": "<<<OUTPUT_S3_BUCKET>>>",
        "--TARGET_DB": "<<<OUTPUT_GLUE_DB>>>", # should already exist
        "--TARGET_TABLE": "<<<OUTPUT_TABLE_NAME>>>",
        "--PROCESS_DATE": process_date
    },
    region_name="us-east-1",
    dag=dag,
    verbose=True
)

glue_ingestion_job

Increase observability of AWS Glue jobs in Amazon MWAA

The AWS Glue jobs write logs to Amazon CloudWatch. With the recent observability enhancements to Airflow’s Amazon provider package, these logs are now integrated with Airflow task logs. This consolidation provides Airflow users with end-to-end visibility directly in the Airflow UI, eliminating the need to search in CloudWatch or the AWS Glue console.

To use this feature, ensure the IAM role attached to the Amazon MWAA environment has the following permissions to retrieve and write the necessary logs:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:GetLogEvents",
        "logs:GetLogRecord",
        "logs:DescribeLogStreams",
        "logs:FilterLogEvents",
        "logs:GetLogGroupFields",
        "logs:GetQueryResults",
        
      ],
      "Resource": [
        "arn:aws:logs:*:*:log-group:airflow-243-<<<Your environment name>>>-*"--Your Amazon MWAA Log Stream Name
      ]
    }
  ]
}

If verbose=true, the AWS Glue job run logs show in the Airflow task logs. The default is false. For more information, refer to Parameters.

When enabled, the DAGs read from the AWS Glue job’s CloudWatch log stream and relay them to the Airflow DAG AWS Glue job step logs. This provides detailed insights into an AWS Glue job’s run in real time via the DAG logs. Note that AWS Glue jobs generate an output and error CloudWatch log group based on the job’s STDOUT and STDERR, respectively. All logs in the output log group and exception or error logs from the error log group are relayed into Amazon MWAA.

AWS admins can now limit a support team’s access to only Airflow, making Amazon MWAA the single pane of glass on job orchestration and job health management. Previously, users needed to check AWS Glue job run status in the Airflow DAG steps and retrieve the job run identifier. They then needed to access the AWS Glue console to find the job run history, search for the job of interest using the identifier, and finally navigate to the job’s CloudWatch logs to troubleshoot.

Create the DAG

To create the DAG, complete the following steps:

Save the preceding DAG code to a local .py file, replacing the indicated placeholders.

The values for your AWS account ID, AWS Glue job name, AWS Glue database with CloudTrail table, and CloudTrail table name should already be known. You can adjust the output S3 bucket, output AWS Glue database, and output table name as needed, but make sure the AWS Glue job’s IAM role that you used earlier is configured accordingly.

On the Amazon MWAA console, navigate to your environment to see where the DAG code is stored.

The DAGs folder is the prefix within the S3 bucket where your DAG file should be placed.

Upload your edited file there.

Open the Amazon MWAA console to confirm that the DAG appears in the table.

Run the DAG

To run the DAG, complete the following steps:

Choose from the following options:
- Trigger DAG – This causes yesterday’s data to be used as the data to process
- Trigger DAG w/ config – With this option, you can pass in a different date, potentially for backfills, which is retrieved using dag_run.conf in the DAG code and then passed into the AWS Glue job as a parameter

The following screenshot shows the additional configuration options if you choose Trigger DAG w/ config.

Monitor the DAG as it runs.
When the DAG is complete, open the run’s details.

On the right pane, you can view the logs, or choose Task Instance Details for a full view.

View the AWS Glue job output logs in Amazon MWAA without using the AWS Glue console thanks to the GlueJobOperator verbose flag.

The AWS Glue job will have written results to the output table you specified.

Query this table via Athena to confirm it was successful.

Summary

Amazon MWAA now provides a single place to track AWS Glue job status and enables you to use the Airflow console as the single pane of glass for job orchestration and health management. In this post, we walked through the steps to orchestrate AWS Glue jobs via Airflow using GlueJobOperator. With the new observability enhancements, you can seamlessly troubleshoot AWS Glue jobs in a unified experience. We also demonstrated how to upgrade your Amazon MWAA environment to a compatible version, update dependencies, and change the IAM role policy accordingly.

For more information about common troubleshooting steps, refer to Troubleshooting: Creating and updating an Amazon MWAA environment. For in-depth details of migrating to an Amazon MWAA environment, refer to Upgrading from 1.10 to 2. To learn about the open-source code changes for increased observability of AWS Glue jobs in the Airflow Amazon provider package, refer to the relay logs from AWS Glue jobs.

Finally, we recommend visiting the AWS Big Data Blog for other material on analytics, ML, and data governance on AWS.

About the Authors

Rushabh Lokhande is a Data & ML Engineer with the AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and golf.

Ryan Gomes is a Data & ML Engineer with the AWS Professional Services Analytics Practice. He is passionate about helping customers achieve better outcomes through analytics and machine learning solutions in the cloud. Outside of work, he enjoys fitness, cooking, and spending quality time with friends and family.

Vishwa Gupta is a Senior Data Architect with the AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

How Zoom implemented streaming log ingestion and efficient GDPR deletes using Apache Hudi on Amazon EMR

2023-05-16 Sekar Srinivasan

Post Syndicated from Sekar Srinivasan original https://aws.amazon.com/blogs/big-data/how-zoom-implemented-streaming-log-ingestion-and-efficient-gdpr-deletes-using-apache-hudi-on-amazon-emr/

In today’s digital age, logging is a critical aspect of application development and management, but efficiently managing logs while complying with data protection regulations can be a significant challenge. Zoom, in collaboration with the AWS Data Lab team, developed an innovative architecture to overcome these challenges and streamline their logging and record deletion processes. In this post, we explore the architecture and the benefits it provides for Zoom and its users.

Application log challenges: Data management and compliance

Application logs are an essential component of any application; they provide valuable information about the usage and performance of the system. These logs are used for a variety of purposes, such as debugging, auditing, performance monitoring, business intelligence, system maintenance, and security. However, although these application logs are necessary for maintaining and improving the application, they also pose an interesting challenge. These application logs may contain personally identifiable data, such as user names, email addresses, IP addresses, and browsing history, which creates a data privacy concern.

Laws such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) require organizations to retain application logs for a specific period of time. The exact length of time required for data storage varies depending on the specific regulation and the type of data being stored. The reason for these data retention periods is to ensure that companies aren’t keeping personal data longer than necessary, which could increase the risk of data breaches and other security incidents. This also helps ensure that companies aren’t using personal data for purposes other than those for which it was collected, which could be a violation of privacy laws. These laws also give individuals the right to request the deletion of their personal data, also known as the “right to be forgotten.” Individuals have the right to have their personal data erased, without undue delay.

So, on one hand, organizations need to collect application log data to ensure the proper functioning of their services, and keep the data for a specific period of time. But on the other hand, they may receive requests from individuals to delete their personal data from the logs. This creates a balancing act for organizations because they must comply with both data retention and data deletion requirements.

This issue becomes increasingly challenging for larger organizations that operate in multiple countries and states, because each country and state may have their own rules and regulations regarding data retention and deletion. For example, the Personal Information Protection and Electronic Documents Act (PIPEDA) in Canada and the Australian Privacy Act in Australia are similar laws to GDPR, but they may have different retention periods or different exceptions. Therefore, organizations big or small must navigate this complex landscape of data retention and deletion requirements, while also ensuring that they are in compliance with all applicable laws and regulations.

Zoom’s initial architecture

During the COVID-19 pandemic, the use of Zoom skyrocketed as more and more people were asked to work and attend classes from home. The company had to rapidly scale its services to accommodate the surge and worked with AWS to deploy capacity across most Regions globally. With a sudden increase in the large number of application endpoints, they had to rapidly evolve their log analytics architecture and worked with the AWS Data Lab team to quickly prototype and deploy an architecture for their compliance use case.

At Zoom, the data ingestion throughput and performance needs are very stringent. Data had to be ingested from several thousand application endpoints that produced over 30 million messages every minute, resulting in over 100 TB of log data per day. The existing ingestion pipeline consisted of writing the data to Apache Hadoop HDFS storage through Apache Kafka first and then running daily jobs to move the data to persistent storage. This took several hours while also slowing the ingestion and creating the potential for data loss. Scaling the architecture was also an issue because HDFS data would have to be moved around whenever nodes were added or removed. Furthermore, transactional semantics on billions of records were necessary to help meet compliance-related data delete requests, and the existing architecture of daily batch jobs was operationally inefficient.

It was at this time, through conversations with the AWS account team, that the AWS Data Lab team got involved to assist in building a solution for Zoom’s hyper-scale.

Solution overview

The AWS Data Lab offers accelerated, joint engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data, analytics, artificial intelligence (AI), machine learning (ML), serverless, and container modernization initiatives. The Data Lab has three offerings: the Build Lab, the Design Lab, and Resident Architect. During the Build and Design Labs, AWS Data Lab Solutions Architects and AWS experts supported Zoom specifically by providing prescriptive architectural guidance, sharing best practices, building a working prototype, and removing technical roadblocks to help meet their production needs.

Zoom and the AWS team (collectively referred to as “the team” going forward) identified two major workflows for data ingestion and deletion.

Data ingestion workflow

The following diagram illustrates the data ingestion workflow.

Data Ingestion Workflow

The team needed to quickly populate millions of Kafka messages in the dev/test environment to achieve this. To expedite the process, we (the team) opted to use Amazon Managed Streaming for Apache Kafka (Amazon MSK), which makes it simple to ingest and process streaming data in real time, and we were up and running in under a day.

To generate test data that resembled production data, the AWS Data Lab team created a custom Python script that evenly populated over 1.2 billion messages across several Kafka partitions. To match the production setup in the development account, we had to increase the cloud quota limit via a support ticket.

We used Amazon MSK and the Spark Structured Streaming capability in Amazon EMR to ingest and process the incoming Kafka messages with high throughput and low latency. Specifically, we inserted the data from the source into EMR clusters at a maximum incoming rate of 150 million Kafka messages every 5 minutes, with each Kafka message holding 7–25 log data records.

To store the data, we chose to use Apache Hudi as the table format. We opted for Hudi because it’s an open-source data management framework that provides record-level insert, update, and delete capabilities on top of an immutable storage layer like Amazon Simple Storage Service (Amazon S3). Additionally, Hudi is optimized for handling large datasets and works well with Spark Structured Streaming, which was already being used at Zoom.

After 150 million messages were buffered, we processed the messages using Spark Structured Streaming on Amazon EMR and wrote the data into Amazon S3 in Apache Hudi-compatible format every 5 minutes. We first flattened the message array, creating a single record from the nested array of messages. Then we added a unique key, known as the Hudi record key, to each message. This key allows Hudi to perform record-level insert, update, and delete operations on the data. We also extracted the field values, including the Hudi partition keys, from incoming messages.

This architecture allowed end-users to query the data stored in Amazon S3 using Amazon Athena with the AWS Glue Data Catalog or using Apache Hive and Presto.

Data deletion workflow

The following diagram illustrates the data deletion workflow.

Data Deletion Workflow

Our architecture allowed for efficient data deletions. To help comply with the customer-initiated data retention policy for GDPR deletes, scheduled jobs ran daily to identify the data to be deleted in batch mode.

We then spun up a transient EMR cluster to run the GDPR upsert job to delete the records. The data was stored in Amazon S3 in Hudi format, and Hudi’s built-in index allowed us to efficiently delete records using bloom filters and file ranges. Because only those files that contained the record keys needed to be read and rewritten, it only took about 1–2 minutes to delete 1,000 records out of the 1 billion records, which had previously taken hours to complete as entire partitions were read.

Overall, our solution enabled efficient deletion of data, which provided an additional layer of data security that was critical for Zoom, in light of its GDPR requirements.

Architecting to optimize scale, performance, and cost

In this section, we share the following strategies Zoom took to optimize scale, performance, and cost:

Optimizing ingestion
Optimizing throughput and Amazon EMR utilization
Decoupling ingestion and GDPR deletion using EMRFS
Efficient deletes with Apache Hudi
Optimizing for low-latency reads with Apache Hudi
Monitoring

Optimizing ingestion

To keep the storage in Kafka lean and optimal, as well as to get a real-time view of data, we created a Spark job to read incoming Kafka messages in batches of 150 million messages and wrote to Amazon S3 in Hudi-compatible format every 5 minutes. Even during the initial stages of the iteration, when we hadn’t started scaling and tuning yet, we were able to successfully load all Kafka messages consistently under 2.5 minutes using the Amazon EMR runtime for Apache Spark.

Optimizing throughput and Amazon EMR utilization

We launched a cost-optimized EMR cluster and switched from uniform instance groups to using EMR instance fleets. We chose instance fleets because we needed the flexibility to use Spot Instances for task nodes and wanted to diversify the risk of running out of capacity for a specific instance type in our Availability Zone.

We started experimenting with test runs by first changing the number of Kafka partitions from 400 to 1,000, and then changing the number of task nodes and instance types. Based on the results of the run, the AWS team came up with the recommendation to use Amazon EMR with three core nodes (r5.16xlarge (64 vCPUs each)) and 18 task nodes using Spot fleet instances (a combination of r5.16xlarge (64 vCPUs), r5.12xlarge (48 vCPUs), r5.8xlarge (32 vCPUs)). These recommendations helped Zoom to reduce their Amazon EMR costs by more than 80% while meeting their desired performance goals of ingesting 150 million Kafka messages under 5 minutes.

Decoupling ingestion and GDPR deletion using EMRFS

A well-known benefit of separation of storage and compute is that you can scale the two independently. But a not-so-obvious advantage is that you can decouple continuous workloads from sporadic workloads. Previously data was stored in HDFS. Resource-intensive GDPR delete jobs and data movement jobs would compete for resources with the stream ingestion, causing a backlog of more than 5 hours in upstream Kafka clusters, which was close to filling up the Kafka storage (which only had 6 hours of data retention) and potentially causing data loss. Offloading data from HDFS to Amazon S3 allowed us the freedom to launch independent transient EMR clusters on demand to perform data deletion, helping to ensure that the ongoing data ingestion from Kafka into Amazon EMR is not starved for resources. This enabled the system to ingest data every 5 minutes and complete each Spark Streaming read in 2–3 minutes. Another side effect of using EMRFS is a cost-optimized cluster, because we removed reliance on Amazon Elastic Block Store (Amazon EBS) volumes for over 300 TB storage that was used for three copies (including two replicas) of HDFS data. We now pay for only one copy of the data in Amazon S3, which provides 11 9s of durability and is relatively inexpensive storage.

Efficient deletes with Apache Hudi

What about the conflict between ingest writes and GDPR deletes when running concurrently? This is where the power of Apache Hudi stands out.

Apache Hudi provides a table format for data lakes with transactional semantics that enables the separation of ingestion workloads and updates when run concurrently. The system was able to consistently delete 1,000 records in less than a minute. There were some limitations in concurrent writes in Apache Hudi 0.7.0, but the Amazon EMR team quickly addressed this by back-porting Apache Hudi 0.8.0, which supports optimistic concurrency control, to the current (at the time of the AWS Data Lab collaboration) Amazon EMR 6.4 release. This saved time in testing and allowed for a quick transition to the new version with minimal testing. This enabled us to query the data directly using Athena quickly without having to spin up a cluster to run ad hoc queries, as well as to query the data using Presto, Trino, and Hive. The decoupling of the storage and compute layers provided the flexibility to not only query data across different EMR clusters, but also delete data using a completely independent transient cluster.

Optimizing for low-latency reads with Apache Hudi

To optimize for low-latency reads with Apache Hudi, we needed to address the issue of too many small files being created within Amazon S3 due to the continuous streaming of data into the data lake.

We utilized Apache Hudi’s features to tune file sizes for optimal querying. Specifically, we reduced the degree of parallelism in Hudi from the default value of 1,500 to a lower number. Parallelism refers to the number of threads used to write data to Hudi; by reducing it, we were able to create larger files that were more optimal for querying.

Because we needed to optimize for high-volume streaming ingestion, we chose to implement the merge on read table type (instead of copy on write) for our workload. This table type allowed us to quickly ingest the incoming data into delta files in row format (Avro) and asynchronously compact the delta files into columnar Parquet files for fast reads. To do this, we ran the Hudi compaction job in the background. Compaction is the process of merging row-based delta files to produce new versions of columnar files. Because the compaction job would use additional compute resources, we adjusted the degree of parallelism for insertion to a lower value of 1,000 to account for the additional resource usage. This adjustment allowed us to create larger files without sacrificing performance throughput.

Overall, our approach to optimizing for low-latency reads with Apache Hudi allowed us to better manage file sizes and improve the overall performance of our data lake.

Monitoring

The team monitored MSK clusters with Prometheus (an open-source monitoring tool). Additionally, we showcased how to monitor Spark streaming jobs using Amazon CloudWatch metrics. For more information, refer to Monitor Spark streaming applications on Amazon EMR.

Outcomes

The collaboration between Zoom and the AWS Data Lab demonstrated significant improvements in data ingestion, processing, storage, and deletion using an architecture with Amazon EMR and Apache Hudi. One key benefit of the architecture was a reduction in infrastructure costs, which was achieved through the use of cloud-native technologies and the efficient management of data storage. Another benefit was an improvement in data management capabilities.

We showed that the costs of EMR clusters can be reduced by about 82% while bringing the storage costs down by about 90% compared to the prior HDFS-based architecture. All of this while making the data available in the data lake within 5 minutes of ingestion from the source. We also demonstrated that data deletions from a data lake containing multiple petabytes of data can be performed much more efficiently. With our optimized approach, we were able to delete approximately 1,000 records in just 1–2 minutes, as compared to the previously required 3 hours or more.

Conclusion

In conclusion, the log analytics process, which involves collecting, processing, storing, analyzing, and deleting log data from various sources such as servers, applications, and devices, is critical to aid organizations in working to meet their service resiliency, security, performance monitoring, troubleshooting, and compliance needs, such as GDPR.

This post shared what Zoom and the AWS Data Lab team have accomplished together to solve critical data pipeline challenges, and Zoom has extended the solution further to optimize extract, transform, and load (ETL) jobs and resource efficiency. However, you can also use the architecture patterns presented here to quickly build cost-effective and scalable solutions for other use cases. Please reach out to your AWS team for more information or contact Sales.

About the Authors

Sekar Srinivasan is a Sr. Specialist Solutions Architect at AWS focused on Big Data and Analytics. Sekar has over 20 years of experience working with data. He is passionate about helping customers build scalable solutions modernizing their architecture and generating insights from their data. In his spare time he likes to work on non-profit projects focused on underprivileged Children’s education.

Chandra Dhandapani is a Senior Solutions Architect at AWS, where he specializes in creating solutions for customers in Analytics, AI/ML, and Databases. He has a lot of experience in building and scaling applications across different industries including Healthcare and Fintech. Outside of work, he is an avid traveler and enjoys sports, reading, and entertainment.

Amit Kumar Agrawal is a Senior Solutions Architect at AWS, based out of San Francisco Bay Area. He works with large strategic ISV customers to architect cloud solutions that address their business challenges. During his free time he enjoys exploring the outdoors with his family.

Viral Shah is a Analytics Sales Specialist working with AWS for 5 years helping customers to be successful in their data journey. He has over 20+ years of experience working with enterprise customers and startups, primarily in the data and database space. He loves to travel and spend quality time with his family.

Best practices to optimize your Amazon EC2 Spot Instances usage

2023-05-15 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/best-practices-to-optimize-your-amazon-ec2-spot-instances-usage/

This blog post is written by Pranaya Anshu, EC2 PMM, and Sid Ambatipudi, EC2 Compute GTM Specialist.

Amazon EC2 Spot Instances are a powerful tool that thousands of customers use to optimize their compute costs. The National Football League (NFL) is an example of customer using Spot Instances, leveraging 4000 EC2 Spot Instances across more than 20 instance types to build its season schedule. By using Spot Instances, it saves 2 million dollars every season! Virtually any organization – small or big – can benefit from using Spot Instances by following best practices.

Overview of Spot Instances

Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud and are available at up to a 90% discount compared to On-Demand prices. Through Spot Instances, you can take advantage of the massive operating scale of AWS and run hyperscale workloads at a significant cost saving. In exchange for these discounts, AWS has the option to reclaim Spot Instances when EC2 requires the capacity. AWS provides a two-minute notification before reclaiming Spot Instances, allowing workloads running on those instances to be gracefully shut down.

In this blog post, we explore four best practices that can help you optimize your Spot Instances usage and minimize the impact of Spot Instances interruptions: diversifying your instances, considering attribute-based instance type selection, leveraging Spot placement scores, and using the price-capacity-optimized allocation strategy. By applying these best practices, you’ll be able to leverage Spot Instances for appropriate workloads and ultimately reduce your compute costs. Note for the purposes of this blog, we will focus on the integration of Spot Instances with Amazon EC2 Auto Scaling groups.

Pre-requisites

Spot Instances can be used for various stateless, fault-tolerant, or flexible applications such as big data, containerized workloads, CI/CD, web servers, high-performance computing (HPC), and AI/ML workloads. However, as previously mentioned, AWS can interrupt Spot Instances with a two-minute notification, so it is best not to use Spot Instances for workloads that cannot handle individual instance interruption — that is, workloads that are inflexible, stateful, fault-intolerant, or tightly coupled.

Best practices

Diversify your instances

The fundamental best practice when using Spot Instances is to be flexible. A Spot capacity pool is a set of unused EC2 instances of the same instance type (for example, m6i.large) within the same AWS Region and Availability Zone (for example, us-east-1a). When you request Spot Instances, you are requesting instances from a specific Spot capacity pool. Since Spot Instances are spare EC2 capacity, you want to base your selection (request) on as many spare pools of capacity as possible in order to increase your likelihood of getting Spot Instances. You should diversify across instance sizes, generations, instance types, and Availability Zones to maximize your savings with Spot Instances. For example, if you are currently using c5a.large in us-east-1a, consider including c6a instances (newer generation of instances), c5a.xl (larger size), or us-east-1b (different Availability Zone) to increase your overall flexibility. Instance diversification is beneficial not only for selecting Spot Instances, but also for scaling, resilience, and cost optimization.

To get hands-on experience with Spot Instances and to practice instance diversification, check out Amazon EC2 Spot Instances workshops. And once you’ve diversified your instances, you can leverage AWS Fault Injection Simulator (AWS FIS) to test your applications’ resilience to Spot Instance interruptions to ensure that they can maintain target capacity while still benefiting from the cost savings offered by Spot Instances. To learn more about stress testing your applications, check out the Back to Basics: Chaos Engineering with AWS Fault Injection Simulator video and AWS FIS documentation.

Consider attribute-based instance type selection

We have established that flexibility is key when it comes to getting the most out of Spot Instances. Similarly, we have said that in order to access your desired Spot Instances capacity, you should select multiple instance types. While building and maintaining instance type configurations in a flexible way may seem daunting or time-consuming, it doesn’t have to be if you use attribute-based instance type selection. With attribute-based instance type selection, you can specify instance attributes — for example, CPU, memory, and storage — and EC2 Auto Scaling will automatically identify and launch instances that meet your defined attributes. This removes the manual-lift of configuring and updating instance types. Moreover, this selection method enables you to automatically use newly released instance types as they become available so that you can continuously have access to an increasingly broad range of Spot Instance capacity. Attribute-based instance type selection is ideal for workloads and frameworks that are instance agnostic, such as HPC and big data workloads, and can help to reduce the work involved with selecting specific instance types to meet specific requirements.

For more information on how to configure attribute-based instance selection for your EC2 Auto Scaling group, refer to Create an Auto Scaling Group Using Attribute-Based Instance Type Selection documentation. To learn more about attribute-based instance type selection, read the Attribute-Based Instance Type Selection for EC2 Auto Scaling and EC2 Fleet news blog or check out the Using Attribute-Based Instance Type Selection and Mixed Instance Groups section of the Launching Spot Instances workshop.

Leverage Spot placement scores

Now that we’ve stressed the importance of flexibility when it comes to Spot Instances and covered the best way to select instances, let’s dive into how to find preferred times and locations to launch Spot Instances. Because Spot Instances are unused EC2 capacity, Spot Instances capacity fluctuates. Correspondingly, it is possible that you won’t always get the exact capacity at a specific time that you need through Spot Instances. Spot placement scores are a feature of Spot Instances that indicates how likely it is that you will be able to get the Spot capacity that you require in a specific Region or Availability Zone. Your Spot placement score can help you reduce Spot Instance interruptions, acquire greater capacity, and identify optimal configurations to run workloads on Spot Instances. However, it is important to note that Spot placement scores serve only as point-in-time recommendations (scores can vary depending on current capacity) and do not provide any guarantees in terms of available capacity or risk of interruption. To learn more about how Spot placement scores work and to get started with them, see the Identifying Optimal Locations for Flexible Workloads With Spot Placement Score blog and Spot placement scores documentation.

As a near real-time tool, Spot placement scores are often integrated into deployment automation. However, because of its logging and graphic capabilities, you may find it to be a valuable resource even before you launch a workload in the cloud. If you are looking to understand historical Spot placement scores for your workload, you should check out the Spot placement score tracker, a tool that automates the capture of Spot placement scores and stores Spot placement score metrics in Amazon CloudWatch. The tracker is available through AWS Labs, a GitHub repository hosting tools. Learn more about the tracker through the Optimizing Amazon EC2 Spot Instances with Spot Placement Scores blog.

When considering ideal times to launch Spot Instances and exploring different options via Spot placement scores, be sure to consider running Spot Instances at off-peak hours – or hours when there is less demand for EC2 Instances. As you may assume, there is less unused capacity – Spot Instances – available during typical business hours than after business hours. So, in order to leverage as much Spot capacity as you can, explore the possibility of running your workload at hours when there is reduced demand for EC2 instances and thus greater availability of Spot Instances. Similarly, consider running your Spot Instances in “off-peak Regions” – or Regions that are not experiencing business hours at that certain time.

On a related note, to maximize your usage of Spot Instances, you should consider using previous generation of instances if they meet your workload needs. This is because, as with off-peak vs peak hours, there is typically greater capacity available for previous generation instances than current generation instances, as most people tend to use current generation instances for their compute needs.

Use the price-capacity-optimized allocation strategy

Once you’ve selected a diversified and flexible set of instances, you should select your allocation strategy. When launching instances, your Auto Scaling group uses the allocation strategy that you specify to pick the specific Spot pools from all your possible pools. Spot offers four allocation strategies: price-capacity-optimized, capacity-optimized, capacity-optimized-prioritized, and lowest-price. Each of these allocation strategies select Spot Instances in pools based on price, capacity, a prioritized list of instances, or a combination of these factors.

The price-capacity-optimized strategy launched in November 2022. This strategy makes Spot Instance allocation decisions based on the most capacity at the lowest price. It essentially enables Auto Scaling groups to identify the Spot pools with the highest capacity availability for the number of instances that are launching. In other words, if you select this allocation strategy, we will find the Spot capacity pools that we believe have the lowest chance of interruption in the near term. Your Auto Scaling groups then request Spot Instances from the lowest priced of these pools.

We recommend you leverage the price-capacity-optimized allocation strategy for the majority of your workloads that run on Spot Instances. To see how the price-capacity-optimized allocation strategy selects Spot Instances in comparison with lowest-price and capacity-optimized allocation strategies, read the Introducing the Price-Capacity-Optimized Allocation Strategy for EC2 Spot Instances blog post.

Clean-up

If you’ve explored the different Spot Instances workshops we recommended throughout this blog post and spun up resources, please remember to delete resources that you are no longer using to avoid incurring future costs.

Conclusion

Spot Instances can be leveraged to reduce costs across a wide-variety of use cases, including containers, big data, machine learning, HPC, and CI/CD workloads. In this blog, we discussed four Spot Instances best practices that can help you optimize your Spot Instance usage to maximize savings: diversifying your instances, considering attribute-based instance type selection, leveraging Spot placement scores, and using the price-capacity-optimized allocation strategy.

To learn more about Spot Instances, check out Spot Instances getting started resources. Or to learn of other ways of reducing costs and improving performance, including leveraging other flexible purchase models such as AWS Savings Plans, read the Increase Your Application Performance at Lower Costs eBook or watch the Seven Steps to Lower Costs While Improving Application Performance webinar.

A walk through AWS Verified Access policies

2023-05-11 Riggs Goodman III

Post Syndicated from Riggs Goodman III original https://aws.amazon.com/blogs/security/a-walk-through-aws-verified-access-policies/

AWS Verified Access helps improve your organization’s security posture by using security trust providers to grant access to applications. This service grants access to applications only when the user’s identity and the user’s device meet configured security requirements. In this blog post, we will provide an overview of trust providers and policies, then walk through a Verified Access policy for securing your corporate applications.

Understanding trust data and policies

Verified Access policies enable you to use trust data from trust providers and help protect access to corporate applications that are hosted on Amazon Web Services (AWS). When you create a Verified Access group or a Verified Access endpoint, you create a Verified Access policy, which is applied to the group or both the group and endpoint. Policies are written in Cedar, an AWS policy language. With Verified Access, you can express policies that use the trust data from the trust providers that you configure, such as corporate identity providers and device security state providers.

Verified Access receives trust data or claims from different trust providers. Currently, Verified Access supports two types of trust providers. The first type is an identity trust provider. Identity trust providers manage the identities of digital users, including the user’s email address, groups, and profile information. The second type of trust provider is a device trust provider. Device trust providers manage the device posture for users, including the OS version of the device, risk scores, and other metrics that reflect device posture. When a user makes a request to Verified Access, the request includes claims from the configured trust providers. Verified Access customers permit or forbid access to applications by evaluating the claims in Cedar policies. We will walk through the types of claims that are included from trust providers and the options for custom trust data.

End-to-end Cedar policy use cases

Let’s look at how to use policies with your applications. In general, you use Verified Access to control access to an application for purposes of authentication and initial authorization. This means that you use Verified Access to authenticate the user when they log in and to confirm that the device posture of the end device meets minimum criteria. For authorization logic to control access to actions and resources inside the application, you pass the identity claims to the application. The application uses the information to authorize users within the application after authentication. In other words, not every identity claim needs to be passed or checked in Verified Access to allow traffic to pass to the application. You can and should put additional logic in place to make decisions for users when they gain access to the backend application after initial authentication and authorization by Verified Access. From an identity perspective, this additional criteria might be an email address, a group, and possibly some additional claims. From a device perspective, Verified Access does not at this time pass device trust data to the end application. This means that you should use Verified Access to perform checks involving device posture.

We will explore the evolution of a policy by walking you through four use cases for Cedar policy. You can test the claim data and policies in the Verified Access Cedar Playground. For more information about Verified Access, see Verified Access policies and types of trust providers.

Use case 1: Basic policy

For many applications, you only need a simple policy to provide access to your users. This can include the identity information only. For example, let’s say that you want to write a policy that uses the user’s email address and matches a certain group that the user is part of. Within the Verified Access trust provider configuration, you can include “openid email groups” as the scope, and your OpenID Connect (OIDC) provider will include each claim associated with the scopes that you have configured with the OIDC provider. When the user John in this example uses case logs in to the OIDC provider, he receives the following claims from the OIDC provider. For this provider, the Verified Access Trust Provider is configured for “identity” to be the policy reference name.

{
  "identity": {
    "email": "[email protected]",
    "groups": [
      "finance",
      "employees"
    ]
  }
}

With these claims, you can write a policy that matches the email domain and the group, to allow access to the application, as follows.

permit(principal, action, resource)
when {
    // Returns true if the email ends in "@example.com"
    context.identity.email like "*@example.com" &&
    // Returns true if the user is part of the "finance" group
    context.identity.groups.contains("finance")
};

Use case 2: Custom claims within a policy

Many times, you are also interested in company-specific or custom claims from the identity provider. The claims that exist with the user endpoint are dependent on how you configure the identity provider. For OIDC providers, this is determined by the scopes that you define when you set up the identity provider. Verified Access uses OIDC scopes to authorize access to details of the user. This includes attributes such as the name, email address, email verification, and custom attributes. Each scope that you configure for the identity provider returns a set of user attributes, which we call claims. Depending on which claims you want to match on in your policy, you configure the scopes and claims in the OIDC provider, which the OIDC provider adds to the user endpoint. For a list of standard claims, including profile, email, name, and others, see the Standard Claims OIDC specification.

In this example use case, as your policy evolves from the basic policy, you decide to add additional company-specific claims to Verified Access. This includes both the business unit and the level of each employee. Within the Verified Access trust provider configuration, you can include “openid email groups profile” as the scope, and your OIDC provider will include each claim associated with the scopes that you have configured with the OIDC provider. Now, when the user John logs in to the OIDC provider, he receives the following claims from the OIDC provider, with both the business unit and role as claims from the “profile” scope in OIDC.

{
  "identity": {
    "email": "[email protected]",
    "groups": [
      "finance",
      "employees"
    ],
    "business_unit": "corp",
    "level": 8
  }
}

With these claims, the company can write a policy that matches the claims to allow access to the application, as follows.

permit(principal, action, resource)
when {
    // Returns true if the email ends in "@example.com"
    context.identity.email like "*@example.com" &&
    // Returns true if the user is part of the "finance" group
    context.identity.groups.contains("finance") &&
    // Returns true if the business unit is "corp"
    context.identity.business_unit == "corp" &&
    // Returns true if the level is greater than 6
    context.identity.level >= 6
};

Use case 3: Add a device trust provider to a policy

The other type of trust provider is a device trust provider. Verified Access supports two device trust providers today: CrowdStrike and Jamf. As detailed in the AWS Verified Access Request Verification Flow, for HTTP/HTTPS traffic, the extension in the web browser receives device posture information from the device agent on the user’s device. Each device trust provider determines what risk information and device information to include in the claims and how that information is formatted. Depending on the device trust provider, the claims are static or configurable.

In our example use case, with the evolution of the policy, you now add device trust provider checks to the policy. After you install the Verified Access browser extension on John’s computer, Verified Access receives the following claims from both the identity trust provider and the device trust provider, which uses the policy reference name “crwd”.

{
  "identity": {
    "email": "[email protected]",
    "groups": [
      "finance",
      "employees"
    ],
    "business_unit": "corp",
    "level": 8
  },
  "crwd": {
    "assessment": {
      "overall": 90,
      "os": 100,
      "sensor_config": 80,
      "version": "3.4.0"
    }
  }
}

With these claims, you can write a policy that matches the claims to allow access to the application, as follows.

permit(principal, action, resource)
when {
    // Returns true if the email ends in "@example.com"
    context.identity.email like "*@example.com" &&
    // Returns true if the user is part of the "finance" group
    context.identity.groups.contains("finance") &&
    // Returns true if the business unit is "corp"
    context.identity.business_unit == "corp" &&
    // Returns true if the level is greater than 6
    context.identity.level >= 6 &&
    // If the CrowdStrike agent is present
    ( context has "crwd" &&
      // The overall device score is greater or equal to 80 
      context.crwd.assessment.overall >= 80 )
};

For more information about these scores, see Third-party trust providers.

Use case 4: Multiple device trust providers

The final update to your policy comes in the form of multiple device trust providers. Verified Access provides the ability to match on multiple device trust providers in the same policy. This provides flexibility for your company, which in this example use case has different device trust providers installed on different types of users’ devices. For information about many of the claims that each device trust provider provides to AWS, see Third-party trust providers. However, for this updated policy, John’s claims do not change, but the new policy can match on either CrowdStrike’s or Jamf’s trust data. For Jamf, the policy reference name is “jamf”.

permit(principal, action, resource)
when {
    // Returns true if the email ends in "@example.com"
    context.identity.email like "*@example.com" &&
    // Returns true if the user is part of the "finance" group
    context.identity.groups.contains("finance") &&
    // Returns true if the business unit is "corp"
    context.identity.business_unit == "corp" &&
    // Returns true if the level is greater than 6
    context.identity.level >= 6 &&
    // If the CrowdStrike agent is present
    (( context has "crwd" &&
      // The overall device score is greater or equal to 80 
      context.crwd.assessment.overall >= 80 ) ||
    // If the Jamf agent is present
    ( context has "jamf" &&
      // The risk level is either LOW or SECURE
      ["LOW","SECURE"].contains(context.jamf.risk) ))
};

For more information about using Jamf with Verified Access, see Integrating AWS Verified Access with Jamf Device Identity.

Conclusion

In this blog post, we covered an overview of Cedar policy for AWS Verified Access, discussed the types of trust providers available for Verified Access, and walked through different use cases as you evolve your Cedar policy in Verified Access.

If you want to test your own policies and claims, see the Cedar Playground. If you want more information about Verified Access, see the AWS Verified Access documentation.

Want more AWS Security news? Follow us on Twitter.

Detect threats to your data stored in RDS databases by using GuardDuty

2023-05-10 Marshall Jones

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/detect-threats-to-your-data-stored-in-rds-databases-by-using-guardduty/

With Amazon Relational Database Service (Amazon RDS), you can set up, operate, and scale a relational database in the AWS Cloud. Amazon RDS provides cost-efficient, resizable capacity for an industry-standard relational database and manages common database administration tasks.

If you use Amazon RDS for your workloads, you can now use Amazon GuardDuty RDS Protection to help detect threats to your data stored in Amazon Aurora databases. GuardDuty is a continuous security monitoring service that can help you identify and prioritize potential threats in your AWS environment. By analyzing and profiling RDS login activity to your Aurora databases, GuardDuty can detect threats, such as high severity brute force events, suspicious logins, access from Tor, and access by known threat actors.

In this post, we will provide an overview of how to get started with RDS Protection, dive into its finding types, and walk you through examples of how to investigate and remediate findings.

Overview of RDS Protection

RDS Protection in GuardDuty analyzes and profiles Amazon RDS login activity to identify potential threats to your data stored in Aurora databases by using a combination of threat intelligence and machine learning. At launch, RDS Protection supports Aurora MySQL versions 2.10.2 and 3.2.1 or later and Aurora PostgreSQL versions 10.17, 11.12, 12.7, 13.3, and 14.3 or later. An updated list of the supported engines and versions is available in the GuardDuty documentation. RDS Protection doesn’t require additional infrastructure, and you don’t need to configure, collect, or store RDS logs in your own account. RDS Protection is also designed to have no impact on the performance of your database instances so that you don’t have to worry about compromising performance to better secure your data stored in Amazon RDS.

When RDS Protection detects a suspicious or anomalous login attempt that indicates a potential threat to your database instance, GuardDuty generates a finding with details to help you quickly identify relevant information to assist in remediation. RDS Protection findings include details on both anomalous and normal login activity in addition to information such as database instance details, database user details, action information, and actor information. These findings are available to you in the GuardDuty console, AWS Command Line Interface (AWS CLI), and API, and all GuardDuty findings are sent to Amazon EventBridge and AWS Security Hub, giving you options to respond by sending alerts to chat or ticketing systems, or by using AWS Lambda and AWS Systems Manager for automatic remediation.

Enable RDS Protection

Getting started with RDS Protection is simple, and you can do it with just a few steps in the console. Both new and existing GuardDuty customers can take advantage of the GuardDuty RDS Protection 30-day free trial. You can turn RDS Protection on or off for each of your accounts in supported AWS Regions. If you already use GuardDuty, you will need to enable RDS Protection either in the console or CLI, or through the API. You will have the option to enable it in the account that you are currently in, or if you are using a GuardDuty delegated administrator account (as shown in Figure 1), you can enable it for all accounts in your AWS Organizations organization. You’ll also have the ability to auto-enable. The auto-enable feature helps ensure that RDS Protection is enabled for each new account added to your organization, without the need for you to configure anything in each member account. If you are turning on GuardDuty for the first time, RDS Protection is enabled by default.

Figure 1: GuardDuty RDS Protection enablement page

Investigate RDS Protection findings

After GuardDuty generates a finding, you will need to analyze the finding so that you understand the potential impact to your environment. We recommend that you familiarize yourself with the GuardDuty finding types. Understanding GuardDuty finding types can help you understand the types of activity that GuardDuty is looking for and help you prepare for how to respond if they occur in your environment.

As adversaries become more sophisticated, it becomes even more important for you to align to a common framework to understand the tactics, techniques, and procedures (TTPs) behind an individual event. GuardDuty aligns findings using the MITRE ATT&CK framework, which is a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations. GuardDuty findings have a specific finding format that helps you understand the details of each finding. You can examine the Threat Purpose section of the GuardDuty finding types to see finding types associated with various MITRE ATT&CK tactics, including CredentialAccess and Discovery. This can help you identify and understand the type of activity associated with a finding.

For example, consider two finding types that seem similar: CredentialAccess:RDS/MaliciousIPCaller.SuccessfulLogin and Discovery:RDS/MaliciousIPCaller. The difference between them is the ThreatPurpose aspect, located at the beginning of the finding type. GuardDuty has determined that both are involved with MaliciousIPCaller, and the difference is the intent of the activity associated with each finding. CredentialAccess SuccessfulLogin indicates that there was a successful login to your RDS database from a known malicious IP address. Discovery indicates that a threat actor opened a connection to the database, but didn’t attempt to authenticate. This indicates scanning behavior, but it might not be targeted at RDS instances. For more information, see GuardDuty RDS Protection finding types.

GuardDuty uses threat intelligence and machine learning to continually monitor and identify potential threats in your environment. To understand how to investigate RDS Protection finding types, you need to understand the details of a finding type that are derived from machine learning. As shown in Figure 2, RDS Protection finding types have two sections: one that shows the unusual behavior and one that shows the normal, historical behavior. To determine this, GuardDuty uses machine learning models to evaluate API requests to your account and identify anomalous events that are associated with tactics used by adversaries. The machine learning model tracks various factors of the API request, such as the user that made the request, the location the request was made from, and the specific API that was requested. It also looks at information such as successfulLoginCount, failedLoginCount, and incompleteConnectionCount for anomalies based on login activity. For more information about anomalous activity in GuardDuty findings, see Anomalous behavior.

Figure 2: GuardDuty finding details showing unusual and historical behavior sections

With RDS Protection, you now have an additional mechanism to gain insight into your Amazon RDS databases across your accounts to continuously monitor for suspicious activity. RDS Protection can alert you to suspicious activity in Amazon RDS, such as a potentially suspicious or anomalous login attempt, unusual pattern in a series of successful, failed, or incomplete login attempts, and unauthorized access to your database instance from a previously unseen internal or external actor. With this new feature, GuardDuty also extends support for finding types that you might already be familiar with that also apply to RDS databases. These finding types include calls to an RDS database API from a Tor node, or calls to an RDS database from a known malicious IP address, which can indicate that there are interactions with your RDS database from sources that are associated with known malicious activity.

Remediate RDS Protection findings

In this section, we describe two RDS Protection findings and how you can investigate and remediate them. Understanding how to remediate these findings can help you maintain the integrity of your database. We share recommendations that focus specifically on security groups, network access control lists (network ACLs), and firewall rules.

CredentialAccess:RDS/AnomalousBehavior.SuccessfulLogin

The CredentialAccess:RDS/AnomalousBehavior.SuccessfulLogin finding informs you that an anomalous successful login was observed on an RDS database in your AWS environment. It might indicate that a previous unseen user logged in to an RDS database for the first time. A common scenario involves an internal user logging in to a database that is accessed programmatically by applications and not by individual users. A potential malicious actor might have compromised and accessed the role on your RDS database. The default Severity for this finding varies, depending on the anomalous behavior associated with the finding.

Figure 3 shows an example of this finding.

Figure 3: Finding of an anomalous behavior successful login

How to remediate

If the activity is unexpected for the associated database, AWS recommends that you change the password of the associated database user, and review available audit logs for activity that the user performed. Medium and high severity findings might indicate an overly permissive access policy to the database, and user credentials might have been exposed or compromised. We recommend that you place the database in a private virtual private cloud (VPC), and limit the security group rules to allow traffic only from necessary sources. For more information, see Remediating potentially compromised database with successful login events.

We recommend that you take the following steps to remediate this finding:

Remediation step 1: Identify the affected database and user

Identify the affected database and user and confirm whether the behavior is expected or unexpected by looking through the GuardDuty finding details, which provide the name of the affected database instance and the corresponding user details. Use the findings to confirm if the behavior is expected or not—for example, the findings might help you identify a user who logs in to their database instance after a long time has passed; a user who logs in to their database instance only occasionally, such as a financial analyst who logs in each quarter; or a suspicious actor who is involved in a successful login attempt that isn’t authorized and potentially compromises the database instance. Review the IP address of the finding. Public IP addresses might signify overly permissive access if it’s not a known network associated with your account.

Figure 4: Finding with details showing Amazon RDS database instance and user details

If the behavior is unexpected, complete the following steps:

Remediation step 2: Restrict database instance credential access

Restrict database instance access for the suspected accounts and the source of the login activity. For more information, see Remediating potentially compromised credentials and Restrict network access. You can identify the user in the RDS DB user details section within the finding panel in the console, or within the resource.rdsDbUserDetails of the findings JSON. These user details include user name, application used, database accessed, SSL version, and authentication method.

To revoke access or rotate passwords for specific users that are involved in the finding, see Security with Amazon Aurora MySQL or Security with Amazon Aurora PostgreSQL. To securely store and automatically rotate the secrets for RDS databases, use AWS Secrets Manager. For more information, see AWS Secrets Manager tutorials. To manage database users’ access without the need for passwords, use IAM database authentication. For more information, see Security best practices for Amazon RDS.

The following CLI command is an example of how to revoke access to a user in a MySQL database. If the behavior is unexpected, you can revoke the privileges while you assess if the user is malicious.

REVOKE CONNECTION_ADMIN ON *.* FROM 'fakeadmin'@'%';

You can revoke privileges from the user, but when taking this action, you should make sure that the user isn’t vital to your system and that revoking permissions won’t break your production or development application. The following CLI command is an example of how to revoke privileges from a user:

REVOKE ALL PRIVILEGES ON *.* FROM 'fakeadmin'@'%';

If you know that the user isn’t necessary for your database or application to function, then you can remove the user from the system. To make sure that your security team can run forensics, check your company’s incident response policy. If you need help getting started with incident response, see AWS sample incident response playbooks. The following CLI command is an example of how to remove a user:

DROP USER 'fakeadmin'@'%';

Let’s say that you find the behavior unexpected, but the user turns out to be the application user, and making a change to the database credential will break your application. You can use AWS Systems Manager to help in this scenario, in which the affected RDS user is the account that is tied to your application. In many cases, a password rotation can break your application, depending on how you connect. If you rotate the password without notifying your application, the application might require additional cascading changes. You could lose connectivity to your application because the credentials that your application is using to connect to your database didn’t change, and now you are experiencing an outage that will remain until you update the credentials. Systems Manager can tie into your application code to automatically update the rotated database credentials in your application. For more information, see Rotate Amazon RDS database credentials automatically with AWS Secrets Manager.

The following figure shows a CLI command to get a secret from Secrets Manager — for this example, we assume the secret is compromised.

Figure 5: Example compromised credentials

The following figures shows that we have a new set of credentials that replace our old credentials, as indicated by “CreatedDate”.

Figure 6: Example remediated credentials

Remediation step 3: Assess the impact and determine what information was accessed

If available, review the audit logs to identify which information might have been accessed. For more information, see Monitoring events, logs, and streams in an Amazon Aurora DB cluster. Determine if sensitive or protected information was accessed or modified.

Remediation step 4: Restrict database instance network access

To learn how to restrict IP access on a security group, see Control traffic to resources using security groups. You can identify the user in the RDS DB user details section within the finding panel in the console, or within the resource.rdsDbUserDetails of the findings JSON. These user details include user name, application used, database accessed, SSL version, and authentication method.

Remediation step 5: Perform root-cause analysis and determine the steps that potentially led to this activity

Implementing a lessons-learned framework and methodology can help improve your incident response capabilities and also help prevent the incident from recurring. By learning from each incident, you can help avoid repeating the same mistakes, exposures, or misconfigurations, which can both improve your security posture and reduce the time lost to preventable situations. To learn more about post-incident activity, see AWS Security Incident Response Guide.

You can set up an alert to be notified when an activity modifies a networking policy and creates an insecure state by using AWS Config and Amazon Simple Notification Service (Amazon SNS). You can use an EventBridge rule with a custom event pattern and an input transformer to match an AWS Config evaluation rule output as NON_COMPLIANT. Then, you can route the response to an Amazon SNS topic. For more information, see How can I be notified when an AWS resource is non-compliant using AWS Config? or Firewall policies in AWS Network Firewall.

CredentialAccess:RDS/AnomalousBehavior.SuccessfulBruteForce

The CredentialAccess:RDS/AnomalousBehavior.successfulBruteForce finding informs you that an anomalous login occurred that is indicative of a successful brute force event, as observed on an RDS database in your AWS environment. Before the anomalous successful login, a consistent pattern of unusual failed login attempts was observed. This indicates that the user and password associated with the RDS database in your account might have been compromised, and a potentially malicious actor might have accessed the RDS database. The Severity of this finding is high. Figure 7 shows an example of this finding.

Figure 7: Example of an anomalous successful brute force finding

How to remediate

This activity indicates that database credentials might have been exposed or compromised. We recommend that you change the password of the associated database user, and review available audit logs for activity performed by the potentially compromised user. A consistent pattern of unusual failed login attempts indicates an overly permissive access policy to the database, or that the database might also have been publicly exposed. AWS recommends that you place the database in a private VPC, and limit the security group rules to allow traffic only from necessary sources. For more information, see Remediating potentially compromised database with successful login events.

We recommend that you take the following steps to remediate this finding

Remediation step 1: Identify the affected database and user

The generated GuardDuty finding provides the name of the affected database instance and the corresponding user details. For more information, see Finding details.

Figure 8: Finding details showing Amazon RDS database instance and user details

Remediation step 2: Identify the source of the failed login attempts

In the generated GuardDuty finding, you can find the IP address, and if it was a public connection, the ASN organization in the Actor section of the finding panel. An autonomous system is a group of one or more IP prefixes (lists of IP addresses accessible on a network) run by one or more network operators that maintain a single, clearly-defined routing policy. Network operators need autonomous system numbers (ASNs) to control routing within their networks and to exchange routing information with other internet service providers.

Figure 9: Action and actor details related to GuardDuty brute force finding

Remediation step 3: Confirm that the behavior is unexpected

Examine if this activity represents an attempt to gain additional unauthorized access to the database instance as follows:

If the source is internal to your network, examine if an application is misconfigured and attempting a connection repeatedly.
If this is an external actor, examine whether the corresponding database instance is public facing or is misconfigured and thus allowing potential malicious actors to attempt to log in with common user names.

If the behavior is unexpected, complete the following steps:

Remediation step 4: Restrict database instance access

As discussed previously for the CredentialAccess:RDS/AnomalousBehavior.SuccessfulLogin finding, you can restrict access to the database through credentials or network access:

For information about how to restrict a specific user’s database access, see Restrict database instance credential access.
For information about how to restrict network access to the database, see Restrict database instance network access.

Remediation step 5: Perform root-cause analysis and determine the steps that potentially led to this activity

By learning from each incident, you can help avoid repeating the same mistakes, exposures, or misconfigurations, which can both improve your security posture and reduce time lost to preventable situations.

Conclusion

In this post, you learned about the new GuardDuty RDS Protection feature and how to understand, operationalize, and respond to the new findings. You can enable this feature through the GuardDuty console, CLI, or APIs to start monitoring your Amazon RDS workloads today.

If you’ve created EventBridge rules to send findings from GuardDuty to a target, make sure that you’ve configured your rules to deliver the newly added findings. After you enable GuardDuty findings, consider creating IR playbooks, doing tabletops and AWS gamedays, and mapping out what you want to automate. For more information, see the AWS Security Incident Response Guide and AWS Incident Response Playbook resources. To gain hands-on experience with different AWS Security services, see AWS Activation Days. The Activation Days workshops begin with hands-on work with different services in sandbox accounts, and then take you through the steps to deploy them across your organization.

To make it more efficient for you to operate securely on AWS, we are committed to continually improving GuardDuty, and we value your feedback. If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Use SAML Identities for programmatic access to Amazon OpenSearch Service

2023-05-09 Muthu Pitchaimani

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/use-saml-identities-for-programmatic-access-to-amazon-opensearch-service/

Customers of Amazon OpenSearch Service can already use Security Assertion Markup Language (SAML) to access OpenSearch Dashboards.

This post outlines two methods by which programmatic users can now access OpenSearch using SAML identities. This applies to all identity providers (IdPs) that support SAML 2.0, including prevalent ones like Active Directory Federation Service (ADFS), Okta, AWS IAM Identity Center (Successor to AWS Single Sign-On), KeyCloak, and others. Although we outline the methods as they pertain to OpenSearch Service and AWS Identity and Access Management (IAM), programmatic access to each of these individual providers is outside the scope of this post. Most of these providers do provide such a facility.

Single sign-on methods

When you use single sign-on (SSO), there are two different authentication methods:

Identity provider initiated – This is when a user or a user-agent first authenticates with an IdP and gets a SAML assertion that establishes the identity of the user. This assertion is then passed to a service provider (SP) that provides access to a protected resource.
Service provider initiated – Although the IdP-initiated exchange is straightforward, a more typical sign-on experience is when the protected resource is accessed directly. The SP then redirects the user to the IdP for authentication along with a SAML authentication request. The IdP responds with an authentication assertion inside a SAML response. After that, the SSO experience is the same as that of an IdP-initiated flow.

For programmatic access to OpenSearch Service, an external IdP is the IdP, and OpenSearch Service and IAM both serve as SPs. To configure your IdP of choice as the SAML IdP for IAM, refer to Creating IAM SAML identity providers. To configure OpenSearch Service, refer to SAML authentication for OpenSearch Dashboards.

In the following sections, we outline two methods to access OpenSearch Service API:

Using AWS Security Token Service (AWS STS)
Using the OpenSearch Dashboards’ console proxy

Method 1: Use AWS STS

The following figure shows the sequence of calls to access OpenSearch Service API using AWS STS.

Let’s explore each step in more detail.

Steps 1 and 2

Steps 1 and 2 vary depending upon your chosen IdP. In general, they typically provide an authentication API or session API or another similar API to authenticate and retrieve the SAML authentication assertion response. We use this SAML assertion in the next step.

Steps 3 and 4

Call the AssumeRoleWithSAML AWS STS API to exchange the SAML assertion for temporary credentials associated with your SAML identity. See the following code:

curl --location 'https://sts.amazonaws.com?
Version=2011-06-15&
Action=AssumeRoleWithSAML&
RoleArn=<ARN of the role being assumed>&
PrincipalArn=<ARN of the IdP integrated with IAM>&
SAMLAssertion=<Base-64 encoded SAML assertion>'

The response contains the temporary AWS STS credentials with AccessKeyId, SecretAccessKey, and a SessionToken.

Step 5

Use the temporary credentials from the last step to sign all API requests to OpenSearch Service. Also ensure the role that you assumed with the AssumeRoleWithSAML call has sufficient permission to access the requisite data in OpenSearch Service. Refer to Mapping roles to users for more information about mapping this role as a backend role. As an additional step to ensure consistency, this AWS STS role and any SAML group the user is part of can be mapped to the same role in OpenSearch Service. The following code shows a model to make this call:

curl --location ‘<OpenSearch Service domain URL>/_search' \
--header 'X-Amz-Security-Token: Fwo...==(truncated)' \
--header 'X-Amz-Date: 20230327T134710Z' \
--header 'Authorization: AWS4-HMAC-SHA256 Credential=ASI..(truncated)/20230327/us-east-1/es/aws4_request, SignedHeaders=host;x-amz-date;x-amz-security-token, Signature=95eb…(truncated)'

Method 2: Use OpenSearch Dashboards’ console proxy

OpenSearch Dashboards has a component called a console proxy that can proxy requests to OpenSearch. This allows OpenSearch clients to make the same API calls in Domain Specific Language (DSL) to this console proxy instead of directly calling OpenSearch. The console proxy forwards these calls to OpenSearch and responds back to the clients in the same format as OpenSearch.

The following figure shows the sequence of calls you can make to the console proxy to gain programmatic access to OpenSearch Service.

Steps 1 and 2

The first two steps are similar to method 1, and they will vary depending on what IdP is chosen. Essentially, you need to obtain a SAML authentication assertion response from the IdP.

Steps 3 and 4

Use the SAML assertion from the previous steps and POST it to the Assertion Consumer Service (ACS) URL, _opendistro/_security/saml/acs/idpinitiated, to exchange the assertion for the security_authentication token. The following code shows the command line for these steps:

curl --location ‘<dashboards URL>/_opendistro/_security/saml/acs/idpinitiated' \
--header 'content-type: application/x-www-form-urlencoded' \
--data-urlencode ‘SAMLResponse=Base-64 encoded SAML assertion' \
--data-urlencode 'RelayState=’

If you’re using the OpenSearch engine, the dashboard URL is <domain URL>/_dashboards. If you’re using the Elasticsearch engine, the dashboard URL is <domain URL>/_plugin/kibana. OpenSearch Dashboards processes this and responds with a redirect response with code 302 and an empty body. The response headers now also contain a cookie named security_authentication, which is the token you must use in all subsequent calls.

Steps 5–8

Use the security_authentication cookie in the API calls to the console proxy to perform programmatic API calls. The following code shows a command line for these steps:

curl --location ‘<dashboardsURL>/api/console/proxy?path=_search&method=GET' \
--header 'content-type: application/json' \
--header 'cookie: security_authentication=Fe26.2**1...(truncated)' \
--header 'osd-xsrf: true' \
--data '{
  "query": {
    "match_all": {}
  }
}’

Make sure to include a header called osd-xsrf : true for programmatic access to dashboards. The console proxy path is /api/console/proxy for Elasticsearch engines version 6.x and 7.x and OpenSearch engine version 1.x and 2.x.

Similar to method 1, make sure to map roles and groups associated with a particular SAML identity as the correct backend role with requisite permissions.

Comparing these methods

You can use method 1 in any domain regardless of the engine as long as fine-grained access control is enabled. Method 2 only works for domains with Elasticsearch engine versions greater than 6.7 and all OpenSearch engine versions.

The OpenSearch Dashboards process is generally meant for human interactions, which has a lower API call rate and volume than those of programmatic calls. OpenSearch can handle considerably higher API call rates and volume, so take care not to send high-volume API calls using method 2. As a best practice for programmatic access with SAML identities, we recommend method 1 wherever possible to avoid performance bottlenecks.

Conclusion

Both of the methods outlined in this post provide a similar flow to access OpenSearch Service programmatically using SAML identities (exchanging a SAML assertion for an authentication token). AssumeRoleWithSAML is a key and fairly straightforward-to-use API that enables this access and is our recommended method. Try one of OpenSearch Service labs and launch an OpenSearch Service domain to experiment with these methods. Good luck!

About the author

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

AWS Nitro System gets independent affirmation of its confidential compute capabilities

2023-05-09 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/aws-nitro-system-gets-independent-affirmation-of-its-confidential-compute-capabilities/

This blog post was written By Anthony Liguori, VP/Distinguished Engineer, EC2 AWS.

Customers around the world trust AWS to keep their data safe, and keeping their workloads secure and confidential is foundational to how we operate. Since the inception of AWS, we have relentlessly innovated on security, privacy tools, and practices to meet, and even exceed, our customers’ expectations.

The AWS Nitro System is the underlying platform for all modern AWS compute instances which has allowed us to deliver the data isolation, performance, cost, and pace of innovation that our customers require. It’s a pioneering design of specialized hardware and software that protects customer code and data from unauthorized access during processing.

When we launched the Nitro System in 2017, we delivered a unique architecture that restricts any operator access to customer data. This means no person or even service from AWS, can access data when it is being used in an Amazon EC2 instance. We knew that designing the system this way would present several architectural and operational challenges for us. However, we also knew that protecting customers’ data in this way was the best way to support our customer’s needs.

When AWS made its Digital Sovereignty Pledge last year, we committed to providing greater transparency and assurances to customers about how AWS services are designed and operated, especially when it comes to handling customer data. As part of that increased transparency, we engaged NCC Group, a leading cybersecurity consulting firm based in the United Kingdom, to conduct an independent architecture review of the Nitro System and the security assurances we make to our customers. NCC has now issued its rand affirmed our claims.

The report states, “As a matter of design, NCC Group found no gaps in the Nitro System that would compromise [AWS] security claims.” Specifically, the report validates the following statements about our Nitro System production hosts:

There is no mechanism for a cloud service provider employee to log in to the underlying host.
No administrative API can access customer content on the underlying host.
There is no mechanism for a cloud service provider employee to access customer content stored on instance storage and encrypted EBS volumes.
There is no mechanism for a cloud service provider employee to access encrypted data transmitted over the network.
Access to administrative APIs always requires authentication and authorization.
Access to administrative APIs is always logged.
Hosts can only run tested and signed software that is deployed by an authenticated and authorized deployment service. No cloud service provider employee can deploy code directly onto hosts.

The report details NCC’s analysis for each of these claims. You can also find additional details about the scope, methodology, and steps that NCC used to evaluate the claims.

How Nitro System protects customer data

At AWS, we know that our customers, especially those who have sensitive or confidential data, may have worries about putting that data in the cloud. That’s why we’ve architected the Nitro System to ensure that your confidential information is as secure as possible. We do this in several ways:

There is no mechanism for any system or person to log in to Amazon EC2 servers, read the memory of EC2 instances, or access any data on encrypted Amazon Elastic Block Store (EBS) volumes.

If any AWS operator, including those with the highest privileges, needs to perform maintenance work on the EC2 server, they can do so only by using a strictly limited set of authenticated, authorized, and audited administrative APIs. Critically, none of these APIs have the ability to access customer data on the EC2 server. These restrictions are built into the Nitro System itself, and no AWS operator can circumvent these controls and protections.

The Nitro System also protects customers from AWS system software through the innovative design of our lightweight Nitro Hypervisor, which manages memory and CPU allocation. Typical commercial hypervisors provide administrators with full access to the system, but with the Nitro System, the only interface operators can use is a restricted API. This means that customers and operators cannot interact with the system in unapproved ways and there is no equivalent of a “root” user. This approach enhances security and allows AWS to update systems in the background, fix system bugs, monitor performance, and even perform upgrades without impacting customer operations or customer data. Customers are unaffected during system upgrades, and their data remains protected.

Finally, the Nitro System can also provide customers an extra layer of data isolation from their own operators and software. AWS created , which allow for isolated compute environments, which is ideal for organizations that need to process personally identifiable information, as well as healthcare, financial, and intellectual property data within their compute instances. These enclaves do not share memory or CPU cores with the customer instance. Further, Nitro Enclaves have cryptographic attestation capabilities that let customers verify that all of the software deployed has been validated and not compromised.

All of these prongs of the Nitro System’s security and confidential compute capabilities required AWS to invest time and resources into building the system’s architecture. We did so because we wanted to ensure that our customers felt confident entrusting us with their most sensitive and confidential data, and we have worked to continue earning that trust. We are not done and this is just one step AWS is taking to increase the transparency about how our services are designed and operated. We will continue to innovate on and deliver unique features that further enhance our customers’ security without compromising on performance.

Learn more:

Watch Anthony speak about AWS Nitro System Security here.

Read the NCC report.
Read this whitepaper on the security design of the AWS Nitro System.
Please visit this webpage to learn more about our ongoing commitment to Digital Sovereignty Digital Sovereignty.
Read Matt Garman’s blog about the NCC Validation Report.

Get details on security finding changes with the new Finding History feature in Security Hub

2023-05-05 Nicholas Jaeger

Post Syndicated from Nicholas Jaeger original https://aws.amazon.com/blogs/security/get-details-on-security-finding-changes-with-the-new-finding-history-feature-in-security-hub/

In today’s evolving security threat landscape, security teams increasingly require tools to detect and track security findings to protect their organizations’ assets. One objective of cloud security posture management is to identify and address security findings in a timely and effective manner. AWS Security Hub aggregates, organizes, and prioritizes security alerts and findings from various AWS services and supported security solutions from the AWS Partner Network.

As the volume of findings increases, tracking the changes and actions that have been taken on each finding becomes more difficult, as well as more important to perform timely and effective investigations. In this post, we will show you how to use the new Finding History feature in Security Hub to track and understand the history of a security finding.

Updates to findings occur when finding providers update certain fields, such as resource details, by using the BatchImportFindings API. You, as a user, can update certain fields, such as workflow status, in the AWS Management Console or through the BatchUpdateFindings API. Ticketing, incident management, security information and event management (SIEM), and automatic remediation solutions can also use the BatchUpdateFindings API to update findings. This new capability highlights these various changes and when they occurred so that you don’t need to investigate this yourself.

Finding History

The new Finding History feature in Security Hub helps you understand the state of a finding by providing an immutable history of changes within the finding details. By using this feature, you can track the history of each finding, including the before and after values of the fields that were changed, who or what made the changes, and when the changes were made. This simplifies how you operate on a finding by giving you visibility into the changes made to a finding over time, alongside the rest of the finding details, which removes the need for separate tooling or additional processes. This feature is available at no additional cost in AWS Regions where Security Hub is available, and appears by default for new or updated findings. Finding History is also available through the Security Hub APIs.

To try out this new feature, open the Security Hub console, select a finding, and choose the History tab. There you will see a chronological list of changes that have been made to the finding. The transparency of the finding history helps you quickly assess the status of the finding, understand actions already taken, and take the necessary actions to mitigate risk. For example, upon resolving a finding, you can add a note to the finding to indicate why you resolved it. Both the resolved status and note will appear in the history.

In the following example, the finding was updated and then resolved with an explanatory note left by the person that reviewed the finding. With Finding History, you can see the previous updates and events in the finding’s History tab.

Figure 1: Finding History shows recent updates to the finding

In addition, you can still view the current state of the finding in its Details tab.

Figure 2: Finding Details shows the record of a security check or security-related detection

Conclusion

With the new Finding History feature in Security Hub, you have greater visibility into the activity and updates on each finding, allowing for more efficient investigation and response to potential security risks. Next time that you start work to investigate and respond to a security finding in Security Hub, begin by checking the finding history.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

The history and future roadmap of the AWS CloudFormation Registry

2023-05-04 Eric Z. Beard

Post Syndicated from Eric Z. Beard original https://aws.amazon.com/blogs/devops/cloudformation-coverage/

AWS CloudFormation is an Infrastructure as Code (IaC) service that allows you to model your cloud resources in template files that can be authored or generated in a variety of languages. You can manage stacks that deploy those resources via the AWS Management Console, the AWS Command Line Interface (AWS CLI), or the API. CloudFormation helps customers to quickly and consistently deploy and manage cloud resources, but like all IaC tools, it faced challenges keeping up with the rapid pace of innovation of AWS services. In this post, we will review the history of the CloudFormation registry, which is the result of a strategy we developed to address scaling and standardization, as well as integration with other leading IaC tools and partner products. We will also give an update on the current state of CloudFormation resource coverage and review the future state, which has a goal of keeping CloudFormation and other IaC tools up to date with the latest AWS services and features.

History

The CloudFormation service was first announced in February of 2011, with sample templates that showed how to deploy common applications like blogs and wikis. At launch, CloudFormation supported 13 out of 15 available AWS services with 48 total resource types. At first, resource coverage was tightly coupled to the core CloudFormation engine, and all development on those resources was done by the CloudFormation team itself. Over the past decade, AWS has grown at a rapid pace, and there are currently 200+ services in total. A challenge over the years has been the coverage gap between what was possible for a customer to achieve using AWS services, and what was possible to define in a CloudFormation template.

It became obvious that we needed a change in strategy to scale resource development in a way that could keep up with the rapid pace of innovation set by hundreds of service teams delivering new features on a daily basis. Over the last decade, our pace of innovation has increased nearly 40-fold, with 80 significant new features launched in 2011 versus more than 3,000 in 2021. Since CloudFormation was a key adoption driver (or blocker) for new AWS services, those teams needed a way to create and manage their own resources. The goal was to enable day one support of new services at the time of launch with complete CloudFormation resource coverage.

In 2016, we launched an internal self-service platform that allowed service teams to control their own resources. This began to solve the scaling problems inherent in the prior model where the core CloudFormation team had to do all the work themselves. The benefits went beyond simply distributing developer effort, as the service teams have deep domain knowledge on their products, which allowed them to create more effective IaC components. However, as we developed resources on this model, we realized that additional design features were needed, such as standardization that could enable automatic support for features like drift detection and resource imports.

We embarked on a new project to address these concerns, with the goal of improving the internal developer experience as well as providing a public registry where customers could use the same programming model to define their own resource types. We realized that it wasn’t enough to simply make the new model available—we had to evangelize it with a training campaign, conduct engineering boot-camps, build better tooling like dashboards and deployment pipeline templates, and produce comprehensive on-boarding documentation. Most importantly, we made CloudFormation support a required item on the feature launch checklist for new services, a requirement that goes beyond documentation and is built into internal release tooling (exceptions to this requirement are rare as training and awareness around the registry have improved over time). This was a prime example of one of the maxims we repeat often at Amazon: good mechanisms are better than good intentions.

In 2019, we made this new functionality available to customers when we announced the CloudFormation registry, a capability that allowed developers to create and manage private resource types. We followed up in 2021 with the public registry where third parties, such as partners in the AWS Partner Network (APN), can publish extensions. The open source resource model that customers and partners use to publish third-party registry extensions is the same model used by AWS service teams to provide CloudFormation support for their features.

Once a service team on-boards their resources to the new resource model and builds the expected Create, Read, Update, Delete, and List (CRUDL) handlers, managed experiences like drift detection and resource import are all supported with no additional development effort. One recent example of day-1 CloudFormation support for a popular new feature was Lambda Function URLs, which offered a built-in HTTPS endpoint for single-function micro-services. We also migrated the Amazon Relational Database Service (Amazon RDS) Database Instance resource (AWS::RDS::DBInstance) to the new resource model in September 2022, and within a month, Amazon RDS delivered support for Amazon Aurora Serverless v2 in CloudFormation. This accelerated delivery is possible because teams can now publish independently by taking advantage of the de-centralized Registry ownership model.

Current State

We are building out future innovations for the CloudFormation service on top of this new standardized resource model so that customers can benefit from a consistent implementation of event handlers. We built AWS Cloud Control API on top of this new resource model. Cloud Control API takes the Create-Read-Update-Delete-List (CRUDL) handlers written for the new resource model and makes them available as a consistent API for provisioning resources. APN partner products such as HashiCorp Terraform, Pulumi, and Red Hat Ansible use Cloud Control API to stay in sync with AWS service launches without recurring development effort.

Figure 1. Cloud Control API Resource Handler Diagram

Besides 3rd party application support, the public registry can also be used by the developer community to create useful extensions on top of AWS services. A common solution to extending the capabilities of CloudFormation resources is to write a custom resource, which generally involves inline AWS Lambda function code that runs in response to CREATE, UPDATE, and DELETE signals during stack operations. Some of those use cases can now be solved by writing a registry extension resource type instead. For more information on custom resources and resource types, and the differences between the two, see Managing resources using AWS CloudFormation Resource Types.

CloudFormation Registry modules, which are building blocks authored in JSON or YAML, give customers a way to replace fragile copy-paste template reuse with template snippets that are published in the registry and consumed as if they were resource types. Best practices can be encapsulated and shared across an organization, which allows infrastructure developers to easily adhere to those best practices using modular components that abstract away the intricate details of resource configuration.

CloudFormation Registry hooks give security and compliance teams a vital tool to validate stack deployments before any resources are created, modified, or deleted. An infrastructure team can activate hooks in an account to ensure that stack deployments cannot avoid or suppress preventative controls implemented in hook handlers. Provisioning tools that are strictly client-side do not have this level of enforcement.

A useful by-product of publishing a resource type to the public registry is that you get automatic support for the AWS Cloud Development Kit (CDK) via an experimental open source repository on GitHub called cdk-cloudformation. In large organizations it is typical to see a mix of CloudFormation deployments using declarative templates and deployments that make use of the CDK in languages like TypeScript and Python. By publishing re-usable resource types to the registry, all of your developers can benefit from higher level abstractions, regardless of the tool they choose to create and deploy their applications. (Note that this project is still considered a developer preview and is subject to change)

If you want to see if a given CloudFormation resource is on the new registry model or not, check if the provisioning type is either Fully Mutable or Immutable by invoking the DescribeType API and inspecting the ProvisioningType response element.

Here is a sample CLI command that gets a description for the AWS::Lambda::Function resource, which is on the new registry model.

$ aws cloudformation describe-type --type RESOURCE \
    --type-name AWS::Lambda::Function | grep ProvisioningType

   "ProvisioningType": "FULLY_MUTABLE",

The difference between FULLY_MUTABLE and IMMUTABLE is the presence of the Update handler. FULLY_MUTABLE types includes an update handler to process updates to the type during stack update operations. Whereas, IMMUTABLE types do not include an update handler, so the type can’t be updated and must instead be replaced during stack update operations. Legacy resource types will be NON_PROVISIONABLE.

Opportunities for improvement

As we continue to strive towards our ultimate goal of achieving full feature coverage and a complete migration away from the legacy resource model, we are constantly identifying opportunities for improvement. We are currently addressing feature gaps in supported resources, such as tagging support for EC2 VPC Endpoints and boosting coverage for resource types to support drift detection, resource import, and Cloud Control API. We have fully migrated more than 130 resources, and acknowledge that there are many left to go, and the migration has taken longer than we initially anticipated. Our top priority is to maintain the stability of existing stacks—we simply cannot break backwards compatibility in the interest of meeting a deadline, so we are being careful and deliberate. One of the big benefits of a server-side provisioning engine like CloudFormation is operational stability—no matter how long ago you deployed a stack, any future modifications to it will work without needing to worry about upgrading client libraries. We remain committed to streamlining the migration process for service teams and making it as easy and efficient as possible.

The developer experience for creating registry extensions has some rough edges, particularly for languages other than Java, which is the language of choice on AWS service teams for their resource types. It needs to be easier to author schemas, write handler functions, and test the code to make sure it performs as expected. We are devoting more resources to the maintenance of the CLI and plugins for Python, Typescript, and Go. Our response times to issues and pull requests in these and other repositories in the aws-cloudformation GitHub organization have not been as fast as they should be, and we are making improvements. One example is the cloudformation-cli repository, where we have merged more than 30 pull requests since October of 2022.

To keep up with progress on resource coverage, check out the CloudFormation Coverage Roadmap, a GitHub project where we catalog all of the open issues to be resolved. You can submit bug reports and feature requests related to resource coverage in this repository and keep tabs on the status of open requests. One of the steps we took recently to improve responses to feature requests and bugs reported on GitHub is to create a system that converts GitHub issues into tickets in our internal issue tracker. These tickets go directly to the responsible service teams—an example is the Amazon RDS resource provider, which has hundreds of merged pull requests.

We have recently announced a new GitHub repository called community-registry-extensions where we are managing a namespace for public registry extensions. You can submit and discuss new ideas for extensions and contribute to any of the related projects. We handle the testing, validation, and deployment of all resources under the AwsCommunity:: namespace, which can be activated in any AWS account for use in your own templates.

To get started with the CloudFormation registry, visit the user guide, and then dive in to the detailed developer guide for information on how to use the CloudFormation Command Line Interface (CFN-CLI) to write your own resource types, modules, and hooks.

We recently created a new Discord server dedicated to CloudFormation. Please join us to ask questions, discuss best practices, provide feedback, or just hang out! We look forward to seeing you there.

Conclusion

In this post, we hope you gained some insights into the history of the CloudFormation registry, and the design decisions that were made during our evolution towards a standardized, scalable model for resource development that can be shared by AWS service teams, customers, and APN partners. Some of the lessons that we learned along the way might be applicable to complex design initiatives at your own company. We hope to see you on Discord and GitHub as we build out a rich set of registry resources together!

About the authors:

Build an analytics pipeline for a multi-account support case dashboard

2023-05-01 Sindhura Palakodety

Post Syndicated from Sindhura Palakodety original https://aws.amazon.com/blogs/big-data/build-an-analytics-pipeline-for-a-multi-account-support-case-dashboard/

As organizations mature in their cloud journey, they have many accounts (even hundreds) that they need to manage. Imagine having to manage support cases for these accounts without a unified dashboard. Administrators have to access each account either by switching roles or with single sign-on (SSO) in order to view and manage support cases.

This post demonstrates how you can build an analytics pipeline to push support cases created in individual member AWS accounts into a central account. We also show you how to build an analytics dashboard to gain visibility and insights on all support cases created in various accounts within your organization.

Overview of solution

In this post, we go through the process to create a pipeline to ingest, store, process, analyze, and visualize AWS support cases. We use the following AWS services as key components:

Amazon EventBridge and AWS Lambda to ingest support case data in real time from various AWS accounts
Amazon Simple Storage Service (Amazon S3) to store the support case data
AWS Glue DataBrew to clean and transform the data
AWS Glue crawlers to catalog the data
AWS Step Functions to orchestrate the flow
Amazon Athena to query the processed data
An Amazon QuickSight dashboard that provides insights into various support case metrics

The following diagram illustrates the architecture.

The central account is the AWS account that you use to centrally manage the support case data.

Member accounts are the AWS accounts where, whenever the support cases are created, the data flows into an S3 bucket in the central account that can be visualized using the QuickSight dashboard in the central account.

To implement this solution, you complete the following high-level steps:

Determine the AWS accounts to use for the central account and member accounts.
Set up permissions for AWS CloudFormation StackSets on the central account and member accounts.
Create resources on the central account using AWS CloudFormation.
Create resources on the member accounts using CloudFormation StackSets.
Open up support cases on the member accounts.
Visualize the data in a QuickSight dashboard in the central account.

Prerequisites

Complete the following prerequisite steps:

Create AWS accounts if you haven’t done so already.
Before you get started, make sure that you have a Business or Enterprise support plan for your member accounts.
Sign up for QuickSight if you have never used QuickSight in this account before. To use the forecast capability in QuickSight, sign up for the Enterprise Edition.

Preparation for CloudFormation StackSets

In this section, we go through the steps to set up permissions for StackSets in both the central and member accounts.

Set up permissions for StackSets on the central account

To set up permissions on the central account, complete the following steps:

Sign in to the AWS Management Console of the central account.
Download the administrator role CloudFormation template.
On the AWS CloudFormation console, choose Create stack and With new resources.
Leave the Prepare template setting as default.
For Template source, select Upload a template file.
Choose Choose file and supply the CloudFormation template you downloaded: AWSCloudFormationStackSetAdministrationRole.yml.
Choose Next.
For Stack name, enter StackSetAdministratorRole.
Choose Next.
For Configure stack options, we recommend configuring tags, which are key-value pairs that can help you identify your stacks and the resources they create. For example, enter Owner as the key, and your email address as the value.
We don’t use additional permissions or advanced options, so accept the default values and choose Next.
Review your configuration and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

The stack takes about 30 seconds to complete.

Set up permissions for StackSets on member accounts

Now that we’ve created a StackSet administrator role on the central account, we need to create the StackSet execution role on the member accounts. Perform the following steps on all member accounts:

Sign in to the console on the member account.
Download the execution role CloudFormation template.
On the AWS CloudFormation console, choose Create stack and With new resources.
Leave the Prepare template setting as default.
For Template source, select Upload a template file.
Choose Choose file and supply the CloudFormation template you downloaded: AWSCloudFormationStackSetExecutionRole.yml.
Choose Next.
For Stack name, use StackSetExecutionRole.
For Parameters, enter the 12-digit account ID for the central account.
Choose Next.
For Configure stack options, we recommend configuring tags. For example, enter Owner as the key and your email address as the value.
We don’t use additional permissions or advanced options, so choose Next.

For more information, see Setting AWS CloudFormation stack options.

Review your configuration and select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

The stack takes about 30 seconds to complete.

Set up the infrastructure for the central account and member accounts

In this section, we go through the steps to create your resources for both accounts and launch the StackSets.

Create resources on the central account with AWS CloudFormation

To launch the provided CloudFormation template, complete the following steps:

Sign in to the console on the central account.
Choose Launch Stack:
Choose Next.
For Stack name, enter a name. For example, support-case-central-account.
For AWSMemberAccountIDs, enter the member account IDs separated by commas from where support case data is gathered.
For Support Case Raw Data Bucket, enter the S3 bucket in the central account that holds the support case raw data from all member accounts. Note the name of this bucket to use in future steps.
For Support Case Transformed Data Bucket, enter the S3 bucket in central account that holds the support case transformed data. Note the name of this bucket to use in future steps.
Choose Next.
Enter any tags you want to assign to the stack and choose Next.
Select the acknowledgement check boxes and choose Create stack.

The stack takes approximately 5 minutes to complete. Wait until the stack is complete before proceeding to the next steps.

Launch CloudFormation StackSets from the central account

To launch StackSets, complete the following steps:

Sign in to the console on the central account.
On the AWS CloudFormation console, choose StackSets in the navigation pane.
Choose Create StackSet.
Leave the IAM execution role name as AWSCloudFormationStackSetExecutionRole.
If AWS Organizations is enabled, under permissions, select Service-managed permissions.
Leave the Prepare template setting as default.
For Template source, select Amazon S3 URL.
Enter the following Amazon S3 URL under Specify Template: https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/BDB-2583/AWS_MemberAccount_SupportCaseDashboard_CF.yaml
Choose Next.
For StackSet name, enter a name. For example, support-case-member-account.
For CentralSupportCaseRawBucket, enter the name of the Support Case Raw Data Bucket created in the central account, which you noted previously.
For CentralAccountID, enter the account ID of the central account.
For Configure StackSet options, we recommend configuring tags.
Leave the rest as default and choose Next.
If AWS Organizations is enabled, in the Set deployment options step, for Deployment targets, you can either choose Deploy to organization or Deploy to organizational units (OU).
- If you deploy to OUs, you will need to specify the AWS OU ID.
If AWS Organizations is not enabled, on the Set Deployment Options page, under Accounts, select Deploy stacks in accounts.
- Under Account numbers, enter the 12-digit account IDs for the member accounts as a comma-separated list. For example: 111111111111,222222222222.
Under Specify regions, choose US East (N. Virginia).

Due to the limitation of EventBridge with the AWS Support API, this StackSet has to be deployed only in the US East (N. Virginia) Region.

Optionally, you can change the maximum concurrent accounts to match the number of member accounts, adjust the failure tolerance to at least 1, and choose Region Concurrency to be Parallel to set up resources in parallel on the member accounts.
Review your selections, select the acknowledgement check boxes, and choose Submit.

The operation takes about 2–3 minutes to complete.

Visualize your support cases in QuickSight in the central account

In this section, we go through the steps to visualize your support cases in QuickSight.

Grant QuickSight permissions

To grant QuickSight permissions, complete the following steps:

Sign in to the console on the central account.
On the QuickSight console, on the Admin drop-down menu in top right-hand corner, choose Manage QuickSight.
In the navigation pane, choose Security & permissions.
Under QuickSight access to AWS services, choose Manage.
Select Amazon Athena.
Select Amazon S3 to edit QuickSight access to your S3 buckets.
Select the bucket you specified during stack creation.
Choose Finish.
Choose Save.

Prepare the datasets

To prepare your datasets, complete the following steps:

On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
Choose Athena.
For Data source name, enter support-case-data-source.
Choose Validate connection.
After your connection is validated, choose Create data source.
For Database, choose support-case-transformed-data.
For Tables, select the table under the database (there should only be one table that matches the name of the S3 bucket you set as the destination for the transformed data).
Choose Edit/Preview data.
Leave Query mode set as Direct Query.
Choose the options menu (three dots) next to the field case_creation_year and set Change data type to Date.
Enter the date format as yyyy, then choose Validate and Update.
Similarly, right-click on the field case_creation_month and set Change data type to Date.
Enter the date format as MM, then choose Validate and Update.
Right-click on the field case_creation_day and set Change data type to Date.
Enter the date format as dd, then choose Validate and Update.
Right-click on the field case_creation_time and set Change data type to Date.
Enter the date format as yyyy-MM-dd’T’HH:mm:ss.SSSZ, then choose Validate and Update.
Change the name of the QuickSight dataset to support-cases-dataset.
Choose Save & publish.
Note the dataset ID from the URL (alpha-numeric string between datasets and view, excluding slashes) to use later for QuickSight dashboard creation.

Choose Cancel to exit this page.

Set up the QuickSight dashboard from a template

To set up your QuickSight dashboard, complete the following steps:

Navigate to the following link, then right-click and choose Save As to download the QuickSight dashboard JSON template from the browser.
On the console, choose the user profile drop-down menu.
Choose the copy icon next to the Account ID: field (of the central account).

Open the JSON file with a text editor and replace xxxxx with the account ID. This will be replaced in two places.
Replace yyyyy with the dataset ID that you previously noted.
Replace rrrrr with the Region where you deployed resources in the central account.

To determine the principal (user) to be used for the dashboard creation, you can use AWS CloudShell.

Navigate to CloudShell on the console. Ensure it’s the same Region where your resources are deployed.

Wait until the environment gets created and you see the CloudShell prompt.

Run the following command, providing your account ID (central account) and Region:

aws quicksight list-users –region <region> --aws-account-id <account-id> --namespace default

From the output, select the value of the ARN field. Replace the value of zzzzz with the ARN.
Optionally, you can change the name of the dashboard by changing the value of the fields in the JSON file:
- For DashboardId, enter SupportCaseCentralDashboard.
- For Name, enter SupportCaseCentralDashboard.
Save the changes to the JSON file.

Now we use CloudShell to upload the JSON file provided in the previous step.

On the Actions menu, choose Upload file.

To create the QuickSight dashboard from the JSON template, use the following AWS Command Line Interface (AWS CLI) command and pass the updated JSON file as an argument, providing your Region:
```
aws quicksight create-dashboard –region <region> --cli-input-json file://support-case-dashboard-template.json
```

The output of the command looks similar to the following screenshot.

In case of any issues or if you want to see more details about the dashboard, you can use the following command:

aws quicksight describe-dashboard --region <region> --aws-account-id <central-account-id> --dashboard-id <DashboardId in screenshot above>

On the QuickSight console, choose Dashboards in the navigation pane.
Choose Support Cases Dashboard.

You should see a dashboard similar to the screenshot shown at the beginning of this post, but there should only be one case.

Add additional member accounts

If you want to add additional member accounts, you need to update the CloudFormation stack that you created earlier on the central account. If you followed our name recommendation, the stack is called support-case-central-account-stack. Add the additional account number in the Member Account IDs parameter.

Next, go to the StackSet in the central account. If you followed our naming recommendation, the StackSet is called support-case-member-account. Select the StackSet and on the Actions menu, choose Add stacks to StackSet. Then follow the same instructions that you followed previously when you created the StackSet.

Monitor support cases created in the central account

So far, our setup will monitor all support cases created in the member accounts that you specified. However, it doesn’t include support cases that you create in the central account. To set up monitoring for the central account, complete the following steps:

Update the CloudFormation stack that you created earlier on the central account. If you followed our name recommendation, the stack is called support-case-central-account-stack. Add the central account ID in the Member Account IDs parameter.
Sign in to the CloudFormation console in the central account.
Choose Launch Stack:
Choose Next.
For Stack name, enter a name. For example, support-case-central-as-member-account.
For CentralAccountIDs, enter the central account ID.
For CentralSupportCaseRawBucket, enter the S3 bucket in the central account that holds the support case raw data from all member accounts.
Choose Next.
Enter any tags you want to assign to the stack and choose Next.
Select the acknowledgement check boxes and choose Create stack.

Clean up

To avoid incurring future charges, delete the resources you created as part of this solution.

Troubleshooting

Note the following troubleshooting tips:

Make sure that you create the CloudFormation stacks and StackSet in the correct accounts: central and member.
If you get a permission denied error from Athena on the S3 path (see the following screenshot), review the steps to grant QuickSight permissions.

When creating the QuickSight dashboard using the template, if you get an error similar to the following, make sure that you use the ARN value from the output generated by the aws quicksight list-users --region <region> --aws-account-id <account-id> --namespace default command.

An error occurred (InvalidParameterValueException) when calling the CreateDashboard operation: Principal ARN xxxx is not part of the same account yyyy

When deleting the stack, if you encounter the DELETE_FAILED error, it means that your S3 bucket is not empty. To fix it, empty the contents of the bucket and try to delete the Stack again.

Conclusion

Congratulations! You have successfully built an analytics pipeline to push support cases created in individual member accounts into a central account. You have also built an analytics dashboard to gain visibility and insights on all support cases created in various accounts. As you start creating support cases in your member accounts, you will be able to view them in a single pane of glass.

With the steps and resources described in this post, you can build your own analytics dashboard to gain visibility and insights on all support cases created in various accounts within your organization.

About the authors

Sindhura Palakodety is a Solutions Architect at AWS. She is passionate about helping customers build enterprise-scale Well-Architected solutions on the AWS platform and specializes in the data analytics domain.

Shu Sia Lukito is a Partner Solutions Architect at AWS. She is on a mission to help AWS partners build successful AWS practices and help their customers accelerate their journey to the cloud. In her spare time, she enjoys spending time with her family and making spicy food.

Monitor and optimize cost on AWS Glue for Apache Spark

2023-04-28 Leonardo Gomez

Post Syndicated from Leonardo Gomez original https://aws.amazon.com/blogs/big-data/monitor-optimize-cost-glue-spark/

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning (ML), and application development. You can use AWS Glue to create, run, and monitor data integration and ETL (extract, transform, and load) pipelines and catalog your assets across multiple data stores.

One of the most common questions we get from customers is how to effectively monitor and optimize costs on AWS Glue for Spark. The diversity of features and pricing options for AWS Glue offers the flexibility to effectively manage the cost of your data workloads and still keep the performance and capacity as per your business needs. Although the fundamental process of cost optimization for AWS Glue workloads remains the same, you can monitor job runs and analyze the costs and usage to find savings and take action to implement improvements to the code or configurations.

In this post, we demonstrate a tactical approach to help you manage and reduce cost through monitoring and optimization techniques on top of your AWS Glue workloads.

Monitor overall costs on AWS Glue for Apache Spark

AWS Glue for Apache Spark charges an hourly rate in 1-second increments with a minimum of 1 minute based on the number of data processing units (DPUs). Learn more in AWS Glue Pricing. This section describes a way to monitor overall costs on AWS Glue for Apache Spark.

AWS Cost Explorer

In AWS Cost Explorer, you can see overall trends of DPU hours. Complete the following steps:

On the Cost Explorer console, create a new cost and usage report.
For Service, choose Glue.
For Usage type, choose the following options:
1. Choose <Region>-ETL-DPU-Hour (DPU-Hour) for standard jobs.
2. Choose <Region>-ETL-Flex-DPU-Hour (DPU-Hour) for Flex jobs.
3. Choose <Region>-GlueInteractiveSession-DPU-Hour (DPU-Hour) for interactive sessions.
Choose Apply.

Learn more in Analyzing your costs with AWS Cost Explorer.

Monitor individual job run costs

This section describes a way to monitor individual job run costs on AWS Glue for Apache Spark. There are two options to achieve this.

AWS Glue Studio Monitoring page

On the Monitoring page in AWS Glue Studio, you can monitor the DPU hours you spent on a specific job run. The following screenshot shows three job runs that processed the same dataset; the first job run spent 0.66 DPU hours, and the second spent 0.44 DPU hours. The third one with Flex spent only 0.33 DPU hours.

GetJobRun and GetJobRuns APIs

The DPU hour values per job run can be retrieved through AWS APIs.

For auto scaling jobs and Flex jobs, the field DPUSeconds is available in GetJobRun and GetJobRuns API responses:

$ aws glue get-job-run --job-name ghcn --run-id jr_ccf6c31cc32184cea60b63b15c72035e31e62296846bad11cd1894d785f671f4
{
    "JobRun": {
        "Id": "jr_ccf6c31cc32184cea60b63b15c72035e31e62296846bad11cd1894d785f671f4",
        "Attempt": 0,
        "JobName": "ghcn",
        "StartedOn": "2023-02-08T19:14:53.821000+09:00",
        "LastModifiedOn": "2023-02-08T19:19:35.995000+09:00",
        "CompletedOn": "2023-02-08T19:19:35.995000+09:00",
        "JobRunState": "SUCCEEDED",
        "PredecessorRuns": [],
        "AllocatedCapacity": 10,
        "ExecutionTime": 274,
        "Timeout": 2880,
        "MaxCapacity": 10.0,
        "WorkerType": "G.1X",
        "NumberOfWorkers": 10,
        "LogGroupName": "/aws-glue/jobs",
        "GlueVersion": "3.0",
        "ExecutionClass": "FLEX",
        "DPUSeconds": 1137.0
    }
}

The field DPUSeconds returns 1137.0. This means 0.32 DPU hours which can be calculated in 1137.0/(60*60)=0.32.

For the other standard jobs without auto scaling, the field DPUSeconds is not available:

$ aws glue get-job-run --job-name ghcn --run-id jr_10dfa93fcbfdd997dd9492187584b07d305275531ff87b10b47f92c0c3bd6264
{
    "JobRun": {
        "Id": "jr_10dfa93fcbfdd997dd9492187584b07d305275531ff87b10b47f92c0c3bd6264",
        "Attempt": 0,
        "JobName": "ghcn",
        "StartedOn": "2023-02-07T16:38:05.155000+09:00",
        "LastModifiedOn": "2023-02-07T16:40:48.575000+09:00",
        "CompletedOn": "2023-02-07T16:40:48.575000+09:00",
        "JobRunState": "SUCCEEDED",
        "PredecessorRuns": [],
        "AllocatedCapacity": 10,
        "ExecutionTime": 157,
        "Timeout": 2880,
        "MaxCapacity": 10.0,
        "WorkerType": "G.1X",
        "NumberOfWorkers": 10,
        "LogGroupName": "/aws-glue/jobs",
        "GlueVersion": "3.0",
        "ExecutionClass": "STANDARD"
    }
}

For these jobs, you can calculate DPU hours by ExecutionTime*MaxCapacity/(60*60). Then you get 0.44 DPU hour by 157*10/(60*60)=0.44. Note that AWS Glue versions 2.0 and later have a 1-minute minimum billing.

AWS CloudFormation template

Because DPU hours can be retrieved through the GetJobRun and GetJobRuns APIs, you can integrate this with other services like Amazon CloudWatch to monitor trends of consumed DPU hours over time. For example, you can configure an Amazon EventBridge rule to invoke an AWS Lambda function to publish CloudWatch metrics every time AWS Glue jobs finish.

To help you configure that quickly, we provide an AWS CloudFormation template. You can review and customize it to suit your needs. Some of the resources this stack deploys incur costs when in use.

The CloudFormation template generates the following resources:

AWS Identity and Access Management (IAM) role
Lambda function
EventBridge rule

To create your resources, complete the following steps:

Sign in to the AWS CloudFormation console.
Choose Launch Stack:
Choose Next.
Choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

Stack creation can take up to 3 minutes.

After you complete the stack creation, when AWS Glue jobs finish, the following DPUHours metrics are published under the Glue namespace in CloudWatch:

Aggregated metrics – Dimension=[JobType, GlueVersion, ExecutionClass]
Per-job metrics – Dimension=[JobName, JobRunId=ALL]
Per-job run metrics – Dimension=[JobName, JobRunId]

Aggregated metrics and per-job metrics are shown as in the following screenshot.

Each datapoint represents DPUHours per individual job run, so valid statistics for the CloudWatch metrics is SUM. With the CloudWatch metrics, you can have a granular view on DPU hours.

Options to optimize cost

This section describes key options to optimize costs on AWS Glue for Apache Spark:

Upgrade to the latest version
Auto scaling
Flex
Set the job’s timeout period appropriately
Interactive sessions
Smaller worker type for streaming jobs

We dive deep to the individual options.

Upgrade to the latest version

Having AWS Glue jobs running on the latest version enables you to take advantage of the latest functionalities and improvements offered by AWS Glue and the upgraded version of the supported engines such as Apache Spark. For example, AWS Glue 4.0 includes the new optimized Apache Spark 3.3.0 runtime and adds support for built-in pandas APIs as well as native support for Apache Hudi, Apache Iceberg, and Delta Lake formats, giving you more options for analyzing and storing your data. It also includes a new highly performant Amazon Redshift connector that is 10 times faster on TPC-DS benchmarking.

Auto scaling

One of the most common challenges to reduce cost is to identify the right amount of resources to run jobs. Users tend to overprovision workers in order to avoid resource-related problems, but part of those DPUs are not used, which increases costs unnecessarily. Starting with AWS Glue version 3.0, AWS Glue auto scaling helps you dynamically scale resources up and down based on the workload, for both batch and streaming jobs. Auto scaling reduces the need to optimize the number of workers to avoid over-provisioning resources for jobs, or paying for idle workers.

To enable auto scaling on AWS Glue Studio, go to the Job Details tab of your AWS Glue job and select Automatically scale number of workers.

You can learn more in Introducing AWS Glue Auto Scaling: Automatically resize serverless computing resources for lower cost with optimized Apache Spark.

Flex

For non-urgent data integration workloads that don’t require fast job start times or can afford to rerun the jobs in case of a failure, Flex could be a good option. The start times and runtimes of jobs using Flex vary because spare compute resources aren’t always available instantly and may be reclaimed during the run of a job. Flex-based jobs offer the same capabilities, including access to custom connectors, a visual job authoring experience, and a job scheduling system. With the Flex option, you can optimize the costs of your data integration workloads by up to 34%.

To enable Flex on AWS Glue Studio, go to the Job Details tab of your job and select Flex execution.

You can learn more in Introducing AWS Glue Flex jobs: Cost savings on ETL workloads.

Interactive sessions

One common practice among developers that create AWS Glue jobs is to run the same job several times every time a modification is made to the code. However, this may not be cost-effective depending of the number of workers assigned to the job and the number of times that it’s run. Also, this approach may slow down the development time because you have to wait until every job run is complete. To address this issue, in 2022 we released AWS Glue interactive sessions. This feature let developers process data interactively using a Jupyter-based notebook or IDE of their choice. Sessions start in seconds and have built-in cost management. As with AWS Glue jobs, you pay for only the resources you use. Interactive sessions allow developers to test their code line by line without needing to run the entire job to test any changes made to the code.

Set the job’s timeout period appropriately

Due to configuration issues, script coding errors, or data anomalies, sometimes AWS Glue jobs can take an exceptionally long time or struggle to process the data, and it can cause unexpected charges. AWS Glue gives you the ability to set a timeout value on any jobs. By default, an AWS Glue job is configured with 48 hours as the timeout value, but you can specify any timeout. We recommend identifying the average runtime of your job, and based on that, set an appropriate timeout period. This way, you can control cost per job run, prevent unexpected charges, and detect any problems related to the job earlier.

To change the timeout value on AWS Glue Studio, go to the Job Details tab of your job and enter a value for Job timeout.

Interactive sessions also have the same ability to set an idle timeout value on sessions. The default idle timeout value for Spark ETL sessions is 2880 minutes (48 hours). To change the timeout value, you can use %idle_timeout magic.

Smaller worker type for streaming jobs

Processing data in real time is a common use case for customers, but sometimes these streams have sporadic and low data volumes. G.1X and G.2X worker types could be too big for these workloads, especially if we consider streaming jobs may need to run 24/7. To help you reduce costs, in 2022 we released G.025X, a new quarter DPU worker type for streaming ETL jobs. With this new worker type, you can process low data volume streams at one-fourth of the cost.

To select the G.025X worker type on AWS Glue Studio, go to the Job Details tab of your job. For Type, choose Spark Streaming, then choose G 0.25X for Worker type.

You can learn more in Best practices to optimize cost and performance for AWS Glue streaming ETL jobs.

Performance tuning to optimize cost

Performance tuning plays an important role in reducing cost. The first action for performance tuning is to identify the bottlenecks. Without measuring the performance and identifying bottlenecks, it’s not realistic to optimize cost-effectively. CloudWatch metrics provide a simple view for quick analysis, and the Spark UI provides deeper view for performance tuning. It’s highly recommended to enable Spark UI for your jobs and then view the UI to identify the bottleneck.

The following are high-level strategies to optimize costs:

Scale cluster capacity
Reduce the amount of data scanned
Parallelize tasks
Optimize shuffles
Overcome data skew
Accelerate query planning

For this post, we discuss the techniques for reducing the amount of data scanned and parallelizing tasks.

Reduce the amount of data scanned: Enable job bookmarks

AWS Glue job bookmarks are a capability to process data incrementally when running a job multiple times on a scheduled interval. If your use case is an incremental data load, you can enable job bookmarks to avoid a full scan for all job runs and process only the delta from the last job run. This reduces the amount of data scanned and accelerates individual job runs.

Reduce the amount of data scanned: Partition pruning

If your input data is partitioned in advance, you can reduce the amount of data scan by pruning partitions.

For AWS Glue DynamicFrame, set push_down_predicate (and catalogPartitionPredicate), as shown in the following code. Learn more in Managing partitions for ETL output in AWS Glue.

# DynamicFrame
dyf = Glue_context.create_dynamic_frame.from_catalog(
    database=src_database_name,
    table_name=src_table_name,
    push_down_predicate = "year='2023' and month ='03'",
)

For Spark DataFrame (or Spark SQL), set a where or filter clause to prune partitions:

# DataFrame
df = spark.read.format("json").load("s3://<YourBucket>/year=2023/month=03/*/*.gz")
 
# SparkSQL 
df = spark.sql("SELECT * FROM <Table> WHERE year= '2023' and month = '03'")

Parallelize tasks: Parallelize JDBC reads

The number of concurrent reads from the JDBC source is determined by configuration. Note that by default, a single JDBC connection will read all the data from the source through a SELECT query.

Both AWS Glue DynamicFrame and Spark DataFrame support parallelize data scans across multiple tasks by splitting the dataset.

For AWS Glue DynamicFrame, set hashfield or hashexpression and hashpartition. Learn more in Reading from JDBC tables in parallel.

For Spark DataFrame, set numPartitions, partitionColumn, lowerBound, and upperBound. Learn more in JDBC To Other Databases.

Conclusion

In this post, we discussed methodologies for monitoring and optimizing cost on AWS Glue for Apache Spark. With these techniques, you can effectively monitor and optimize costs on AWS Glue for Spark.

If you have comments or feedback, please leave them in the comments.

About the Authors

Leonardo Gómez is a Principal Analytics Specialist Solutions Architect at AWS. He has over a decade of experience in data management, helping customers around the globe address their business and technical needs. Connect with him on LinkedIn

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Best Practices for managing data residency in AWS Local Zones using landing zone controls

2023-04-27 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/best-practices-for-managing-data-residency-in-aws-local-zones-using-landing-zone-controls/

This blog post is written by Abeer Naffa’, Sr. Solutions Architect, Solutions Builder AWS, David Filiatrault, Principal Security Consultant, and Jared Thompson Hybrid Edge SA Specialist.

In this post, we discuss how you can leverage AWS Control Tower landing zone and AWS Organizations custom policies – guardrails – at the root level, known as Service Control Policies (SCPS) to enable certain data residency requirements on AWS Local Zones. Using the suggested controls lets you limit the ability to store data, process data outside a geographic location, and keep your data within specific Local Zone(s).

Data residency is a critical consideration for organizations that collect and store sensitive information, such as Personal Identifiable Information (PII), financial data, and healthcare data. With the rise of cloud computing and the global nature of the Internet, it can be challenging for organizations to make sure that their data is being stored and processed in compliance with local laws and regulations.

One potential solution for addressing data residency challenges with AWS is utilizing Local Zones, which places AWS infrastructure in large metro areas. This enables organizations to store and process data in specific geographic locations. By using Local Zones, organizations can architect their environment to meet data residency requirements when an AWS Region is unavailable within the same legal jurisdiction. Local Zones can be configured to utilize landing zone to further adhere to data residency requirements.

A landing zone is a well-architected, multi-account AWS environment that is scalable and secure. This is a starting point from which your organization can quickly launch and deploy workloads and applications with confidence in your security and infrastructure environment.

When leveraging a Local Zone to meet data residency requirements, you must have control over the in-scope data movement from the Local Zones to the AWS Region. This can be accomplished by implementing landing zone best practices and the suggested guardrails. The main focus of this post is the custom policies that restrict data snapshots, prohibit data creation within the Region, and limit data transfer to the Region. Furthermore, this post covers which prerequisites organizations should consider before implementing these guardrails.

Prerequisites

Landing zones best practices and custom guardrails can help when data must remain in a specific locality where the Local Zone is also located. This can be completed by defining and enforcing policies for data storage and usage within the landing zone organization that you set up. The following prerequisites should be considered before implementing the suggested guardrails:

AWS Local Zones

Local Zones are enabled through the Amazon Elastic Compute Cloud (Amazon EC2) console or the AWS Command Line Interface (AWS CLI).

You can start using available AWS services on the designated Local Zone directly from the console. Moreover, you can manage the Local Zone using the same tools and interfaces that you use in AWS Region.

2. AWS Control Tower landing zone

AWS Control Tower is a managed service that provides a pre-packaged set of best-practice blueprints for setting up and governing multi-account AWS environments. You must have Control Tower fully implemented in your environment before you can deploy custom guardrails.

Control Tower launches a key resource associated with your account, called a landing zone, which serves as a home for your organizations and their accounts.

Note that Control Tower relies on Organizations to create and manage multi-account structures.

Set up the data residency guardrails

Using Organizations, you must make sure that the Local Zone is enabled within a workload account in the landing zones.

Figure 1: Landing Zones Accelerator – Local Zones workload on AWS high level Architecture

Utilizing Local Zones for regulated components

The availability of Local Zones provides an excellent opportunity to meet data residency requirements and comply with local regulations that restrict the use of the Region outside of your required geo-political boundary. By leveraging Local Zones, organizations can maintain compliance while utilizing AWS services to support their business needs. AWS owns and manages the infrastructure, including hardware, software, and networking for Local Zones. However, as part of the shared responsibility model, customers are responsible for the security of their applications and data, including access control, data encryption, etc.

You must also comply with all applicable regulations and security standards, such as HIPAA, PCI DSS, and GDPR. You should conduct regular security assessments and implement appropriate security controls to protect their applications and data.

In this post, we consider a scenario where there is a single Local Zones location in a metro. However, you must analyze the specific requirements of your workloads and the relevant regulations that apply to determine the most appropriate high availability configurations for your case.

Local Zones workload data residency guardrails

Organizations provides central governance and management for multiple accounts. Central security administrators use SCPs with Organizations to establish controls to which all AWS Identity and Access Management (IAM) principals (users and roles) adhere.

Now you can use SCPs to set permission guardrails. The suggested preventative controls that leverage the implementation of SCPs for data residency on Local Zones are shown in the next paragraph. SCPs let you set permission guardrails by defining the maximum available permissions for IAM entities in an account and all accounts within the Organization root or Organizational Unit (OU). If an SCP denies an action for an account, then none of the entities in any member account, including the member account’s root user, can take that action, even if their IAM permissions let them. The guardrails set in SCPs apply to all IAM entities in the account, which include all users, roles, and the account root user.

Upon finalizing these prerequisites, you can create the guardrails for the chosen OU.

Note that although the following guidelines serve as helpful guardrails – SCPs – for data residency, you should consult internally with legal and security teams for specific organizational requirements.

To exercise better control over the workload in Local Zones and prevent data transfer from Local Zones or data storage outside of the Local Zones, consider implementing the following guardrails:

When your data residency requirements require restricting data transfer/saving to the Region, consider the following guardrails:

a. Deny copying data from Local Zones to the Region for Amazon EC2), Amazon Relational Database Service (Amazon RDS), Amazon ElastiCache, and data sync “DenyCopyToRegion”.

As the available services in Local Zones can vary based on location, you must review the services available in the chosen Local Zone and Adjust the SCPs accordingly.

b. Deny Amazon Simple Storage Service (Amazon S3) put action to the Region “DenyPutObjectToRegionalBuckets”.

If your data residency requirements mandate restrictions on data storage in the Region, then consider implementing this guardrail to prevent the use of Amazon S3 in the Region.

c. If your data residency requirements mandate restrictions on data storage in the Region, then consider implementing “DenyDirectTransferToRegion”

Out of Scope is metadata such as tags or operational data such as AWS Key Management Service (AWS KMS) keys.

{
  "Version": "2012-10-17",
  "Statement": [
      {
      "Sid": "DenyCopyToRegion",
      "Action": [
        "ec2:ModifyImageAttribute",
        "ec2:CopyImage",  
        "ec2:CreateImage",
        "ec2:CreateInstanceExportTask",
        "ec2:ExportImage",
        "ec2:ImportImage",
        "ec2:ImportInstance",
        "ec2:ImportSnapshot",
        "ec2:ImportVolume",
        "rds:CreateDBSnapshot",
        "rds:CreateDBClusterSnapshot",
        "rds:ModifyDBSnapshotAttribute",
        "elasticache:CreateSnapshot",
        "elasticache:CopySnapshot",
        "datasync:Create*",
        "datasync:Update*"
      ],
      "Resource": "*",
      "Effect": "Deny"
    },
    {
      "Sid": "DenyDirectTransferToRegion",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:CreateTable",
        "ec2:CreateTrafficMirrorTarget",
        "ec2:CreateTrafficMirrorSession",
        "rds:CreateGlobalCluster",
        "es:Create*",
        "elasticfilesystem:C*",
        "elasticfilesystem:Put*",
        "storagegateway:Create*",
        "neptune-db:connect",
        "glue:CreateDevEndpoint",
        "glue:UpdateDevEndpoint",
        "datapipeline:CreatePipeline",
        "datapipeline:PutPipelineDefinition",
        "sagemaker:CreateAutoMLJob",
        "sagemaker:CreateData*",
        "sagemaker:CreateCode*",
        "sagemaker:CreateEndpoint",
        "sagemaker:CreateDomain",
        "sagemaker:CreateEdgePackagingJob",
        "sagemaker:CreateNotebookInstance",
        "sagemaker:CreateProcessingJob",
        "sagemaker:CreateModel*",
        "sagemaker:CreateTra*",
        "sagemaker:Update*",
        "redshift:CreateCluster*",
        "ses:Send*",
        "ses:Create*",
        "sqs:Create*",
        "sqs:Send*",
        "mq:Create*",
        "cloudfront:Create*",
        "cloudfront:Update*",
        "ecr:Put*",
        "ecr:Create*",
        "ecr:Upload*",
        "ram:AcceptResourceShareInvitation"
      ],
      "Resource": "*",
      "Effect": "Deny"
    },
    {
      "Sid": "DenyPutObjectToRegionalBuckets",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": ["arn:aws:s3:::*"],
      "Effect": "Deny"
    }
  ]
}

If your data residency requirements require limitations on data storage in the Region, then consider implementing this guardrail “DenyAllSnapshots” to restrict the use of snapshots in the Region.

Note that the following guardrail restricts the creation of snapshots on AWS Outposts as well. If you’re using Outposts in the same AWS account, then you may need to customize this guardrail to make sure that it aligns with your organization’s specific needs and requirements. For more information on data residency considerations for Outposts, please refer to Architecting for data residency with AWS Outposts rack and landing zone guardrails.

{
  "Version": "2012-10-17",
  "Statement": [

    {
      "Sid": "DenyAllSnapshots",
      "Effect":"Deny",
      "Action":[
        "ec2:CreateSnapshot",
        "ec2:CreateSnapshots",
        "ec2:CopySnapshot",
        "ec2:ModifySnapshotAttribute"
      ],
      "Resource":"arn:aws:ec2:*::snapshot/*"
    }
  ]
}

This guardrail helps prevent the launch of EC2 instances or the creation of network interfaces by subnet as opposed to Local Zones You should keep data residency workloads within the Local Zones rather than the Region to make sure of better control over regulated workloads. This approach can help your organization achieve better control over data residency workloads and improve governance over your Organization.

Make sure to update the Local Zones subnets < localzones_subnet_arns>.

{
"Version": "2012-10-17",
  "Statement":[{
    "Sid": "DenyNotLocalZonesSubnet",
    "Effect":"Deny",
    "Action": [
      "ec2:RunInstances",
      "ec2:CreateNetworkInterface"
    ],
    "Resource": [
      "arn:aws:ec2:*:*:network-interface/*"
    ],
    "Condition": {
      "ForAllValues:ArnNotEquals": {
        "ec2:Subnet": ["<localzones_subnet_arns>"]
      }
    }
  }]
}

Additional considerations

When implementing data residency guardrails on Local Zones, consider backup and disaster recovery strategies to make sure that your data is protected in the event of an outage or other unexpected events. This may include creating regular backups of your data, implementing disaster recovery plans and procedures, and using redundancy and failover systems to minimize the impact of any potential disruptions. Additionally, you should make sure that your backup and disaster recovery systems are compliant with any relevant data residency regulations and requirements. You should also test your backup and disaster recovery systems regularly to make sure that they are functioning as intended.

Additionally, the provided SCPs for Local Zones in the above example do not block the “logs:PutLogEvents”. Therefore, even if you implemented data residency guardrails on Local Zones, the application may log data to Amazon CloudWatch Logs in the Region.

Highlights

By default, application-level logs on Local Zones are not automatically sent to CloudWatch Logs in the Region. You can configure CloudWatch Logs agent on Local Zones to collect and send your application-level logs to CloudWatch Logs.

logs:PutLogEvents does transmit data to the Region, but it is not blocked by the provided SCPs, as it’s expected that most use cases still want to be able to use this logging API. However, if blocking is desired, then add the action to the first recommended guardrail. If you want specific roles to be allowed, then combine with the ArnNotLike condition example referenced in the Customization Guide.

Conclusion

The combined use of Local Zones and the suggested guardrails via Organizations policies enables you to exercise better control over the movement of the data. By creating a landing zone for your organization, you can apply SCPs to your Local Zones that will help make sure that your data remains within a specific geographic location, as required by the data residency regulations.

Note that, although custom guardrails can help you manage data residency on Local Zones, it’s critical to thoroughly review your policies, procedures, and configurations. This lets you make sure that they are compliant with all relevant data residency regulations and requirements. Regularly testing and monitoring your systems can help make sure that your data is protected and your organization stays compliant.

References

Working with percolators in Amazon OpenSearch Service

2023-04-27 Arun Lakshmanan

Post Syndicated from Arun Lakshmanan original https://aws.amazon.com/blogs/big-data/working-with-percolators-in-amazon-opensearch-service/

Amazon OpenSearch Service is a managed service that makes it easy to secure, deploy, and operate OpenSearch and legacy Elasticsearch clusters at scale in the AWS Cloud. Amazon OpenSearch Service provisions all the resources for your cluster, launches it, and automatically detects and replaces failed nodes, reducing the overhead of self-managed infrastructures. The service makes it easy for you to perform interactive log analytics, real-time application monitoring, website searches, and more by offering the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions). Amazon OpenSearch Service now offers a serverless deployment option (public preview) that makes it even easier to use OpenSearch in the AWS cloud.

A typical workflow for OpenSearch is to store documents (as JSON data) in an index, and execute searches (also JSON) to find those documents. Percolation reverses that. You store searches and query with documents. Let’s say I’m searching for a house in Chicago that costs < 500K. I could go to the website every day and run my query. A clever website would be able to store my requirements (a query) and notify me when something new (a document) comes up that matches my requirements. Percolation is an OpenSearch feature that enables the website to store these queries and run documents against them to find new matches.

In this post, We will explore how to use percolators to find matching homes from new listings.

Before getting into the details of percolators, let’s explore how search works. When you insert a document, OpenSearch maintains an internal data structure called the “inverted index” which speeds up the search.

Indexing and Searching:

Let’s take the above example of a real estate application having the simple schema of type of the house, city, and the price.

First, let’s create an index with mappings as below

PUT realestate
{
     "mappings": {
        "properties": {
           "house_type": { "type": "keyword"},
           "city": { "type": "keyword" },
           "price": { "type": "long" }
         }
    }
}

Let’s insert some documents into the index.

ID	House_type	City	Price
1	townhouse	Chicago	650000
2	house	Washington	420000
3	condo	Chicago	580000

POST realestate/_bulk 
{ "index" : { "_id": "1" } } 
{ "house_type" : "townhouse", "city" : "Chicago", "price": 650000 }
{ "index" : { "_id": "2" } }
{ "house_type" : "house", "city" : "Washington", "price": 420000 }
{ "index" : { "_id": "3"} }
{ "house_type" : "condo", "city" : "Chicago", "price": 580000 }

As we don’t have any townhouses listed in Chicago for less than 500K, the below query returns no results.

GET realestate/_search
{
  "query": {
    "bool": {
      "filter": [ 
        { "term": { "city": "Chicago" } },
        { "term": { "house_type": "townhouse" } },
        { "range": { "price": { "lte": 500000 } } }
      ]
    }
  }
}

If you’re curious to know how search works under the hood at high level, you can refer to this article.

Percolation:

If one of your customers wants to get notified when a townhouse in Chicago is available, and listed at less than $500,000, you can store their requirements as a query in the percolator index. When a new listing becomes available, you can run that listing against the percolator index with a _percolate query. The query will return all matches (each match is a single set of requirements from one user) for that new listing. You can then notify each user that a new listing is available that fits their requirements. This process is called percolation in OpenSearch.

OpenSearch has a dedicated data type called “percolator” that allows you to store queries.

Let’s create a percolator index with the same mapping, with additional fields for query and optional metadata. Make sure you include all the necessary fields that are part of a stored query. In our case, along with the actual fields and query, we capture the customer_id and priority to send notifications.

PUT realestate-percolator-queries
{
  "mappings": {
    "properties": {
      "user": {
         "properties": {
            "query": { "type": "percolator" },
            "id": { "type": "keyword" },
            "priority":{ "type": "keyword" }
         }
      },
      "house_type": {"type": "keyword"},
      "city": {"type": "keyword"},
      "price": {"type": "long"}
    }
  }
}

After creating the index, insert a query as below

POST realestate-percolator-queries/_doc/chicago-house-alert-500k
{
  "user" : {
     "id": "CUST101",
     "priority": "high",
     "query": {
        "bool": {
           "filter": [ 
                { "term": { "city": "Chicago" } },
                { "term": { "house_type": "townhouse" } },
                { "range": { "price": { "lte": 500000 } } }
            ]
        }
      }
   }
}

The percolation begins when a new document gets run against the stored queries.

{"city": "Chicago", "house_type": "townhouse", "price": 350000}
{"city": "Dallas", "house_type": "house", "price": 500000}

Run the percolation query with document(s), and it matches the stored query

GET realestate-percolator-queries/_search
{
  "query": {
     "percolate": {
        "field": "user.query",
        "documents": [ 
           {"city": "Chicago", "house_type": "townhouse", "price": 350000 },
           {"city": "Dallas", "house_type": "house", "price": 500000}
        ]
      }
   }
}

The above query returns the queries along with the metadata we stored (customer_id in our case) that matches the documents

{
    "took" : 11,
    "timed_out" : false,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
     },
     "hits" : {
        "total" : {
           "value" : 1,
           "relation" : "eq"
         },
         "max_score" : 0.0,
         "hits" : [ 
         {
              "_index" : "realestate-percolator-queries",
              "_id" : "chicago-house-alert-500k",
              "_score" : 0.0,
              "_source" : {
                   "user" : {
                       "id" : "CUST101",
                       "priority" : "high",
                       "query" : {
                            "bool" : {
                                 "filter" : [ 
                                      { "term" : { "city" : "Chicago" } },
                                      { "term" : { "house_type" : "townhouse" } },
                                      { "range" : { "price" : { "lte" : 500000 } } }
                                 ]
                              }
                        }
                  }
            },
            "fields" : {
                "_percolator_document_slot" : [0]
            }
        }
     ]
   }
}

Percolation at scale

When you have a high volume of queries stored in the percolator index, searching queries across the index might be inefficient. You can consider segmenting your queries and use them as filters to handle the high-volume queries effectively. As we already capture priority, you can now run percolation with filters on priority that reduces the scope of matching queries.

GET realestate-percolator-queries/_search
{
    "query": {
        "bool": {
            "must": [ 
             {
                  "percolate": {
                      "field": "user.query",
                      "documents": [ 
                          { "city": "Chicago", "house_type": "townhouse", "price": 35000 },
                          { "city": "Dallas", "house_type": "house", "price": 500000 }
                       ]
                  }
              }
          ],
          "filter": [ 
                  { "term": { "user.priority": "high" } }
            ]
       }
    }
}

Best practices

Prefer the percolation index separate from the document index. Different index configurations, like number of shards on percolation index, can be tuned independently for performance.
Prefer using query filters to reduce matching queries to percolate from percolation index.
Consider using a buffer in your ingestion pipeline for reasons below,
1. You can batch the ingestion and percolation independently to suit your workload and SLA
2. You can prioritize the ingest and search traffic by running the percolation at off hours. Make sure that you have enough storage in the buffering layer.

Consider using an independent cluster for percolation for the below reasons,
1. The percolation process relies on memory and compute, your primary search will not be impacted.
2. You have the flexibility of scaling the clusters independently.

Conclusion

In this post, we walked through how percolation in OpenSearch works, and how to use effectively, at scale. Percolation works in both managed and serverless versions of OpenSearch. You can follow the best practices to analyze and arrange data in an index, as it is important for a snappy search performance.

If you have feedback about this post, submit your comments in the comments section.

About the author

Arun Lakshmanan is a Search Specialist with Amazon OpenSearch Service based out of Chicago, IL. He has over 20 years of experience working with enterprise customers and startups. He loves to travel and spend quality time with his family.

Optimizing Amazon EC2 Spot Instances with Spot Placement Scores

2023-04-27 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/optimizing-amazon-ec2-spot-instances-with-spot-placement-scores/

This blog post is written by Steve Cole, Principal Specialist SA, and Robert McCone, Sr. Specialist SA.

Getting the compute resources you need, even vCPUS numbering in the millions, and completing a workload using Amazon EC2 Spot Instances is just a configuration away. In this post you will learn how to use Spot placement scores to reduce interruptions, acquire greater capacity, and identify optimal configurations, times, and locations to run workloads on Spot Instances. Amazon EC2 Spot Instances let you take advantage of unused EC2 capacity in the AWS cloud and are available at up to a 90% discount compared to On-Demand prices. Spot placement scores is a feature that many customers use to identify optimal instance types or to choose the best Availability Zone (AZ) for ephemeral work like data analytics or high-performance computing. As a real-time tool, Spot placement scores are often integrated into deployment automation. However, because of its logging and graphic capabilities, you may find it be a valuable resource even before you launch a workload into the cloud. Now available through AWS Labs, a Github repository hosting tools for customers, the Spot placement score tracker tackles the undifferentiated heavy lifting and can do this for any customer.

About Spot placement score

Spot placement scores are a feature available through AWS APIs – also implemented in the Amazon EC2 Spot requests console – that uses internal capacity and interruption data to scrutinize the size and shape of a Spot Instance request and responds with a “likelihood of success” rating of 1 to imply lower likelihood of success and 10 to imply higher likelihood of success. The score represents confidence in being able to acquire the desired capacity (size) using the instance configuration (shape) for the next few hours. The shape of the request can be a list of specific instances or can be requirements-based with attribute-based instance type selection. The size of the request can be instance count, number of vCPUs, or GB of RAM. It’s based on known capacity, allocation strategies, and the trending of capacities over time.

Before the release of Spot placement score, customers could track the trends of their existing workloads and configurations. This might have helped them to anticipate capacity constraints over time, but the ability to do something more meaningful when assessing configurations was something customers requested often. With the launch of Spot placement score, that capability was delivered and enabled customers to receive guidance on how a configuration change might affect the effectiveness of Spot Instances in a workload.

Customers immediately recognized the power of this new feature and started writing tooling around their workloads to incorporate the new functionality provided by Spot placement scores. For examples, customers leveraged Spot placement scores to find the highest scoring AZ in a region for work that requires low latency within a cluster. Customers running data analytics with services like Amazon EMR could more confidently launch clusters on Spot Instances. This reduces costs and the time necessary to process data because of fewer interruptions. Financial customers, health care and life sciences, and high tech were some of the early adopters of this strategy.

Benefits of Spot placement scores

One specific customer used tools like the Spot instance advisor and Spot pricing history tools to make decisions about what instances to run every night. If the customer’s analytics workload received too many interruptions, then it would inevitably be relaunched using On-Demand Instances, increasing costs and time-to-complete. The addition of Spot placement scores to the customer’s tooling allowed for more informed decisions about which configurations worked best and, more specifically, which AZ(s) to use. Ultimately, this led not only to higher confidence in using Spot instances, but also to significant cost savings over time.

Other customers tracked Spot placement scores over time with regular queries stored in time series databases to identify not only the best configuration or location, but also the best time-of-day or day-of-week to run their workloads. Different configurations of instance types were queried through automation and the results were logged into a time series database that could then be presented as graphs. These graphs were scrutinized, configurations were tuned, and ultimately these customers could take greater advantage of the cost optimization that Spot instances offer through fewer interruptions by running their workloads where and when scores were higher.

AWS was interested in how this solved problems for customers, and after some more research with customers and design ideation, led to the creation of an OSS tool that AWS has recently released: Spot placement score tracker. Spot placement score tracker helps customers evaluate different configurations against multiple times and locations. It’s an AWS-native solution that leverages the Spot placement score API along with AWS Lambda and Amazon CloudWatch to create a dashboard that enables any AWS customer to benefit from this model without having to write it themselves.

How to use the Spot placement score tracker

The project provides Infrastructure as Code (IaC) automation using the AWS Cloud Development Kit (AWS CDK) to deploy the infrastructure and permissions required to run Lambda. This gets executed every five minutes to collect the placement scores of as many diversified configurations as defined.

After installing the CloudWatch dashboard, and given some time to collect and record data, you will be provided valuable insights in an intuitive graph such as those in the following example.

Insights available through the Spot placement score tracker

The first thing you may notice by observing data over time is that instance diversification is the primary driver of high placement scores. This has always been a best practice for the use of Spot Instances, and it extends to On-Demand Instances as well. In short, if you can only run on one instance type, then the likelihood of experiencing interruptions is far greater than if you can run on six or twelve. Sometimes the simple inclusion of -a, -d, and -n instance types (e.g. m5.large, m5a.large, m5d.large, m5d.large), previous generations (e.g., m5.large, m4.large), different sizes in a container environment (e.g., m5.large, m5.xlarge, m5.2xlarge), and even the inclusion of AWS Graviton will have a material impact on placement scores, which equates to fewer interruptions. This ultimately leads to more efficient use of resources through less restarted processes, resulting in increased efficiency and reduced costs.

The second insight that you can realize through the use of placement scores over time is identifying the optimal AZ in which an ephemeral process can be placed. Perhaps the best use case for this type of insight is data analytics clusters that are launched to complete many calculations overnight. This is common in financial institutions for various reasons including risk analysis and compliance, but could apply to medical research examining results of experiments during the day as well as other situations where a 24/7 presence isn’t required by the workload. These customers are typically using a single AZ to allow for faster communication between nodes and to reduce data transfer costs. Therefore, the ability for Spot placement scores to provide different scores for different AZs is highly advantageous.

Third, with access to placement scores over time, it becomes possible to identify exactly how large a workload’s footprint can be. By submitting identical configurations to Spot placement scores but with different sizes, you can surface the ideal workload size. Not too small, where perhaps the job takes too long to complete, but also not so large that the interruptions are too frequent and cause restarts too often. This can benefit not only ephemeral workloads, but also persistent clusters or fleets by understanding what the lowest score would be over time and giving you solid information regarding what they can expect from Spot Instances and where. This might inform you to be ready to launch On-Demand Instances to compensate when Spot Instance availability is lower. This can also help to forecast pricing and inform decisions about the consideration of AWS Savings Plans or On-Demand Capacity Reservations.

Finally, analyzing Spot placement scores over time can provide regional scoring. Through this lens it’s possible for you to identify entire regions that they may have overlooked without the knowledge that Spot Instances outside the your primary region(s) might offer lower interruptions during daylight hours due to them being off-peak. When it’s possible to place a workload in another region, unconstrained by local data access requirements, it’s quite possible to harness the compute of a significant footprint in locations that are otherwise un(der)-utilized. Workloads that require less data transfer and more compute can benefit tremendously from access to Spot Instances in other regions. For example, things like build servers might run extraordinarily well in Europe during North American business hours and the reduction in compute cost might offset the data transfer to complete the job.

Conclusion

Spot placement scores can be used to make decisions about how, when, and where Spot Instances can be most efficiently utilized to deliver business needs, and at greatly reduced prices. We’re very excited to release this tool to enable you to tap into information which was previously unavailable and make data-driven decisions for your business. The information in this post, combined with the output of placement scores over time, is a significant evolution.

Install the Spot placement score tracker today, configure it to match an existing Spot workload, and see how you might perform at different times or different locations. Explore more robust options and discover greater capacity and lower interruptions. Or investigate how On-Demand workloads could migrate to Spot Instances.

10 ways to build applications faster with Amazon CodeWhisperer

2023-04-26 Kris Schultz

Post Syndicated from Kris Schultz original https://aws.amazon.com/blogs/devops/10-ways-to-build-applications-faster-with-amazon-codewhisperer/

Amazon CodeWhisperer is a powerful generative AI tool that gives me coding superpowers. Ever since I have incorporated CodeWhisperer into my workflow, I have become faster, smarter, and even more delighted when building applications. However, learning to use any generative AI tool effectively requires a beginner’s mindset and a willingness to embrace new ways of working.

Best practices for tapping into CodeWhisperer’s power are still emerging. But, as an early explorer, I’ve discovered several techniques that have allowed me to get the most out of this amazing tool. In this article, I’m excited to share these techniques with you, using practical examples to illustrate just how CodeWhisperer can enhance your programming workflow. I’ll explore:

Before we begin

If you would like to try these techniques for yourself, you will need to use a code editor with the AWS Toolkit extension installed. VS Code, AWS Cloud9, and most editors from JetBrains will work. Refer to the CodeWhisperer “Getting Started” resources for setup instructions.

CodeWhisperer will present suggestions automatically as you type. If you aren’t presented with a suggestion, you can always manually trigger a suggestion using the Option + C (Mac) or Alt + C (Windows) shortcut. CodeWhisperer will also sometimes present you with multiple suggestions to choose from. You can press the → and ← keys to cycle through all available suggestions.

The suggestions CodeWhisperer offers are non-deterministic, which means you may receive slightly different suggestions than the ones shown in this article. If you receive a suggestion that doesn’t suit your needs, try cycling through all available suggestions using the arrow keys or try rephrasing any code comment prompts slightly. Even punctuation changes can produce fresh suggestions.

The examples in this article are written in JavaScript and TypeScript. However, the techniques I’ll demonstrate are equally suited to many of the programming languages supported by CodeWhisperer. Now, onto the examples!

Typing less

At the most basic level, CodeWhisperer can be used as a supercharged auto-complete, significantly reducing your typing when writing individual lines or blocks of code. To use CodeWhisperer in this way, just write code as you normally would. CodeWhisperer will make automatic suggestions as you type. Accept suggestions when they’re helpful, and ignore suggestions when they’re not. In my experience, using CodeWhisperer in this way alone reduces my typing by 50% or more.

Generating functions

Automatically generating function bodies from signatures

Skilled programmers use clear, descriptive names for their functions and parameters. If you follow this practice in your own programming, CodeWhisperer will have all the context it needs to implement many functions for you.

After I enter the function signature below, CodeWhisperer generates the function body for me.

User input (JavaScript):

function standardDeviation(numbers)

Result:

function standardDeviation(numbers) {
  let mean = numbers.reduce((a, b) => a + b) / numbers.length;
  let variance =
    numbers.reduce((a, b) => a + Math.pow(b - mean, 2)) / numbers.length;
  return Math.sqrt(variance);
}

Generating functions from comments

Sometimes a good function signature alone isn’t enough to convey the desired behavior of a function. In those cases, you can use a detailed code comment as a prompt to CodeWhisperer. I recommend using your language’s doc comment delimiter when using this technique.

User Input (JavaScript):

/**
 * Given a list that contains some numbers and strings, format them into a string
 * in which the numbers are prepended with "#" and the strings are wrapped in
 * double quotes.
 */

Result:

/**
 * Given a list that contains some numbers and strings, format them into a string
 * in which the numbers are prepended with "#" and the strings are wrapped in
 * double quotes.
 */
function formatList(list) {
  return list
    .map((item) => {
      if (typeof item === "number") {
        return `#${item}`;
      } else {
        return `"${item}"`;
      }
    })
    .join(", ");
}

Generating classes

You can generate full or partial class implementations from a single code comment. The comment should describe the key properties and behaviors of the class. Simple descriptions are often all that is required.

User input (TypeScript):

/**
 * A folder has a name, children, and an optional parent. A child can't be added
 * twice.
 */

Result:

After I write the above block comment, CodeWhisperer provides multiple suggestions, including the following implementation, complete with documentation comments for each method!

/**
 * A folder has a name, children, and an optional parent. A child can't be added
 * twice.
 */
export class Folder {
  name: string;
  children: Folder[];
  parent?: Folder;

  constructor(name: string, parent?: Folder) {
    this.name = name;
    this.children = [];
    this.parent = parent;
  }

  /**
   * Add a child to the folder.
   * @param child The child to add.
   * @returns True if the child was added, false if it was already present.
   */
  addChild(child: Folder): boolean {
    if (this.children.includes(child)) {
      return false;
    }
    this.children.push(child);
    child.parent = this;
    return true;
  }

  /**
   * Remove a child from the folder.
   * @param child The child to remove.
   * @returns True if the child was removed, false if it wasn't present.
   */
  removeChild(child: Folder): boolean {
    const index = this.children.indexOf(child);
    if (index === -1) {
      return false;
    }
    this.children.splice(index, 1);
    child.parent = undefined;
    return true;
  }
}

Implementing algorithms

CodeWhisperer enables you to implement well-known algorithms with ease. Simply tell CodeWhisperer which algorithm you want to apply to a particular coding problem, and it will oblige.

User input (JavaScript):

const actors = [ ... ];

// Sort the actors by age using the merge sort algorithm.

Result:

const actors = [ ... ];

// Sort the actors by age using the merge sort algorithm.
function mergeSort(arr) {
  if (arr.length < 2) {
    return arr;
  }

  const middle = Math.floor(arr.length / 2);
  const left = arr.slice(0, middle);
  const right = arr.slice(middle);

  return merge(mergeSort(left), mergeSort(right));
}

function merge(left, right) {
  const result = [];

  while (left.length && right.length) {
    if (left[0].age < right[0].age) {
      result.push(left.shift());
    } else {
      result.push(right.shift());
    }
  }

  while (left.length) {
    result.push(left.shift());
  }

  while (right.length) {
    result.push(right.shift());
  }

  return result;
}

console.log(mergeSort(actors));

Writing unit tests

CodeWhisperer accelerates the implementation of unit tests. Not only is it able to implement the logic for most of the explicit unit tests I ask for, but it often suggests unit tests for edge cases that hadn’t even occurred to me.

When I want CodeWhisperer to implement a unit test, I start with a declaration of the condition I’m testing and the expected result as in the following example.

User input (JavaScript):

describe("Folder.addChild()", () => {
  it("when adding an existing folder should not add duplicate"

Result:

describe("Folder.addChild()", () => {
  it("when adding an existing folder should not add duplicate", () => {
    const folder = new Folder("root");
    const child = new Folder("child");
    folder.addChild(child);
    folder.addChild(child);
    expect(folder.children.length).toBe(1);
  });
});

After you’ve written a test or two, CodeWhisperer is likely to suggest more tests, even without explicit prompting.

Tip: You’ll notice that my test description includes 1) the function I’m testing, 2) the condition I’m testing, and 3) what I expect the result to be. Following this three-part pattern is a great way to ensure CodeWhisperer has the context it needs to implement high quality test logic.

Creating sample data

One of the delightful discoveries I’ve made is that CodeWhisperer can generate dummy data—or really any code that has a repeating pattern. As an illustration, for the following list I entered only the first “user” object as an example, and CodeWhisperer helpfully continued the pattern.

User input (JavaScript):

const fakeUsers = [
  { name: "User 1", id: "user1" },

Result:

const fakeUsers = [
  { name: "User 1", id: "user1" },
  { name: "User 2", id: "user2" },
  { name: "User 3", id: "user3" },  
  { name: "User 4", id: "user4" },
  { name: "User 5", id: "user5" },
  { name: "User 6", id: "user6" }
]

It even works for generating random but plausible values like the names seen below.

User input (JavaScript):

const fakeUsers = [
  { name: "Sally", id: "user1" },

Result:

const fakeUsers = [
  { name: "Sally", id: "user1" },
  { name: "Bob", id: "user2" },
  { name: "Joe", id: "user3" },
  { name: "Jane", id: "user4" },
  { name: "Jill", id: "user5" },
  { name: "Jack", id: "user6" }
]

Simplifying regular expressions

While CodeWhisperer can’t translate natural language into custom regular expressions (yet), it does know a lot of commonly used regular expressions. This saves you time by putting these regular expressions at your fingertips whenever you need them.

Let’s look at one notoriously complex but common regular expression.

User input (JavaScript):

/**
 * Function that validates whether a string is a valid email address.
 */

Result:

/**
 * Function that validates whether a string is a valid email address.
 */
function validateEmail(email: string): boolean {
  const re =
    /^(([^<>()[\]\\.,;:\s@"]+(\.[^<>()[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
  return re.test(String(email).toLowerCase());
}

Learning third-party code libraries faster

CodeWhisperer helps accelerate my learning when I’m using an unfamiliar library, module, or framework. In the example below, I needed to write a utility function that would take a gzip compressed value coming from an API and translate it into a value that the rest of my code could understand. I knew the “pako” library would help me, but I had no idea how to use it. I just prompted CodeWhisperer with an import statement and a comment describing my goal.

import pako from "pako";

/**
 * Accepts a base64-encoded, gzip-compressed JSON string and returns
 * a parsed JSON object.
 */

Result:

import pako from "pako";

/**
 * Accepts a base64-encoded, gzip-compressed JSON string and returns
 * a parsed JSON object.
 */
export function decodeGzipJson(base64String) {
  const compressed = atob(base64String);
  const uncompressed = pako.inflate(compressed, { to: "string" });
  return JSON.parse(uncompressed);
}

Documenting code

CodeWhisperer is capable of generating docstrings and comments for the code it generates, as well as for your existing code. For example, let’s say I want CodeWhisperer to document the matches() method of this FavoritesFilter TypeScript class I’ve implemented (I’ve omitted some implementation details for brevity).

class FavoritesFilter implements IAssetFilter {
  ...
  matches(asset: Asset): boolean {
    ...
  }
}

I can just type a doc comment delimiter (/** */) immediately above the method name and CodeWhisperer will generate the body of the doc comment for me.

Note: When using CodeWhisperer in this way you may have to manually trigger a suggestion using Option + C (Mac) or Alt + C (Windows).

class FavoritesFilter implements IAssetFilter {
  ...
  /**
   * Determines whether the asset matches the filter.
   */
  matches(asset: Asset): boolean {
    ...
  }
}

Conclusion

I hope the techniques above inspire ideas for how CodeWhisperer can make you a more productive coder. Install CodeWhisperer today to start using these time-saving techniques in your own projects. These examples only scratch the surface. As additional creative minds start applying CodeWhisperer to their daily workflows, I’m sure new techniques and best practices will continue to emerge. If you discover a novel approach that you find useful, post a comment to share what you’ve discovered. Perhaps your technique will make it into a future article and help others in the CodeWhisperer community enhance their superpowers.

Protect your Amazon Cognito user pool with AWS WAF

2023-04-21 Maitreya Ranganath

Post Syndicated from Maitreya Ranganath original https://aws.amazon.com/blogs/security/protect-your-amazon-cognito-user-pool-with-aws-waf/

Many of our customers use Amazon Cognito user pools to add authentication, authorization, and user management capabilities to their web and mobile applications. You can enable the built-in advanced security in Amazon Cognito to detect and block the use of credentials that have been compromised elsewhere, and to detect unusual sign-in activity and then prompt users for additional verification or block sign-ins. Additionally, you can associate an AWS WAF web access control list (web ACL) with your user pool to allow or block requests to Amazon Cognito user pools, based on security rules.

In this post, we’ll show how you can use AWS WAF with Amazon Cognito user pools and provide a sample set of rate-based rules and advanced AWS WAF rule groups. We’ll also show you how to test and tune the rules to help protect your user pools from common threats.

Rate-based rules for Amazon Cognito user pool endpoints

The following are endpoints exposed publicly by an Amazon Cognito user pool that you can protect with AWS WAF:

Hosted UI — These endpoints are listed in the OIDC and hosted UI API reference. Cognito creates these endpoints when you assign a domain to your user pool. Your users will interact with these endpoints when they use the Hosted UI web interface directly, or when your application calls Cognito OAuth endpoints such as Authorize or Token.
Public API operations — These generate a request to Cognito API actions that are either unauthenticated or authenticated with a session string or access token, but not with AWS credentials.

A good way to protect these endpoints is to deploy rate-based AWS WAF rules. These rules will detect and block requests with high rates that could indicate an attempt to exceed your Amazon Cognito API request rate quotas and that could subsequently impact requests from legitimate users.

When you apply rate limits, it helps to group Amazon Cognito API actions into four action categories. You can set specific rate limits per action category giving you traffic visibility for each category.

User Creation — This category includes operations that create new users in Cognito. Setting a rate limit for this category provides visibility for traffic of these operations and threats such as fake users being created in Cognito, which drives up your Monthly Active User (MAU) costs for Cognito.
Sign-in — This category includes operations to initiate a sign-in operation. Setting a rate limit for this category can provide visibility into the abuse of these operations. This could indicate high frequency, automated attempts to guess user credentials, sometimes referred to as credential stuffing.
Account Recovery — This category includes operations to recover accounts, including “forgot password” flows. Setting a rate limit for this category can provide visibility into the abuse of these operations, malicious activity can include: sending fake reset attempts, which might result in emails and SMS messages being sent to users.
Default — This is a catch-all rate limit that applies to an operation that is not in one of the prior categories. Setting a default rate limit can provide visibility and mitigation from request flooding attacks.

Table 1 below shows selected Hosted UI endpoint paths (the equivalent of individual API actions) and the recommended rate-based rule limit category for each.

Table 1: Amazon Cognito Hosted UI URL paths mapped to action categories

Hosted UI URL path	Authentication method	Action category
/signup	Unauthenticated	User Creation
/confirmUser	Confirmation code	User Creation
/resendcode	Unauthenticated	User Creation
/login	Unauthenticated	Sign-in
/oauth2/authorize	Unauthenticated	Sign-in
/forgotPassword	Unauthenticated	Account Recovery
/confirmForgotPassword	Confirmation code	Account Recovery
/logout	Unauthenticated	Default
/oauth2/revoke	Refresh token	Default
/oauth2/token	Auth code, or refresh token, or client credentials	Default
/oauth2/userInfo	Access token	Default
/oauth2/idpresponse	Authorization code	Default
/saml2/idpresponse	SAML assertion	Default

Table 2 below shows selected Cognito API actions and the recommended rate-based rule category for each.

Table 2: Selected Cognito API actions mapped to action categories

API action name	Authentication method	Action category
SignUp	Unauthenticated	User Creation
ConfirmSignUp	Confirmation code	User Creation
ResendConfirmationCode	Unauthenticated	User Creation
InitiateAuth	Unauthenticated	Sign-in
RespondToAuthChallenge	Unauthenticated	Sign-in
ForgotPassword	Unauthenticated	Account Recovery
ConfirmForgotPassword	Confirmation code	Account Recovery
AssociateSoftwareToken	Access token or session	Default
VerifySoftwareToken	Access token or session	Default

Additionally, the rate-based rules we provide in this post include the following:

Two IP sets that represent allow lists for IPv4 and IPv6. You can add IPs that represent your trusted source IP addresses to these IP sets so that other AWS WAF rules don’t apply to requests that originate from these IP addresses.
Two IP sets that represent deny lists for IPv4 and IPv6. Add IPs to these IP sets that you want to block in all cases, regardless of the result of other rules.
An AWS managed IP reputation rule group: The AWS managed IP reputation list rule group contains rules that are based on Amazon internal threat intelligence, to identify IP addresses typically associated with bots or other threats. You can limit requests that match rules in this rule group to a specific rate limit.

Deploy rate-based rules

You can deploy the rate-based rules described in the previous section by using the AWS CloudFormation template that we provide here.

To deploy rate-based rules using the template

(Optional but recommended) If you want to enable AWS WAF logging and resources to analyze request rates, create an Amazon Simple Storage Service (Amazon S3) bucket in the same AWS Region as your Amazon Cognito user pool, with a bucket name starting with the prefix aws-waf-logs-. If you previously created an S3 bucket for AWS WAF logs, you can choose to reuse it, or you can create a new bucket to store AWS WAF logs for Amazon Cognito.
Choose the following Launch Stack button to launch a CloudFormation stack in your account.

Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, download the solution’s CloudFormation template and deploy it to the selected Region.

This template creates the following resources in your AWS account:
- A rule group for the rate-based rules, according to the limits shown in Tables 1 and 2.
- Four IP sets for an allow list and deny list for IPv4 and IPv6 addresses.
- A web ACL that includes the rule group that is created, IP set based rules, and the AWS managed IP reputation rule group.
- (Optional) The template enables AWS WAF logging for the web ACL to an S3 bucket that you specify.
- (Optional) The template creates resources to help you analyze AWS WAF logs in S3 to calculate peak request rates that you can use to set rate limits for the rate-based rules.

Set the template parameters as needed. The following table shows the default values for the parameters. We recommend that you deploy the template with the default values and with TestMode set to Yes so that all rules are set to Count. This allows all requests but emits Amazon CloudWatch metrics and AWS WAF log events for each rule that matches. You can then follow the guidance in the next section to analyze the logs and tune the rate limits to match the traffic patterns to your user pool. When you are satisfied with the unique rate limits for each parameter, you can update the stack and set TestMode to No to start blocking requests that exceed the rate limits.

The rate limits for AWS WAF rate-based rules are configured as the number of requests per 5-minute period per unique source IP. The value of the rate limit can be between 100 and 2,000,000,000 (2 billion).

Table 3: Default values for template parameters

Parameter name	Description	Default value	Allowed values
Request rate limits by action category
UserCreationRateLimit	Rate limit applied to User Creation actions	2000	100–2,000,000,000
SignInRateLimit	Rate limit applied to Sign-in actions	4000	100–2,000,000,000
AccountRecoveryRateLimit	Rate limit applied to Account Recovery actions	1000	100–2,000,000,000
IPReputationRateLimit	Rate limit applied to requests that match the AWS Managed IP reputation list	1000	100–2,000,000,000
DefaultRateLimit	Default rate limit applied to actions that are not in any of the prior categories	6000	100–2,000,000,000
Test mode
TestMode	Set to Yes to test rules by overriding rule actions to Count. Set to No to apply the default actions for rules after you’ve tested the impact of these rules.	Yes	Yes or No
AWS WAF logging and rate analysis
EnableWAFLogsAndRateAnalysis	Set to Yes to enable logging for the AWS WAF web ACL to an S3 bucket and create resources for request rate analysis. Set to No to disable AWS WAF logging and skip creating resources for rate analysis. If No, the rest of the parameter values in this section are ignored. If Yes, choose values for the rest of the parameters in this section.	Yes	Yes or No
WAFLogsS3Bucket	The name of an existing S3 bucket where AWS WAF logs are delivered. The bucket name must start with aws-waf-logs- and can end with any suffix. Only used if the parameter EnableWAFLogsAndRateAnalysis is set to Yes.	None	Name of an existing S3 bucket that starts with the prefix aws-waf-logs-
DatabaseName	The name of the AWS Glue database to create, which will contain the request rate analysis tables created by this template. (Important: The name cannot contain hyphens.) Only used if the parameter EnableWAFLogsAndRateAnalysis is set to Yes.	rate_analysis
WorkgroupName	The name of the Amazon Athena workgroup to create for rate analysis. Only used if the parameter EnableWAFLogsAndRateAnalysis is set to Yes.	rate_analysis
WAFLogsTableName	The name of the AWS Glue table for AWS WAF logs. Only used if the parameter EnableWAFLogsAndRateAnalysis is set to Yes.		waf_logs
WAFLogsProjectionStartDate	The earliest date to analyze AWS WAF logs, in the format YYYY/MM/DD (example: 2023/02/28). Only used if the parameter EnableWAFLogsAndRateAnalysis is set to Yes.	None	Set this to the current date, in the format YYYY/MM/DD

Wait for the CloudFormation template to be created successfully.
Go to the AWS WAF console and choose the web ACL created by the template. It will have a name ending with CognitoWebACL.
Choose the Associated AWS resources tab, and then choose Add AWS resource.
For Resource type, choose Amazon Cognito user pool, and then select the Amazon Cognito user pools that you want to protect with this web ACL.
Choose Add.

Now that your user pool is being protected by the rate-based rules in the web ACL you created, you can proceed to tune the rate-based rule limits by analyzing AWS WAF logs.

Tune AWS WAF rate-based rule limits

As described in the previous section, the rate-based rules give you the ability to set separate rate limit values for each category of Amazon Cognito API actions.

Although the CloudFormation template has default starting values for these rate limits, it is important that you tune these values to match the traffic patterns for your user pool. To begin the tuning process, deploy the template with default values for all parameters, including Yes for TestMode. This overrides all rule actions to Count, allowing all requests but emitting CloudWatch metrics and AWS WAF log events for each rule that matches.

After you collect AWS WAF logs for a period of time (this period can vary depending on your traffic, from a couple of hours to a couple of days), you can analyze them, as shown in the next section, to get peak request rates to tune the rate limits to match observed traffic patterns for your user pool.

Query AWS WAF logs to calculate peak request rates by request type

You can calculate peak request rates by analyzing information that is present in AWS WAF logs. One way to analyze these is to send AWS WAF logs to S3 and to analyze the logs by using SQL queries in Amazon Athena. If you deploy the template in this post with default values, it creates the resources you need to analyze AWS WAF logs in S3 to calculate peak requests rates by request type.

If you are instead ingesting AWS WAF logs into your security information and event management (SIEM) system or a different analytics environment, you can create equivalent queries by using the query language for your SIEM or analytics environment to get similar results.

To access and edit the queries built by the CloudFormation template for use

Open the Athena console and switch to the Athena workgroup that was created by the template (the default name is rate_analysis).

On the Saved queries tab, choose the query named Peak request rate per 5-minute period by source IP and request category. The following SQL query will be loaded into the edit panel.

-- Gets the top 5 source IPs sending the most requests in a 5-minute period per request category
‐‐ NOTE: change the start and end timestamps to match the duration of interest
SELECT request_category, from_unixtime(time_bin*60*5) AS date_time, client_ip, request_count FROM (
  SELECT *, row_number() OVER (PARTITION BY request_category ORDER BY request_count DESC, time_bin DESC) AS row_num FROM (
    SELECT
      CASE
        WHEN ip_reputation_labels.name IN (
          'awswaf:managed:aws:amazon-ip-list:AWSManagedIPReputationList',
          'awswaf:managed:aws:amazon-ip-list:AWSManagedReconnaissanceList',
          'awswaf:managed:aws:amazon-ip-list:AWSManagedIPDDoSList'
        ) THEN 'IPReputation'
        WHEN target.value IN (
          'AWSCognitoIdentityProviderService.InitiateAuth',
          'AWSCognitoIdentityProviderService.RespondToAuthChallenge'
        ) THEN 'SignIn'
        WHEN target.value IN (
          'AWSCognitoIdentityProviderService.ResendConfirmationCode',
          'AWSCognitoIdentityProviderService.SignUp',
          'AWSCognitoIdentityProviderService.ConfirmSignUp'
        ) THEN 'UserCreation'
        WHEN target.value IN (
          'AWSCognitoIdentityProviderService.ForgotPassword',
          'AWSCognitoIdentityProviderService.ConfirmForgotPassword'
        ) THEN 'AccountRecovery'
        WHEN httprequest.uri IN (
          '/login',
          '/oauth2/authorize'
        ) THEN 'SignIn'
        WHEN httprequest.uri IN (
          '/signup',
          '/confirmUser',
          '/resendcode'
        ) THEN 'UserCreation'
        WHEN  httprequest.uri IN (
          '/forgotPassword',
          '/confirmForgotPassword'
        ) THEN 'AccountRecovery'
        ELSE 'Default'
      END AS request_category,
      httprequest.clientip AS client_ip,
      FLOOR("timestamp"/(1000*60*5)) AS time_bin,
      COUNT(*) AS request_count
    FROM waf_logs
      LEFT OUTER JOIN UNNEST(FILTER(httprequest.headers, h -> h.name = 'x-amz-target')) AS t(target) ON TRUE
      LEFT OUTER JOIN UNNEST(FILTER(labels, l -> l.name like 'awswaf:managed:aws:amazon-ip-list:%')) AS t(ip_reputation_labels) ON TRUE
    WHERE
      from_unixtime("timestamp"/1000) BETWEEN TIMESTAMP '2022-01-01 00:00:00' AND TIMESTAMP '2023-01-01 00:00:00'
    GROUP BY 1, 2, 3
    ORDER BY 1, 4 DESC
  )
) WHERE row_num <= 5 ORDER BY request_category ASC, row_num ASC

Scroll down to Line 48 in the Query Editor and edit the timestamps to match the start and end time of the time window of interest.
Run the query to calculate the top 5 peak request rates per 5-minute period by source IP and by action category.

The results show the action category, source IP, time, and count of requests. You can use the request count to tune the rate limits for each action category.

The lowest rate limit you can set for AWS WAF rate-based rules is 100 requests per 5-minute period. If your query results show that the peak request count is less than 100, set the rate limit as 100 or higher.

After you have tuned the rate limits, you can apply the changes to your web ACL by updating the CloudFormation stack.

To update the CloudFormation stack

On the CloudFormation console, choose the stack you created earlier.
Choose Update. For Prepare template, choose Use current template, and then choose Next.
Update the values of the parameters with rate limits to match the tuned values from your analysis.
You can choose to enable blocking of requests by setting TestMode to No. This will set the action to Block for the rate-based rules in the web ACL and start blocking traffic that exceeds the rate limits you have chosen.
Choose Next and then Next again to update the stack.

Now the rate-based rules are updated with your tuned limits, and requests will be blocked if you set TestMode to No.

Protect endpoints with user interaction

Now that we’ve covered the bases with rate-based rules, we’ll show you some more advanced AWS WAF rules that further help protect your user pool. We’ll explore two sample scenarios in detail, and provide AWS WAF rules for each. You can use the rules provided as a guideline to build others that can help with similar use cases.

Rules to verify human activity

The first scenario is protecting endpoints where users have interaction with the page. This will be a browser-based interaction, and a human is expected to be behind the keyboard. This scenario applies to the Hosted UI endpoints such as /login, /signup, and /forgotPassword, where a CAPTCHA can be rendered on the user’s browser for the user to solve. Let’s take the login (sign-in) endpoint as an example, and imagine you want to make sure that only actual human users are attempting to sign in and you want to block bots that might try to guess passwords.

To illustrate how to protect this endpoint with AWS WAF, we’re sharing a sample rule, shown in Figure 1. In this rule, you can take input from prior rules like the Amazon IP reputation list or the Anonymous IP list (which are configured to Count requests and add labels) and combine that with a CAPTCHA action. The logic of the rule says that if the request matches the reputation rules (and has received the corresponding labels) and is going to the /login endpoint, then the AWS WAF action should be to respond with a CAPTCHA challenge. This will present a challenge that increases the confidence that a human is performing the action, and it also adds a custom label so you can efficiently identify and have metrics on how many requests were matched by this rule. The rule is provided in the CloudFormation template and is in JSON format, because it has advanced logic that cannot be displayed by the console. Learn more about labels and CAPTCHA actions in the AWS WAF documentation.

Figure 1: Login sample rule flow

Note that the rate-based rules you created in the previous section are evaluated before the advanced rules. The rate-based rules will block requests to the /login endpoint that exceed the rate limit you have configured, while this advanced rule will match requests that are below the rate limit but match the other conditions in the rule.

Rules for specific activity

The second scenario explores activity on specific application clients within the user pool. You can spot this activity by monitoring the logs provided by AWS WAF, or other traffic logs like Application Load Balancer (ALB) logs. The application client information is provided in the call to the service.

In the Amazon Cognito user pool in this scenario, we have different application clients and they’re constrained by geography. For example, for one of the application clients, requests are expected to come from the United States at or below a certain rate. We can create a rule that combines the rate and geographical criteria to block requests that don’t meet the conditions defined.

The flow of this rule is shown in Figure 2. The logic of the rule will evaluate the application client information provided in the request and the geographic information identified by the service, and apply the selected rate limit. If blocked, the rule will provide a custom response code by using HTTP code 429 Too Many Requests, which can help the sender understand the reason for the block. For requests that you make with the Amazon Cognito API, you could also customize the response body of a request that receives a Block response. Adding a custom response helps provide the sender context and adjust the rate or information that is sent.

Figure 2: AppClientId sample rule flow

AWS WAF can detect geo location with Region accuracy and add specific labels for the location. These can then be used in other rule evaluations. This rule is also provided as a sample in the CloudFormation template.

Advanced protections

To build on the rules we’ve shared so far, you can consider using some of the other intelligent threat mitigation rules that are available as managed rules—namely, bot control for common or targeted bots. These rules offer advanced capabilities to detect bots in sensitive endpoints where automation or non-browser user agents are not expected or allowed. If you receive machine traffic to the endpoint, these rules will result in false positives that would need to be tuned. For more information, see Options for intelligent threat mitigation.

The sample rule flow in Figure 3 shows an example for our Hosted UI, which builds on the first rule we built for specific activity and adds signals coming from the Bot Control common bots managed rule, in this case the non-browser-user-agent label.

Figure 3: Login sample rule with advanced protections

Adding the bot detection label will also add accuracy to the evaluation, because AWS WAF will consider multiple different sources of information when analyzing the request. This can also block attacks that come from a small set of IPs or easily recognizable bots.

We’ve shared this rule in the CloudFormation template sample. The rule requires you to add AWS WAF Bot Control (ABC) before the custom rule evaluation. ABC has additional costs associated with it and should only be used for specific use cases. For more information on ABC and how to enable it, see this blog post.

After adding these protections, we have a complete set of rules for our Hosted UI–specific needs; consider that your traffic and needs might be different. Figure 4 shows you what the rule priority looks like. All rules except the last are included in the provided CloudFormation template. Managed rule evaluations need to have higher priority and be in Count mode; this way, a matching request can get labels that can be evaluated further down the priority list by using the custom rules that were created. For more information, see How labeling works.

Figure 4: Summary of the rules discussed in this post

Conclusion

In this post, we examined the different protections provided by the integration between AWS WAF and Amazon Cognito. This integration makes it simpler for you to view and monitor the activity in the different Amazon Cognito endpoints and APIs, while also adding rate-based rules and IP reputation evaluations. For more specific use cases and advanced protections, we provided sample custom rules that use labels, as well as an advanced rule that uses bot control for common bots. You can use these advanced rules as examples to create similar rules that apply to your use cases.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the re:Post with tag AWS WAF or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Use IAM roles to connect GitHub Actions to actions in AWS

2023-04-20 David Rowe

Post Syndicated from David Rowe original https://aws.amazon.com/blogs/security/use-iam-roles-to-connect-github-actions-to-actions-in-aws/

Have you ever wanted to initiate change in an Amazon Web Services (AWS) account after you update a GitHub repository, or deploy updates in an AWS application after you merge a commit, without the use of AWS Identity and Access Management (IAM) user access keys? If you configure an OpenID Connect (OIDC) identity provider (IdP) inside an AWS account, you can use IAM roles and short-term credentials, which removes the need for IAM user access keys.

In this blog post, we will walk you through the steps needed to configure a specific GitHub repo to assume an individual role in an AWS account to preform changes. You will learn how to create an OIDC-trusted connection that is scoped to an individual GitHub repository, and how to map the repository to an IAM role in your account. You will create the OIDC connection, IAM role, and trust relationship two ways: with the AWS Management Console and with the AWS Command Line Interface (AWS CLI).

This post focuses on creating an IAM OIDC identity provider for GitHub and demonstrates how to authorize access into an AWS account from a specific branch and repository. You can use OIDC IdPs for workflows that support the OpenID Connect standard, such as Google or Salesforce.

Prerequisites

To follow along with this blog post, you should have the following prerequisites in place:

The ability to create a new GitHub Actions workflow file in the .github/workflows/ directory under a branch of a GitHub repository
An AWS account
Access to the following IAM permissions in the account:
- Must be able to create an OpenID Connect IdP
- Must be able to create an IAM role and attach a policy

Solution overview

GitHub is an external provider that is independent from AWS. To use GitHub as an OIDC IdP, you will need to complete four steps to access AWS resources from your GitHub repository. Then, for the fifth and final step, you will use AWS CloudTrail to audit the role that you created and used in steps 1–4.

Create an OIDC provider in your AWS account. This is a trust relationship that allows GitHub to authenticate and be authorized to perform actions in your account.
Create an IAM role in your account. You will then scope the IAM role’s trust relationship to the intended parts of your GitHub organization, repository, and branch for GitHub to assume and perform specific actions.
Assign a minimum level of permissions to the role.
Create a GitHub Actions workflow file in your repository that can invoke actions in your account.
Audit the role’s use with Amazon CloudTrail logs.

Step 1: Create an OIDC provider in your account

The first step in this process is to create an OIDC provider which you will use in the trust policy for the IAM role used in this action.

To create an OIDC provider for GitHub (console):

Open the IAM console.
In the left navigation menu, choose Identity providers.
In the Identity providers pane, choose Add provider.
For Provider type, choose OpenID Connect.
For Provider URL, enter the URL of the GitHub OIDC IdP for this solution: https://token.actions.GitHubusercontent.com.
Choose Get thumbprint to verify the server certificate of your IdP. To learn more about OIDC thumbprints, see Obtaining the thumbprint for an OpenID Connect Identity Provider.
For Audience, enter sts.amazonaws.com. This will allow the AWS Security Token Service (AWS STS) API to be called by this IdP.
(Optional) For Add tags, you can add key–value pairs to help you identify and organize your IdPs. To learn more about tagging IAM OIDC IdPs, see Tagging OpenID Connect (OIDC) IdPs.
Verify the information that you entered. Your console should match the screenshot in Figure 1. After verification, choose Add provider.

Note: Each provider is a one-to-one relationship to an external IdP. If you want to add more IdPs to your account, you can repeat this process.

Figure 1: Steps to configure the identity provider
Once you are taken back to the Identity providers page, you will see your new IdP as shown in Figure 2. Select your provider to view its properties, and make note of the Amazon Resource Name (ARN). You will use the ARN later in this post. The ARN will look similar to the following:
arn:aws:iam::111122223333:oidc-provider/token.actions.GitHubusercontent.com

Figure 2: View your identity provider

To create an OIDC provider for GitHub (AWS CLI):

You can add GitHub as an IdP in your account with a single AWS CLI command. The following code will perform the previous steps outlined for the console, with the same results. For the value —thumbprint-list, you will use the GitHub OIDC thumbprint 938fd4d98bab03faadb97b34396831e3780aea1.

aws iam create-open-id-connect-provider --url 
"https://token.actions.GitHubusercontent.com" --thumbprint-list 
"6938fd4d98bab03faadb97b34396831e3780aea1" --client-id-list 
'sts.amazonaws.com'

To learn more about the GitHub thumbprint, see GitHub Actions – Update on OIDC based deployments to AWS. At the time of publication, this thumbprint is correct.

Both of the preceding methods will add an IdP in your account. You can view the provider on the Identity providers page in the IAM console.

Step 2: Create an IAM role and scope the trust policy

You can create an IAM role with either the IAM console or the AWS CLI. If you choose to create the IAM role with the AWS CLI, you will scope the Trust Relationship Policy before you create the role.

The procedure to create the IAM role and to scope the trust policy come from the AWS Identity and Access Management User Guide. For detailed instructions on how to configure a role, see How to Configure a Role for GitHub OIDC Identity Provider.

To create the IAM role (IAM console):

In the IAM console, on the Identity providers screen, choose the Assign role button for the newly created IdP.

Figure 3: Assign a role to the identity provider
In the Assign role for box, choose Create a new role, and then choose Next, as shown in the following figure.

Figure 4: Create a role from the Identity provider page
The Create role page presents you with a few options. Web identity is already selected as the trusted entity, and the Identity provider field is populated with your IdP. In the Audience list, select sts.amazonaws.com, and then choose Next.
On the Permissions page, choose Next. For this demo, you won’t add permissions to the role.
If you’d like to test other actions, like AWS CodeBuild operations, you can add permissions as outlined by these blog posts: Complete CI/CD with AWS CodeCommit, AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline or Techniques for writing least privilege IAM policies.
(Optional) On the Tags page, add tags to this new role, and then choose Next: Review.
On the Create role page, add a role name. For this demo, enter GitHubAction-AssumeRoleWithAction. Optionally add a description.
To create the role, choose Create role.

Next, you’ll scope the IAM role’s trust policy to a single GitHub organization, repository, and branch.

To scope the trust policy (IAM console)

In the IAM console, open the newly created role and choose Edit trust relationship.

On the Edit trust policy page, modify the trust policy to allow your unique GitHub organization, repository, and branch to assume the role. This example trusts the GitHub organization <aws-samples>, the repository named <EXAMPLEREPO>, and the branch named <ExampleBranch>. Update the Federated ARN with the GitHub IdP ARN that you copied previously.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Federated": "<arn:aws:iam::111122223333:oidc-provider/token.actions.githubusercontent.com>"
            },
            "Action": "sts:AssumeRoleWithWebIdentity",
            "Condition": {
                "StringEquals": {
                    "token.actions.githubusercontent.com:sub": "repo: <aws-samples/EXAMPLEREPO>:ref:refs/heads/<ExampleBranch>",
                    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
                }
            }
        }
    ]
}

To create a role (AWS CLI)

In the AWS CLI, use the example trust policy shown above for the console. This policy is designed to limit access to a defined GitHub organization, repository, and branch.

Create and save a JSON file with the example policy to your local computer with the file name trustpolicyforGitHubOIDC.json.

Run the following command to create the role.

aws iam create-role --role-name GitHubAction-AssumeRoleWithAction --assume-role-policy-document file://C:\policies\trustpolicyforGitHubOIDC.json

For more details on how to create an OIDC role with the AWS CLI, see Creating a role for federated access (AWS CLI).

Step 3: Assign a minimum level of permissions to the role

For this example, you won’t add permissions to the IAM role, but will assume the role and call STS GetCallerIdentity to demonstrate a GitHub action that assumes the AWS role.

If you’re interested in performing additional actions in your account, you can add permissions to the role you created, GitHubAction-AssumeRoleWithAction. Common actions for workflows include calling AWS Lambda functions or pushing files to an Amazon Simple Storage Service (Amazon S3) bucket. For more information about using IAM to apply permissions, see Policies and permissions in IAM.

If you’d like to do a test, you can add permissions as outlined by these blog posts: Complete CI/CD with AWS CodeCommit, AWS CodeBuild, AWS CodeDeploy, and AWS CodePipeline or Techniques for writing least privilege IAM policies.

Step 4: Create a GitHub action to invoke the AWS CLI

GitHub actions are defined as methods that you can use to automate, customize, and run your software development workflows in GitHub. The GitHub action that you create will authenticate into your account as the role that was created in Step 2: Create the IAM role and scope the trust policy.

To create a GitHub action to invoke the AWS CLI:

Create a basic workflow file, such as main.yml, in the .github/workflows directory of your repository. This sample workflow will assume the GitHubAction-AssumeRoleWithAction role, to perform the action aws sts get-caller-identity. Your repository can have multiple workflows, each performing different sets of tasks. After GitHub is authenticated to the role with the workflow, you can use AWS CLI commands in your account.

Paste the following example workflow into the file.

# This is a basic workflow to help you get started with Actions
name:Connect to an AWS role from a GitHub repository

# Controls when the action will run. Invokes the workflow on push events but only for the main branch
on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

env:
  
  AWS_REGION : <"us-east-1"> #Change to reflect your Region

# Permission can be added at job level or workflow level    
permissions:
      id-token: write   # This is required for requesting the JWT
      contents: read    # This is required for actions/checkout
jobs:
  AssumeRoleAndCallIdentity:
    runs-on: ubuntu-latest
    steps:
      - name: Git clone the repository
        uses: actions/checkout@v3
      - name: configure aws credentials
        uses: aws-actions/[email protected]
        with:
          role-to-assume: <arn:aws:iam::111122223333:role/GitHubAction-AssumeRoleWithAction> #change to reflect your IAM role’s ARN
          role-session-name: GitHub_to_AWS_via_FederatedOIDC
          aws-region: ${{ env.AWS_REGION }}
      # Hello from AWS: WhoAmI
      - name: Sts GetCallerIdentity
        run: |
          aws sts get-caller-identity

Modify the workflow to reflect your AWS account information:
- AWS_REGION: Enter the AWS Region for your AWS resources.
- role-to-assume: Replace the ARN with the ARN of the AWS GitHubAction role that you created previously.

In the example workflow, if there is a push or pull on the repository’s “main” branch, the action that you just created will be invoked.

Figure 5 shows the workflow steps in which GitHub does the following:

Authenticates to the IAM role with the OIDC IdP in the Region that was defined in the workflow file in the step configure aws credentials.
Calls aws sts get-caller-identity in the step Hello from AWS. WhoAmI… Run AWS CLI sts GetCallerIdentity.

Figure 5: Results of GitHub action

Step 5: Audit the role usage: Query CloudTrail logs

The final step is to view the AWS CloudTrail logs in your account to audit the use of this role.

To view the event logs for the GitHub action:

In the AWS Management Console, open CloudTrail and choose Event History.
In the Lookup attributes list, choose Event source.
In the search bar, enter sts.amazonaws.com.

Figure 6: Find event history in CloudTrail
You should see the GetCallerIdentity and AssumeRoleWithWebIdentity events, as shown in Figure 6. The GetCallerIdentity event is the Hello from AWS. step in the GitHub workflow file. This event shows the workflow as it calls aws sts get-caller-identity. The AssumeRoleWithWebIdentity event shows GitHub authenticating and assuming your IAM role GitHubAction-AssumeRoleWithAction.

You can also view one event at a time.

To view the AWS CLI GetCallerIdentity event:

In the Lookup attributes list, choose User name.
In the search bar, enter the role-session-name, defined in the workflow file in your repository. This is not the IAM role name, because this role-session-name is defined in line 30 of the workflow example. In the workflow example for this blog post, the role-session-name is GitHub_to_AWS_via_FederatedOIDC.
You can now see the first event in the CloudTrail history.

Figure 7: View the get caller identity in CloudTrail

To view the AssumeRoleWithWebIdentity event

In the Lookup attributes list, choose User name.
In the search bar, enter the GitHub organization, repository, and branch that is defined in the IAM role’s trust policy. In the example outlined earlier, the user name is repo:aws-samples/EXAMPLE:ref:refs/heads/main.
You can now see the individual event in the CloudTrail history.

Figure 8: View the assume role call in CloudTrail

Conclusion

When you use IAM roles with OIDC identity providers, you have a trusted way to provide access to your AWS resources. GitHub and other OIDC providers can generate temporary security credentials to update resources and infrastructure inside your accounts.

In this post, you learned how to use the federated access to assume a role inside AWS directly from a workflow action file in a GitHub repository. With this new IdP in place, you can begin to delete AWS access keys from your IAM users and use short-term credentials.

After you read this post, we recommend that you follow the AWS Well Architected Security Pillar IAM directive to use programmatic access to AWS services using temporary and limited-privilege credentials. If you deploy IAM federated roles instead of AWS user access keys, you follow this guideline and issue tokens by the AWS Security Token Service. If you have feedback on this post, leave a comment below and let us know how you would like to see OIDC workflows expanded to help your IAM needs.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Optimizing GPU utilization for AI/ML workloads on Amazon EC2

2023-04-18 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/optimizing-gpu-utilization-for-ai-ml-workloads-on-amazon-ec2/

This blog post is written by Ben Minahan, DevOps Consultant, and Amir Sotoodeh, Machine Learning Engineer.

Machine learning workloads can be costly, and artificial intelligence/machine learning (AI/ML) teams can have a difficult time tracking and maintaining efficient resource utilization. ML workloads often utilize GPUs extensively, so typical application performance metrics such as CPU, memory, and disk usage don’t paint the full picture when it comes to system performance. Additionally, data scientists conduct long-running experiments and model training activities on existing compute instances that fit their unique specifications. Forcing these experiments to be run on newly provisioned infrastructure with proper monitoring systems installed might not be a viable option.

In this post, we describe how to track GPU utilization across all of your AI/ML workloads and enable accurate capacity planning without needing teams to use a custom Amazon Machine Image (AMI) or to re-deploy their existing infrastructure. You can use Amazon CloudWatch to track GPU utilization, and leverage AWS Systems Manager Run Command to install and configure the agent across your existing fleet of GPU-enabled instances.

Overview

First, make sure that your existing Amazon Elastic Compute Cloud (Amazon EC2) instances have the Systems Manager Agent installed, and also have the appropriate level of AWS Identity and Access Management (IAM) permissions to run the Amazon CloudWatch Agent. Next, specify the configuration for the CloudWatch Agent in Systems Manager Parameter Store, and then deploy the CloudWatch Agent to our GPU-enabled EC2 instances. Finally, create a CloudWatch Dashboard to analyze GPU utilization.

Install the CloudWatch Agent on your existing GPU-enabled EC2 instances.
Your CloudWatch Agent configuration is stored in Systems Manager Parameter Store.
Systems Manager Documents are used to install and configure the CloudWatch Agent on your EC2 instances.
GPU metrics are published to CloudWatch, which you can then visualize through the CloudWatch Dashboard.

Prerequisites

This post assumes you already have GPU-enabled EC2 workloads running in your AWS account. If the EC2 instance doesn’t have any GPUs, then the custom configuration won’t be applied to the CloudWatch Agent. Instead, the default configuration is used. For those instances, leveraging the CloudWatch Agent’s default configuration is better suited for tracking resource utilization.

For the CloudWatch Agent to collect your instance’s GPU metrics, the proper NVIDIA drivers must be installed on your instance. Several AWS official AMIs including the Deep Learning AMI already have these drivers installed. To see a list of AMIs with the NVIDIA drivers pre-installed, and for full installation instructions for Linux-based instances, see Install NVIDIA drivers on Linux instances.

Additionally, deploying and managing the CloudWatch Agent requires the instances to be running. If your instances are currently stopped, then you must start them to follow the instructions outlined in this post.

Preparing your EC2 instances

You utilize Systems Manager to deploy the CloudWatch Agent, so make sure that your EC2 instances have the Systems Manager Agent installed. Many AWS-provided AMIs already have the Systems Manager Agent installed. For a full list of the AMIs which have the Systems Manager Agent pre-installed, see Amazon Machine Images (AMIs) with SSM Agent preinstalled. If your AMI doesn’t have the Systems Manager Agent installed, see Working with SSM Agent for instructions on installing based on your operating system (OS).

Once installed, the CloudWatch Agent needs certain permissions to accept commands from Systems Manager, read Systems Manager Parameter Store entries, and publish metrics to CloudWatch. These permissions are bundled into the managed IAM policies AmazonEC2RoleforSSM, AmazonSSMReadOnlyAccess, and CloudWatchAgentServerPolicy. To create a new IAM role and associated IAM instance profile with these policies attached, you can run the following AWS Command Line Interface (AWS CLI) commands, replacing <REGION_NAME> with your AWS region, and <INSTANCE_ID> with the EC2 Instance ID that you want to associate with the instance profile:

aws iam create-role --role-name CloudWatch-Agent-Role --assume-role-policy-document  '{"Statement":{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}}'
aws iam attach-role-policy --role-name CloudWatch-Agent-Role --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
aws iam attach-role-policy --role-name CloudWatch-Agent-Role --policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess
aws iam attach-role-policy --role-name CloudWatch-Agent-Role --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws iam create-instance-profile --instance-profile-name CloudWatch-Agent-Instance-Profile
aws iam add-role-to-instance-profile --instance-profile-name CloudWatch-Agent-Instance-Profile --role-name CloudWatch-Agent-Role
aws ec2 associate-iam-instance-profile --region <REGION_NAME> --instance-id <INSTANCE_ID> --iam-instance-profile Name=CloudWatch-Agent-Instance-Profile

Alternatively, you can attach the IAM policies to your existing IAM role associated with an existing IAM instance profile.

aws iam attach-role-policy --role-name <ROLE_NAME> --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
aws iam attach-role-policy --role-name <ROLE_NAME> --policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess
aws iam attach-role-policy --role-name <ROLE_NAME> --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws ec2 associate-iam-instance-profile --region <REGION_NAME> --instance-id <INSTANCE_ID> --iam-instance-profile Name=<INSTANCE_PROFILE>

Once complete, you should see that your EC2 instance is associated with the appropriate IAM role.

This role should have the AmazonEC2RoleforSSM, AmazonSSMReadOnlyAccess and CloudWatchAgentServerPolicy IAM policies attached.

Configuring and deploying the CloudWatch Agent

Before deploying the CloudWatch Agent onto our EC2 instances, make sure that those agents are properly configured to collect GPU metrics. To do this, you must create a CloudWatch Agent configuration and store it in Systems Manager Parameter Store.

Copy the following into a file cloudwatch-agent-config.json:

{
    "agent": {
        "metrics_collection_interval": 60,
        "run_as_user": "cwagent"
    },
    "metrics": {
        "aggregation_dimensions": [
            [
                "InstanceId"
            ]
        ],
        "append_dimensions": {
            "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
            "ImageId": "${aws:ImageId}",
            "InstanceId": "${aws:InstanceId}",
            "InstanceType": "${aws:InstanceType}"
        },
        "metrics_collected": {
            "cpu": {
                "measurement": [
                    "cpu_usage_idle",
                    "cpu_usage_iowait",
                    "cpu_usage_user",
                    "cpu_usage_system"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ],
                "totalcpu": false
            },
            "disk": {
                "measurement": [
                    "used_percent",
                    "inodes_free"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ]
            },
            "diskio": {
                "measurement": [
                    "io_time"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ]
            },
            "mem": {
                "measurement": [
                    "mem_used_percent"
                ],
                "metrics_collection_interval": 60
            },
            "swap": {
                "measurement": [
                    "swap_used_percent"
                ],
                "metrics_collection_interval": 60
            },
            "nvidia_gpu": {
                "measurement": [
                    "utilization_gpu",
                    "temperature_gpu",
                    "utilization_memory",
                    "fan_speed",
                    "memory_total",
                    "memory_used",
                    "memory_free",
                    "pcie_link_gen_current",
                    "pcie_link_width_current",
                    "encoder_stats_session_count",
                    "encoder_stats_average_fps",
                    "encoder_stats_average_latency",
                    "clocks_current_graphics",
                    "clocks_current_sm",
                    "clocks_current_memory",
                    "clocks_current_video"
                ],
                "metrics_collection_interval": 60
            }
        }
    }
}

Run the following AWS CLI command to deploy a Systems Manager Parameter CloudWatch-Agent-Config, which contains a minimal agent configuration for GPU metrics collection. Replace <REGION_NAME> with your AWS Region.

aws ssm put-parameter \
--region <REGION_NAME> \
--name CloudWatch-Agent-Config \
--type String \
--value file://cloudwatch-agent-config.json

Now you can see a CloudWatch-Agent-Config parameter in Systems Manager Parameter Store, containing your CloudWatch Agent’s JSON configuration.

Next, install the CloudWatch Agent on your EC2 instances. To do this, you can leverage Systems Manager Run Command, specifically the AWS-ConfigureAWSPackage document which automates the CloudWatch Agent installation.

Run the following AWS CLI command, replacing <REGION_NAME> with the Region into which your instances are deployed, and <INSTANCE_ID> with the EC2 Instance ID on which you want to install the CloudWatch Agent.

aws ssm send-command \
--query 'Command.CommandId' \
--region <REGION_NAME> \
--instance-ids <INSTANCE_ID> \
--document-name AWS-ConfigureAWSPackage \
--parameters '{"action":["Install"],"installationType":["In-place update"],"version":["latest"],"name":["AmazonCloudWatchAgent"]}'

2. To monitor the status of your command, use the get-command-invocation AWS CLI command. Replace <COMMAND_ID> with the command ID output from the previous step, <REGION_NAME> with your AWS region, and <INSTANCE_ID> with your EC2 instance ID.

aws ssm get-command-invocation --query Status --region <REGION_NAME> --command-id <COMMAND_ID> --instance-id <INSTANCE_ID>

3.Wait for the command to show the status Success before proceeding.

$ aws ssm send-command \
	 --query 'Command.CommandId' \
    --region us-east-2 \
    --instance-ids i-0123456789abcdef \
    --document-name AWS-ConfigureAWSPackage \
    --parameters '{"action":["Install"],"installationType":["Uninstall and reinstall"],"version":["latest"],"additionalArguments":["{}"],"name":["AmazonCloudWatchAgent"]}'

"5d8419db-9c48-434c-8460-0519640046cf"

$ aws ssm get-command-invocation --query Status --region us-east-2 --command-id 5d8419db-9c48-434c-8460-0519640046cf --instance-id i-0123456789abcdef

"Success"

Repeat this process for all EC2 instances on which you want to install the CloudWatch Agent.

Next, configure the CloudWatch Agent installation. For this, once again leverage Systems Manager Run Command. However, this time the AmazonCloudWatch-ManageAgent document which applies your custom agent configuration is stored in the Systems Manager Parameter Store to your deployed agents.

Run the following AWS CLI command, replacing <REGION_NAME> with the Region into which your instances are deployed, and <INSTANCE_ID> with the EC2 Instance ID on which you want to configure the CloudWatch Agent.

aws ssm send-command \
--query 'Command.CommandId' \
--region <REGION_NAME> \
--instance-ids <INSTANCE_ID> \
--document-name AmazonCloudWatch-ManageAgent \
--parameters '{"action":["configure"],"mode":["ec2"],"optionalConfigurationSource":["ssm"],"optionalConfigurationLocation":["/CloudWatch-Agent-Config"],"optionalRestart":["yes"]}'

2. To monitor the status of your command, utilize the get-command-invocation AWS CLI command. Replace <COMMAND_ID> with the command ID output from the previous step, <REGION_NAME> with your AWS region, and <INSTANCE_ID> with your EC2 instance ID.

aws ssm get-command-invocation --query Status --region <REGION_NAME> --command-id <COMMAND_ID> --instance-id <INSTANCE_ID>

3. Wait for the command to show the status Success before proceeding.

$ aws ssm send-command \
    --query 'Command.CommandId' \
    --region us-east-2 \
    --instance-ids i-0123456789abcdef \
    --document-name AmazonCloudWatch-ManageAgent \
    --parameters '{"action":["configure"],"mode":["ec2"],"optionalConfigurationSource":["ssm"],"optionalConfigurationLocation":["/CloudWatch-Agent-Config"],"optionalRestart":["yes"]}'

"9a4a5c43-0795-4fd3-afed-490873eaca63"

$ aws ssm get-command-invocation --query Status --region us-east-2 --command-id 9a4a5c43-0795-4fd3-afed-490873eaca63 --instance-id i-0123456789abcdef

"Success"

Repeat this process for all EC2 instances on which you want to install the CloudWatch Agent. Once finished, the CloudWatch Agent installation and configuration is complete, and your EC2 instances now report GPU metrics to CloudWatch.

Visualize your instance’s GPU metrics in CloudWatch

Now that your GPU-enabled EC2 Instances are publishing their utilization metrics to CloudWatch, you can visualize and analyze these metrics to better understand your resource utilization patterns.

The GPU metrics collected by the CloudWatch Agent are within the CWAgent namespace. Explore your GPU metrics using the CloudWatch Metrics Explorer, or deploy our provided sample dashboard.

Copy the following into a file, cloudwatch-dashboard.json, replacing instances of <REGION_NAME> with your Region:

{
    "widgets": [
        {
            "height": 10,
            "width": 24,
            "y": 16,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [{"expression": "SELECT AVG(nvidia_smi_utilization_gpu) FROM SCHEMA(\"CWAgent\", InstanceId) GROUP BY InstanceId","id": "q1"}]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "title": "GPU Core Utilization",
                "yAxis": {
                    "left": {"label": "Percent","max": 100,"min": 0,"showUnits": false}
                }
            }
        },
        {
            "height": 7,
            "width": 8,
            "y": 0,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [{"expression": "SELECT AVG(nvidia_smi_utilization_gpu) FROM SCHEMA(\"CWAgent\", InstanceId)", "label": "Utilization","id": "q1"}]
                ],
                "view": "gauge",
                "stacked": false,
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "title": "Average GPU Core Utilization",
                "yAxis": {"left": {"max": 100, "min": 0}
                },
                "liveData": false
            }
        },
        {
            "height": 9,
            "width": 24,
            "y": 7,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [{ "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_used\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m1", "visible": false }],
                    [{ "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_total\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m2", "visible": false }],
                    [{ "expression": "SEARCH(' MetricName=\"mem_used_percent\" {CWAgent, InstanceId} ', 'Average')", "id": "m3", "visible": false }],
                    [{ "expression": "100*AVG(m1)/AVG(m2)", "label": "GPU", "id": "e2", "color": "#17becf" }],
                    [{ "expression": "AVG(m3)", "label": "RAM", "id": "e3" }]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "yAxis": {
                    "left": {"min": 0,"max": 100,"label": "Percent","showUnits": false}
                },
                "title": "Average Memory Utilization"
            }
        },
        {
            "height": 7,
            "width": 8,
            "y": 0,
            "x": 8,
            "type": "metric",
            "properties": {
                "metrics": [
                    [ { "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_used\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m1", "visible": false } ],
                    [ { "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_total\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m2", "visible": false } ],
                    [ { "expression": "100*AVG(m1)/AVG(m2)", "label": "Utilization", "id": "e2" } ]
                ],
                "sparkline": true,
                "view": "gauge",
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "yAxis": {
                    "left": {"min": 0,"max": 100}
                },
                "liveData": false,
                "title": "GPU Memory Utilization"
            }
        }
    ]
}

2. run the following AWS CLI command, replacing <REGION_NAME> with the name of your Region:

aws cloudwatch put-dashboard \
    --region <REGION_NAME> \
    --dashboard-name My-GPU-Usage \
    --dashboard-body file://cloudwatch-dashboard.json

View the My-GPU-Usage CloudWatch dashboard in the CloudWatch console for your AWS region..

Cleaning Up

To avoid incurring future costs for resources created by following along in this post, delete the following:

My-GPU-Usage CloudWatch Dashboard
CloudWatch-Agent-Config Systems Manager Parameter
CloudWatch-Agent-Role IAM Role

Conclusion

By following along with this post, you deployed and configured the CloudWatch Agent across your GPU-enabled EC2 instances to track GPU utilization without pausing in-progress experiments and model training. Then, you visualized the GPU utilization of your workloads with a CloudWatch Dashboard to better understand your workload’s GPU usage and make more informed scaling and cost decisions. For other ways that Amazon CloudWatch can improve your organization’s operational insights, see the Amazon CloudWatch documentation.