It always pays to know more about your customers, and AWS Data Exchange makes it straightforward to use publicly available census data to enrich your customer dataset.
The United States Census Bureau conducts the US census every 10 years and gathers household survey data. This data is anonymized, aggregated, and made available for public use. The smallest geographic area for which the Census Bureau collects and aggregates data are census blocks, which are formed by streets, roads, railroads, streams and other bodies of water, other visible physical and cultural features, and the legal boundaries shown on Census Bureau maps.
If you know the census block in which a customer lives, you are able to make general inferences about their demographic characteristics. With these new attributes, you are able to build a segmentation model to identify distinct groups of customers that you can target with personalized messaging. This data is available to subscribe to on AWS Data Exchange—and with data sharing, you don’t need to pay to store a copy of it in your account in order to query it.
In this post, we show how to use customer addresses to enrich a dataset with additional demographic details from the US Census Bureau dataset.
Solution overview
The solution includes the following high-level steps:
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Redshift Serverless makes it straightforward to run analytics workloads of any size without having to manage data warehouse infrastructure.
To load our address data, we first create a Redshift Serverless workgroup. Then we use Amazon Redshift Query Editor v2 to load customer data from Amazon Simple Storage Service (Amazon S3).
Create a Redshift Serverless workgroup
There are two primary components of the Redshift Serverless architecture:
Namespace – A collection of database objects and users. Namespaces group together all of the resources you use in Redshift Serverless, such as schemas, tables, users, datashares, and snapshots.
Workgroup – A collection of compute resources. Workgroups have network and security settings that you can configure using the Redshift Serverless console, the AWS Command Line Interface (AWS CLI), or the Redshift Serverless APIs.
Use Query Editor v2 to load customer data from Amazon S3
You can use Query Editor v2 to submit queries and load data to your data warehouse through a web interface. To configure Query Editor v2 for your AWS account, refer to Data load made easy and secure in Amazon Redshift using Query Editor V2. After it’s configured, complete the following steps:
Use the following SQL to create the customer_data schema within the dev database in your data warehouse:
CREATE SCHEMA customer_data;
Use the following SQL DDL to create your target table into which you’ll load your customer address data:
CREATE TABLE customer_data.customer_addresses (
address character varying(256) ENCODE lzo,
unitnumber character varying(256) ENCODE lzo,
municipality character varying(256) ENCODE lzo,
region character varying(256) ENCODE lzo,
postalcode character varying(256) ENCODE lzo,
country character varying(256) ENCODE lzo,
customer_id integer ENCODE az64
) DISTSTYLE AUTO;
The file has no column headers and is pipe delimited (|). For information on how to load data from either Amazon S3 or your local desktop, refer to Loading data into a database.
Use Location Service to geocode and enrich address data
Location Service lets you add location data and functionality to applications, which includes capabilities such as maps, points of interest, geocoding, routing, geofences, and tracking.
Our data is in Amazon Redshift, so we need to access the Location Service APIs using SQL statements. Each row of data contains an address that we want to enrich and geotag using the Location Service APIs. Amazon Redshift allows developers to create UDFs using a SQL SELECT clause, Python, or Lambda.
Lambda is a compute service that lets you run code without provisioning or managing servers. With Lambda UDFs, you can write custom functions with complex logic and integrate with third-party components. Scalar Lambda UDFs return one result per invocation of the function—in this case, the Lambda function runs one time for each row of data it receives.
For this post, we write a Lambda function that uses the Location Service API to geotag and validate our customer addresses. Then we register this Lambda function as a UDF with our Redshift instance, allowing us to call the function from a SQL command.
For instructions to create a Location Service place index and create your Lambda function and scalar UDF, refer to Access Amazon Location Service from Amazon Redshift. For this post, we use ESRI as a provider and name the place index placeindex.redshift.
Test your new function with the following code, which returns the coordinates of the White House in Washington, DC:
Subscribe to demographic data from AWS Data Exchange
AWS Data Exchange is a data marketplace with more than 3,500 products from over 300 providers delivered—through files, APIs, or Amazon Redshift queries—directly to the data lakes, applications, analytics, and machine learning models that use it.
First, we need to give our Redshift namespace permission via AWS Identity and Access Management (IAM) to access subscriptions on AWS Data Exchange. Then we can subscribe to our sample demographic data. Complete the following steps:
On the IAM console, add the AWSDataExchangeSubscriberFullAccess managed policy to your Amazon Redshift commands access role you assigned when creating the namespace.
Choose Continue to subscribe, then choose Subscribe.
The subscription may take a few minutes to configure.
When your subscription is in place, navigate back to the Redshift Serverless console.
In the navigation pane, choose Datashares.
On the Subscriptions tab, choose the datashare that you just subscribed to.
On the datashare details page, choose Create database from datashare.
Choose the namespace you created earlier and provide a name for the new database that will hold the shared objects from the dataset you subscribed to.
In Query Editor v2, you should see the new database you just created and two new tables: one that holds the block group polygons and another that holds the demographic information for each block group.
Join geocoded customer data to census data with geospatial queries
There are two primary types of spatial data: raster and vector data. Raster data is represented as a grid of pixels and is beyond the scope of this post. Vector data is comprised of vertices, edges, and polygons. With geospatial data, vertices are represented as latitude and longitude points and edges are the connections between pairs of vertices. Think of the road connecting two intersections on a map. A polygon is a set of vertices with a series of connecting edges that form a continuous shape. A simple rectangle is a polygon, just as the state border of Ohio can be represented as a polygon. The geography_usa_blockgroup_2019 dataset that you subscribed to has 220,134 rows, each representing a single census block group and its geographic shape.
Amazon Redshift supports the storage and querying of vector-based spatial data with the GEOMETRY and GEOGRAPHY data types. You can use Redshift SQL functions to perform queries such as a point in polygon operation to determine if a given latitude/longitude point falls within the boundaries of a given polygon (such as state or county boundary). In this dataset, you can observe that the geom column in geography_usa_blockgroup_2019 is of type GEOMETRY.
Our goal is to determine which census block (polygon) each of our geotagged addresses falls within so we can enrich our customer records with details that we know about the census block. Complete the following steps:
Build a new table with the geocoding results from our UDF:
CREATE TABLE customer_data.customer_addresses_geocoded AS
select address
,unitnumber
,municipality
,region
,postalcode
,country
,customer_id
,public.f_geocode_address(address||' '||unitnumber,municipality,region,postalcode,country) as geocode_result
FROM customer_data.customer_addresses;
Use the following code to extract the different address fields and latitude/longitude coordinates from the JSON column and create a new table with the results:
CREATE TABLE customer_data.customer_addresses_points AS
SELECT customer_id
,geo_address
address
,unitnumber
,municipality
,region
,postalcode
,country
,longitude
,latitude
,ST_SetSRID(ST_MakePoint(Longitude, Latitude),4326) as address_point
--create new geom column of type POINT, set new point SRID = 4326
FROM
(
select customer_id
,address
,unitnumber
,municipality
,region
,postalcode
,country
,cast(json_extract_path_text(geocode_result, 'Label', true) as VARCHAR) as geo_address
,cast(json_extract_path_text(geocode_result, 'Longitude', true) as float) as longitude
,cast(json_extract_path_text(geocode_result, 'Latitude', true) as float) as latitude
--use json function to extract fields from geocode_result
from customer_data.customer_addresses_geocoded) a;
This code uses the ST_POINT function to create a new column from the latitude/longitude coordinates called address_point of type GEOMETRY and subtype POINT. It uses the ST_SetSRID geospatial function to set the spatial reference identifier (SRID) of the new column to 4326.
The SRID defines the spatial reference system to be used when evaluating the geometry data. It’s important when joining or comparing geospatial data that they have matching SRIDs. You can check the SRID of an existing geometry column by using the ST_SRID function. For more information on SRIDs and GEOMETRY data types, refer to Querying spatial data in Amazon Redshift.
Now that your customer addresses are geocoded as latitude/longitude points in a geometry column, you can use a join to identify which census block shape your new point falls within:
CREATE TABLE customer_data.customer_addresses_with_census AS
select c.*
,shapes.geoid as census_group_shape
,demo.*
from customer_data.customer_addresses_points c
inner join "carto_census_data"."carto".geography_usa_blockgroup_2019 shapes
on ST_Contains(shapes.geom, c.address_point)
--join tables where the address point falls within the census block geometry
inner join carto_census_data.usa_acs.demographics_sociodemographics_usa_blockgroup_2019_yearly_2019 demo
on demo.geoid = shapes.geoid;
The preceding code creates a new table called customer_addresses_with_census, which joins the customer addresses to the census block in which they belong as well as the demographic data associated with that census block.
To do this, you used the ST_CONTAINS function, which accepts two geometry data types as an input and returns TRUE if the 2D projection of the first input geometry contains the second input geometry. In our case, we have census blocks represented as polygons and addresses represented as points. The join in the SQL statement succeeds when the point falls within the boundaries of the polygon.
Visualize the new demographic data with QuickSight
QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people who you work with, wherever they are. QuickSight connects to your data in the cloud and combines data from many different sources.
First, let’s build some new calculated fields that will help us better understand the demographics of our customer base. We can do this in QuickSight, or we can use SQL to build the columns in a Redshift view. The following is the code for a Redshift view:
CREATE VIEW customer_data.customer_features AS (
SELECT customer_id
,postalcode
,region
,municipality
,geoid as census_geoid
,longitude
,latitude
,total_pop
,median_age
,white_pop/total_pop as perc_white
,black_pop/total_pop as perc_black
,asian_pop/total_pop as perc_asian
,hispanic_pop/total_pop as perc_hispanic
,amerindian_pop/total_pop as perc_amerindian
,median_income
,income_per_capita
,median_rent
,percent_income_spent_on_rent
,unemployed_pop/coalesce(pop_in_labor_force) as perc_unemployment
,(associates_degree + bachelors_degree + masters_degree + doctorate_degree)/total_pop as perc_college_ed
,(household_language_total - household_language_english)/coalesce(household_language_total) as perc_other_than_english
FROM "dev"."customer_data"."customer_addresses_with_census" t );
To get QuickSight to talk to our Redshift Serverless endpoint, complete the following steps:
On the QuickSight console, choose Datasets in the navigation pane.
Choose New dataset.
We want to create a dataset from a new data source and use the Redshift: Manual connect option.
Provide the connection information for your Redshift Serverless workgroup.
You will need the endpoint for our workgroup and the user name and password that you created when you set up your workgroup. You can find your workgroup’s endpoint on the Redshift Serverless console by navigating to your workgroup configuration. The following screenshot is an example of the connection settings needed. Notice the connection type is the name of the VPC connection that you previously configured in QuickSight. When you copy the endpoint from the Redshift console, be sure to remove the database and port number from the end of the URL before entering it in the field.
Save the new data source configuration.
You’ll be prompted to choose the table you want to use for your dataset.
Choose the new view that you created that has your new derived fields.
Select Directly query your data.
This will connect your visualizations directly to the data in the database rather than ingesting data into the QuickSight in-memory data store.
To create a histogram of median income level, choose the blank visual on Sheet1 and then choose the histogram visual icon under Visual types.
Choose median_income under Fields list and drag it to the Value field well.
This builds a histogram showing the distribution of median_income for our customers based on the census block group in which they live.
Conclusion
In this post, we demonstrated how companies can use open census data available on AWS Data Exchange to effortlessly gain a high-level understanding of their customer base from a demographic standpoint. This basic understanding of customers based on where they live can serve as the foundation for more targeted marketing campaigns and even influence product development and service offerings.
As always, AWS welcomes your feedback. Please leave your thoughts and questions in the comments section.
About the Author
Tony Stricker is a Principal Technologist on the Data Strategy team at AWS, where he helps senior executives adopt a data-driven mindset and align their people/process/technology in ways that foster innovation and drive towards specific, tangible business outcomes. He has a background as a data warehouse architect and data scientist and has delivered solutions in to production across multiple industries including oil and gas, financial services, public sector, and manufacturing. In his spare time, Tony likes to hang out with his dog and cat, work on home improvement projects, and restore vintage Airstream campers.
Many organizations operate data lakes spanning multiple cloud data stores. This could be for various reasons, such as business expansions, mergers, or specific cloud provider preferences for different business units. In these cases, you may want an integrated query layer to seamlessly run analytical queries across these diverse cloud stores and streamline your data analytics processes. With a unified query interface, you can avoid the complexity of managing multiple query tools and gain a holistic view of your data assets regardless of where the data assets reside. You can consolidate your analytics workflows, reducing the need for extensive tooling and infrastructure management. This consolidation not only saves time and resources but also enables teams to focus more on deriving insights from data rather than navigating through various query tools and interfaces. A unified query interface promotes a holistic view of data assets by breaking down silos and facilitating seamless access to data stored across different cloud data stores. This comprehensive view enhances decision-making capabilities by empowering stakeholders to analyze data from multiple sources in a unified manner, leading to more informed strategic decisions.
In this post, we delve into the ways in which you can use Amazon Athena connectors to efficiently query data files residing across Azure Data Lake Storage (ADLS) Gen2, Google Cloud Storage (GCS), and Amazon Simple Storage Service (Amazon S3). Additionally, we explore the use of Athena workgroups and cost allocation tags to effectively categorize and analyze the costs associated with running analytical queries.
Solution overview
Imagine a fictional company named Oktank, which manages its data across data lakes on Amazon S3, ADLS, and GCS. Oktank wants to be able to query any of their cloud data stores and run analytical queries like joins and aggregations across the data stores without needing to transfer data to an S3 data lake. Oktank also wants to identify and analyze the costs associated with running analytics queries. To achieve this, Oktank envisions a unified data query layer using Athena.
The following diagram illustrates the high-level solution architecture.
Users run their queries from Athena connecting to specific Athena workgroups. Athena uses connectors to federate the queries across multiple data sources. In this case, we use the Amazon Athena Azure Synapse connector to query data from ADLS Gen2 via Synapse and the Amazon Athena GCS connector for GCS. An Athena connector is an extension of the Athena query engine. When a query runs on a federated data source using a connector, Athena invokes multiple AWS Lambda functions to read from the data sources in parallel to optimize performance. Refer to Using Amazon Athena Federated Query for further details. The AWS Glue Data Catalog holds the metadata for Amazon S3 and GCS data.
In the following sections, we demonstrate how to build this architecture.
Prerequisites
Before you configure your resources on AWS, you need to set up the necessary infrastructure required for this post in both Azure and GCP. The detailed steps and guidelines for creating the resources in Azure and GCP are beyond the scope of this post. Refer to the respective documentation for details. In this section, we provide some basic steps needed to create the resources required for the post.
To set up the sample dataset for Azure, log in to the Azure portal and upload the file to ADLS Gen2. The following screenshot shows the file under the container blog-container under a specific storage account on ADLS Gen2.
Set up a Synapse workspace in Azure and create an external table in Synapse that points to the relevant location. The following commands offer a foundational guide for running the necessary actions within the Synapse workspace to create the essential resources for this post. Refer to the corresponding Synapse documentation for additional details as required.
# Create Database
CREATE DATABASE azure_athena_blog_db
# Create file format
CREATE EXTERNAL FILE FORMAT [SynapseDelimitedTextFormat]
WITH ( FORMAT_TYPE = DELIMITEDTEXT ,
FORMAT_OPTIONS (
FIELD_TERMINATOR = ',',
USE_TYPE_DEFAULT = FALSE,
FIRST_ROW = 2
))
# Create key
CREATE MASTER KEY ENCRYPTION BY PASSWORD = '*******;
# Create Database credential
CREATE DATABASE SCOPED CREDENTIAL dbscopedCreds
WITH IDENTITY = 'Managed Identity';
# Create Data Source
CREATE EXTERNAL DATA SOURCE athena_blog_datasource
WITH ( LOCATION = 'abfss://[email protected]/',
CREDENTIAL = dbscopedCreds
)
# Create External Table
CREATE EXTERNAL TABLE dbo.customer_feedbacks_azure (
[data_key] nvarchar(4000),
[data_load_date] nvarchar(4000),
[data_location] nvarchar(4000),
[product_id] nvarchar(4000),
[customer_email] nvarchar(4000),
[customer_name] nvarchar(4000),
[comment1] nvarchar(4000),
[comment2] nvarchar(4000)
)
WITH (
LOCATION = 'cust_feedback_v0.csv',
DATA_SOURCE = athena_blog_datasource,
FILE_FORMAT = [SynapseDelimitedTextFormat]
);
# Create User
CREATE LOGIN bloguser1 WITH PASSWORD = '****';
CREATE USER bloguser1 FROM LOGIN bloguser1;
# Grant select on the Schema
GRANT SELECT ON SCHEMA::dbo TO [bloguser1];
Note down the user name, password, database name, and the serverless or dedicated SQL endpoint you use—you need these in the subsequent steps.
This completes the setup on Azure for the sample dataset.
Configure the dataset for GCS
To set up the sample dataset for GCS, upload the file to the GCS bucket.
Create a GCP service account and grant access to the bucket.
In addition, create a JSON key for the service account. The content of the key is needed in subsequent steps.
This completes the setup on GCP for our sample dataset.
Deploy the AWS infrastructure
You can now run the provided AWS CloudFormationstack to create the solution resources. Identify an AWS Region in which you want to create the resources and ensure you use the same Region throughout the setup and verifications.
Refer to the following table for the necessary parameters that you must provide. You can leave other parameters at their default values or modify them according to your requirement.
Parameter Name
Expected Value
AzureSynapseUserName
User name for the Synapse database you created.
AzureSynapsePwd
Password for the Synapse database user.
AzureSynapseURL
Synapse JDBC URL, in the following format: jdbc:sqlserver://<sqlendpoint>;databaseName=<databasename>
For example: jdbc:sqlserver://xxxxg-ondemand.sql.azuresynapse.net;databaseName=azure_athena_blog_db
GCSSecretKey
Content from the secret key file from GCP.
UserAzureADLSOnlyUserPassword
AWS Management Console password for the Azure-only user. This user can only query data from ADLS.
UserGCSOnlyUserPassword
AWS Management Console password for the GCS-only user. This user can only query data from GCP GCS.
UserMultiCloudUserPassword
AWS Management Console password for the multi-cloud user. This user can query data from any of the cloud stores.
The stack provisions the VPC, subnets, S3 buckets, Athena workgroups, and AWS Glue database and tables. It creates two secrets in AWS Secrets Manager to store the GCS secret key and the Synapse user name and password. You use these secrets when creating the Athena connectors.
The stack also creates three AWS Identity and Access Management (IAM) users and grants permissions on corresponding Athena workgroups, Athena data sources, and Lambda functions: AzureADLSUser, which can run queries on ADLS and Amazon S3, GCPGCSUser, which can query GCS and Amazon S3, and MultiCloudUser, which can query Amazon S3, Azure ADLS Gen2 and GCS data sources. The stack does not create the Athena data source and Lambda functions. You create these in subsequent steps when you create the Athena connectors.
The stack also attaches cost allocation tags to the Athena workgroups, the secrets in Secrets Manager, and the S3 buckets. You use these tags for cost analysis in subsequent steps.
When the stack deployment is complete, note the values of the CloudFormation stack outputs, which you use in subsequent steps.
Upload the data file to the S3 bucket created by the CloudFormation stack. You can retrieve the bucket name from the value of the key named S3SourceBucket from the stack output. This serves as the S3 data lake data for this post.
In the Application settings section, enter the values for the corresponding key from the output of the CloudFormation stack, as listed in the following table.
Property Name
CloudFormation Output Key
SecretNamePrefix
AzureSecretName
DefaultConnectionString
AzureSynapseConnectorJDBCURL
LambdaFunctionName
AzureADLSLambdaFunctionName
SecurityGroupIds
SecurityGroupId
SpillBucket
AthenaLocationAzure
SubnetIds
PrivateSubnetId
Select the Acknowledgement check box and choose Deploy.
Wait for the application to be deployed before proceeding to the next step.
In the Application settings section, enter the values for the corresponding key from the output of the CloudFormation stack, as listed in the following table.
Property Name
CloudFormation Output Key
SpillBucket
AthenaLocationGCP
GCSSecretName
GCSSecretName
LambdaFunctionName
GCSLambdaFunctionName
Select the Acknowledgement check box and choose Deploy.
For the GCS connector, there are some post-deployment steps to create the AWS Glue database and table for the GCS data file. In this post, the CloudFormation stack you deployed earlier already created these resources, so you don’t have to create it. The stack created an AWS Glue database called oktank_multicloudanalytics_gcp and a table called customer_feedbacks under the database with the required configurations.
Log in to the Lambda console to verify the Lambda functions were created.
Next, you create the Athena data sources corresponding to these connectors.
Create the Azure data source
Complete the following steps to create your Azure data source:
For Data source name, enter the value for the AthenaFederatedDataSourceNameForGCS key from the CloudFormation stack output.
In the Connection details section, choose Lambda function you created earlier for GCS.
Choose Next, then choose Create data source.
This completes the deployment. You can now run the multi-cloud queries from Athena.
Query the federated data sources
In this section, we demonstrate how to query the data sources using the ADLS user, GCS user, and multi-cloud user.
Run queries as the ADLS user
The ADLS user can run multi-cloud queries on ADLS Gen2 and Amazon S3 data. Complete the following steps:
Get the value for UserAzureADLSUser from the CloudFormation stack output.
Sign in to the Athena query editor with this user.
Switch the workgroup to athena-mc-analytics-azure-wg in the Athena query editor.
Choose Acknowledge to accept the workgroup settings.
Run the following query to join the S3 data lake table to the ADLS data lake table:
SELECT a.data_load_date as azure_load_date, b.data_key as s3_data_key, a.data_location as azure_data_location FROM "azure_adls_ds"."dbo"."customer_feedbacks_azure" a join "AwsDataCatalog"."oktank_multicloudanalytics_aws"."customer_feedbacks" b ON cast(a.data_key as integer) = b.data_key
Run queries as the GCS user
The GCS user can run multi-cloud queries on GCS and Amazon S3 data. Complete the following steps:
Get the value for UserGCPGCSUser from the CloudFormation stack output.
Sign in to the Athena query editor with this user.
Switch the workgroup to athena-mc-analytics-gcp-wg in the Athena query editor.
Choose Acknowledge to accept the workgroup settings.
Run the following query to join the S3 data lake table to the GCS data lake table:
SELECT a.data_load_date as gcs_load_date, b.data_key as s3_data_key, a.data_location as gcs_data_location FROM "gcp_gcs_ds"."oktank_multicloudanalytics_gcp"."customer_feedbacks" a
join "AwsDataCatalog"."oktank_multicloudanalytics_aws"."customer_feedbacks" b
ON a.data_key = b.data_key
Run queries as the multi-cloud user
The multi-cloud user can run queries that can access data from any cloud store. Complete the following steps:
Get the value for UserMultiCloudUser from the CloudFormation stack output.
Sign in to the Athena query editor with this user.
Switch the workgroup to athena-mc-analytics-multi-wg in the Athena query editor.
Choose Acknowledge to accept the workgroup settings.
Run the following query to join data across the multiple cloud stores:
SELECT a.data_load_date as adls_load_date, b.data_key as s3_data_key, c.data_location as gcs_data_location
FROM "azure_adls_ds"."dbo"."CUSTOMER_FEEDBACKS_AZURE" a
join "AwsDataCatalog"."oktank_multicloudanalytics_aws"."customer_feedbacks" b
on cast(a.data_key as integer) = b.data_key join "gcp_gcs_ds"."oktank_multicloudanalytics_gcp"."customer_feedbacks" c
on b.data_key = c.data_key
Cost analysis with cost allocation tags
When you run multi-cloud queries, you need to carefully consider the data transfer costs associated with each cloud provider. Refer to the corresponding cloud documentation for details. The cost reports highlighted in this section refer to the AWS infrastructure and service usage costs. The storage and other associated costs with ADLS, Synapse, and GCS are not included.
Let’s see how to handle cost analysis for the multiple scenarios we have discussed.
Sign in to AWS Billing and Cost Management console and enable these cost allocation tags. It may take up to 24 hours for the cost allocation tags to be available and reflected in AWS Cost Explorer.
To track the cost of the Lambda functions deployed as part of the GCS and Synapse connectors, you can use the AWS generated cost allocation tags, as shown in the following screenshot.
You can use these tags on the Billing and Cost Management console to determine the cost per tag. We provide some sample screenshots for reference. These reports only show the cost of AWS resources used to access ADLS Gen2 or GCP GCS. The reports do not show the cost of GCP or Azure resources.
Athena costs
To view Athena costs, choose the tag athena-mc-analytics:athena:workgroup and filter the tags values azure, gcp, and multi.
To view the costs for Amazon S3 storage (Athena query results and spill storage), choose the tag athena-mc-analytics:s3:result-spill and filter the tag values azure, gcp, and multi.
Lambda costs
To view the costs for the Lambda functions, choose the tag aws:cloudformation:stack-name and filter the tag values serverlessepo-AthenaSynapseConnector and serverlessepo-AthenaGCSConnector.
Cost allocation tags help manage and track costs effectively when you’re running multi-cloud queries. This can help you track, control, and optimize your spending while taking advantage of the benefits of multi-cloud data analytics.
Clean up
To avoid incurring further charges, delete the CloudFormation stacks to delete the resources you provisioned as part of this post. There are two additional stacks deployed for each connector: serverlessrepo-AthenaGCSConnector and serverlessrepo-AthenaSynapseConnector. Delete all three stacks.
Conclusion
In this post, we discussed a comprehensive solution for organizations looking to implement multi-cloud data lake analytics using Athena, enabling a consolidated view of data across diverse cloud data stores and enhancing decision-making capabilities. We focused on querying data lakes across Amazon S3, Azure Data Lake Storage Gen2, and Google Cloud Storage using Athena. We demonstrated how to set up resources on Azure, GCP, and AWS, including creating databases, tables, Lambda functions, and Athena data sources. We also provided instructions for querying federated data sources from Athena, demonstrating how you can run multi-cloud queries tailored to your specific needs. Lastly, we discussed cost analysis using AWS cost allocation tags.
For further reading, refer to the following resources:
Shoukat Ghouse is a Senior Big Data Specialist Solutions Architect at AWS. He helps customers around the world build robust, efficient and scalable data platforms on AWS leveraging AWS analytics services like AWS Glue, AWS Lake Formation, Amazon Athena and Amazon EMR.
Amazon Simple Email Service (SES) is a cloud-based email sending service that helps businesses and developers send marketing and transactional emails. We introduced the SESv1 API in 2011 to provide developers with basic email sending capabilities through Amazon SES using HTTPS. In 2020, we introduced the redesigned Amazon SESv2 API, with new and updated features that make it easier and more efficient for developers to send email at scale.
This post will compare Amazon SESv1 API and Amazon SESv2 API and explain the advantages of transitioning your application code to the SESv2 API. We’ll also provide examples using the AWS Command-Line Interface (AWS CLI) that show the benefits of transitioning to the SESv2 API.
Amazon SESv1 API
The SESv1 API is a relatively simple API that provides basic functionality for sending and receiving emails. For over a decade, thousands of SES customers have used the SESv1 API to send billions of emails. Our customers’ developers routinely use the SESv1 APIs to verify email addresses, create rules, send emails, and customize bounce and complaint notifications. Our customers’ needs have become more advanced as the global email ecosystem has developed and matured. Unsurprisingly, we’ve received customer feedback requesting enhancements and new functionality within SES. To better support an expanding array of use cases and stay at the forefront of innovation, we developed the SESv2 APIs.
While the SESv1 API will continue to be supported, AWS is focused on advancing functionality through the SESv2 API. As new email sending capabilities are introduced, they will only be available through SESv2 API. Migrating to the SESv2 API provides customers with access to these, and future, optimizations and enhancements. Therefore, we encourage SES customers to consider the information in this blog, review their existing codebase, and migrate to SESv2 API in a timely manner.
Amazon SESv2 API
Released in 2020, the SESv2 API and SDK enable customers to build highly scalable and customized email applications with an expanded set of lightweight and easy to use API actions. Leveraging insights from current SES customers, the SESv2 API includes several new actions related to list and subscription management, the creation and management of dedicated IP pools, and updates to unsubscribe that address recent industry requirements.
One example of new functionality in SESv2 API is programmatic support for the SES Virtual Delivery Manager. Previously only addressable via the AWS console, VDM helps customers improve sending reputation and deliverability. SESv2 API includes vdmAttributes such as VdmEnabled and DashboardAttributes as well as vdmOptions. DashboardOptions and GaurdianOptions.
To improve developer efficiency and make the SESv2 API easier to use, we merged several SESv1 APIs into single commands. For example, in the SESv1 API you must make separate calls for createConfigurationSet, setReputationMetrics, setSendingEnabled, setTrackingOptions, and setDeliveryOption. In the SESv2 API, however, developers make a single call to createConfigurationSet and they can include trackingOptions, reputationOptions, sendingOptions, deliveryOptions. This can result in more concise code (see below).
Another example of SESv2 API command consolidation is the GetIdentity action, which is a composite of SESv1 API’s GetIdentityVerificationAttributes, GetIdentityNotificationAttributes, GetCustomMailFromAttributes, GetDKIMAttributes, and GetIdentityPolicies. See SESv2 documentation for more details.
Why migrate to Amazon SESv2 API?
The SESv2 API offers an enhanced experience compared to the original SESv1 API. Compared to the SESv1 API, the SESv2 API provides a more modern interface and flexible options that make building scalable, high-volume email applications easier and more efficient. SESv2 enables rich email capabilities like template management, list subscription handling, and deliverability reporting. It provides developers with a more powerful and customizable set of tools with improved security measures to build and optimize inbox placement and reputation management. Taken as a whole, the SESv2 APIs provide an even stronger foundation for sending critical communications and campaign email messages effectively at a scale.
Migrating your applications to SESv2 API will benefit your email marketing and communication capabilities with:
New and Enhanced Features: Amazon SESv2 API includes new actions as well as enhancements that provide better functionality and improved email management. By moving to the latest version, you’ll be able to optimize your email sending process. A few examples include:
Increase the maximum message size (including attachments) from 10Mb (SESv1) to 40Mb (SESv2) for both sending and receiving.
Access key actions for the SES Virtual Deliverability Manager (VDM) which provides insights into your sending and delivery data. VDM provides near-realtime advice on how to fix the issues that are negatively affecting your delivery success rate and reputation.
Meet Google & Yahoo’s June 2024 unsubscribe requirements with the SES v2 SendEmail action. For more information, see the “What’s New blog”
Future-proof Your Application: Avoid potential compatibility issues and disruptions by keeping your application up-to-date with the latest version of the Amazon SESv2 API via the AWS SDK.
Improve Usability and Developer Experience: Amazon SESv2 API is designed to be more user-friendly and consistent with other AWS services. It is a more intuitive API with better error handling, making it easier to develop, maintain, and troubleshoot your email sending applications.
Migrating to the latest SESv2 API and SDK positions customers for success in creating reliable and scalable email services for their businesses.
What does migration to the SESv2 API entail?
While SESv2 API builds on the v1 API, the v2 API actions don’t universally map exactly to the v1 API actions. Current SES customers that intend to migrate to SESv2 API will need to identify the SESv1 API actions in their code and plan to refactor for v2. When planning the migration, it is essential to consider several important considerations:
Customers with applications that receive email using SESv1 API’s CreateReceiptFilter, CreateReceiptRule or CreateReceiptRuleSet actions must continue using the SESv1 API client for these actions. SESv1 and SESv2 can be used in the same application, where needed.
We recommend all customers follow the security best practice of “least privilege” with their IAM policies. As such, customers may need to review and update their policies to include the new and modified API actions introduced in SESv2 before migrating. Taking the time to properly configure permissions ensures a seamless transition while maintaining a securely optimized level of access. See documentation.
Below is an example of an IAM policy with a user with limited allow privileges related to several SESv1 Identity actions only:
When calling delete- with SESv1, SES returns 200 (or no response), even if the identity was previously deleted or doesn’t exist:
aws ses delete-identity --identity example.com
SESv2 provides better error handling and responses when calling the delete API:
aws sesv2 delete-email-identity --email-identity example.com
An error occurred (NotFoundException) when calling the DeleteEmailIdentity operation: Email identity example.com does not exist.
Hands-on with SESv1 API vs. SESv2 API
Below are a few examples you can use to explore the differences between SESv1 API and the SESv2 API. To complete these exercises, you’ll need:
AWS Account (setup) with enough permission to interact with the SES service via the CLI
SES enabled, configured and properly sending emails
A recipient email address with which you can check inbound messages (if you’re in the SES Sandbox, this email must be verified email identity). In the following examples, replace [email protected] with the verified email identity.
Your preferred IDE with AWS credentials and necessary permissions (you can also use AWS CloudShell)
Open the AWS CLI (or AWS CloudShell) and:
Create a test directory called v1-v2-test.
Create the following (8) files in the v1-v2-test directory:
destination.json (replace [email protected] with the verified email identity):
{
"Subject": {
"Data": "SESv1 API email sent using the AWS CLI",
"Charset": "UTF-8"
},
"Body": {
"Text": {
"Data": "This is the message body from SESv1 API in text format.",
"Charset": "UTF-8"
},
"Html": {
"Data": "This message body from SESv1 API, it contains HTML formatting. For example - you can include links: <a class=\"ulink\" href=\"http://docs.aws.amazon.com/ses/latest/DeveloperGuide\" target=\"_blank\">Amazon SES Developer Guide</a>.",
"Charset": "UTF-8"
}
}
}
ses-v1-raw-message.json (replace [email protected] with the verified email identity):
{
"Data": "From: [email protected]\nTo: [email protected]\nSubject: Test email sent using the SESv1 API and the AWS CLI \nMIME-Version: 1.0\nContent-Type: text/plain\n\nThis is the message body from the SESv1 API SendRawEmail.\n\n"
}
ses-v1-template.json (replace [email protected] with the verified email identity):
my-template.json (replace [email protected] with the verified email identity):
{
"Template": {
"TemplateName": "my-template",
"SubjectPart": "Greetings SES Developer, {{name}}!",
"HtmlPart": "<h1>Hello {{name}},</h1><p>Your favorite animal is {{favoriteanimal}}.</p>",
"TextPart": "Dear {{name}},\r\nYour favorite animal is {{favoriteanimal}}."
}
}
ses-v2-simple.json (replace [email protected] with the verified email identity):
{
"FromEmailAddress": "[email protected]",
"Destination": {
"ToAddresses": [
"[email protected]"
]
},
"Content": {
"Simple": {
"Subject": {
"Data": "SESv2 API email sent using the AWS CLI",
"Charset": "utf-8"
},
"Body": {
"Text": {
"Data": "SESv2 API email sent using the AWS CLI",
"Charset": "utf-8"
}
},
"Headers": [
{
"Name": "List-Unsubscribe",
"Value": "insert-list-unsubscribe-here"
},
{
"Name": "List-Unsubscribe-Post",
"Value": "List-Unsubscribe=One-Click"
}
]
}
}
}
ses-v2-raw.json (replace [email protected] with the verified email identity):
{
"FromEmailAddress": "[email protected]",
"Destination": {
"ToAddresses": [
"[email protected]"
]
},
"Content": {
"Raw": {
"Data": "Subject: Test email sent using SESv2 API via the AWS CLI \nMIME-Version: 1.0\nContent-Type: text/plain\n\nThis is the message body from SendEmail Raw Content SESv2.\n\n"
}
}
}
ses-v2-tempate.json (replace [email protected] with the verified email identity):
As mentioned above, customers who are using least privilege permissions with SESv1 API must first update their IAM policies before running the SESv2 API examples below. See documentation for more info.
As you can see from the .json files we created for SES v2 API (above), you can modify or remove sections from the .json files, based on the type of email content (simple, raw or templated) you want to send.
As you can see from the examples above, SESv2 API shares much of its syntax and actions with the SESv1 API. As a result, most customers have found they can readily evaluate, identify and migrate their application code base in a relatively short period of time. However, it’s important to note that while the process is generally straightforward, there may be some nuances and differences to consider depending on your specific use case and programming language.
Regardless of the language, you’ll need anywhere from a few hours to a few weeks to:
Update your code to use SESv2 Client and change API signature and request parameters
Update permissions / policies to reflect SESv2 API requirements
Test your migrated code to ensure that it functions correctly with the SESv2 API
Stage, test
Deploy
Summary
As we’ve described in this post, Amazon SES customers that migrate to the SESv2 API will benefit from updated capabilities, a more user-friendly and intuitive API, better error handling and improved deliverability controls. The SESv2 API also provide for compliance with the industry’s upcoming unsubscribe header requirements, more flexible subscription-list management, and support for larger attachments. Taken collectively, these improvements make it even easier for customers to develop, maintain, and troubleshoot their email sending applications with Amazon Simple Email Service. For these, and future reasons, we recommend SES customers migrate their existing applications to the SESv2 API immediately.
For more information regarding the SESv2 APIs, comment on this post, reach out to your AWS account team, or consult the AWS SESv2 API documentation:
Zip is an Amazon Pinpoint and Amazon Simple Email Service Sr. Specialist Solutions Architect at AWS. Outside of work he enjoys time with his family, cooking, mountain biking and plogging.
Vinay Ujjini
Vinay is an Amazon Pinpoint and Amazon Simple Email Service Worldwide Principal Specialist Solutions Architect at AWS. He has been solving customer’s omni-channel challenges for over 15 years. He is an avid sports enthusiast and in his spare time, enjoys playing tennis and cricket.
Dmitrijs Lobanovskis
Dmitrijs is a Software Engineer for Amazon Simple Email service. When not working, he enjoys traveling, hiking and going to the gym.
Many organizations around the world rely on the use of physical assets, such as vehicles, to deliver a service to their end-customers. By tracking these assets in real time and storing the results, asset owners can derive valuable insights on how their assets are being used to continuously deliver business improvements and plan for future changes. For example, a delivery company operating a fleet of vehicles may need to ascertain the impact from local policy changes outside of their control, such as the announced expansion of an Ultra-Low Emission Zone (ULEZ). By combining historical vehicle location data with information from other sources, the company can devise empirical approaches for better decision-making. For example, the company’s procurement team can use this information to make decisions about which vehicles to prioritize for replacement before policy changes go into effect.
Developers can use the support in Amazon Location Service for publishing device position updates to Amazon EventBridge to build a near-real-time data pipeline that stores locations of tracked assets in Amazon Simple Storage Service (Amazon S3). Additionally, you can use AWS Lambda to enrich incoming location data with data from other sources, such as an Amazon DynamoDB table containing vehicle maintenance details. Then a data analyst can use the geospatial querying capabilities of Amazon Athena to gain insights, such as the number of days their vehicles have operated in the proposed boundaries of an expanded ULEZ. Because vehicles that do not meet ULEZ emissions standards are subjected to a daily charge to operate within the zone, you can use the location data, along with maintenance data such as age of the vehicle, current mileage, and current emissions standards to estimate the amount the company would have to spend on daily fees.
This post shows how you can use Amazon Location, EventBridge, Lambda, Amazon Data Firehose, and Amazon S3 to build a location-aware data pipeline, and use this data to drive meaningful insights using AWS Glue and Athena.
Overview of solution
This is a fully serverless solution for location-based asset management. The solution consists of the following interfaces:
IoT or mobile application – A mobile application or an Internet of Things (IoT) device allows the tracking of a company vehicle while it is in use and transmits its current location securely to the data ingestion layer in AWS. The ingestion approach is not in scope of this post. Instead, a Lambda function in our solution simulates sample vehicle journeys and directly updates Amazon Location tracker objects with randomized locations.
Data analytics – Business analysts gather operational insights from multiple data sources, including the location data collected from the vehicles. Data analysts are looking for answers to questions such as, “How long did a given vehicle historically spend inside a proposed zone, and how much would the fees have cost had the policy been in place over the past 12 months?”
The following diagram illustrates the solution architecture.
The workflow consists of the following key steps:
The tracking functionality of Amazon Location is used to track the vehicle. Using EventBridge integration, filtered positional updates are published to an EventBridge event bus. This solution uses distance-based filtering to reduce costs and jitter. Distanced-based filtering ignores location updates in which devices have moved less than 30 meters (98.4 feet).
Amazon Location device position events arrive on the EventBridge default bus with source: ["aws.geo"] and detail-type: ["Location Device Position Event"]. One rule is created to forward these events to two downstream targets: a Lambda function, and a Firehose delivery stream.
Two different patterns, based on each target, are described in this post to demonstrate different approaches to committing the data to a S3 bucket:
Lambda function – The first approach uses a Lambda function to demonstrate how you can use code in the data pipeline to directly transform the incoming location data. You can modify the Lambda function to fetch additional vehicle information from a separate data store (for example, a DynamoDB table or a Customer Relationship Management system) to enrich the data, before storing the results in an S3 bucket. In this model, the Lambda function is invoked for each incoming event.
Firehose delivery stream – The second approach uses a Firehose delivery stream to buffer and batch the incoming positional updates, before storing them in an S3 bucket without modification. This method uses GZIP compression to optimize storage consumption and query performance. You can also use the data transformation feature of Data Firehose to invoke a Lambda function to perform data transformation in batches.
AWS Glue crawls both S3 bucket paths, populates the AWS Glue database tables based on the inferred schemas, and makes the data available to other analytics applications through the AWS Glue Data Catalog.
Athena is used to run geospatial queries on the location data stored in the S3 buckets. The Data Catalog provides metadata that allows analytics applications using Athena to find, read, and process the location data stored in Amazon S3.
This solution includes a Lambda function that continuously updates the Amazon Location tracker with simulated location data from fictitious journeys. The Lambda function is triggered at regular intervals using a scheduled EventBridge rule.
You can test this solution yourself using the AWS Samples GitHub repository. The repository contains the AWS Serverless Application Model (AWS SAM) template and Lambda code required to try out this solution. Refer to the instructions in the README file for steps on how to provision and decommission this solution.
Visual layouts in some screenshots in this post may look different than those on your AWS Management Console.
Data generation
In this section, we discuss the steps to manually or automatically generate journey data.
Manually generate journey data
You can manually update device positions using the AWS Command Line Interface (AWS CLI) command aws location batch-update-device-position. Replace the tracker-name, device-id, Position, and SampleTime values with your own, and make sure that successive updates are more than 30 meters in distance apart to place an event on the default EventBridge event bus:
Automatically generate journey data using the simulator
The provided AWS CloudFormation template deploys an EventBridge scheduled rule and an accompanying Lambda function that simulates tracker updates from vehicles. This rule is enabled by default, and runs at a frequency specified by the SimulationIntervalMinutes CloudFormation parameter. The data generation Lambda function updates the Amazon Location tracker with a randomized position offset from the vehicles’ base locations.
Vehicle names and base locations are stored in the vehicles.json file. A vehicle’s starting position is reset each day, and base locations have been chosen to give them the ability to drift in and out of the ULEZ on a given day to provide a realistic journey simulation.
You can disable the rule temporarily by navigating to the scheduled rule details on the EventBridge console. Alternatively, change the parameter State: ENABLED to State: DISABLED for the scheduled rule resource GenerateDevicePositionsScheduleRule in the template.yml file. Rebuild and re-deploy the AWS SAM template for this change to take effect.
Location data pipeline approaches
The configurations outlined in this section are deployed automatically by the provided AWS SAM template. The information in this section is provided to describe the pertinent parts of the solution.
Amazon Location device position events
Amazon Location sends device position update events to EventBridge in the following format:
You can optionally specify an input transformation to modify the format and contents of the device position event data before it reaches the target.
Data enrichment using Lambda
Data enrichment in this pattern is facilitated through the invocation of a Lambda function. In this example, we call this function ProcessDevicePosition, and use a Python runtime. A custom transformation is applied in the EventBridge target definition to receive the event data in the following format:
You could apply additional transformations, such as the refactoring of Latitude and Longitude data into separate key-value pairs if this is required by the downstream business logic processing the events.
The following code demonstrates the Python application logic that is run by the ProcessDevicePosition Lambda function. Error handling has been skipped in this code snippet for brevity. The full code is available in the GitHub repo.
The preceding code creates an S3 object for each device position event received by EventBridge. The code uses the DeviceId as a prefix to write the objects to the bucket.
You can add additional logic to the preceding Lambda function code to enrich the event data using other sources. The example in the GitHub repo demonstrates enriching the event with data from a DynamoDB vehicle maintenance table.
In addition to the prerequisite AWS Identity and Access Management (IAM) permissions provided by the role AWSBasicLambdaExecutionRole, the ProcessDevicePosition function requires permissions to perform the S3 put_object action and any other actions required by the data enrichment logic. IAM permissions required by the solution are documented in the template.yml file.
Complete the following steps to create your Firehose delivery stream:
On the Amazon Data Firehose console, choose Firehose streams in the navigation pane.
Choose Create Firehose stream.
For Source, choose as Direct PUT.
For Destination, choose Amazon S3.
For Firehose stream name, enter a name (for this post, ProcessDevicePositionFirehose).
Configure the destination settings with details about the S3 bucket in which the location data is stored, along with the partitioning strategy:
Use <S3_BUCKET_NAME> and <S3_BUCKET_FIREHOSE_PREFIX> to determine the bucket and object prefixes.
Use DeviceId as an additional prefix to write the objects to the bucket.
Enable Dynamic partitioning and New line delimiter to make sure partitioning is automatic based on DeviceId, and that new line delimiters are added between records in objects that are delivered to Amazon S3.
These are required by AWS Glue to later crawl the data, and for Athena to recognize individual records.
Create an EventBridge rule and attach targets
The EventBridge rule ProcessDevicePosition defines two targets: the ProcessDevicePosition Lambda function, and the ProcessDevicePositionFirehose delivery stream. Complete the following steps to create the rule and attach targets:
On the EventBridge console, create a new rule.
For Name, enter a name (for this post, ProcessDevicePosition).
For Event bus¸ choose default.
For Rule type¸ select Rule with an event pattern.
For Event source, select AWS events or EventBridge partner events.
For Method, select Use pattern form.
In the Event pattern section, specify AWS services as the source, Amazon Location Service as the specific service, and Location Device Position Event as the event type.
For Target 1, attach the ProcessDevicePosition Lambda function as a target.
We use Input transformer to customize the event that is committed to the S3 bucket.
Configure Input paths map and Input template to organize the payload into the desired format.
After sufficient data has been generated, complete the following steps:
On the AWS Glue console, choose Crawlers in the navigation pane.
Select the crawlers that have been created, location-analytics-glue-crawler-lambda and location-analytics-glue-crawler-firehose.
Choose Run.
The crawlers will automatically classify the data into JSON format, group the records into tables and partitions, and commit associated metadata to the AWS Glue Data Catalog.
When the Last run statuses of both crawlers show as Succeeded, confirm that two tables (lambda and firehose) have been created on the Tables page.
The solution partitions the incoming location data based on the deviceid field. Therefore, as long as there are no new devices or schema changes, the crawlers don’t need to run again. However, if new devices are added, or a different field is used for partitioning, the crawlers need to run again.
You’re now ready to query the tables using Athena.
Query the data using Athena
Athena is a serverless, interactive analytics service built to analyze unstructured, semi-structured, and structured data where it is hosted. If this is your first time using the Athena console, follow the instructions to set up a query result location in Amazon S3. To query the data with Athena, complete the following steps:
On the Athena console, open the query editor.
For Data source, choose AwsDataCatalog.
For Database, choose location-analytics-glue-database.
On the options menu (three vertical dots), choose Preview Table to query the content of both tables.
The query displays 10 sample positional records currently stored in the table. The following screenshot is an example from previewing the firehose table. The firehose table stores raw, unmodified data from the Amazon Location tracker.
You can now experiment with geospatial queries.The GeoJSON file for the 2021 London ULEZ expansion is part of the repository, and has already been converted into a query compatible with both Athena tables.
This query uses the ST_Within geospatial function to determine if a recorded position is inside or outside the ULEZ zone defined by the polygon. A new view called ulezvehicleanalysis_firehose is created with a new column, insidezone, which captures whether the recorded position exists within the zone.
A simple Python utility is provided, which converts the polygon features found in the downloaded GeoJSON file into ST_Polygon strings based on the well-known text format that can be used directly in an Athena query.
Choose Preview View on the ulezvehicleanalysis_firehose view to explore its content.
You can now run queries against this view to gain overarching insights.
This query establishes the total number of days each vehicle has entered ULEZ, and what the expected total charges would be. The query has been parameterized using the ? placeholder character. Parameterized queries allow you to rerun the same query with different parameter values.
Enter the daily fee amount for Parameter 1, then run the query.
The results display each vehicle, the total number of days spent in the proposed ULEZ, and the total charges based on the daily fee you entered.
You can repeat this exercise using the lambda table. Data in the lambda table is augmented with additional vehicle details present in the vehicle maintenance DynamoDB table at the time it is processed by the Lambda function. The solution supports the following fields:
MeetsEmissionStandards (Boolean)
Mileage (Number)
PurchaseDate (String, in YYYY-MM-DD format)
You can also enrich the new data as it arrives.
On the DynamoDB console, find the vehicle maintenance table under Tables. The table name is provided as output VehicleMaintenanceDynamoTable in the deployed CloudFormation stack.
Choose Explore table items to view the content of the table.
Choose Create item to create a new record for a vehicle.
Enter DeviceId (such as vehicle1 as a String), PurchaseDate (such as 2005-10-01 as a String), Mileage (such as 10000 as a Number), and MeetsEmissionStandards (with a value such as False as Boolean).
Choose Create item to create the record.
Duplicate the newly created record with additional entries for other vehicles (such as for vehicle2 or vehicle3), modifying the values of the attributes slightly each time.
Rerun the location-analytics-glue-crawler-lambda AWS Glue crawler after new data has been generated to confirm that the update to the schema with new fields is registered.
Preview the ulezvehicleanalysis_lambda view to confirm that the new columns have been created.
If errors such as Column 'mileage' cannot be resolved are displayed, the data enrichment is not taking place, or the AWS Glue crawler has not yet detected updates to the schema.
If the Preview table option is only returning results from before you created records in the DynamoDB table, return the query results in descending order using sampletime (for example, order by sampletime desc limit 100;).
Now we focus on the vehicles that don’t currently meet emissions standards, and order the vehicles in descending order based on the mileage per year (calculated using the latest mileage / age of vehicle in years).
In this example, we can see that out of our fleet of vehicles, five have been reported as not meeting emission standards. We can also see the vehicles that have accumulated high mileage per year, and the number of days spent in the proposed ULEZ. The fleet operator may now decide to prioritize these vehicles for replacement. Because location data is enriched with the most up-to-date vehicle maintenance data at the time it is ingested, you can further evolve these queries to run over a defined time window. For example, you could factor in mileage changes within the past year.
Due to the dynamic nature of the data enrichment, any new data being committed to Amazon S3, along with the query results, will be altered as and when records are updated in the DynamoDB vehicle maintenance table.
Clean up
Refer to the instructions in the README file to clean up the resources provisioned for this solution.
Conclusion
This post demonstrated how you can use Amazon Location, EventBridge, Lambda, Amazon Data Firehose, and Amazon S3 to build a location-aware data pipeline, and use the collected device position data to drive analytical insights using AWS Glue and Athena. By tracking these assets in real time and storing the results, companies can derive valuable insights on how effectively their fleets are being utilized and better react to changes in the future. You can now explore extending this sample code with your own device tracking data and analytics requirements.
About the Authors
Alan Peaty is a Senior Partner Solutions Architect at AWS. Alan helps Global Systems Integrators (GSIs) and Global Independent Software Vendors (GISVs) solve complex customer challenges using AWS services. Prior to joining AWS, Alan worked as an architect at systems integrators to translate business requirements into technical solutions. Outside of work, Alan is an IoT enthusiast and a keen runner who loves to hit the muddy trails of the English countryside.
Parag Srivastava is a Solutions Architect at AWS, helping enterprise customers with successful cloud adoption and migration. During his professional career, he has been extensively involved in complex digital transformation projects. He is also passionate about building innovative solutions around geospatial aspects of addresses.
In recent years, data lakes have become a mainstream architecture, and data quality validation is a critical factor to improve the reusability and consistency of the data. AWS Glue Data Quality reduces the effort required to validate data from days to hours, and provides computing recommendations, statistics, and insights about the resources required to run data validation.
AWS Glue Data Quality is built on DeeQu, an open source tool developed and used at Amazon to calculate data quality metrics and verify data quality constraints and changes in the data distribution so you can focus on describing how data should look instead of implementing algorithms.
In this post, we provide benchmark results of running increasingly complex data quality rulesets over a predefined test dataset. As part of the results, we show how AWS Glue Data Quality provides information about the runtime of extract, transform, and load (ETL) jobs, the resources measured in terms of data processing units (DPUs), and how you can track the cost of running AWS Glue Data Quality for ETL pipelines by defining custom cost reporting in AWS Cost Explorer.
This post is Part 6 of a six-part series of posts to explain how AWS Glue Data Quality works.
Part 6: Measure performance of AWS Glue Data Quality for ETL pipelines
Solution overview
We start by defining our test dataset in order to explore how AWS Glue Data Quality automatically scales depending on input datasets.
Dataset details
The test dataset contains 104 columns and 1 million rows stored in Parquet format. You can download the dataset or recreate it locally using the Python script provided in the repository. If you opt to run the generator script, you need to install the Pandas and Mimesis packages in your Python environment:
pip install pandas mimesis
The dataset schema is a combination of numerical, categorical, and string variables in order to have enough attributes to use a combination of built-in AWS Glue Data Quality rule types. The schema replicates some of the most common attributes found in financial market data such as instrument ticker, traded volumes, and pricing forecasts.
Data quality rulesets
We categorize some of the built-in AWS Glue Data Quality rule types to define the benchmark structure. The categories consider whether the rules perform column checks that don’t require row-level inspection (simple rules), row-by-row analysis (medium rules), or data type checks, eventually comparing row values against other data sources (complex rules). The following table summarizes these rules.
Simple Rules
Medium Rules
Complex Rules
ColumnCount
DistinctValuesCount
ColumnValues
ColumnDataType
IsComplete
Completeness
ColumnExist
Sum
ReferentialIntegrity
ColumnNamesMatchPattern
StandardDeviation
ColumnCorrelation
RowCount
Mean
RowCountMatch
ColumnLength
.
.
We define eight different AWS Glue ETL jobs where we run the data quality rulesets. Each job has a different number of data quality rules associated to it. Each job also has an associated user-defined cost allocation tag that we use to create a data quality cost report in AWS Cost Explorer later on.
We provide the plain text definition for each ruleset in the following table.
Create the AWS Glue ETL jobs containing the data quality rulesets
We upload the test dataset to Amazon Simple Storage Service (Amazon S3) and also two additional CSV files that we’ll use to evaluate referential integrity rules in AWS Glue Data Quality (isocodes.csv and exchanges.csv) after they have been added to the AWS Glue Data Catalog. Complete the following steps:
On the Amazon S3 console, create a new S3 bucket in your account and upload the test dataset.
Create a folder in the S3 bucket called isocodes and upload the isocodes.csv file.
Create another folder in the S3 bucket called exchange and upload the exchanges.csv file.
On the AWS Glue console, run two AWS Glue crawlers, one for each folder to register the CSV content in AWS Glue Data Catalog (data_quality_catalog). For instructions, refer to Adding an AWS Glue Crawler.
The AWS Glue crawlers generate two tables (exchanges and isocodes) as part of the AWS Glue Data Catalog.
On the IAM console, create a new IAM role called AWSGlueDataQualityPerformanceRole
For Trusted entity type, select AWS service.
For Service or use case, choose Glue.
Choose Next.
For Permission policies, enter AWSGlueServiceRole
Choose Next.
Create and attach a new inline policy (AWSGlueDataQualityBucketPolicy) with the following content. Replace the placeholder with the S3 bucket name you created earlier:
Scroll to the end and under Performance Configuration, enable Cache Data.
Under Job details, for IAM Role, choose AWSGlueDataQualityPerformanceRole.
In the Tags section, define dqjob tag as rs5.
This tag will be different for each of the data quality ETL jobs; we use them in AWS Cost Explorer to review the ETL jobs cost.
Choose Save.
Repeat these steps with the rest of the rulesets to define all the ETL jobs.
Run the AWS Glue ETL jobs
Complete the following steps to run the ETL jobs:
On the AWS Glue console, choose Visual ETL under ETL jobs in the navigation pane.
Select the ETL job and choose Run job.
Repeat for all the ETL jobs.
When the ETL jobs are complete, the Job run monitoring page will display the job details. As shown in the following screenshot, a DPU hours column is provided for each ETL job.
Review performance
The following table summarizes the duration, DPU hours, and estimated costs from running the eight different data quality rulesets over the same test dataset. Note that all rulesets have been run with the entire test dataset described earlier (104 columns, 1 million rows).
ETL Job Name
Number of Rules
Tag
Duration (sec)
# of DPU hours
# of DPUs
Cost ($)
ruleset-400
400
dqjob:rs400
445.7
1.24
10
$0.54
ruleset-200
200
dqjob:rs200
235.7
0.65
10
$0.29
ruleset-100
100
dqjob:rs100
186.5
0.52
10
$0.23
ruleset-50
50
dqjob:rs50
155.2
0.43
10
$0.19
ruleset-10
10
dqjob:rs10
152.2
0.42
10
$0.18
ruleset-5
5
dqjob:rs5
150.3
0.42
10
$0.18
ruleset-1
1
dqjob:rs1
150.1
0.42
10
$0.18
ruleset-0
0
dqjob:rs0
53.2
0.15
10
$0.06
The cost of evaluating an empty ruleset is close to zero, but it has been included because it can be used as a quick test to validate the IAM roles associated to the AWS Glue Data Quality jobs and read permissions to the test dataset in Amazon S3. The cost of data quality jobs only starts to increase after evaluating rulesets with more than 100 rules, remaining constant below that number.
We can observe that the cost of running data quality for the largest ruleset in the benchmark (400 rules) is still slightly above $0.50.
After you create and apply user-defined tags to your resources, it can take up to 24 hours for the tag keys to appear on your cost allocation tags page for activation. It can then take up to 24 hours for the tag keys to activate.
On the AWS Cost Explorer console, choose Cost Explorer Saved Reports in the navigation pane.
Choose Create new report.
Select Cost and usage as the report type.
Choose Create Report.
For Date Range, enter a date range.
For Granularity¸ choose Daily.
For Dimension, choose Tag, then choose the dqjob tag.
Under Applied filters, choose the dqjob tag and the eight tags used in the data quality rulesets (rs0, rs1, rs5, rs10, rs50, rs100, rs200, and rs400).
Choose Apply.
The Cost and Usage report will be updated. The X-axis shows the data quality ruleset tags as categories. The Cost and usage graph in AWS Cost Explorer will refresh and show the total monthly cost of the latest data quality ETL jobs run, aggregated by ETL job.
Clean up
To clean up the infrastructure and avoid additional charges, complete the following steps:
Empty the S3 bucket initially created to store the test dataset.
Delete the ETL jobs you created in AWS Glue.
Delete the AWSGlueDataQualityPerformanceRole IAM role.
Delete the custom report created in AWS Cost Explorer.
Conclusion
AWS Glue Data Quality provides an efficient way to incorporate data quality validation as part of ETL pipelines and scales automatically to accommodate increasing volumes of data. The built-in data quality rule types offer a wide range of options to customize the data quality checks and focus on how your data should look instead of implementing undifferentiated logic.
In this benchmark analysis, we showed how common-size AWS Glue Data Quality rulesets have little or no overhead, whereas in complex cases, the cost increases linearly. We also reviewed how you can tag AWS Glue Data Quality jobs to make cost information available in AWS Cost Explorer for quick reporting.
Ruben Afonso is a Global Financial Services Solutions Architect with AWS. He enjoys working on analytics and AI/ML challenges, with a passion for automation and optimization. When not at work, he enjoys finding hidden spots off the beaten path around Barcelona.
Kalyan Kumar Neelampudi (KK) is a Specialist Partner Solutions Architect (Data Analytics & Generative AI) at AWS. He acts as a technical advisor and collaborates with various AWS partners to design, implement, and build practices around data analytics and AI/ML workloads. Outside of work, he’s a badminton enthusiast and culinary adventurer, exploring local cuisines and traveling with his partner to discover new tastes and experiences.
Gonzalo Herreros is a Senior Big Data Architect on the AWS Glue team.
When running Apache Flink applications on Amazon Managed Service for Apache Flink, you have the unique benefit of taking advantage of its serverless nature. This means that cost-optimization exercises can happen at any time—they no longer need to happen in the planning phase. With Managed Service for Apache Flink, you can add and remove compute with the click of a button.
Apache Flink is an open source stream processing framework used by hundreds of companies in critical business applications, and by thousands of developers who have stream-processing needs for their workloads. It is highly available and scalable, offering high throughput and low latency for the most demanding stream-processing applications. These scalable properties of Apache Flink can be key to optimizing your cost in the cloud.
Managed Service for Apache Flink is a fully managed service that reduces the complexity of building and managing Apache Flink applications. Managed Service for Apache Flink manages the underlying infrastructure and Apache Flink components that provide durable application state, metrics, logs, and more.
In this post, you can learn about the Managed Service for Apache Flink cost model, areas to save on cost in your Apache Flink applications, and overall gain a better understanding of your data processing pipelines. We dive deep into understanding your costs, understanding whether your application is overprovisioned, how to think about scaling automatically, and ways to optimize your Apache Flink applications to save on cost. Lastly, we ask important questions about your workload to determine if Apache Flink is the right technology for your use case.
How costs are calculated on Managed Service for Apache Flink
To optimize for costs with regards to your Managed Service for Apache Flink application, it can help to have a good idea of what goes into the pricing for the managed service.
Managed Service for Apache Flink applications are comprised of Kinesis Processing Units (KPUs), which are compute instances composed of 1 virtual CPU and 4 GB of memory. The total number of KPUs assigned to the application is determined by multiplying two parameters that you control directly:
Parallelism – The level of parallel processing in the Apache Flink application
Parallelism per KPU – The number of resources dedicated to each parallelism
The number of KPUs is determined by the simple formula: KPU = Parallelism / ParallelismPerKPU, rounded up to the next integer.
An additional KPU per application is also charged for orchestration and not directly used for data processing.
The total number of KPUs determines the number of resources, CPU, memory, and application storage allocated to the application. For each KPU, the application receives 1 vCPU and 4 GB of memory, of which 3 GB are allocated by default to the running application and the remaining 1 GB is used for application state store management. Each KPU also comes with 50 GB of storage attached to the application. Apache Flink retains application state in-memory to a configurable limit, and spillover to the attached storage.
The third cost component is durable application backups, or snapshots. This is entirely optional and its impact on the overall cost is small, unless you retain a very large number of snapshots.
At the time of writing, each KPU in the US East (Ohio) AWS Region costs $0.11 per hour, and attached application storage costs $0.10 per GB per month. The cost of durable application backup (snapshots) is $0.023 per GB per month. Refer to Amazon Managed Service for Apache Flink Pricing for up-to-date pricing and different Regions.
The following diagram illustrates the relative proportions of cost components for a running application on Managed Service for Apache Flink. You control the number of KPUs via the parallelism and parallelism per KPU parameters. Durable application backup storage is not represented.
In the following sections, we examine how to monitor your costs, optimize the usage of application resources, and find the required number of KPUs to handle your throughput profile.
AWS Cost Explorer and understanding your bill
To see what your current Managed Service for Apache Flink spend is, you can use AWS Cost Explorer.
On the Cost Explorer console, you can filter by date range, usage type, and service to isolate your spend for Managed Service for Apache Flink applications. The following screenshot shows the past 12 months of cost broken down into the price categories described in the previous section. The majority of spend in many of these months was from interactive KPUs from Amazon Managed Service for Apache Flink Studio.
Using Cost Explorer can not only help you understand your bill, but help further optimize particular applications that may have scaled beyond expectations automatically or due to throughput requirements. With proper application tagging, you could also break this spend down by application to see which applications account for the cost.
Signs of overprovisioning or inefficient use of resources
To minimize costs associated with Managed Service for Apache Flink applications, a straightforward approach involves reducing the number of KPUs your applications use. However, it’s crucial to recognize that this reduction could adversely affect performance if not thoroughly assessed and tested. To quickly gauge whether your applications might be overprovisioned, examine key indicators such as CPU and memory usage, application functionality, and data distribution. However, although these indicators can suggest potential overprovisioning, it’s essential to conduct performance testing and validate your scaling patterns before making any adjustments to the number of KPUs.
Metrics
Analyzing metrics for your application on Amazon CloudWatch can reveal clear signals of overprovisioning. If the containerCPUUtilization and containerMemoryUtilization metrics consistently remain below 20% over a statistically significant period for your application’s traffic patterns, it might be viable to scale down and allocate more data to fewer machines. Generally, we consider applications appropriately sized when containerCPUUtilization hovers between 50–75%. Although containerMemoryUtilization can fluctuate throughout the day and be influenced by code optimization, a consistently low value for a substantial duration could indicate potential overprovisioning.
Parallelism per KPU underutilized
Another subtle sign that your application is overprovisioned is if your application is purely I/O bound, or only does simple call-outs to databases and non-CPU intensive operations. If this is the case, you can use the parallelism per KPU parameter within Managed Service for Apache Flink to load more tasks onto a single processing unit.
You can view the parallelism per KPU parameter as a measure of density of workload per unit of compute and memory resources (the KPU). Increasing parallelism per KPU above the default value of 1 makes the processing more dense, allocating more parallel processes on a single KPU.
The following diagram illustrates how, by keeping the application parallelism constant (for example, 4) and increasing parallelism per KPU (for example, from 1 to 2), your application uses fewer resources with the same level of parallel runs.
The decision of increasing parallelism per KPU, like all recommendations in this post, should be taken with great care. Increasing the parallelism per KPU value can put more load on a single KPU, and it must be willing to tolerate that load. I/O-bound operations will not increase CPU or memory utilization in any meaningful way, but a process function that calculates many complex operations against the data would not be an ideal operation to collate onto a single KPU, because it could overwhelm the resources. Performance test and evaluate if this is a good option for your applications.
How to approach sizing
Before you stand up a Managed Service for Apache Flink application, it can be difficult to estimate the number of KPUs you should allocate for your application. In general, you should have a good sense of your traffic patterns before estimating. Understanding your traffic patterns on a megabyte-per-second ingestion rate basis can help you approximate a starting point.
As a general rule, you can start with one KPU per 1 MB/s that your application will process. For example, if your application processes 10 MB/s (on average), you would allocate 10 KPUs as a starting point for your application. Keep in mind that this is a very high-level approximation that we have seen effective for a general estimate. However, you also need to performance test and evaluate whether or not this is an appropriate sizing in the long term based on metrics (CPU, memory, latency, overall job performance) over a long period of time.
To find the appropriate sizing for your application, you need to scale up and down the Apache Flink application. As mentioned, in Managed Service for Apache Flink you have two separate controls: parallelism and parallelism per KPU. Together, these parameters determine the level of parallel processing within the application and the overall compute, memory, and storage resources available.
The recommended testing methodology is to change parallelism or parallelism per KPU separately, while experimenting to find the right sizing. In general, only change parallelism per KPU to increase the number of parallel I/O-bound operations, without increasing the overall resources. For all other cases, only change parallelism—KPU will change consequentially—to find the right sizing for your workload.
You can also set parallelism at the operator level to restrict sources, sinks, or any other operator that might need to be restricted and independent of scaling mechanisms. You could use this for an Apache Flink application that reads from an Apache Kafka topic that has 10 partitions. With the setParallelism() method, you could restrict the KafkaSource to 10, but scale the Managed Service for Apache Flink application to a parallelism higher than 10 without creating idle tasks for the Kafka source. It is recommended for other data processing cases to not statically set operator parallelism to a static value, but rather a function of the application parallelism so that it scales when the overall application scales.
Scaling and auto scaling
In Managed Service for Apache Flink, modifying parallelism or parallelism per KPU is an update of the application configuration. It causes the application to automatically take a snapshot (unless disabled), stop the application, and restart it with the new sizing, restoring the state from the snapshot. Scaling operations don’t cause data loss or inconsistencies, but it does pause data processing for a short period of time while infrastructure is added or removed. This is something you need to consider when rescaling in a production environment.
During the testing and optimization process, we recommend disabling automatic scaling and modifying parallelism and parallelism per KPU to find the optimal values. As mentioned, manual scaling is just an update of the application configuration, and can be run via the AWS Management Console or API with the UpdateApplication action.
When you have found the optimal sizing, if you expect your ingested throughput to vary considerably, you may decide to enable auto scaling.
In Managed Service for Apache Flink, you can use multiple types of automatic scaling:
Out-of-the-box automatic scaling – You can enable this to adjust the application parallelism automatically based on the containerCPUUtilization metric. Automatic scaling is enabled by default on new applications. For details about the automatic scaling algorithm, refer to Automatic Scaling.
Fine-grained, metric-based automatic scaling – This is straightforward to implement. The automation can be based on virtually any metrics, including custom metrics your application exposes.
Scheduled scaling – This may be useful if you expect peaks of workload at given times of the day or days of the week.
Another way to approach cost savings for your Managed Service for Apache Flink applications is through code optimization. Un-optimized code will require more machines to perform the same computations. Optimizing the code could allow for lower overall resource utilization, which in turn could allow for scaling down and cost savings accordingly.
The first step to understanding your code performance is through the built-in utility within Apache Flink called Flame Graphs.
Flame Graphs, which are accessible via the Apache Flink dashboard, give you a visual representation of your stack trace. Each time a method is called, the bar that represents that method call in the stack trace gets larger proportional to the total sample count. This means that if you have an inefficient piece of code with a very long bar in the flame graph, this could be cause for investigation as to how to make this code more efficient. Additionally, you can use Amazon CodeGuru Profiler to monitor and optimize your Apache Flink applications running on Managed Service for Apache Flink.
When designing your applications, it is recommended to use the highest-level API that is required for a particular operation at a given time. Apache Flink offers four levels of API support: Flink SQL, Table API, Datastream API, and ProcessFunction APIs, with increasing levels of complexity and responsibility. If your application can be written entirely in the Flink SQL or Table API, using this can help take advantage of the Apache Flink framework rather than managing state and computations manually.
Data skew
On the Apache Flink dashboard, you can gather other useful information about your Managed Service for Apache Flink jobs.
On the dashboard, you can inspect individual tasks within your job application graph. Each blue box represents a task, and each task is composed of subtasks, or distributed units of work for that task. You can identify data skew among subtasks this way.
Data skew is an indicator that more data is being sent to one subtask than another, and that a subtask receiving more data is doing more work than the other. If you have such symptoms of data skew, you can work to eliminate it by identifying the source. For example, a GroupBy or KeyedStream could have a skew in the key. This would mean that data is not evenly spread among keys, resulting in an uneven distribution of work across Apache Flink compute instances. Imagine a scenario where you are grouping by userId, but your application receives data from one user significantly more than the rest. This can result in data skew. To eliminate this, you can choose a different grouping key to evenly distribute the data across subtasks. Keep in mind that this will require code modification to choose a different key.
When the data skew is eliminated, you can return to the containerCPUUtilization and containerMemoryUtilization metrics to reduce the number of KPUs.
Other areas for code optimization include making sure that you’re accessing external systems via the Async I/O API or via a data stream join, because a synchronous query out to a data store can create slowdowns and issues in checkpointing. Additionally, refer to Troubleshooting Performance for issues you might experience with slow checkpoints or logging, which can cause application backpressure.
How to determine if Apache Flink is the right technology
If your application doesn’t use any of the powerful capabilities behind the Apache Flink framework and Managed Service for Apache Flink, you could potentially save on cost by using something simpler.
Apache Flink’s tagline is “Stateful Computations over Data Streams.” Stateful, in this context, means that you are using the Apache Flink state construct. State, in Apache Flink, allows you to remember messages you have seen in the past for longer periods of time, making things like streaming joins, deduplication, exactly-once processing, windowing, and late-data handling possible. It does so by using an in-memory state store. On Managed Service for Apache Flink, it uses RocksDB to maintain its state.
If your application doesn’t involve stateful operations, you may consider alternatives such as AWS Lambda, containerized applications, or an Amazon Elastic Compute Cloud (Amazon EC2) instance running your application. The complexity of Apache Flink may not be necessary in such cases. Stateful computations, including cached data or enrichment procedures requiring independent stream position memory, may warrant Apache Flink’s stateful capabilities. If there’s a potential for your application to become stateful in the future, whether through prolonged data retention or other stateful requirements, continuing to use Apache Flink could be more straightforward. Organizations emphasizing Apache Flink for stream processing capabilities may prefer to stick with Apache Flink for stateful and stateless applications so all their applications process data in the same way. You should also factor in its orchestration features like exactly-once processing, fan-out capabilities, and distributed computation before transitioning from Apache Flink to alternatives.
Another consideration is your latency requirements. Because Apache Flink excels at real-time data processing, using it for an application with a 6-hour or 1-day latency requirement does not make sense. The cost savings by switching to a temporal batch process out of Amazon Simple Storage Service (Amazon S3), for example, would be significant.
Conclusion
In this post, we covered some aspects to consider when attempting cost-savings measures for Managed Service for Apache Flink. We discussed how to identify your overall spend on the managed service, some useful metrics to monitor when scaling down your KPUs, how to optimize your code for scaling down, and how to determine if Apache Flink is right for your use case.
Implementing these cost-saving strategies not only enhances your cost efficiency but also provides a streamlined and well-optimized Apache Flink deployment. By staying mindful of your overall spend, using key metrics, and making informed decisions about scaling down resources, you can achieve a cost-effective operation without compromising performance. As you navigate the landscape of Apache Flink, constantly evaluating whether it aligns with your specific use case becomes pivotal, so you can achieve a tailored and efficient solution for your data processing needs.
If any of the recommendations discussed in this post resonate with your workloads, we encourage you to try them out. With the metrics specified, and the tips on how to understand your workloads better, you should now have what you need to efficiently optimize your Apache Flink workloads on Managed Service for Apache Flink. The following are some helpful resources you can use to supplement this post:
Jeremy Ber has been working in the telemetry data space for the past 10 years as a Software Engineer, Machine Learning Engineer, and most recently a Data Engineer. At AWS, he is a Streaming Specialist Solutions Architect, supporting both Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink.
Lorenzo Nicora works as Senior Streaming Solution Architect at AWS, helping customers across EMEA. He has been building cloud-native, data-intensive systems for over 25 years, working in the finance industry both through consultancies and for FinTech product companies. He has leveraged open-source technologies extensively and contributed to several projects, including Apache Flink.
Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance.
With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging. With a modern data architecture on AWS, you can rapidly build scalable data lakes; use a broad and deep collection of purpose-built data services; ensure compliance via unified data access, security, and governance; scale your systems at a low cost without compromising performance; and share data across organizational boundaries with ease, allowing you to make decisions with speed and agility at scale.
You can take all your data from various silos, aggregate that data in your data lake, and perform analytics and machine learning (ML) directly on top of that data. You can also store other data in purpose-built data stores to analyze and get fast insights from both structured and unstructured data. This data movement can be inside-out, outside-in, around the perimeter or sharing across.
For example, application logs and traces from web applications can be collected directly in a data lake, and a portion of that data can be moved out to a log analytics store like Amazon OpenSearch Service for daily analysis. We think of this concept as inside-out data movement. The analyzed and aggregated data stored in Amazon OpenSearch Service can again be moved to the data lake to run ML algorithms for downstream consumption from applications. We refer to this concept as outside-in data movement.
Let’s look at an example use case. Example Corp. is a leading Fortune 500 company that specializes in social content. They have hundreds of applications generating data and traces at approximately 500 TB per day and have the following criteria:
Have logs available for fast analytics for 2 days
Beyond 2 days, have data available in a storage tier that can be made available for analytics with a reasonable SLA
Retain the data beyond 1 week in cold storage for 30 days (for purposes of compliance, auditing, and others)
In the following sections, we discuss three possible solutions to address similar use cases:
Tiered storage in Amazon OpenSearch Service and data lifecycle management
Amazon OpenSearch Service direct queries with Amazon Simple Storage Service (Amazon S3)
Solution 1: Tiered storage in OpenSearch Service and data lifecycle management
OpenSearch Service supports three integrated storage tiers: hot, UltraWarm, and cold storage. Based on your data retention, query latency, and budgeting requirements, you can choose the best strategy to balance cost and performance. You can also migrate data between different storage tiers.
Hot storage is used for indexing and updating, and provides the fastest access to data. Hot storage takes the form of an instance store or Amazon Elastic Block Store (Amazon EBS) volumes attached to each node.
UltraWarm offers significantly lower costs per GiB for read-only data that you query less frequently and doesn’t need the same performance as hot storage. UltraWarm nodes use Amazon S3 with related caching solutions to improve performance.
Cold storage is optimized to store infrequently accessed or historical data. When you use cold storage, you detach your indexes from the UltraWarm tier, making them inaccessible. You can reattach these indexes in a few seconds when you need to query that data.
The workflow for this solution consists of the following steps:
Incoming data generated by the applications is streamed to an S3 data lake.
Data is ingested into Amazon OpenSearch using S3-SQS near-real-time ingestion through notifications set up on the S3 buckets.
After 2 days, hot data is migrated to UltraWarm storage to support read queries.
After 5 days in UltraWarm, the data is migrated to cold storage for 21 days and detached from any compute. The data can be reattached to UltraWarm when needed. Data is deleted from cold storage after 21 days.
Daily indexes are maintained for easy rollover. An Index State Management (ISM) policy automates the rollover or deletion of indexes that are older than 2 days.
The following is a sample ISM policy that rolls over data into the UltraWarm tier after 2 days, moves it to cold storage after 5 days, and deletes it from cold storage after 21 days:
UltraWarm uses sophisticated caching techniques to enable querying for infrequently accessed data. Although the data access is infrequent, the compute for UltraWarm nodes needs to be running all the time to make this access possible.
When operating at PB scale, to reduce the area of effect of any errors, we recommend decomposing the implementation into multiple OpenSearch Service domains when using tiered storage.
The next two patterns remove the need to have long-running compute and describe on-demand techniques where the data is either brought when needed or queried directly where it resides.
Solution 2: On-demand ingestion of logs data through OpenSearch Ingestion
OpenSearch Ingestion is a fully managed data collector that delivers real-time log and trace data to OpenSearch Service domains. OpenSearch Ingestion is powered by the open source data collector Data Prepper. Data Prepper is part of the open source OpenSearch project.
With OpenSearch Ingestion, you can filter, enrich, transform, and deliver your data for downstream analysis and visualization. You configure your data producers to send data to OpenSearch Ingestion. It automatically delivers the data to the domain or collection that you specify. You can also configure OpenSearch Ingestion to transform your data before delivering it. OpenSearch Ingestion is serverless, so you don’t need to worry about scaling your infrastructure, operating your ingestion fleet, and patching or updating the software.
There are two ways that you can use Amazon S3 as a source to process data with OpenSearch Ingestion. The first option is S3-SQS processing. You can use S3-SQS processing when you require near-real-time scanning of files after they are written to S3. It requires an Amazon Simple Queue Service (Amazon S3) queue that receives S3 Event Notifications. You can configure S3 buckets to raise an event any time an object is stored or modified within the bucket to be processed.
Alternatively, you can use a one-time or recurring scheduled scan to batch process data in an S3 bucket. To set up a scheduled scan, configure your pipeline with a schedule at the scan level that applies to all your S3 buckets, or at the bucket level. You can configure scheduled scans with either a one-time scan or a recurring scan for batch processing.
For a comprehensive overview of OpenSearch Ingestion, see Amazon OpenSearch Ingestion. For more information about the Data Prepper open source project, visit Data Prepper.
Solution overview
We present an architecture pattern with the following key components:
Application logs are streamed into to the data lake, which helps feed hot data into OpenSearch Service in near-real time using OpenSearch Ingestion S3-SQS processing.
ISM policies within OpenSearch Service handle index rollovers or deletions. ISM policies let you automate these periodic, administrative operations by triggering them based on changes in the index age, index size, or number of documents. For example, you can define a policy that moves your index into a read-only state after 2 days and then deletes it after a set period of 3 days.
Cold data is available in the S3 data lake to be consumed on demand into OpenSearch Service using OpenSearch Ingestion scheduled scans.
The following diagram illustrates the solution architecture.
The workflow includes the following steps:
Incoming data generated by the applications is streamed to the S3 data lake.
For the current day, data is ingested into OpenSearch Service using S3-SQS near-real-time ingestion through notifications set up in the S3 buckets.
Daily indexes are maintained for easy rollover. An ISM policy automates the rollover or deletion of indexes that are older than 2 days.
If a request is made for analysis of data beyond 2 days and the data is not in the UltraWarm tier, data will be ingested using the one-time scan feature of Amazon S3 between the specific time window.
For example, if the present day is January 10, 2024, and you need data from January 6, 2024 at a specific interval for analysis, you can create an OpenSearch Ingestion pipeline with an Amazon S3 scan in your YAML configuration, with the start_time and end_time to specify when you want the objects in the bucket to be scanned:
Data in Amazon S3 can be compressed, which reduces your overall data footprint and results in significant cost savings. For example, if you are generating 15 PB of raw JSON application logs per month, you can use a compression mechanism like GZIP, which can reduce the size to approximately 1PB or less, resulting in significant cost savings.
Stop the pipeline when possible
OpenSearch Ingestion scales automatically between the minimum and maximum OCUs set for the pipeline. After the pipeline has completed the Amazon S3 scan for the specified duration mentioned in the pipeline configuration, the pipeline continues to run for continuous monitoring at the minimum OCUs.
For on-demand ingestion for past time durations where you don’t expect new objects to be created, consider using supported pipeline metrics such as recordsOut.count to create Amazon CloudWatch alarms that can stop the pipeline. For a list of supported metrics, refer to Monitoring pipeline metrics.
CloudWatch alarms perform an action when a CloudWatch metric exceeds a specified value for some amount of time. For example, you might want to monitor recordsOut.count to be 0 for longer than 5 minutes to initiate a request to stop the pipeline through the AWS Command Line Interface (AWS CLI) or API.
Solution 3: OpenSearch Service direct queries with Amazon S3
OpenSearch Service direct queries with Amazon S3 (preview) is a new way to query operational logs in Amazon S3 and S3 data lakes without needing to switch between services. You can now analyze infrequently queried data in cloud object stores and simultaneously use the operational analytics and visualization capabilities of OpenSearch Service.
OpenSearch Service direct queries with Amazon S3 provides zero-ETL integration to reduce the operational complexity of duplicating data or managing multiple analytics tools by enabling you to directly query your operational data, reducing costs and time to action. This zero-ETL integration is configurable within OpenSearch Service, where you can take advantage of various log type templates, including predefined dashboards, and configure data accelerations tailored to that log type. Templates include VPC Flow Logs, Elastic Load Balancing logs, and NGINX logs, and accelerations include skipping indexes, materialized views, and covered indexes.
With OpenSearch Service direct queries with Amazon S3, you can perform complex queries that are critical to security forensics and threat analysis and correlate data across multiple data sources, which aids teams in investigating service downtime and security events. After you create an integration, you can start querying your data directly from OpenSearch Dashboards or the OpenSearch API. You can audit connections to ensure that they are set up in a scalable, cost-efficient, and secure way.
Direct queries from OpenSearch Service to Amazon S3 use Spark tables within the AWS Glue Data Catalog. After the table is cataloged in your AWS Glue metadata catalog, you can run queries directly on your data in your S3 data lake through OpenSearch Dashboards.
Solution overview
The following diagram illustrates the solution architecture.
This solution consists of the following key components:
The hot data for the current day is stream processed into OpenSearch Service domains through the event-driven architecture pattern using the OpenSearch Ingestion S3-SQS processing feature
The hot data lifecycle is managed through ISM policies attached to daily indexes
The cold data resides in your Amazon S3 bucket, and is partitioned and cataloged
The following screenshot shows a sample http_logs table that is cataloged in the AWS Glue metadata catalog. For detailed steps, refer to Data Catalog and crawlers in AWS Glue.
Before you create a data source, you should have an OpenSearch Service domain with version 2.11 or later and a target S3 table in the AWS Glue Data Catalog with the appropriate AWS Identity and Access Management (IAM) permissions. IAM will need access to the desired S3 buckets and have read and write access to the AWS Glue Data Catalog. The following is a sample role and trust policy with appropriate permissions to access the AWS Glue Data Catalog through OpenSearch Service:
To create a new data source on the OpenSearch Service console, provide the name of your new data source, specify the data source type as Amazon S3 with the AWS Glue Data Catalog, and choose the IAM role for your data source.
After you create a data source, you can go to the OpenSearch dashboard of the domain, which you use to configure access control, define tables, set up log type-based dashboards for popular log types, and query your data.
After you set up your tables, you can query your data in your S3 data lake through OpenSearch Dashboards. You can run a sample SQL query for the http_logs table you created in the AWS Glue Data Catalog tables, as shown in the following screenshot.
Best practices
Ingest only the data you need
Work backward from your business needs and establish the right datasets you’ll need. Evaluate if you can avoid ingesting noisy data and ingest only curated, sampled, or aggregated data. Using these cleaned and curated datasets will help you optimize the compute and storage resources needed to ingest this data.
Reduce the size of data before ingestion
When you design your data ingestion pipelines, use strategies such as compression, filtering, and aggregation to reduce the size of the ingested data. This will permit smaller data sizes to be transferred over the network and stored in your data layer.
Conclusion
In this post, we discussed solutions that enable petabyte-scale log analytics using OpenSearch Service in a modern data architecture. You learned how to create a serverless ingestion pipeline to deliver logs to an OpenSearch Service domain, manage indexes through ISM policies, configure IAM permissions to start using OpenSearch Ingestion, and create the pipeline configuration for data in your data lake. You also learned how to set up and use the OpenSearch Service direct queries with Amazon S3 feature (preview) to query data from your data lake.
To choose the right architecture pattern for your workloads when using OpenSearch Service at scale, consider the performance, latency, cost and data volume growth over time in order to make the right decision.
Use Tiered storage architecture with Index State Management policies when you need fast access to your hot data and want to balance the cost and performance with UltraWarm nodes for read-only data.
Use On Demand Ingestion of your data into OpenSearch Service when you can tolerate ingestion latencies to query your data not retained in your hot nodes. You can achieve significant cost savings when using compressed data in Amazon S3 and ingesting data on demand into OpenSearch Service.
Use Direct query with S3 feature when you want to directly analyze your operational logs in Amazon S3 with the rich analytics and visualization features of OpenSearch Service.
As a next step, refer to the Amazon OpenSearch Developer Guide to explore logs and metric pipelines that you can use to build a scalable observability solution for your enterprise applications.
About the Authors
Jagadish Kumar (Jag) is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.
Muthu Pitchaimani is a Senior Specialist Solutions Architect with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.
Sam Selvan is a Principal Specialist Solution Architect with Amazon OpenSearch Service.
Part 1 of this two-part series described how to build a pseudonymization service that converts plain text data attributes into a pseudonym or vice versa. A centralized pseudonymization service provides a unique and universally recognized architecture for generating pseudonyms. Consequently, an organization can achieve a standard process to handle sensitive data across all platforms. Additionally, this takes away any complexity and expertise needed to understand and implement various compliance requirements from development teams and analytical users, allowing them to focus on their business outcomes.
Following a decoupled service-based approach means that, as an organization, you are unbiased towards the use of any specific technologies to solve your business problems. No matter which technology is preferred by individual teams, they are able to call the pseudonymization service to pseudonymize sensitive data.
In this post, we focus on common extract, transform, and load (ETL) consumption patterns that can use the pseudonymization service. We discuss how to use the pseudonymization service in your ETL jobs on Amazon EMR (using Amazon EMR on EC2) for streaming and batch use cases. Additionally, you can find an Amazon Athena and AWS Glue based consumption pattern in the GitHub repo of the solution.
Solution overview
The following diagram describes the solution architecture.
The account on the right hosts the pseudonymization service, which you can deploy using the instructions provided in the Part 1 of this series.
The account on the left is the one that you set up as part of this post, representing the ETL platform based on Amazon EMR using the pseudonymization service.
You can deploy the pseudonymization service and the ETL platform on the same account.
Amazon EMR empowers you to create, operate, and scale big data frameworks such as Apache Spark quickly and cost-effectively.
Both applications use a common utility function that makes HTTP POST calls against the API Gateway that is linked to the pseudonymization AWS Lambda function. The REST API calls are made per Spark partition using the Spark RDD mapPartitions function. The POST request body contains the list of unique values for a given input column. The POST request response contains the corresponding pseudonymized values. The code swaps the sensitive values with the pseudonymized ones for a given dataset. The result is saved to Amazon S3 and the AWS Glue Data Catalog, using Apache Iceberg table format.
Iceberg is an open table format that supports ACID transactions, schema evolution, and time travel queries. You can use these features to implement the right to be forgotten (or data erasure) solutions using SQL statements or programming interfaces. Iceberg is supported by Amazon EMR starting with version 6.5.0, AWS Glue, and Athena. Batch and streaming patterns use Iceberg as their target format. For an overview of how to build an ACID compliant data lake using Iceberg, refer to Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR.
A bash terminal to run bash scripts that deploy CloudFormation stacks.
An additional S3 bucket containing the input dataset in Parquet files (only for batch applications). Copy the sample dataset to the S3 bucket.
A copy of the latest code repository in the local machine using git clone or the download option.
Open a new bash terminal and navigate to the root folder of the cloned repository.
The source code for the proposed patterns can be found in the cloned repository. It uses the following parameters:
ARTEFACT_S3_BUCKET – The S3 bucket where the infrastructure code will be stored. The bucket must be created in the same account and Region where the solution lives.
AWS_REGION – The Region where the solution will be deployed.
AWS_PROFILE – The named profile that will be applied to the AWS CLI command. This should contain credentials for an IAM principal with privileges to deploy the CloudFormation stack of related resources.
SUBNET_ID – The subnet ID where the EMR cluster will be spun up. The subnet is pre-existing and for demonstration purposes, we use the default subnet ID of the default VPC.
EP_URL – The endpoint URL of the pseudonymization service. Retrieve this from the solution deployed as Part 1 of this series.
S3_INPUT_PATH – The S3 URI pointing to the folder containing the input dataset as Parquet files.
KINESIS_DATA_STREAM_NAME – The Kinesis data stream name deployed with the CloudFormation stack.
BATCH_SIZE – The number of records to be pushed to the data stream per batch.
THREADS_NUM – The number of parallel threads used in the local machine to upload data to the data stream. More threads correspond to a higher message volume.
EMR_CLUSTER_ID – The EMR cluster ID where the code will be run (the EMR cluster was created by the CloudFormation stack).
STACK_NAME – The name of the CloudFormation stack, which is assigned in the deployment script.
Batch deployment steps
As described in the prerequisites, before you deploy the solution, upload the Parquet files of the test dataset to Amazon S3. Then provide the S3 path of the folder containing the files as the parameter <S3_INPUT_PATH>.
We create the solution resources via AWS CloudFormation. You can deploy the solution by running the deploy_1.sh script, which is inside the deployment_scripts folder.
After the deployment prerequisites have been satisfied, enter the following command to deploy the solution:
The output should look like the following screenshot.
The required parameters for the cleanup command are printed out at the end of the run of the deploy_1.sh script. Make sure to note down these values.
Test the batch solution
In the CloudFormation template deployed using the deploy_1.sh script, the EMR step containing the Spark batch application is added at the end of the EMR cluster setup.
To verify the results, check the S3 bucket identified in the CloudFormation stack outputs with the variable SparkOutputLocation.
You can also use Athena to query the tablepseudo_table in the database blog_batch_db.
Clean up batch resources
To destroy the resources created as part of this exercise,
in a bash terminal, navigate to the root folder of the cloned repository. Enter the cleanup command shown as the output of the previously run deploy_1.sh script:
sh ./deployment_scripts/cleanup_1.sh \
-a <ARTEFACT_S3_BUCKET> \
-s <STACK_NAME> \
-r <AWS_REGION> \
-e <EMR_CLUSTER_ID>
The output should look like the following screenshot.
Streaming deployment steps
We create the solution resources via AWS CloudFormation. You can deploy the solution by running the deploy_2.sh script, which is inside the deployment_scripts folder. The CloudFormation stack template for this pattern is available in the GitHub repo.
After the deployment prerequisites have been satisfied, enter the following command to deploy the solution:
sh deployment_scripts/deploy_2.sh \
-a <ARTEFACT_S3_BUCKET> \
-r <AWS_REGION> \
-p <AWS_PROFILE> \
-s <SUBNET_ID> \
-e <EP_URL> \
-x <API_SECRET>
The output should look like the following screenshot.
The required parameters for the cleanup command are printed out at the end of the output of the deploy_2.sh script. Make sure to save these values to use later.
Test the streaming solution
In the CloudFormation template deployed using the deploy_2.sh script, the EMR step containing the Spark streaming application is added at the end of the EMR cluster setup. To test the end-to-end pipeline, you need to push records to the deployed Kinesis data stream. With the following commands in a bash terminal, you can activate a Kinesis producer that will continuously put records in the stream, until the process is manually stopped. You can control the producer’s message volume by modifying the BATCH_SIZE and the THREADS_NUM variables.
To verify the results, check the S3 bucket identified in the CloudFormation stack outputs with the variable SparkOutputLocation.
In the Athena query editor, check the results by querying the table pseudo_table in the database blog_stream_db.
Clean up streaming resources
To destroy the resources created as part of this exercise, complete the following steps:
Stop the Python Kinesis producer that was launched in a bash terminal in the previous section.
Enter the following command:
sh ./deployment_scripts/cleanup_2.sh \
-a <ARTEFACT_S3_BUCKET> \
-s <STACK_NAME> \
-r <AWS_REGION> \
-e <EMR_CLUSTER_ID>
The output should look like the following screenshot.
Performance details
Use cases might differ in requirements with respect to data size, compute capacity, and cost. We have provided some benchmarking and factors that may influence performance; however, we strongly advise you to validate the solution in lower environments to see if it meets your particular requirements.
You can influence the performance of the proposed solution (which aims to pseudonymize a dataset using Amazon EMR) by the maximum number of parallel calls to the pseudonymization service and the payload size for each call. In terms of parallel calls, factors to consider are the GetSecretValue calls limit from Secrets Manager (10.000 per second, hard limit) and the Lambda default concurrency parallelism (1,000 by default; can be increased by quota request). You can control the maximum parallelism adjusting the number of executors, the number of partitions composing the dataset, and the cluster configuration (number and type of nodes). In terms of payload size for each call, factors to consider are the API Gateway maximum payload size (6 MB) and the Lambda function maximum runtime (15 minutes). You can control the payload size and the Lambda function runtime by adjusting the batch size value, which is a parameter of the PySpark script that determines the number of items to be pseudonymized per each API call. To capture the influence of all these factors and assess the performance of the consumption patterns using Amazon EMR, we have designed and monitored the following scenarios.
Batch consumption pattern performance
To assess the performance for the batch consumption pattern, we ran the pseudonymization application with three input datasets composed of 1, 10, and 100 Parquet files of 97.7 MB each. We generated the input files using the dataset_generator.py script.
The cluster capacity nodes were 1 primary (m5.4xlarge) and 15 core (m5d.8xlarge). This cluster configuration remained the same for all three scenarios, and it allowed the Spark application to use up to 100 executors. The batch_size, which was also the same for the three scenarios, was set to 900 VINs per API call, and the maximum VIN size was 5 bytes.
The following table captures the information of the three scenarios.
Execution ID
Repartition
Dataset Size
Number of Executors
Cores per Executor
Executor Memory
Runtime
A
800
9.53 GB
100
4
4 GiB
11 minutes, 10 seconds
B
80
0.95 GB
10
4
4 GiB
8 minutes, 36 seconds
C
8
0.09 GB
1
4
4 GiB
7 minutes, 56 seconds
As we can see, properly parallelizing the calls to our pseudonymization service enables us to control the overall runtime.
In the following examples, we analyze three important Lambda metrics for the pseudonymization service: Invocations, ConcurrentExecutions, and Duration.
The following graph depicts the Invocations metric, with the statistic SUM in orange and RUNNING SUM in blue.
By calculating the difference between the starting and ending point of the cumulative invocations, we can extract how many invocations were made during each run.
Run ID
Dataset Size
Total Invocations
A
9.53 GB
1.467.000 – 0 = 1.467.000
B
0.95 GB
1.467.000 – 1.616.500 = 149.500
C
0.09 GB
1.616.500 – 1.631.000 = 14.500
As expected, the number of invocations increases proportionally by 10 with the dataset size.
The following graph depicts the total ConcurrentExecutions metric, with the statistic MAX in blue.
The application is designed such that the maximum number of concurrent Lambda function runs is given by the amount of Spark tasks (Spark dataset partitions), which can be processed in parallel. This number can be calculated as MIN (executors x executor_cores, Spark dataset partitions).
In the test, run A processed 800 partitions, using 100 executors with four cores each. This makes 400 tasks processed in parallel so the Lambda function concurrent runs can’t be above 400. The same logic was applied for runs B and C. We can see this reflected in the preceding graph, where the amount of concurrent runs never surpasses the 400, 40, and 4 values.
To avoid throttling, make sure that the amount of Spark tasks that can be processed in parallel is not above the Lambda function concurrency limit. If that is the case, you should either increase the Lambda function concurrency limit (if you want to keep up the performance) or reduce either the amount of partitions or the number of available executors (impacting the application performance).
The following graph depicts the Lambda Duration metric, with the statistic AVG in orange and MAX in green.
As expected, the size of the dataset doesn’t affect the duration of the pseudonymization function run, which, apart from some initial invocations facing cold starts, remains constant to an average of 3 milliseconds throughout the three scenarios. This because the maximum number of records included in each pseudonymization call is constant (batch_size value).
Lambda is billed based on the number of invocations and the time it takes for your code to run (duration). You can use the average duration and invocations metrics to estimate the cost of the pseudonymization service.
Streaming consumption pattern performance
To assess the performance for the streaming consumption pattern, we ran the producer.py script, which defines a Kinesis data producer that pushes records in batches to the Kinesis data stream.
The streaming application was left running for 15 minutes and it was configured with a batch_interval of 1 minute, which is the time interval at which streaming data will be divided into batches. The following table summarizes the relevant factors.
Repartition
Cluster Capacity Nodes
Number of Executors
Executor’s Memory
Batch Window
Batch Size
VIN Size
17
1 Primary (m5.xlarge),
3 Core (m5.2xlarge)
6
9 GiB
60 seconds
900 VINs/API call.
5 Bytes / VIN
The following graphs depict the Kinesis Data Streams metrics PutRecords (in blue) and GetRecords (in orange) aggregated with 1-minute period and using the statistic SUM. The first graph shows the metric in bytes, which peaks 6.8 MB per minute. The second graph shows the metric in record count peaking at 85,000 records per minute.
We can see that the metrics GetRecords and PutRecords have overlapping values for almost the entire application’s run. This means that the streaming application was able to keep up with the load of the stream.
Next, we analyze the relevant Lambda metrics for the pseudonymization service: Invocations,ConcurrentExecutions, and Duration.
The following graph depicts the Invocations metric, with the statistic SUM (in orange) and RUNNING SUM in blue.
By calculating the difference between the starting and ending point of the cumulative invocations, we can extract how many invocations were made during the run. In specific, in 15 minutes, the streaming application invoked the pseudonymization API 977 times, which is around 65 calls per minute.
The following graph depicts the total ConcurrentExecutions metric, with the statistic MAX in blue.
The repartition and the cluster configuration allow the application to process all Spark RDD partitions in parallel. As a result, the concurrent runs of the Lambda function are always equal to or below the repartition number, which is 17.
To avoid throttling, make sure that the amount of Spark tasks that can be processed in parallel is not above the Lambda function concurrency limit. For this aspect, the same suggestions as for the batch use case are valid.
The following graph depicts the Lambda Duration metric, with the statistic AVG in blue and MAX in orange.
As expected, aside the Lambda function’s cold start, the average duration of the pseudonymization function was more or less constant throughout the run. This because the batch_size value, which defines the number of VINs to pseudonymize per call, was set to and remained constant at 900.
The ingestion rate of the Kinesis data stream and the consumption rate of our streaming application are factors that influence the number of API calls made against the pseudonymization service and therefore the related cost.
The following graph depicts the Lambda Invocations metric, with the statistic SUM in orange, and the Kinesis Data Streams GetRecords.Records metric, with the statistic SUM in blue. We can see that there is correlation between the amount of records retrieved from the stream per minute and the amount of Lambda function invocations, thereby impacting the cost of the streaming run.
Navigating through the rules and regulations of data privacy laws can be difficult. Pseudonymization of PII attributes is one of many points to consider while handling sensitive data.
In this two-part series, we explored how you can build and consume a pseudonymization service using various AWS services with features to assist you in building a robust data platform. In Part 1, we built the foundation by showing how to build a pseudonymization service. In this post, we showcased the various patterns to consume the pseudonymization service in a cost-efficient and performant manner. Check out the GitHub repository for additional consumption patterns.
About the Authors
Edvin Hallvaxhiu is a Senior Global Security Architect with AWS Professional Services and is passionate about cybersecurity and automation. He helps customers build secure and compliant solutions in the cloud. Outside work, he likes traveling and sports.
Rahul Shaurya is a Principal Big Data Architect with AWS Professional Services. He helps and works closely with customers building data platforms and analytical applications on AWS. Outside of work, Rahul loves taking long walks with his dog Barney.
Andrea Montanari is a Senior Big Data Architect with AWS Professional Services. He actively supports customers and partners in building analytics solutions at scale on AWS.
María Guerra is a Big Data Architect with AWS Professional Services. Maria has a background in data analytics and mechanical engineering. She helps customers architecting and developing data related workloads in the cloud.
Pushpraj Singh is a Senior Data Architect with AWS Professional Services. He is passionate about Data and DevOps engineering. He helps customers build data driven applications at scale.
Use of long-term access keys for authentication between cloud resources increases the risk of key exposure and unauthorized secrets reuse. Amazon Web Services (AWS) has developed a solution to enable customers to securely authenticate Azure resources with AWS resources using short-lived tokens to reduce risks to secure authentication.
In this solution, we show you how to obtain temporary credentials in IAM. The solution uses AWS Security Token Service (AWS STS) in conjunction with Azure managed identities and Azure App Registration. This method provides a more secure and efficient way to bridge Azure and AWS clouds, providing seamless integration without compromising secure authentication and authorization standards.
Create and attach an Azure managed identity to an Azure virtual machine (VM).
Azure VM gets an Azure access token from the managed identity and sends it to AWS STS to retrieve temporary security credentials.
An IAM role created with a valid Azure tenant audience and subject validates that the claim is sourced from a trusted entity and sends temporary security credentials to the requesting Azure VM.
Azure VM accesses AWS resources using the AWS STS provided temporary security credentials.
To prepare the authentication process with Microsoft Entra ID, an enterprise application must be created in Microsoft Entra ID. This serves as a sign-in endpoint and provides the necessary user identity information through OIDC access tokens to the identity provider (IdP) of the target AWS account.
Note: You can get short term credentials by providing access tokens from managed identities or enterprise applications. This post covers the enterprise application use case.
Register a new application in Azure
In the Azure portal, select Microsoft Entra ID.
Select App registrations.
Select New registration.
Enter a name for your application and then select an option in Supported account types (in this example, we chose Accounts in this Organization directory only). Leave the other options as is. Then choose Register.
Figure 2: Register an application in the Azure portal
Configure the application ID URI
In the Azure portal, select Microsoft Entra ID.
Select App registrations.
On the App registrations page, select All applications and choose the newly registered application.
On the newly registered application’s overview page, choose Application ID URI and then select Add.
On the Edit application ID URI page, enter the value of the URI, which looks like urn://<name of the application> or api://<name of the application>.
The application ID URI will be used later as the audience in the identity provider(idP) section of AWS.
Figure 3: Configure the application ID URI
Open the newly registered application’s overview page.
In the navigation pane, under Manage, choose App roles.
Select Create app role and then enter a Display name and for Allowed member types, select Both (Users/Groups + Applications).
For Description, enter a description.
Select Do you want to enable this app role? And then choose Apply.
Figure 4: Create and enable an application role
Assign a managed identity—as created in Step 4 of the prerequisites—to the new application role. This operation can only be done by either using the Azure Cloud Shell or running scripts locally by installing the latest version of the Microsoft Graph PowerShell SDK. (For more information about assigning managed identities to application roles using PowerShell, see Azure documentation.)
You must have the following information:
ObjectID: To find the managed identity’s Object (Principal) ID, go to the Managed Identities page, select the identity name, and then select Overview.
Figure 5: Find the ObjectID of the managed identity
ID: To find the ID of the application role, go to App registrations, select the application name, and then select App roles.
Figure 6: Find the ID of the application role
PrincipalID: Same as ObjectID, which is the managed identity’s Object (Principal) ID.
ResourceID: The ObjectID of the resource service principal, which you can find by going to the Enterprise applications page and selection the application. Select Overview and then Properties to find the ObjectID.
Figure 7: Find the ResourceID
With the resource IDs, you can now use Azure Cloud Shell and run the following script in PowerShell terminal with New-AzureADServiceAppRoleAssignment. Replace the variables with the resource IDs.
In the AWS Management Console for IAM, create an IAM Identity Provider.
In the left navigation pane, select Identity providers and then choose Add an identity provider.
For Provider type, choose OpenID Connect.
For Provider URL, enter https://sts.windows.net/<Microsoft Entra Tenant ID>. Replace <Microsoft Entra Tenant ID> with your Tenant ID from Azure. This allows only identities from your Azure tenant to access your AWS resources.
For Audience use the client_id of the Azure managed identity or the application ID URI from enterprise applications.
For Audience, enter the application ID URI that you configured on step 5 of Configure the application ID URI. If you have additional client IDs (also known as audiences) for this IdP, you can add them to the provider detail page later.
You can also use different audiences in the role trust policy in the next step to limit the roles that specific audiences can assume. To do so, you must provide a StringEquals condition in the trust policy of the IAM role.
Figure 8: Adding an audience (client ID)
Using an OIDC principal without a condition can be overly permissive. To make sure that only the intended identities assume the role, provide an audience (aud) and subject (sub) as conditions in the role trust policy for this IAM role.
sts.windows.net/<Microsoft Entra Tenant ID>/:sub represents the identity of your Azure workload that limits access to the specific Azure identity that can assume this role from the Azure tenant. See the following example for conditions.
Replace <Microsoft Entra Tenant ID> with your tenant ID from Azure.
Replace <Application ID URI> with your audience value configured in the previous step.
Replace <Managed Identity’s object (Principal) ID> with your ObjectID captured in the first bullet of Step 12 of Configure the application ID URI.
To test the access, you’ll assign a user assigned managed identity to an existing VM.
Sign in to the Azure portal.
Navigate to the desired VM and select Identity, User assigned, and then choose Add.
Figure 9: Assigning a User assigned Identity
Select the managed identity created as part of prerequisite and then choose Add.
Figure 10: Add a user assigned managed identity
In AWS, we used credential_process in a separate AWS Config profile to dynamically and programmatically retrieve AWS temporary credentials. The credential process calls a bash script that retrieves an access token from Azure and uses the token to obtain temporary credentials from AWS STS. For the syntax and operating system requirements, see Source credentials with an external process. For this post, we created a custom profile called DevTeam-S3ReadOnlyAccess, as shown in the config file:
[profile DevTeam-S3ReadOnlyAccess]
credential_process = /opt/bin/credentials.sh
region = ap-southeast-2
For this example, credentials_process invokes the script /opt/bin/credentials.sh. Replace <111122223333> with your own account ID.
/opt/bin/credentials.sh
#!/bin/bash
# Application ID URI from Azure
AUDIENCE=”urn://dev-aws-account-team-a”
# Role ARN from AWS to assume
ROLE_ARN=”arn:aws:iam::<111122223333>:role/Azure-AWSAssumeRole”
# Retrieve Access Token using Audience
access_token=$(curl “http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=${AUDIENCE}” -H “Metadata:true” -s| jq -r ‘.access_token’)
# Create credentials following JSON format required by AWS CLI
credentials=$(aws sts assume-role-with-web-identity –role-arn ${ROLE_ARN} –web-identity-token $access_token –role-session-name AWSAssumeRole|jq ‘.Credentials’ | jq ‘.Version=1’)
# Write credentials to STDOUT for AWS CLI to pick up
echo $credentials
After you configure the AWS Config CLI file for the credential_process script, verify the setup by accessing AWS resources from Azure VM.
Using AWS SDK for Python to run s3AccessFromAzure.py. You should see a list of S3 buckets from your account. This example also demonstrates specifying a profile to use for credential purposes.
S3AccessFromAzure.py
import boto3
# Assume Role with Web Identity Provider profile
session = boto3.Session(profile_name=’DevTeam-S3ReadOnlyAccess’)
# Retrieve the list of existing buckets
s3 = session.client(‘s3’)
response = s3.list_buckets()
# Output the bucket names
print(‘Existing buckets:’)
for bucket in response[‘Buckets’]:
print(f’ {bucket[“Name”]}’)
Note: The AWS CLI doesn’t cache external process credentials; instead, the AWS CLI calls the credential_process for every CLI request, which creates a new role session. If you use AWS SDKs, the credentials are cached and reused until they expire.
We used Azure VM as an example to access AWS resources, but a similar approach can be used for any compute resources in Azure that are capable of issuing Azure credentials.
Clean up
If you don’t need the resources that you created for this walkthrough, delete them to avoid future charges for the deployed resources:
Delete the VM instance, managed identity, and enterprise applications created in Azure.
Delete the resources that you provisioned on AWS to test the solution.
Conclusion
In this post, we showed you how to securely access AWS resources from Azure workloads using an IAM role assumed with one-time, short-term credentials. By using this solution, your Azure workloads will request temporary security credentials and remove the need for long-term AWS credentials or other secrets usage that are less secure methods of authentication.
Use the following resources to help you get started with AWS IAM federation:
This blog post provides architectural guidance on AWS CloudHSM crypto user credential rotation and is intended for those using or considering using CloudHSM. CloudHSM is a popular solution for secure cryptographic material management. By using this service, organizations can benefit from a robust mechanism to manage their own dedicated FIPS 140-2 level 3 hardware security module (HSM) cluster in the cloud and a client SDK that enables crypto users to perform cryptographic operations on deployed HSMs.
Credential rotation is an AWS Well-Architected best practice as it helps reduce the risks associated with the use of long-term credentials. Additionally, organizations are often required to rotate crypto user credentials for their HSM clusters to meet compliance, regulatory, or industry requirements. Unlike most AWS services that use AWS Identity and Access Management (IAM) users or IAM policies to access resources within your cluster, HSM users are directly created and maintained on the HSM cluster. As a result, how the credential rotation operation is performed might impact the workload’s availability. Thus, it’s important to understand the available options to perform crypto user credential rotation and the impact each option has in terms of ease of implementation and downtime.
In this post, we dive deep into the different options, steps to implement them, and their related pros and cons. We finish with a matrix of the relative downtime, complexity, and cost of each option so you can choose which best fits your use case.
Solution overview
In this document, we consider three approaches:
Approach 1 — For a workload with a defined maintenance window. You can shut down all client connections to CloudHSM, change the crypto user’s password, and subsequently re-establish connections to CloudHSM. This option is the most straightforward, but requires some application downtime.
Approach 2 — You create an additional crypto user (with access to all cryptographic materials) with a new password and from which new client instances are deployed. When the new user and instances are in place, traffic is rerouted to the new instances through a load balancer. This option involves no downtime but requires additional infrastructure (client instances) and a process to share cryptographic material between the crypto users.
Approach 3 — You run two separate and identical environments, directing traffic to a live (blue) environment while making and testing the changes on a secondary (green) environment before redirecting traffic to the green environment. This option involves no downtime, but requires additional infrastructure (client instances and an additional CloudHSM cluster) to support the blue/green deployment strategy.
The first approach uses an application’s planned maintenance window to enact necessary crypto user password changes. It’s the most straightforward of the recommended options, with the least amount of complexity because no additional infrastructure is needed to support the password rotation activity. However, it requires downtime (preferably planned) to rotate the password and update the client application instances; depending on how you deploy a client application, you can shorten the downtime by automating the application deployment process. The main steps for this approach are shown in Figure 1:
Figure 1: Approach 1 to update crypto user password
To implement approach 1:
Terminate all client connections to a CloudHSM cluster. This is necessary because you cannot change a password while a crypto user’s session is active.
You can query an Amazon CloudWatch log group for your CloudHSM cluster to find out if any user session is active. Additionally, you can audit Amazon Virtual Private Cloud (Amazon VPC)Flow Logs by enabling them for the elastic network interfaces (ENIs) related to the CloudHSM cluster. See where the traffic is coming from and link that to the applications.
Use the login command and log in as the user with the password you want to change. aws-cloudhsm > login --username <USERNAME> --role <ROLE>
Enter the user’s password.
Enter the user change-password command. aws-cloudhsm > user change-password --username <USERNAME> --role <ROLE>
Enter the new password.
Re-enter the new password.
Update the client connecting to CloudHSM to use the new credentials. Follow the SDK documentation for detailed steps if you are using PKCS # 11, OpenSSL Dynamic Engine, JCE provider or KSP and CNG provider.
Resume all client connections to CloudHSM cluster
Approach 2
The second approach employs two crypto users and a blue/green deployment strategy, that is, a deployment strategy in which you create two separate but identical client environments. One environment (blue) runs the current application version with crypto user 1 (CU1) and handles live traffic, while the other environment (green) runs a new application version with the updated crypto user 2 (CU2) password. After testing is complete on the green environment, traffic is directed to the green environment and the blue environment is deprecated. In this approach, both crypto users have access to the required cryptographic material. When rotating the crypto user password, you spin up new client instances and swap connection credentials to use the second crypto user. Because the client application only uses one crypto user at a time, the second user can remain dormant and be reused in the future as well. When compared to the first approach, this approach adds complexity to your architecture so that you can redirect live application traffic to the new environment by deploying additional client instances without having to restart. You also need to be aware that a shared user can only perform sign, encrypt, decrypt, verify, and HMAC operations with the shared key. Currently, export, wrap, modify, delete, and derive operations aren’t allowed with a shared user. This approach has the advantages of a classic blue/green deployment (no downtime and low risk), in addition to adding redundancy at the user management level by having multiple crypto users with access to the required cryptographic material. Figure 2 depicts a possible architecture:
Figure 2: Approach 2 to update crypto user password
Use the key share command to share the key with the other user so that both users have access to all the keys.
Start by running the key list command with a filter to return a specific key.
View the shared-users output to identify whom the key is currently shared with.
To share this key with a crypto user, enter the following command: aws-cloudhsm > aws-cloudhsm > key share --filter attr.label="rsa_key_to_share" attr.class=private-key --username <USERNAME> --role crypto-user
If CU1 is used to make client (that is, blue environment) connections to a CloudHSM cluster then change the password for CU2.
Follow the instructions in To change HSM user passwords or step 2 of Approach 1 to change the password assigned to CU2.
Spin up new client instances and use CU2 to configure the connection credentials (that is, green environment).
Add the new client instances to a new target group for the existing Application Load Balancer (ALB).
Next use the weighted target groups routing feature of ALB to route traffic to the newly configured environment.
You can use forward actions of the ALB listener rules setting to route requests to one or more target groups.
If you specify multiple target groups for a forward action, you must specify a weight for each target group. Each target group weight is a value from 0 to 999. Requests that match a listener rule with weighted target groups are distributed to these target groups based on their weights. For example, if you specify one with a weight of 10 and the other with a weight of 20, the target group with a weight of 20 receives twice as many requests as the other target group.
You can make these changes to the ALB setting using the AWS Command Line Interface (AWS CLI), AWS Management Console, or supported infrastructure as code (IaC) tools.
For the next password rotation iteration, you can switch back to using CU1 with updated credentials by updating your client instances and redeploying using steps 6 and 7.
Approach 3
The third approach is a variation of the previous approach as you build an identical environment (blue/green deployment) and change the crypto user password on the new environment to achieve zero downtime for the workload. You create two separate but identical CloudHSM clusters, with one serving as the live (blue) environment, and another as the test (green) environment in which changes are tested prior to deployment. After testing is complete in the green environment, production traffic is directed to the green environment and the blue environment is deprecated. Again, this approach adds complexity to your architecture so that you can redirect live application traffic to the new environment by deploying additional client instances and a CloudHSM cluster during the deployment and cutover window without having to restart. Additionally, changes made to the blue cluster after the green cluster was created won’t be available in the green cluster—something that can be mitigated by a brief embargo on changes while this cutover process is in progress. A key advantage to this approach is that it increases application availability without the need for a second crypto user, while still reducing deployment risk and simplifying the rollback process if a deployment fails. Such a deployment pattern is typically automated using continuous integration and continuous delivery (CI/CD) tools such as AWS CodeDeploy. For detailed deployment configuration options, see deployment configurations in CodeDeploy. Figure 3 depicts a possible architecture:
Figure 3: Approach 3 to update crypto user password
To implement approach 3:
Create a cluster from backup. Make sure you restore the new cluster in the same Availability Zone as the existing CloudHSM cluster. This will be your green environment.
Spin up new application instances (green environment) and configure them to connect to the new CloudHSM cluster.
Take note of the new CloudHSM cluster security group and attach it to the new client instances.
Follow the steps in To change HSM user passwords or Approach 1 step 2 to change the crypto user password on the new cluster.
Update the client connecting to CloudHSM with the new password.
Add the new client to the existing Application Load Balancer by following Approach 2 steps 6 and 7.
After the deployment is complete, you can delete the old cluster and client instances (blue environment).
Select the old cluster and then choose Delete cluster.
Confirm that you want to delete the cluster, then choose Delete.
To delete the cluster using the AWS Command Line Interface (AWS CLI), use the following command: aws cloudhsmv2 delete-cluster --cluster-id <cluster ID>
How to choose an approach
To better understand which approach is the best fit for your use case, consider the following criteria:
Downtime: What is the acceptable amount of downtime for your workload?
Implementation complexity: Do you need to make architecture changes to your workload and how complex is the implementation effort?
Cost: Is the additional cost required for the approach acceptable to the business?
Downtime
Relative Implementation complexity
Relative infrastructure cost
Approach 1
Yes
Low
None
Approach 2
No
Medium
Medium
Approach 3
No
Medium
High
Approach 1 — especially when run within a scheduled maintenance window—is the most straightforward of the three approaches because there’s no additional infrastructure required, and workload downtime is the only tradeoff. This is best suited for applications where planned downtime is acceptable and you need to keep solution complexity low.
Approach 2 involves no downtime for the workload and the second crypto user serves as a backup for future password updates (such as if credentials are lost, or in case there are personnel changes). The downside is the initial planning required to set up the workload to handle multiple CUs, share all keys among the crypto users, and the additional cost. This is best suited for workloads that require zero downtime and an architecture that supports hot swapping of incoming traffic.
Approach 3 also supports zero downtime for the workload, with a complex implementation and some cost to set up additional infrastructure. This is best suited for workloads that have require zero downtime, have an architecture supports hot swapping of incoming traffic, and you don’t want to maintain a second crypto user that has shared access to all required cryptographic material.
Conclusion
In this post, we covered three approaches you can take to rotate the crypto user password on your CloudHSM cluster to align with AWS security best practices of the Well-Architected Framework and to meet your compliance, regulatory, or industry requirements. Each has considerations in terms of relative cost, complexity, and downtime. We recommend carefully considering mapping them to your workload and picking the approach best suited for your business and workload needs.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS CloudHSM re:Post or contact AWS Support.
As enterprises collect increasing amounts of data from various sources, the structure and organization of that data often need to change over time to meet evolving analytical needs. However, altering schema and table partitions in traditional data lakes can be a disruptive and time-consuming task, requiring renaming or recreating entire tables and reprocessing large datasets. This hampers agility and time to insight.
Schema evolution enables adding, deleting, renaming, or modifying columns without needing to rewrite existing data. This is critical for fast-moving enterprises to augment data structures to support new use cases. For example, an ecommerce company may add new customer demographic attributes or order status flags to enrich analytics. Apache Iceberg manages these schema changes in a backward-compatible way through its innovative metadata table evolution architecture.
Similarly, partition evolution allows seamless adding, dropping, or splitting partitions. For instance, an ecommerce marketplace may initially partition order data by day. As orders accumulate, and querying by day becomes inefficient, they may split to day and customer ID partitions. Table partitioning organizes big datasets most efficiently for query performance. Iceberg gives enterprises the flexibility to incrementally adjust partitions rather than requiring tedious rebuild procedures. New partitions can be added in a fully compatible way without downtime or having to rewrite existing data files.
This post demonstrates how you can harness Iceberg, Amazon Simple Storage Service (Amazon S3), AWS Glue, AWS Lake Formation, and AWS Identity and Access Management (IAM) to implement a transactional data lake supporting seamless evolution. By allowing for painless schema and partition adjustments as data insights evolve, you can benefit from the future-proof flexibility needed for business success.
Overview of solution
For our example use case, a fictional large ecommerce company processes thousands of orders each day. When orders are received, updated, cancelled, shipped, delivered, or returned, the changes are made in their on-premises system, and those changes need to be replicated to an S3 data lake so that data analysts can run queries through Amazon Athena. The changes can contain schema updates as well. Due to the security requirements of different organizations, they need to manage fine-grained access control for the analysts through Lake Formation.
The following diagram illustrates the solution architecture.
The solution workflow includes the following key steps:
Ingest data from on premises into a Dropzone location using a data ingestion pipeline.
Merge the data from the Dropzone location into Iceberg using AWS Glue.
Query the data using Athena.
Prerequisites
For this walkthrough, you should have the following prerequisites:
To create your infrastructure with an AWS CloudFormation template, complete the following steps:
Log in as an administrator to your AWS account.
Open the AWS CloudFormation console.
Choose Launch Stack:
For Stack name, enter a name (for this post, icebergdemo1).
Choose Next.
Provide information for the following parameters:
DatalakeUserName
DatalakeUserPassword
DatabaseName
TableName
DatabaseLFTagKey
DatabaseLFTagValue
TableLFTagKey
TableLFTagValue
Choose Next.
Choose Next again.
In the Review section, review the values you entered.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names and choose Submit.
In a few minutes, the stack status will change to CREATE_COMPLETE.
You can go to the Outputs tab of the stack to see all the resources it has provisioned. The resources are prefixed with the stack name you provided (for this post, icebergdemo1).
Create an Iceberg table using Lambda and grant access using Lake Formation
To create an Iceberg table and grant access on it, complete the following steps:
Navigate to the Resources tab of the CloudFormation stack icebergdemo1 and search for logical ID named LambdaFunctionIceberg.
Choose the hyperlink of the associated physical ID.
You’re redirected to the Lambda function icebergdemo1-Lambda-Create-Iceberg-and-Grant-access.
On the Configuration tab, choose Environment variables in the left pane.
On the Code tab, you can inspect the function code.
The function uses the AWS SDK for Python (Boto3) APIs to provision the resources. It assumes the provisioned data lake admin role to perform the following tasks:
Grant DATA_LOCATION_ACCESS access to the data lake admin role on the registered data lake location
Grant DESCRIBE and SELECT on the Iceberg table LF-Tags for the data lake IAM user
Grant ALL, DESCRIBE, SELECT, INSERT, DELETE, and ALTER access on the Iceberg table LF-Tags to the AWS Glue ETL IAM role
On the Test tab, choose Test to run the function.
When the function is complete, you will see the message “Executing function: succeeded.”
Lake Formation helps you centrally manage, secure, and globally share data for analytics and machine learning. With Lake Formation, you can manage fine-grained access control for your data lake data on Amazon S3 and its metadata in the Data Catalog.
To add an Amazon S3 location as Iceberg storage in your data lake, register the location with Lake Formation. You can then use Lake Formation permissions for fine-grained access control to the Data Catalog objects that point to this location, and to the underlying data in the location.
The CloudFormation stack registered the data lake location.
Data location permissions in Lake Formation enable principals to create and alter Data Catalog resources that point to the designated registered Amazon S3 locations. Data location permissions work in addition to Lake Formation data permissions to secure information in your data lake.
Lake Formation tag-based access control (LF-TBAC) is an authorization strategy that defines permissions based on attributes. In Lake Formation, these attributes are called LF-Tags. You can attach LF-Tags to Data Catalog resources, Lake Formation principals, and table columns. You can assign and revoke permissions on Lake Formation resources using these LF-Tags. Lake Formation allows operations on those resources when the principal’s tag matches the resource tag.
Verify the Iceberg table from the Lake Formation console
To verify the Iceberg table, complete the following steps:
On the Lake Formation console, choose Databases in the navigation pane.
Open the details page for icebergdb1.
You can see the associated database LF-Tags.
Choose Tables in the navigation pane.
Open the details page for ecomorders.
In the Table details section, you can observe the following:
Table format shows as Apache Iceberg
Table management shows as Managed by Data Catalog
Location lists the data lake location of the Iceberg table
In the LF-Tags section, you can see the associated table LF-Tags.
In the Table details section, expand Advanced table properties to view the following:
metadata_location points to the location of the Iceberg table’s metadata file
table_type shows as ICEBERG
On the Schema tab, you can view the columns defined on the Iceberg table.
Integrate Iceberg with the AWS Glue Data Catalog and Amazon S3
Iceberg tracks individual data files in a table instead of directories. When there is an explicit commit on the table, Iceberg creates data files and adds them to the table. Iceberg maintains the table state in metadata files. Any change in table state creates a new metadata file that atomically replaces the older metadata. Metadata files track the table schema, partitioning configuration, and other properties.
Iceberg requires file systems that support the operations to be compatible with object stores like Amazon S3.
Iceberg creates snapshots for the table contents. Each snapshot is a complete set of data files in the table at a point in time. Data files in snapshots are stored in one or more manifest files that contain a row for each data file in the table, its partition data, and its metrics.
The following diagram illustrates this hierarchy.
When you create an Iceberg table, it creates the metadata folder first and a metadata file in the metadata folder. The data folder is created when you load data into the Iceberg table.
Contents of the Iceberg metadata file
The Iceberg metadata file contains a lot of information, including the following:
format-version –Version of the Iceberg table
Location – Amazon S3 location of the table
Schemas – Name and data type of all columns on the table
partition-specs – Partitioned columns
sort-orders – Sort order of columns
properties – Table properties
current-snapshot-id – Current snapshot
refs – Table references
snapshots – List of snapshots, each containing the following information:
sequence-number – Sequence number of snapshots in chronological order (the highest number represents the current snapshot, 1 for the first snapshot)
snapshot-id – Snapshot ID
timestamp-ms – Timestamp when the snapshot was committed
summary – Summary of changes committed
manifest-list – List of manifests; this file name starts with snap-< snapshot-id >
schema-id – Sequence number of the schema in chronological order (the highest number represents the current schema)
snapshot-log – List of snapshots in chronological order
metadata-log – List of metadata files in chronological order
The metadata file has all the historical changes to the table’s data and schema. Reviewing the contents on the metafile file directly can be a time-consuming task. Fortunately, you can query the Iceberg metadata using Athena.
Iceberg framework in AWS Glue
AWS Glue 4.0 supports Iceberg tables registered with Lake Formation. In the AWS Glue ETL jobs, you need the following code to enable the Iceberg framework:
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.conf import SparkConf
aws_account_id = boto3.client('sts').get_caller_identity().get('Account')
args = getResolvedOptions(sys.argv, ['JOB_NAME','warehouse_path']
# Set up configuration for AWS Glue to work with Apache Iceberg
conf = SparkConf()
conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
conf.set("spark.sql.catalog.glue_catalog.warehouse", args['warehouse_path'])
conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
conf.set("spark.sql.catalog.glue_catalog.glue.lakeformation-enabled", "true")
conf.set("spark.sql.catalog.glue_catalog.glue.id", aws_account_id)
sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
For read/write access to underlying data, in addition to Lake Formation permissions, the AWS Glue IAM role to run the AWS Glue ETL jobs was granted lakeformation: GetDataAccess IAM permission. With this permission, Lake Formation grants the request for temporary credentials to access the data.
The CloudFormation stack provisioned the four AWS Glue ETL jobs for you. The name of each job starts with your stack name (icebergdemo1). Complete the following steps to view the jobs:
Log in as an administrator to your AWS account.
On the AWS Glue console, choose ETL jobs in the navigation pane.
Search for jobs with icebergdemo1 in the name.
Merge data from Dropzone into the Iceberg table
For our use case, the company ingests their ecommerce orders data daily from their on-premises location into an Amazon S3 Dropzone location. The CloudFormation stack loaded three files with sample orders for 3 days, as shown in the following figures. You see the data in the Dropzone location s3://icebergdemo1-s3bucketdropzone-kunftrcblhsk/data.
The AWS Glue ETL job icebergdemo1-GlueETL1-merge will run daily to merge the data into the Iceberg table. It has the following logic to add or update the data on Iceberg:
If the table has a matching order, update the status and shipping_id:
stmt_merge = f"""
MERGE INTO glue_catalog.{database_name}.{table_name} AS t
USING input_table AS s
ON t.ordernum= s.ordernum
WHEN MATCHED
THEN UPDATE SET
t.status = s.status,
t.shipping_id = s.shipping_id
WHEN NOT MATCHED THEN INSERT *
"""
spark.sql(stmt_merge)
Complete the following steps to run the AWS Glue merge job:
On the AWS Glue console, choose ETL jobs in the navigation pane.
Select the ETL job icebergdemo1-GlueETL1-merge.
On the Actions dropdown menu, choose Run with parameters.
On the Run parameters page, go to Job parameters.
For the --dropzone_path parameter, provide the S3 location of the input data (icebergdemo1-s3bucketdropzone-kunftrcblhsk/data/merge1).
Run the job to add all the orders: 1001, 1002, 1003, and 1004.
For the --dropzone_path parameter, change the S3 location to icebergdemo1-s3bucketdropzone-kunftrcblhsk/data/merge2.
Run the job again to add orders 2001 and 2002, and update orders 1001, 1002, and 1003.
For the --dropzone_path parameter, change the S3 location to icebergdemo1-s3bucketdropzone-kunftrcblhsk/data/merge3.
Run the job again to add order 3001 and update orders 1001, 1003, 2001, and 2002.
Go to the data folder of table to see the data files written by Iceberg when you merged the data into the table using the Glue ETL job icebergdemo1-GlueETL1-merge.
Query Iceberg using Athena
The CloudFormation stack created the IAM user iceberguser1, which has read access on the Iceberg table using LF-Tags. To query Iceberg using Athena via this user, complete the following steps:
On the Athena console, choose Workgroups in the navigation pane.
Locate the workgroup that CloudFormation provisioned (icebergdemo1-workgroup)
Verify Athena engine version 3.
The Athena engine version 3 supports Iceberg file formats, including Parquet, ORC, and Avro.
Go to the Athena query editor.
Choose the workgroup icebergdemo1-workgroup on the dropdown menu.
For Database, choose icebergdb1. You will see the table ecomorders.
Run the following query to see the data in the Iceberg table:
SELECT * FROM "icebergdb1"."ecomorders" ORDER BY ordernum ;
Run the following query to see table’s current partitions:
DESCRIBE icebergdb1.ecomorders ;
Partition-spec describes how table is partitioned. In this example, there are no partitioned fields because you didn’t define any partitions on the table.
Iceberg partition evolution
You may need to change your partition structure; for example, due to trend changes of common query patterns in downstream analytics. A change of partition structure for traditional tables is a significant operation that requires an entire data copy.
Iceberg makes this straightforward. When you change the partition structure on Iceberg, it doesn’t require you to rewrite the data files. The old data written with earlier partitions remains unchanged. New data is written using the new specifications in a new layout. Metadata for each of the partition versions is kept separately.
Let’s add the partition field category to the Iceberg table using the AWS Glue ETL job icebergdemo1-GlueETL2-partition-evolution:
ALTER TABLE glue_catalog.icebergdb1.ecomorders
ADD PARTITION FIELD category ;
On the AWS Glue console, run the ETL job icebergdemo1-GlueETL2-partition-evolution. When the job is complete, you can query partitions using Athena.
DESCRIBE icebergdb1.ecomorders ;
SELECT * FROM "icebergdb1"."ecomorders$partitions";
You can see the partition field category, but the partition values are null. There are no new data files in the data folder, because partition evolution is a metadata operation and doesn’t rewrite data files. When you add or update data, you will see the corresponding partition values populated.
Iceberg schema evolution
Iceberg supports in-place table evolution. You can evolve a table schema just like SQL. Iceberg schema updates are metadata changes, so no data files need to be rewritten to perform the schema evolution.
To explore the Iceberg schema evolution, run the ETL job icebergdemo1-GlueETL3-schema-evolution via the AWS Glue console. The job runs the following SparkSQL statements:
ALTER TABLE glue_catalog.icebergdb1.ecomorders
ADD COLUMNS (shipping_carrier string) ;
ALTER TABLE glue_catalog.icebergdb1.ecomorders
RENAME COLUMN shipping_id TO tracking_number ;
ALTER TABLE glue_catalog.icebergdb1.ecomorders
ALTER COLUMN ordernum TYPE bigint ;
In the Athena query editor, run the following query:
SELECT * FROM "icebergdb1"."ecomorders" ORDER BY ordernum asc ;
You can verify the schema changes to the Iceberg table:
A new column has been added called shipping_carrier
The column shipping_id has been renamed to tracking_number
The data type of the column ordernum has changed from int to bigint
DESCRIBE icebergdb1.ecomorders;
Positional update
The data in tracking_number contains the shipping carrier concatenated with the tracking number. Let’s assume that we want to split this data in order to keep the shipping carrier in the shipping_carrier field and the tracking number in the tracking_number field.
On the AWS Glue console, run the ETL job icebergdemo1-GlueETL4-update-table. The job runs the following SparkSQL statement to update the table:
UPDATE glue_catalog.icebergdb1.ecomorders
SET shipping_carrier = substring(tracking_number,1,3),
tracking_number = substring(tracking_number,4,50)
WHERE tracking_number != '' ;
Query the Iceberg table to verify the updated data on tracking_number and shipping_carrier.
SELECT * FROM "icebergdb1"."ecomorders" ORDER BY ordernum ;
Now that the data has been updated on the table, you should see the partition values populated for category:
SELECT * FROM "icebergdb1"."ecomorders$partitions"
ORDER BY partition;
Clean up
To avoid incurring future charges, clean up the resources you created:
On the Lambda console, open the details page for the function icebergdemo1-Lambda-Create-Iceberg-and-Grant-access.
In the Environment variables section, choose the key Task_To_Perform and update the value to CLEANUP.
Run the function, which drops the database, table, and their associated LF-Tags.
On the AWS CloudFormation console, delete the stack icebergdemo1.
Conclusion
In this post, you created an Iceberg table using the AWS Glue API and used Lake Formation to control access on the Iceberg table in a transactional data lake. With AWS Glue ETL jobs, you merged data into the Iceberg table, and performed schema evolution and partition evolution without rewriting or recreating the Iceberg table. With Athena, you queried the Iceberg data and metadata.
Based on the concepts and demonstrations from this post, you can now build a transactional data lake in an enterprise using Iceberg, AWS Glue, Lake Formation, and Amazon S3.
About the Author
Satya Adimula is a Senior Data Architect at AWS based in Boston. With over two decades of experience in data and analytics, Satya helps organizations derive business insights from their data at scale.
Maximizing the value from Enterprise Software tools requires an understanding of who and how users interact with those tools. As we have worked with builders rolling out Amazon CodeWhisperer to their enterprises, identifying usage patterns has been critical.
This blog post is a result of that work, builds on Introducing Amazon CodeWhisperer Dashboard blog and Amazon CloudWatch metrics and enables customers to build dashboards to support their rollouts. Note that these features are only available in CodeWhisperer Professional plan.
Organizations have leveraged the existing Amazon CodeWhisperer Dashboard to gain insights into developer usage. This blog explores how we can supplement the existing dashboard with detailed user analytics. Identifying leading contributors has accelerated tool usage and adoption within organizations. Acknowledging and incentivizing adopters can accelerate a broader adoption.
The architecture diagram outlines a streamlined process for tracking and analyzing Amazon CodeWhisperer usage events. It begins with logging these events in CodeWhisperer and AWS CloudTrail and then forwarding them to Amazon CloudWatch Logs. Configuring AWS CloudTrail involves using Amazon S3 for storage and AWS Key Management Service (KMS) for log encryption. An AWS Lambda function analyzes the logs, extracting information about user activity. This blog also introduces a AWS CloudFormation template that simplifies the setup process, including creating the CloudTrail with an S3 bucket KMS key and the Lambda function. The template also configures AWS IAM permissions, ensuring the Lambda function has access rights to interact with other AWS services.
Configuring CloudTrail for CodeWhisperer User Tracking
This section details the process for monitoring user interactions while using Amazon CodeWhisperer. The aim is to utilize AWS CloudTrail to record instances where users receive code suggestions from CodeWhisperer. This involves setting up a new CloudTrail trail tailored to log events related to these interactions. By accomplishing this, you lay a foundational framework for capturing detailed user activity data, which is crucial for the subsequent steps of analyzing and visualizing this data through a custom AWS Lambda function and an Amazon CloudWatch dashboard.
Setup CloudTrail for CodeWhisperer
1. Navigate to AWS CloudTrail Service.
2. Create Trail
3. Choose Trail Attributes
a. Click on Create Trail
b. Provide a Trail Name, for example, “cwspr-preprod-cloudtrail”
c. Choose Enable for all accounts in my organization
d. Choose Create a new Amazon S3 bucket to configure the Storage Location
e. For Trail log bucket and folder, note down the given unique trail bucket name in order to view the logs at a future point.
f. Check Enabled to encrypt log files with SSE-KMS encryption
j. Enter an AWS Key Management Service alias for log file SSE-KMS encryption, for example, “cwspr-preprod-cloudtrail”
h. Select Enabled for CloudWatch Logs
i. Select New
j. Copy the given CloudWatch Log group name, you will need this for the testing the Lambda function in a future step.
k. Provide a Role Name, for example, “CloudTrailRole-cwspr-preprod-cloudtrail”
l. Click Next.
4. Choose Log Events
a. Check “Management events“ and ”Data events“
b. Under Management events, keep the default options under API activity, Read and Write
c. Under Data event, choose CodeWhisperer for Data event type
d. Keep the default Log all events under Log selector template
e. Click Next
f. Review and click Create Trail
Please Note: The logs will need to be included on the account which the management account or member accounts are enabled.
Gathering Application ARN for CodeWhisperer application
Step 1: Access AWS IAM Identity Center
1. Locate and click on the Services dropdown menu at the top of the console.
Step 2: Find the Application ARN for CodeWhisperer application
1. In the IAM Identity Center dashboard, click on Application Assignments. -> Applications in the left-side navigation pane.
2. Locate the application with Service as CodeWhisperer and click on it
3. Copy the Application ARN and store it in a secure place. You will need this ID to configure your Lambda function’s JSON event.
User Activity Analysis in CodeWhisperer with AWS Lambda
This section focuses on creating and testing our custom AWS Lambda function, which was explicitly designed to analyze user activity within an Amazon CodeWhisperer environment. This function is critical in extracting, processing, and organizing user activity data. It starts by retrieving detailed logs from CloudWatch containing CodeWhisperer user activity, then cross-references this data with the membership details obtained from the AWS Identity Center. This allows the function to categorize users into active and inactive groups based on their engagement within a specified time frame.
The Lambda function’s capability extends to fetching and structuring detailed user information, including names, display names, and email addresses. It then sorts and compiles these details into a comprehensive HTML output. This output highlights the CodeWhisperer usage in an organization.
Creating and Configuring Your AWS Lambda Function
1. Navigate to the Lambda service.
2. Click on Create function.
3. Choose Author from scratch.
4. Enter a Function name, for example, “AmazonCodeWhispererUserActivity”.
5. Choose Python 3.11 as the Runtime.
6. Click on ‘Create function’ to create your new Lambda function.
7. Access the Function: After creating your Lambda function, you will be directed to the function’s dashboard. If not, navigate to the Lambda service, find your function “AmazonCodeWhispererUserActivity”, and click on it.
8. Copy and paste your Python code into the inline code editor on the function’s dashboard.
9. Click ‘Deploy’ to save and deploy your code to the Lambda function. The lambda function zip file can be found here.
10. You have now successfully created and configured an AWS Lambda function with our Python code.
Updating the Execution Role for Your AWS Lambda Function
After you’ve created your Lambda function, you need to ensure it has the appropriate permissions to interact with other AWS services like CloudWatch Logs and AWS Identity Store. Here’s how you can update the IAM role permissions:
Locate the Execution Role:
1. Open Your Lambda Function’s Dashboard in the AWS Management Console.
2. Click on the ‘Configuration’ tab located near the top of the dashboard.
3. Set the Time Out setting to 15 minutes from the default 3 seconds
4. Select the ‘Permissions’ menu on the left side of the Configuration page.
5. Find the ‘Execution role’ section on the Permissions page.
6. Click on the Role Name to open the IAM (Identity and Access Management) role associated with your Lambda function.
7. In the IAM role dashboard, click on the Policy Name under the Permissions policies.
8. Edit the existing policy: Replace the policy with the following JSON.
9. Save the changes to the policy.
{
"Version":"2012-10-17",
"Statement":[
{
"Action":[
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:StartQuery",
"logs:GetQueryResults",
"sso:ListInstances",
"identitystore:DescribeUser",
"identitystore:ListUsers",
"identitystore:ListGroupMemberships"
],
"Resource":"*",
"Effect":"Allow"
},
{
"Action":[
"cloudtrail:DescribeTrails",
"cloudtrail:GetTrailStatus"
],
"Resource":"*",
"Effect":"Allow"
}
]
} Your AWS Lambda function now has the necessary permissions to execute and interact with CloudWatch Logs and AWS Identity Store.
Testing Lambda Function with custom input
1. On your Lambda function’s dashboard.
2. On the function’s dashboard, locate the Test button near the top right corner.
3. Click on Test. This opens a dialog for configuring a new test event.
4. In the dialog, you’ll see an option to create a new test event. If it’s your first test, you’ll be prompted automatically to create a new event.
5. For Event name, enter a descriptive name for your test, such as “TestEvent”.
6. In the event code area, replace the existing JSON with your specific input:
{
"log_group_name": "{Insert Log Group Name}",
"start_date": "{Insert Start Date}",
"end_date": "{Insert End Date}",
"codewhisperer_application_arn": "{Insert Codewhisperer Application ARN}"
}
7. This JSON structure includes:
a. log_group_name: The name of the log group in CloudWatch Logs.
b. start_date: The start date and time for the query, formatted as “YYYY-MM-DD HH:MM:SS”.
c. end_date: The end date and time for the query, formatted as “YYYY-MM-DD HH:MM:SS”.
e. codewhisperer_application_arn: The ARN of the Code Whisperer Application in the AWS Identity Store.
8. Click on Save to store this test configuration.
9. With the test event selected, click on the Test button again to execute the function with this event.
10. The function will run, and you’ll see the execution result at the top of the page. This includes execution status, logs, and output.
11. Check the Execution result section to see if the function executed successfully.
Visualizing CodeWhisperer User Activity with Amazon CloudWatch Dashboard
This section focuses on effectively visualizing the data processed by our AWS Lambda function using a CloudWatch dashboard. This part of the guide provides a step-by-step approach to creating a “CodeWhispererUserActivity” dashboard within CloudWatch. It details how to add a custom widget to display the results from the Lambda Function. The process includes configuring the widget with the Lambda function’s ARN and the necessary JSON parameters.
1. On the AWS Management Console and navigate to the CloudWatch service.
2. Create a New Dashboard: Click on ‘Dashboards’ in the left-hand navigation pane, then click on ‘Create dashboard’. Name your dashboard “CodeWhispererUserActivity” and click ‘Create Dashboard’.
3. Select “Other Content Types”, choose “Custom Widget”, and then click ‘Next’.
4. Configure the Lambda Function Widget: Enter your Lambda function’s ARN (Amazon Resource Name) or use the dropdown menu to find and select your “CodeWhispererUserActivity” function. Then, add the JSON Parameters.
5. Click ‘Add Widget’. The dashboard will update to include your new widget and will run the Lambda function to retrieve initial data.
6. Customize Your Dashboard: Arrange the dashboard by dragging and resizing widgets for optimal organization and visibility. Adjust the time range and refresh settings as needed to suit your monitoring requirements.
7. Save the Dashboard Configuration: After setting up and customizing your dashboard, click ‘Save dashboard’ to preserve your layout and settings.
CloudFormation Deployment for the CodeWhisperer Dashboard
The blog post concludes with a detailed AWS CloudFormation template designed to automate the setup of the necessary infrastructure for the Amazon CodeWhisperer User Activity Dashboard. This template provisions AWS resources, streamlining the deployment process. It includes the configuration of AWS CloudTrail for tracking user interactions, setting up CloudWatch Logs for logging and monitoring, and creating an AWS Lambda function for analyzing user activity data. Additionally, the template defines the required IAM roles and permissions, ensuring the Lambda function has access to the needed AWS services and resources.
The blog post also provides a JSON configuration for the CloudWatch dashboard. This is because, at the time of writing, AWS CloudFormation does not natively support the creation and configuration of CloudWatch dashboards. Therefore, the JSON configuration is necessary to manually set up the dashboard in CloudWatch, allowing users to visualize the processed data from the Lambda function. Here is the CloudFormation template.
Create a CloudWatch Dashboard and import the JSON below.
{
"widgets":[
{
"height":19,
"width":7,
"y":0,
"x":0,
"type":"custom",
"properties":{
"endpoint":"{Insert ARN of Lambda Function}",
"updateOn":{
"refresh":true,
"resize":true,
"timeRange":true
},
"params":{
"log_group_name":"{Insert Log Group Name}",
"start_date":"{Insert Start Date}",
"end_date":"{Insert End Date}",
"identity_store_id":"{Insert Identity Store ID}",
"group_id":"{Insert Group ID}"
}
}
}
]
}
Conclusion
In this blog, we detail a comprehensive process for establishing a user activity dashboard for Amazon CodeWhisperer to deliver data to support an enterprise rollout. The journey begins with setting up AWS CloudTrail to log user interactions with CodeWhisperer. This foundational step ensures the capture of detailed activity events, which is vital for our subsequent analysis. We then construct a tailored AWS Lambda function to sift through CloudTrail logs. Then, create a dashboard in AWS CloudWatch. This dashboard serves as a central platform for displaying the user data from our Lambda function in an accessible, user-friendly format.
You can reference the existing CodeWhisperer dashboard for additional insights. The Amazon CodeWhisperer Dashboard offers a comprehensive view summarizing valuable data about how your developers use the service.
Overall, this dashboard empowers you to track, understand, and influence the adoption and effective use of Amazon CodeWhisperer in your organizations, optimizing the tool’s deployment and fostering a culture of informed data-driven usage.
This post is cowritten with Amy Tseng, Jack Lin and Regis Chow from BMO.
BMO is the 8th largest bank in North America by assets. It provides personal and commercial banking, global markets, and investment banking services to 13 million customers. As they continue to implement their Digital First strategy for speed, scale and the elimination of complexity, they are always seeking ways to innovate, modernize and also streamline data access control in the Cloud. BMO has accumulated sensitive financial data and needed to build an analytic environment that was secure and performant. One of the bank’s key challenges related to strict cybersecurity requirements is to implement field level encryption for personally identifiable information (PII), Payment Card Industry (PCI), and data that is classified as high privacy risk (HPR). Data with this secured data classification is stored in encrypted form both in the data warehouse and in their data lake. Only users with required permissions are allowed to access data in clear text.
Amazon Redshift is a fully managed data warehouse service that tens of thousands of customers use to manage analytics at scale. Amazon Redshift supports industry-leading security with built-in identity management and federation for single sign-on (SSO) along with multi-factor authentication. The Amazon Redshift Spectrum feature enables direct query of your Amazon Simple Storage Service (Amazon S3) data lake, and many customers are using this to modernize their data platform.
AWS Lake Formation is a fully managed service that simplifies building, securing, and managing data lakes. It provides fine-grained access control, tagging (tag-based access control (TBAC)), and integration across analytical services. It enables simplifying the governance of data catalog objects and accessing secured data from services like Amazon Redshift Spectrum.
BMO had more than Petabyte(PB) of financial sensitive data classified as follows:
Personally Identifiable Information (PII)
Payment Card Industry (PCI)
High Privacy Risk (HPR)
The bank aims to store data in their Amazon Redshift data warehouse and Amazon S3 data lake. They have a large, diverse end user base across sales, marketing, credit risk, and other business lines and personas:
Business analysts
Data engineers
Data scientists
Fine-grained access control needs to be applied to the data on both Amazon Redshift and data lake data accessed using Amazon Redshift Spectrum. The bank leverages AWS services like AWS Glue and Amazon SageMaker on this analytics platform. They also use an external identity provider (IdP) to manage their preferred user base and integrate it with these analytics tools. End users access this data using third-party SQL clients and business intelligence tools.
Solution overview
In this post, we’ll use synthetic data very similar to BMO data with data classified as PII, PCI, or HPR. Users and groups exists in External IdP. These users federate for single sign on to Amazon Redshift using native IdP federation. We’ll define the permissions using Redshift role based access control (RBAC) for the user roles. For users accessing the data in data lake using Amazon Redshift Spectrum, we’ll use Lake Formation policies for access control.
Technical Solution
To implement customer needs for securing different categories of data, it requires the definition of multiple AWS IAM roles, which requires knowledge in IAM policies and maintaining those when permission boundary changes.
In this post, we show how we simplified managing the data classification policies with minimum number of Amazon Redshift AWS IAM roles aligned by data classification, instead of permutations and combinations of roles by lines of business and data classifications. Other organizations (e.g., Financial Service Institute [FSI]) can benefit from the BMO’s implementation of data security and compliance.
As a part of this blog, the data will be uploaded into Amazon S3. Access to the data is controlled using policies defined using Redshift RBAC for corresponding Identity provider user groups and TAG Based access control will be implemented using AWS Lake Formation for data on S3.
Solution architecture
The following diagram illustrates the solution architecture along with the detailed steps.
IdP users with groups like lob_risk_public, Lob_risk_pci, hr_public, and hr_hpr are assigned in External IdP (Identity Provider).
Each users is mapped to the Amazon Redshift local roles that are sent from IdP, and including aad:lob_risk_pci, aad:lob_risk_public, aad:hr_public, and aad:hr_hpr in Amazon Redshift. For example, User1 who is part of Lob_risk_public and hr_hpr will grant role usage accordingly.
Attach iam_redshift_hpr, iam_redshift_pcipii, and iam_redshift_public AWS IAM roles to Amazon Redshift cluster.
AWS Glue databases which are backed on s3 (e.g., lobrisk,lobmarket,hr and their respective tables) are referenced in Amazon Redshift. Using Amazon Redshift Spectrum, you can query these external tables and databases (e.g., external_lobrisk_pci, external_lobrisk_public, external_hr_public, and external_hr_hpr), which are created using AWS IAM roles iam_redshift_pcipii, iam_redshift_hpr, iam_redshift_public as shown in the solutions steps.
AWS Lake Formation is used to control access to the external schemas and tables.
Using AWS Lake Formation tags, we apply the fine-grained access control to these external tables for AWS IAM roles (e.g., iam_redshift_hpr, iam_redshift_pcipii, and iam_redshift_public).
Finally, grant usage for these external schemas to their Amazon Redshift roles.
Walkthrough
The following sections walk you through implementing the solution using synthetic data.
Download the data files and place your files into buckets
Amazon S3 serves as a scalable and durable data lake on AWS. Using Data Lake you can bring any open format data like CSV, JSON, PARQUET, or ORC into Amazon S3 and perform analytics on your data.
The solutions utilize CSV data files containing information classified as PCI, PII, HPR, or Public. You can download input files using the provided links below. Using the downloaded files upload into Amazon S3 by creating folder and files as shown in below screenshot by following the instruction here. The detail of each file is provided in the following list:
credit_card_transaction_PCI_public – Contains credit card transactions categorized as public data and visible to principals who has access to public data.
credit_card_transaction_PCI – Contains credit card numbers and accessed to only principals who has access or clearance to PCI data.
lob_risk_High_confidential– Contains confidential business reports only selected principals (e.g., head or leader of the line of business has access).
customers_PII_HPR_public – Contains customer data accessed by the principals with public clearance.
customers_PII_HPR– Contains personally identifiable information data accessed by HR principals with HPR certain principals.
Register the files into AWS Glue Data Catalog using crawlers
The following instructions demonstrate how to register files downloaded into the AWS Glue Data Catalog using crawlers. We organize files into databases and tables using AWS Glue Data Catalog, as per the following steps. It is recommended to review the documentation to learn how to properly set up an AWS Glue Database. Crawlers can automate the process of registering our downloaded files into the catalog rather than doing it manually. You’ll create the following databases in the AWS Glue Data Catalog:
lobrisk
lobmarket
hr
Example steps to create an AWS Glue database for lobrisk data are as follows:
Choose Add database and enter the name of databases as lobrisk.
Select Create database, as shown in the following screenshot.
Repeat the steps for creating other database like lobmarket and hr.
An AWS Glue Crawler scans the above files and catalogs metadata about them into the AWS Glue Data Catalog. The Glue Data Catalog organizes this Amazon S3 data into tables and databases, assigning columns and data types so the data can be queried using SQL that Amazon Redshift Spectrum can understand. Please review the AWS Glue documentation about creating the Glue Crawler. Once AWS Glue crawler finished executing, you’ll see the following respective database and tables:
lobrisk
lob_risk_high_confidential_public
lob_risk_high_confidential
lobmarket
credit_card_transaction_pci
credit_card_transaction_pci_public
hr
customers_pii_hpr_public
customers_pii_hpr
Example steps to create an AWS Glue Crawler for lobrisk data are as follows:
Select Crawlers under Data Catalog in AWS Glue Console.
Next, choose Create crawler. Provide the crawler name as lobrisk_crawler and choose Next.
Make sure to select the data source as Amazon S3 and browse the Amazon S3 path to the lob_risk_high_confidential_public folder and choose an Amazon S3 data source.
Crawlers can crawl multiple folders in Amazon S3. Choose Add a data source and include path S3://<<Your Bucket >>/ lob_risk_high_confidential.
After adding another Amazon S3 folder, then choose Next.
Next, create a new IAM role in the Configuration security settings.
Choose Next.
Select the Target database as lobrisk. Choose Next.
Next, under Review, choose Create crawler.
Select Run Crawler. This creates two tables : lob_risk_high_confidential_public and lob_risk_high_confidential under database lobrisk.
Similarly, create an AWS Glue crawler for lobmarket and hr data using the above steps.
Create AWS IAM roles
Using AWS IAM, create the following IAM roles with Amazon Redshift, Amazon S3, AWS Glue, and AWS Lake Formation permissions.
You can create AWS IAM roles in this service using this link. Later, you can attach a managed policy to these IAM roles:
iam_redshift_pcipii (AWS IAM role attached to Amazon Redshift cluster)
AmazonRedshiftFullAccess
AmazonS3FullAccess
Add inline policy (Lakeformation-inline) for Lake Formation permission as follows:
iam_redshift_hpr (AWS IAM role attached to Amazon Redshift cluster): Add the following managed:
AmazonRedshiftFullAccess
AmazonS3FullAccess
Add inline policy (Lakeformation-inline), which was created previously.
iam_redshift_public (AWS IAM role attached to Amazon Redshift cluster): Add the following managed policy:
AmazonRedshiftFullAccess
AmazonS3FullAccess
Add inline policy (Lakeformation-inline), which was created previously.
LF_admin (Lake Formation Administrator): Add the following managed policy:
AWSLakeFormationDataAdmin
AWSLakeFormationCrossAccountManager
AWSGlueConsoleFullAccess
Use Lake Formation tag-based access control (LF-TBAC) to access control the AWS Glue data catalog tables.
LF-TBAC is an authorization strategy that defines permissions based on attributes. Using LF_admin Lake Formation administrator, you can create LF-tags, as mentioned in the following details:
Key
Value
Classification:HPR
no, yes
Classification:PCI
no, yes
Classification:PII
no, yes
Classifications
non-sensitive, sensitive
Follow the below instructions to create Lake Formation tags:
Log into Lake Formation Console (https://console.aws.amazon.com/lakeformation/) using LF-Admin AWS IAM role.
Go to LF-Tags and permissions in Permissions sections.
Select Add LF-Tag.
Create the remaining LF-Tags as directed in table earlier. Once created you find the LF-Tags as show below.
Assign LF-TAG to the AWS Glue catalog tables
Assigning Lake Formation tags to tables typically involves a structured approach. The Lake Formation Administrator can assign tags based on various criteria, such as data source, data type, business domain, data owner, or data quality. You have the ability to allocate LF-Tags to Data Catalog assets, including databases, tables, and columns, which enables you to manage resource access effectively. Access to these resources is restricted to principals who have been given corresponding LF-Tags (or those who have been granted access through the named resource approach).
Follow the instruction in the give link to assign LF-TAGS to Glue Data Catalog Tables:
Glue Catalog Tables
Key
Value
customers_pii_hpr_public
Classification
non-sensitive
customers_pii_hpr
Classification:HPR
yes
credit_card_transaction_pci
Classification:PCI
yes
credit_card_transaction_pci_public
Classifications
non-sensitive
lob_risk_high_confidential_public
Classifications
non-sensitive
lob_risk_high_confidential
Classification:PII
yes
Follow the below instructions to assign a LF-Tag to Glue Tables from AWS Console as follows:
To access the databases in Lake Formation Console, go to the Data catalog section and choose Databases.
Select the lobrisk database and choose View Tables.
Select lob_risk_high_confidential table and edit the LF-Tags.
Assign the Classification:HPR as Assigned Keys and Values as Yes. Select Save.
Similarly, assign the Classification Key and Value as non-sensitive for the lob_risk_high_confidential_public table.
Follow the above instructions to assign tables to remaining tables for lobmarket and hr databases.
Grant permissions to resources using a LF-Tag expression grant to Redshift IAM Roles
Grant select, describe Lake Formation permission to LF-Tags and Redshift IAM role using Lake Formation Administrator in Lake formation console. To grant, please follow the documentation.
Use the following table to grant the corresponding IAM role to LF-tags:
IAM role
LF-Tags Key
LF-Tags Value
Permission
iam_redshift_pcipii
Classification:PII
yes
Describe, Select
.
Classification:PCI
yes
.
iam_redshift_hpr
Classification:HPR
yes
Describe, Select
iam_redshift_public
Classifications
non-sensitive
Describe, Select
Follow the below instructions to grant permissions to LF-tags and IAM roles:
Choose Data lake permissions in Permissions section in the AWS Lake Formation Console.
Choose Grants. Select IAM users and roles in Principals.
In LF-tags or catalog resources select Key as Classifications and values as non-sensitive.
Next, select Table permissions as Select & Describe. Choose grants.
Follow the above instructions for remaining LF-Tags and their IAM roles, as shown in the previous table.
Map the IdP user groups to the Redshift roles
In Redshift, use Native IdP federation to map the IdP user groups to the Redshift roles. Use Query Editor V2.
create role aad:rs_lobrisk_pci_role;
create role aad:rs_lobrisk_public_role;
create role aad:rs_hr_hpr_role;
create role aad:rs_hr_public_role;
create role aad:rs_lobmarket_pci_role;
create role aad:rs_lobmarket_public_role;
Create External schemas
In Redshift, create External schemas using AWS IAM roles and using AWS Glue Catalog databases. External schema’s are created as per data classification using iam_role.
create external schema external_lobrisk_pci
from data catalog
database 'lobrisk'
iam_role 'arn:aws:iam::571750435036:role/iam_redshift_pcipii';
create external schema external_hr_hpr
from data catalog
database 'hr'
iam_role 'arn:aws:iam::571750435036:role/iam_redshift_hpr';
create external schema external_lobmarket_pci
from data catalog
database 'lobmarket'
iam_role 'arn:aws:iam::571750435036:role/iam_redshift_pcipii';
create external schema external_lobrisk_public
from data catalog
database 'lobrisk'
iam_role 'arn:aws:iam::571750435036:role/iam_redshift_public';
create external schema external_hr_public
from data catalog
database 'hr'
iam_role 'arn:aws:iam::571750435036:role/iam_redshift_public';
create external schema external_lobmarket_public
from data catalog
database 'lobmarket'
iam_role 'arn:aws:iam::571750435036:role/iam_redshift_public';
Verify list of tables
Verify list of tables in each external schema. Each schema lists only the tables Lake Formation has granted to IAM_ROLES used to create external schema. Below is the list of tables in Redshift query edit v2 output on top left hand side.
Grant usage on external schemas to different Redshift local Roles
In Redshift, grant usage on external schemas to different Redshift local Roles as follows:
grant usage on schema external_lobrisk_pci to role aad:rs_lobrisk_pci_role;
grant usage on schema external_lobrisk_public to role aad:rs_lobrisk_public_role;
grant usage on schema external_lobmarket_pci to role aad:rs_lobmarket_pci_role;
grant usage on schema external_lobmarket_public to role aad:rs_lobmarket_public_role;
grant usage on schema external_hr_hpr_pci to role aad:rs_hr_hpr_role;
grant usage on schema external_hr_public to role aad:rs_hr_public_role;
Verify access to external schema
Verify access to external schema using user from Lob Risk team. User lobrisk_pci_user federated into Amazon Redshift local role rs_lobrisk_pci_role. Role rs_lobrisk_pci_role only has access to external schema external_lobrisk_pci.
set session_authorization to creditrisk_pci_user;
select * from external_lobrisk_pci.lob_risk_high_confidential limit 10;
On querying table from external_lobmarket_pci schema, you’ll see that your permission is denied.
set session_authorization to lobrisk_pci_user;
select * from external_lobmarket_hpr.lob_card_transaction_pci;
BMO’s automated access provisioning
Working with the bank, we developed an access provisioning framework that allows the bank to create a central repository of users and what data they have access to. The policy file is stored in Amazon S3. When the file is updated, it is processed, messages are placed in Amazon SQS. AWS Lambda using Data API is used to apply access control to Amazon Redshift roles. Simultaneously, AWS Lambda is used to automate tag-based access control in AWS Lake Formation.
Benefits of adopting this model were:
Created a scalable automation process to allow dynamically applying changing policies.
Streamlined the user accesses on-boarding and processing with existing enterprise access management.
Empowered each line of business to restrict access to sensitive data they own and protect customers data and privacy at enterprise level.
Simplified the AWS IAM role management and maintenance by greatly reduced number of roles required.
With the recent release of Amazon Redshift integration with AWS Identity center which allows identity propagation across AWS service can be leveraged to simplify and scale this implementation.
Conclusion
In this post, we showed you how to implement robust access controls for sensitive customer data in Amazon Redshift, which were challenging when trying to define many distinct AWS IAM roles. The solution presented in this post demonstrates how organizations can meet data security and compliance needs with a consolidated approach—using a minimal set of AWS IAM roles organized by data classification rather than business lines.
By using Amazon Redshift’s native integration with External IdP and defining RBAC policies in both Redshift and AWS Lake Formation, granular access controls can be applied without creating an excessive number of distinct roles. This allows the benefits of role-based access while minimizing administrative overhead.
Other financial services institutions looking to secure customer data and meet compliance regulations can follow a similar consolidated RBAC approach. Careful policy definition, aligned to data sensitivity rather than business functions, can help reduce the proliferation of AWS IAM roles. This model balances security, compliance, and manageability for governance of sensitive data in Amazon Redshift and broader cloud data platforms.
In short, a centralized RBAC model based on data classification streamlines access management while still providing robust data security and compliance. This approach can benefit any organization managing sensitive customer information in the cloud.
About the Authors
Amy Tseng is a Managing Director of Data and Analytics(DnA) Integration at BMO. She is one of the AWS Data Hero. She has over 7 years of experiences in Data and Analytics Cloud migrations in AWS. Outside of work, Amy loves traveling and hiking.
Jack Lin is a Director of Engineering on the Data Platform at BMO. He has over 20 years of experience working in platform engineering and software engineering. Outside of work, Jack loves playing soccer, watching football games and traveling.
Regis Chow is a Director of DnA Integration at BMO. He has over 5 years of experience working in the cloud and enjoys solving problems through innovation in AWS. Outside of work, Regis loves all things outdoors, he is especially passionate about golf and lawn care.
Nishchai JM is an Analytics Specialist Solutions Architect at Amazon Web services. He specializes in building Big-data applications and help customer to modernize their applications on Cloud. He thinks Data is new oil and spends most of his time in deriving insights out of the Data.
Harshida Patel is a Principal Solutions Architect, Analytics with AWS.
Raghu Kuppala is an Analytics Specialist Solutions Architect experienced working in the databases, data warehousing, and analytics space. Outside of work, he enjoys trying different cuisines and spending time with his family and friends.
In this post, I’ll show how you can export software bills of materials (SBOMs) for your containers by using an AWS native service, Amazon Inspector, and visualize the SBOMs through Amazon QuickSight, providing a single-pane-of-glass view of your organization’s software supply chain.
The concept of a bill of materials (BOM) originated in the manufacturing industry in the early 1960s. It was used to keep track of the quantities of each material used to manufacture a completed product. If parts were found to be defective, engineers could then use the BOM to identify products that contained those parts. An SBOM extends this concept to software development, allowing engineers to keep track of vulnerable software packages and quickly remediate the vulnerabilities.
Today, most software includes open source components. A Synopsys study, Walking the Line: GitOps and Shift Left Security, shows that 8 in 10 organizations reported using open source software in their applications. Consider a scenario in which you specify an open source base image in your Dockerfile but don’t know what packages it contains. Although this practice can significantly improve developer productivity and efficiency, the decreased visibility makes it more difficult for your organization to manage risk effectively.
It’s important to track the software components and their versions that you use in your applications, because a single affected component used across multiple organizations could result in a major security impact. According to a Gartner report titled Gartner Report for SBOMs: Key Takeaways You Should know, by 2025, 60 percent of organizations building or procuring critical infrastructure software will mandate and standardize SBOMs in their software engineering practice, up from less than 20 percent in 2022. This will help provide much-needed visibility into software supply chain security.
Integrating SBOM workflows into the software development life cycle is just the first step—visualizing SBOMs and being able to search through them quickly is the next step. This post describes how to process the generated SBOMs and visualize them with Amazon QuickSight. AWS also recently added SBOM export capability in Amazon Inspector, which offers the ability to export SBOMs for Amazon Inspector monitored resources, including container images.
Why is vulnerability scanning not enough?
Scanning and monitoring vulnerable components that pose cybersecurity risks is known as vulnerability scanning, and is fundamental to organizations for ensuring a strong and solid security posture. Scanners usually rely on a database of known vulnerabilities, the most common being the Common Vulnerabilities and Exposures (CVE) database.
Identifying vulnerable components with a scanner can prevent an engineer from deploying affected applications into production. You can embed scanning into your continuous integration and continuous delivery (CI/CD) pipelines so that images with known vulnerabilities don’t get pushed into your image repository. However, what if a new vulnerability is discovered but has not been added to the CVE records yet? A good example of this is the Apache Log4j vulnerability, which was first disclosed on Nov 24, 2021 and only added as a CVE on Dec 1, 2021. This means that for 7 days, scanners that relied on the CVE system weren’t able to identify affected components within their organizations. This issue is known as a zero-day vulnerability. Being able to quickly identify vulnerable software components in your applications in such situations would allow you to assess the risk and come up with a mitigation plan without waiting for a vendor or supplier to provide a patch.
In addition, it’s also good hygiene for your organization to track usage of software packages, which provides visibility into your software supply chain. This can improve collaboration between developers, operations, and security teams, because they’ll have a common view of every software component and can collaborate effectively to address security threats.
In this post, I present a solution that uses the new Amazon Inspector feature to export SBOMs from container images, process them, and visualize the data in QuickSight. This gives you the ability to search through your software inventory on a dashboard and to use natural language queries through QuickSight Q, in order to look for vulnerabilities.
Solution overview
Figure 1 shows the architecture of the solution. It is fully serverless, meaning there is no underlying infrastructure you need to manage. This post uses a newly released feature within Amazon Inspector that provides the ability to export a consolidated SBOM for Amazon Inspector monitored resources across your organization in commonly used formats, including CycloneDx and SPDX.
Another Lambda function is invoked whenever a new JSON file is deposited. The function performs the data transformation steps and uploads the new file into a new S3 bucket.
Amazon Athena is then used to perform preliminary data exploration.
A dashboard on Amazon QuickSight displays SBOM data.
Implement the solution
This section describes how to deploy the solution architecture.
In this post, you’ll perform the following tasks:
Create S3 buckets and AWS KMS keys to store the SBOMs
Create QuickSight dashboards to identify libraries and packages
Use QuickSight Q to identify libraries and packages by using natural language queries
Deploy the CloudFormation stack
The AWS CloudFormation template we’ve provided provisions the S3 buckets that are required for the storage of raw SBOMs and transformed SBOMs, the Lambda functions necessary to initiate and process the SBOMs, and EventBridge rules to run the Lambda functions based on certain events. An empty repository is provisioned as part of the stack, but you can also use your own repository.
Browse to the CloudFormation service in your AWS account and choose Create Stack.
Upload the CloudFormation template you downloaded earlier.
For the next step, Specify stack details, enter a stack name.
You can keep the default value of sbom-inspector for EnvironmentName.
Specify the Amazon Resource Name (ARN) of the user or role to be the admin for the KMS key.
Deploy the stack.
Set up Amazon Inspector
If this is the first time you’re using Amazon Inspector, you need to activate the service. In the Getting started with Amazon Inspector topic in the Amazon Inspector User Guide, follow Step 1 to activate the service. This will take some time to complete.
Figure 2: Activate Amazon Inspector
SBOM invocation and processing Lambda functions
This solution uses two Lambda functions written in Python to perform the invocation task and the transformation task.
Invocation task — This function is run whenever a new image is pushed into Amazon ECR. It takes in the repository name and image tag variables and passes those into the create_sbom_export function in the SPDX format. This prevents duplicated SBOMs, which helps to keep the S3 data size small.
Transformation task — This function is run whenever a new file with the suffix .json is added to the raw S3 bucket. It creates two files, as follows:
It extracts information such as image ARN, account number, package, package version, operating system, and SHA from the SBOM and exports this data to the transformed S3 bucket under a folder named sbom/.
Because each package can have more than one CVE, this function also extracts the CVE from each package and stores it in the same bucket in a directory named cve/. Both files are exported in Apache Parquet so that the file is in a format that is optimized for queries by Amazon Athena.
Populate the AWS Glue Data Catalog
To populate the AWS Glue Data Catalog, you need to generate the SBOM files by using the Lambda functions that were created earlier.
To populate the AWS Glue Data Catalog
You can use an existing image, or you can continue on to create a sample image.
# Pull the nginx image from a public repo
docker pull public.ecr.aws/nginx/nginx:1.19.10-alpine-perl
docker tag public.ecr.aws/nginx/nginx:1.19.10-alpine-perl <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com/sbom-inspector:nginxperl
# Authenticate to ECR, fill in your account id
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com
# Push the image into ECR
docker push <ACCOUNT-ID>.dkr.ecr.us-east-1.amazonaws.com/sbom-inspector:nginxperl
An image is pushed into the Amazon ECR repository in your account. This invokes the Lambda functions that perform the SBOM export by using Amazon Inspector and converts the SBOM file to Parquet.
Verify that the Parquet files are in the transformed S3 bucket:
Browse to the S3 console and choose the bucket named sbom-inspector-<ACCOUNT-ID>-transformed. You can also track the invocation of each Lambda function in the Amazon CloudWatch log console.
After the transformation step is complete, you will see two folders (cve/ and sbom/)in the transformed S3 bucket. Choose the sbom folder. You will see the transformed Parquet file in it. If there are CVEs present, a similar file will appear in the cve folder.
The next step is to run an AWS Glue crawler to determine the format, schema, and associated properties of the raw data. You will need to crawl both folders in the transformed S3 bucket and store the schema in separate tables in the AWS Glue Data Catalog.
On the AWS Glue Service console, on the left navigation menu, choose Crawlers.
On the Crawlers page, choose Create crawler. This starts a series of pages that prompt you for the crawler details.
In the Crawler name field, enter sbom-crawler, and then choose Next.
Under Data sources, select Add a data source.
Now you need to point the crawler to your data. On the Add data source page, choose the Amazon S3 data store. This solution in this post doesn’t use a connection, so leave the Connection field blank if it’s visible.
For the option Location of S3 data, choose In this account. Then, for S3 path, enter the path where the crawler can find the sbom and cve data, which is s3://sbom-inspector-<ACCOUNT-ID>-transformed/sbom/ and s3://sbom-inspector-<ACCOUNT-ID>-transformed/cve/. Leave the rest as default and select Add an S3 data source.
Figure 3: Data source for AWS Glue crawler
The crawler needs permissions to access the data store and create objects in the Data Catalog. To configure these permissions, choose Create an IAM role. The AWS Identity and Access Management (IAM) role name starts with AWSGlueServiceRole-, and in the field, you enter the last part of the role name. Enter sbomcrawler, and then choose Next.
Crawlers create tables in your Data Catalog. Tables are contained in a database in the Data Catalog. To create a database, choose Add database. In the pop-up window, enter sbom-db for the database name, and then choose Create.
Verify the choices you made in the Add crawler wizard. If you see any mistakes, you can choose Back to return to previous pages and make changes. After you’ve reviewed the information, choose Finish to create the crawler.
Figure 4: Creation of the AWS Glue crawler
Select the newly created crawler and choose Run.
After the crawler runs successfully, verify that the table is created and the data schema is populated.
Figure 5: Table populated from the AWS Glue crawler
Set up Amazon Athena
Amazon Athena performs the initial data exploration and validation. Athena is a serverless interactive analytics service built on open source frameworks that supports open-table and file formats. Athena provides a simplified, flexible way to analyze data in sources like Amazon S3 by using standard SQL queries. If you are SQL proficient, you can query the data source directly; however, not everyone is familiar with SQL. In this section, you run a sample query and initialize the service so that it can used in QuickSight later on.
To start using Amazon Athena
In the AWS Management Console, navigate to the Athena console.
For Database, select sbom-db (or select the database you created earlier in the crawler).
Navigate to the Settings tab located at the top right corner of the console. For Query result location, select the Athena S3 bucket created from the CloudFormation template, sbom-inspector-<ACCOUNT-ID>-athena.
Keep the defaults for the rest of the settings. You can now return to the Query Editor and start writing and running your queries on the sbom-db database.
You can use the following sample query.
select package, packageversion, cve, sha, imagearn from sbom
left join cve
using (sha, package, packageversion)
where cve is not null;
Your Athena console should look similar to the screenshot in Figure 6.
Figure 6: Sample query with Amazon Athena
This query joins the two tables and selects only the packages with CVEs identified. Alternatively, you can choose to query for specific packages or identify the most common package used in your organization.
Amazon QuickSight is a serverless business intelligence service that is designed for the cloud. In this post, it serves as a dashboard that allows business users who are unfamiliar with SQL to identify zero-day vulnerabilities. This can also reduce the operational effort and time of having to look through several JSON documents to identify a single package across your image repositories. You can then share the dashboard across teams without having to share the underlying data.
QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine) is an in-memory engine that QuickSight uses to perform advanced calculations. In a large organization where you could have millions of SBOM records stored in S3, importing your data into SPICE helps to reduce the time to process and serve the data. You can also use the feature to perform a scheduled refresh to obtain the latest data from S3.
QuickSight also has a feature called QuickSight Q. With QuickSightQ, you can use natural language to interact with your data. If this is the first time you are initializing QuickSight, subscribe to QuickSight and select Enterprise + Q. It will take roughly 20–30 minutes to initialize for the first time. Otherwise, if you are already using QuickSight, you will need to enable QuickSight Q by subscribing to it in the QuickSight console.
Finally, in QuickSight you can select different data sources, such as Amazon S3 and Athena, to create custom visualizations. In this post, we will use the two Athena tables as the data source to create a dashboard to keep track of the packages used in your organization and the resulting CVEs that come with them.
Prerequisites for setting up the QuickSight dashboard
This process will be used to create the QuickSight dashboard from a template already pre-provisioned through the command line interface (CLI). It also grants the necessary permissions for QuickSight to access the data source. You will need the following:
A QuickSight + Q subscription (only if you want to use the Q feature).
QuickSight permissions to Amazon S3 and Athena (enable these through the QuickSight security and permissions interface).
Set the default AWS Region where you want to deploy the QuickSight dashboard. This post assumes that you’re using the us-east-1 Region.
Create datasets
In QuickSight, create two datasets, one for the sbom table and another for the cve table.
In the QuickSight console, select the Dataset tab.
Choose Create dataset, and then select the Athena data source.
Name the data source sbom and choose Create data source.
Select the sbom table.
Choose Visualize to complete the dataset creation. (Delete the analyses automatically created for you because you will create your own analyses afterwards.)
Navigate back to the main QuickSight page and repeat steps 1–4 for the cve dataset.
Merge datasets
Next, merge the two datasets to create the combined dataset that you will use for the dashboard.
On the Datasets tab, edit the sbom dataset and add the cve dataset.
Set three join clauses, as follows:
Sha : Sha
Package : Package
Packageversion : Packageversion
Perform a left merge, which will append the cve ID to the package and package version in the sbom dataset.
Figure 7: Combining the sbom and cve datasets
Next, you will create a dashboard based on the combined sbom dataset.
Prepare configuration files
In your terminal, export the following variables. Substitute <QuickSight username> in the QS_USER_ARN variable with your own username, which can be found in the Amazon QuickSight console.
Validate that the variables are set properly. This is required for you to move on to the next step; otherwise you will run into errors.
echo ACCOUNT_ID is $ACCOUNT_ID || echo ACCOUNT_ID is not set
echo TEMPLATE_ID is $TEMPLATE_ID || echo TEMPLATE_ID is not set
echo QUICKSIGHT USER ARN is $QS_USER_ARN || echo QUICKSIGHT USER ARN is not set
echo QUICKSIGHT DATA ARN is $QS_DATA_ARN || echo QUICKSIGHT DATA ARN is not set
Next, use the following commands to create the dashboard from a predefined template and create the IAM permissions needed for the user to view the QuickSight dashboard.
Note: Run the following describe-dashboard command, and confirm that the response contains a status code of 200. The 200-status code means that the dashboard exists.
You should now be able to see the dashboard in your QuickSight console, similar to the one in Figure 8. It’s an interactive dashboard that shows you the number of vulnerable packages you have in your repositories and the specific CVEs that come with them. You can navigate to the specific image by selecting the CVE (middle right bar chart) or list images with a specific vulnerable package (bottom right bar chart).
Note: You won’t see the exact same graph as in Figure 8. It will change according to the image you pushed in.
Figure 8: QuickSight dashboard containing SBOM information
Alternatively, you can use QuickSight Q to extract the same information from your dataset through natural language. You will need to create a topic and add the dataset you added earlier. For detailed information on how to create a topic, see the Amazon QuickSight User Guide. After QuickSight Q has completed indexing the dataset, you can start to ask questions about your data.
Figure 9: Natural language query with QuickSight Q
Conclusion
This post discussed how you can use Amazon Inspector to export SBOMs to improve software supply chain transparency. Container SBOM export should be part of your supply chain mitigation strategy and monitored in an automated manner at scale.
Although it is a good practice to generate SBOMs, it would provide little value if there was no further analysis being done on them. This solution enables you to visualize your SBOM data through a dashboard and natural language, providing better visibility into your security posture. Additionally, this solution is also entirely serverless, meaning there are no agents or sidecars to set up.
You can use Amazon Security Lake to simplify log data collection and retention for Amazon Web Services (AWS) and non-AWS data sources. To make sure that you get the most out of your implementation requires proper planning.
In this post, we will show you how to plan and implement a proof of concept (POC) for Security Lake to help you determine the functionality and value of Security Lake in your environment, so that your team can confidently design and implement in production. We will walk you through the following steps:
Understand the functionality and value of Security Lake
Determine success criteria for the POC
Define your Security Lake configuration
Prepare for deployment
Enable Security Lake
Validate deployment
Understand the functionality of Security Lake
Figure 1 summarizes the main features of Security Lake and the context of how to use it:
Figure 1: Overview of Security Lake functionality
As shown in the figure, Security Lake ingests and normalizes logs from data sources such as AWS services, AWS Partner sources, and custom sources. Security Lake also manages the lifecycle, orchestration, and subscribers. Subscribers can be AWS services, such as Amazon Athena, or AWS Partner subscribers.
There are four primary functions that Security Lake provides:
Centralize visibility to your data from AWS environments, SaaS providers, on-premises, and other cloud data sources — You can collect log sources from AWS services such as AWS CloudTrail management events, Amazon Simple Storage Service (Amazon S3) data events, AWS Lambda data events, Amazon Route 53 Resolver logs, VPC Flow Logs, and AWS Security Hub findings, in addition to log sources from on-premises, other cloud services, SaaS applications, and custom sources. Security Lake automatically aggregates the security data across AWS Regions and accounts.
Normalize your security data to an open standard — Security Lake normalizes log sources in a common schema, the Open Security Schema Framework (OCSF), and stores them in compressed parquet files.
Use your preferred analytics tools to analyze your security data — You can use AWS tools, such as Athena and Amazon OpenSearch Service, or you can utilize external security tools to analyze the data in Security Lake.
Optimize and manage your security data for more efficient storage and query — Security Lake manages the lifecycle of your data with customizable retention settings with automated storage tiering to help provide more cost-effective storage.
Determine success criteria
By establishing success criteria, you can assess whether Security Lake has helped address the challenges that you are facing. Some example success criteria include:
I need to centrally set up and store AWS logs across my organization in AWS Organizations for multiple log sources.
I need to more efficiently collect VPC Flow Logs in my organization and analyze them in my security information and event management (SIEM) solution.
I want to use OpenSearch Service to replace my on-premises SIEM.
I want to collect AWS log sources and custom sources for machine learning with Amazon Sagemaker.
I need to establish a dashboard in Amazon QuickSight to visualize my Security Hub findings and a custom log source data.
Review your success criteria to make sure that your goals are realistic given your timeframe and potential constraints that are specific to your organization. For example, do you have full control over the creation of AWS services that are deployed in an organization? Do you have resources that can dedicate time to implement and test? Is this time convenient for relevant stakeholders to evaluate the service?
The timeframe of your POC will depend on your answers to these questions.
Important: Security Lake has a 15-day free trial per account that you use from the time that you enable Security Lake. This is the best way to estimate the costs for each Region throughout the trial, which is an important consideration when you configure your POC.
Define your Security Lake configuration
After you establish your success criteria, you should define your desired Security Lake configuration. Some important decisions include the following:
Determine AWS log sources — Decide which AWS log sources to collect. For information about the available options, see Collecting data from AWS services.
Determine third-party log sources — Decide if you want to include non-AWS service logs as sources in your POC. For more information about your options, see Third-party integrations with Security Lake; the integrations listed as “Source” can send logs to Security Lake.
Note: You can add third-party integrations after the POC or in a second phase of the POC. Pre-planning will be required to make sure that you can get these set up during the 15-day free trial. Third-party integrations usually take more time to set up than AWS service logs.
Select a delegated administrator – Identify which account will serve as the delegated administrator. Make sure that you have the appropriate permissions from the organization admin account to identify and enable the account that will be your Security Lake delegated administrator. This account will be the location for the S3 buckets with your security data and where you centrally configure Security Lake. The AWS Security Reference Architecture (AWS SRA) recommends that you use the AWS logging account for this purpose. In addition, make sure to review Important considerations for delegated Security Lake administrators.
Select accounts in scope — Define which accounts to collect data from. To get the most realistic estimate of the cost of Security Lake, enable all accounts across your organization during the free trial.
Determine analytics tool — Determine if you want to use native AWS analytics tools, such as Athena and OpenSearch Service, or an existing SIEM, where the SIEM is a subscriber to Security Lake.
Define log retention and Regions — Define your log retention requirements and Regional restrictions or considerations.
Prepare for deployment
After you determine your success criteria and your Security Lake configuration, you should have an idea of your stakeholders, desired state, and timeframe. Now you need to prepare for deployment. In this step, you should complete as much as possible before you deploy Security Lake. The following are some steps to take:
Create a project plan and timeline so that everyone involved understands what success look like and what the scope and timeline is.
Define the relevant stakeholders and consumers of the Security Lake data. Some common stakeholders include security operations center (SOC) analysts, incident responders, security engineers, cloud engineers, finance, and others.
Define who is responsible, accountable, consulted, and informed during the deployment. Make sure that team members understand their roles.
Consider other technical prerequisites that you need to accomplish. For example, if you need roles in addition to what Security Lake creates for custom extract, transform, and load (ETL) pipelines for custom sources, can you work with the team in charge of that process before the POC?
Enable Security Lake
The next step is to enable Security Lake in your environment and configure your sources and subscribers.
Deploy Security Lake across the Regions, accounts, and AWS log sources that you previously defined.
Configure custom sources that are in scope for your POC.
Configure analytics tools in scope for your POC.
Validate deployment
The final step is to confirm that you have configured Security Lake and additional components, validate that everything is working as intended, and evaluate the solution against your success criteria.
Validate log collection — Verify that you are collecting the log sources that you configured. To do this, check the S3 buckets in the delegated administrator account for the logs.
Validate analytics tool — Verify that you can analyze the log sources in your analytics tool of choice. If you don’t want to configure additional analytics tooling, you can use Athena, which is configured when you set up Security Lake. For sample Athena queries, see Amazon Security Lake Example Queries on GitHub and Security Lake queries in the documentation.
Obtain a cost estimate — In the Security Lake console, you can review a usage page to verify that the cost of Security Lake in your environment aligns with your expectations and budgets.
Assess success criteria — Determine if you achieved the success criteria that you defined at the beginning of the project.
Next steps
Next steps will largely depend on whether you decide to move forward with Security Lake.
Determine if you have the approval and budget to use Security Lake.
Expand to other data sources that can help you provide more security outcomes for your business.
Configure S3 lifecycle policies to efficiently store logs long term based on your requirements.
Let other teams know that they can subscribe to Security Lake to use the log data for their own purposes. For example, a development team that gets access to CloudTrail through Security Lake can analyze the logs to understand the permissions needed for an application.
Conclusion
In this blog post, we showed you how to plan and implement a Security Lake POC. You learned how to do so through phases, including defining success criteria, configuring Security Lake, and validating that Security Lake meets your business needs.
As a customer, this guide will help you run a successful proof of value (POV) with Security Lake. It guides you in assessing the value and factors to consider when deciding to implement the current features.
The management of security services across organizations has evolved over the years, and can vary depending on the size of your organization, the type of industry, the number of services to be administered, and compliance regulations and legislation. When compliance standards require you to set up scoped administrative control of event monitoring and auditing, we find that single administrator support on management consoles can present several challenges for large enterprises. In this blog post, I’ll dive deep into these security policy management challenges and show how you can optimize your security operations at scale by using AWS Firewall Manager to support multiple administrators.
These are some of the use cases and challenges faced by large enterprise organizations when scaling their security operations:
Policy enforcement across complex organizational boundaries
Large organizations tend to be divided into multiple organizational units, each of which represents a function within the organization. Risk appetite, and therefore security policy, can vary dramatically between organizational units. For example, organizations may support two types of users: central administrators and app developers, both of whom can administer security policy but might do so at different levels of granularity. The central admin applies a baseline and relatively generic policy for all accounts, while the app developer can be made an admin for specific accounts and be allowed to create custom rules for the overall policy. A single administrator interface limits the ability for multiple administrators to enforce differing policies for the organizational unit to which they are assigned.
Lack of adequate separation across services
The benefit of centralized management is that you can enforce a centralized policy across multiple services that the management console supports. However, organizations might have different administrators for each service. For example, the team that manages the firewall could be different than the team that manages a web application firewall solution. Aggregating administrative access that is confined to a single administrator might not adequately conform to the way organizations have services mapped to administrators.
Auditing and compliance
Most security frameworks call for auditing procedures, to gain visibility into user access, types of modifications to configurations, timestamps of incremental changes, and logs for periods of downtime. An organization might want only specific administrators to have access to certain functions. For example, each administrator might have specific compliance scope boundaries based on their knowledge of a particular compliance standard, thereby distributing the responsibility for implementation of compliance measures. Single administrator access greatly reduces the ability to discern the actions of different administrators in that single account, making auditing unnecessarily complex.
Availability
Redundancy and resiliency are regarded as baseline requirements for security operations. Organizations want to ensure that if a primary administrator is locked out of a single account for any reason, other legitimate users are not affected in the same way. Single administrator access, in contrast, can lock out legitimate users from performing critical and time-sensitive actions on the management console.
Security risks
In a single administrator setting, the ability to enforce the policy of least privilege is not possible. This is because there are multiple operators who might share the same levels of access to the administrator account. This means that there are certain administrators who could be granted broader access than what is required for their function in the organization.
What is multi-admin support?
Multi-admin support for Firewall Manager allows customers with multiple organizational units (OUs) and accounts to create up to 10 Firewall Manager administrator accounts from AWS Organizations to manage their firewall policies. You can delegate responsibility for firewall administration at a granular scope by restricting access based on OU, account, policy type, and AWS Region, thereby enabling policy management tasks to be implemented more effectively.
Multi-admin support provides you the ability to use different administrator accounts to create administrative scopes for different parameters. Examples of these administrative scopes are included in the following table.
Administrator
Scope
Default Administrator
Full Scope (Default)
Administrator 1
OU = “Test 1”
Administrator 2
Account IDs = “123456789, 987654321”
Administrator 3
Policy-Type = “Security Group”
Administrator 4
Region = “us-east-2”
Benefits of multi-admin support
Multi-admin support helps alleviate many of the challenges just discussed by allowing administrators the flexibility to implement custom configurations based on job functions, while enforcing the principle of least privilege to help ensure that corporate policy and compliance requirements are followed. The following are some of the key benefits of multi-admin support:
Improved security
Security is enhanced, given that the principle of least privilege can be enforced in a multi-administrator access environment. This is because the different administrators using Firewall Manager will be using delegated privileges that are appropriate for the level of access they are permitted. The result is that the scope for user errors, intentional errors, and unauthorized changes can be significantly reduced. Additionally, you attain an added level of accountability for administrators.
Autonomy of job functions
Companies with organizational units that have separate administrators are afforded greater levels of autonomy within their AWS Organizations accounts. The result is an increase in flexibility, where concurrent users can perform very different security functions.
Compliance benefits
It is easier to meet auditing requirements based on compliance standards in multi-admin accounts, because there is a greater level of visibility into user access and the functions performed on the services when compared to a multi-eyes approval workflow and approval of all policies by one omnipotent admin. This can simplify routine audits through the generation of reports that detail the chronology of security changes that are implemented by specific admins over time.
Administrator Availability
Multi-admin management support helps avoid the limitations of having a single point of access and enhances availability by providing multiple administrators with their own levels of access. This can result in fewer disruptions, especially during periods that require time-sensitive changes to be made to security configurations.
Integration with AWS Organizations
You can enable trusted access using either the Firewall Manager console or the AWS Organizations console. To do this, you sign in with your AWS Organizations management account and configure an account allocated for security tooling within the organization as the Firewall Manager administrator account. After this is done, subsequent multi-admin Firewall Manager operations can also be performed using AWS APIs. With accounts in an organization, you can quickly allocate resources, group multiple accounts, and apply governance policies to accounts or groups. This simplifies operational overhead for services that require cross-account management.
Key use cases
Multi-admin support in Firewall Manager unlocks several use cases pertaining to admin role-based access. The key use cases are summarized here.
Role-based access
Multi-admin support allows for different admin roles to be defined based on the job function of the administrator, relative to the service being managed. For example, an administrator could be tasked to manage network firewalls to protect their VPCs, and a different administrator could be tasked to manage web application firewalls (AWS WAF), both using Firewall Manager.
User tracking and accountability
In a multi-admin configuration environment, each Firewall Manager administrator’s activities are logged and recorded according to corporate compliance standards. This is useful when dealing with the troubleshooting of security incidents, and for compliance with auditing standards.
Compliance with security frameworks
Regulations specific to a particular industry, such as Payment Card Industry (PCI), and industry-specific legislation, such as HIPAA, require restricted access, control, and separation of tasks for different job functions. Failure to adhere to such standards could result in penalties. With administrative scope extending to policy types, customers can assign responsibility for managing particular firewall policies according to user role guidelines, as specified in compliance frameworks.
Region-based privileges
Many state or federal frameworks, such as the California Consumer Privacy Act (CCPA), require that admins adhere to customized regional requirements, such as data sovereignty or privacy requirements. Multi-admin Firewall Manager support helps organizations to adopt these frameworks by making it easier to assign admins who are familiar with the regulations of a particular region to that region.
Figure 1: Use cases for multi-admin support on AWS Firewall Manager
How to implement multi-admin support with Firewall Manager
To configure multi-admin support on Firewall Manager, use the following steps:
In the AWS Organizations console of the organization’s managed account, expand the Root folder to view the various accounts in the organization. Select the Default Administrator account that is allocated to delegate Firewall Manager administrators. The Default Administrator account should be a dedicated security account separate from the AWS Organizations management account, such as a Security Tooling account.
Figure 2: Overview of the AWS Organizations console
Navigate to Firewall Manager and in the left navigation menu, select Settings.
Figure 3: AWS Firewall Manager settings to update policy types
In Settings, choose an account. Under Policy types, select AWS Network Firewall to allow an admin to manage a specific firewall across accounts and across Regions. Select Edit to show the Details menu in Figure 4.
Figure 4: Select AWS Network Firewall as a policy type that can be managed by this administration account
The results of your selection are shown in Figure 5. The admin has been granted privileges to set AWS Network Firewall policy across all Regions and all accounts.
Figure 5: The admin has been granted privileges to set Network Firewall policy across all Regions and all accounts
In this second use case, you will identify a number of sub-accounts that the admin should be restricted to. As shown in Figure 6, there are no sub-accounts or OUs that the admin is restricted to by default until you choose Edit and select them.
Figure 6: The administrative scope details for the admin
In order to achieve this second use case, you choose Edit, and then add multiple sub-accounts or an OU that you need the admin restricted to, as shown in Figure 7.
Figure 7: Add multiple sub-accounts or an OU that you need the admin restricted to
The third use case pertains to granting the admin privileges to a particular AWS Region. In this case, you go into the Edit administrator account page once more, but this time, for Regions, select US West (N California) in order to restrict admin privileges only to this selected Region.
Figure 8: Restricting admin privileges only to the US West (N California) Region
Conclusion
Large enterprises need strategies for operationalizing security policy management so that they can enforce policy across organizational boundaries, deal with policy changes across security services, and adhere to auditing and compliance requirements. Multi-admin support in Firewall Manager provides a framework that admins can use to organize their workflow across job roles, to help maintain appropriate levels of security while providing the autonomy that admins desire.
You can get started using the multi-admin feature with Firewall Manager using the AWS Management Console. To learn more, refer to the service documentation.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
This blog post provides recommendations that you can use to help improve resiliency in the unlikely event of disrupted availability of the global (now legacy) AWS Security Token Service (AWS STS) endpoint. Although the global (legacy) AWS STS endpoint https://sts.amazonaws.com is highly available, it’s hosted in a single AWS Region—US East (N. Virginia)—and like other endpoints, it doesn’t provide automatic failover to endpoints in other Regions. In this post I will show you how to use Regional AWS STS endpoints in your configurations to improve the performance and resiliency of your workloads.
For authentication, it’s best to use temporary credentials instead of long-term credentials to help reduce risks, such as inadvertent disclosure, sharing, or theft of credentials. With AWS STS, trusted users can request temporary, limited-privilege credentials to access AWS resources.
Temporary credentials include an access key pair and a session token. The access key pair consists of an access key ID and a secret key. AWS STS generates temporary security credentials dynamically and provides them to the user when requested, which eliminates the need for long-term storage. Temporary security credentials have a limited lifetime so you don’t have to manage or rotate them.
To get these credentials, you can use several different methods:
An AWS principal, such as a role, user, or service
A web token, such as those provided by an OAuth 2.0 or OpenID Connect (OIDC) application
Figure 1: Methods to request credentials from AWS STS
Global (legacy) and Regional AWS STS endpoints
To connect programmatically to an AWS service, you use an endpoint. An endpoint is the URL of the entry point for AWS STS.
AWS STS provides Regional endpoints in every Region. AWS initially built AWS STS with a global endpoint (now legacy) https://sts.amazonaws.com, which is hosted in the US East (N. Virginia) Region (us-east-1). Regional AWS STS endpoints are activated by default for Regions that are enabled by default in your AWS account. For example, https://sts.us-east-2.amazonaws.com is the US East (Ohio) Regional endpoint. By default, AWS services use Regional AWS STS endpoints. For example, IAM Roles Anywhere uses the Regional STS endpoint that corresponds to the trust anchor. For a complete list of AWS STS endpoints for each Region, see AWS Security Token Service endpoints and quotas. You can’t activate an AWS STS endpoint in a Region that is disabled. For more information on which AWS STS endpoints are activated by default and which endpoints you can activate or deactivate, see Regions and endpoints.
As noted previously, the global (legacy) AWS STS endpoint https://sts.amazonaws.com is hosted in a single Region — US East (N. Virginia) — and like other endpoints, it doesn’t provide automatic failover to endpoints in other Regions. If your workloads on AWS or outside of AWS are configured to use the global (legacy) AWS STS endpoint https://sts.amazonaws.com, you introduce a dependency on a single Region: US East (N. Virginia). In the unlikely event that the endpoint becomes unavailable in that Region or connectivity between your resources and that Region is lost, your workloads won’t be able to use AWS STS to retrieve temporary credentials, which poses an availability risk to your workloads.
AWS recommends that you use Regional AWS STS endpoints (https://sts.<region-name>.amazonaws.com) instead of the global (legacy) AWS STS endpoint.
In addition to improved resiliency, Regional endpoints have other benefits:
Isolation and containment — By making requests to an AWS STS endpoint in the same Region as your workloads, you can minimize cross-Region dependencies and align the scope of your resources with the scope of your temporary security credentials to help address availability and security concerns. For example, if your workloads are running in the US East (Ohio) Region, you can target the Regional AWS STS endpoint in the US East (Ohio) Region (us-east-2) to remove dependencies on other Regions.
Performance — By making your AWS STS requests to an endpoint that is closer to your services and applications, you can access AWS STS with lower latency and shorter response times.
Figure 2: Assume an IAM role by using an API call to a Regional AWS STS endpoint
Calls to AWS STS within the same Region
You should configure your workloads within a specific Region to use only the Regional AWS STS endpoint for that Region. By using a Regional endpoint, you can use AWS STS in the same Region as your workloads, removing cross-Region dependency. For example, workloads in the US East (Ohio) Region should use only the Regional endpoint https://sts.us-east-2.amazonaws.com to call AWS STS. If a Regional AWS STS endpoint becomes unreachable, your workloads shouldn’t call AWS STS endpoints outside of the operating Region. If your workload has a multi-Region resiliency requirement, your other active or standby Region should use a Regional AWS STS endpoint for that Region and should be deployed such that the application can function despite a Regional failure. You should direct STS traffic to the STS endpoint within the same Region, isolated and independent from other Regions, and remove dependencies on the global (legacy) endpoint.
Calls to AWS STS from outside AWS
You should configure your workloads outside of AWS to call the appropriate Regional AWS STS endpoints that offer the lowest latency to your workload located outside of AWS. If your workload has a multi-Region resiliency requirement, build failover logic for AWS STS calls to other Regions in the event that Regional AWS STS endpoints become unreachable. Temporary security credentials obtained from Regional AWS STS endpoints are valid globally for the default session duration or duration that you specify.
How to configure Regional AWS STS endpoints for your tools and SDKs
By default, the AWS CLI version 2 sends AWS STS API requests to the Regional AWS STS endpoint for the currently configured Region. If you are using AWS CLI v2, you don’t need to make additional changes.
By default, the AWS CLI v1 sends AWS STS requests to the global (legacy) AWS STS endpoint. To check the version of the AWS CLI that you are using, run the following command: $ aws –version.
When you run AWS CLI commands, the AWS CLI looks for credential configuration in a specific order—first in shell environment variables and then in the local AWS configuration file (~/.aws/config).
AWS SDK
AWS SDKs are available for a variety of programming languages and environments. Since July 2022, major new versions of the AWS SDK default to Regional AWS STS endpoints and use the endpoint corresponding to the currently configured Region. If you use a major version of the AWS SDK that was released after July 2022, you don’t need to make additional changes.
An AWS SDK looks at various configuration locations until it finds credential configuration values. For example, the AWS SDK for Python (Boto3) adheres to the following lookup order when it searches through sources for configuration values:
A configuration object created and passed as the AWS configuration parameter when creating a client
Environment variables
The AWS configuration file ~/.aws/config
If you still use AWS CLI v1, or your AWS SDK version doesn’t default to a Regional AWS STS endpoint, you have the following options to set the Regional AWS STS endpoint:
Option 1 — Use a shared AWS configuration file setting
The configuration file is located at ~/.aws/config on Linux or macOS, and at C:\Users\USERNAME\.aws\config on Windows. To use the Regional endpoint, add the sts_regional_endpoints parameter.
The following example shows how you can set the value for the Regional AWS STS endpoint in the US East (Ohio) Region (us-east-2), by using the default profile in the AWS configuration file:
[default]
region = us-east-2
sts_regional_endpoints = regional
The valid values for the AWS STS endpoint parameter (sts_regional_endpoints) are:
legacy (default) — Uses the global (legacy) AWS STS endpoint, sts.amazonaws.com.
regional — Uses the AWS STS endpoint for the currently configured Region.
Note: Since July 2022, major new versions of the AWS SDK default to Regional AWS STS endpoints and use the endpoint corresponding to the currently configured Region. If you are using AWS CLI v1, you must use version 1.16.266 or later to use the AWS STS endpoint parameter.
You can use the --debug option with the AWS CLI command to receive the debug log and validate which AWS STS endpoint was used.
If you search for UseGlobalEndpoint in your debug log, you’ll find that the UseGlobalEndpoint parameter is set to False, and you’ll see the Regional endpoint provider fully qualified domain name (FQDN) when the Regional AWS STS endpoint is configured in a shared AWS configuration file or environment variables:
For a list of AWS SDKs that support shared AWS configuration file settings for Regional AWS STS endpoints, see Compatibility with AWS SDKS.
Option 2 — Use environment variables
Environment variables provide another way to specify configuration options. They are global and affect calls to AWS services. Most SDKs support environment variables. When you set the environment variable, the SDK uses that value until the end of your shell session or until you set the variable to a different value. To make the variables persist across future sessions, set them in your shell’s startup script.
The following example shows how you can set the value for the Regional AWS STS endpoint in the US East (Ohio) Region (us-east-2) by using environment variables:
You can run the command $ (echo $AWS_DEFAULT_REGION; echo $AWS_STS_REGIONAL_ENDPOINTS) to validate the variables. The output should look similar to the following:
us-east-2
regional
Windows
C:\> set AWS_DEFAULT_REGION=us-east-2
C:\> set AWS_STS_REGIONAL_ENDPOINTS=regional
The following example shows how you can configure an STS client with the AWS SDK for Python (Boto3) to use a Regional AWS STS endpoint by setting the environment variable:
import boto3
import os
os.environ["AWS_DEFAULT_REGION"] = "us-east-2"
os.environ["AWS_STS_REGIONAL_ENDPOINTS"] = "regional"
You can use the metadata attribute sts_client.meta.endpoint_url to inspect and validate how an STS client is configured. The output should look similar to the following:
For a list of AWS SDKs that support environment variable settings for Regional AWS STS endpoints, see Compatibility with AWS SDKs.
Option 3 — Construct an endpoint URL
You can also manually construct an endpoint URL for a specific Regional AWS STS endpoint.
The following example shows how you can configure the STS client with AWS SDK for Python (Boto3) to use a Regional AWS STS endpoint by setting a specific endpoint URL:
You can create a private connection to AWS STS from the resources that you deployed in your Amazon VPCs. AWS STS integrates with AWS PrivateLink by using interface VPC endpoints. The network traffic on AWS PrivateLink stays on the global AWS network backbone and doesn’t traverse the public internet. When you configure a VPC endpoint for AWS STS, the traffic for the Regional AWS STS endpoint traverses to that endpoint.
By default, the DNS in your VPC will update the entry for the Regional AWS STS endpoint to resolve to the private IP address of the VPC endpoint for AWS STS in your VPC. The following output from an Amazon Elastic Compute Cloud (Amazon EC2) instance shows the DNS name for the AWS STS endpoint resolving to the private IP address of the VPC endpoint for AWS STS:
After you create an interface VPC endpoint for AWS STS in your Region, set the value for the respective Regional AWS STS endpoint by using environment variables to access AWS STS in the same Region.
The output of the following log shows that an AWS STS call was made to the Regional AWS STS endpoint:
POST
/
content-type:application/x-www-form-urlencoded; charset=utf-8
host:sts.us-east-2.amazonaws.com
Log AWS STS requests
You can use AWS CloudTrail events to get information about the request and endpoint that was used for AWS STS. This information can help you identify AWS STS request patterns and validate if you are still using the global (legacy) STS endpoint.
An event in CloudTrail is the record of an activity in an AWS account. CloudTrail events provide a history of both API and non-API account activity made through the AWS Management Console, AWS SDKs, command line tools, and other AWS services.
Log locations
Requests to Regional AWS STS endpoints sts.<region-name>.amazonaws.com are logged in CloudTrail within their respective Region.
Requests to the global (legacy) STS endpoint sts.amazonaws.com are logged within the US East (N. Virginia) Region (us-east-1).
Log fields
Requests to Regional AWS STS endpoints and global endpoint are logged in the tlsDetails field in CloudTrail. You can use this field to determine if the request was made to a Regional or global (legacy) endpoint.
Requests made from a VPC endpoint are logged in the vpcEndpointId field in CloudTrail.
The following example shows a CloudTrail event for an STS request to a Regional AWS STS endpoint with a VPC endpoint.
The following example shows how to run a CloudWatch Logs Insights query to look for API calls made to the global (legacy) AWS STS endpoint. Before you can query CloudTrail events, you must configure a CloudTrail trail to send events to CloudWatch Logs.
The query output shows event details for an AWS STS call made to the global (legacy) AWS STS endpoint https://sts.amazonaws.com.
Figure 3: Use a CloudWatch Log Insights query to look for STS API calls
Amazon Athena
The following example shows how to query CloudTrail events with Amazon Athena and search for API calls made to the global (legacy) AWS STS endpoint.
SELECT
eventtime,
recipientaccountid,
eventname,
tlsdetails.clientProvidedHostHeader,
sourceipaddress,
eventid,
useridentity.arn
FROM "cloudtrail_logs"
WHERE
eventsource = 'sts.amazonaws.com' AND
tlsdetails.clientProvidedHostHeader = 'sts.amazonaws.com'
ORDER BY eventtime DESC
The query output shows STS calls made to the global (legacy) AWS STS endpoint https://sts.amazonaws.com.
Figure 4: Use Athena to search for STS API calls and identify STS endpoints
Conclusion
In this post, you learned how to use Regional AWS STS endpoints to help improve resiliency, reduce latency, and increase session token usage for the operating Regions in your AWS environment.
AWS recommends that you check the configuration and usage of AWS STS endpoints in your environment, validate AWS STS activity in your CloudTrail logs, and confirm that Regional AWS STS endpoints are used.
When building API-based web applications in the cloud, there are two main types of communication flow in which identity is an integral consideration:
User-to-Service communication: Authenticate and authorize users to communicate with application services and APIs
Service-to-Service communication: Authenticate and authorize application services to talk to each other
To design an authentication and authorization solution for these flows, you need to add an extra dimension to each flow:
Authentication: What identity you will use and how it’s verified
Authorization: How to determine which identity can perform which task
In each flow, a user or a service must present some kind of credential to the application service so that it can determine whether the flow should be permitted. The credentials are often accompanied with other metadata that can then be used to make further access control decisions.
In this blog post, I show you two ways that you can use Amazon VPC Lattice to implement both communication flows. I also show you how to build a simple and clean architecture for securing your web applications with scalable authentication, providing authentication metadata to make coarse-grained access control decisions.
The example solution is based around a standard API-based application with multiple API components serving HTTP data over TLS. With this solution, I show that VPC Lattice can be used to deliver authentication and authorization features to an application without requiring application builders to create this logic themselves. In this solution, the example application doesn’t implement its own authentication or authorization, so you will use VPC Lattice and some additional proxying with Envoy, an open source, high performance, and highly configurable proxy product, to provide these features with minimal application change. The solution uses Amazon Elastic Container Service (Amazon ECS) as a container environment to run the API endpoints and OAuth proxy, however Amazon ECS and containers aren’t a prerequisite for VPC Lattice integration.
If your application already has client authentication, such as a web application using OpenID Connect (OIDC), you can still use the sample code to see how implementation of secure service-to-service flows can be implemented with VPC Lattice.
VPC Lattice configuration
VPC Lattice is an application networking service that connects, monitors, and secures communications between your services, helping to improve productivity so that your developers can focus on building features that matter to your business. You can define policies for network traffic management, access, and monitoring to connect compute services in a simplified and consistent way across instances, containers, and serverless applications.
For a web application, particularly those that are API based and comprised of multiple components, VPC Lattice is a great fit. With VPC Lattice, you can use native AWS identity features for credential distribution and access control, without the operational overhead that many application security solutions require.
This solution uses a single VPC Lattice service network, with each of the application components represented as individual services. VPC Lattice auth policies are AWS Identity and Access Management (IAM) policy documents that you attach to service networks or services to control whether a specified principal has access to a group of services or specific service. In this solution we use an auth policy on the service network, as well as more granular policies on the services themselves.
User-to-service communication flow
For this example, the web application is constructed from multiple API endpoints. These are typical REST APIs, which provide API connectivity to various application components.
The most common method for securing REST APIs is by using OAuth2. OAuth2 allows a client (on behalf of a user) to interact with an authorization server and retrieve an access token. The access token is intended to be presented to a REST API and contains enough information to determine that the user identified in the access token has given their consent for the REST API to operate on their data on their behalf.
Access tokens use OAuth2 scopes to indicate user consent. Defining how OAuth2 scopes work is outside the scope of this post. You can learn about scopes in Permissions, Privileges, and Scopes in the AuthO blog.
VPC Lattice doesn’t support OAuth2 client or inspection functionality, however it can verify HTTP header contents. This means you can use header matching within a VPC Lattice service policy to grant access to a VPC Lattice service only if the correct header is included. By generating the header based on validation occurring prior to entering the service network, we can use context about the user at the service network or service to make access control decisions.
Figure 1: User-to-service flow
The solution uses Envoy, to terminate the HTTP request from an OAuth 2.0 client. This is shown in Figure 1: User-to-service flow.
Envoy (shown as (1) in Figure 2) can validate access tokens (presented as a JSON Web Token (JWT) embedded in an Authorization: Bearer header). If the access token can be validated, then the scopes from this token are unpacked (2) and placed into X-JWT-Scope-<scopename> headers, using a simple inline Lua script. The Envoy documentation provides examples of how to use inline Lua in Envoy. Figure 2 – JWT Scope to HTTP shows how this process works at a high level.
Figure 2: JWT Scope to HTTP headers
Following this, Envoy uses Signature Version 4 (SigV4) to sign the request (3) and pass it to the VPC Lattice service. SigV4 signing is a native Envoy capability, but it requires the underlying compute that Envoy is running on to have access to AWS credentials. When you use AWS compute, assigning a role to that compute verifies that the instance can provide credentials to processes running on that compute, in this case Envoy.
By adding an authorization policy that permits access only from Envoy (through validating the Envoy SigV4 signature) and only with the correct scopes provided in HTTP headers, you can effectively lock down a VPC Lattice service to specific verified users coming from Envoy who are presenting specific OAuth2 scopes in their bearer token.
To answer the original question of where the identity comes from, the identity is provided by the user when communicating with their identity provider (IdP). In addition to this, Envoy is presenting its own identity from its underlying compute to enter the VPC Lattice service network. From a configuration perspective this means your user-to-service communication flow doesn’t require understanding of the user, or the storage of user or machine credentials.
The sample code provided shows a full Envoy configuration for VPC Lattice, including SigV4 signing, access token validation, and extraction of JWT contents to headers. This reference architecture supports various clients including server-side web applications, thick Java clients, and even command line interface-based clients calling the APIs directly. I don’t cover OAuth clients in detail in this post, however the optional sample code allows you to use an OAuth client and flow to talk to the APIs through Envoy.
Service-to-service communication flow
In the service-to-service flow, you need a way to provide AWS credentials to your applications and configure them to use SigV4 to sign their HTTP requests to the destination VPC Lattice services. Your application components can have their own identities (IAM roles), which allows you to uniquely identify application components and make access control decisions based on the particular flow required. For example, application component 1 might need to communicate with application component 2, but not application component 3.
If you have full control of your application code and have a clean method for locating the destination services, then this might be something you can implement directly in your server code. This is the configuration that’s implemented in the AWS Cloud Development Kit (AWS CDK) solution that accompanies this blog post, the app1, app2, and app3 web servers are capable of making SigV4 signed requests to the VPC Lattice services they need to communicate with. The sample code demonstrates how to perform VPC Lattice SigV4 requests in node.js using the aws-crt node bindings. Figure 3 depicts the use of SigV4 authentication between services and VPC Lattice.
Figure 3: Service-to-service flow
To answer the question of where the identity comes from in this flow, you use the native SigV4 signing support from VPC Lattice to validate the application identity. The credentials come from AWS STS, again through the native underlying compute environment. Providing credentials transparently to your applications is one of the biggest advantages of the VPC Lattice solution when comparing this to other types of application security solutions such as service meshes. This implementation requires no provisioning of credentials, no management of identity stores, and automatically rotates credentials as required. This means low overhead to deploy and maintain the security of this solution and benefits from the reliability and scalability of IAM and the AWS Security Token Service (AWS STS) — a very slick solution to securing service-to-service communication flows!
VPC Lattice policy configuration
VPC Lattice provides two levels of auth policy configuration — at the VPC Lattice service network and on individual VPC Lattice services. This allows your cloud operations and development teams to work independently of each other by removing the dependency on a single team to implement access controls. This model enables both agility and separation of duties. More information about VPC Lattice policy configuration can be found in Control access to services using auth policies.
Service network auth policy
This design uses a service network auth policy that permits access to the service network by specific IAM principals. This can be used as a guardrail to provide overall access control over the service network and underlying services. Removal of an individual service auth policy will still enforce the service network policy first, so you can have confidence that you can identify sources of network traffic into the service network and block traffic that doesn’t come from a previously defined AWS principal.
The preceding auth policy example grants permissions to any authenticated request that uses one of the IAM roles app1TaskRole, app2TaskRole, app3TaskRole or EnvoyFrontendTaskRole to make requests to the services attached to the service network. You will see in the next section how service auth policies can be used in conjunction with service network auth policies.
Service auth policies
Individual VPC Lattice services can have their own policies defined and implemented independently of the service network policy. This design uses a service policy to demonstrate both user-to-service and service-to-service access control.
The preceding auth policy is an example that could be attached to the app1 VPC Lattice service. The policy contains two statements:
The first (labelled “Sid”: “UserToService”) provides user-to-service authorization and requires requiring the caller principal to be EnvoyFrontendTaskRole and the request headers to contain the header x-jwt-scope-test.all: true when calling the app1 VPC Lattice service.
The second (labelled “Sid”: “ServiceToService”) provides service-to-service authorization and requires the caller principal to be app2TaskRole when calling the app1 VPC Lattice service.
As with a standard IAM policy, there is an implicit deny, meaning no other principals will be permitted access.
The caller principals are identified by VPC Lattice through the SigV4 signing process. This means by using the identities provisioned to the underlying compute the network flow can be associated with a service identity, which can then be authorized by VPC Lattice service access policies.
Distributed development
This model of access control supports a distributed development and operational model. Because the service network auth policy is decoupled from the service auth policies, the service auth policies can be iterated upon by a development team without impacting the overall policy controls set by an operations team for the entire service network.
The AWS CDK solution deploys four Amazon ECS services, one for the frontend Envoy server for the client-to-service flow, and the remaining three for the backend application components. Figure 4 shows the solution when deployed with the internal domain parameter application.internal.
Backend application components are a simple node.js express server, which will print the contents of your request in JSON format and perform service-to-service calls.
A number of other infrastructure components are deployed to support the solution:
A VPC with associated subnets, NAT gateways and an internet gateway. Internet access is required for the solution to retrieve JSON Web Key Set (JWKS) details from your OAuth provider.
An Amazon Route53 hosted zone for handling traffic routing to the configured domain and VPC Lattice services.
An Amazon ECS cluster (two container hosts by default) to run the ECS tasks.
All application load balancers are internally facing.
Application component load balancers are configured to only accept traffic from the VPC Lattice managed prefix List.
The frontend Envoy load balancer is configured to accept traffic from any host.
Three VPC Lattice services and one VPC Lattice network.
The code for Envoy and the application components can be found in the lattice_soln/containers directory.
AWS CDK code for all other deployable infrastructure can be found in lattice_soln/lattice_soln_stack.py.
Prerequisites
Before you begin, you must have the following prerequisites in place:
An AWS account to deploy solution resources into. AWS credentials should be available to the AWS CDK in the environment or configuration files for the CDK deploy to function.
Python 3.9.6 or higher
Docker or Finch for building containers. If using Finch, ensure the Finch executable is in your path and instruct the CDK to use it with the command export CDK_DOCKER=finch
Enable elastic network interface (ENI) trunking in your account to allow more containers to run in VPC networking mode:
This solution has been tested using Okta, however any OAuth compatible provider will work if it can issue access tokens and you can retrieve them from the command line.
The following instructions describe the configuration process for Okta using the Okta web UI. This allows you to use the device code flow to retrieve access tokens, which can then be validated by the Envoy frontend deployment.
Create a new app integration
In the Okta web UI, select Applications and then choose Create App Integration.
For Sign-in method, select OpenID Connect.
For Application type, select Native Application.
For Grant Type, select both Refresh Token and Device Authorization.
Note the client ID for use in the device code flow.
Create a new API integration
Still in the Okta web UI, select Security, and then choose API.
Choose Add authorization server.
Enter a name and audience. Note the audience for use during CDK installation, then choose Save.
Select the authorization server you just created. Choose the Metadata URI link to open the metadata contents in a new tab or browser window. Note the jwks_uri and issuer fields for use during CDK installation.
Return to the Okta web UI, select Scopes and then Add scope.
For the scope name, enter test.all. Use the scope name for the display phrase and description. Leave User consent as implicit. Choose Save.
Under Access Policies, choose Add New Access Policy.
For Assign to, select The following clients and select the client you created above.
Choose Add rule.
In Rule name, enter a rule name, such as Allow test.all access
Under If Grant Type Is uncheck all but Device Authorization. Under And Scopes Requested choose The following scopes. Select OIDC default scopes to add the default scopes to the scopes box, then also manually add the test.all scope you created above.
During the API Integration step, you should have collected the audience, JWKS URI, and issuer. These fields are used on the command line when installing the CDK project with OAuth support.
You can then use the process described in configure the smart device to retrieve an access token using the device code flow. Make sure you modify scope to include test.all — scope=openid profile offline_access test.all — so your token matches the policy deployed by the solution.
Installation
You can download the deployable solution from GitHub.
Deploy without OAuth functionality
If you only want to deploy the solution with service-to-service flows, you can deploy with a CDK command similar to the following:
To deploy the solution with OAuth functionality, you must provide the following parameters:
jwt_jwks: The URL for retrieving JWKS details from your OAuth provider. This would look something like https://dev-123456.okta.com/oauth2/ausa1234567/v1/keys
jwt_issuer: The issuer for your OAuth access tokens. This would look something like https://dev-123456.okta.com/oauth2/ausa1234567
jwt_audience: The audience configured for your OAuth protected APIs. This is a text string configured in your OAuth provider.
app_domain: The domain to be configured in Route53 for all URLs provided for this application. This domain is local to the VPC created for the solution. For example application.internal.
The solution can be deployed with a CDK command as follows:
$ cdk deploy -c enable_oauth=True -c jwt_jwks=<URL for retrieving JWKS details> \
-c jwt_issuer=<URL of the issuer for your OAuth access tokens> \
-c jwt_audience=<OAuth audience string> \
-c app_domain=<application domain>
Security model
For this solution, network access to the web application is secured through two main controls:
Entry into the service network requires SigV4 authentication, enforced by the service network policy. No other mechanisms are provided to allow access to the services, either through their load balancers or directly to the containers.
Service policies restrict access to either user- or service-based communication based on the identity of the caller and OAuth subject and scopes.
The Envoy configuration strips any x- headers coming from user clients and replaces them with x-jwt-subject and x-jwt-scope headers based on successful JWT validation. You are then able to match these x-jwt-* headers in VPC Lattice policy conditions.
Solution caveats
This solution implements TLS endpoints on VPC Lattice and Application Load Balancers. The container instances do not implement TLS in order to reduce cost for this example. As such, traffic is in cleartext between the Application Load Balancers and container instances, and can be implemented separately if required.
How to use the solution
Now for the interesting part! As part of solution deployment, you’ve deployed a number of Amazon Elastic Compute Cloud (Amazon EC2) hosts to act as the container environment. You can use these hosts to test some of the flows and you can use the AWS Systems Manager connect function from the AWS Management console to access the command line interface on any of the container hosts.
In these examples, I’ve configured the domain during the CDK installation as application.internal, which will be used for communicating with the application as a client. If you change this, adjust your command lines to match.
[Optional] For examples 3 and 4, you need an access token from your OAuth provider. In each of the examples, I’ve embedded the access token in the AT environment variable for brevity.
Example 1: Service-to-service calls (permitted)
For these first two examples, you must sign in to the container host and run a command in your container. This is because the VPC Lattice policies allow traffic from the containers. I’ve assigned IAM task roles to each container, which are used to uniquely identify them to VPC Lattice when making service-to-service calls.
To set up service-to service calls (permitted):
Sign in to the Amazon ECS console. You should see at least three ECS services running.
Figure 5: Cluster console
Select the app2 service LatticeSolnStack-app2service…, then select the Tasks tab. Under the Container Instances heading select the container instance that’s running the app2 service.
Figure 6: Container instances
You will see the instance ID listed at the top left of the page.
Figure 7: Single container instance
Select the instance ID (this will open a new window) and choose Connect. Select the Session Manager tab and choose Connect again. This will open a shell to your container instance.
The policy statements permit app2 to call app1. By using the path app2/call-to-app1, you can force this call to occur.
Test this with the following commands:
sh-4.2$ sudo bash
# docker ps --filter "label=webserver=app2"
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
<containerid> 111122223333.dkr.ecr.ap-southeast-2.amazonaws.com/cdk-hnb659fds-container-assets-111122223333-ap-southeast-2:5b5d138c3abd6cfc4a90aee4474a03af305e2dae6bbbea70bcc30ffd068b8403 "sh /app/launch_expr…" 9 minutes ago Up 9minutes ecs-LatticeSolnStackapp2task4A06C2E4-22-app2-container-b68cb2ffd8e4e0819901
# docker exec -it <containerid> curl localhost:80/app2/call-to-app1
The policy statements don’t permit app2 to call app3. You can simulate this in the same way and verify that the access isn’t permitted by VPC Lattice.
To set up service-to-service calls (denied)
You can change the curl command from Example 1 to test app2 calling app3.
# docker exec -it cd8420221dcb curl localhost:80/app2/call-to-app3
{
"upstreamResponse": "AccessDeniedException: User: arn:aws:sts::111122223333:assumed-role/LatticeSolnStack-app2TaskRoleA1BE533B-3K7AJnCr8kTj/ddaf2e517afb4d818178f9e0fef8f841 is not authorized to perform: vpc-lattice-svcs:Invoke on resource: arn:aws:vpc-lattice:ap-southeast-2:111122223333:service/svc-08873e50553c375cd/ with an explicit deny in a service-based policy"
}
[Optional] Example 3: OAuth – Invalid access token
If you’ve deployed using OAuth functionality, you can test from the shell in Example 1 that you’re unable to access the frontend Envoy server (application.internal) without a valid access token, and that you’re also unable to access the backend VPC Lattice services (app1.application.internal, app2.application.internal, app3.application.internal) directly.
You can also verify that you cannot bypass the VPC Lattice service and connect to the load balancer or web server container directly.
sh-4.2$ curl -v https://application.internal
Jwt is missing
sh-4.2$ curl https://app1.application.internal
AccessDeniedException: User: anonymous is not authorized to perform: vpc-lattice-svcs:Invoke on resource: arn:aws:vpc-lattice:ap-southeast-2:111122223333:service/svc-03edffc09406f7e58/ because no network-based policy allows the vpc-lattice-svcs:Invoke action
sh-4.2$ curl https://internal-Lattic-app1s-C6ovEZzwdTqb-1882558159.ap-southeast-2.elb.amazonaws.com
^C
sh-4.2$ curl https://10.0.209.127
^C
[Optional] Example 4: Client access
If you’ve deployed using OAuth functionality, you can test from the shell in Example 1 to access the application with a valid access token. A client can reach each application component by using application.internal/<componentname>. For example, application.internal/app2. If no component name is specified, it will default to app1.
This will fail when attempting to connect to app3 using Envoy, as we’ve denied user to service calls on the VPC Lattice Service policy
sh-4.2$ https://application.internal/app3 -H "Authorization: Bearer $AT"
AccessDeniedException: User: arn:aws:sts::111122223333:assumed-role/LatticeSolnStack-EnvoyFrontendTaskRoleA297DB4D-OwD8arbEnYoP/827dc1716e3a49ad8da3fd1dd52af34c is not authorized to perform: vpc-lattice-svcs:Invoke on resource: arn:aws:vpc-lattice:ap-southeast-2:111122223333:service/svc-06987d9ab4a1f815f/app3 with an explicit deny in a service-based policy
Summary
You’ve seen how you can use VPC Lattice to provide authentication and authorization to both user-to-service and service-to-service flows. I’ve shown you how to implement some novel and reusable solution components:
JWT authorization and translation of scopes to headers, integrating an external IdP into your solution for user authentication.
SigV4 signing from an Envoy proxy running in a container.
Service-to-service flows using SigV4 signing in node.js and container-based credentials.
Integration of VPC Lattice with ECS containers, using the CDK.
All of this is created almost entirely with managed AWS services, meaning you can focus more on security policy creation and validation and less on managing components such as service identities, service meshes, and other self-managed infrastructure.
Some ways you can extend upon this solution include:
Implementing different service policies taking into consideration different OAuth scopes for your user and client combinations
Implementing multiple issuers on Envoy to allow different OAuth providers to use the same infrastructure
Deploying the VPC Lattice services and ECS tasks independently of the service network, to allow your builders to manage task deployment themselves
I look forward to hearing about how you use this solution and VPC Lattice to secure your own applications!
Streaming ingestion from Amazon MSK into Amazon Redshift, represents a cutting-edge approach to real-time data processing and analysis. Amazon MSK serves as a highly scalable, and fully managed service for Apache Kafka, allowing for seamless collection and processing of vast streams of data. Integrating streaming data into Amazon Redshift brings immense value by enabling organizations to harness the potential of real-time analytics and data-driven decision-making.
This integration enables you to achieve low latency, measured in seconds, while ingesting hundreds of megabytes of streaming data per second into Amazon Redshift. At the same time, this integration helps make sure that the most up-to-date information is readily available for analysis. Because the integration doesn’t require staging data in Amazon S3, Amazon Redshift can ingest streaming data at a lower latency and without intermediary storage cost.
You can configure Amazon Redshift streaming ingestion on a Redshift cluster using SQL statements to authenticate and connect to an MSK topic. This solution is an excellent option for data engineers that are looking to simplify data pipelines and reduce the operational cost.
The following architecture diagram describes the AWS services and features you will be using.
The workflow includes the following steps:
You start with configuring an Amazon MSK Connect source connector, to create an MSK topic, generate mock data, and write it to the MSK topic. For this post, we work with mock customer data.
The next step is to connect to a Redshift cluster using the Query Editor v2.
Finally, you configure an external schema and create a materialized view in Amazon Redshift, to consume the data from the MSK topic. This solution does not rely on an MSK Connect sink connector to export the data from Amazon MSK to Amazon Redshift.
The following solution architecture diagram describes in more detail the configuration and integration of the AWS services you will be using.
The workflow includes the following steps:
You deploy an MSK Connect source connector, an MSK cluster, and a Redshift cluster within the private subnets on a VPC.
The MSK cluster uses a custom MSK cluster configuration, allowing the MSK Connect connector to create topics on the MSK cluster.
The MSK cluster logs are captured and sent to an Amazon CloudWatch log group.
The Redshift cluster uses granular permissions defined in an IAM in-line policy attached to an IAM role, which allows the Redshift cluster to perform actions on the MSK cluster.
You can use the Query Editor v2 to connect to the Redshift cluster.
Prerequisites
To simplify the provisioning and configuration of the prerequisite resources, you can use the following AWS CloudFormation template:
Complete the following steps when launching the stack:
For Stack name, enter a meaningful name for the stack, for example, prerequisites.
Choose Next.
Choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Submit.
The CloudFormation stack creates the following resources:
An MSK cluster, with three brokers deployed across the three private subnets of custom-vpc. The msk-cluster-sg security group and the custom-msk-cluster-configuration configuration are applied to the MSK cluster. The broker logs are delivered to the msk-cluster-logs CloudWatch log group.
A Redshift cluster, with one single node deployed in a private subnet within the Redshift cluster subnet group. The redshift-sg security group and redshift-role IAM role are applied to the Redshift cluster.
Create an MSK Connect custom plugin
For this post, we use an Amazon MSK data generator deployed in MSK Connect, to generate mock customer data, and write it to an MSK topic.
Upload the JAR file into an S3 bucket in your AWS account.
On the Amazon MSK console, choose Custom plugins under MSK Connect in the navigation pane.
Choose Create custom plugin.
Choose Browse S3, search for the Amazon MSK data generator JAR file you uploaded to Amazon S3, then choose Choose.
For Custom plugin name, enter msk-datagen-plugin.
Choose Create custom plugin.
When the custom plugin is created, you will see that its status is Active, and you can move to the next step.
Create an MSK Connect connector
Complete the following steps to create your connector:
On the Amazon MSK console, choose Connectors under MSK Connect in the navigation pane.
Choose Create connector.
For Custom plugin type, choose Use existing plugin.
Select msk-datagen-plugin, then choose Next.
For Connector name, entermsk-datagen-connector.
For Cluster type, choose Self-managed Apache Kafka cluster.
For VPC, choose custom-vpc.
For Subnet 1, choose the private subnet within your first Availability Zone.
For the custom-vpc created by the CloudFormation template, we are using odd CIDR ranges for public subnets, and even CIDR ranges for the private subnets:
The CIDRs for the public subnets are 10.10.1.0/24, 10.10.3.0/24, and 10.10.5.0/24
The CIDRs for the private subnets are 10.10.2.0/24, 10.10.4.0/24, and 10.10.6.0/24
For Subnet 2, select the private subnet within your second Availability Zone.
For Subnet 3, select the private subnet within your third Availability Zone.
For Bootstrap servers, enter the list of bootstrap servers for TLS authentication of your MSK cluster.
To retrieve the bootstrap servers for your MSK cluster, navigate to the Amazon MSK console, choose Clusters, choose msk-cluster, then choose View client information. Copy the TLS values for the bootstrap servers.
For Security groups, choose Use specific security groups with access to this cluster, and choose msk-connect-sg.
For Connector configuration, replace the default settings with the following:
For Worker configuration, choose Use the MSK default configuration.
For Access permissions, choose msk-connect-role.
Choose Next.
For Encryption, select TLS encrypted traffic.
Choose Next.
For Log delivery, choose Deliver to Amazon CloudWatch Logs.
Choose Browse, select msk-connect-logs, and choose Choose.
Choose Next.
Review and choose Create connector.
After the custom connector is created, you will see that its status is Running, and you can move to the next step.
Configure Amazon Redshift streaming ingestion for Amazon MSK
Complete the following steps to set up streaming ingestion:
Connect to your Redshift cluster using Query Editor v2, and authenticate with the database user name awsuser, and password Awsuser123.
Create an external schema from Amazon MSK using the following SQL statement.
In the following code, enter the values for the redshift-role IAM role, and the msk-clustercluster ARN.
CREATE EXTERNAL SCHEMA msk_external_schema
FROM MSK
IAM_ROLE '<insert your redshift-role arn>'
AUTHENTICATION iam
CLUSTER_ARN '<insert your msk-cluster arn>';
CREATE MATERIALIZED VIEW msk_mview AUTO REFRESH YES AS
SELECT
"kafka_partition",
"kafka_offset",
"kafka_timestamp_type",
"kafka_timestamp",
"kafka_key",
JSON_PARSE(kafka_value) as Data,
"kafka_headers"
FROM
"dev"."msk_external_schema"."customer"
Choose Run to run the SQL statement.
You can now query the materialized view using the following SQL statement:
select * from msk_mview LIMIT 100;
Choose Run to run the SQL statement.
To monitor the progress of records loaded via streaming ingestion, you can take advantage of the SYS_STREAM_SCAN_STATES monitoring view using the following SQL statement:
select * from SYS_STREAM_SCAN_STATES;
Choose Run to run the SQL statement.
To monitor errors encountered on records loaded via streaming ingestion, you can take advantage of the SYS_STREAM_SCAN_ERRORS monitoring view using the following SQL statement:
select * from SYS_STREAM_SCAN_ERRORS;
Choose Run to run the SQL statement.
Clean up
After following along, if you no longer need the resources you created, delete them in the following order to prevent incurring additional charges:
Delete the MSK Connect connector msk-datagen-connector.
Delete the MSK Connect plugin msk-datagen-plugin.
Delete the Amazon MSK data generator JAR file you downloaded, and delete the S3 bucket you created.
After you delete your MSK Connect connector, you can delete the CloudFormation template. All the resources created by the CloudFormation template will be automatically deleted from your AWS account.
Conclusion
In this post, we demonstrated how to configure Amazon Redshift streaming ingestion from Amazon MSK, with a focus on privacy and security.
The combination of the ability of Amazon MSK to handle high throughput data streams with the robust analytical capabilities of Amazon Redshift empowers business to derive actionable insights promptly. This real-time data integration enhances the agility and responsiveness of organizations in understanding changing data trends, customer behaviors, and operational patterns. It allows for timely and informed decision-making, thereby gaining a competitive edge in today’s dynamic business landscape.
We hope this post was a good opportunity to learn more about AWS service integration and configuration. Let us know your feedback in the comments section.
About the authors
Sebastian Vlad is a Senior Partner Solutions Architect with Amazon Web Services, with a passion for data and analytics solutions and customer success. Sebastian works with enterprise customers to help them design and build modern, secure, and scalable solutions to achieve their business outcomes.
Sharad Pai is a Lead Technical Consultant at AWS. He specializes in streaming analytics and helps customers build scalable solutions using Amazon MSK and Amazon Kinesis. He has over 16 years of industry experience and is currently working with media customers who are hosting live streaming platforms on AWS, managing peak concurrency of over 50 million. Prior to joining AWS, Sharad’s career as a lead software developer included 9 years of coding, working with open source technologies like JavaScript, Python, and PHP.
AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.
AWS Glue customers often have to meet strict security requirements, which sometimes involve locking down the network connectivity allowed to the job, or running inside a specific VPC to access another service. To run inside the VPC, the jobs needs to be assigned to a single subnet, but the most suitable subnet can change over time (for instance, based on the usage and availability), so you may prefer to make that decision at runtime, based on your own strategy.
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is an AWS service to run managed Airflow workflows, which allow writing custom logic to coordinate how tasks such as AWS Glue jobs run.
In this post, we show how to run an AWS Glue job as part of an Airflow workflow, with dynamic configurable selection of the VPC subnet assigned to the job at runtime.
Solution overview
To run inside a VPC, an AWS Glue job needs to be assigned at least a connection that includes network configuration. Any connection allows specifying a VPC, subnet, and security group, but for simplicity, this post uses connections of type: NETWORK, which just defines the network configuration and doesn’t involve external systems.
If the job has a fixed subnet assigned by a single connection, in case of a service outage on the Availability Zones or if the subnet isn’t available for other reasons, the job can’t run. Furthermore, each node (driver or worker) in an AWS Glue job requires an IP address assigned from the subnet. When running many large jobs concurrently, this could lead to an IP address shortage and the job running with fewer nodes than intended or not running at all.
AWS Glue extract, transform, and load (ETL) jobs allow multiple connections to be specified with multiple network configurations. However, the job will always try to use the connections’ network configuration in the order listed and pick the first one that passes the health checks and has at least two IP addresses to get the job started, which might not be the optimal option.
With this solution, you can enhance and customize that behavior by reordering the connections dynamically and defining the selection priority. If a retry is needed, the connections are reprioritized again based on the strategy, because the conditions might have changed since the last run.
As a result, it helps prevent the job from failing to run or running under capacity due to subnet IP address shortage or even an outage, while meeting the network security and connectivity requirements.
The following diagram illustrates the solution architecture.
Prerequisites
To follow the steps of the post, you need a user that can log in to the AWS Management Console and has permission to access Amazon MWAA, Amazon Virtual Private Cloud (Amazon VPC), and AWS Glue. The AWS Region where you choose to deploy the solution needs the capacity to create a VPC and two elastic IP addresses. The default Regional quota for both types of resources is five, so you might need to request an increase via the console.
First, you’ll deploy a new Airflow environment, including the creation of a new VPC with two public subnets and two private ones. This is because Amazon MWAA requires Availability Zone failure tolerance, so it needs to run on two subnets on two different Availability Zones in the Region. The public subnets are used so the NAT Gateway can provide internet access for the private subnets.
On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose Create stack with the option With new resources (standard).
Choose Upload a template file and choose the local template file.
Choose Next.
Complete the setup steps, entering a name for the environment, and leave the rest of the parameters as default.
On the last step, acknowledge that resources will be created and choose Submit.
The creation can take 20–30 minutes, until the status of the stack changes to CREATE_COMPLETE.
The resource that will take most of time is the Airflow environment. While it’s being created, you can continue with the following steps, until you are required to open the Airflow UI.
On the stack’s Resources tab, note the IDs for the VPC and two private subnets (PrivateSubnet1 and PrivateSubnet2), to use in the next step.
Create AWS Glue connections
The CloudFormation template deploys two private subnets. In this step, you create an AWS Glue connection to each one so AWS Glue jobs can run in them. Amazon MWAA recently added the capacity to run the Airflow cluster on shared VPCs, which reduces cost and simplifies network management. For more information, refer to Introducing shared VPC support on Amazon MWAA.
Complete the following steps to create the connections:
On the AWS Glue console, choose Data connections in the navigation pane.
Choose Create connection.
Choose Network as the data source.
Choose the VPC and private subnet (PrivateSubnet1) created by the CloudFormation stack.
Use the default security group.
Choose Next.
For the connection name, enter MWAA-Glue-Blog-Subnet1.
Review the details and complete the creation.
Repeat these steps using PrivateSubnet2 and name the connection MWAA-Glue-Blog-Subnet2.
Create the AWS Glue job
Now you create the AWS Glue job that will be triggered later by the Airflow workflow. The job uses the connections created in the previous section, but instead of assigning them directly on the job, as you would normally do, in this scenario you leave the job connections list empty and let the workflow decide which one to use at runtime.
The job script in this case is not significant and is just intended to demonstrate the job ran in one of the subnets, depending on the connection.
On the AWS Glue console, choose ETL jobs in the navigation pane, then choose Script editor.
Leave the default options (Spark engine and Start fresh) and choose Create script.
Replace the placeholder script with the following Python code:
import ipaddress
import socket
subnets = {
"PrivateSubnet1": "10.192.20.0/24",
"PrivateSubnet2": "10.192.21.0/24"
}
ip = socket.gethostbyname(socket.gethostname())
subnet_name = "unknown"
for subnet, cidr in subnets.items():
if ipaddress.ip_address(ip) in ipaddress.ip_network(cidr):
subnet_name = subnet
print(f"The driver node has been assigned the ip: {ip}"
+ f" which belongs to the subnet: {subnet_name}")
Rename the job to AirflowBlogJob.
On the Job details tab, for IAM Role, choose any role and enter 2 for the number of workers (just for frugality).
Save these changes so the job is created.
Grant AWS Glue permissions to the Airflow environment role
The role created for Airflow by the CloudFormation template provides the basic permissions to run workflows but not to interact with other services such as AWS Glue. In a production project, you would define your own templates with these additional permissions, but in this post, for simplicity, you add the additional permissions as an inline policy. Complete the following steps:
On the IAM console, choose Roles in the navigation pane.
Locate the role created by the template; it will start with the name you assigned to the CloudFormation stack and then -MwaaExecutionRole-.
On the role details page, on the Add permissions menu, choose Create inline policy.
Switch from Visual to JSON mode and enter the following JSON on the textbox. It assumes that the AWS Glue role you have follows the convention of starting with AWSGlueServiceRole. For enhanced security, you can replace the wildcard resource on the ec2:DescribeSubnets permission with the ARNs of the two private subnets from the CloudFormation stack.
Enter GlueRelatedPermissions as the policy name and complete the creation.
In this example, we use an ETL script job; for a visual job, because it generates the script automatically on save, the Airflow role would need permission to write to the configured script path on Amazon Simple Storage Service (Amazon S3).
Create the Airflow DAG
An Airflow workflow is based on a Directed Acyclic Graph (DAG), which is defined by a Python file that programmatically specifies the different tasks involved and its interdependencies. Complete the following scripts to create the DAG:
Create a local file named glue_job_dag.py using a text editor.
In each of the following steps, we provide a code snippet to enter into the file and an explanation of what is does.
The following snippet adds the required Python modules imports. The modules are already installed on Airflow; if that weren’t the case, you would need to use a requirements.txt file to indicate to Airflow which modules to install. It also defines the Boto3 clients that the code will use later. By default, they will use the same role and Region as Airflow, that’s why you set up before the role with the additional permissions required.
import boto3
from pendulum import datetime, duration
from random import shuffle
from airflow import DAG
from airflow.decorators import dag, task
from airflow.models import Variable
from airflow.providers.amazon.aws.operators.glue import GlueJobOperator
glue_client = boto3.client('glue')
ec2 = boto3.client('ec2')
The following snippet adds three functions to implement the connection order strategy, which defines how to reorder the connections given to establish their priority. This is just an example; you can build your custom code to implement your own logic, as per your needs. The code first checks the IPs available on each connection subnet and separates the ones that have enough IPs available to run the job at full capacity and those that could be used because they have at least two IPs available, which is the minimum a job needs to start. If the strategy is set to random, it will randomize the order within each of the connection groups previously described and add any other connections. If the strategy is capacity, it will order them from most IPs free to fewest.
def get_available_ips_from_connection(glue_connection_name):
conn_response = glue_client.get_connection(Name=glue_connection_name)
connection_properties = conn_response['Connection']['PhysicalConnectionRequirements']
subnet_id = connection_properties['SubnetId']
subnet_response = ec2.describe_subnets(SubnetIds=[subnet_id])
return subnet_response['Subnets'][0]['AvailableIpAddressCount']
def get_connections_free_ips(glue_connection_names, num_workers):
good_connections = []
usable_connections = []
for connection_name in glue_connection_names:
try:
available_ips = get_available_ips_from_connection(connection_name)
# Priority to connections that can hold the full cluster and we haven't just tried
if available_ips >= num_workers:
good_connections.append((connection_name, available_ips))
elif available_ips >= 2: # The bare minimum to start a Glue job
usable_connections.append((connection_name, available_ips))
except Exception as e:
print(f"[WARNING] Failed to check the free ips for:{connection_name}, will skip. Exception: {e}")
return good_connections, usable_connections
def prioritize_connections(connection_list, num_workers, strategy):
(good_connections, usable_connections) = get_connections_free_ips(connection_list, num_workers)
print(f"Good connections: {good_connections}")
print(f"Usable connections: {usable_connections}")
all_conn = []
if strategy=="random":
shuffle(good_connections)
shuffle(usable_connections)
# Good connections have priority
all_conn = good_connections + usable_connections
elif strategy=="capacity":
# We can sort both at the same time
all_conn = good_connections + usable_connections
all_conn.sort(key=lambda x: -x[1])
else:
raise ValueError(f"Unknown strategy specified: {strategy}")
result = [c[0] for c in all_conn] # Just need the name
# Keep at the end any other connections that could not be checked for ips
result += [c for c in connection_list if c not in result]
return result
The following code creates the DAG itself with the run job task, which updates the job with the connection order defined by the strategy, runs it, and waits for the results. The job name, connections, and strategy come from Airflow variables, so it can be easily configured and updated. It has two retries with exponential backoff configured, so if the tasks fails, it will repeat the full task including the connection selection. Maybe now the best choice is another connection, or the subnet previously picked randomly is in an Availability Zone that is currently suffering an outage, and by picking a different one, it can recover.
with DAG(
dag_id='glue_job_dag',
schedule_interval=None, # Run on demand only
start_date=datetime(2000, 1, 1), # A start date is required
max_active_runs=1,
catchup=False
) as glue_dag:
@task(
task_id="glue_task",
retries=2,
retry_delay=duration(seconds = 30),
retry_exponential_backoff=True
)
def run_job_task(**ctx):
glue_connections = Variable.get("glue_job_dag.glue_connections").strip().split(',')
glue_jobname = Variable.get("glue_job_dag.glue_job_name").strip()
strategy= Variable.get('glue_job_dag.strategy', 'random') # random or capacity
print(f"Connections available: {glue_connections}")
print(f"Glue job name: {glue_jobname}")
print(f"Strategy to use: {strategy}")
job_props = glue_client.get_job(JobName=glue_jobname)['Job']
num_workers = job_props['NumberOfWorkers']
glue_connections = prioritize_connections(glue_connections, num_workers, strategy)
print(f"Running Glue job with the connection order: {glue_connections}")
existing_connections = job_props.get('Connections',{}).get('Connections', [])
# Preserve other connections that we don't manage
other_connections = [con for con in existing_connections if con not in glue_connections]
job_props['Connections'] = {"Connections": glue_connections + other_connections}
# Clean up properties so we can reuse the dict for the update request
for prop_name in ['Name', 'CreatedOn', 'LastModifiedOn', 'AllocatedCapacity', 'MaxCapacity']:
del job_props[prop_name]
GlueJobOperator(
task_id='submit_job',
job_name=glue_jobname,
iam_role_name=job_props['Role'].split('/')[-1],
update_config=True,
create_job_kwargs=job_props,
wait_for_completion=True
).execute(ctx)
run_job_task()
Create the Airflow workflow
Now you create a workflow that invokes the AWS Glue job you just created:
On the Amazon S3 console, locate the bucket created by the CloudFormation template, which will have a name starting with the name of the stack and then -environmentbucket- (for example, myairflowstack-environmentbucket-ap1qks3nvvr4).
Inside that bucket, create a folder called dags, and inside that folder, upload the DAG file glue_job_dag.py that you created in the previous section.
On the Amazon MWAA console, navigate to the environment you deployed with the CloudFormation stack.
If the status is not yet Available, wait until it reaches that state. It shouldn’t take longer than 30 minutes since you deployed the CloudFormation stack.
Choose the environment link on the table to see the environment details.
It’s configured to pick up DAGs from the bucket and folder you used in the previous steps. Airflow will monitor that folder for changes.
Choose Open Airflow UI to open a new tab accessing the Airflow UI, using the integrated IAM security to log you in.
If there’s any issue with the DAG file you created, it will display an error on top of the page indicating the lines affected. In that case, review the steps and upload again. After a few seconds, it will parse it and update or remove the error banner.
On the Admin menu, choose Variables.
Add three variables with the following keys and values:
Key glue_job_dag.glue_connections with value MWAA-Glue-Blog-Subnet1,MWAA-Glue-Blog-Subnet2.
Key glue_job_dag.glue_job_name with value AirflowBlogJob.
Key glue_job_dag.strategy with value capacity.
Run the job with a dynamic subnet assignment
Now you’re ready to run the workflow and see the strategy dynamically reordering the connections.
On the Airflow UI, choose DAGs, and on the row glue_job_dag, choose the play icon.
On the Browse menu, choose Task instances.
On the instances table, scroll right to display the Log Url and choose the icon on it to open the log.
The log will update as the task runs; you can locate the line starting with “Running Glue job with the connection order:” and the previous lines showing details of the connection IPs and the category assigned. If an error occurs, you’ll see the details in this log.
On the AWS Glue console, choose ETL jobs in the navigation pane, then choose the job AirflowBlogJob.
On the Runs tab, choose the run instance, then the Output logs link, which will open a new tab.
On the new tab, use the log stream link to open it.
It will display the IP that the driver was assigned and which subnet it belongs to, which should match the connection indicated by Airflow (if the log is not displayed, choose Resume so it gets updated as soon as it’s available).
On the Airflow UI, edit the Airflow variable glue_job_dag.strategy to set it to random.
Run the DAG multiple times and see how the ordering changes.
Clean up
If you no longer need the deployment, delete the resources to avoid any further charges:
Delete the Python script you uploaded, so the S3 bucket can be automatically deleted in the next step.
Delete the CloudFormation stack.
Delete the AWS Glue job.
Delete the script that the job saved in Amazon S3.
Delete the connections you created as part of this post.
Conclusion
In this post, we showed how AWS Glue and Amazon MWAA can work together to build more advanced custom workflows, while minimizing the operational and management overhead. This solution gives you more control about how your AWS Glue job runs to meet special operational, network, or security requirements.
You can deploy your own Amazon MWAA environment in multiple ways, such as with the template used in this post, on the Amazon MWAA console, or using the AWS CLI. You can also implement your own strategies to orchestrate AWS Glue jobs, based on your network architecture and requirements (for instance, to run the job closer to the data when possible).
About the authors
Michael Greenshtein is an Analytics Specialist Solutions Architect for the Public Sector.
Gonzalo Herreros is a Senior Big Data Architect on the AWS Glue team.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.