AWS Glue interactive sessions offer a powerful way to iteratively explore datasets and fine-tune transformations using Jupyter-compatible notebooks. Interactive sessions enable you to work with a choice of popular integrated development environments (IDEs) in your local environment or with AWS Glue or Amazon SageMaker Studio notebooks on the AWS Management Console, all while seamlessly harnessing the power of a scalable, on-demand Apache Spark backend. This post is part of a series exploring the features of AWS Glue interactive sessions.
AWS Glue interactive sessions now include native support for the matplotlib visualization library (AWS Glue version 3.0 and later). In this post, we look at how we can use matplotlib and Seaborn to explore and visualize data using AWS Glue interactive sessions, facilitating rapid insights without complex infrastructure setup.
Solution overview
You can quickly provision new interactive sessions directly from your notebook without needing to interact with the AWS Command Line Interface (AWS CLI) or the console. You can use magic commands to provide configuration options for your session and install any additional Python modules that are needed.
In this post, we use the classic Iris and MNIST datasets to navigate through a few commonly used visualization techniques using matplotlib on AWS Glue interactive sessions.
Create visualizations using AWS Glue interactive sessions
We start by installing the Sklearn and Seaborn libraries using the additional_python_modules Jupyter magic command:
%additional_python_modules scikit-learn, seaborn
You can also upload Python wheel modules to Amazon Simple Storage Service (Amazon S3) and specify the full path as a parameter value to the additional_python_modules magic command.
Now, let’s run a few visualizations on the Iris and MNIST datasets.
Create a pair plot using Seaborn to uncover patterns within sepal and petal measurements across the iris species:
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = sns.load_dataset("iris")
# Create a pair plot
sns.pairplot(iris, hue="species")
%matplot plt
Create a violin plot to reveal the distribution of the sepal width measure across the three species of iris flowers:
# Create a violin plot of the Sepal Width measure
plt.figure(figsize=(10, 6))
sns.violinplot(x="species", y="sepal_width", data=iris)
plt.title("Violin Plot of Sepal Width by Species")
plt.show()
%matplot plt
Create a heat map to display correlations across the iris dataset variables:
# Calculate the correlation matrix
correlation_matrix = iris.corr()
# Create a heatmap using Seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
%matplot plt
Create a scatter plot on the MNIST dataset using PCA to visualize distributions among the handwritten digits:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
# Load the MNIST dataset
mnist = fetch_openml('mnist_784', version=1)
X, y = mnist['data'], mnist['target']
# Apply PCA to reduce dimensions to 2 for visualization
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Scatter plot of the reduced data
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y.astype(int), cmap='viridis', s=5)
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA - MNIST Dataset")
plt.colorbar(label="Digit Class")
%matplot plt
Create another visualization using matplotlib and the mplot3d toolkit:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Generate mock data
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
x, y = np.meshgrid(x, y)
z = np.sin(np.sqrt(x**2 + y**2))
# Create a 3D plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
# Plot the surface
surface = ax.plot_surface(x, y, z, cmap='viridis')
# Add color bar to map values to colors
fig.colorbar(surface, ax=ax, shrink=0.5, aspect=10)
# Set labels and title
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('3D Surface Plot Example')
%matplot plt
As illustrated by the preceding examples, you can use any compatible visualization library by installing the required modules and then using the %matplot magic command.
Conclusion
In this post, we discussed how extract, transform, and load (ETL) developers and data scientists can efficiently visualize patterns in their data using familiar libraries through AWS Glue interactive sessions. With this functionality, you’re empowered to focus on extracting valuable insights from their data, while AWS Glue handles the infrastructure heavy lifting using a serverless compute model. To get started today, refer to Developing AWS Glue jobs with Notebooks and Interactive sessions.
About the authors
Annie Nelson is a Senior Solutions Architect at AWS. She is a data enthusiast who enjoys problem solving and tackling complex architectural challenges with customers.
Keerthi Chadalavada is a Senior Software Development Engineer at AWS Glue. She is passionate about designing and building end-to-end solutions to address customer data integration and analytic needs.
Zach Mitchell is a Sr. Big Data Architect. He works within the product team to enhance understanding between product engineers and their customers while guiding customers through their journey to develop their enterprise data architecture on AWS.
Gal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering and BI. She is passionate about developing a deep understanding of customer’s business needs and collaborating with engineers to design easy to use data products.
AWS Glue interactive sessions allow you to run interactive AWS Glue workloads on demand, which enables rapid development by issuing blocks of code on a cluster and getting prompt results. This technology is enabled by the use of notebook IDEs, such as the AWS Glue Studio notebook, Amazon SageMaker Studio, or your own Jupyter notebooks.
In this post, we discuss the following new management features recently added and how can they give you more control over the configurations and security of your AWS Glue interactive sessions:
Tags magic – You can use this new cell magic to tag the session for administration or billing purposes. For example, you can tag each session with the name of the billable department and later run a search to find all spending associated with this department on the AWS Billing console.
Assume role magic – Now you can create a session in an account different than the one you’re connected with by assuming an AWS Identity and Access Management (IAM) role owned by the other account. You can designate a dedicated role with permissions to create sessions and have other users assume it when they use sessions.
IAM VPC rules – You can require your users to use (or restrict them from using) certain VPCs or subnets for the sessions, to comply with your corporate policies and have control over how your data travels in the network. This feature existed for AWS Glue jobs and is now available for interactive sessions.
Solution overview
For our use case, we’re building a highly secured app and want to have users (developers, analysts, data scientists) running AWS Glue interactive sessions on specific VPCs to control how the data travels through the network.
In addition, users are not allowed to log in directly to the production account, which has the data and the connections they need; instead, users will run their own notebooks via their individual accounts and get permission to assume a specific role enabled on the production account to run their sessions. Users can run AWS Glue interactive sessions by using both AWS Glue Studio notebooks via the AWS Glue console, as well as Jupyter notebooks that run on their local machine.
Lastly, all new resources be tagged with the name of the department for proper billing allocation and cost control.
The following architecture diagram highlights the different roles and accounts involved:
Account A – The individual user account. The user ISBlogUser has permissions to create AWS Glue notebook servers via the AWSGlueServiceRole-notebooks role and assume a role in account B (directly or indirectly).
Account B – The production account that owns the GlueSessionsCreationRole role, which users assume to create AWS Glue interactive sessions in this account.
Prerequisites
In this section, we walk through the steps to set up the prerequisite resources and security configurations.
Optionally, if you want to use run a local notebook from your computer, install Python 3.7 or later and then install Jupyter and the AWS Glue interactive sessions kernels. For instructions, refer to Getting started with AWS Glue interactive sessions. You can then run Jupyter directly from the command line using jupyter notebook, or via an IDE like VSCode or PyCharm.
Get access to two AWS accounts
If you have access to two accounts, you can reproduce the use case described in this post. The instructions refer to account A as the user account that runs the notebook and account B as the account that runs the sessions (the production account in the use case). This post assumes you have enough administration permissions to create the different components and manage the account security roles.
If you have access to only one account, you can still follow this post and perform all the steps on that single account.
Create a VPC and subnet
We want to limit users to use AWS Glue interactive session only via a specific VPC network. First, let’s create a new VPC in account B using Amazon Virtual Private Cloud (Amazon VPC). We use this VPC connection later to enforce the network restrictions.
On the Amazon VPC console, choose Your VPCs in the navigation pane.
Choose Create VPC.
Enter 10.0.0.0/24 as the IP CIDR.
Leave the remaining parameters as default and create your VPC.
Make a note of the VPC ID (starting with vpc-) to use later.
For more information about creating VPCs, refer to Create a VPC.
In the navigation pane, choose Subnets.
Choose Create subnet.
Select the VPC you created, enter the same CIDR (10.0.0.0/24), and create your subnet.
In the navigation pane, choose Endpoints.
Choose Create endpoint.
For Service category, select AWS services.
Search for the option that ends in s3, such as com.amazonaws.{region}.s3.
In the search results, select the Gateway type option.
Choose your VPC on the drop-down menu.
For Route tables, select the subnet you created.
Complete the endpoint creation.
Create an AWS Glue network connection
You now need to create an AWS Glue connection that uses the VPC, so sessions created with it can meet the VPC requirement.
Sign in to the console with account B.
On the AWS Glue console, choose Data connections in the navigation pane.
Choose Create connection.
For Name, enter session_vpc.
For Connection type, choose Network.
In the Network options section, choose the VPC you created, a subnet, and a security group.
Choose Create connection.
Account A security setup
Account A is the development account for your users (developers, analysts, data scientists, and so on). They are provided IAM users to access this account programmatically or via the console.
Create the assume role policy
The assume role policy allows users and roles in account A to assume roles in account B (the role in account B also has to allow it). Complete the following steps to create the policy:
On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
Switch to the JSON tab in the policy editor and enter the following policy (provide the account B number):{
Name the role AssumeRoleAccountBPolicy and complete the creation.
Create an IAM user
Now you create an IAM user for account A that you can use to run AWS Glue interactive sessions locally or on the console.
On the IAM console, choose Users in the navigation pane.
Choose Create user.
Name the user ISBlogUser.
Select Provide user access to the AWS Management Console.
Select I want to create an IAM user and choose a password.
Attach the policies AWSGlueConsoleFullAccess and AssumeRoleAccountBPolicy.
Review the settings and complete the user creation.
Create an AWS Glue Studio notebook role
To start an AWS Glue Studio notebook, a role is required. Usually, the same role is used both to start a notebook and run a session. In this use case, users of account A only need permissions to run a notebook, because they will create sessions via the assumed role in account B.
On the IAM console, choose Roles in the navigation pane.
Choose Create role.
Select Glue as the use case.
Attach the policies AWSGlueServiceNotebookRole and AssumeRoleAccountBPolicy.
Name the role AWSGlueServiceRole-notebooks (because the name starts with AWSGlueServiceRole, the user doesn’t need explicit PassRole permission), then complete the creation.
Optionally, you can allow Amazon CodeWhisperer to provide code suggestions on the notebook by adding the permission to the role. To do so, navigate to the role AWSGlueServiceRole-notebooks on the IAM console. On the Add permissions menu, choose Create inline policy. Use the following JSON policy and name it CodeWhispererPolicy:
Account B is considered the production account that contains the data and connections, and runs the AWS Glue data integration pipelines (using either AWS Glue sessions or jobs). Users don’t have direct access to it; they use it assuming the role created for this purpose.
To follow this post, you need two roles: one the AWS Glue service will assume to run and another that creates sessions, enforcing the VPC restriction.
Create an AWS Glue service role
To create an AWS Glue service role, complete the following steps:
On the IAM console, choose Roles in the navigation pane.
Choose Create role.
Choose Glue for the use case.
Attach the policy AWSGlueServiceRole.
Name the role AWSGlueServiceRole-blog and complete the creation.
Create an AWS Glue interactive session role
This role will be used to create sessions following the VPC requirements. Complete the following steps to create the role:
On the IAM console, choose Policies in the navigation pane.
Choose Create policy.
Switch to the JSON tab in the policy editor and enter the following code (provide your VPC ID). You can also replace the * in the policy with the full ARN of the role AWSGlueServiceRole-blog you just created, to force the notebook to only use that role when creating sessions.
This policy complements the AWSGlueServiceRole you attached before and restricts the session creation based on the VPC. You could also restrict the subnet and security group in a similar way using conditions for the resources glue:SubnetIds and glue:SecurityGroupIds respectively.
In this case, the sessions creation requires a VPC, which has to be in the list of IDs listed. If you need to just require any valid VPC to be used, you can remove the first statement and leave the one that denies the creation when the VPC is null.
Name the policy CustomCreateSessionPolicy and complete the creation.
Choose Roles in the navigation pane.
Choose Create role.
Select Custom trust policy.
Replace the trust policy template with the following code (provide your account A number):
This allows the role to be assumed directly by the user when using a local notebook and also when using an AWS Glue Studio notebook with a role.
Attach the policies AWSGlueServiceRole and CustomCreateSessionPolicy (which you created on the previous step, so you might need to refresh for them to be listed).
Name the role GlueSessionCreationRole and complete the role creation.
Create the Glue interactive session in the VPC, with assumed role and tags
Now that you have the accounts, roles, VPC, and connection ready, you use them to meet the requirements. You start a new notebook using account A, which assumes the role of account B to create a session in the VPC, and tag it with the department and billing area.
Start a new notebook
Using account A, start a new notebook. You may use either of the following options.
Option 1: Create an AWS Glue Studio notebook
The first option is to create an AWS Glue Studio notebook:
Sign in to the console with account A and the ISBlogUser user.
On the AWS Glue console, choose Notebooks in the navigation pane under ETL jobs.
Select Jupyter Notebook and choose Create.
Enter a name for your notebook.
Specify the role AWSGlueServiceRole-notebooks.
Choose Start notebook.
Option 2: Create a local notebook
Alternatively, you can create a local notebook. Before you start the process that runs Jupyter (or if you run it indirectly, then the IDE that runs it), you need to set the IAM ID and key for the user ISBlogUser, either using aws configure on the command line or setting the values as environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY for the user ID and secret key, respectively. Then create a new Jupyter notebook and select the kernel Glue PySpark.
Start a session from the notebook
After you start the notebook, select the first cell and add four new empty code cells. If you are using an AWS Glue Studio notebook, the notebook already contains some prepopulated cells as examples; we don’t use those sample cells in this post.
In the first cell, enter the following magic configuration with the session creation role ARN, using the ID of account B:
# Configure the role we assume for creating the sessions
# Tip: assume_role is a cell magic (meaning it needs its own cell)
%%assume_role
"arn:aws:iam::{account B}:role/GlueSessionCreationRole"
Run the cell to set up that configuration, either by choosing the button on the toolbar or pressing Shift + Enter.
It should confirm the role was assumed correctly. Now when the session is launched, it will be done by this role. This allowed you to use a role from a different account to run a session on that account.
In the second cell, enter sample tags like the following and run the cell in the same way:
# Set a tag to associate the session with billable department
# Tip: tags is a cell magic (meaning it needs its own cell)
%%tags
{'team':'analytics', 'billing':'Data-Platform'}
In the third cell, enter the following sample configuration (provide the role ARN with account B) and run the cell to set up the configuration:
# Set the configuration of your sessions using magics
# Tip: non-cell magics can share the same cell
%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 5
%iam_role arn:aws:iam::{account B}:role/AWSGlueServiceRole-blog
Now the session is configured but hasn’t started yet because you didn’t run any Python code.
In the fourth empty cell, enter the following code to set up the objects required to work with AWS Glue and run the cell:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
It should fail with a permission error saying that there is an explicit deny policy activated. This is the VPC condition you set before. By default, the session doesn’t use a VPC, so this is why it’s failing.
You can solve the error by assigning the connection you created before, so the session runs inside the VPC authorized.
In the third cell, add the %connections magic with the value session_vpc.
The session needs to run in the same Region in which the connection is defined. If that’s not the same as the notebook Region, you can explicitly configure the session Region using the %region magic.
After you have added the new config settings, run the cell again so the magics take effect.
Run the fourth cell again (the one with the code).
This time, it should start the session and after a brief period confirm it has been created correctly.
Add a new cell with the following content and run it: %status
This will display the configuration and other information about the session that the notebook is using, including the tags set before.
You started a notebook in account A and used a role from account B to create a session, which uses the network connection so it runs in the required VPC. You also tagged the session to be able to easily identify it later.
In the next section, we discuss more ways to monitor sessions using tags.
Interactive session tags
Before tags were supported, if you wanted to identify the purpose of sessions running the account, you had to use the magic %session_id_prefix to name your session with something meaningful.
Now, with the new tags magic, you can use more sophisticated ways to categorize your sessions.
In the previous section, you tagged the session with a team and billing department. Let’s imagine now you are an administrator checking the sessions that different teams run in an account and Region.
Explore tags via the AWS CLI
On the command line where you have the AWS CLI installed, run the following command to list the sessions running in the account and Regions configured (use the Region and max results parameters if needed):
aws glue list-sessions
You also have the option to just list sessions that have a specific tag:
aws glue list-sessions --tags team=analytics
You can also list all the tags associated with a specific session with the following command. Provide the Region, account, and session ID (you can get it from the list-sessions command):
You can also use tags to keep track of cost and do more accurate cost assignment in your company. After you have used a tag in your session, the tag will become available for billing purposes (it can take up to 24 hours to be detected).
On the AWS Billing console, choose Cost allocation tags under Billing in the navigation pane.
Search for and select the tags you used in the session: “team” and “billing”.
Choose Activate.
This activation can take up to 24 hours additional hours until the tag is applied for billing purposes. You only have to do this one time when you start using a new tag on an account.
After the tags have been correctly activated and applied, choose Cost explorer under Cost Management in the navigation pane.
In the Report parameters pane, for Tag, choose one of the tags you activated.
This adds a drop-down menu for this tag, where you can choose some or all of the tag values to use.
Make your selection and choose Apply to use the filter on the report.
Clean up
Run the %stop_session magic in a cell to stop the session and avoid further charges. If you no longer need the notebook, VPC, or roles you created, you can delete them as well.
Conclusion
In this post, we showed how to use these new features in AWS Glue to have more control over your interactive sessions for management and security. You can enforce network restrictions, allow users from other accounts to use your session, and use tags to help you keep track of the session usage and cost reports. These new features are already available, so you can start using them now.
About the authors
Gonzalo Herreros is a Senior Big Data Architect on the AWS Glue team.
Gal Heyne is a Technical Product Manager on the AWS Glue team.
Streaming data has become an indispensable resource for organizations worldwide because it offers real-time insights that are crucial for data analytics. The escalating velocity and magnitude of collected data has created a demand for real-time analytics. This data originates from diverse sources, including social media, sensors, logs, and clickstreams, among others. With streaming data, organizations gain a competitive edge by promptly responding to real-time events and making well-informed decisions.
In streaming applications, a prevalent approach involves ingesting data through Apache Kafka and processing it with Apache Spark Structured Streaming. However, managing, integrating, and authenticating the processing framework (Apache Spark Structured Streaming) with the ingesting framework (Kafka) poses significant challenges, necessitating a managed and serverless framework. For example, integrating and authenticating a client like Spark streaming with Kafka brokers and zookeepers using a manual TLS method requires certificate and keystore management, which is not an easy task and requires a good knowledge of TLS setup.
To address these issues effectively, we propose using Amazon Managed Streaming for Apache Kafka (Amazon MSK), a fully managed Apache Kafka service that offers a seamless way to ingest and process streaming data. In this post, we use Amazon MSK Serverless, a cluster type for Amazon MSK that makes it possible for you to run Apache Kafka without having to manage and scale cluster capacity. To further enhance security and streamline authentication and authorization processes, MSK Serverless enables you to handle both authentication and authorization using AWS Identity and Access Management (IAM) in your cluster. This integration eliminates the need for separate mechanisms for authentication and authorization, simplifying and strengthening data protection. For example, when a client tries to write to your cluster, MSK Serverless uses IAM to check whether that client is an authenticated identity and also whether it is authorized to produce to your cluster.
To process data effectively, we use AWS Glue, a serverless data integration service that uses the Spark Structured Streaming framework and enables near-real-time data processing. An AWS Glue streaming job can handle large volumes of incoming data from MSK Serverless with IAM authentication. This powerful combination ensures that data is processed securely and swiftly.
The post demonstrates how to build an end-to-end implementation to process data from MSK Serverless using an AWS Glue streaming extract, transform, and load (ETL) job with IAM authentication to connect MSK Serverless from the AWS Glue job and query the data using Amazon Athena.
Solution overview
The following diagram illustrates the architecture that you implement in this post.
The workflow consists of the following steps:
Create an MSK Serverless cluster with IAM authentication and an EC2 Kafka client as the producer to ingest sample data into a Kafka topic. For this post, we use the kafka-console-producer.sh Kafka console producer client.
Set up an AWS Glue streaming ETL job to process the incoming data. This job extracts data from the Kafka topic, loads it into Amazon Simple Storage Service (Amazon S3), and creates a table in the AWS Glue Data Catalog. By continuously consuming data from the Kafka topic, the ETL job ensures it remains synchronized with the latest streaming data. Moreover, the job incorporates the checkpointing functionality, which tracks the processed records, enabling it to resume processing seamlessly from the point of interruption in the event of a job run failure.
Following the data processing, the streaming job stores data in Amazon S3 and generates a Data Catalog table. This table acts as a metadata layer for the data. To interact with the data stored in Amazon S3, you can use Athena, a serverless and interactive query service. Athena enables the run of SQL-like queries on the data, facilitating seamless exploration and analysis.
For this post, we create the solution resources in the us-east-1 Region using AWS CloudFormation templates. In the following sections, we show you how to configure your resources and implement the solution.
Configure resources with AWS CloudFormation
In this post, you use the following two CloudFormation templates. The advantage of using two different templates is that you can decouple the resource creation of ingestion and processing part according to your use case and if you have requirements to create specific process resources only.
vpc-mskserverless-client.yaml – This template sets up data the ingestion service resources such as a VPC, MSK Serverless cluster, and S3 bucket
gluejob-setup.yaml – This template sets up the data processing resources such as the AWS Glue table, database, connection, and streaming job
Create data ingestion resources
The vpc-mskserverless-client.yaml stack creates a VPC, private and public subnets, security groups, S3 VPC Endpoint, MSK Serverless cluster, EC2 instance with Kafka client, and S3 bucket. To create the solution resources for data ingestion, complete the following steps:
Launch the stack vpc-mskserverless-client using the CloudFormation template:
Provide the parameter values as listed in the following table.
Parameters
Description
Sample Value
EnvironmentName
Environment name that is prefixed to resource names
.
PrivateSubnet1CIDR
IP range (CIDR notation) for the private subnet in the first Availability Zone
.
PrivateSubnet2CIDR
IP range (CIDR notation) for the private subnet in the second Availability Zone
.
PublicSubnet1CIDR
IP range (CIDR notation) for the public subnet in the first Availability Zone
.
PublicSubnet2CIDR
IP range (CIDR notation) for the public subnet in the second Availability Zone
On the Amazon EC2 console, select the instanceid and on the Session Manager tab, choose Connect.
After you log in to the EC2 instance, you create a Kafka topic in the MSK Serverless cluster from the EC2 instance.
In the following export command, provide the MSKBootstrapServers value from the vpc-mskserverless- client stack output for your endpoint:
$ sudo su – ec2-user
$ BS=<your-msk-serverless-endpoint (e.g.) boot-xxxxxx.yy.kafka-serverless.us-east-1.a>
Run the following command on the EC2 instance to create a topic called msk-serverless-blog. The Kafka client is already installed in the ec2-user home directory (/home/ec2-user).
After you confirm the topic creation, you can push the data to the MSK Serverless.
Run the following command on the EC2 instance to create a console producer to produce records to the Kafka topic. (For source data, we use nycflights.csv downloaded at the ec2-user home directory /home/ec2-user.)
Next, you set up the data processing service resources, specifically AWS Glue components like the database, table, and streaming job to process the data.
Create data processing resources
The gluejob-setup.yaml CloudFormation template creates a database, table, AWS Glue connection, and AWS Glue streaming job. Retrieve the values for VpcId, GluePrivateSubnet, GlueconnectionSubnetAZ, SecurityGroup, S3BucketForOutput, and S3BucketForGlueScript from the vpc-mskserverless-client stack’s Outputs tab to use in this template. Complete the following steps:
Launch the stack gluejob-setup:
Provide parameter values as listed in the following table.
Parameters
Description
Sample value
EnvironmentName
Environment name that is prefixed to resource names.
Gluejob-setup
VpcId
ID of the VPC for security group. Use the VPC ID created with the first stack.
Refer to the first stack’s output.
GluePrivateSubnet
Private subnet used for creating the AWS Glue connection.
Refer to the first stack’s output.
SecurityGroupForGlueConnection
Security group used by the AWS Glue connection.
Refer to the first stack’s output.
GlueconnectionSubnetAZ
Availability Zone for the first private subnet used for the AWS Glue connection.
.
GlueDataBaseName
Name of the AWS Glue Data Catalog database.
glue_kafka_blog_db
GlueTableName
Name of the AWS Glue Data Catalog table.
blog_kafka_tbl
S3BucketNameForScript
Bucket Name for Glue ETL script.
Use the S3 bucket name from the previous stack. For example, aws-gluescript-${AWS::AccountId}-${AWS::Region}-${EnvironmentName}
GlueWorkerType
Worker type for AWS Glue job. For example, G.1X.
G.1X
NumberOfWorkers
Number of workers in the AWS Glue job.
3
S3BucketNameForOutput
Bucket name for writing data from the AWS Glue job.
The stack creation process can take around 1–2 minutes to complete. You can check the Outputs tab for the stack after the stack is created.
In the gluejob-setup stack, we created a Kafka type AWS Glue connection, which consists of broker information like the MSK bootstrap server, topic name, and VPC in which the MSK Serverless cluster is created. Most importantly, it specifies the IAM authentication option, which helps AWS Glue authenticate and authorize using IAM authentication while consuming the data from the MSK topic. For further clarity, you can examine the AWS Glue connection and the associated AWS Glue table generated through AWS CloudFormation.
After successfully creating the CloudFormation stack, you can now proceed with processing data using the AWS Glue streaming job with IAM authentication.
Run the AWS Glue streaming job
To process the data from the MSK topic using the AWS Glue streaming job that you set up in the previous section, complete the following steps:
On the CloudFormation console, choose the stack gluejob-setup.
On the Outputs tab, retrieve the name of the AWS Glue streaming job from the GlueJobName row. In the following screenshot, the name is GlueStreamingJob-glue-streaming-job.
On the AWS Glue console, choose ETL jobs in the navigation pane.
Search for the AWS Glue streaming job named GlueStreamingJob-glue-streaming-job.
Choose the job name to open its details page.
Choose Run to start the job.
On the Runs tab, confirm if the job ran without failure.
Retrieve the OutputBucketName from the gluejob-setup template outputs.
On the Amazon S3 console, navigate to the S3 bucket to verify the data.
On the AWS Glue console, choose the AWS Glue streaming job you ran, then choose Stop job run.
Because this is a streaming job, it will continue to run indefinitely until manually stopped. After you verify the data is present in the S3 output bucket, you can stop the job to save cost.
Validate the data in Athena
After the AWS Glue streaming job has successfully created the table for the processed data in the Data Catalog, follow these steps to validate the data using Athena:
On the Athena console, navigate to the query editor.
Choose the Data Catalog as the data source.
Choose the database and table that the AWS Glue streaming job created.
To validate the data, run the following query to find the flight number, origin, and destination that covered the highest distance in a year:
SELECT distinct(flight),distance,origin,dest,year from "glue_kafka_blog_db"."output" where distance= (select MAX(distance) from "glue_kafka_blog_db"."output")
The following screenshot shows the output of our example query.
Clean up
To clean up your resources, complete the following steps:
Delete the CloudFormation stack gluejob-setup.
Delete the CloudFormation stack vpc-mskserverless-client.
Conclusion
In this post, we demonstrated a use case for building a serverless ETL pipeline for streaming with IAM authentication, which allows you to focus on the outcomes of your analytics. You can also modify the AWS Glue streaming ETL code in this post with transformations and mappings to ensure that only valid data gets loaded to Amazon S3. This solution enables you to harness the prowess of AWS Glue streaming, seamlessly integrated with MSK Serverless through the IAM authentication method. It’s time to act and revolutionize your streaming processes.
Appendix
This section provides more information about how to create the AWS Glue connection on the AWS Glue console, which helps establish the connection to the MSK Serverless cluster and allow the AWS Glue streaming job to authenticate and authorize using IAM authentication while consuming the data from the MSK topic.
On the AWS Glue console, in the navigation pane, under Data catalog, choose Connections.
Choose Create connection.
For Connection name, enter a unique name for your connection.
For Connection type, choose Kafka.
For Connection access, select Amazon managed streaming for Apache Kafka (MSK).
For Kafka bootstrap server URLs, enter a comma-separated list of bootstrap server URLs. Include the port number. For example, boot-xxxxxxxx.c2.kafka-serverless.us-east- 1.amazonaws.com:9098.
For Authentication, choose IAM Authentication.
Select Require SSL connection.
For VPC, choose the VPC that contains your data source.
For Subnet, choose the private subnet within your VPC.
For Security groups, choose a security group to allow access to the data store in your VPC subnet.
Security groups are associated to the ENI attached to your subnet. You must choose at least one security group with a self-referencing inbound rule for all TCP ports.
Choose Save changes.
After you create the AWS Glue connection, you can use the AWS Glue streaming job to consume data from the MSK topic using IAM authentication.
About the authors
Shubham Purwar is a Cloud Engineer (ETL) at AWS Bengaluru specialized in AWS Glue and Amazon Athena. He is passionate about helping customers solve issues related to their ETL workload and implement scalable data processing and analytics pipelines on AWS. In his free time, Shubham loves to spend time with his family and travel around the world.
Nitin Kumar is a Cloud Engineer (ETL) at AWS with a specialization in AWS Glue. He is dedicated to assisting customers in resolving issues related to their ETL workloads and creating scalable data processing and analytics pipelines on AWS.
Distributed denial of service (DDoS) events occur when a threat actor sends traffic floods from multiple sources to disrupt the availability of a targeted application. DDoS simulation testing uses a controlled DDoS event to allow the owner of an application to assess the application’s resilience and practice event response. DDoS simulation testing is permitted on Amazon Web Services (AWS), subject to Testing policy terms and conditions. In this blog post, we help you understand when it’s appropriate to perform a DDoS simulation test on an application running on AWS, and what options you have for running the test.
DDoS protection at AWS
Security is the top priority at AWS. AWS services include basic DDoS protection as a standard feature to help protect customers from the most common and frequently occurring infrastructure (layer 3 and 4) DDoS events, such as SYN/UDP floods, reflection attacks, and others. While this protection is designed to protect the availability of AWS infrastructure, your application might require more nuanced protections that consider your traffic patterns and integrate with your internal reporting and incident response processes. If you need more nuanced protection, then you should consider subscribing to AWS Shield Advanced in addition to the native resiliency offered by the AWS services you use.
AWS Shield Advanced is a managed service that helps you protect your application against external threats, like DDoS events, volumetric bots, and vulnerability exploitation attempts. When you subscribe to Shield Advanced and add protection to your resources, Shield Advanced provides expanded DDoS event protection for those resources. With advanced protections enabled on your resources, you get tailored detection based on the traffic patterns of your application, assistance with protecting against Layer 7 DDoS events, access to 24×7 specialized support from the Shield Response Team (SRT), access to centralized management of security policies through AWS Firewall Manager, and cost protections to help safeguard against scaling charges resulting from DDoS-related usage spikes. You can also configure AWS WAF (a web application firewall) to integrate with Shield Advanced to create custom layer 7 firewall rules and enable automatic application layer DDoS mitigation.
Acceptable DDoS simulation use cases on AWS
AWS is constantly learning and innovating by delivering new DDoS protection capabilities, which are explained in the DDoS Best Practices whitepaper. This whitepaper provides an overview of DDoS events and the choices that you can make when building on AWS to help you architect your application to absorb or mitigate volumetric events. If your application is architected according to our best practices, then a DDoS simulation test might not be necessary, because these architectures have been through rigorous internal AWS testing and verified as best practices for customers to use.
Using DDoS simulations to explore the limits of AWS infrastructure isn’t a good use case for these tests. Similarly, validating if AWS is effectively protecting its side of the shared responsibility model isn’t a good test motive. Further, using AWS resources as a source to simulate a DDoS attack on other AWS resources isn’t encouraged. Load tests are performed to gain reliable information on application performance under stress and these are different from DDoS tests. For more information, see the Amazon Elastic Compute Cloud (Amazon EC2) testing policy and penetration testing. Application owners, who have a security compliance requirement from a regulator or who want to test the effectiveness of their DDoS mitigation strategies, typically run DDoS simulation tests.
DDoS simulation tests at AWS
AWS offers two options for running DDoS simulation tests. They are:
A simulated DDoS attack in production traffic with an authorized pre-approved AWS Partner.
A synthetic simulated DDoS attack with the SRT, also referred to as a firedrill.
The motivation for DDoS testing varies from application to application and these engagements don’t offer the same value to all customers. Establishing clear motives for the test can help you choose the right option. If you want to test your incident response strategy, we recommend scheduling a firedrill with our SRT. If you want to test the Shield Advanced features or test application resiliency, we recommend that you work with an AWS approved partner.
DDoS simulation testing with an AWS Partner
AWS DDoS test partners are authorized to conduct DDoS simulation tests on customers’ behalf without prior approval from AWS. Customers can currently contact the following partners to set up these paid engagements:
Before contacting the partners, customers must agree to the terms and conditions for DDoS simulation tests. The application must be well-architected prior to DDoS simulation testing as described in AWS DDoS Best Practices whitepaper. AWS DDoS test partners that want to perform DDoS simulation tests that don’t comply with the technical restrictions set forth in our public DDoS testing policy, or other DDoS test vendors that aren’t approved, can request approval to perform DDoS simulation tests by submitting the DDoS Simulation Testing form at least 14 days before the proposed test date. For questions, please send an email to [email protected].
After choosing a test partner, customers go through various phases of testing. Typically, the first phase involves a discovery discussion, where the customer defines clear goals, assembles technical details, and defines the test schedule with the partner. In the next phase, partners run multiple simulations based on agreed attack vectors, duration, diversity of the attack vectors, and other factors. These tests are usually carried out by slowly ramping up traffic levels from low levels to desired high levels with an ability for an emergency stop. The final stage involves reporting, discussing observed gaps, identifying actionable tasks, and driving those tasks to completion.
These engagements are typically long-term, paid contracts that are planned over months and carried out over weeks, with results analyzed over time. These tests and reports are beneficial to customers who need to evaluate detection and mitigation capabilities on a large scale. If you’re an application owner and want to evaluate the DDoS resiliency of your application, practice event response with real traffic, or have a DDoS compliance or regulation requirement, we recommend this type of engagement. These tests aren’t recommended if you want to learn the volumetric breaking points of the AWS network or understand when AWS starts to throttle requests. AWS services are designed to scale, and when certain dynamic volume thresholds are exceeded, AWS detection systems will be invoked to block traffic. Lastly, it’s critical to distinguish between these tests and stress tests, in which meaningful packets are sent to the application to assess its behavior.
DDoS firedrill testing with the Shield Response Team
Shield Advanced service offers additional assistance through the SRT, this team can also help with testing incident response workflows. Customers can contact the SRT and request firedrill testing. Firedrill testing is a type of synthetic test that doesn’t generate real volumetric traffic but does post a shield event to the requesting customer’s account.
These tests are available for customers who are already on-boarded to Shield Advanced and want to test their Amazon CloudWatch alarms by invoking a DDoSDetected metric, or test their proactive engagement setup or their custom incident response strategy. Because this event isn’t based on real traffic, the customer won’t see traffic generated on their account or see logs that drive helpful reports.
These tests are intended to generate associated Shield Advanced metrics and post a DDoS event for a customer resource. For example, SRT can post a 14 Gbps UDP mock attack on a protected resource for about 15 minutes and customers can test their response capability during such an event.
Note: Not all attack vectors and AWS resource types are supported for a firedrill. Shield Advanced onboarded customers can contact AWS Support teams to request assistance with running a firedrill or understand more about them.
Conclusion
DDoS simulations and incident response testing on AWS through the SRT or an AWS Partner are useful in improving application security controls, identifying Shield Advanced misconfigurations, optimizing existing detection systems, and improving incident readiness. The goal of these engagements is to help you build a DDoS resilient architecture to protect your application’s availability. However, these engagements don’t offer the same value to all customers. Most customers can obtain similar benefits by following AWS Best Practices for DDoS Resiliency. AWS recommends architecting your application according to DDoS best practices and fine tuning AWS Shield Advanced out-of-the-box offerings to your application needs to improve security posture.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
Customers often need to architect solutions to support components across multiple cloud service providers, a need which may arise if they have acquired a company running on another cloud, or for functional purposes where specific services provide a differentiated capability. In this post, we will show you how to use the AWS Cloud Development Kit (AWS CDK) to create a single pane of glass for managing your multicloud resources.
AWS CDK is an open source framework that builds on the underlying functionality provided by AWS CloudFormation. It allows developers to define cloud resources using common programming languages and an abstraction model based on reusable components called constructs. There is a misconception that CloudFormation and CDK can only be used to provision resources on AWS, but this is not the case. The CloudFormation registry, with support for third party resource types, along with custom resource providers, allow for any resource that can be configured via an API to be created and managed, regardless of where it is located.
Multicloud solution design paradigm
Multicloud solutions are often designed with services grouped and separated by cloud, creating a segregation of resource and functions within the design. This approach leads to a duplication of layers of the solution, most commonly a duplication of resources and the deployment processes for each environment. This duplication increases cost, and leads to a complexity of management increasing the potential break points within the solution or practice.
Along with simplifying resource deployments, and the ever-increasing complexity of customer needs, so too has the need increased for the capability of IaC solutions to deploy resources across hybrid or multicloud environments. Through meeting this need, a proliferation of supported tools, frameworks, languages, and practices has created “choice overload”. At worst, this scares the non-cloud-savvy away from adopting an IaC solution benefiting their cloud journey, and at best confuses the very reason for adopting an IaC practice.
A single pane of glass
Systems Thinking is a holistic approach that focuses on the way a system’s constituent parts interrelate and how systems work as a whole especially over time and within the context of larger systems. Systems thinking is commonly accepted as the backbone of a successful systems engineering approach. Designing solutions taking a full systems view, based on the component’s function and interrelation within the system across environments, more closely aligns with the ability to handle the deployment of each cloud-specific resource, from a single control plane.
While AWS provides a list of services that can be used to help design, manage and operate hybrid and multicloud solutions, with AWS as the primary cloud you can go beyond just using services to support multicloud. CloudFormation registry resource types model and provision resources using custom logic, as a component of stacks in CloudFormation. Public extensions are not only provided by AWS, but third-party extensions are made available for general use by publishers other than AWS, meaning customers can create their own extensions and publish them for anyone to use.
The AWS CDK, which has a 1:1 mapping of all AWS CloudFormation resources, as well as a library of abstracted constructs, supports the ability to import custom AWS CloudFormation extensions, enabling customers and partners to create custom AWS CDK constructs for their extensions. The chosen programming language can be used to inherit and abstract the custom resource into reusable AWS CDK constructs, allowing developers to create solutions that contain native AWS extensions along with secondary hybrid or alternate cloud resources.
Providing the ability to integrate mixed resources in the same stack more closely aligns with the functional design and often diagrammatic depiction of the solution. In essence, we are creating a single IaC pane of glass over the entire solution, deployed through a single control plane. This lowers the complexity and the cost of maintaining separate modules and deployment pipelines across multiple cloud providers.
A common use case for a multicloud: disaster recovery
One of the most common use cases of the requirement for using components across different cloud providers is the need to maintain data sovereignty while designing disaster recovery (DR) into a solution.
Data sovereignty is the idea that data is subject to the laws of where it is physically located, and in some countries extends to regulations that if data is collected from citizens of a geographical area, then the data must reside in servers located in jurisdictions of that geographical area or in countries with a similar scope and rigor in their protection laws.
This requires organizations to remain in compliance with their host country, and in cases such as state government agencies, a stricter scope of within state boundaries, data sovereignty regulations. Unfortunately, not all countries, and especially not all states, have multiple AWS regions to select from when designing where their primary and recovery data backups will reside. Therefore, the DR solution needs to take advantage of multiple cloud providers in the same geography, and as such a solution must be designed to backup or replicate data across providers.
The multicloud solution
A multicloud solution to the proposed use case would be the backup of data from an AWS resource such as an Amazon S3 bucket to another cloud provider within the same geography, such as an Azure Blob Storage container, using AWS event driven behaviour to trigger the copying of data from the primary AWS resource to the secondary Azure backup resource.
Following the IaC single pane of glass approach, the Azure Blob Storage container is created as a resource type in the CloudFormation Registry, and imported into the AWS CDK to be used as a construct in the solution. However, before the extension resource type can be used effectively in the CDK as a reusable construct and added to your private library, you will first need to go through the import into CDK process for creating Constructs.
There are three different levels of constructs, beginning with low-level constructs, which are called CFN Resources (or L1, short for “layer 1”). These constructs directly represent all resources available in AWS CloudFormation. They are named CfnXyz, where Xyz is name of the resource.
Layer 1 Construct
In this example, an L1 construct named CfnAzureBlobStorage represents an Azure::BlobStorage AWS CloudFormation extension. Here you also explicitly configure the ref property, in order for higher level constructs to access the Output value which will be the Azure blob container url being provisioned.
As with every CDK Construct, the constructor arguments are scope, id and props. scope and id are propagated to the cdk.Construct base class. The props argument is of type CfnAzureBlobStorageProps which includes four properties all of type string. This is how the Azure credentials are propagated down from upstream constructs.
Layer 2 Construct
The next level of constructs, L2, also represent AWS resources, but with a higher-level, intent-based API. They provide similar functionality, but incorporate the defaults, boilerplate, and glue logic you’d be writing yourself with a CFN Resource construct. They also provide convenience methods that make it simpler to work with the resource.
In this example, an L2 construct is created to abstract the CfnAzureBlobStorage L1 construct and provides additional properties and methods.
The custom L2 construct class is declared as AzureBlobStorage, this time without the Cfn prefix to represent an L2 construct. This time the constructor arguments include the Azure credentials and client secret, and the ref from the L1 construct us output to the public variable AzureBlobContainerUrl.
As an L2 construct, the AzureBlobStorage construct could be used in CDK Apps along with AWS Resource Constructs in the same Stack, to be provisioned through AWS CloudFormation creating the IaC single pane of glass for a multicloud solution.
Layer 3 Construct
The true value of the CDK construct programming model is in the ability to extend L2 constructs, which represent a single resource, into a composition of multiple constructs that provide a solution for a common task. These are Layer 3, L3, Constructs also known as patterns.
In this example, the L3 construct represents the solution architecture to backup objects uploaded to an Amazon S3 bucket into an Azure Blob Storage container in real-time, using AWS Lambda to process event notifications from Amazon S3.
import { RemovalPolicy, Duration, CfnOutput } from "aws-cdk-lib";
import { Bucket, BlockPublicAccess, EventType } from "aws-cdk-lib/aws-s3";
import { DockerImageFunction, DockerImageCode } from "aws-cdk-lib/aws-lambda";
import { PolicyStatement, Effect } from "aws-cdk-lib/aws-iam";
import { LambdaDestination } from "aws-cdk-lib/aws-s3-notifications";
import { IStringParameter, StringParameter } from "aws-cdk-lib/aws-ssm";
import { Secret, ISecret } from "aws-cdk-lib/aws-secretsmanager";
import { Construct } from "constructs";
import { AzureBlobStorage } from "./azure-blob-storage";
// L3 Construct
export class S3ToAzureBackupService extends Construct {
constructor(
scope: Construct,
id: string,
azureSubscriptionIdParamName: string,
azureClientIdParamName: string,
azureTenantIdParamName: string,
azureClientSecretName: string
) {
super(scope, id);
// Retrieve existing SSM Parameters
const azureSubscriptionIdParameter = this.getSSMParameter("AzureSubscriptionIdParam", azureSubscriptionIdParamName);
const azureClientIdParameter = this.getSSMParameter("AzureClientIdParam", azureClientIdParamName);
const azureTenantIdParameter = this.getSSMParameter("AzureTenantIdParam", azureTenantIdParamName);
// Retrieve existing Azure Client Secret
const azureClientSecret = this.getSecret("AzureClientSecret", azureClientSecretName);
// Create an S3 bucket
const sourceBucket = new Bucket(this, "SourceBucketForAzureBlob", {
removalPolicy: RemovalPolicy.RETAIN,
blockPublicAccess: BlockPublicAccess.BLOCK_ALL,
});
// Create a corresponding Azure Blob Storage account and a Blob Container
const azurebBlobStorage = new AzureBlobStorage(
this,
"MyCustomAzureBlobStorage",
azureSubscriptionIdParameter.stringValue,
azureClientIdParameter.stringValue,
azureTenantIdParameter.stringValue,
azureClientSecretName
);
// Create a lambda function that will receive notifications from S3 bucket
// and copy the new uploaded object to Azure Blob Storage
const copyObjectToAzureLambda = new DockerImageFunction(
this,
"CopyObjectsToAzureLambda",
{
timeout: Duration.seconds(60),
code: DockerImageCode.fromImageAsset("copy_s3_fn_code", {
buildArgs: {
"--platform": "linux/amd64"
}
}),
},
);
// Add an IAM policy statement to allow the Lambda function to access the
// S3 bucket
sourceBucket.grantRead(copyObjectToAzureLambda);
// Add an IAM policy statement to allow the Lambda function to get the contents
// of an S3 object
copyObjectToAzureLambda.addToRolePolicy(
new PolicyStatement({
effect: Effect.ALLOW,
actions: ["s3:GetObject"],
resources: [`arn:aws:s3:::${sourceBucket.bucketName}/*`],
})
);
// Set up an S3 bucket notification to trigger the Lambda function
// when an object is uploaded
sourceBucket.addEventNotification(
EventType.OBJECT_CREATED,
new LambdaDestination(copyObjectToAzureLambda)
);
// Grant the Lambda function read access to existing SSM Parameters
azureSubscriptionIdParameter.grantRead(copyObjectToAzureLambda);
azureClientIdParameter.grantRead(copyObjectToAzureLambda);
azureTenantIdParameter.grantRead(copyObjectToAzureLambda);
// Put the Azure Blob Container Url into SSM Parameter Store
this.createStringSSMParameter(
"AzureBlobContainerUrl",
"Azure blob container URL",
"/s3toazurebackupservice/azureblobcontainerurl",
azurebBlobStorage.blobContainerUrl,
copyObjectToAzureLambda
);
// Grant the Lambda function read access to the secret
azureClientSecret.grantRead(copyObjectToAzureLambda);
// Output S3 bucket arn
new CfnOutput(this, "sourceBucketArn", {
value: sourceBucket.bucketArn,
exportName: "sourceBucketArn",
});
// Output the Blob Conatiner Url
new CfnOutput(this, "azureBlobContainerUrl", {
value: azurebBlobStorage.blobContainerUrl,
exportName: "azureBlobContainerUrl",
});
}
}
The custom L3 construct can be used in larger IaC solutions by calling the class called S3ToAzureBackupService and providing the Azure credentials and client secret as properties to the constructor.
import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import { S3ToAzureBackupService } from "./s3-to-azure-backup-service";
export class MultiCloudBackupCdkStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const s3ToAzureBackupService = new S3ToAzureBackupService(
this,
"MyMultiCloudBackupService",
"/s3toazurebackupservice/azuresubscriptionid",
"/s3toazurebackupservice/azureclientid",
"/s3toazurebackupservice/azuretenantid",
"s3toazurebackupservice/azureclientsecret"
);
}
}
Solution Diagram
Diagram 1: IaC Single Control Plane, demonstrates the concept of the Azure Blob Storage extension being imported from the AWS CloudFormation Registry into AWS CDK as an L1 CfnResource, wrapped into an L2 Construct and used in an L3 pattern alongside AWS resources to perform the specific task of backing up from and Amazon s3 Bucket into an Azure Blob Storage Container.
Diagram 1: IaC Single Control Plan
The CDK application is then synthesized into one or more AWS CloudFormation Templates, which result in the CloudFormation service deploying AWS resource configurations to AWS and Azure resource configurations to Azure.
This solution demonstrates not only how to consolidate the management of secondary cloud resources into a unified infrastructure stack in AWS, but also the improved productivity by eliminating the complexity and cost of operating multiple deployment mechanisms into multiple public cloud environments.
The following video demonstrates an example in real-time of the end-state solution:
Next Steps
While this was just a straightforward example, with the same approach you can use your imagination to come up with even more and complex scenarios where AWS CDK can be used as a single pane of glass for IaC to manage multicloud and hybrid solutions.
To get started with the solution discussed in this post, this workshop will provide you with the instructions you need to understand the steps required to create the S3ToAzureBackupService.
Once you have learned how to create AWS CloudFormation extensions and develop them into AWS CDK Constructs, you will learn how, with just a few lines of code, you can develop reusable multicloud unified IaC solutions that deploy through a single AWS control plane.
Conclusion
By adopting AWS CloudFormation extensions and AWS CDK, deployed through a single AWS control plane, the cost and complexity of maintaining deployment pipelines across multiple cloud providers is reduced to a single holistic solution-focused pipeline. The techniques demonstrated in this post and the related workshop provide a capability to simplify the design of complex systems, improve the management of integration, and more closely align the IaC and deployment management practices with the design.
Currently, MSK Serverless only directly supports IAM for authentication using Java. This example shows how to use this mechanism. Additionally, it provides a pattern creating a proxy that can easily be integrated into solutions built in languages other than Java.
The rising trend in today’s tech landscape is the use of streaming data and event-oriented structures. They are being applied in numerous ways, including monitoring website traffic, tracking industrial Internet of Things (IoT) devices, analyzing video game player behavior, and managing data for cutting-edge analytics systems.
Apache Kafka, a top-tier open-source tool, is making waves in this domain. It’s widely adopted by numerous users for building fast and efficient data pipelines, analyzing streaming data, merging data from different sources, and supporting essential applications.
Amazon’s serverless Apache Kafka offering, Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless, is attracting a lot of interest. It’s appreciated for its user-friendly approach, ability to scale automatically, and cost-saving benefits over other Kafka solutions. However, a hurdle encountered by many users is the requirement of MSK Serverless to use AWS Identity and Access Management (IAM) access control. At the time of writing, the Amazon MSK library for IAM is exclusive to Kafka libraries in Java, creating a challenge for users of other programming languages. In this post, we aim to address this issue and present how you can use Amazon API Gateway and AWS Lambda to navigate around this obstacle.
SASL/SCRAM authentication vs. IAM authentication
Compared to the traditional authentication methods like Salted Challenge Response Authentication Mechanism (SCRAM), the IAM extension into Apache Kafka through MSK Serverless provides a lot of benefits. Before we delve into those, it’s important to understand what SASL/SCRAM authentication is. Essentially, it’s a traditional method used to confirm a user’s identity before giving them access to a system. This process requires users or clients to provide a user name and password, which the system then cross-checks against stored credentials (for example, via AWS Secrets Manager) to decide whether or not access should be granted.
Compared to this approach, IAM simplifies permission management across AWS environments, enables the creation and strict enforcement of detailed permissions and policies, and uses temporary credentials rather than the typical user name and password authentication. Another benefit of using IAM is that you can use IAM for both authentication and authorization. If you use SASL/SCRAM, you have to additionally manage ACLs via a separate mechanism. In IAM, you can use the IAM policy attached to the IAM principal to define the fine-grained access control for that IAM principal. All of these improvements make the IAM integration a more efficient and secure solution for most use cases.
However, for applications not built in Java, utilizing MSK Serverless becomes tricky. The standard SASL/SCRAM authentication isn’t available, and non-Java Kafka libraries don’t have a way to use IAM access control. This calls for an alternative approach to connect to MSK Serverless clusters.
But there’s an alternative pattern. Without having to rewrite your existing application in Java, you can employ API Gateway and Lambda as a proxy in front of a cluster. They can handle API requests and relay them to Kafka topics instantly. API Gateway takes in producer requests and channels them to a Lambda function, written in Java using the Amazon MSK IAM library. It then communicates with the MSK Serverless Kafka topic using IAM access control. After the cluster receives the message, it can be further processed within the MSK Serverless setup.
You can also utilize Lambda on the consumer side of MSK Serverless topics, bypassing the Java requirement on the consumer side. You can do this by setting Amazon MSK as an event source for a Lambda function. When the Lambda function is triggered, the data sent to the function includes an array of records from the Kafka topic—no need for direct contact with Amazon MSK.
Solution overview
This example walks you through how to build a serverless real-time stream producer application using API Gateway and Lambda.
For testing, this post includes a sample AWS Cloud Development Kit (AWS CDK) application. This creates a demo environment, including an MSK Serverless cluster, three Lambda functions, and an API Gateway that consumes the messages from the Kafka topic.
The following diagram shows the architecture of the resulting application including its data flows.
The data flow contains the following steps:
The infrastructure is defined in an AWS CDK application. By running this application, a set of AWS CloudFormation templates is created.
AWS CloudFormation creates all infrastructure components, including a Lambda function that runs during the deployment process to create a topic in the MSK Serverless cluster and to retrieve the authentication endpoint needed for the producer Lambda function. On destruction of the CloudFormation stack, the same Lambda function gets triggered again to delete the topic from the cluster.
An external application calls an API Gateway endpoint.
API Gateway forwards the request to a Lambda function.
The Lambda function acts as a Kafka producer and pushes the message to a Kafka topic using IAM authentication.
The Lambda event source mapping mechanism triggers the Lambda consumer function and forwards the message to it.
Note that we don’t need to worry about Availability Zones. MSK Serverless automatically replicates the data across multiple Availability Zones to ensure high availability of the data.
The demo additionally shows how to use Lambda Powertools for Java to streamline logging and tracing and the IAM authenticator for the simple authentication process outlined in the introduction.
The following sections take you through the steps to deploy, test, and observe the example application.
Prerequisites
The example has the following prerequisites:
An AWS account. If you haven’t signed up, complete the following steps:
Appropriate AWS credentials for interacting with resources in your AWS account.
Deploy the solution
Complete the following steps to deploy the solution:
Clone the project GitHub repository and change the directory to subfolder serverless-kafka-iac:
git clone https://github.com/aws-samples/apigateway-lambda-msk-serverless-integration
cd apigateway-lambda-msk-serverless-integration/serverless-kafka-iac
Run cdk synth to build the code and test the requirements (ensure docker daemon is running on your machine):
cdk synth
Run cdk deploy to deploy the code to your AWS account:
cdk deploy --all
Test the solution
To test the solution, we generate messages for the Kafka topics by sending calls through the API Gateway from our development machine or AWS Cloud9 environment. We then go to the CloudWatch console to observe incoming messages in the log files of the Lambda consumer function.
Open a terminal on your development machine to test the API with the Python script provided under /serverless_kafka_iac/test_api.py:
python3 test-api.py
On the Lambda console, open the Lambda function named ServerlessKafkaConsumer.
On the Monitor tab, choose View CloudWatch logs to access the logs of the Lambda function.
Choose the latest log stream to access the log files of the last run.
You can review the log entry of the received Kafka messages in the log of the Lambda function.
Trace a request
All components integrate with AWS X-Ray. With AWS X-Ray, you can trace the entire application, which is useful to identify bottlenecks when load testing. You can also trace method runs at the Java method level.
Lambda Powertools for Java allows you to shortcut this process by adding the @Trace annotation to a method to see traces on the method level in X-Ray.
To trace a request end to end, complete the following steps:
On the CloudWatch console, choose Service map in the navigation pane.
Select a component to investigate (for example, the Lambda function where you deployed the Kafka producer).
Choose View traces.
Choose a single Lambda method invocation and investigate further at the Java method level.
Implement a Kafka producer in Lambda
Kafka natively supports Java. To stay open, cloud native, and without third-party dependencies, the producer is written in that language. Currently, the IAM authenticator is only available to Java. In this example, the Lambda handler receives a message from an API Gateway source and pushes this message to an MSK topic called messages.
Typically, Kafka producers are long-living and pushing a message to a Kafka topic is an asynchronous process. Because Lambda is ephemeral, you must enforce a full flush of a submitted message until the Lambda function ends by calling producer.flush():
// Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: MIT-0
package software.amazon.samples.kafka.lambda;
// This class is part of the AWS samples package and specifically deals with Kafka integration in a Lambda function.
// It serves as a simple API Gateway to Kafka Proxy, accepting requests and forwarding them to a Kafka topic.
public class SimpleApiGatewayKafkaProxy implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
// Specifies the name of the Kafka topic where the messages will be sent
public static final String TOPIC_NAME = "messages";
// Logger instance for logging events of this class
private static final Logger log = LogManager.getLogger(SimpleApiGatewayKafkaProxy.class);
// Factory to create properties for Kafka Producer
public KafkaProducerPropertiesFactory kafkaProducerProperties = new KafkaProducerPropertiesFactoryImpl();
// Instance of KafkaProducer
private KafkaProducer<String, String>[KT1] producer;
// Overridden method from the RequestHandler interface to handle incoming API Gateway proxy events
@Override
@Tracing
@Logging(logEvent = true)
public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent input, Context context) {
// Creating a response object to send back
APIGatewayProxyResponseEvent response = createEmptyResponse();
try {
// Extracting the message from the request body
String message = getMessageBody(input);
// Create a Kafka producer
KafkaProducer<String, String> producer = createProducer();
// Creating a record with topic name, request ID as key and message as value
ProducerRecord<String, String> record = new ProducerRecord<String, String>(TOPIC_NAME, context.getAwsRequestId(), message);
// Sending the record to Kafka topic and getting the metadata of the record
Future<RecordMetadata>[KT2] send = producer.send(record);
producer.flush();
// Retrieve metadata about the sent record
RecordMetadata metadata = send.get();
// Logging the partition where the message was sent
log.info(String.format("Message was send to partition %s", metadata.partition()));
// If the message was successfully sent, return a 200 status code
return response.withStatusCode(200).withBody("Message successfully pushed to kafka");
} catch (Exception e) {
// In case of exception, log the error message and return a 500 status code
log.error(e.getMessage(), e);
return response.withBody(e.getMessage()).withStatusCode(500);
}
}
// Creates a Kafka producer if it doesn't already exist
@Tracing
private KafkaProducer<String, String> createProducer() {
if (producer == null) {
log.info("Connecting to kafka cluster");
producer = new KafkaProducer<String, String>(kafkaProducerProperties.getProducerProperties());
}
return producer;
}
// Extracts the message from the request body. If it's base64 encoded, it's decoded first.
private String getMessageBody(APIGatewayProxyRequestEvent input) {
String body = input.getBody();
if (input.getIsBase64Encoded()) {
body = decode(body);
}
return body;
}
// Creates an empty API Gateway proxy response event with predefined headers.
private APIGatewayProxyResponseEvent createEmptyResponse() {
Map<String, String> headers = new HashMap<>();
headers.put("Content-Type", "application/json");
headers.put("X-Custom-Header", "application/json");
APIGatewayProxyResponseEvent response = new APIGatewayProxyResponseEvent().withHeaders(headers);
return response;
}
}
Connect to Amazon MSK using IAM authentication
This post uses IAM authentication to connect to the respective Kafka cluster. For information about how to configure the producer for connectivity, refer to IAM access control.
Because you configure the cluster via IAM, grant Connect and WriteData permissions to the producer so that it can push messages to Kafka:
This shows the Kafka excerpt of the IAM policy, which must be applied to the Kafka producer. When using IAM authentication, be aware of the current limits of IAM Kafka authentication, which affect the number of concurrent connections and IAM requests for a producer. Refer to Amazon MSK quota and follow the recommendation for authentication backoff in the producer client:
Each MSK Serverless cluster can handle 100 requests per second. To reduce IAM authentication requests from the Kafka producer, place it outside of the handler. For frequent calls, there is a chance that Lambda reuses the previously created class instance and only reruns the handler.
For bursting workloads with a high number of concurrent API Gateway requests, this can lead to dropped messages. Although this might be tolerable for some workloads, for others this might not be the case.
To reduce latency, reduce cold start times for Java by changing the tiered compilation level to 1, as described in Optimizing AWS Lambda function performance for Java. Provisioned concurrency ensures that polling Lambda functions don’t need to warm up before requests arrive.
Conclusion
In this post, we showed how to create a serverless integration Lambda function between API Gateway and MSK Serverless as a way to do IAM authentication when your producer is not written in Java. You also learned about the native integration of Lambda and Amazon MSK on the consumer side. Additionally, we showed how to deploy such an integration with the AWS CDK.
The general pattern is suitable for many use cases where you want to use IAM authentication but your producers or consumers are not written in Java, but you still want to take advantage of the benefits of MSK Serverless, like its ability to scale up and down with unpredictable or spikey workloads or its little to no operational overhead of running Apache Kafka.
You can also use MSK Serverless to reduce operational complexity by automating provisioning and the management of capacity needs, including the need to constantly monitor brokers and storage.
For more serverless learning resources, visit Serverless Land.
For more information on MSK Serverless, check out the following:
Philipp Klose is a Global Solutions Architect at AWS based in Munich. He works with enterprise FSI customers and helps them solve business problems by architecting serverless platforms. In this free time, Philipp spends time with his family and enjoys every geek hobby possible.
Daniel Wessendorf is a Global Solutions Architect at AWS based in Munich. He works with enterprise FSI customers and is primarily specialized in machine learning and data architectures. In his free time, he enjoys swimming, hiking, skiing, and spending quality time with his family.
Marvin Gersho is a Senior Solutions Architect at AWS based in New York City. He works with a wide range of startup customers. He previously worked for many years in engineering leadership and hands-on application development, and now focuses on helping customers architect secure and scalable workloads on AWS with a minimum of operational overhead. In his free time, Marvin enjoys cycling and strategy board games.
Nathan Lichtenstein is a Senior Solutions Architect at AWS based in New York City. Primarily working with startups, he ensures his customers build smart on AWS, delivering creative solutions to their complex technical challenges. Nathan has worked in cloud and network architecture in the media, financial services, and retail spaces. Outside of work, he can often be found at a Broadway theater.
In March 2022, AWS announced support for custom certificate extensions, including name constraints, using AWS Certificate Manager (ACM) Private Certificate Authority (CA). Defining DNS name constraints with your subordinate CA can help establish guardrails to improve public key infrastructure (PKI) security and mitigate certificate misuse. For example, you can set a DNS name constraint that restricts the CA from issuing certificates to a resource that is using a specific domain name. Certificate requests from resources using an unauthorized domain name will be rejected by your CA and won’t be issued a certificate.
In this blog post, I’ll walk you step-by-step through the process of applying DNS name constraints to a subordinate CA by using the AWS Private CA service.
Prerequisites
You need to have the following prerequisite tools, services, and permissions in place before following the steps presented within this post:
All of the examples in this blog post are provided for the us-west-2 AWS Region. You will need to make sure that you have access to resources in your desired Region and specify the Region in the example commands.
Retrieve the solution code
Our GitHub repository contains the Python code that you need in order to replicate the steps presented in this post. There are two methods for cloning the repository provided, HTTPS or SSH. Select the method that you prefer.
Creating a Python virtual environment will allow you to run this solution in a fresh environment without impacting your existing Python packages. This will prevent the solution from interfering with dependencies that your other Python scripts may have. The virtual environment has its own set of Python packages installed. Read the official Python documentation on virtual environments for more information on their purpose and functionality.
To create a Python virtual environment
Create a new directory for the Python virtual environment in your home path.
Generate the API passthrough file with encoded name constraints
This step allows you to define the permitted and excluded DNS name constraints to apply to your subordinate CA. Read the documentation on name constraints in RFC 5280 for more information on their usage and functionality.
The Python encoder provided in this solution accepts two arguments for the permitted and excluded name constraints. The -p argument is used to provide the permitted subtrees, and the -e argument is used to provide the excluded subtrees. Use commas without spaces to separate multiple entries. For example: -p .dev.example.com,.test.example.com -e .prod.dev.example.com,.amazon.com.
To encode your name constraints
Run the following command, and update <~/github> with your information and provide your desired name constraints for the permitted (-p) and excluded (-e) arguments.
If the command runs successfully, you will see the message “Successfully Encoded Name Constraints” and the name of the generated API passthrough JSON file. The output of Permitted Subtrees will show the domain names you passed with the -p argument, and Excluded Subtrees will show the domain names you passed with the -e argument in the previous step.
Figure 1: Command line output example for name-constraints-encoder.py
Use the following command to display the contents of the API passthrough file generated by the Python encoder.
The contents of api_passthrough_config.json will look similar to the following screenshot. The JSON object will have an ObjectIdentifier key and value of 2.5.29.30, which represents the name constraints OID from the Global OID database. The base64-encoded Value represents the permitted and excluded name constraints you provided to the Python encoder earlier.
Figure 2: Viewing contents of api_passthrough_config.json
Generate a CSR from your subordinate CA
You must generate a certificate signing request (CSR) from the subordinate CA to which you intend to have the name constraints applied. Otherwise, you might encounter errors when you attempt to install the new certificate with name constraints.
To generate the CSR
Update and run the following command with your subordinate CA ARN and Region. The ARN is something that uniquely identifies AWS resources, similar to how your home address tells the mail person where to deliver the mail. In this case, the ARN is the unique identifier for your subordinate CA that tells the command which subordinate CA it’s interacting with.
The following screenshot provides an example output for a CSR. Your CSR details will be different; however, you should see something similar. Look for verify OK in the output and make sure that the Subject details match your subordinate CA. The subject details will provide the country, state, and city. The details will also likely contain your organization’s name, organizational unit or department name, and a common name for the subordinate CA.
Figure 3: Reviewing CSR content using openssl
Use the root CA to issue a new certificate with the name constraints custom extension
This post uses a two-tiered certificate authority architecture for simplicity. However, you can use the steps in this post with a more complex multi-level CA architecture. The name constraints certificate will be generated by the root CA and applied to the intermediary CA.
To issue and download a certificate with name constraints
Run the following command, making sure to update the argument values in red italics with your information. Make sure that the certificate-authority-arn is that of your root CA.
Note that the provided template-arn instructs the root CA to use the api_passthrough_config.json file that you created earlier to generate the certificate with the name constraints custom extension. If you use a different template, the new certificate might not be created as you intended.
Also, note that the validity period provided in this example is 5 years or 1825 days. The validity period for your subordinate CA must be less than that of your root CA.
If the issue-certificate command is successful, the output will provide the ARN of the new certificate that is issued by the root CA. Copy the certificate ARN, because it will be used in the following command.
Figure 4: Issuing a certificate with name constraints from the root CA using the AWS CLI
To download the new certificate, run the following command. Make sure to update the placeholders in red italics with your root CA’s certificate-authority-arn, the certificate-arn you obtained from the previous step, and your region.
Separate the certificate and certificate chain into two separate files by running the following commands. The new subordinate CA certificate is saved as cert.pem and the certificate chain is saved as cert_chain.pem.
The x509v3 Name Constraints portion of cert.pem should match the permitted and excluded name constraints you provided to the Python encoder earlier.
Figure 5: Verifying the X509v3 name constraints in the newly issued certificate using openssl
Install the name constraints certificate on the subordinate CA
In this section, you will install the name constraints certificate on your subordinate CA. Note that this will replace the existing certificate installed on the subordinate CA. The name constraints will go into effect as soon as the new certificate is installed.
To install the name constraints certificate
Run the following command with your subordinate CA’s certificate-authority-arn and path to the cert.pem and cert_chain.pem files you created earlier.
The output from the previous command will be similar to the following screenshot. The CertificateAuthorityConfiguration and highlighted NotBefore and NotAfter fields in the output should match the name constraints certificate.
Figure 6: Verifying subordinate CA details using the AWS CLI
Test the name constraints
Now that your subordinate CA has the new certificate installed, you can test to see if the name constraints are being enforced based on the certificate you installed in the previous section.
To request a certificate from your subordinate CA and test the applied name constraints
To request a new certificate, update and run the following command with your subordinate CA’s certificate-authority-arn, region, and desired certificate subject in the domain-name argument.
You will see output similar to the following screenshot if the requested certificate domain name was not permitted by the name constraints applied to your subordinate CA. In this example, a certificate for app.prod.dev.example.com was rejected. The Status shows “FAILED” and the FailureReason indicates “PCA_NAME_CONSTRAINTS_VALIDATION”.
Figure 7: Verifying the status of the certificate request using the AWS CLI describe-certificate command
A key part of protecting your organization’s non-public, sensitive data is to understand who can access it and from where. One of the common requirements is to restrict access to authorized users from known locations. To accomplish this, you should be familiar with the expected network access patterns and establish organization-wide guardrails to limit access to known networks. Additionally, you should verify that the credentials associated with your AWS Identity and Access Management (IAM) principals are only usable within these expected networks. On Amazon Web Services (AWS), you can use the network perimeter to apply network coarse-grained controls on your resources and principals. In this fourth blog post of the Establishing a data perimeter on AWS series, we explore the benefits and implementation considerations of defining your network perimeter.
The network perimeter is a set of coarse-grained controls that help you verify that your identities and resources can only be used from expected networks.
To achieve these security objectives, you first must define what expected networks means for your organization. Expected networks usually include approved networks your employees and applications use to access your resources, such as your corporate IP CIDR range and your VPCs. There are also scenarios where you need to permit access from networks of AWS services acting on your behalf or networks of trusted third-party partners that you integrate with. You should consider all intended data access patterns when you create the definition of expected networks. Other networks are considered unexpected and shouldn’t be allowed access.
Security risks addressed by the network perimeter
The network perimeter helps address the following security risks:
Unintended information disclosure through credential use from non-corporate networks
It’s important to consider the security implications of having developers with preconfigured access stored on their laptops. For example, let’s say that to access an application, a developer uses a command line interface (CLI) to assume a role and uses the temporary credentials to work on a new feature. The developer continues their work at a coffee shop that has great public Wi-Fi while their credentials are still valid. Accessing data through a non-corporate network means that they are potentially bypassing their company’s security controls, which might lead to the unintended disclosure of sensitive corporate data in a public space.
Unintended data access through stolen credentials
Organizations are prioritizing protection from credential theft risks, as threat actors can use stolen credentials to gain access to sensitive data. For example, a developer could mistakenly share credentials from an Amazon EC2 instance CLI access over email. After credentials are obtained, a threat actor can use them to access your resources and potentially exfiltrate your corporate data, possibly leading to reputational risk.
Figure 1 outlines an undesirable access pattern: using an employee corporate credential to access corporate resources (in this example, an Amazon Simple Storage Service (Amazon S3) bucket) from a non-corporate network.
Figure 1: Unintended access to your S3 bucket from outside the corporate network
Implementing the network perimeter
During the network perimeter implementation, you use IAM policies and global condition keys to help you control access to your resources based on which network the API request is coming from. IAM allows you to enforce the origin of a request by making an API call using both identity policies and resource policies.
The following two policies help you control both your principals and resources to verify that the request is coming from your expected network:
Service control policies (SCPs) are policies you can use to manage the maximum available permissions for your principals. SCPs help you verify that your accounts stay within your organization’s access control guidelines.
Resource based policies are policies that are attached to resources in each AWS account. With resource based policies, you can specify who has access to the resource and what actions they can perform on it. For a list of services that support resource based policies, see AWS services that work with IAM.
With the help of these two policy types, you can enforce the control objectives using the following IAM global condition keys:
aws:SourceIp: You can use this condition key to create a policy that only allows request from a specific IP CIDR range. For example, this key helps you define your expected networks as your corporate IP CIDR range.
aws:SourceVpc: This condition key helps you check whether the request comes from the list of VPCs that you specified in the policy. In a policy, this condition key is used to only allow access to an S3 bucket if the VPC where the request originated matches the VPC ID listed in your policy.
aws:SourceVpce: You can use this condition key to check if the request came from one of the VPC endpoints specified in your policy. Adding this key to your policy helps you restrict access to API calls that originate from VPC endpoints that belong to your organization.
aws:ViaAWSService: You can use this key to write a policy to allow an AWS service that uses your credentials to make calls on your behalf. For example, when you upload an object to Amazon S3 with server-side encryption with AWS Key Management Service (AWS KMS) on, S3 needs to encrypt the data on your behalf. To do this, S3 makes a subsequent request to AWS KMS to generate a data key to encrypt the object. The call that S3 makes to AWS KMS is signed with your credentials and originates outside of your network.
aws:PrincipalIsAWSService: This condition key helps you write a policy to allow AWS service principals to access your resources. For example, when you create an AWS CloudTrail trail with an S3 bucket as a destination, CloudTrail uses a service principal, cloudtrail.amazonaws.com, to publish logs to your S3 bucket. The API call from CloudTrail comes from the service network.
The following table summarizes the relationship between the control objectives and the capabilities used to implement the network perimeter.
Control objective
Implemented by using
Primary IAM capability
My resources can only be accessed from expected networks.
My resources can only be accessed from expected networks
Start by implementing the network perimeter on your resources using resource based policies. The perimeter should be applied to all resources that support resource- based policies in each AWS account. With this type of policy, you can define which networks can be used to access the resources, helping prevent access to your company resources in case of valid credentials being used from non-corporate networks.
The following is an example of a resource-based policy for an S3 bucket that limits access only to expected networks using the aws:SourceIp, aws:SourceVpc, aws:PrincipalIsAWSService, and aws:ViaAWSService condition keys. Replace <my-data-bucket>, <my-corporate-cidr>, and <my-vpc> with your information.
The Deny statement in the preceding policy has four condition keys where all conditions must resolve to true to invoke the Deny effect. Use the IfExists condition operator to clearly state that each of these conditions will still resolve to true if the key is not present on the request context.
This policy will deny Amazon S3 actions unless requested from your corporate CIDR range (NotIpAddressIfExists with aws:SourceIp), or from your VPC (StringNotEqualsIfExists with aws:SourceVpc). Notice that aws:SourceVpc and aws:SourceVpce are only present on the request if the call was made through a VPC endpoint. So, you could also use the aws:SourceVpce condition key in the policy above, however this would mean listing every VPC endpoint in your environment. Since the number of VPC endpoints is greater than the number of VPCs, this example uses the aws:SourceVpc condition key.
This policy also creates a conditional exception for Amazon S3 actions requested by a service principal (BoolIfExists with aws:PrincipalIsAWSService), such as CloudTrail writing events to your S3 bucket, or by an AWS service on your behalf (BoolIfExists with aws:ViaAWSService), such as S3 calling AWS KMS to encrypt or decrypt an object.
Extending the network perimeter on resource
There are cases where you need to extend your perimeter to include AWS services that access your resources from outside your network. For example, if you’re replicating objects using S3 bucket replication, the calls to Amazon S3 originate from the service network outside of your VPC, using a service role. Another case where you need to extend your perimeter is if you integrate with trusted third-party partners that need access to your resources. If you’re using services with the described access pattern in your AWS environment or need to provide access to trusted partners, the policy EnforceNetworkPerimeter that you applied on your S3 bucket in the previous section will deny access to the resource.
In this section, you learn how to extend your network perimeter to include networks of AWS services using service roles to access your resources and trusted third-party partners.
AWS services that use service roles and service-linked roles to access resources on your behalf
Service roles are assumed by AWS services to perform actions on your behalf. An IAM administrator can create, change, and delete a service role from within IAM; this role exists within your AWS account and has an ARN like arn:aws:iam::<AccountNumber>:role/<RoleName>. A key difference between a service-linked role (SLR) and a service role is that the SLR is linked to a specific AWS service and you can view but not edit the permissions and trust policy of the role. An example is AWS Identity and Access Management Access Analyzer using an SLR to analyze resource metadata. To account for this access pattern, you can exempt roles on the service-linked role dedicated path arn:aws:iam::<AccountNumber>:role/aws-service-role/*, and for service roles, you can tag the role with the tag network-perimeter-exception set to true.
If you are exempting service roles in your policy based on a tag-value, you must also include a policy to enforce the identity perimeter on your resource as shown in this sample policy. This helps verify that only identities from your organization can access the resource and cannot circumvent your network perimeter controls with network-perimeter-exception tag.
Partners accessing your resources from their own networks
There might be situations where your company needs to grant access to trusted third parties. For example, providing a trusted partner access to data stored in your S3 bucket. You can account for this type of access by using the aws:PrincipalAccount condition key set to the account ID provided by your partner.
The following is an example of a resource-based policy for an S3 bucket that incorporates the two access patterns described above. Replace <my-data-bucket>, <my-corporate-cidr>, <my-vpc>, <third-party-account-a>, <third-party-account-b>, and <my-account-id> with your information.
There are four condition operators in the policy above, and you need all four of them to resolve to true to invoke the Deny effect. Therefore, this policy only allows access to Amazon S3 from expected networks defined as: your corporate IP CIDR range (NotIpAddressIfExists and aws:SourceIp), your VPC (StringNotEqualsIfExists and aws:SourceVpc), networks of AWS service principals (aws:PrincipalIsAWSService), or an AWS service acting on your behalf (aws:ViaAWSService). It also allows access to networks of trusted third-party accounts (StringNotEqualsIfExists and aws:PrincipalAccount:<third-party-account-a>), and AWS services using an SLR to access your resources (ArnNotLikeIfExists and aws:PrincipalArn).
My identities can access resources only from expected networks
Applying the network perimeter on identity can be more challenging because you need to consider not only calls made directly by your principals, but also calls made by AWS services acting on your behalf. As described in access pattern 3 Intermediate IAM roles for data access in this blog post, many AWS services assume an AWS service role you created to perform actions on your behalf. The complicating factor is that even if the service supports VPC-based access to your data — for example AWS Glue jobs can be deployed within your VPC to access data in your S3 buckets — the service might also use the service role to make other API calls outside of your VPC. For example, with AWS Glue jobs, the service uses the service role to deploy elastic network interfaces (ENIs) in your VPC. However, these calls to create ENIs in your VPC are made from the AWS Glue managed network and not from within your expected network. A broad network restriction in your SCP for all your identities might prevent the AWS service from performing tasks on your behalf.
Therefore, the recommended approach is to only apply the perimeter to identities that represent the highest risk of inappropriate use based on other compensating controls that might exist in your environment. These are identities whose credentials can be obtained and misused by threat actors. For example, if you allow your developers access to the Amazon Elastic Compute Cloud (Amazon EC2) CLI, a developer can obtain credentials from the Amazon EC2 instance profile and use the credentials to access your resources from their own network.
To summarize, to enforce your network perimeter based on identity, evaluate your organization’s security posture and what compensating controls are in place. Then, according to this evaluation, identify which service roles or human roles have the highest risk of inappropriate use, and enforce the network perimeter on those identities by tagging them with data-perimeter-include set to true.
The following policy shows the use of tags to enforce the network perimeter on specific identities. Replace <my-corporate-cidr>, and <my-vpc> with your own information.
The above policy statement uses the Deny effect to limit access to expected networks for identities with the tag data-perimeter-include attached to them (StringEquals and aws:PrincipalTag/data-perimeter-include set to true). This policy will deny access to those identities unless the request is done by an AWS service on your behalf (aws:ViaAWSService), is coming from your corporate CIDR range (NotIpAddressIfExists and aws:SourceIp), or is coming from your VPCs (StringNotEqualsIfExists with the aws:SourceVpc).
Amazon EC2 also uses a special service role, also known as infrastructure role, to decrypt Amazon Elastic Block Store (Amazon EBS). When you mount an encrypted Amazon EBS volume to an EC2 instance, EC2 calls AWS KMS to decrypt the data key that was used to encrypt the volume. The call to AWS KMS is signed by an IAM role, arn:aws:iam::*:role/aws:ec2-infrastructure, which is created in your account by EC2. For this use case, as you can see on the preceding policy, you can use the aws:PrincipalArn condition key to exclude this role from the perimeter.
IAM policy samples
This GitHub repository contains policy examples that illustrate how to implement network perimeter controls. The policy samples don’t represent a complete list of valid access patterns and are for reference only. They’re intended for you to tailor and extend to suit the needs of your environment. Make sure you thoroughly test the provided example policies before implementing them in your production environment.
Conclusion
In this blog post you learned about the elements needed to build the network perimeter, including policy examples and strategies on how to extend that perimeter. You now also know different access patterns used by AWS services that act on your behalf, how to evaluate those access patterns, and how to take a risk-based approach to apply the perimeter based on identities in your organization.
For additional learning opportunities, see the Data perimeters on AWS. This information resource provides additional materials such as a data perimeter workshop, blog posts, whitepapers, and webinar sessions.
Earlier this year, Amazon Web Services (AWS)released Amazon Corretto Crypto Provider (ACCP) 2, a cryptography provider built by AWS for Java virtual machine (JVM) applications. ACCP 2 delivers comprehensive performance enhancements, with some algorithms (such as elliptic curve key generation) seeing a greater than 13-fold improvement over ACCP 1. The new release also brings official support for the AWS Graviton family of processors. In this post, I’ll discuss a use case for ACCP, then review performance benchmarks to illustrate the performance gains. Finally, I’ll show you how to get started using ACCP 2 in applications today.
This release changes the backing cryptography library for ACCP from OpenSSL (used in ACCP 1) to the AWS open source cryptography library, AWS libcrypto (AWS-LC). AWS-LC has extensive formal verification, as well as traditional testing, to assure the correctness of cryptography that it provides. While AWS-LC and OpenSSL are largely compatible, there are some behavioral differences that required the ACCP major version increment to 2.
The move to AWS-LC also allows ACCP to leverage performance optimizations in AWS-LC for modern processors. I’ll illustrate the ACCP 2 performance enhancements through the use case of establishing a secure communications channel with Transport Layer Security version 1.3 (TLS 1.3). Specifically, I’ll examine cryptographic components of the connection’s initial phase, known as the handshake. TLS handshake latency particularly matters for large web service providers, but reducing the time it takes to perform various cryptographic operations is an operational win for any cryptography-intensive workload.
TLS 1.3 requires ephemeral key agreement, which means that a new key pair is generated and exchanged for every connection. During the TLS handshake, each party generates an ephemeral elliptic curve key pair, exchanges public keys using Elliptic Curve Diffie-Hellman (ECDH), and agrees on a shared secret. Finally, the client authenticates the server by verifying the Elliptic Curve Digital Signature Algorithm (ECDSA) signature in the certificate presented by the server after key exchange. All of this needs to happen before you can send data securely over the connection, so these operations directly impact handshake latency and must be fast.
Figure 1 shows benchmarks for the three elliptic curve algorithms that implement the TLS 1.3 handshake: elliptic curve key generation (up to 1,298% latency improvement in ACCP 2.0 over ACCP 1.6), ECDH key agreement (up to 858% latency improvement), and ECDSA digital signature verification (up to 260% latency improvement). These algorithms were benchmarked over three common elliptic curves with different key sizes on both ACCP 1 and ACCP 2. The choice of elliptic curve determines the size of the key used or generated by the algorithm, and key size correlates to performance. The following benchmarks were measured under the Amazon Corretto 11 JDK on a c7g.large instance running Amazon Linux with a Graviton 3 processor.
Figure 1: Percentage improvement of ACCP 2.0 over 1.6 performance benchmarks on c7g.large Amazon Linux Graviton 3
The performance improvements due to the optimization of secp384r1 in AWS-LC are particularly noteworthy.
Getting started
Whether you’re introducing ACCP to your project or upgrading from ACCP 1, start the onboarding process for ACCP 2 by updating your dependency manager configuration in your development or testing environment. The Maven and Gradle examples below assume that you’re using linux on an ARM64 processor. If you’re using an x86 processor, substitute linux-x86_64 for linux-aarch64. After you’ve performed this update, sync your application’s dependencies and install ACCP in your JVM process. ACCP can be installed either by specifying our recommended security.properties file in your JVM invocation or programmatically at runtime. The following sections provide more details about all of these steps.
After ACCP has been installed, the Java Cryptography Architecture (JCA) will look for cryptographic implementations in ACCP first before moving on to other providers. So, as long as your application and dependencies obtain algorithms supported by ACCP from the JCA, your application should gain the benefits of ACCP 2 without further configuration or code changes.
Maven
If you’re using Maven to manage dependencies, add or update the following dependency configuration in your pom.xml file.
After updating your dependency manager, you’ll need to install ACCP. You can install ACCP using security properties as described in our GitHub repository. This installation method is a good option for users who have control over their JVM invocation.
Install programmatically
If you don’t have control over your JVM invocation, you can install ACCP programmatically. For Java applications, add the following code to your application’s initialization logic (optionally performing a health check).
Although the migration path to version 2 is straightforward for most ACCP 1 users, ACCP 2 ends support for some outdated algorithms: a finite field Diffie-Hellman key agreement, finite field DSA signatures, and a National Institute of Standards and Technology (NIST)-specified random number generator. The removal of these algorithms is not backwards compatible, so you’ll need to check your code for their usage and, if you do find usage, either migrate to more modern algorithms provided by ACCP 2 or obtain implementations from a different provider, such as one of the default providers that ships with the JDK.
Check your code
Search for unsupported algorithms in your application code by their JCA names:
DH: Finite-field Diffie-Hellman key agreement
DSA: Finite-field Digital Signature Algorithm
NIST800-90A/AES-CTR-256: NIST-specified random number generator
Use ACCP 2 supported algorithms
Where possible, use these supported algorithms in your application code:
ECDH for key agreement instead of DH
ECDSA or RSA for signatures instead of DSA
Default SecureRandom instead of NIST800-90A/AES-CTR-256
If your use case requires the now-unsupported algorithms, check whether any of those algorithms are explicitly requested from ACCP.
If ACCP is not explicitly named as the provider, then you should be able to transparently fall back to another provider without a code change.
If ACCP is explicitly named as the provider, then remove that provider specification and register a different provider that offers the algorithm. This will allow the JCA to obtain an implementation from another registered provider without breaking backwards compatibility in your application.
Test your code
Some behavioral differences exist between ACCP 2 and other providers, including ACCP 1 (backed by OpenSSL). After onboarding or migrating, it’s important that you test your application code thoroughly to identify potential incompatibilities between cryptography providers.
Conclusion
Integrate ACCP 2 into your application today to benefit from AWS-LC’s security assurance and performance improvements. For a full list of changes, see the ACCP CHANGELOG on GitHub. Linux builds of ACCP 2 are now available on Maven Central for aarch64 and x86-64 processor architectures. If you encounter any issues with your integration, or have any feature suggestions, please reach out to us on GitHub by filing an issue.
If you have feedback about this post, submit comments in the Comments section below.
Want more AWS Security news? Follow us on Twitter.
In 2017, AWS announced the release of Rate-based Rules for AWS WAF, a new rule type that helps protect websites and APIs from application-level threats such as distributed denial of service (DDoS) attacks, brute force log-in attempts, and bad bots. Rate-based rules track the rate of requests for each originating IP address and invokes a rule action on IPs with rates that exceed a set limit.
While rate-based rules are useful to detect and mitigate a broad variety of bad actors, threats have evolved to bypass request-rate limit rules. For example, one bypass technique is to send a high volumes of requests by spreading them across thousands of unique IP addresses.
In May 2023, AWS announced AWS WAF enhancements to the existing rate-based rules feature that you can use to create more dynamic and intelligent rules by using additional HTTP request attributes for request rate limiting. For example, you can now choose from the following predefined keys to configure your rules: label namespace, header, cookie, query parameter, query string, HTTP method, URI path and source IP Address or IP Address in a header. Additionally, you can combine up to five composite keys as parameters for stronger rule development. These rule definition enhancements help improve perimeter security measures against sophisticated application-layer DDoS attacks using AWS WAF. For more information about the supported request attributes, see Rate-based rule statement in the AWS WAF Developer Guide.
In this blog post, you will learn more about these new AWS WAF feature enhancements and how you can use alternative request attributes to create more robust and granular sets of rules. In addition, you’ll learn how to combine keys to create a composite aggregation key to uniquely identify a specific combination of elements to improve rate tracking.
Getting started
Configuring advanced rate-based rules is similar to configuring simple rate-based rules. The process starts with creating a new custom rule of type rate-based rule, entering the rate limit value, selecting custom keys, choosing the key from the request aggregation key dropdown menu, and adding additional composite keys by choosing Add a request aggregation key as shown in Figure 1.
Figure 1: Creating an advanced rate-based rule with two aggregation keys
For existing rules, you can update those rate-based rules to use the new functionality by editing them. For example, you can add a header to be aggregated with the source IP address, as shown in Figure 2. Note that previously created rules will not be modified.
Figure 2: Add a second key to an existing rate-based rule
You still can set the same rule action, such as block, count, captcha, or challenge. Optionally, you can continue applying a scope-down statement to limit rule action. For example, you can limit the scope to a certain application path or requests with a specified header. You can scope down the inspection criteria so that only certain requests are counted towards rate limiting, and use certain keys to aggregate those requests together. A technique would be to count only requests that have /api at the start of the URI, and aggregate them based on their SessionId cookie value.
Target use cases
Now that you’re familiar with the foundations of advanced rate-based rules, let’s explore how they can improve your security posture using the following use cases:
Enhanced Application (Layer 7) DDoS protection
Improved API security
Enriched request throttling
Use case 1: Enhance Layer 7 DDoS mitigation
The first use case that you might find beneficial is to enhance Layer 7 DDoS mitigation. An HTTP request flood is the most common vector of DDoS attacks. This attack type aims to affect application availability by exhausting available resources to run the application.
Before the release of these enhancements to AWS WAF rules, rules were limited by aggregating requests based on the IP address from the request origin or configured to use a forwarded IP address in an HTTP header such as X-Forwarded-For. Now you can create a more robust rate-based rule to help protect your web application from DDoS attacks by tracking requests based on a different key or a combination of keys. Let’s examine some examples.
To help detect pervasive bots, such as scrapers, scanners, and crawlers, or common bots that are distributed across many unique IP addresses, a rule can look for static request data like a custom header — for example, User-Agent.
To uniquely identity users behind a NAT gateway, you can use a cookie in addition to an IP address. Before the aggregation keys feature, it was difficult to identify users who connected from a single IP address. Now, you can use the session cookie to aggregate requests by their session identifier and IP address.
Note that for Layer 7 DDoS mitigation, tracking by session ID in cookies can be circumvented, because bots might send random values or not send any cookie at all. It’s a good idea to keep an IP-based blanket rate-limiting rule to block offending IP addresses that reach a certain high rate, regardless of their request attributes. In that case, the keys would look like:
Key 1: Session cookie
Key 2: IP address
You can reduce false positives when using AWS Managed Rules (AMR) IP reputation lists by rate limiting based on their label namespace. Labelling functionality is a powerful feature that allows you to map the requests that match a specific pattern and apply custom rules to them. In this case, you can match the label namespace provided by the AMR IP reputation list that includes AWSManagedIPDDoSList, which is a list of IP addresses that have been identified as actively engaging in DDoS activities.
You might want to be cautious about using this group list in block mode, because there’s a chance of blocking legitimate users. To mitigate this, use the list in count mode and create an advanced rate-based rule to aggregate all requests with the label namespace awswaf:managed:aws:amazon-ip-list:, targeting captcha as the rule action. This lets you reduce false positives without compromising security. Applying captcha as an action for the rule reduces serving captcha to all users and instead only applies it when the rate of requests exceeds the defined limit. The key for this rule would be:
Labels (AMR IP reputation lists).
Use case 2: API security
In this second use case, you learn how to use an advanced rate-based rule to improve the security of an API. Protecting an API with rate-limiting rules helps ensure that requests aren’t being sent too frequently in a short amount of time. Reducing the risk from misusing an API helps to ensure that only legitimate requests are handled and not denied due to an overload of requests.
Now, you can create advanced rate-based rules that track API requests based on two aggregation keys. For example, HTTP method to differentiate between GET, POST, and other requests in combination with a custom header like Authorization to match a JSON Web Token (JWT). JWTs are not decrypted by AWS WAF, and AWS WAF only aggregates requests with the same token. This can help to ensure that a token is not being used maliciously or to bypass rate-limiting rules. An additional benefit of this configuration is that requests with no authorization headers are being aggregated together towards the rate limiting threshold. The keys for this use case are:
Key 1: HTTP method
Key 2: Custom header (Authorization)
In addition, you can configure a rule to block and add a custom response when the requests limit is reached. For example, by returning HTTP error code 429 (too many requests) with a Retry-After header indicating the requester should wait 900 seconds (15 minutes) before making a new request.
There are many situations where throttling should be considered. For example, if you want to maintain the performance of a service API by providing fair usage for all users, you can have different rate limits based on the type or purpose of the API, such as mutable or non-mutable requests. To achieve this, you can create two advanced rate-based rules using aggregation keys like IP address, combined with an HTTP request parameter for either mutable or non-mutable that indicates the type of request. Each rule will have its own HTTP request parameter, and you can set different maximum values for the rate limit. The keys for this use case are:
Key 1: HTTP request parameter
Key 2: IP address
Another example where throttling can be helpful is for a multi-tenant application where you want to track requests made by each tenant’s users. Let’s say you have a free tier but also a paying subscription model for which you want to allow a higher request rate. For this use case, it’s recommended to use two different URI paths to verify that the two tenants are kept separated. Additionally, it is advised to still use a custom header or query string parameter to differentiate between the two tenants, such as a tenant-id header or parameter that contains a unique identifier for each tenant. To implement this type of throttling using advanced rate-based rules, you can create two rules using an IP address in combination with the custom header as aggregation keys. Each rule can have its own maximum value for rate limiting, as well as a scope-down statement that matches requests for each URI path. The keys and scope-down statement for this use case are:
Key 1: Custom header (tenant-id)
Key 2: IP address
Scope down statement (URI path)
As a third example, you can rate-limit web applications based on the total number of requests that can be handled. For this use case, you can use the new Count all as aggregation option. The option counts and rate-limits the requests that match the rule’s scope-down statement, which is required for this type of aggregation. One option is to scope down and inspect the URI path to target a specific functionality like a /history-search page. An option when you need to control how many requests go to a specific domain is to scope down a single header to a specific host, creating one rule for a.example.com and another rule for b.example.com.
Request Aggregation: Count all
Scope down statement (URI path | Single header)
For these examples, you can block with a custom response when the requests exceed the limit. For example, by returning the same HTTP error code and header, but adding a custom response body with a message like “You have reached the maximum number of requests allowed.”
Logging
The AWS WAF logs now include additional information about request keys used for request-rate tracking and the values of matched request keys. In addition to the existing IP or Forwarded_IP values, you can see the updated log fieldslimitKey and customValue, where the limitKey field now shows either CustomKeys for custom aggregate key settings or Constant for count all requests. CustomValues shows an array of keys, names, and values.
Figure 3: Example log output for the advanced rate-based rule showing updated limitKey and customValues fields
As mentioned in the first use case, to get more detailed information about the traffic that’s analyzed by the web ACL, consider enabling logging. If you choose to enable Amazon CloudWatch Logs as the log destination, you can use CloudWatch Logs Insights and advanced queries to interactively search and analyze logs.
For example, you can use the following query to get the request information that matches rate-based rules, including the updated keys and values, directly from the AWS WAF console.
Figure 4 shows the CloudWatch Log Insights query and the logs output including custom keys, names, and values fields.
Figure 4: The CloudWatch Log Insights query and the logs output
Pricing
There is no additional cost for using advanced rate-base rules; standard AWS WAF pricing applies when you use this feature. For AWS WAF pricing information, see AWS WAF Pricing. You only need to be aware that using aggregation keys will increase AWS WAF web ACL capacity units (WCU) usage for the rule. WCU usage is calculated based on how many keys you want to use for rate limiting. The current model of 2 WCUs plus any additional WCUs for a nested statement is being updated to 2 WCUs as a base, and 30 WCUs for each custom aggregation key that you specify. For example, if you want to create aggregation keys with an IP address in combination with a session cookie, this will use 62 WCUs, and aggregation keys with an IP address, session cookie, and customer header will use 92 WCUs. For more details about the WCU-based cost structure, visit Rate-based rule statement in the AWS WAF Developer Guide.
Conclusion
In this blog post, you learned about AWS WAF enhancements to existing rate-based rules that now support request parameters in addition to IP addresses. Additionally, these enhancements allow you to create composite keys based on up to five request parameters. This new capability allows you to be either more coarse in aggregating requests (such as all the requests that have an IP reputation label associated with them) or finer (such as aggregate requests for a specific session ID, not its IP address).
For more rule examples that include JSON rule configuration, visit Rate-based rule examples in the AWS WAF Developer Guide.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
Over the past few decades, digital technologies have brought tremendous benefits to our societies, governments, businesses, and everyday lives. However, the more we depend on them for critical applications, the more we must do so securely. The increasing reliance on these systems comes with a broad responsibility for society, companies, and governments.
At Amazon Web Services (AWS), every employee, regardless of their role, works to verify that security is an integral component of every facet of the business (see Security at AWS). This goes hand-in-hand with new cybersecurity-related regulations, such as the Directive on Measures for a High Common Level of Cybersecurity Across the Union (NIS 2), formally adopted by the European Parliament and the Counsel of the European Union (EU) in December 2022. NIS 2 will be transposed into the national laws of the EU Member States by October 2024, and aims to strengthen cybersecurity across the EU.
AWS is excited to help customers become more resilient, and we look forward to even closer cooperation with national cybersecurity authorities to raise the bar on cybersecurity across Europe. Building society’s trust in the online environment is key to harnessing the power of innovation for social and economic development. It’s also one of our core Leadership Principles: Success and scale bring broad responsibility.
Compliance with NIS 2
NIS 2 seeks to ensure that entities mitigate the risks posed by cyber threats, minimize the impact of incidents, and protect the continuity of essential and important services in the EU.
Besides increased cooperation between authorities and support for enhanced information sharing amongst covered entities, NIS 2 includes minimum requirements for cybersecurity risk management measures and reporting obligations, which are applicable to a broad range of AWS customers based on their sector. Examples of sectors that must comply with NIS 2 requirements are energy, transport, health, public administration, and digital infrastructures. For the full list of covered sectors, see Annexes I and II of NIS 2. Generally, the NIS 2 Directive applies to a wider pool of entities than those currently covered by the NIS Directive, including medium-sized enterprises, as defined in Article 2 of the Annex to Recommendation 2003/361/EC (over 50 employees or an annual turnover over €10 million).
In several countries, aspects of the AWS service offerings are already part of the national critical infrastructure. For example, in Germany, Amazon Elastic Compute Cloud (Amazon EC2) and Amazon CloudFront are in scope for the KRITIS regulation. For several years, AWS has fulfilled its obligations to secure these services, run audits related to national critical infrastructure, and have established channels for exchanging security information with the German Federal Office for Information Security (BSI) KRITIS office. AWS is also part of the UP KRITIS initiative, a cooperative effort between industry and the German Government to set industry standards.
AWS will continue to support customers in implementing resilient solutions, in accordance with the shared responsibility model. Compliance efforts within AWS will include implementing the requirements of the act and setting out technical and methodological requirements for cloud computing service providers, to be published by the European Commission, as foreseen in Article 21 of NIS 2.
AWS cybersecurity risk management – Current status
Even before the introduction of NIS 2, AWS has been helping customers improve their resilience and incident response capacities. Our core infrastructure is designed to satisfy the security requirements of the military, global banks, and other highly sensitive organizations.
AWS provides information and communication technology services and building blocks that businesses, public authorities, universities, and individuals use to become more secure, innovative, and responsive to their own needs and the needs of their customers. Security and compliance remain a shared responsibility between AWS and the customer. We make sure that the AWS cloud infrastructure complies with applicable regulatory requirements and good practices for cloud providers, and customers remain responsible for building compliant workloads in the cloud.
In total, AWS supports or has obtained over 143 security standards compliance certifications and attestations around the globe, such as ISO 27001, ISO 22301, ISO 20000, ISO 27017, and System and Organization Controls (SOC) 2. The following are some examples of European certifications and attestations that we’ve achieved:
C5 — provides a wide-ranging control framework for establishing and evidencing the security of cloud operations in Germany.
ENS High — comprises principles for adequate protection applicable to government agencies and public organizations in Spain.
HDS — demonstrates an adequate framework for technical and governance measures to secure and protect personal health data, governed by French law.
Pinakes — provides a rating framework intended to manage and monitor the cybersecurity controls of service providers upon which Spanish financial entities depend.
These and other AWS Compliance Programs help customers understand the robust controls in place at AWS to help ensure the security and compliance of the cloud. Through dedicated teams, we’re prepared to provide assurance about the approach that AWS has taken to operational resilience and to help customers achieve assurance about the security and resiliency of their workloads. AWS Artifact provides on-demand access to these security and compliance reports and many more.
For security in the cloud, it’s crucial for our customers to make security by design and security by default central tenets of product development. To begin with, customers can use the AWS Well-Architected tool to help build secure, high-performing, resilient, and efficient infrastructure for a variety of applications and workloads. Customers that use the AWS Cloud Adoption Framework (AWS CAF) can improve cloud readiness by identifying and prioritizing transformation opportunities. These foundational resources help customers secure regulated workloads. AWS Security Hub provides customers with a comprehensive view of their security state on AWS and helps them check their environments against industry standards and good practices.
With regards to the cybersecurity risk management measures and reporting obligations that NIS 2 mandates, existing AWS service offerings can help customers fulfill their part of the shared responsibility model and comply with future national implementations of NIS 2. For example, customers can use Amazon GuardDuty to detect a set of specific threats to AWS accounts and watch out for malicious activity. Amazon CloudWatch helps customers monitor the state of their AWS resources. With AWS Config, customers can continually assess, audit, and evaluate the configurations and relationships of selected resources on AWS, on premises, and on other clouds. Furthermore, AWS Whitepapers, such as the AWS Security Incident Response Guide, help customers understand, implement, and manage fundamental security concepts in their cloud architecture.
At Amazon, we strive to be the world’s most customer-centric company. For AWS Security Assurance, that means having teams that continuously engage with authorities to understand and exceed regulatory and customer obligations on behalf of customers. This is just one way that we raise the security bar in Europe. At the same time, we recommend that national regulators carefully assess potentially conflicting, overlapping, or contradictory measures.
We also cooperate with cybersecurity agencies around the globe because we recognize the importance of their role in keeping the world safe. To that end, we have built the Global Cybersecurity Program (GCSP) to provide agencies with a direct and consistent line of communication to the AWS Security team. Two examples of GCSP members are the Dutch National Cyber Security Centrum (NCSC-NL), with whom we signed a cooperation in May 2023, and the Italian National Cybersecurity Agency (ACN). Together, we will work on cybersecurity initiatives and strengthen the cybersecurity posture across the EU. With the war in Ukraine, we have experienced how important such a collaboration can be. AWS has played an important role in helping Ukraine’s government maintain continuity and provide critical services to citizens since the onset of the war.
The way forward
At AWS, we will continue to provide key stakeholders with greater insights into how we help customers tackle their most challenging cybersecurity issues and provide opportunities to deep dive into what we’re building. We very much look forward to continuing our work with authorities, agencies and, most importantly, our customers to provide for the best solutions and raise the bar on cybersecurity and resilience across the EU and globally.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
Data warehousing provides a business with several benefits such as advanced business intelligence and data consistency. It plays a big role within an organization by helping to make the right strategic decision at the right moment which could have a huge impact in a competitive market. One of the major and essential parts in a data warehouse is the extract, transform, and load (ETL) process which extracts the data from different sources, applies business rules and aggregations and then makes the transformed data available for the business users.
This process is always evolving to reflect new business and technical requirements, especially when working in an ambitious market. Nowadays, more verification steps are applied to source data before processing them which so often add an administration overhead. Hence, automatic notifications are more often required in order to accelerate data ingestion, facilitate monitoring and provide accurate tracking about the process.
Amazon Redshift is a fast, fully managed, cloud data warehouse that allows you to process and run your complex SQL analytics workloads on structured and semi-structured data. It also helps you to securely access your data in operational databases, data lakes or third-party datasets with minimal movement or copying. AWS Step Functions is a fully managed service that gives you the ability to orchestrate and coordinate service components. Amazon S3 Event Notifications is an Amazon S3 feature that you can enable in order to receive notifications when specific events occur in your S3 bucket.
In this post we discuss how we can build and orchestrate in a few steps an ETL process for Amazon Redshift using Amazon S3 Event Notifications for automatic verification of source data upon arrival and notification in specific cases. And we show how to use AWS Step Functions for the orchestration of the data pipeline. It can be considered as a starting point for teams within organizations willing to create and build an event driven data pipeline from data source to data warehouse that will help in tracking each phase and in responding to failures quickly. Alternatively, you can also use Amazon Redshift auto-copy from Amazon S3 to simplify data loading from Amazon S3 into Amazon Redshift.
Solution overview
The workflow is composed of the following steps:
A Lambda function is triggered by an S3 event whenever a source file arrives at the S3 bucket. It does the necessary verifications and then classifies the file before processing by sending it to the appropriate Amazon S3 prefix (accepted or rejected).
There are two possibilities:
If the file is moved to the rejected Amazon S3 prefix, an Amazon S3 event sends a message to Amazon SNS for further notification.
If the file is moved to the accepted Amazon S3 prefix, an Amazon S3 event is triggered and sends a message with the file path to Amazon SQS.
An Amazon EventBridge scheduled event triggers the AWS Step Functions workflow.
The workflow executes a Lambda function that pulls out the messages from the Amazon SQS queue and generates a manifest file for the COPY command.
Once the manifest file is generated, the workflow starts the ETL process using stored procedure.
The following image shows the workflow.
Prerequisites
Before configuring the previous solution, you can use the following AWS CloudFormation template to set up and create the infrastructure
Give the stack a name, select a deployment VPC and define the master user for the Amazon Redshift cluster by filling in the two parameters MasterUserName and MasterUserPassword.
The template will create the following services:
An S3 bucket
An Amazon Redshift cluster composed of two ra3.xlplus nodes
An empty AWS Step Functions workflow
An Amazon SQS queue
An Amazon SNS topic
An Amazon EventBridge scheduled rule with a 5-minute rate
Two empty AWS Lambda functions
IAM roles and policies for the services to communicate with each other
The names of the created services are usually prefixed by the stack’s name or the word blogdemo. You can find the names of the created services in the stack’s resources tab.
Step 1: Configure Amazon S3 Event Notifications
Create the following four folders in the S3 bucket:
received
rejected
accepted
manifest
In this scenario, we will create the following three Amazon S3 event notifications:
Trigger an AWS Lambda function on the received folder.
Send a message to the Amazon SNS topic on the rejected folder.
Send a message to Amazon SQS on the accepted folder.
To create an Amazon S3 event notification:
Go to the bucket’s Properties tab.
In the Event Notifications section, select Create Event Notification.
Fill in the necessary properties:
Give the event a name.
Specify the appropriate prefix or folder (accepted/, rejected/ or received/).
Select All object create events as an event type.
Select and choose the destination (AWS lambda, Amazon SNS or Amazon SQS). Note: for an AWS Lambda destination, choose the function that starts with ${stackname}-blogdemoVerify_%
At the end, you should have three Amazon S3 events:
An event for the received prefix with an AWS Lambda function as a destination type.
An event for the accepted prefix with an Amazon SQS queue as a destination type.
An event for the rejected prefix with an Amazon SNS topic as a destination type.
The following image shows what you should have after creating the three Amazon S3 events:
Step 2: Create objects in Amazon Redshift
Connect to the Amazon Redshift cluster and create the following objects:
Three schemas:
create schema blogdemo_staging; -- for staging tables
create schema blogdemo_core; -- for target tables
create schema blogdemo_proc; -- for stored procedures
A table in the blogdemo_staging and blogdemo_core schemas:
create table ${schemaname}.rideshare
(
id_ride bigint not null,
date_ride timestamp not null,
country varchar (20),
city varchar (20),
distance_km smallint,
price decimal (5,2),
feedback varchar (10)
) distkey(id_ride);
A stored procedure to extract and load data into the target schema:
create or replace procedure blogdemo_proc.elt_rideshare (bucketname in varchar(200),manifestfile in varchar (500))
as $$
begin
-- purge staging table
truncate blogdemo_staging.rideshare;
-- copy data from s3 bucket to staging schema
execute 'copy blogdemo_staging.rideshare from ''s3://' + bucketname + '/' + manifestfile + ''' iam_role default delimiter ''|'' manifest;';
-- apply transformation rules here
-- insert data into target table
insert into blogdemo_core.rideshare
select * from blogdemo_staging.rideshare;
end;
$$ language plpgsql;
Set the role ${stackname}-blogdemoRoleRedshift_% as a default role:
In the Amazon Redshift console, go to clusters and click on the cluster blogdemoRedshift%.
Go to the Properties tab.
In the Cluster permissions section, select the role ${stackname}-blogdemoRoleRedshift%.
Click on Set default then Make default.
Step 3: Configure Amazon SQS queue
The Amazon SQS queue can be used as it is; this means with the default values. The only thing you need to do for this demo is to go to the created queue ${stackname}-blogdemoSQS% and purge the test messages generated (if any) by the Amazon S3 event configuration. Copy its URL in a text file for further use (more precisely, in one of the AWS Lambda functions).
Step 4: Setup Amazon SNS topic
In the Amazon SNS console, go to the topic ${stackname}-blogdemoSNS%
Click on the Create subscription button.
Choose the blogdemo topic ARN, email protocol, type your email and then click on Create subscription.
Confirm your subscription in your email that you received.
Step 5: Customize the AWS Lambda functions
The following code verifies the name of a file. If it respects the naming convention, it will move it to the accepted folder. If it does not respect the naming convention, it will move it to the rejected one. Copy it to the AWS Lambda function ${stackname}-blogdemoLambdaVerify and then deploy it:
The second AWS Lambda function ${stackname}-blogdemonLambdaGenerate% retrieves the messages from the Amazon SQS queue and generates and stores a manifest file in the S3 bucket manifest folder. Copy the following content, replace the variable ${sqs_url} by the value retrieved in Step 3 and then click on Deploy.
Step 6: Add tasks to the AWS Step Functions workflow
Create the following workflow in the state machine ${stackname}-blogdemoStepFunctions%.
If you would like to accelerate this step, you can drag and drop the content of the following JSON file in the definition part when you click on Edit. Make sure to replace the three variables:
${GenerateManifestFileFunctionName} by the ${stackname}-blogdemoLambdaGenerate% arn.
${RedshiftClusterIdentifier} by the Amazon Redshift cluster identifier.
${MasterUserName} by the username that you defined while deploying the CloudFormation template.
Step 7: Enable Amazon EventBridge rule
Enable the rule and add the AWS Step Functions workflow as a rule target:
Go to the Amazon EventBridge console.
Select the rule created by the Amazon CloudFormation template and click on Edit.
Enable the rule and click Next.
You can change the rate if you want. Then select Next.
Add the AWS Step Functions state machine created by the CloudFormation template blogdemoStepFunctions% as a target and use an existing role created by the CloudFormation template ${stackname}-blogdemoRoleEventBridge%
Click on Next and then Update rule.
Test the solution
In order to test the solution, the only thing you should do is upload some csv files in the received prefix of the S3 bucket. Here are some sample data; each file contains 1000 rows of rideshare data.
If you upload them in one click, you should receive an email because the ridesharedata2022.csv does not respect the naming convention. The other three files will be loaded in the target table blogdemo_core.rideshare. You can check the Step Functions workflow to verify that the process finished successfully.
Clean up
Go to the Amazon EventBridge console and delete the rule ${stackname}-blogdemoevenbridge%.
In the Amazon S3 console, select the bucket created by the CloudFormation template ${stackname}-blogdemobucket% and click on Empty.
Go to Subscriptions in the Amazon SNS console and delete the subscription created in Step 4.
In the AWS CloudFormation console, select the stack and delete it.
Conclusion
In this post, we showed how different AWS services can be easily implemented together in order to create an event-driven architecture and automate its data pipeline, which targets the cloud data warehouse Amazon Redshift for business intelligence applications and complex queries.
About the Author
Ziad WALI is an Acceleration Lab Solutions Architect at Amazon Web Services. He has over 10 years of experience in databases and data warehousing where he enjoys building reliable, scalable and efficient solutions. Outside of work, he enjoys sports and spending time in nature.
Welcome to another blog post from the AWS Customer Incident Response Team (CIRT)! For this post, we’re looking at two events that the team was involved in from the viewpoint of a regularly discussed but sometimes misunderstood subject, least privilege. Specifically, we consider the idea that the benefit of reducing permissions in real-life use cases does not always require using the absolute minimum set of privileges. Instead, you need to weigh the cost and effort of creating and maintaining privileges against the risk reduction that is achieved, to make sure that your permissions are appropriate for your needs.
To quote VP and Distinguished Engineer at Amazon Security, Eric Brandwine, “Least privilege equals maximum effort.” This is the idea that creating and maintaining the smallest possible set of privileges needed to perform a given task will require the largest amount of effort, especially as customer needs and service features change over time. However, the correlation between effort and permission reduction is not linear. So, the question you should be asking is: How do you balance the effort of privilege reduction with the benefits it provides?
Unfortunately, this is not an easy question to answer. You need to consider the likelihood of an unwanted issue happening, the impact if that issue did happen, and the cost and effort to prevent (or detect and recover from) that issue. You also need to factor requirements such as your business goals and regulatory requirements into your decision process. Of course, you won’t need to consider just one potential issue, but many. Often it can be useful to start with a rough set of permissions and refine them down as you develop your knowledge of what security level is required. You can also use service control policies (SCPs) to provide a set of permission guardrails if you’re using AWS Organizations. In this post, we tell two real-world stories where limiting AWS Identity and Access Management (IAM) permissions worked by limiting the impact of a security event, but where the permission set did not involve maximum effort.
Story 1: On the hunt for credentials
In this AWS CIRT story, we see how a threat actor was unable to achieve their goal because the access they obtained — a database administrator’s — did not have the IAM permissions they were after.
Background and AWS CIRT engagement
A customer came to us after they discovered unauthorized activity in their on-premises systems and in some of their AWS accounts. They had incident response capability and were looking for an additional set of hands with AWS knowledge to help them with their investigation. This helped to free up the customer’s staff to focus on the on-premises analysis.
Before our engagement, the customer had already performed initial containment activities. This included rotating credentials, revoking temporary credentials, and isolating impacted systems. They also had a good idea of which federated user accounts had been accessed by the threat actor.
The key part of every AWS CIRT engagement is the customer’s ask. Everything our team does falls on the customer side of the AWS Shared Responsibility Model, so we want to make sure that we are aligned to the customer’s desired outcome. The ask was clear—review the potential unauthorized federated users’ access, and investigate the unwanted AWS actions that were taken by those users during the known timeframe. To get a better idea of what was “unwanted,” we talked to the customer to understand at a high level what a typical day would entail for these users, to get some context around what sort of actions would be expected. The users were primarily focused on working with Amazon Relational Database Service (RDS).
Analysis and findings
For this part of the story, we’ll focus on a single federated user whose apparent actions we investigated, because the other federated users had not been impersonated by the threat actor in a meaningful way. We knew the approximate start and end dates to focus on and had discovered that the threat actor had performed a number of unwanted actions.
After reviewing the actions, it was clear that the threat actor had performed a console sign-in on three separate occasions within a 2-hour window. Each time, the threat actor attempted to perform a subset of the following actions:
Note: This list includes only the actions that are displayed as readOnly = false in AWS CloudTrail, because these actions are often (but not always) the more impactful ones, with the potential to change the AWS environment.
This is the point where the benefit of permission restriction became clear. As soon as this list was compiled, we noticed that two fields were present for all of the actions listed:
"errorCode": "Client.UnauthorizedOperation",
"errorMessage": "You are not authorized to perform this operation. [rest of message]"
As this reveals, every single non-readOnly action that was attempted by the threat actor was denied because the federated user account did not have the required IAM permissions.
Customer communication and result
After we confirmed the findings, we had a call with the customer to discuss the results. As you can imagine, they were happy that the results showed no material impact to their data, and said no further investigation or actions were required at that time.
What were the IAM permissions the federated user had, which prevented the set of actions the threat actor attempted?
The answer did not actually involve the absolute minimal set of permissions required by the user to do their job. It’s simply that the federated user had a role that didn’t have an Allow statement for the IAM actions the threat actor tried — their job did not require them. Without an explicit Allow statement, the IAM actions attempted were denied because IAM policies are Deny by default. In this instance, simply not having the desired IAM permissions meant that the threat actor wasn’t able to achieve their goal, and stopped using the access. We’ll never know what their goal actually was, but trying to create access keys, passwords, and update policies means that a fair guess would be that they were attempting to create another way to access that AWS account.
Story 2: More instances for crypto mining
In this AWS CIRT story, we see how a threat actor’s inability to create additional Amazon Elastic Compute Cloud (Amazon EC2) instances resulted in the threat actor leaving without achieving their goal.
Because this account was new and currently only used for testing their software, the customer saw that the detection was related to the Amazon ECS cluster and decided to delete all the resources in the account and rebuild. Not too long after doing this, they received a similar GuardDuty alert for the new Amazon ECS cluster they had set up. The second finding resulted in the customer’s security team and AWS being brought in to try to understand what was causing this. The customer’s security team was focused on reviewing the tasks that were being run on the cluster, while AWS CIRT reviewed the AWS account actions and provided insight about the GuardDuty finding and what could have caused it.
Analysis and findings
Working with the customer, it wasn’t long before we discovered that the 3rd party Amazon ECS task definition that the customer was using, was unintentionally exposing a web interface to the internet. That interface allowed unauthenticated users to run commands on the system. This explained why the same alert was also received shortly after the new install had been completed.
This is where the story takes a turn for the better. The AWS CIRT analysis of the AWS CloudTrail logs found that there were a number of attempts to use the credentials of the Task IAM role that was associated with the Amazon ECS task. The majority of actions were attempting to launch multiple Amazon EC2 instances via RunInstances calls. Every one of these actions, along with the other actions attempted, failed with either a Client.UnauthorizedOperation or an AccessDenied error message.
Next, we worked with the customer to understand the permissions provided by the Task IAM role. Once again, the permissions could have been limited more tightly. However, this time the goal of the threat actor — running a number of Amazon EC2 instances (most likely for surreptitious crypto mining) — did not align with the policy given to the role:
AWS CIRT recommended creating policies to restrict the allowed actions further, providing specific resources where possible, and potentially also adding in some conditions to limit other aspects of the access (such as the two Condition keys launched recently to limit where Amazon EC2 instance credentials can be used from). However, simply having the policy limit access to Amazon Simple Storage Service (Amazon S3) meant that the threat actor decided to leave with just the one Amazon ECS task running crypto mining rather than a larger number of Amazon EC2 instances.
Customer communication and result
After reporting these findings to the customer, there were two clear next steps: First, remove the now unwanted and untrusted Amazon ECS resource from their AWS account. Second, review and re-architect the Amazon ECS task so that access to the web interface was only provided to appropriate users. As part of that re-architecting, an Amazon S3 policy similar to “Allows read and write access to objects in an S3 bucket” was recommended. This separates Amazon S3 bucket actions from Amazon S3 object actions. When applications have a need to read and write objects in Amazon S3, they don’t normally have a need to create or delete entire buckets (or versioning on those buckets).
Some tools to help
We’ve just looked at how limiting privileges helped during two different security events. Now, let’s consider what can help you decide how to reduce your IAM permissions to an appropriate level. There are a number of resources that talk about different approaches:
The first approach is to use Access Analyzer to help generate IAM policies based on access activity from log data. This can then be refined further with the addition of Condition elements as desired. We already have a couple of blog posts about that exact topic:
The third approach is a manual method of creating and refining policies to reduce the amount of work required. For this, you can begin with an appropriate AWS managed IAM policy or an AWS provided example policy as a starting point. Following this, you can add or remove Actions, Resources, and Conditions — using wildcards as desired — to balance your effort and permission reduction.
An example of balancing effort and permission is in the IAM tutorial Create and attach your first customer managed policy. In it, the authors create a policy that uses iam:Get* and iam:List:* in the Actions section. Although not all iam:Get* and iam:List:* Actions may be required, this is a good way to group similar Actions together while minimizing Actions that allow unwanted access — for example, iam:Create* or iam:Delete*. Another example of this balancing was mentioned earlier relating to Amazon S3, allowing access to create, delete, and read objects, but not to change the configuration of the bucket those objects are in.
In addition to limiting permissions, we also recommend that you set up appropriate detection and response capability. This will enable you to know when an issue has occurred and provide the tools to contain and recover from the issue. Further details about performing incident response in an AWS account can be found in the AWS Security Incident Response Guide.
There are also two services that were used to help in the stories we presented here — Amazon GuardDuty and AWS CloudTrail. GuardDuty is a threat detection service that continuously monitors your AWS accounts and workloads for malicious activity. It’s a great way to monitor for unwanted activity within your AWS accounts. CloudTrail records account activity across your AWS infrastructure and provides the logs that were used for the analysis that AWS CIRT performed for both these stories. Making sure that both of these are set up correctly is a great first step towards improving your threat detection and incident response capability in AWS.
Conclusion
In this post, we looked at two examples where limiting privilege provided positive results during a security event. In the second case, the policy used should probably have restricted permissions further, but even as it stood, it was an effective preventative control in stopping the unauthorized user from achieving their assumed goal.
Hopefully these stories will provide new insight into the way your organization thinks about setting permissions, while taking into account the effort of creating the permissions. These stories are a good example of how starting a journey towards least privilege can help stop unauthorized users. Neither of the scenarios had policies that were least privilege, but the policies were restrictive enough that the unauthorized users were prevented from achieving their goals this time, resulting in minimal impact to the customers. However in both cases AWS CIRT recommended further reducing the scope of the IAM policies being used.
Finally, we provided a few ways to go about reducing permissions—first, by using tools to assist with policy creation, and second, by editing existing policies so they better fit your specific needs. You can get started by checking your existing policies against what Access Analyzer would recommend, by looking for and removing overly permissive wildcard characters (*) in some of your existing IAM policies, or by implementing and refining your SCPs.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
To improve a Spark application’s efficiency, it’s essential to monitor its performance and behavior. In this post, we demonstrate how to publish detailed Spark metrics from Amazon EMR to Amazon CloudWatch. This will give you the ability to identify bottlenecks while optimizing resource utilization.
CloudWatch provides a robust, scalable, and cost-effective monitoring solution for AWS resources and applications, with powerful customization options and seamless integration with other AWS services. By default, Amazon EMR sends basic metrics to CloudWatch to track the activity and health of a cluster. Spark’s configurable metrics system allows metrics to be collected in a variety of sinks, including HTTP, JMX, and CSV files, but additional configuration is required to enable Spark to publish metrics to CloudWatch.
Solution overview
This solution includes Spark configuration to send metrics to a custom sink. The custom sink collects only the metrics defined in a Metricfilter.json file. It utilizes the CloudWatch agent to publish the metrics to a custom Cloudwatch namespace. The bootstrap action script included is responsible for installing and configuring the CloudWatch agent and the metric library on the Amazon Elastic Compute Cloud (Amazon EC2) EMR instances. A CloudWatch dashboard can provide instant insight into the performance of an application.
The following diagram illustrates the solution architecture and workflow.
The workflow includes the following steps:
Users start a Spark EMR job, creating a step on the EMR cluster. With Apache Spark, the workload is distributed across the different nodes of the EMR cluster.
In each node (EC2 instance) of the cluster, a Spark library captures and pushes metric data to a CloudWatch agent, which aggregates the metric data before pushing them to CloudWatch every 30 seconds.
Users can view the metrics accessing the custom namespace on the CloudWatch console.
We provide an AWS CloudFormation template in this post as a general guide. The template demonstrates how to configure a CloudWatch agent on Amazon EMR to push Spark metrics to CloudWatch. You can review and customize it as needed to include your Amazon EMR security configurations. As a best practice, we recommend including your Amazon EMR security configurations in the template to encrypt data in transit.
You should also be aware that some of the resources deployed by this stack incur costs when they remain in use. Additionally, EMR metrics don’t incur CloudWatch costs. However, custom metrics incur charges based on CloudWatch metrics pricing. For more information, see Amazon CloudWatch Pricing.
In the next sections, we go through the following steps:
Create and upload the metrics library, installation script, and filter definition to an Amazon Simple Storage Service (Amazon S3) bucket.
Use the CloudFormation template to create the following resources:
Default IAM service roles for Amazon EMR permissions to AWS services and resources. You can create these roles with the aws emr create-default-roles command in the AWS Command Line Interface (AWS CLI).
An optional EC2 key pair, if you plan to connect to your cluster through SSH rather than Session Manager, a capability of AWS Systems Manager.
Define the required metrics
To avoid sending unnecessary data to CloudWatch, our solution implements a metric filter. Review the Spark documentation to get acquainted with the namespaces and their associated metrics. Determine which metrics are relevant to your specific application and performance goals. Different applications may require different metrics to monitor, depending on the workload, data processing requirements, and optimization objectives. The metric names you’d like to monitor should be defined in the Metricfilter.json file, along with their associated namespaces.
We have created an example Metricfilter.json definition, which includes capturing metrics related to data I/O, garbage collection, memory and CPU pressure, and Spark job, stage, and task metrics.
Note that certain metrics are not available in all Spark release versions (for example, appStatus was introduced in Spark 3.0).
Create and upload the required files to an S3 bucket
Choose Upload, and take note of the S3 URIs for the files.
Provision resources with the CloudFormation template
Choose Launch Stack to launch a CloudFormation stack in your account and deploy the template:
This template creates an IAM role, IAM instance profile, EMR cluster, and CloudWatch dashboard. The cluster starts a basic Spark example application. You will be billed for the AWS resources used if you create a stack from this template.
The CloudFormation wizard will ask you to modify or provide these parameters:
InstanceType – The type of instance for all instance groups. The default is m5.2xlarge.
InstanceCountCore – The number of instances in the core instance group. The default is 4.
BootstrapScriptPath – The S3 path of the installer.sh installation bootstrap script that you copied earlier.
MetricFilterPath – The S3 path of your Metricfilter.json definition that you copied earlier.
MetricsLibraryPath – The S3 path of your CloudWatch emr-custom-cw-sink-0.0.1.jar library that you copied earlier.
CloudWatchNamespace – The name of the custom CloudWatch namespace to be used.
SparkDemoApplicationPath – The S3 path of your examplejob.sh script that you copied earlier.
Subnet – The EC2 subnet where the cluster launches. You must provide this parameter.
EC2KeyPairName – An optional EC2 key pair for connecting to cluster nodes, as an alternative to Session Manager.
View the metrics
After the CloudFormation stack deploys successfully, the example job starts automatically and takes approximately 15 minutes to complete. On the CloudWatch console, choose Dashboards in the navigation pane. Then filter the list by the prefix SparkMonitoring.
The example dashboard includes information on the cluster and an overview of the Spark jobs, stages, and tasks. Metrics are also available under a custom namespace starting with EMRCustomSparkCloudWatchSink.
Memory, CPU, I/O, and additional task distribution metrics are also included.
Finally, detailed Java garbage collection metrics are available per executor.
Clean up
To avoid future charges in your account, delete the resources you created in this walkthrough. The EMR cluster will incur charges as long as the cluster is active, so stop it when you’re done. Complete the following steps:
On the CloudFormation console, in the navigation pane, choose Stacks.
Choose the stack you launched (EMR-CloudWatch-Demo), then choose Delete.
Now that you have completed the steps in this walkthrough, the CloudWatch agent is running on your cluster hosts and configured to push Spark metrics to CloudWatch. With this feature, you can effectively monitor the health and performance of your Spark jobs running on Amazon EMR, detecting critical issues in real time and identifying root causes quickly.
You can package and deploy this solution through a CloudFormation template like this example template, which creates the IAM instance profile role, CloudWatch dashboard, and EMR cluster. The source code for the library is available on GitHub for customization.
To take this further, consider using these metrics in CloudWatch alarms. You could collect them with other alarms into a composite alarm or configure alarm actions such as sending Amazon Simple Notification Service (Amazon SNS) notifications to trigger event-driven processes such as AWS Lambda functions.
About the Author
Le Clue Lubbe is a Principal Engineer at AWS. He works with our largest enterprise customers to solve some of their most complex technical problems. He drives broad solutions through innovation to impact and improve the life of our customers.
At AWS, we often hear from customers that they want expanded security coverage for the multiple services that they use on AWS. However, alert fatigue is a common challenge that customers face as we introduce new security protections. The challenge becomes how to operationalize, identify, and prioritize alerts that represent real risk.
In this post, we highlight recent enhancements to Amazon Detective finding groups visualizations. We show you how Detective automatically consolidates multiple security findings into a single security event—called finding groups—and how finding group visualizations help reduce noise and prioritize findings that present true risk. We incorporate additional services like Amazon GuardDuty, Amazon Inspector, and AWS Security Hub to highlight how effective findings groups is at consolidating findings for different AWS security services.
Overview of solution
This post uses several different services. The purpose is twofold: to show how you can enable these services for broader protection, and to show how Detective can help you investigate findings from multiple services without spending a lot of time sifting through logs or querying multiple data sources to find the root cause of a security event. These are the services and their use cases:
GuardDuty – a threat detection service that continuously monitors your AWS accounts and workloads for malicious activity. If potential malicious activity, such as anomalous behavior, credential exfiltration, or command and control (C2) infrastructure communication is detected, GuardDuty generates detailed security findings that you can use for visibility and remediation. Recently, GuardDuty released the following threat detections for specific services that we’ll show you how to enable for this walkthrough: GuardDuty RDS Protection, EKS Runtime Monitoring, and Lambda Protection.
Amazon Inspector – an automated vulnerability management service that continually scans your AWS workloads for software vulnerabilities and unintended network exposure. Like GuardDuty, Amazon Inspector sends a finding for alerting and remediation when it detects a software vulnerability or a compute instance that’s publicly available.
Security Hub – a cloud security posture management service that performs automated, continuous security best practice checks against your AWS resources to help you identify misconfigurations, and aggregates your security findings from integrated AWS security services.
Detective – a security service that helps you investigate potential security issues. It does this by collecting log data from AWS CloudTrail, Amazon Virtual Private Cloud (Amazon VPC) flow logs, and other services. Detective then uses machine learning, statistical analysis, and graph theory to build a linked set of data called a security behavior graph that you can use to conduct faster and more efficient security investigations.
The following diagram shows how each service delivers findings along with log sources to Detective.
Figure 1: Amazon Detective log source diagram
Enable the required services
If you’ve already enabled the services needed for this post—GuardDuty, Amazon Inspector, Security Hub, and Detective—skip to the next section. For instructions on how to enable these services, see the following resources:
Each of these services offers a free 30-day trial and provides estimates on charges after your trial expires. You can also use the AWS Pricing Calculator to get an estimate.
To enable the services across multiple accounts, consider using a delegated administrator account in AWS Organizations. With a delegated administrator account, you can automatically enable services for multiple accounts and manage settings for each account in your organization. You can view other accounts in the organization and add them as member accounts, making central management simpler. For instructions on how to enable the services with AWS Organizations, see the following resources:
The next step is to enable the latest detections in GuardDuty and learn how Detective can identify multiple threats that are related to a single security event.
If you’ve already enabled the different GuardDuty protection plans, skip to the next section. If you recently enabled GuardDuty, the protections plans are enabled by default, except for EKS Runtime Monitoring, which is a two-step process.
For the next steps, we use the delegated administrator account in GuardDuty to make sure that the protection plans are enabled for each AWS account. When you use GuardDuty (or Security Hub, Detective, and Inspector) with AWS Organizations, you can designate an account to be the delegated administrator. This is helpful so that you can configure these security services for multiple accounts at the same time. For instructions on how to enable a delegated administrator account for GuardDuty, see Managing GuardDuty accounts with AWS Organizations.
To enable EKS Protection
Sign in to the GuardDuty console using the delegated administrator account, choose Protection plans, and then choose EKS Protection.
In the Delegated administrator section, choose Edit and then choose Enable for each scope or protection. For this post, select EKS Audit Log Monitoring, EKS Runtime Monitoring, and Manage agent automatically, as shown in Figure 2. For more information on each feature, see the following resources:
To enable these protections for current accounts, in the Active member accounts section, choose Edit and Enable for each scope of protection.
To enable these protections for new accounts, in the New account default configuration section, choose Edit and Enable for each scope of protection.
To enable RDS Protection
The next step is to enable RDS Protection. GuardDuty RDS Protection works by analysing RDS login activity for potential threats to your Amazon Aurora databases (MySQL-Compatible Edition and Aurora PostgreSQL-Compatible Editions). Using this feature, you can identify potentially suspicious login behavior and then use Detective to investigate CloudTrail logs, VPC flow logs, and other useful information around those events.
Navigate to the RDS Protection menu and under Delegated administrator (this account), select Enable and Confirm.
In the Enabled for section, select Enable all if you want RDS Protection enabled on all of your accounts. If you want to select a specific account, choose Manage Accounts and then select the accounts for which you want to enable RDS Protection. With the accounts selected, choose Edit Protection Plans, RDS Login Activity, and Enable for X selected account.
(Optional) For new accounts, turn on Auto-enable RDS Login Activity Monitoring for new member accounts as they join your organization.
Figure 2: Enable EKS Runtime Monitoring
To enable Lambda Protection
The final step is to enable Lambda Protection. Lambda Protection helps detect potential security threats during the invocation of AWS Lambda functions. By monitoring network activity logs, GuardDuty can generate findings when Lambda functions are involved with malicious activity, such as communicating with command and control servers.
Navigate to the Lambda Protection menu and under Delegated administrator (this account), select Enable and Confirm.
In the Enabled for section, select Enable all if you want Lambda Protection enabled on all of your accounts. If you want to select a specific account, choose Manage Accounts and select the accounts for which you want to enable RDS Protection. With the accounts selected, choose Edit Protection Plans, Lambda Network Activity Monitoring, and Enable for X selected account.
(Optional) For new accounts, turn on Auto-enable Lambda Network Activity Monitoring for new member accounts as they join your organization.
Now that you’ve enabled these new protections, GuardDuty will start monitoring EKS audit logs, EKS runtime activity, RDS login activity, and Lambda network activity. If GuardDuty detects suspicious or malicious activity for these log sources or services, it will generate a finding for the activity, which you can review in the GuardDuty console. In addition, you can automatically forward these findings to Security Hub for consolidation, and to Detective for security investigation.
Detective data sources
If you have Security Hub and other AWS security services such as GuardDuty or Amazon Inspector enabled, findings from these services are forwarded to Security Hub. With the exception of sensitive data findings from Amazon Macie, you’re automatically opted in to other AWS service integrations when you enable Security Hub. For the full list of services that forward findings to Security Hub, see Available AWS service integrations.
With each service enabled and forwarding findings to Security Hub, the next step is to enable the data source in Detective called AWS security findings, which are the findings forwarded to Security Hub. Again, we’re going to use the delegated administrator account for these steps to make sure that AWS security findings are being ingested for your accounts.
To enable AWS security findings
Sign in to the Detective console using the delegated administrator account and navigate to Settings and then General.
Choose Optional source packages, Edit, select AWS security findings, and then choose Save.
Figure 5: Enable AWS security findings
When you enable Detective, it immediately starts creating a security behavior graph for AWS security findings to build a linked dataset between findings and entities, such as RDS login activity from Aurora databases, EKS runtime activity, and suspicious network activity for Lambda functions. For GuardDuty to detect potential threats that affect your database instances, it first needs to undertake a learning period of up to two weeks to establish a baseline of normal behavior. For more information, see How RDS Protection uses RDS login activity monitoring. For the other protections, after suspicious activity is detected, you can start to see findings in both GuardDuty and Security Hub consoles. This is where you can start using Detective to better understand which findings are connected and where to prioritize your investigations.
Detective behavior graph
As Detective ingests data from GuardDuty, Amazon Inspector, and Security Hub, as well as CloudTrail logs, VPC flow logs, and Amazon Elastic Kubernetes Service (Amazon EKS) audit logs, it builds a behavior graph database. Graph databases are purpose-built to store and navigate relationships. Relationships are first-class citizens in graph databases, which means that they’re not computed out-of-band or by interfering with relationships through querying foreign keys. Because Detective stores information on relationships in your graph database, you can effectively answer questions such as “are these security findings related?”. In Detective, you can use the search menu and profile panels to view these connections, but a quicker way to see this information is by using finding groups visualizations.
Finding groups visualizations
Finding groups extract additional information out of the behavior graph to highlight findings that are highly connected. Detective does this by running several machine learning algorithms across your behavior graph to identify related findings and then statically weighs the relationships between those findings and entities. The result is a finding group that shows GuardDuty and Amazon Inspector findings that are connected, along with entities like Amazon Elastic Compute Cloud (Amazon EC2) instances, AWS accounts, and AWS Identity and Access Management (IAM) roles and sessions that were impacted by these findings. With finding groups, you can more quickly understand the relationships between multiple findings and their causes because you don’t need to connect the dots on your own. Detective automatically does this and presents a visualization so that you can see the relationships between various entities and findings.
Enhanced visualizations
Recently, we released several enhancements to finding groups visualizations to aid your understanding of security connections and root causes. These enhancements include:
Dynamic legend – the legend now shows icons for entities that you have in the finding group instead of showing all available entities. This helps reduce noise to only those entities that are relevant to your investigation.
Aggregated evidence and finding icons – these icons provide a count of similar evidence and findings. Instead of seeing the same finding or evidence repeated multiple times, you’ll see one icon with a counter to help reduce noise.
More descriptive side panel information – when you choose a finding or entity, the side panel shows additional information, such as the service that identified the finding and the finding title, in addition to the finding type, to help you understand the action that invoked the finding.
Label titles – you can now turn on or off titles for entities and findings in the visualization so that you don’t have to choose each to get a summary of what the different icons mean.
To use the finding groups visualization
Open the Detective console, choose Summary, and then choose View all finding groups.
Choose the title of an available finding group and scroll down to Visualization.
Under the Select layout menu, choose one of the layouts available, or choose and drag each icon to rearrange the layout according to how you’d like to see connections.
For a complete list of involved entities and involved findings, scroll down below the visualization.
Figure 6 shows an example of how you can use finding groups visualization to help identify the root cause of findings quickly. In this example, an IAM role was connected to newly observed geolocations, multiple GuardDuty findings detected malicious API calls, and there were newly observed user agents from the IAM session. The visualization can give you high confidence that the IAM role is compromised. It also provides other entities that you can search against, such as the IP address, S3 bucket, or new user agents.
Figure 6: Finding groups visualization
Now that you have the new GuardDuty protections enabled along with the data source of AWS security findings, you can use finding groups to more quickly visualize which IAM sessions have had multiple findings associated with unauthorized access, or which EC2 instances are publicly exposed with a software vulnerability and active GuardDuty finding—these patterns can help you determine if there is an actual risk.
Conclusion
In this blog post, you learned how to enable new GuardDuty protections and use Detective, finding groups, and visualizations to better identify, operationalize, and prioritize AWS security findings that represent real risk. We also highlighted the new enhancements to visualizations that can help reduce noise and provide summaries of detailed information to help reduce the time it takes to triage findings. If you’d like to see an investigation scenario using Detective, watch the video Amazon Detective Security Scenario Investigation.
The post Archive and Purge Data for Amazon RDS for PostgreSQL and Amazon Aurora with PostgreSQL Compatibility using pg_partman and Amazon S3 proposes data archival as a critical part of data management and shows how to efficiently use PostgreSQL’s native range partition to partition current (hot) data with pg_partman and archive historical (cold) data in Amazon Simple Storage Service (Amazon S3). Customers need a cloud-native automated solution to archive historical data from their databases. Customers want the business logic to be maintained and run from outside the database to reduce the compute load on the database server. This post proposes an automated solution by using AWS Glue for automating the PostgreSQL data archiving and restoration process, thereby streamlining the entire procedure.
AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. There is no need to pre-provision, configure, or manage infrastructure. It can also automatically scale resources to meet the requirements of your data processing job, providing a high level of abstraction and convenience. AWS Glue integrates seamlessly with AWS services like Amazon S3, Amazon Relational Database Service (Amazon RDS), Amazon Redshift, Amazon DynamoDB, Amazon Kinesis Data Streams, and Amazon DocumentDB (with MongoDB compatibility) to offer a robust, cloud-native data integration solution.
The features of AWS Glue, which include a scheduler for automating tasks, code generation for ETL (extract, transform, and load) processes, notebook integration for interactive development and debugging, as well as robust security and compliance measures, make it a convenient and cost-effective solution for archival and restoration needs.
Solution overview
The solution combines PostgreSQL’s native range partitioning feature with pg_partman, the Amazon S3 export and import functions in Amazon RDS, and AWS Glue as an automation tool.
The solution involves the following steps:
Provision the required AWS services and workflows using the provided AWS Cloud Development Kit (AWS CDK) project.
Set up your database.
Archive the older table partitions to Amazon S3 and purge them from the database with AWS Glue.
Restore the archived data from Amazon S3 to the database with AWS Glue when there is a business need to reload the older table partitions.
The solution is based on AWS Glue, which takes care of archiving and restoring databases with Availability Zone redundancy. The solution is comprised of the following technical components:
An S3 bucket stores Python scripts and database archives.
An S3 Gateway endpoint allows Amazon RDS and AWS Glue to communicate privately with the Amazon S3.
AWS Glue uses a Secrets Manager interface endpoint to retrieve database secrets from Secrets Manager.
AWS Glue ETL jobs run in either private subnet. They use the S3 endpoint to retrieve Python scripts. The AWS Glue jobs read the database credentials from Secrets Manager to establish JDBC connections to the database.
You can create an AWS Cloud9 environment in one of the private subnets available in your AWS account to set up test data in Amazon RDS. The following diagram illustrates the solution architecture.
Prerequisites
For instructions to set up your environment for implementing the solution proposed in this post, refer to Deploy the application in the GitHub repo.
Provision the required AWS resources using AWS CDK
Complete the following steps to provision the necessary AWS resources:
Clone the repository to a new folder on your local desktop.
Create a virtual environment and install the project dependencies.
The CDK project includes three stacks: vpcstack, dbstack, and gluestack, implemented in the vpc_stack.py, db_stack.py, and glue_stack.py modules, respectively.
These stacks have preconfigured dependencies to simplify the process for you. app.py declares Python modules as a set of nested stacks. It passes a reference from vpcstack to dbstack, and a reference from both vpcstack and dbstack to gluestack.
gluestack reads the following attributes from the parent stacks:
The S3 bucket, VPC, and subnets from vpcstack
The secret, security group, database endpoint, and database name from dbstack
The deployment of the three stacks creates the technical components listed earlier in this post.
Archive the historical table partition to Amazon S3 and purge it from the database with AWS Glue
The “Maintain and Archive” AWS Glue workflow created in the first step consists of two jobs: “Partman run maintenance” and “Archive Cold Tables.”
The “Partman run maintenance” job runs the Partman.run_maintenance_proc() procedure to create new partitions and detach old partitions based on the retention setup in the previous step for the configured table. The “Archive Cold Tables” job identifies the detached old partitions and exports the historical data to an Amazon S3 destination using aws_s3.query_export_to_s3. In the end, the job drops the archived partitions from the database, freeing up storage space. The following screenshot shows the results of running this workflow on demand from the AWS Glue console.
Additionally, you can set up this AWS Glue workflow to be triggered on a schedule, on demand, or with an Amazon EventBridge event. You need to use your business requirement to select the right trigger.
Restore archived data from Amazon S3 to the database
The “Restore from S3” Glue workflow created in the first step consists of one job: “Restore from S3.”
This job initiates the run of the partman.create_partition_time procedure to create a new table partition based on your specified month. It subsequently calls aws_s3.table_import_from_s3 to restore the matched data from Amazon S3 to the newly created table partition.
To start the “Restore from S3” workflow, navigate to the workflow on the AWS Glue console and choose Run.
The following screenshot shows the “Restore from S3” workflow run details.
Validate the results
The solution provided in this post automated the PostgreSQL data archival and restoration process using AWS Glue.
You can use the following steps to confirm that the historical data in the database is successfully archived after running the “Maintain and Archive” AWS Glue workflow:
On the Amazon S3 console, navigate to your S3 bucket.
Confirm the archived data is stored in an S3 object as shown in the following screenshot.
From a psql command line tool, use the \dt command to list the available tables and confirm the archived table ticket_purchase_hist_p2020_01 does not exist in the database.
You can use the following steps to confirm that the archived data is restored to the database successfully after running the “Restore from S3” AWS Glue workflow.
From a psql command line tool, use the \dt command to list the available tables and confirm the archived table ticket_history_hist_p2020_01 is restored to the database.
Clean up
Use the information provided in Cleanup to clean up your test environment created for testing the solution proposed in this post.
Summary
This post showed how to use AWS Glue workflows to automate the archive and restore process in RDS for PostgreSQL database table partitions using Amazon S3 as archive storage. The automation is run on demand but can be set up to be trigged on a recurring schedule. It allows you to define the sequence and dependencies of jobs, track the progress of each workflow job, view run logs, and monitor the overall health and performance of your tasks. Although we used Amazon RDS for PostgreSQL as an example, the same solution works for Amazon Aurora-PostgreSQL Compatible Edition as well. Modernize your database cron jobs using AWS Glue by using this post and the GitHub repo. Gain a high-level understanding of AWS Glue and its components by using the following hands-on workshop.
About the Authors
Anand Komandooru is a Senior Cloud Architect at AWS. He joined AWS Professional Services organization in 2021 and helps customers build cloud-native applications on AWS cloud. He has over 20 years of experience building software and his favorite Amazon leadership principle is “Leaders are right a lot.”
Li Liu is a Senior Database Specialty Architect with the Professional Services team at Amazon Web Services. She helps customers migrate traditional on-premise databases to the AWS Cloud. She specializes in database design, architecture, and performance tuning.
Neil Potter is a Senior Cloud Application Architect at AWS. He works with AWS customers to help them migrate their workloads to the AWS Cloud. He specializes in application modernization and cloud-native design and is based in New Jersey.
Vivek Shrivastava is a Principal Data Architect, Data Lake in AWS Professional Services. He is a big data enthusiast and holds 14 AWS Certifications. He is passionate about helping customers build scalable and high-performance data analytics solutions in the cloud. In his spare time, he loves reading and finds areas for home automation.
Product security teams play a critical role to help ensure that new services, products, and features are built and shipped securely to customers. However, since security teams are in the product launch path, they can form a bottleneck if organizations struggle to scale their security teams to support their growing product development teams. In this post, we will share how Amazon Web Services (AWS) developed a mechanism to scale security processes and expertise by distributing security ownership between security teams and development teams. This mechanism has many names in the industry — Security Champions, Security Advocates, and others — and it’s often part of a shift-left approach to security. At AWS, we call this mechanism Security Guardians.
In many organizations, there are fewer security professionals than product developers. Our experience is that it takes much more time to hire a security professional than other technical job roles, and research conducted by (ISC)2 shows that the cybersecurity industry is short 3.4 million workers. When product development teams continue to grow at a faster rate than security teams, the disparity between security professionals and product developers continues to increase as well. Although most businesses understand the importance of security, frustration and tensions can arise when it becomes a bottleneck for the business and its ability to serve customers.
At AWS, we require the teams that build products to undergo an independent security review with an AWS application security engineer before launching. This is a mechanism to verify that new services, features, solutions, vendor applications, and hardware meet our high security bar. This intensive process impacts how quickly product teams can ship to customers. As shown in Figure 1, we found that as the product teams scaled, so did the problem: there were more products being built than the security teams could review and approve for launch. Because security reviews are required and non-negotiable, this could potentially lead to delays in the shipping of products and features.
Figure 1: More products are being developed than can be reviewed and shipped
How AWS builds a culture of security
Because of its size and scale, many customers look to AWS to understand how we scale our own security teams. To tell our story and provide insight, let’s take a look at the culture of security at AWS.
Security is a business priority
At AWS, security is a business priority. Business leaders prioritize building products and services that are designed to be secure, and they consider security to be an enabler of the business rather than an obstacle.
Leaders also strive to create a safe environment by encouraging employees to identify and escalate potential security issues. Escalation is the process of making sure that the right people know about the problem at the right time. Escalation encompasses “Dive Deep”, which is one of our corporate values at Amazon, because it requires owners and leaders to dive into the details of the issue. If you don’t know the details, you can’t make good decisions about what’s going on and how to run your business effectively.
This aspect of the culture goes beyond intention — it’s embedded in our organizational structure:
CISOs and IT leaders play a key role in demystifying what security and compliance represent for the business. At AWS, we made an intentional choice for the security team to report directly to the CEO. The goal was to build security into the structural fabric of how AWS makes decisions, and every week our security team spends time with AWS leadership to ensure we’re making the right choices on tactical and strategic security issues.
Because our leadership supports security, it’s understood within AWS that security is everyone’s job. Security teams and product development teams work together to help ensure that products are built and shipped securely. Despite this collaboration, the product teams own the security of their product. They are responsible for making sure that security controls are built into the product and that customers have the tools they need to use the product securely.
On the other hand, central security teams are responsible for helping developers to build securely and verifying that security requirements are met before launch. They provide guidance to help developers understand what security controls to build, provide tools to make it simpler for developers to implement and test controls, provide support in threat modeling activities, use mechanisms to help ensure that customers’ security expectations are met before launch, and so on.
This responsibility model highlights how security ownership is distributed between the security and product development teams. At AWS, we learned that without this distribution, security doesn’t scale. Regardless of the number of security experts we hire, product teams always grow faster. Although the culture around security and the need to distribute ownership is now well understood, without the right mechanisms in place, this model would have collapsed.
Mechanisms compared to good intentions
Mechanisms are the final pillar of AWS culture that has allowed us to successfully distribute security across our organization. A mechanism is a complete process, or virtuous cycle, that reinforces and improves itself as it operates. As shown in Figure 2, a mechanism takes controllable inputs and transforms them into ongoing outputs to address a recurring business challenge. At AWS, the business challenge that we’re facing is that security teams create bottlenecks for the business. The culture of security at AWS provides support to help address this challenge, but we needed a mechanism to actually do it.
Figure 2: AWS sees mechanisms as a complete process, or virtuous cycle
“Often, when we find a recurring problem, something that happens over and over again, we pull the team together, ask them to try harder, do better – essentially, we ask for good intentions. This rarely works… When you are asking for good intentions, you are not asking for a change… because people already had good intentions. But if good intentions don’t work, what does? Mechanisms work.
At AWS, we’ve learned that we can help solve the challenge of scaling security by distributing security ownership with a mechanism we call the Security Guardians program. Like other mechanisms, it has inputs and outputs, and transforms over time.
AWS distributes security ownership with the Security Guardians program
At AWS, the Security Guardians program trains, develops, and empowers developers to be security ambassadors, or Guardians, within the product teams. At a high level, Guardians make sure that security considerations for a product are made earlier and more often, helping their peers build and ship their product faster. They also work closely with the central security team to help ensure that the security bar at AWS is rising and the Security Guardians program is improving over time. As shown in Figure 3, embedding security expertise within the product teams helps products with Guardian involvement move through security review faster.
Figure 3: Security expertise is embedded in the product teams by Guardians
Guardians are informed, security-minded product builders who volunteer to be consistent champions of security on their teams and are deeply familiar with the security processes and tools. They provide security guidance throughout the development lifecycle and are stakeholders in the security of the products being shipped, helping their teams make informed decisions that lead to more secure, on-time launches. Guardians are the security points-of-contact for their product teams.
In this distributed security ownership model, accountability for product security sits with the product development teams. However, the Guardians are responsible for performing the first evaluation of a development team’s security review submission. They confirm the quality and completeness of the new service’s resources, design documents, threat model, automated findings, and penetration test readiness. The development teams, supported by the Guardian, submit their security review to AWS Application Security (AppSec) engineers for the final pre-launch review.
In practice, as part of this development journey, Guardians help ensure that security considerations are made early, when teams are assessing customer requests and the feature or product design. This can be done by starting the threat modeling processes. Next, they work to make sure that mitigations identified during threat modeling are developed. Guardians also play an active role in software testing, including security scans such as static application security testing (SAST) and dynamic application security testing (DAST). To close out the security review, security engineers work with Guardians to make sure that findings are resolved and the product is ready to ship.
Figure 4: Expedited security review process supported by Guardians
Guardians are, after all, Amazonians. Therefore, Guardians exemplify a number of the Amazon Leadership Principles and often have the following characteristics:
They are exemplary practitioners for security ownership and empower their teams to own the security of their service.
They hold a high security bar and exercise strong security judgement, don’t accept quick or easy answers, and drive continuous improvement.
They advocate for security needs in internal discussions with the product team.
They are thoughtful yet assertive to make customer security a top priority on their team.
They maintain and showcase their security knowledge to their peers, continuously building knowledge from many different sources to gain perspective and to stay up to date on the constantly evolving threat landscape.
They aren’t afraid to have their work independently validated by the central security team.
Expected outcomes
AWS has benefited greatly from the Security Guardians program. We’ve had 22.5 percent fewer medium and high severity security findings generated during the security review process and have taken about 26.9 percent less time to review a new service or feature. This data demonstrates that with Guardians involved we’re identifying fewer issues late in the process, reducing remediation work, and as a result securely shipping services faster for our customers. To help both builders and Guardians improve over time, our security review tool captures feedback from security engineers on their inputs. This helps ensure that our security ownership mechanism reinforces and improves itself over time.
AWS and other organizations have benefited from this mechanism because it generates specialized security resources and distributes security knowledge that scales without needing to hire additional staff.
A program such as this could help your business build and ship faster, as it has for AWS, while maintaining an appropriately high security bar that rises over time. By training builders to be security practitioners and advocates within your development cycle, you can increase the chances of identifying risks and security findings earlier. These findings, earlier in the development lifecycle, can reduce the likelihood of having to patch security bugs or even start over after the product has already been built. We also believe that a consistent security experience for your product teams is an important aspect of successfully distributing your security ownership. An experience with less confusion and friction will help build trust between the product and security teams.
If you’re an AWS customer and want to learn more about how AWS built the Security Guardians program, reach out to your local AWS solutions architect or account manager for more information.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
Apache Iceberg is an open table format for large datasets in Amazon Simple Storage Service (Amazon S3) and provides fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution. Iceberg has become very popular for its support for ACID transactions in data lakes and features like schema and partition evolution, time travel, and rollback. Iceberg captures metadata information on the state of datasets as they evolve and change over time.
AWS Gluecrawlers now support Iceberg tables, enabling you to use the AWS Glue Data Catalog and migrate from other Iceberg catalogs easier. AWS Glue crawlers will extract schema information and update the location of Iceberg metadata and schema updates in the Data Catalog. You can then query the Data Catalog Iceberg tables across all analytics engines and apply AWS Lake Formation fine-grained permissions.
The Iceberg catalog helps you manage a collection of Iceberg tables and tracks the table’s current metadata. Iceberg provides several implementation options for the Iceberg catalog, including the AWS Glue Data Catalog, Hive Metastore, and JDBC catalogs. Customers prefer using or migrating to the AWS Glue Data Catalog because of its integrations with AWS analytical services such as Amazon Athena, AWS Glue, Amazon EMR, and Lake Formation.
With today’s launch, you can create and schedule an AWS Glue crawler to existing Iceberg tables into in the Data Catalog. You can then provide one or multiple S3 paths where the Iceberg tables are located. You have the option to provide the maximum depth of S3 paths that the crawler can traverse. With each crawler run, the crawler inspects each of the S3 paths and catalogs the schema information, such as new tables, deletes, and updates to schemas in the Data Catalog. Crawlers support schema merging across all snapshots and update the latest metadata file location in the Data Catalog that AWS analytical engines can directly use.
Additionally, AWS Glue is launching support for creating new (empty) Iceberg tables in the Data Catalog using the AWS Glue console or AWS Glue CreateTable API. Before the launch, customers who wanted to adopt Iceberg table format were required to generate Iceberg’s metadata.json file on Amazon S3 using PutObject separately in addition to CreateTable. Often, customers have used the create table statement on analytics engines such as Athena, AWS Glue, and so on. The new CreateTable API eliminates the need to create the metadata.json file separately, and automates generating metadata.json based on the given API input. Also, customers who manage deployments using AWS CloudFormation templates can now create Iceberg tables using the CreateTable API. For more details, refer to Creating Apache Iceberg tables.
For accessing the data using Athena, you can also use Lake Formation to secure your Iceberg table using fine-grained access control permissions when you register the Amazon S3 data location with Lake Formation. For source data in Amazon S3 and metadata that is not registered with Lake Formation, access is determined by AWS Identity and Access Management (IAM) permissions policies for Amazon S3 and AWS Glue actions.
Solution overview
For our example use case, a customer uses Amazon EMR for data processing and Iceberg format for the transactional data. They store their product data in Iceberg format on Amazon S3 and host the metadata of their datasets in Hive Metastore on the EMR primary node. The customer wants to make product data accessible to analyst personas for interactive analysis using Athena. Many AWS analytics services don’t integrate natively with Hive Metastore, so we use an AWS Glue crawler to populate the metadata in the AWS Glue Data Catalog. Athena supports Lake Formation permissions on Iceberg tables, so we apply fine-grained access for data access.
We configure the crawler to onboard the Iceberg schema to the Data Catalog and use Lake Formation access control for crawling. We apply Lake Formation grants on the database and crawled table to enable analyst users to query the data and verify using Athena.
After we populate the schema of the existing Iceberg dataset to the Data Catalog, we onboard new Iceberg tables to the Data Catalog and load data into the newly created data using Athena. We apply Lake Formation grants on the database and newly created table to enable analyst users to query the data and verify using Athena.
The following diagram illustrates the solution architecture.
Set up resources with AWS CloudFormation
To set up the solution resources using AWS CloudFormation, complete the following steps:
Choose Launch Stack to deploy a CloudFormation template.
Choose Next.
On the next page, choose Next.
Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create.
The CloudFormation template generates the following resources:
VPC, subnet, and security group for the EMR cluster
Data lake bucket to store Iceberg table data and metadata
IAM roles for the crawler and Lake Formation registration
EMR cluster and steps to create an Iceberg table with Hive Metastore
Analyst role for data access
Athena bucket path for results
When the stack is complete, on the AWS CloudFormation console, navigate to the Resources tab of the stack.
Note down the values of EmrClusterId, DataLakeBucketName, LFRegisterLocationServiceRole, AWSGlueServiceRole, AthenaBucketName, and LFBusinessAnalystRole.
Navigate to the Amazon EMR console and choose the EMR cluster you created.
Navigate to the Steps tab and verify that the steps were run.
This script run creates the database icebergcrawlerblodb using Hive and the Iceberg table product. It uses the Hive Metastore server on Amazon EMR as the metastore and stores the data on Amazon S3.
Navigate to the S3 bucket you created and verify if the data and metadata are created for the Iceberg table.
Some of the resources that this stack deploys incur costs when in use.
Now that the data is on Amazon S3, we can register the bucket with Lake Formation to implement access control and centralize the data governance.
Set up Lake Formation permissions
To use the AWS Glue Data Catalog in Lake Formation, complete the following steps to update the Data Catalog settings to use Lake Formation permissions to control Data Catalog resources instead of IAM-based access control:
Sign in to the Lake Formation console as admin.
If this is the first time accessing the Lake Formation console, add yourself as the data lake administrator.
In the navigation pane, under Data catalog, choose Settings.
Deselect Use only IAM access control for new databases.
Deselect Use only IAM access control for new tables in new databases.
Choose Version 3 for Cross account version settings.
Choose Save.
Now you can set up Lake Formation permissions.
Register the data lake S3 bucket with Lake Formation
To register the data lake S3 bucket, complete the following steps:
On the Lake Formation console, in the navigation pane, choose Data lake locations.
Choose Register location.
For Amazon S3 path, enter the data lake bucket path.
For IAM role, choose the role noted from the CloudFormation template for LFRegisterLocationServiceRole.
Choose Register location.
Grant crawler role access to the data location
To grant access to the crawler, complete the following steps:
On the Lake Formation console, in the navigation pane, choose Data locations.
Choose Grant.
For IAM users and roles, choose the role for the crawler.
For Storage locations, enter the data lake bucket path.
Choose Grant.
Create database and grant access to the crawler role
Complete the following steps to create your database and grant access to the crawler role:
On the Lake Formation console, in the navigation pane, choose Databases.
Choose Create database.
Provide the name icebergcrawlerblogdb for the database.
Make sure Use only IAM access control for new tables in this database option is not selected.
Choose Create database.
On the Action menu, choose Grant.
For IAM users and roles, choose the role for the crawler.
Leave the database specified as icebergcrawlerblogdb.
Select Create table, Describe, and Alter for Database permissions.
Choose Grant.
Configure the crawler for Iceberg
To configure your crawler for Iceberg, complete the following steps:
On the AWS Glue console, in the navigation pane, choose Crawlers.
Choose Create crawler.
Enter a name for the crawler. For this post, we use icebergcrawler.
Under Data source configuration, choose Add data source.
For Data source, choose Iceberg.
For S3 path, enter s3://<datalakebucket>/icebergcrawlerblogdb.db/.
Choose Add a Iceberg data source.
Support for Iceberg tables is available through CreateCrawler and UpdateCrawler APIs and adding the additional IcebergTarget as a target, with the following properties:
connectionId – If your Iceberg tables are stored in buckets that require VPC authorization, you can set your connection properties here
icebergTables – This is an array of icebergPaths strings, each indicating the folder with which the metadata files for an Iceberg table resides
For Existing IAM role, enter the crawler role created by the stack.
Under Lake Formation configuration, select Use Lake Formation credentials for crawling S3 data source.
Choose Next.
Under Set output and scheduling, specify the target database as icebergcrawlerblogdb.
Choose Next.
Choose Create crawler.
Run the crawler.
During each crawl, for each icebergTable path provided, the crawler calls the Amazon S3 List API to find the most recent metadata file under that Iceberg table metadata folder and updates the metadata_location parameter to the latest manifest file.
The following screenshot shows the details after a successful run.
The crawler was able to crawl the S3 data source and successfully populate the schema for Iceberg data in the Data Catalog.
You can now start using the Data Catalog as your primary metastore and create new Iceberg tables directly in the Data Catalog or using the createtable API.
Create a new Iceberg table
To create an Iceberg table in the Data Catalog using the console, complete the steps in this section. Alternatively, you can use a CloudFormation template to create an Iceberg table using the following code:
Complete the following steps to add a record to the Iceberg table:
On the Athena console, navigate to the query editor.
Choose Edit settings to configure the Athena query results bucket using the value noted from the CloudFormation output for AthenaBucketName.
Choose Save.
Run the following query to add a record to the table:
insert into icebergcrawlerblogdb.product_details values('00001','ABC Company',10)
Configure Lake Formation permissions on the Iceberg table in the Data Catalog
Athena supports Lake Formation permission on Iceberg tables, so for this post, we show you how to set up fine-grained access on the tables and query them using Athena.
Now the data lake admin can delegate permissions on the database and table to the LFBusinessAnalystRole-IcebergBlogIAM role via the Lake Formation console.
Grant the role access to the database and describe permissions
To grant the LFBusinessAnalystRole-IcebergBlogIAM role access to the database with describe permissions, complete the following steps:
On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
Choose Grant
Under Principals, select IAM users and roles.
Choose the IAM role LFBusinessAnalystRole-IcebergBlog.
Under LF-Tags or catalog resources, choose icebergcrawlerblogdb for Databases.
Select Describe for Database permissions.
Choose Grant to apply the permissions.
Grant column access to the role
Next, grant column access to the LFBusinessAnalystRole-IcebergBlogIAM role:
On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
Choose Grant.
Under Principals, select IAM users and roles.
Choose the IAM role LFBusinessAnalystRole-IcebergBlog.
Under LF-Tags or catalog resources, choose icebergcrawlerblogdb for Databases and product for Tables.
Choose Select for Table permissions.
Under Data permissions, select Column-based access.
Select Include columns and choose product_name and price.
Choose Grant to apply the permissions.
Grant table access to the role
Lastly, grant table access to the LFBusinessAnalystRole-IcebergBlogIAM role:
On the Lake Formation console, under Permissions in the navigation pane, choose Data lake permissions.
Choose Grant.
Under Principals, select IAM users and roles.
Choose the IAM role LFBusinessAnalystRole-IcebergBlog.
Under LF-Tags or catalog resources, choose icebergcrawlerblogdb for Databases and product_details for Tables.
Choose Select and Describe for Table permissions.
Choose Grant to apply the permissions.
Verify the tables using Athena
To verify the tables using Athena, switch to LFBusinessAnalystRole-IcebergBlogrole and complete the following steps:
On the Athena console, navigate to the query editor.
Choose Edit settings to configure the Athena query results bucket using the value noted from the CloudFormation output for AthenaBucketName.
Choose Save.
Run the queries on product and product_details to validate access.
The following screenshot shows column permissions on product.
The following screenshot shows table permissions on product_details.
We have successfully crawled the Iceberg dataset created from Hive Metastore with data on Amazon S3 and created an AWS Glue Data Catalog table with the schema populated. We registered the data lake bucket with Lake Formation and enabled crawling access to the data lake using Lake Formation permissions. We granted Lake Formation permissions on the database and table to the analyst user and validated access to the data using Athena.
Clean up
To avoid unwanted charges to your AWS account, delete the AWS resources:
Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack.
Delete the CloudFormation stack you created.
Conclusion
With the support for Iceberg crawlers, you can quickly move to using the AWS Glue Data Catalog as your primary Iceberg table catalog. You can automatically register Iceberg tables into the Data Catalog by running an AWS Glue crawler, which doesn’t require any DDL or manual schema definition. You can start building your serverless transactional data lake on AWS using the AWS Glue crawler, create a new table using the Data Catalog, and utilize Lake Formation fine-grained access controls for querying Iceberg tables formats by Athena.
Refer to Working with other AWS services for Lake Formation support for Iceberg tables across various AWS analytical services.
Special thanks to everyone who contributed to this crawler and createtable feature launch: Theo Xu, Kyle Duong, Anshuman Sharma, Atreya Srivathsan, Eric Wu, Jack Ye, Himani Desai, Atreya Srivathsan, Masoud Shahamiri and Sachet Saurabh.
If you have questions or suggestions, submit them in the comments section.
About the authors
Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.
Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.
Mahesh Mishra is a Principal Product Manager with AWS Lake Formation team. He works with many of AWS largest customers on emerging technology needs, and leads several data and analytics initiatives within AWS including strong support for Transactional Data Lakes.
On June 19, 2023, AWS Verified Access introduced improved logging functionality; Verified Access now logs more extensive user context information received from the trust providers. This improved logging feature simplifies administration and troubleshooting of application access policies while adhering to zero-trust principles.
In this blog post, we will show you how to manage the Verified Access logging configuration and how to use Verified Access logs to write and troubleshoot access policies faster. We provide an example showing the user context information that was logged before and after the improved logging functionality and how you can use that information to transform a high-level policy into a fine-grained policy.
Overview of AWS Verified Access
AWS Verified Access helps enterprises to provide secure access to their corporate applications without using a virtual private network (VPN). Using Verified Access, you can configure fine-grained access policies to help limit application access only to users who meet the specified security requirements (for example, user identity and device security status). These policies are written in Cedar, a new policy language developed and open-sourced by AWS.
Verified Access validates each request based on access policies that you set. You can use user context—such as user, group, and device risk score—from your existing third-party identity and device security services to define access policies. In addition, Verified Access provides you an option to log every access attempt to help you respond quickly to security incidents and audit requests. These logs also contain user context sent from your identity and device security services and can help you to match the expected outcomes with the actual outcomes of your policies. To capture these logs, you need to enable logging from the Verified Access console.
Figure 1: Overview of AWS Verified Access architecture showing Verified Access connected to an application
After a Verified Access administrator attaches a trust provider to a Verified Access instance, they can write policies using the user context information from the trust provider. This user context information is custom to an organization, and you need to gather it from different sources when writing or troubleshooting policies that require more extensive user context.
Now, with the improved logging functionality, the Verified Access logs record more extensive user context information from the trust providers. This eliminates the need to gather information from different sources. With the detailed context available in the logs, you have more information to help validate and troubleshoot your policies.
To improve the preceding policy and make it more granular, you can include checks for various user and device details. For example, you can check if the user belongs to a particular group, has a verified email, should be logging in from a device with an OS that has an assessment score greater than 50, and has an overall device score greater than 15.
Modify the Verified Access instance logging configuration
Open the Verified Access console and select Verified Access instances.
Select the instance that you want to modify, and then, on the Verified Access instance logging configuration tab, select Modify Verified Access instance logging configuration.
Under Update log version, select ocsf-1.0.0-rc.2, turn on Include trust context, and select where the logs should be delivered.
Figure 3: Verified Access log version and trust context
After you’ve completed the preceding steps, Verified Access will start logging more extensive user context information from the trust providers for every request that Verified Access receives. This context information can have sensitive information. To learn more about how to protect this sensitive information, see Protect Sensitive Data with Amazon CloudWatch Logs.
The following example log shows information received from the IAM Identity Center identity provider (IdP) and the device provider CrowdStrike.
The following example log shows the user context information received from the OpenID Connect (OIDC) trust provider Okta. You can see the difference in the information provided by the two different trust providers: IAM Identity Center and Okta.
The following is a sample policy written using the information received from the trust providers.
permit(principal,action,resource)
when {
context.idcpolicy.groups has "<hr-group-id>" &&
context.idcpolicy.user.email.address like "*@example.com" &&
context.idcpolicy.user.email.verified == true &&
context has "crdstrikepolicy" &&
context.crdstrikepolicy.assessment.os > 50 &&
context.crdstrikepolicy.assessment.overall > 15
};
This policy only grants access to users who belong to a particular group, have a verified email address, and have a corporate email domain. Also, users can only access the application from a device with an OS that has an assessment score greater than 50, and has an overall device score greater than 15.
Conclusion
In this post, you learned how to manage Verified Access logging configuration from the Verified Access console and how to use improved logging information to write AWS Verified Access policies. To get started with Verified Access, see the Amazon VPC console.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it simple to set up and operate end-to-end data pipelines in the cloud. Trusted across various industries, Amazon MWAA helps organizations like Siemens, ENGIE, and Choice Hotels International enhance and scale their business workflows, while significantly improving security and reducing infrastructure management overhead.
Today, we are announcing the availability of Apache Airflow version 2.6.3 environments. If you’re currently running Apache Airflow version 2.x, you can seamlessly upgrade to v2.6.3 using in-place version upgrades, thereby retaining your workflow run history and environment configurations.
In this post, we delve into some of the new features and capabilities of Apache Airflow v2.6.3 and how you can set up or upgrade your Amazon MWAA environment to accommodate this version as you orchestrate your workflows in the cloud at scale.
New feature: Notifiers
Airflow now gives you an efficient way to create reusable and standardized notifications to handle systemic errors and failures. Notifiers introduce a new object in Airflow, designed to be an extensible layer for adding notifications to DAGs. This framework can send messages to external systems when a task instance or an individual DAG run changes its state. You can build notification logic from a new base object and call it directly from your DAG files. The BaseNotifier is an abstract class that provides a basic structure for sending notifications in Airflow using the various on_*__callback. It is intended for providers to extend and customize this for their specific needs.
Using this framework, you can build custom notification logic directly within your DAG files. For instance, notifications can be sent through email, Slack, or Amazon Simple Notification Service (Amazon SNS) based on the state of a DAG (on_failure, on_success, and so on). You can also create your own custom notifier that updates an API or posts a file to your storage system of choice.
For details on how to create and use a notifier, refer to Creating a notifier.
New feature: Managing tasks stuck in a queued state
Apache Airflow v2.6.3 brings a significant improvement to address the long-standing issue of tasks getting stuck in the queued state when using the CeleryExecutor. In a typical Apache Airflow workflow, tasks progress through a lifecycle, moving from the scheduled state to the queued state, and eventually to the running state. However, tasks can occasionally remain in the queued state longer than expected due to communication issues among the scheduler, the executor, and the worker. In Amazon MWAA, customers have experienced such tasks being queued for up to 12 hours due to how it utilizes the native integration of Amazon Simple Queue Service (Amazon SQS) with CeleryExecutor.
To mitigate this issue, Apache Airflow v2.6.3 introduced a mechanism that checks the Airflow database for tasks that have remained in the queued state beyond a specified timeout, defaulting to 600 seconds. This default can be modified using the environment configuration parameter scheduler.task_queued_timeout. The system then retries such tasks if retries are still available or fails them otherwise, ensuring that your data pipelines continue to function smoothly.
Notably, this update deprecates the previously used celery.stalled_task_timeout and celery.task_adoption_timeout settings, and consolidates their functionalities into a single configuration, scheduler.task_queued_timeout. This enables more effective management of tasks that remain in the queued state. Operators can also configure scheduler.task_queued_timeout_check_interval, which controls how frequently the system checks for tasks that have stayed in the queued state beyond the defined timeout.
New feature: A new continuous timetable and support for continuous schedule
With prior versions of Airflow, to run a DAG continuously in a loop, you had to use the TriggerDagRunOperator to rerun the DAG after the last task is finished. With Apache Airflow v2.6.3, you can now run a DAG continuously with a predefined timetable. The simplifies scheduling for continual DAG runs. The new ContinuousTimetable construct will create one continuous DAG run, respecting start_date and end_date, with the new run starting as soon as the previous run has completed, regardless of whether the previous run has succeeded or failed. Using a continuous timetable is especially useful when sensors are used to wait for highly irregular events from external data tools.
You can bound the degree of parallelism to ensure that only one DAG is running at any given time with the max_active_runs parameter:
New feature: Trigger the DAG UI extension with flexible user form concept
Prior to Apache Airflow v2.6.3, you could provide parameters in JSON structure through the Airflow UI for custom workflow runs. You had to model, check, and understand the JSON and enter parameters manually without the option to validate them before triggering a workflow. With Apache Airflow v2.6.3, when you choose Trigger DAG w/ config, a trigger UI form is rendered based on the predefined DAG Params. For your ad hoc, testing, or custom runs, this simplifies the DAG’s parameter entry. If the DAG has no parameters defined, a JSON entry mask is shown. The form elements can be defined with the Param class and attributes define how a form field is displayed.
For an example DAG the following form is generated by DAG Params.
When you have successfully created an Apache Airflow v2.6.3 environment in Amazon MWAA, the following packages are automatically installed on the scheduler and worker nodes along with other provider packages:
In this post, we talked about some of the new features of Apache Airflow v2.6.3 and how you can get started using them in Amazon MWAA. Try out these new features like notifiers and continuous timetables, and other enhancements to improve your data orchestration pipelines.
Apache, Apache Airflow, and Airflow are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.
About the Authors
Hernan Garcia is a Senior Solutions Architect at AWS, based out of Amsterdam, working in the Financial Services Industry since 2018. He specializes in application modernization and supports his customers in the adoption of cloud operating models and serverless technologies.
Parnab Basak is a Solutions Architect and a Serverless Specialist at AWS. He specializes in creating new solutions that are cloud native using modern software development practices like serverless, DevOps, and analytics. Parnab works closely in the analytics and integration services space helping customers adopt AWS services for their workflow orchestration needs.
Shubham Mehta is an experienced product manager with over eight years of experience and a proven track record of delivering successful products. In his current role as a Senior Product Manager at AWS, he oversees Amazon Managed Workflows for Apache Airflow (Amazon MWAA) and spearheads the Apache Airflow open-source contributions to further enhance the product’s functionality.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.