This post is written by John Ritsema (Principal Solutions Architect)
Continuous integration and continuous delivery (CI/CD) pipelines are effective mechanisms that allow teams to turn source code into running applications. When a developer makes a code change and pushes it to a remote repository, a pipeline with a series of steps can process the change. A pipeline integrates a change with the full code base, verifies the style and formatting, runs security checks, and runs unit tests. As the final step, it builds the code into an artifact that is deployable to an environment for consumption.
When using GitHub or many other hosted Git providers, a pull request or merge request can be submitted for a particular code change. This creates a focused place for discussion and collaboration on the change before it is approved and merged into a shared code branch.
A powerful mechanism for collaboration involves deploying a pull request (PR) to a running environment. This allows stakeholders to preview the changes live and see how they would look. Spinning up a running environment quickly allows teammates to provide almost immediate feedback, expediting the entire development process.
Deploying PRs to ephemeral environments encourages teams to make many small changes that can be previewed and tested in parallel. This avoids having to first merge into a common source branch and deploy to long-lived environments that are always on and incur costs.
Creating this mechanism has several challenges including setup complexity, environment creation time, and environment cost. This post addresses these challenges by showing how to create a CI/CD pipeline for previewing changes to web applications in ephemeral, quick-to-provision, low-cost, and scale-to-zero environments. This post walks through the steps required to set up a sample application.
Example architecture
The concepts in this post can be implemented using a number of tools and hosted Git providers that connect to CI/CD pipelines. The example code shared in this post uses GitHub Actions to trigger a workflow. The workflow uses a small Terraform module with Docker to build the application source code into a container image, push it to Amazon Elastic Container Registry (ECR), and create an AWS Lambda function with the image.
The container running on Lambda is accessible from a web browser through a Lambda function URL. This provides a dedicated HTTPS endpoint for a function.
This is used instead of AWS App Runner, Amazon ECS Fargate with an Application Load Balancer (ALB), or Amazon EKS with ALB ingress because of speed of provisioning and low cost. Lambda function URLs are ideal for occasionally used ephemeral PR environments as they can be provisioned quickly. Lambda’s scale-to-zero compute environment leads to lower cost, as charges are only incurred for actual HTTP requests. This is useful for PRs that may only be reviewed infrequently and then sit idle until the PR is either merged or closed.
This is the example architecture:
Setting up the example
The sample project shows how to implement this example. It consists of a vanilla web application written in Node.js. All of the code needed to implement the architecture is contained within the .github directory. To enable ephemeral environments for a new project, copy over the .github directory without cluttering your project files.
There are two main resources needed to run Terraform inside of GitHub Actions: an AWS IAM role and a place to store Terraform state. AWS credentials are required to give the pipeline permission to provision AWS resources.
Instead of using static IAM user credentials that must be rotated and secured, assume an IAM role to obtain temporary credentials. Terraform remote state is needed to dispose of the environment when the PR is merged or closed. The sample project uses an Amazon S3 bucket to store Terraform state.
You can use the Terraform module located under .github/setup to create these required resources.
Provide the name of your GitHub organization and repository in the terraform.tfvars file as input parameters. You can replace aws-samples with your GitHub user name:
cd .github/setup
terraform init && terraform apply
Store the outputted terraform.tfstate file safely so that you can manage these resources in the future if needed.
Place the Region, generated IAM role, and bucket name into the configuration file located under .github/workflows/config.env. This configuration file is read and used by the GitHub Actions workflow.
export AWS_REGION="<add region from setup>"
export AWS_ROLE="<add role from setup>"
export TF_BACKEND_S3_BUCKET="<add bucket from setup>"
This IAM role has an inline policy that contains the minimum set of permissions needed to provision the AWS resources. This assumes that your application does not interact with external services like databases or caches. If your application needs this additional access, you can add the required permissions to the policy located here.
Running a web server in Lambda
The sample web (HTTP) application includes a Dockerfile that contains instructions for packaging the web app into a process-based container image. A Lambda extension called Lambda Web Adapter enables you to run this standard web server process on Lambda. The CI/CD workflow makes a copy of the Dockerfile and adds the following line.
This line copies the Lambda Web Adapter executable binary from a public ECR image and writes it into the container in the /opt/extensions/ directory. When the container starts, Lambda starts the Lambda Web Adapter extension. This translates Lambda event payloads from HTTP-based triggers into actual HTTP requests that it proxies to the web app running inside the container. This is the architecture:
By default, Lambda Web Adapter assumes that the web app is listening on port 8080. However, you can change this in the Dockerfile by setting the PORT environment variable.
The containerized web app experiences a “cold start”. However, this is likely not too much of a concern, as the app will only be previewed internally by teammates.
Workflow pipeline
The GitHub Actions job defined in the up.yml workflow is triggered when a PR is opened or reopened against the repository’s main branch. The following is a summary of the steps that the Job performs.
Read the configuration from .github/workflows/config.env
Assume the IAM Role, which has minimal permissions to deploy AWS resources
Install the Terraform CLI
Add the Lambda Web Adapter extension to the copy of the Dockerfile
Run terraform apply to provision the AWS resources using the S3 bucket for Terraform remote state
Obtain the HTTPS endpoint from Terraform and add it to the PR as a comment
The following code snippet shows the key steps (4-6) from the up.yml workflow.
The main.tf file (in the same directory) includes infrastructure as code (IaC) that is responsible for creating an ECR repository, building and pushing the container image to it, and spinning up a Lambda function based on the image. The following is a snippet from the Terraform configuration. You can see how concisely this can be configured.
Terraform outputs the generated HTTPS endpoint. The workflow writes it back to the PR as a comment so that teammates can click on the link to preview the changes:
The workflow takes about 60 seconds to spin up a new isolated containerized web application in an ephemeral environment that can be previewed.
Pull request collaboration
The following screenshot shows an example PR as the author collaborates with their team. After implementing this example, when a new PR arrives, the changes are deployed to a new ephemeral environment. Stakeholders can use the link to preview what the changes look like and provide feedback.
Once the changes are approved and merged into the main branch, the GitHub Actions down.yml workflow disposes of the environment. This means that the ephemeral environment is de-provisioned, including resources like the Lambda function and the ECR repository.
Conclusion
This post discusses some of the benefits of using ephemeral environments in CI/CD pipelines. It shows how to implement a pipeline using GitHub Actions and Lambda Function URLs for fast, low-cost, and ephemeral environments.
With this example, you can deploy PRs quickly, and the cost is based on HTTP requests made to the environment. There are no compute costs incurred while a PR is open and no one is previewing the environment. The only charges are for Lambda invocations, while stakeholders are actively interacting with the environment. When a PR is merged or closed, the cloud infrastructure is disposed of. You can find all of the example code referenced in this post here.
For more serverless learning resources, visit Serverless Land.
This blog post is written by Katja-Maja Krödel, IoT Specialist Solutions Architect, and Benjamin Meyer, Senior Solutions Architect, Game Tech.
Customers use Amazon Elastic Compute Cloud (Amazon EC2) instances to develop, deploy, and test applications. To use those instances most effectively, customers have expressed the need to set back their instance to a previous state within minutes or even seconds. They want to find a quick and automated way to manage setting back their instances at scale.
The feature of replacing Root Volumes of Amazon EC2 instances enables customers to replace the root volumes of running EC2 instances to a specific snapshot or its launch state. Without stopping the instance, this allows customers to fix issues while retaining the instance store data, networking, and AWS Identity and Access Management (IAM) configuration. Customers can resume their operations with their instance store data intact. This works for all virtualized EC2 instances and bare metal EC2 Mac instances today.
In this post, we show you how to design your architecture for automated Root Volume Replacement using this Amazon EC2 feature. We start with the automated snapshot creation, continue with automatically replacing the root volume, and finish with how to keep your environment clean after your replacement job succeeds.
What is Root Volume Replacement?
Amazon EC2 enables customers to replace the root Amazon Elastic Block Store (Amazon EBS) volume for an instance without stopping the instance to which it’s attached. An Amazon EBS root volume is replaced to the launch state, or any snapshot taken from the EBS volume itself. This allows issues to be fixed, such as root volume corruption or guest OS networking errors. Replacing the root volume of an instance includes the following steps:
A new EBS volume is created from a previously taken snapshot or the launch state
Reboot of the instance
While rebooting, the current root volume is detached and the new root volume is attached
The previous EBS root volume isn’t deleted and can be attached to an instance for later investigation of the volume. If replacing to a different state of the EBS than the launch state, then a snapshot of the current root volume is used.
An example use case is a continuous integration/continuous deployment (CI/CD) System that builds on EC2 instances to build artifacts. Within this system, you could alter the installed tools on the host and may cause failing builds on the same machine. To prevent any unclean builds, the introduced architecture is used to clean up the machine by replacing the root volume to a previously known good state. This is especially interesting for EC2 Mac Instances, as their Dedicated Host won’t undergo the scrubbing process, and the instance is more quickly restored than launching a fresh EC2 Mac instance on the same host.
Overview
The feature of replacing Root Volumes was introduced in April 2021 and has just been <TBD> extended to work for Bare Metal EC2 Mac Instances. This means that EC2 Mac Instances are included. If you want to reset an EC2 instance to a previously known good state, then you can create Snapshots of your EBS volumes. To reset the root volume to its launch state, no snapshot is needed. For non-root volumes, you can use these Snapshots to create new EBS volumes, and then attach those to your instance as well as detach them. To automate the process of replacing your root volume not only once, but also in a repeatable manner, we’re introducing you to an architecture that can fully-automate this process.
In the case that you use a snapshot to create a new root volume, you must take a new snapshot of that volume to be able to get back to that state later on. You can’t use a snapshot of a different volume to restore to, which is the reason that the architecture includes the automatic snapshot creation of a fresh root volume.
The architecture is built in three steps:
Automation of Snapshot Creation for new EBS volumes
Automation of replacing your Root Volume
Preparation of the environment for the next Root Volume Replacement
The following diagram illustrates the architecture of this solution.
In the next sections, we go through these concepts to design the automatic Root Volume Replacement Task.
Automation of Snapshot Creation for new EBS volumes
The figure above illustrates the architecture for automatically creating a snapshot of an existing EBS volume. In this architecture, we focus on the automation of creating a snapshot whenever a new EBS root volume is created.
Amazon EventBridge is used to invoke an AWS Lambda function on an emitted createVolume event. For automated reaction to the event, you can add a rule to the EventBridge which will forward the event to an AWS Lambda function whenever a new EBS volume is created. The rule within EventBridge looks like this:
The code of the function uses the resource ARN within the received event and requests resource details about the EBS volume from the Amazon EC2 APIs. Since the event doesn’t include information if it’s a root volume, then you must verify this using the Amazon EC2 API.
The following is a summary of the tasks of the Lambda function:
Extract the EBS ARN from the EventBridge Event
Verify that it’s a root volume of an EC2 Instance
Call the Amazon EC2 API create-snapshot to create a snapshot of the root volume and add a tag replace-snapshot=true
Then, the tag is used to clean up the environment and get rid of snapshots that aren’t needed.
As an alternative, you can emit your own event to EventBridge. This can be used to automatically create snapshots to which you can restore your volume. Instead of reacting to the createVolume event, you can use a customized approach for this architecture.
Automation of replacing your Root Volume
The figure above illustrates the procedure of replacing the EBS root volume. It starts with the event, which is created through the AWS Command Line Interface (AWS CLI), console, or usage of the API. This leads to creating a new volume from a snapshot or using the initial launch state. The EC2 instance is rebooted, and during that time the old root volume is detached and a new volume gets attached as the root volume.
To invoke the create-replace-root-volume-task, you can call the Amazon EC2 API with the following AWS CLI command:
After running this command, AWS will create a new EBS volume, add the tag to the old EBS replaced-volume=true, restart your instance, and attach the new volume to the instance as the root volume. The tag is used later to detect old root volumes and clean up the environment.
If this is combined with the earlier explained automation, then the automation will immediately take a snapshot from the new EBS volume. A restore operation can only be done to a snapshot of the current EBS root volume. Therefore, if no snapshot is taken from the freshly restored EBS volume, then no restore operation is possible except the restore to launch state.
Preparation of the Environment for the next Root Volume Replacement
After the task is completed, the old root volume isn’t removed. Additionally, snapshots of previous root volumes can’t be used to restore current root volumes. To clean up your environment, you can schedule a Lambda function which does the following steps:
Delete detached EBS volumes with the tag delete-volume=true
Delete snapshots with the tag replace-snapshot=true, which aren’t associated with an existing EBS volume
Conclusion
In this post, we described an architecture to quickly restore EC2 instances through Root Volume Replacement. The feature of replacing Root Volumes of Amazon EC2 instances, now including Bare Metal EC2 Mac instances, enables customers to replace the root volumes of running EC2 instances to a specific snapshot or its launch state. Customers can resume their operations with their instance store data intact. We’ve split the process of doing this in an automated and quick manner into three steps: Create a snapshot, run the replacement task, and reset your environment to be prepared for a following replacement task. If you want to learn more about this feature, then see the Announcement of replacing Root Volumes, as well as the documentation for this feature. <TBD Announcement Bare Metal>
Leading modern application platform space OutSystems is a low-code platform that provides tools for companies to develop, deploy, and manage omnichannel enterprise applications.
Security is a top priority at OutSystems. Their Security Operations Center (SOC) deals with thousands of incidents a year, each with a set of response actions that need to be executed as quickly as possible. Providing security at such large scale is a challenge, even for the most well-prepared organizations. Manual and repetitive tasks account for the majority of the response time involved in this process, and decreasing this key metric requires orchestration and automation.
Security orchestration, automation, and response (SOAR) systems are designed to translate security analysts’ manual procedures into automated actions, making them faster and more scalable.
In this blog post, we’ll explore how OutSystems lowered their incident response time by 99 percent by designing and deploying a custom SOAR using Serverless services on AWS.
Solution architecture
Security incidents happen with unknown frequency, making serverless services a natural fit to boost security at OutSystems because of their increased agility and capability to scale to zero.
There are two ways to trigger SOAR actions in this architecture:
Automatically through Security Information and Event Management (SIEM) security incident findings
On-demand through chat application
Using the first method, when a security incident is detected by the SIEM, an event is published to Amazon Simple Notification Service (Amazon SNS). This triggers an AWS Lambda function that creates a ticket in an internal ticketing system. Then the Lambda Playbooks function triggers to decide which playbook to run depending on the incident details.
Each playbook is a set of actions that are executed in response to a trigger. Playbooks are the key component behind automated tasks. OutSystems uses AWS Step Functions to orchestrate the actions and Lambda functions to execute them.
But this solution does not exist in isolation. Depending on the playbook, Step Functions interacts with other components such as AWS Secrets Manager or external APIs.
Using the second method, the on-demand trigger for OutSystems SOAR relies on a chat application. This application calls a Lambda function URL that interacts with the playbooks we just discussed.
Figure 1 represents the high-level architecture of OutSystems’ custom SOAR.
This same IaC architecture is used when new playbooks or updates to existing ones are made. Code changes that are committed to a source control repository trigger the CodePipeline which uses AWS CodeBuild and CloudFormation change sets to deploy the updates to the affected resources.
Use cases
The use cases that OutSystems has deployed playbooks for to date include:
SQL injection
Unauthorized access to credentials
Issuance of new certificates
Login brute forces
Impossible travel
Let’s explore the Impossible travel use case. Impossible travel happens when a user logs in from one location, and then later logs in from a different location that would be impossible to travel between within the elapsed time.
When the SIEM identifies this behavior, it triggers an alert and the following actions are performed:
A ticket is created
An IP address check is performed in reputation databases, such as AbuseIPDB or VirusTotal
An IP address check is performed in the internal database, and the IP address is added if it is not found
A search is performed for past events with the same IP address
Recent logins of the user are identified in the SIEM, along with all related information
All of this information is automatically added to the ticket. Every step listed here was previously performed manually; a task that took an average of 15 minutes. Now, the process takes just 8 seconds—a 99.1% incident response time improvement.
The following remediation actions can also be automated, along with many others:
Some of these remediation actions are already in place, while others are in development.
Conclusion
At OutSystems, much like at AWS, security is considered “job zero.” It is not only important to be proactive in preventing security incidents, but when they happen, the response must be quick, effective, and as immune to human error as possible.
With the implementation of this custom SOAR, OutSystems reduced the average response time to security incidents by 99%. Tasks that previously took 76 hours of analysts’ time are now accomplished automatically within 31 minutes.
During the evaluation period, SOAR addressed hundreds of real-world incidents with some threat intel use cases being executed thousands of times.
An architecture composed of serverless services ensures OutSystems does not pay for systems that are standing by waiting for work, and at the same time, not compromising on performance.
Secrets managers are a great tool to securely store your secrets and provide access to secret material to a set of individuals, applications, or systems that you trust. Across your environments, you might have multiple secrets managers hosted on different providers, which can increase the complexity of maintaining a consistent operating model for your secrets. In these situations, centralizing your secrets in a single source of truth, and replicating subsets of secrets across your other secrets managers, can simplify your operating model.
This blog post explains how you can use your third-party secrets manager as the source of truth for your secrets, while replicating a subset of these secrets to AWS Secrets Manager. By doing this, you will be able to use secrets that originate and are managed from your third-party secrets manager in Amazon Web Services (AWS) applications or in AWS services that use Secrets Manager secrets.
I’ll demonstrate this approach in this post by setting up a sample open-source HashiCorp Vault to create and maintain secrets and create a replication mechanism that enables you to use these secrets in AWS by using AWS Secrets Manager. Although this post uses HashiCorp Vault as an example, you can also modify the replication mechanism to use secrets managers from other providers.
Important: This blog post is intended to provide guidance that you can use when planning and implementing a secrets replication mechanism. The examples in this post are not intended to be run directly in production, and you will need to take security hardening requirements into consideration before deploying this solution. As an example, HashiCorp provides tutorials on hardening production vaults.
You can use these links to navigate through this post:
The primary use case for this post is for customers who are running applications on AWS and are currently using a third-party secrets manager to manage their secrets, hosted on-premises, in the AWS Cloud, or with a third-party provider. These customers typically have existing secrets vending processes, deployment pipelines, and procedures and processes around the management of these secrets. Customers with such a setup might want to keep their existing third-party secrets manager and have a set of secrets that are accessible to workloads running outside of AWS, as well as workloads running within AWS, by using AWS Secrets Manager.
Another use case is for customers who are in the process of migrating workloads to the AWS Cloud and want to maintain a (temporary) hybrid form of secrets management. By replicating secrets from an existing third-party secrets manager, customers can migrate their secrets to the AWS Cloud one-by-one, test that they work, integrate the secrets with the intended applications and systems, and once the migration is complete, remove the third-party secrets manager.
Additionally, some AWS services, such as Amazon Relational Database Service (Amazon RDS) Proxy, AWS Direct Connect MACsec, and AD Connector seamless join (Linux), only support secrets from AWS Secrets Manager. Customers can use secret replication if they have a third-party secrets manager and want to be able to use third-party secrets in services that require integration with AWS Secrets Manager. That way, customers don’t have to manage secrets in two places.
Two approaches to secrets replication
In this post, I’ll discuss two main models to replicate secrets from an external third-party secrets manager to AWS Secrets Manager: a pull model and a push model.
Pull model In a pull model, you can use AWS services such as Amazon EventBridge and AWS Lambda to periodically call your external secrets manager to fetch secrets and updates to those secrets. The main benefit of this model is that it doesn’t require any major configuration to your third-party secrets manager. The AWS resources and mechanism used for pulling secrets must have appropriate permissions and network access to those secrets. However, there could be a delay between the time a secret is created and updated and when it’s picked up for replication, depending on the time interval configured between pulls from AWS to the external secrets manager.
Push model In this model, rather than periodically polling for updates, the external secrets manager pushes updates to AWS Secrets Manager as soon as a secret is added or changed. The main benefit of this is that there is minimal delay between secret creation, or secret updating, and when that data is available in AWS Secrets Manager. The push model also minimizes the network traffic required for replication since it’s a unidirectional flow. However, this model adds a layer of complexity to the replication, because it requires additional configuration in the third-party secrets manager. More specifically, the push model is dependent on the third-party secrets manager’s ability to run event-based push integrations with AWS resources. This will require a custom integration to be developed and managed on the third-party secrets manager’s side.
This blog post focuses on the pull model to provide an example integration that requires no additional configuration on the third-party secrets manager.
Replicate secrets to AWS Secrets Manager with the pull model
In this section, I’ll walk through an example of how to use the pull model to replicate your secrets from an external secrets manager to AWS Secrets Manager.
Solution overview
Figure 1: Secret replication architecture diagram
The architecture shown in Figure 1 consists of the following main steps, numbered in the diagram:
A Cron expression in Amazon EventBridge invokes an AWS Lambda function every 30 minutes.
To connect to the third-party secrets manager, the Lambda function, written in NodeJS, fetches a set of user-defined API keys belonging to the secrets manager from AWS Secrets Manager. These API keys have been scoped down to give read-only access to secrets that should be replicated, to adhere to the principle of least privilege. There is more information on this in Step 3: Update the Vault connection secret.
The third step has two variants depending on where your third-party secrets manager is hosted:
The Lambda function is configured to fetch secrets from a third-party secrets manager that is hosted outside AWS. This requires sufficient networking and routing to allow communication from the Lambda function.
Note: Depending on the location of your third-party secrets manager, you might have to consider different networking topologies. For example, you might need to set up hybrid connectivity between your external environment and the AWS Cloud by using AWS Site-to-Site VPN or AWS Direct Connect, or both.
Important: To simplify the deployment of this example integration, I’ll use a secrets manager hosted on a publicly available Amazon EC2 instance within the same VPC as the Lambda function (3b). This minimizes the additional networking components required to interact with the secrets manager. More specifically, the EC2 instance runs an open-source HashiCorp Vault. In the rest of this post, I’ll refer to the HashiCorp Vault’s API keys as Vault tokens.
The Lambda function compares the version of the secret that it just fetched from the third-party secrets manager against the version of the secret that it has in AWS Secrets Manager (by tag). The function will create a new secret in AWS Secrets Manager if the secret does not exist yet, and will update it if there is a new version. The Lambda function will only consider secrets from the third-party secrets manager for replication if they match a specified prefix. For example, hybrid-aws-secrets/.
In case there is an error synchronizing the secret, an email notification is sent to the email addresses which are subscribed to the Amazon Simple Notification Service (Amazon SNS) Topic deployed. This sample application uses email notifications with Amazon SNS as an example, but you could also integrate with services like ServiceNow, Jira, Slack, or PagerDuty. Learn more about how to use webhooks to publish Amazon SNS messages to external services.
Step 1: Deploy the solution by using the AWS CDK toolkit
For this blog post, I’ve created an AWS Cloud Development Kit (AWS CDK) script, which can be found in this AWS GitHub repository. Using the AWS CDK, I’ve defined the infrastructure depicted in Figure 1 as Infrastructure as Code (IaC), written in TypeScript, ready for you to deploy and try out. The AWS CDK is an open-source software development framework that allows you to write your cloud application infrastructure as code using common programming languages such as TypeScript, Python, Java, Go, and so on.
Prerequisites:
To deploy the solution, the following should be in place on your system:
AWS CDK Toolkit. Install using npm (included in Node setup) by running npm install -g aws-cdk in a local terminal.
An AWS access key ID and secret access key configured as this setup will interact with your AWS account. See Configuration basics in the AWS Command Line Interface User Guide for more details.
Clone the CDK script for secret replication. git clone https://github.com/aws-samples/aws-secrets-manager-hybrid-secret-replication-from-hashicorp-vault.git SecretReplication
Use the cloned project as the working directory. cd SecretReplication
Install the required dependencies to deploy the application. npm install
Adjust any configuration values for your setup in the cdk.json file. For example, you can adjust the secretsPrefix value to change which prefix is used by the Lambda function to determine the subset of secrets that should be replicated from the third-party secrets manager.
Bootstrap your AWS environments with some resources that are required to deploy the solution. With correctly configured AWS credentials, run the following command. cdk bootstrap
This command deploys the infrastructure shown in Figure 1 for you by using AWS CloudFormation. For a full list of resources, you can view the SecretsManagerReplicationStack in AWS CloudFormation after the deployment has completed.
Note: If your local environment does not have a terminal that allows you to run these commands, consider using AWS Cloud9 or AWS CloudShell.
After the deployment has finished, you should see an output in your terminal that looks like the one shown in Figure 2. If successful, the output provides the IP address of the sample HashiCorp Vault and its web interface.
Figure 2: AWS CDK deployment output
Step 2: Initialize the HashiCorp Vault
As part of the output of the deployment script, you will be given a URL to access the user interface of the open-source HashiCorp Vault. To simplify accessibility, the URL points to a publicly available Amazon EC2 instance running the HashiCorp Vault user interface as shown in step 3b in Figure 1.
Let’s look at the HashiCorp Vault that was just created. Go to the URL in your browser, and you should see the Raft Storage initialize page, as shown in Figure 3.
The vault requires an initial configuration to set up storage and get the initial set of root keys. You can go through the steps manually in the HashiCorp Vault’s user interface, but I recommend that you use the initialise_vault.sh script that is included as part of the SecretsManagerReplication project instead.
Using the HashiCorp Vault API, the initialization script will automatically do the following:
Initialize the Raft storage to allow the Vault to store secrets locally on the instance.
Create an initial set of unseal keys for the Vault. Importantly, for demo purposes, the script uses a single key share. For production environments, it’s recommended to use multiple key shares so that multiple shares are needed to reconstruct the root key, in case of an emergency.
Store the unseal keys in init/vault_init_output.json in your project.
Unseals the HashiCorp Vault by using the unseal keys generated earlier.
Enables two key-value secrets engines:
An engine named after the prefix that you’re using for replication, defined in the cdk.json file. In this example, this is hybrid-aws-secrets. We’re going to use the secrets in this engine for replication to AWS Secrets Manager.
An engine called super-secret-engine, which you’re going to use to show that your replication mechanism does not have access to secrets outside the engine used for replication.
Creates three example secrets, two in hybrid-aws-secrets, and one in super-secret-engine.
Creates a read-only policy, which you can see in the init/replication-policy-payload.json file after the script has finished running, that allows read-only access to only the secrets that should be replicated.
Creates a new vault token that has the read-only policy attached so that it can be used by the AWS Lambda function later on to fetch secrets for replication.
To run the initialization script, go back to your terminal, and run the following command. ./initialise_vault.sh
The script will then ask you for the IP address of your HashiCorp Vault. Provide the IP address (excluding the port) and choose Enter. Input y so that the script creates a couple of sample secrets.
If everything is successful, you should see an output that includes tokens to access your HashiCorp Vault, similar to that shown in Figure 4.
The setup script has outputted two tokens: one root token that you will use for administrator tasks, and a read-only token that will be used to read secret information for replication. Make sure that you can access these tokens while you’re following the rest of the steps in this post.
Note: The root token is only used for demonstration purposes in this post. In your production environments, you should not use root tokens for regular administrator actions. Instead, you should use scoped down roles depending on your organizational needs. In this case, the root token is used to highlight that there are secrets under super-secret-engine/ which are not meant for replication. These secrets cannot be seen, or accessed, by the read-only token.
Go back to your browser and refresh your HashiCorp Vault UI. You should now see the Sign in to Vault page. Sign in using the Token method, and use the root token. If you don’t have the root token in your terminal anymore, you can find it in the init/vault_init_output.json file.
After you sign in, you should see the overview page with three secrets engines enabled for you, as shown in Figure 5.
If you explore hybrid-aws-secrets and super-secret-engine, you can see the secrets that were automatically created by the initialization script. For example, first-secret-for-replication, which contains a sample key-value secret with the key secrets and value manager.
If you navigate to Policies in the top navigation bar, you can also see the aws-replication-read-only policy, as shown in Figure 6. This policy provides read-only access to only the hybrid-aws-secrets path.
Figure 6: Read-only HashiCorp Vault token policy
The read-only policy is attached to the read-only token that we’re going to use in the secret replication Lambda function. This policy is important because it scopes down the access that the Lambda function obtains by using the token to a specific prefix meant for replication. For secret replication we only need to perform read operations. This policy ensures that we can read, but cannot add, alter, or delete any secrets in HashiCorp Vault using the token.
You can verify the read-only token permissions by signing into the HashiCorp Vault user interface using the read-only token rather than the root token. Now, you should only see hybrid-aws-secrets. You no longer have access to super-secret-engine, which you saw in Figure 5. If you try to create or update a secret, you will get a permission denied error.
Great! Your HashiCorp Vault is now ready to have its secrets replicated from hybrid-aws-secrets to AWS Secrets Manager. The next section describes a final configuration that you need to do to allow access to the secrets in HashiCorp Vault by the replication mechanism in AWS.
Step 3: Update the Vault connection secret
To allow secret replication, you must give the AWS Lambda function access to the HashiCorp Vault read-only token that was created by the initialization script. To do that, you need to update the vault-connection-secret that was initialized in AWS Secrets Manager as part of your AWS CDK deployment.
For demonstration purposes, I’ll show you how to do that by using the AWS Management Console, but you can also do it programmatically by using the AWS Command Line Interface (AWS CLI) or AWS SDK with the update-secret command.
To update the Vault connection secret (console)
In the AWS Management Console, go to AWS Secrets Manager > Secrets > hybrid-aws-secrets/vault-connection-secret.
Under Secret Value, choose Retrieve Secret Value, and then choose Edit.
Update the vaultToken value to contain the read-only token that was generated by the initialization script.
Step 4: (Optional) Set up email notifications for replication failures
As highlighted in Figure 1, the Lambda function will send an email by using Amazon SNS to a designated email address whenever one or more secrets fails to be replicated. You will need to configure the solution to use the correct email address. To do this, go to the cdk.json file at the root of the SecretReplication folder and adjust the notificationEmail parameter to an email address that you own. Once done, deploy the changes using the cdk deploy command. Within a few minutes, you’ll get an email requesting you to confirm the subscription. Going forward, you will receive an email notification if one or more secrets fails to replicate.
Test your secret replication
You can either wait up to 30 minutes for the Lambda function to be invoked automatically to replicate the secrets, or you can manually invoke the function.
To test your secret replication
Open the AWS Lambda console and find the Secret Replication function (the name starts with SecretsManagerReplication-SecretReplication).
Navigate to the Test tab.
For the text event action, select Create new event, create an event using the default parameters, and then choose the Test button on the right-hand side, as shown in Figure 8.
Figure 8: AWS Lambda – Test page to manually invoke the function
This will run the function. You should see a success message, as shown in Figure 9. If this is the first time the Lambda function has been invoked, you will see in the results that two secrets have been created.
Figure 9: AWS Lambda function output
You can find the corresponding logs for the Lambda function invocation in a Log group in AWS CloudWatch matching the name /aws/lambda/SecretsManagerReplication-SecretReplicationLambdaF-XXXX.
To verify that the secrets were added, navigate to AWS Secrets Manager in the console, and in addition to the vault-connection-secret that you edited before, you should now also see the two new secrets with the same hybrid-aws-secrets prefix, as shown in Figure 10.
Figure 10: AWS Secrets Manager overview – New replicated secrets
For example, if you look at first-secret-for-replication, you can see the first version of the secret, with the secret key secrets and secret value manager, as shown in Figure 11.
Figure 11: AWS Secrets Manager – New secret overview showing values and version number
Success! You now have access to the secret values that originate from HashiCorp Vault in AWS Secrets Manager. Also, notice how there is a version tag attached to the secret. This is something that is necessary to update the secret, which you will learn more about in the next two sections.
Update a secret
It’s a recommended security practice to rotate secrets frequently. The Lambda function in this solution not only replicates secrets when they are created — it also periodically checks if existing secrets in AWS Secrets Manager should be updated when the third-party secrets manager (HashiCorp Vault in this case) has a new version of the secret. To validate that this works, you can manually update a secret in your HashiCorp Vault and observe its replication in AWS Secrets Manager in the same way as described in the previous section. You will notice that the version tag of your secret gets updated automatically when there is a new secret replication from the third-party secrets manager to AWS Secrets Manager.
Secret replication logic
This section will explain in more detail the logic behind the secret replication. Consider the following sequence diagram, which explains the overall logic implemented in the Lambda function.
Figure 12: State diagram for secret replication logic
This diagram highlights that the Lambda function will first fetch a list of secret names from the HashiCorp Vault. Then, the function will get a list of secrets from AWS Secrets Manager, matching the prefix that was configured for replication. AWS Secrets Manager will return a list of the secrets that match this prefix and will also return their metadata and tags. Note that the function has not fetched any secret material yet.
Next, the function will loop through each secret name that HashiCorp Vault gave and will check if the secret exists in AWS Secrets Manager:
If there is no secret that matches that name, the function will fetch the secret material from HashiCorp Vault, including the version number, and create a new secret in AWS Secrets Manager. It will also add a version tag to the secret to match the version.
If there is a secret matching that name in AWS Secrets Manager already, the Lambda function will first fetch the metadata from that secret in HashiCorp Vault. This is required to get the version number of the secret, because the version number was not exposed when the function got the list of secrets from HashiCorp Vault initially. If the secret version from HashiCorp Vault does not match the version value of the secret in AWS Secrets Manager (for example, the version in HashiCorp vault is 2, and the version in AWS Secrets manager is 1), an update is required to get the values synchronized again. Only now will the Lambda function fetch the actual secret material from HashiCorp Vault and update the secret in AWS Secrets Manager, including the version number in the tag.
The Lambda function fetches metadata about the secrets, rather than just fetching the secret material from HashiCorp Vault straight away. Typically, secrets don’t update very often. If this Lambda function is called every 30 minutes, then it should not have to add or update any secrets in the majority of invocations. By using metadata to determine whether you need the secret material to create or update secrets, you minimize the number of times secret material is fetched both from HashiCorp Vault and AWS Secrets Manager.
Note: The AWS Lambda function has permissions to pull certain secrets from HashiCorp Vault. It is important to thoroughly review the Lambda code and any subsequent changes to it to prevent leakage of secrets. For example, you should ensure that the Lambda function does not get updated with code that unintentionally logs secret material outside the Lambda function.
Use your secret
Now that you have created and replicated your secrets, you can use them in your AWS applications or AWS services that are integrated with Secrets Manager. For example, you can use the secrets when you set up connectivity for a proxy in Amazon RDS, as follows.
To use a secret when creating a proxy in Amazon RDS
Go to the Amazon RDS service in the console.
In the left navigation pane, choose Proxies, and then choose Create Proxy.
On the Connectivity tab, you can now select first-secret-for-replicationorsecond-secret-for-replication, which were created by the Lambda function after replicating them from the HashiCorp Vault.
Figure 13: Amazon RDS Proxy – Example of using replicated AWS Secrets Manager secrets
Due to the sensitive nature of the secrets, it is important that you scope down the permissions to the least amount required to prevent inadvertent access to your secrets. The setup adopts a least-privilege permission strategy, where only the necessary actions are explicitly allowed on the resources that are required for replication. However, the permissions should be reviewed in accordance to your security standards.
In the architecture of this solution, there are two main places where you control access to the management of your secrets in Secrets Manager.
Lambda execution IAM role: The IAM role assumed by the Lambda function during execution contains the appropriate permissions for secret replication. There is an additional safety measure, which explicitly denies any action to a resource that is not required for the replication. For example, the Lambda function only has permission to publish to the Amazon SNS topic that is created for the failed replications, and will explicitly deny a publish action to any other topic. Even if someone accidentally adds an allow to the policy for a different topic, the explicit deny will still block this action.
AWS KMS key policy: When other services need to access the replicated secret in AWS Secrets Manager, they need permission to use the hybrid-aws-secrets-encryption-key AWS KMS key. You need to allow the principal these permissions through the AWS KMS key policy. Additionally, you can manage permissions to the AWS KMS key for the principal through an identity policy. For example, this is required when accessing AWS KMS keys across AWS accounts. See Permissions for AWS services in key policies and Specifying KMS keys in IAM policy statements in the AWS KMS Developer Guide.
Options for customizing the sample solution
The solution that was covered in this post provides an example for replication of secrets from HashiCorp Vault to AWS Secrets Manager using the pull model. This section contains additional customization options that you can consider when setting up the solution, or your own variation of it.
Depending on the solution that you’re using, you might have access to different metadata attached to the secrets, which you can use to determine if a secret should be updated. For example, if you have access to data that represents a last_updated_datetime property, you could use this to infer whether or not a secret ought to be updated.
It is a recommended practice to not use long-lived tokens wherever possible. In this sample, I used a static vault token to give the Lambda function access to the HashiCorp Vault. Depending on the solution that you’re using, you might be able to implement better authentication and authorization mechanisms. For example, HashiCorp Vault allows you to use IAM auth by using AWS IAM, rather than a static token.
This post addressed the creation of secrets and updating of secrets, but for your production setup, you should also consider deletion of secrets. Depending on your requirements, you can choose to implement a strategy that works best for you to handle secrets in AWS Secrets Manager once the original secret in HashiCorp Vault has been deleted. In the pull model, you could consider removing a secret in AWS Secrets Manager if the corresponding secret in your external secrets manager is no longer present.
In the sample setup, the same AWS KMS key is used to encrypt both the environment variables of the Lambda function, and the secrets in AWS Secrets Manager. You could choose to add an additional AWS KMS key (which would incur additional cost), to have two separate keys for these tasks. This would allow you to apply more granular permissions for the two keys in the corresponding KMS key policies or IAM identity policies that use the keys.
Conclusion
In this blog post, you’ve seen how you can approach replicating your secrets from an external secrets manager to AWS Secrets Manager. This post focused on a pull model, where the solution periodically fetched secrets from an external HashiCorp Vault and automatically created or updated the corresponding secret in AWS Secrets Manager. By using this model, you can now use your external secrets in your AWS Cloud applications or services that have an integration with AWS Secrets Manager.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Secrets Manager re:Post or contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
As organizations continue to grow and scale their applications, the need for teams to be able to quickly and autonomously detect anomalous operational behaviors becomes increasingly important. Amazon DevOps Guru offers a fully managed AIOps service that enables you to improve application availability and resolve operational issues quickly. DevOps Guru helps ease this process by leveraging machine learning (ML) powered recommendations to detect operational insights, identify the exhaustion of resources, and provide suggestions to remediate issues. Many organizations running business critical applications use different tools to be notified about anomalous events in real-time for the remediation of critical issues. Atlassian is a modern team collaboration and productivity software suite that helps teams organize, discuss, and complete shared work. You can deliver these insights in near-real time to DevOps teams by integrating DevOps Guru with Atlassian Opsgenie. Opsgenie is a modern incident management platform that receives alerts from your monitoring systems and custom applications and categorizes each alert based on importance and timing.
This blog post walks you through how to integrate Amazon DevOps Guru with Atlassian Opsgenie to receive notifications for new operational insights detected by DevOps Guru with more flexibility and customization using Amazon EventBridge and AWS Lambda. The Lambda function will be used to demonstrate how to customize insights sent to Opsgenie.
Solution overview
Figure 1: Amazon EventBridge Integration with Opsgenie using AWS Lambda
Amazon DevOps Guru directly integrates with Amazon EventBridge to notify you of events relating to generated insights and updates to insights. To begin routing these notifications to Opsgenie, you can configure routing rules to determine where to send notifications. As outlined below, you can also use pre-defined DevOps Guru patterns to only send notifications or trigger actions that match that pattern. You can select any of the following pre-defined patterns to filter events to trigger actions in a supported AWS resource. Here are the following predefined patterns supported by DevOps Guru:
DevOps Guru New Insight Open
DevOps Guru New Anomaly Association
DevOps Guru Insight Severity Upgraded
DevOps Guru New Recommendation Created
DevOps Guru Insight Closed
By default, the patterns referenced above are enabled so we will leave all patterns operational in this implementation. However, you do have flexibility to change which of these patterns to choose to send to Opsgenie. When EventBridge receives an event, the EventBridge rule matches incoming events and sends it to a target, such as AWS Lambda, to process and send the insight to Opsgenie.
Prerequisites
The following prerequisites are required for this walkthrough:
In this step, you will navigate to Opsgenie to create the integration with DevOps Guru and to obtain the API key and team name within your account. These parameters will be used as inputs in a later section of this blog.
Navigate to Teams, and take note of the team name you have as shown below, as you will need this parameter in a later section.
Figure 2: Opsgenie team names
Click on the team to proceed and navigate to Integrations on the left-hand pane. Click on Add Integration and select the Amazon DevOps Guru option.
Figure 3: Integration option for DevOps Guru
Now, scroll down and take note of the API Key for this integration and copy it to your notes as it will be needed in a later section. Click Save Integration at the bottom of the page to proceed.
Figure 4: API Key for DevOps Guru Integration
Now, the Opsgenie integration has been created and we’ve obtained the API key and team name. The email of any team member will be used in the next section as well.
Review & launch the AWS SAM template to deploy the solution
In this step, you will review & launch the SAM template. The template will deploy an AWS Lambda function that is triggered by an Amazon EventBridge rule when Amazon DevOps Guru generates a new event. The Lambda function will retrieve the parameters obtained from the deployment and pushes the events to Opsgenie via an API.
Reviewing the template
Below is the SAM template that will be deployed in the next step. This template launches a few key components specified earlier in the blog. The Transform section of the template allows us takes an entire template written in the AWS Serverless Application Model (AWS SAM) syntax and transforms and expands it into a compliant CloudFormation template. Under the Resources section this solution will deploy an AWS Lamba function using the Java runtime as well as an Amazon EventBridge Rule/Pattern. Another key aspect of the template are the Parameters. As shown below, the ApiKey, Email, and TeamName are parameters we will use for this CloudFormation template which will then be used as environment variables for our Lambda function to pass to OpsGenie.
Figure 5: Review of SAM Template
Launching the Template
Navigate to the directory of choice within a terminal and clone the GitHub repository with the following command:
Change directories with the command below to navigate to the directory of the SAM template.
cd amazon-devops-guru-connector-opsgenie/OpsGenieServerlessTemplate
From the CLI, use the AWS SAM to build and process your AWS SAM template file, application code, and any applicable language-specific files and dependencies.
sam build
From the CLI, use the AWS SAM to deploy the AWS resources for the pattern as specified in the template.yml file.
sam deploy --guided
You will now be prompted to enter the following information below. Use the information obtained from the previous section to enter the Parameter ApiKey, Parameter Email, and Parameter TeamName fields.
Stack Name
AWS Region
Parameter ApiKey
Parameter Email
Parameter TeamName
Allow SAM CLI IAM Role Creation
Test the solution
Follow this blog to enable DevOps Guru and generate an operational insight.
When DevOps Guru detects a new insight, it will generate an event in EventBridge. EventBridge then triggers Lambda and sends the event to Opsgenie as shown below.
Figure 6: Event Published to Opsgenie with details such as the source, alert type, insight type, and a URL to the insight in the AWS console.enecccdgruicnuelinbbbigebgtfcgdjknrjnjfglclt
Cleaning up
To avoid incurring future charges, delete the resources.
From the command line, use AWS SAM to delete the serverless application along with its dependencies.
sam delete
Customizing Insights published using Amazon EventBridge & AWS Lambda
The foundation of the DevOps Guru and Opsgenie integration is based on Amazon EventBridge and AWS Lambda which allows you the flexibility to implement several customizations. An example of this would be the ability to generate an Opsgenie alert when a DevOps Guru insight severity is high. Another example would be the ability to forward appropriate notifications to the AIOps team when there is a serverless-related resource issue or forwarding a database-related resource issue to your DBA team. This section will walk you through how these customizations can be done.
EventBridge customization
EventBridge rules can be used to select specific events by using event patterns. As detailed below, you can trigger the lambda function only if a new insight is opened and the severity is high. The advantage of this kind of customization is that the Lambda function will only be invoked when needed.
Open the file template.yaml reviewed in the previous section and implement the changes as highlighted below under the Events section within resources (original file on the left, changes on the right hand side).
Figure 7: CloudFormation template file changed so that the EventBridge rule is only triggered when the alert type is “DevOps Guru New Insight Open” and insightSeverity is “high”.
Save the changes and use the following command to apply the changes
sam deploy --template-file template.yaml
Accept the changeset deployment
Determining the Ops team based on the resource type
Another customization would be to change the Lambda code to route and control how alerts will be managed. Let’s say you want to get your DBA team involved whenever DevOps Guru raises an insight related to an Amazon RDS resource. You can change the AlertType Java class as follows:
To begin this customization of the Lambda code, the following changes need to be made within the AlertType.java file:
At the beginning of the file, the standard java.util.List and java.util.ArrayList packages were imported
Line 60: created a list of CloudWatch metrics namespaces
Line 74: Assigned the dataIdentifiers JsonNode to the variable dataIdentifiersNode
Line 75: Assigned the namespace JsonNode to a variable namespaceNode
Line 77: Added the namespace to the list for each DevOps Insight which is always raised as an EventBridge event with the structure detail►anomalies►0►sourceDetails►0►dataIdentifiers►namespace
Line 88: Assigned the default responder team to the variable defaultResponderTeam
Line 89: Created the list of responders and assigned it to the variable respondersTeam
Line 92: Check if there is at least one AWS/RDS namespace
Line 93: Assigned the DBAOps_Team to the variable dbaopsTeam
Line 93: Included the DBAOps_Team team as part of the responders list
Line 97: Set the OpsGenie request teams to be the responders list
Figure 8: java.util.List and java.util.ArrayList packages were imported
Figure 9: AlertType Java class customized to include DBAOps_Team for RDS-related DevOps Guru insights.
You then need to generate the jar file by using the mvn clean package command.
As result, the DBAOps_Team will be assigned to the Opsgenie alert in the case a DevOps Guru Insight is related to RDS.
Figure 10: Opsgenie alert assigned to both DBAOps_Team and AIOps_Team.
Conclusion
In this post, you learned how Amazon DevOps Guru integrates with Amazon EventBridge and publishes insights to Opsgenie using AWS Lambda. By creating an Opsgenie integration with DevOps Guru, you can now leverage Opsgenie strengths, incident management, team communication, and collaboration when responding to an insight. All of the insight data can be viewed and addressed in Opsgenie’s Incident Command Center (ICC). By customizing the data sent to Opsgenie via Lambda, you can empower your organization even more by fine tuning and displaying the most relevant data thus decreasing the MTTR (mean time to resolve) of the responding operations team.
This blog post is written by Jonathan Tuliani, Principal Product Manager.
Today, AWS Lambda is announcing runtime management controls which provide more visibility and control over when Lambda applies runtime updates to your functions. Lambda is also changing how it rolls out automatic runtime updates to your functions. Together, these changes provide more flexibility in how you take advantage of the latest runtime features, performance improvements, and security updates.
By default, all functions will continue to receive automatic runtime updates. You do not need to change how you use Lambda to continue to benefit from the security and operational simplicity of the managed runtimes Lambda provides. The runtime controls are optional capabilities for advanced customers that require more control over their runtime changes.
This post explains what new runtime management controls are available and how you can take advantage of this new capability.
Overview
For each runtime, Lambda provides a managed execution environment. This includes the underlying Amazon Linux operating system, programming language runtime, and AWS SDKs. Lambda takes on the operational burden of applying patches and security updates to all these components. Customers tell us how much they appreciate being able to deploy a function and leave it, sometimes for years, without having to apply patches. With Lambda, patching ‘just works’, automatically.
Lambda strives to provide updates which are backward compatible with existing functions. However, as with all software patching, there are rare cases where a patch can expose an underlying issue with an existing function that depends on the previous behavior. For example, consider a bug in one of the runtime OS packages. Applying a patch to fix the bug is the right choice for the vast majority of customers and functions. However, in rare cases, a function may depend on the previous (incorrect) behavior. Customers with critical workloads running in Lambda tell us they would like a way to further mitigate even this slight risk of disruption.
With the launch of runtime management controls, Lambda now offers three new capabilities. First, Lambda provides visibility into which patch version of a runtime your function is using and when runtime updates are applied. Second, you can optionally synchronize runtime updates with function deployments. This provides you with control over when Lambda applies runtime updates and enables early detection of rare runtime update incompatibilities. Third, in the rare case where a runtime update incompatibility occurs, you can roll back your function to an earlier runtime version. This keeps your function working and minimizes disruption, providing time to remedy the incompatibility before returning to the latest runtime version.
Runtime identifiers and runtime versions
Lambda runtimes define the components of the execution environment in which your function code runs. This includes the OS, programming language runtime, environment variables, and certificates. For Python, Node.js and Ruby, the runtime also includes the AWS SDK. Each Lambda runtime has a unique runtime identifier, for example, nodejs18.x, or python3.9. Each runtime identifier represents a distinct major release of the programming language.
Runtime management controls introduce the concept of Lambda runtime versions. A runtime version is an immutable version of a particular runtime. Each Lambda runtime, such as Node.js 16, or Python 3.9, starts with an initial runtime version. Every time Lambda updates the runtime, it adds a new runtime version to that runtime. These updates cover all runtime components (OS, language runtime, etc.) and therefore use a Lambda-defined numbering scheme, independent of the version numbers used by the programming language. For each runtime version, Lambda also publishes a corresponding base image for customers who package their functions as container images.
New runtime identifiers represent a major release for the programming language, and sometimes other runtime components, such as the OS or SDK. Lambda cannot guarantee compatibility between runtime identifiers, although most times you can upgrade your functions with little or no modification. You control when you upgrade your functions to a new runtime identifier. In contrast, new runtime versions for the same runtime identifier have a very high level of backward compatibility with existing functions. By default, Lambda automatically applies runtime updates by moving functions from the previous runtime version to a newer runtime version.
Each runtime version has a version number, and an Amazon Resource Name (ARN). You can view the version in a new platform log line, INIT_START. Lambda emits this log line each time it creates a new execution environment during the cold start initialization process.
INIT_START Runtime Version: python:3.9.v14 Runtime Version ARN: arn:aws:lambda:eu-south-1::runtime:7b620fc2e66107a1046b140b9d320295811af3ad5d4c6a011fad1fa65127e9e6I
INIT_START Runtime Version: python:3.9.v14 Runtime Version ARN: arn:aws:lambda:eu-south-1::runtime:7b620fc2e66107a1046b140b9d320295811af3ad5d4c6a011fad1fa65127e9e6I
Runtime versions improve visibility into managed runtime updates. You can use the INIT_START log line to identify when the function transitions from one runtime version to another. This helps you investigate whether a runtime update might have caused any unexpected behavior of your functions. Changes in behavior caused by runtime updates are very rare. If your function isn’t behaving as expected, by far the most likely cause is an error in the function code or configuration.
Runtime management modes
With runtime management controls, you now have more control over when Lambda applies runtime updates to your functions. You can now specify a runtime management configuration for each function. You can set the runtime management configuration independently for $LATEST and each published function version.
You can specify one of three runtime update modes: auto, function update, or manual. The runtime update mode controls when Lambda updates the function version to a new runtime version. By default, all functions receive runtime updates automatically, the alternatives are for advanced users in specific cases.
Automatic
Auto updates are the default, and are the right choice for most customers to ensure that you continue to benefit from runtime version updates. They help minimize your operational overheads by letting Lambda take care of runtime updates.
While Lambda has always provided automatic runtime updates, this release includes a change to how automatic runtime updates are rolled out. Previously, Lambda applied runtime updates to all functions in each region, following a region-by-region deployment sequence. With this release, functions configured to use the auto runtime update mode now receive runtime updates in two phases. Initially, Lambda only applies a new runtime version to newly created or updated functions. After this initial period is complete, Lambda then applies the runtime update to any remaining functions configured to use the auto runtime update mode.
This two-phase rollout synchronizes runtime updates with function updates for customers who are actively developing their functions. This makes it easier to detect and respond to any unexpected changes in behavior. For functions not in active development, auto mode continues to provide the operational benefits of fully automatic runtime updates.
Function update
In function update mode, Lambda updates your function to the latest available runtime version whenever you change your function code or configuration. This is the same as the first phase of auto mode. The difference is that, in auto mode, there is a second phase when Lambda applies runtime updates to functions which have not been changed. In function update mode, if you do not change a function, it continues to use the current runtime version indefinitely. This means that when using function update mode, you must update your functions regularly to keep their runtimes up-to-date. If you do not update a function regularly, you should use the auto runtime update mode.
Synchronizing runtime updates with function deployments gives you control over when Lambda applies runtime updates. For example, you can avoid applying updates during business-critical events, such as a product launch or holiday sales.
When used with CI/CD pipelines, function update mode enables early detection and mitigation in the rare event of a runtime update incompatibility. This is especially effective if you create a new published function version with each deployment. Each published function version captures a static copy of the function code and configuration, so that you can roll back to a previously published function version if required. Using function update mode extends the published function version to also capture the runtime version. This allows you to synchronize rollout and rollback of the entire Lambda execution environment, including function code, configuration, and runtime version.
Manual
Consider the rare event that a runtime update is incompatible with one of your functions. With runtime management controls, you can now roll back to an earlier runtime version. This keeps your function working and minimizes disruption, giving you time to remedy the incompatibility before returning to the latest runtime version.
There are two ways to implement a runtime version rollback. You can use function update mode with a published function version to synchronize the rollback of code, configuration, and runtime version. Or, for functions using the default auto runtime update mode, you can roll back your runtime version by using manual mode.
The manual runtime update mode provides you with full control over which runtime version your function uses. When enabling manual mode, you must specify the ARN of the runtime version to use, which you can find from the INIT_START log line.
Lambda does not impose a time limit on how long you can use any particular runtime version. However, AWS strongly recommends using manual mode only for short-term remediation of code incompatibilities. Revert your function back to auto mode as soon as you resolve the issue. Functions using the same runtime version for an extended period may eventually stop working due to, for example, a certificate expiry.
From the Lambda console, navigate to a specific function. You can find runtime management controls on the Code tab, in the Runtime settings panel. Expand Runtime management configuration to view the current runtime update mode and runtime version ARN.
Runtime settings
To change runtime update mode, select Edit runtime management configuration. You can choose between automatic, function update, and manual runtime update modes.
Edit runtime management configuration (Auto)
In manual mode, you must also specify the runtime version ARN.
Edit runtime management configuration (Manual)
AWS SAM
AWS SAM is an open-source framework for building serverless applications. You can specify runtime management settings using the RuntimeManagementConfig property.
You can also manage runtime management settings using the AWS CLI. You configure runtime management controls via a dedicated command aws lambda put-runtime-management-config, rather than aws lambda update-function-configuration.
The current runtime version ARN is also returned by aws lambda get-function and aws lambda get-function-configuration.
Conclusion
Runtime management controls provide more visibility and flexibility over when and how Lambda applies runtime updates to your functions. You can specify one of three update modes: auto, function update, or manual. These modes allow you to continue to take advantage of Lambda’s automatic patching, synchronize runtime updates with your deployments, and rollback to an earlier runtime version in the rare event that a runtime update negatively impacts your function.
For more information on runtime management controls, see our documentation page.
For more serverless learning resources, visit Serverless Land.
This post is written by Adrian Hornsby (Principal System Dev Engineer) and Marcia Villalba (Principal Developer Advocate).
AWS Lambda comprises over 80 services working together to provide the serverless compute service that it offers to customers. Under the hood, many of these services are built on top of Amazon Elastic Compute Cloud (Amazon EC2) instances, provisioned within Availability Zones. However, AWS Lambda is a Regional service. This means that customers use Lambda services from the Region level and its services are designed to be resilient to impairments that the underlying Availability Zones might have.
Availability Zones are physically isolated sections of an AWS Region, designed to operate but also fail independently. They are separated by a meaningful distance from each other, up to 100 kilometers (60 miles), to prevent correlated failures, but close enough to use synchronous replication with single-digit millisecond latency.
Customers and AWS services have been using Availability Zones for years to build highly available, fault tolerant, and scalable applications. In particular, AWS Regional services such as AWS Lambda, Amazon DynamoDB, Amazon Simple Queue Service (Amazon SQS), and Amazon Simple Storage Service (Amazon S3), have achieved their high availability promises by spreading multiple independent replicas of their services across multiple Availability Zones. It uses the principles of independence and redundancy of Availability Zones to maximize the overall availability of that service.
Each replica is called a zonal replica. The system is designed so that any of the replicas can fail at any time. When a replica fails, it can be temporarily removed from the system until everything works as expected again. When that happens, the load is shared between the remaining zonal replicas.
Designing for failures
One lesson we learned at AWS when building services is when there is an Availability Zone impairment, it is better not to rely on control plane operations to remediate the failure. A control plane operation can, for example, be provisioning more capacity in an Availability Zone that is not affected by the impairment.
This principle is called static stability, and it describes the capability for a system to keep its original steady-state (or behavior) even when subjected to disruptive events without having to make any changes. A statically stable service should have as few dependencies as possible for its recovery process.
For a Regional service like AWS Lambda, this means that the remaining capacity in the healthy Availability Zones can absorb the traffic from a potentially impaired Availability Zone without having to scale up. This implies over-provisioning resources in all Availability Zones. Having that extra capacity pre-provisioned helps Lambda achieve its static stability. It is a tradeoff between the cost of over-provisioning resources and service availability. Since AWS Lambda promises high availability to its customers, with a monthly uptime service commitment of 99.95%, that tradeoff falls towards service availability.
How to prepare for failures
Preparing for an Availability Zone impairment is difficult because the symptoms and size of the impact can vary widely. An Availability Zone may be partially accessible or totally unreachable, and everything in between. Causes for the impairment can range from fiber cuts, power issues, overheating, hardware malfunctions, networking problems, capacity issues, and other unexpected situations. While those happen, they happen rarely. The most common categories of failures are bad deployments and bad configurations.
While some of these failures can be difficult to infer or reproduce, common symptoms include disruption of connectivity, increased latency, increased traffic due to retry storms, increased CPU and memory usage, and slow I/O.
At AWS, we learned to expect the unexpected and plan for failure. This means injecting faults in the system to reproduce some of the common symptoms of Availability Zone impairments, then observe how the system responds, and implement improvements. In addition, injecting faults in the system helps uncover potential monitoring and alarming blind spots, and gives an opportunity for teams to practice and improve their response to events with a focus on reducing time to recovery.
How Lambda tests its response to an Availability Zone impairment
Lambda’s approach to being resilient to Availability Zone impairments is to rely on static stability and automated systems. Humans are slower than machines for detecting issues and mitigating them. Therefore, Lambda must ensure that its services can detect issues within a zonal replica and remediate automatically within minutes and with no operator intervention. This auto-remediation is done by shifting customer traffic away from the affected Availability Zone to healthy ones, and it is called Availability Zone evacuation.
To do this, Lambda built a tool that detects failures and performs the Availability Zone evacuation when needed. This tool does a statistical comparison of metrics between different Availability Zones and EC2 instances in order to identify unhealthy Availability Zones. If an Availability Zone is found to have issues, the tool starts the evacuation out of the unhealthy Availability Zone automatically. This automation cuts the time to the first action from 30 minutes to less than 3 minutes.
How AWS Lambda uses AWS FIS
To verify the automation continuously works as expected, Lambda performs a wide variety of tests, which includes Availability Zone failure testing in their pre-production environment. The main objective of these tests is to verify the services are statically stable in the presence of Availability Zone impairments, and to verify that the Availability Zone evacuation can be successfully initiated. The benefit of having an automated test is that teams can repeat it regularly and don’t need to have special skills. One click is all it takes to launch the test.
For these tests, Lambda uses AWS FIS to inject faults into their large fleet of EC2 instances. They use AWS FIS with support of the AWS System Manager (SSM) agent and resource filters to target their fleet of EC2 instances in a particular Availability Zone. This is a versatile approach that can inject resource faults, such as CPU and memory exhaustion, and networking faults, such as packet latency, loss, or drop.
Injecting packet loss or latency is very important, since these symptoms can have a serious impact on application and network performance. Indeed, latency and loss, even in small quantities, can create inefficiencies and prevent applications from running at their peak performance. For Lambda, being able to detect increased latency or loss before it affects customers is critical.
How to recover your applications rapidly from Availability Zones failures
You can build a similar solution to rapidly recover your applications from a zonal failure. The solution must have a mechanism to evacuate an impaired Availability Zone, a monitoring system that allows you to detect when a zonal replica is impaired, and a way to test the static stability of your system. AWS provides many tools and services that can help you build this solution to achieve Lambda’s resiliency strategy.
For performing Availability Zone evacuation, you can use the new zonal shift capability from Route 53 ARC, which at the time of writing is in preview. Zonal shift lets you evacuate an Availability Zone for applications that are uses Elastic Load Balancing. If you find out that a zonal replica is impaired or unhealthy, you can use zonal shift to evacuate the Availability Zone for a period of time, while the issue gets fixed.
For performing the zonal shift, you must detect when a zonal replica is unhealthy. Your application must provide a signal of its health per Availability Zone. There are two common ways to capture this signal. First, passively, you can check your metrics, like response times, HTTP status codes, and other metrics that can help track fatal errors in your applications. Or actively, using synthetic monitoring, which allows you to create synthetic requests against your production application to provide a more complete view of the customer experience.
Amazon CloudWatch Synthetics provides canaries, which are scripts that run on a schedule and perform synthetic requests in your application endpoints and APIs. Canaries perform the same actions as customers and continuously verify the customer experience. You can create a canary for each zonal replica of your application and monitor the results independently.
With this information, if the user experience diminishes in one of the replicas, you can start an Availability Zone evacuation using zonal shift and minimize the bad experience for the user while you find and fix the sources of the failure.
To ensure that you can successfully recover from a failure, you must test the solution in advance. Without testing, it is just an assumption. To prove or disprove your assumptions about your system’s capability to handle disruptive events such as issues within an Availability Zones, you can use FIS.
Typical use cases for testing a workload’s resilience to Availability Zones impairment are, for example, terminating all compute resources and databases within a particular Availability Zone, injecting latency or packet loss, increasing resource consumption (CPU, memory, and I/O) in compute resources in a particular Availability Zone, or impacting network communication within or between Availability Zones.
For more information and a step-by-step example of how to recover rapidly from application failures in a single Availability Zone and testing it with AWS FIS, read this blog post.
Conclusion
This article discusses static stability, a mechanism that is used by AWS services such as Lambda to build resilient Regional services. It also discusses how AWS takes advantage of the same services and infrastructure as customers. It shows how Lambda uses multiple Availability Zones and services like AWS FIS to build highly available services and improve its recovery time from unexpected failures to only a few minutes without human intervention. Finally, it shows a solution that you can implement for your applications to achieve Lambda’s resilience strategy.
Text data is a common type of unstructured data found in analytics. It is often stored without a predefined format and can be hard to obtain and process.
For example, web pages contain text data that data analysts collect through web scraping and pre-process using lowercasing, stemming, and lemmatization. After pre-processing, the cleaned text is analyzed by data scientists and analysts to extract relevant insights.
This blog post covers how to effectively handle text data using a data lake architecture on Amazon Web Services (AWS). We explain how data teams can independently extract insights from text documents using OpenSearch as the central search and analytics service. We also discuss how to index and update text data in OpenSearch and evolve the architecture towards automation.
Architecture overview
This architecture outlines the use of AWS services to create an end-to-end text analytics solution, starting from the data collection and ingestion up to the data consumption in OpenSearch (Figure 1).
Figure 1. Data lake architecture with OpenSearch
Collect data from various sources, such as SaaS applications, edge devices, logs, streaming media, and social networks.
Validate, clean, normalize, transform, and enrich the data through a series of pre-processing steps using AWS Glue or Amazon EMR.
Place the data that is ready to be indexed in the indexing zone.
Use AWS Lambda to index the documents into OpenSearch and store them back in the data lake with a unique identifier.
Use the clean zone as the source of truth for teams to consume the data and calculate additional metrics.
Develop, train, and generate new metrics using machine learning (ML) models with Amazon SageMaker or artificial intelligence (AI) services like Amazon Comprehend.
Store the new metrics in the enriching zone along with the identifier of the OpenSearch document.
Use the identifier column from the initial indexing phase to identify the correct documents and update them in OpenSearch with the newly calculated metrics using AWS Lambda.
Use OpenSearch to search through the documents and visualize them with metrics using OpenSearch Dashboards.
Considerations
Data lake orchestration among teams
This architecture allows data teams to work independently on text documents at different stages of their lifecycles. The data engineering team manages the raw and indexing zones, who also handle data ingestion and preprocessing for indexing in OpenSearch.
The cleaned data is stored in the clean zone, where data analysts and data scientists generate insights and calculate new metrics. These metrics are stored in the enrich zone and indexed as new fields in the OpenSearch documents by the data engineering team (Figure 2).
Figure 2. Data lake orchestration among teams
Let’s explore an example. Consider a company that periodically retrieves blog site comments and performs sentiment analysis using Amazon Comprehend. In this case:
The comments are ingested into the raw zone of the data lake.
The data engineering team processes the comments and stores them in the indexing zone.
A Lambda function indexes the comments into OpenSearch, enriches the comments with the OpenSearch document ID, and saves it in the clean zone.
The data science team consumes the comments and performs sentiment analysis using Amazon Comprehend.
The sentiment analysis metrics are stored in the metrics zone of the data lake. A second Lambda function updates the comments in OpenSearch with the new metrics.
If the raw data does not require any preprocessing steps, the indexing and clean zones can be combined. You can explore this specific example, along with code implementation, in the AWS samples repository.
Schema evolution
As your data progresses through data lake stages, the schema changes and gets enriched accordingly. Continuing with our previous example, Figure 3 explains how the schema evolves.
Figure 3. Schema evolution through the data lake stages
In the raw zone, there is a raw text field received directly from the ingestion phase. It’s best practice to keep a raw version of the data as a backup, or in case the processing steps need to be repeated later.
In the indexing zone, the clean text field replaces the raw text field after being processed.
In the clean zone, we add a new ID field that is generated during indexing and identifies the OpenSearch document of the text field.
In the enrich zone, the ID field is required. Other fields with metric names are optional and represent new metrics calculated by other teams that will be added to OpenSearch.
Consumption layer with OpenSearch
In OpenSearch, data is organized into indices, which can be thought of as tables in a relational database. Each index consists of documents—similar to table rows—and multiple fields, similar to table columns. You can add documents to an index by indexing and updating them using various client APIs for popular programming languages.
Now, let’s explore how our architecture integrates with OpenSearch in the indexing and updating stage.
Indexing and updating documents using Python
The index document API operation allows you to index a document with a custom ID, or assigns one if none is provided. To speed up indexing, we can use the bulk index API to index multiple documents in one call.
We need to store the IDs back from the index operation to later identify the documents we’ll update with new metrics. Let’s explore two ways of doing this:
Use the requests library to call the REST Bulk Index API (preferred): the response returns the auto-generated IDs we need.
Use the Python Low-Level Client for OpenSearch: The IDs are not returned and need to be pre-assigned to later store them. We can use an atomic counter in Amazon DynamoDB to do so. This allows multiple Lambda functions to index documents in parallel without ID collisions.
As in Figure 4, the Lambda function:
Increases the atomic counter by the number of documents that will index into OpenSearch.
Gets the value of the counter back from the API call.
Indexes the documents using the range that goes between [current counter value, current counter value – number of documents].
Figure 4. Storing the IDs back from the bulk index operation using the Python Low-Level Client for OpenSearch
Data flow automation
As architectures evolve towards automation, the data flow between data lake stages becomes event-driven. Following our previous example, we can automate the processing steps of the data when moving from the raw to the indexing zone (Figure 5).
The same approach can be applied to the other data lake stages to achieve a fully automated architecture. Explore this implementation for an automated language use case.
Conclusion
In this blog post, we covered designing an architecture to effectively handle text data using a data lake on AWS. We explained how different data teams can work independently to extract insights from text documents at different lifecycle stages using OpenSearch as the search and analytics service.
In the telecommunications industry, sensitive authentication and user data are typically received through mobile voice and keypads, and companies are responsible for protecting the data obtained through these channels. The increasing use of voice-driven interactive voice response (IVR) has resulted in a need to provide solutions that can protect user data that is gathered from mobile voice inputs. In this blog post, you’ll see how to protect a caller’s sensitive voice data that was captured through Amazon Lex by using data encryption implemented through AWS Lambda functions. The solution described in this post helps you to protect customer data received through voice channels from inadvertent or unknown access. The solution also includes decryption capabilities, which give an authorized administrator or operator the ability to decrypt user data from a Lambda console.
Solution overview
To demonstrate the IVR solution described in this post, a caller speaks two sensitive pieces of data—credit card number and zip code—from an Amazon Connect contact flow. The spoken values are encrypted and returned to the contact flow to be stored in contact attributes. The encrypted ciphertext is retained as a contact attribute for decryption purposes. Amazon CloudWatch Logs is enabled in the contact flow, but only the encrypted values are logged in log streams.
In the newly created Amazon Connect instance, under the Overview section, find the access URL with the format https://<aliasname>.awsapps.com/connect/login
Make note of the access URL, which you will use later to log in to the Amazon Connect Dashboard.
From the Routing menu on the left side, choose Contact flows to show the list of contact flows.
Choose Create Contact flow.
Choose the arrow to the right of the Save button and choose Import flow (beta). This imports the contact flow that you previously downloaded in the procedure To clone or download the solution.
The contact flow already has the Amazon Lex bot configured.
Figure 5: Select Import flow (beta)
In the upper right corner of the contact flow, choose Save, and then choose OK to save the changes.
Choose Publish to make the contact flow ready for use during the validation steps.
(Optional) Claim a phone number (if none is available), using the following steps:
In the Connect Dashboard, on the navigation menu, choose Channels, and then choose Phone numbers.
On the right side of the page, choose Claim a number.
Select the DID(Direct Inward Dialing) tab. Use the drop-down arrow to choose your country/region. When numbers are returned, choose one.
Write down the phone number. You call it later in this post.
(Optional) On the Edit Phone number page, in the Description box, you can type a note if desired.
To assign the contact flow to your claimed phone number, for Contact flow / IVR, choose the drop-down arrow, and then choose Secure_Lex_Input.
Choose Save.
Figure 6: Under Contact flow / IVR, select the imported contact flow
Dial the test phone number to go through the voice prompt flow.
When prompted, speak a 16-digit credit card number (you have a maximum of two retries), then speak a 5-digit zip code (also a maximum of two retries).
After you complete your test call, review the log streams in Amazon CloudWatch Logs to confirm that the digits that you entered are now encrypted and stored as a contact attribute. The two entered values zipcode and creditcard are stored in contact attributes. Both are encrypted.
Figure 7: Sample log showing encrypted values for zipcode and creditcard
Log in to your Amazon Connect Dashboard as a Supervisor. The URL is provided after the connect instance has been created. In the navigation menu, choose Contact search.
Figure 8: Choose Contact search to look for the call information
Locate your inbound call on the Contact search list. Note that it can take up to 60 seconds for data to appear in the Contact search list.
Select the Contact ID for your call.
Figure 9: The Contact search showing the contact details for your test call
Copy the encrypted values for creditcard and zipcode and make note of them; you will use these values in the next procedure.
Figure 10: Contact attributes stored in a contact flow are registered as part of the contact details
Use the Search bar to look for the dev-encryption-core-DecryptFn Lambda function, and then select the name link to open it.
Under folder encryption-master, open the test folder. Under the tab \events, locate the file decrypt.json.
Use the following steps to create a sample test event in the console by using the contents from decrypt.json. For more details, see Testing Lambda functions in the console.
Use the encrypted values saved in the Validate a solution procedure and replace the ones in the recently created test event.
Figure 11: Replace the creditcard or zipCode values with the ones from the Contact Search page
Choose Test. The output from the test shows the values decrypted by the Lambda function. This is shown in Figure 12under the Execution result tab.
Figure 12: Result from the decryption operation
Note: Make sure that only the appropriate authorized administrator or operator, application, or AWS service is able to invoke the decryption Lambda function.
You have now successfully implemented the solution by encrypting and decrypting the voice input of your test call, which you collected through Amazon Lex.
Cleanup
To avoid incurring future charges, follow these steps to clean up the deployed resources that you created when implementing this solution.
To delete the Amazon Connect instance
In the Amazon Connect console, under Instance alias, select the name of the Amazon Connect instance, and choose Delete.
When prompted, type the name of the instance, and then choose Delete.
To delete the Amazon Lex bot
In the Amazon Lex console, choose the bot that you created in the To configure the Amazon Lex bot procedure.
Choose Delete, and then choose Continue.
To delete the AWS CloudFormation stack
In the AWS CloudFormation console, on the Stacks page, select the stack you created in the procedure To create AWS resources needed for encryption and decryption.
In the stack details pane, choose Delete.
Choose Delete stack when prompted. This deletes the Amazon S3 bucket, IAM roles and AWS Lambda functions you created for testing. This will also schedule a deletion date on the AWS KMS key.
Conclusion
In this post, you learned how an Amazon Connect contact flow can collect voice inputs from a caller by using Amazon Lex, and how you can encrypt these inputs by using your own AWS KMS key. This solution can help improve the security of voice input that is collected through Amazon Connect. For cost information, see the Amazon Connect pricing page.
For more information, see the blog post Creating a secure IVR solution with Amazon Connect and the topic Encrypt customer input (using OpenSSL) in the Amazon Connect Administrator Guide. As previously mentioned, the increasing use of voice-driven IVR has resulted in a need to provide solutions that can protect user data gathered from mobile voice inputs.
If you need help with setting up this solution, you can get assistance from AWS Professional Services. You can also seek assistance from Amazon Connect partners available worldwide.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
AWS re:Invent returned to Las Vegas, Nevada, November 28 to December 2, 2022. After a virtual event in 2020 and a hybrid 2021 edition, spirits were high as over 51,000 in-person attendees returned to network and learn about the latest AWS innovations.
Now in its 11th year, the conference featured 5 keynotes, 22 leadership sessions, and more than 2,200 breakout sessions and hands-on labs at 6 venues over 5 days.
With well over 100 service and feature announcements—and innumerable best practices shared by AWS executives, customers, and partners—distilling highlights is a challenge. From a security perspective, three key themes emerged.
Turn data into actionable insights
Security teams are always looking for ways to increase visibility into their security posture and uncover patterns to make more informed decisions. However, as AWS Vice President of Data and Machine Learning, Swami Sivasubramanian, pointed out during his keynote, data often exists in silos; it isn’t always easy to analyze or visualize, which can make it hard to identify correlations that spark new ideas.
“Data is the genesis for modern invention.” – Swami Sivasubramanian, AWS VP of Data and Machine Learning
At AWS re:Invent, we launched new features and services that make it simpler for security teams to store and act on data. One such service is Amazon Security Lake, which brings together security data from cloud, on-premises, and custom sources in a purpose-built data lake stored in your account. The service, which is now in preview, automates the sourcing, aggregation, normalization, enrichment, and management of security-related data across an entire organization for more efficient storage and query performance. It empowers you to use the security analytics solutions of your choice, while retaining control and ownership of your security data.
Amazon Security Lake has adopted the Open Cybersecurity Schema Framework (OCSF), which AWS cofounded with a number of organizations in the cybersecurity industry. The OCSF helps standardize and combine security data from a wide range of security products and services, so that it can be shared and ingested by analytics tools. More than 37 AWS security partners have announced integrations with Amazon Security Lake, enhancing its ability to transform security data into a powerful engine that helps drive business decisions and reduce risk. With Amazon Security Lake, analysts and engineers can gain actionable insights from a broad range of security data and improve threat detection, investigation, and incident response processes.
Strengthen security programs
According to Gartner, by 2026, at least 50% of C-Level executives will have performance requirements related to cybersecurity risk built into their employment contracts. Security is top of mind for organizations across the globe, and as AWS CISO CJ Moses emphasized during his leadership session, we are continuously building new capabilities to help our customers meet security, risk, and compliance goals.
In addition to Amazon Security Lake, several new AWS services announced during the conference are designed to make it simpler for builders and security teams to improve their security posture in multiple areas.
Identity and networking
Authorization is a key component of applications. Amazon Verified Permissions is a scalable, fine-grained permissions management and authorization service for custom applications that simplifies policy-based access for developers and centralizes access governance. The new service gives developers a simple-to-use policy and schema management system to define and manage authorization models. The policy-based authorization system that Amazon Verified Permissions offers can shorten development cycles by months, provide a consistent user experience across applications, and facilitate integrated auditing to support stringent compliance and regulatory requirements.
Additional services that make it simpler to define authorization and service communication include Amazon VPC Lattice, an application-layer service that consistently connects, monitors, and secures communications between your services, and AWS Verified Access, which provides secure access to corporate applications without a virtual private network (VPN).
Threat detection and monitoring
Monitoring for malicious activity and anomalous behavior just got simpler. Amazon GuardDuty RDS Protection expands the threat detection capabilities of GuardDuty by using tailored machine learning (ML) models to detect suspicious logins to Amazon Aurora databases. You can enable the feature with a single click in the GuardDuty console, with no agents to manually deploy, no data sources to enable, and no permissions to configure. When RDS Protection detects a potentially suspicious or anomalous login attempt that indicates a threat to your database instance, GuardDuty generates a new finding with details about the potentially compromised database instance. You can view GuardDuty findings in AWS Security Hub, Amazon Detective (if enabled), and Amazon EventBridge, allowing for integration with existing security event management or workflow systems.
To bolster vulnerability management processes, Amazon Inspector now supports AWS Lambda functions, adding automated vulnerability assessments for serverless compute workloads. With this expanded capability, Amazon Inspector automatically discovers eligible Lambda functions and identifies software vulnerabilities in application package dependencies used in the Lambda function code. Actionable security findings are aggregated in the Amazon Inspector console, and pushed to Security Hub and EventBridge to automate workflows.
Data protection and privacy
The first step to protecting data is to find it. Amazon Macie now automatically discovers sensitive data, providing continual, cost-effective, organization-wide visibility into where sensitive data resides across your Amazon Simple Storage Service (Amazon S3) estate. With this new capability, Macie automatically and intelligently samples and analyzes objects across your S3 buckets, inspecting them for sensitive data such as personally identifiable information (PII), financial data, and AWS credentials. Macie then builds and maintains an interactive data map of your sensitive data in S3 across your accounts and Regions, and provides a sensitivity score for each bucket. This helps you identify and remediate data security risks without manual configuration and reduce monitoring and remediation costs.
Encryption is a critical tool for protecting data and building customer trust. The launch of the end-to-end encrypted enterprise communication service AWS Wickr offers advanced security and administrative controls that can help you protect sensitive messages and files from unauthorized access, while working to meet data retention requirements.
Management and governance
Maintaining compliance with regulatory, security, and operational best practices as you provision cloud resources is key. AWS Config rules, which evaluate the configuration of your resources, have now been extended to support proactive mode, so that they can be incorporated into infrastructure-as-code continuous integration and continuous delivery (CI/CD) pipelines to help identify noncompliant resources prior to provisioning. This can significantly reduce time spent on remediation.
Managing the controls needed to meet your security objectives and comply with frameworks and standards can be challenging. To make it simpler, we launched comprehensive controls management with AWS Control Tower. You can use it to apply managed preventative, detective, and proactive controls to accounts and organizational units (OUs) by service, control objective, or compliance framework. You can also use AWS Control Tower to turn on Security Hub detective controls across accounts in an OU. This new set of features reduces the time that it takes to define and manage the controls required to meet specific objectives, such as supporting the principle of least privilege, restricting network access, and enforcing data encryption.
Do more with less
As we work through macroeconomic conditions, security leaders are facing increased budgetary pressures. In his opening keynote, AWS CEO Adam Selipsky emphasized the effects of the pandemic, inflation, supply chain disruption, energy prices, and geopolitical events that continue to impact organizations.
Now more than ever, it is important to maintain your security posture despite resource constraints. Citing specific customer examples, Selipsky underscored how the AWS Cloud can help organizations move faster and more securely. By moving to the cloud, agricultural machinery manufacturer Agco reduced costs by 78% while increasing data retrieval speed, and multinational HVAC provider Carrier Global experienced a 40% reduction in the cost of running mission-critical ERP systems.
“If you’re looking to tighten your belt, the cloud is the place to do it.” – Adam Selipsky, AWS CEO
Security teams can do more with less by maximizing the value of existing controls, and bolstering security monitoring and analytics capabilities. Services and features announced during AWS re:Invent—including Amazon Security Lake, sensitive data discovery with Amazon Macie, support for Lambda functions in Amazon Inspector, Amazon GuardDuty RDS Protection, and more—can help you get more out of the cloud and address evolving challenges, no matter the economic climate.
Security is our top priority
AWS re:Invent featured many more highlights on a variety of topics, such as Amazon EventBridge Pipes and the pre-announcement of GuardDuty EKS Runtime protection, as well as Amazon CTO Dr. Werner Vogels’ keynote, and the security partnerships showcased on the Expo floor. It was a whirlwind week, but one thing is clear: AWS is working harder than ever to make our services better and to collaborate on solutions that ease the path to proactive security, so that you can focus on what matters most—your business.
This blog post is written by Solutions Architects John Lee and Jeetendra Vaidya.
AWS Lambda now provides a way to control the maximum number of concurrent functions invoked by Amazon SQS as an event source. You can use this feature to control the concurrency of Lambda functions processing messages in individual SQS queues.
This post describes how to set the maximum concurrency of SQS triggers when using SQS as an event source with Lambda. It also provides an overview of the scaling behavior of Lambda using this architectural pattern, challenges this feature helps address, and a demo of the maximum concurrency feature.
Overview
Lambda uses an event source mapping to process items from a stream or queue. The event source mapping reads from an event source, such as an SQS queue, optionally filters the messages, batches them, and invokes the mapped Lambda function.
The scaling behavior for Lambda integration with SQS FIFO queues is simple. A single Lambda function processes batches of messages within a single message group to ensure that messages are processed in order.
For SQS standard queues, the event source mapping polls the queue to consume incoming messages, starting at five concurrent batches with five functions at a time. As messages are added to the SQS queue, Lambda continues to scale out to meet demand, adding up to 60 functions per minute, up to 1,000 functions, to consume those messages. To learn more about Lambda scaling behavior, read ”Understanding how AWS Lambda scales with Amazon SQS standard queues.”
Lambda processing standard SQS queues
Challenges
When a large number of messages are in the SQS queue, Lambda scales out, adding additional functions to process the messages. The scale out can consume the concurrency quota in the account. To prevent this from happening, you can set reserved concurrency for individual Lambda functions. This ensures that the specified Lambda function can always scale to that much concurrency, but it also cannot exceed this number.
When the Lambda function concurrency reaches the reserved concurrency limit, the queue configuration specifies the subsequent behavior. The message is returned to the queue and retried based on the redrive policy, expired based on its retention policy, or sent to another SQS dead-letter queue (DLQ). While sending unprocessed messages to a DLQ is a good option to preserve messages, it requires a separate mechanism to inspect and process messages from the DLQ.
The following example shows a Lambda function reaching its reserved concurrency quota of 10.
Lambda reaching reserved concurrency of 10.
Maximum Lambda concurrency with SQS as an event source
The launch of maximum concurrency for SQS as an event source allows you to control Lambda function concurrency per source. You set the maximum concurrency on the event source mapping, not on the Lambda function.
This event source mapping setting does not change the scaling or batching behavior of Lambda with SQS. You can continue to batch messages with a customized batch size and window. It rather sets a limit on the maximum number of concurrent function invocations per SQS event source. Once Lambda scales and reaches the maximum concurrency configured on the event source, Lambda stops reading more messages from the queue. This feature also provides you with the flexibility to define the maximum concurrency for individual event sources when the Lambda function has multiple event sources.
Maximum concurrency is set to 10 for the SQS queue.
This feature can help prevent a Lambda function from consuming all available Lambda concurrency of the account and avoids messages returning to the queue unnecessarily because of Lambda functions being throttled. It provides an easier way to control and consume messages at a desired pace, controlled by the maximum number of concurrent Lambda functions.
The maximum concurrency setting does not replace the existing reserved concurrency feature. Both serve distinct purposes and the two features can be used together. Maximum concurrency can help prevent overwhelming downstream systems and unnecessary throttled invocations. Reserved concurrency guarantees a maximum number of concurrent instances for the function.
When used together, the Lambda function can have its own allocated capacity (reserved concurrency), while being able to control the throughput for each event source (maximum concurrency). When using the two features together, you must set the function reserved concurrency higher than the maximum concurrency on the SQS event source mapping to prevent throttling.
Setting maximum concurrency for SQS as an event source
The following demo compares Lambda receiving and processes messages differently when using maximum concurrency compared to reserved concurrency.
This GitHub repository contains an AWS SAM template that deploys the following resources:
ReservedConcurrencyQueue (SQS queue)
ReservedConcurrencyDeadLetterQueue (SQS queue)
ReservedConcurrencyFunction (Lambda function)
MaxConcurrencyQueue (SQS queue)
MaxConcurrencyDeadLetterQueue (SQS queue)
MaxConcurrencyFunction (Lambda function)
CloudWatchDashboard (CloudWatch dashboard)
The AWS SAM template provisions two sets of identical architectures and an Amazon CloudWatch dashboard to monitor the resources. Each architecture comprises a Lambda function receiving messages from an SQS queue, and a DLQ for the SQS queue.
The maxReceiveCount is set as 1 for the SQS queues, which sends any returned messages directly to the DLQ. The ReservedConcurrencyFunction has its reserved concurrency set to 5, and the MaxConcurrencyFunction has the maximum concurrency for the SQS event source set to 5.
Pre-requisites
Running this demo requires the AWS CLI and the AWS SAM CLI. After installing both CLIs, clone this GitHub repository and navigate to the root of the directory:
git clone https://github.com/aws-samples/aws-lambda-amazon-sqs-max-concurrency
cd aws-lambda-amazon-sqs-max-concurrency
Deploying the AWS SAM template
Build the AWS SAM template with the build command to prepare for deployment to your AWS environment.
sam build
Use the guided deploy command to deploy the resources in your account.
sam deploy --guided
Give the stack a name and accept the remaining default values. Once deployed, you can track the progress through the CLI or by navigating to the AWS CloudFormation page in the AWS Management Console.
Note the queue URLs from the Outputs tab in the AWS SAM CLI, CloudFormation console, or navigate to the SQS console to find the queue URLs.
The Outputs tab of the launched AWS SAM template provides URLs to CloudWatch dashboard and SQS queues.
Running the demo
The deployed Lambda function code simulates processing by sleeping for 10 seconds before returning a 200 response. This allows the function to reach a high function concurrency number with only a small number of messages.
To add 25 messages to the Reserved Concurrency queue, run the following commands. Replace <ReservedConcurrencyQueueURL> with your queue URL from the AWS SAM Outputs.
for i in {1..25}; do aws sqs send-message --queue-url <ReservedConcurrencyQueueURL> --message-body testing; done
To add 25 messages to the Maximum Concurrency queue, run the following commands. Replace <MaxConcurrencyQueueURL> with your queue URL from the AWS SAM Outputs.
for i in {1..25}; do aws sqs send-message --queue-url <MaxConcurrencyQueueURL> --message-body testing; done
After sending messages to both queues, navigate to the dashboard URL available in the Outputs tab to view the CloudWatch dashboard.
Validating results
Both Lambda functions have the same number of invocations and the same concurrent invocations fixed at 5. The CloudWatch dashboard shows the ReservedConcurrencyFunction experienced throttling and 9 messages, as seen in the top-right metric, were sent to the corresponding DLQ. The MaxConcurrencyFunction did not experience any throttling and messages were not delivered to the DLQ.
CloudWatch dashboard showing throttling and DLQs.
Clean up
To remove all the resources created in this demo, use the delete command and follow the prompts:
sam delete
Conclusion
You can now control the maximum number of concurrent functions invoked by SQS as a Lambda event source. This post explains the scaling behavior of Lambda using this architectural pattern, challenges this feature helps address, and a demo of maximum concurrency in action.
There are no additional charges to use this feature besides the standard SQS and Lambda charges. You can start using maximum concurrency for SQS as an event source with new or existing event source mappings by connecting it with SQS. This feature is available in all Regions where Lambda and SQS are available.
For more serverless learning resources, visit Serverless Land.
This is written by Julian Bonilla, Senior Solutions Architect, and Matthew de Anda, Startup Solutions Architect.
React is a popular JavaScript library used to create single-page applications (SPAs). React focuses on helping to build UIs, but leaves it up to developers to decide how to accomplish other aspects involved with developing a SPA.
This post demonstrates how to build a Next.js application with Serverless services on AWS and explains Next.js Server-Side rendering. To deploy this solution and to provision the AWS resources, you can use either AWS Serverless Application Model (AWS SAM) or AWS Cloud Development Kit (CDK). Both are open-source frameworks to automate AWS deployment. AWS SAM is a declarative framework for building serverless applications and CDK is an imperative framework to define cloud application resources using familiar programming languages.
Overview
To render a Next.js application, you use Amazon S3, Amazon CloudFront, Amazon API Gateway, and AWS Lambda. Static resources are hosted in a private S3 bucket with a CloudFront distribution. Since static sources are generated at build time, this takes advantage of CloudFront so browsers can load these files cached on the network edge instead of from the server.
The Next.js application components that uses server-side rendering is rendered with a Lambda function using AWS Lambda Web Adapter. The CloudFront distribution is configured to forward requests to the API Gateway endpoint, which then calls the Lambda function.
Static files (for example, CSS, JavaScript, and HTML) are mapped to /_next/static/* and /public/*.
Server-side rendering is mapped with default behavior (*).
Next.js is a React framework that creates a more opinionated approach to building web applications while providing additional structure and features such as Server-Side Rendering and Static Site Generation.
These additional rendering options provide more flexibility over the typical ways a React application is built, which is to render in the client’s browser with JavaScript. This can help in scenarios where customers have JavaScript disabled and can improve search engine optimization (SEO). While you can implement SSR in React applications, Next.js makes it simpler for developers.
Next.js rendering strategies
These are the different rendering strategies offered by Next.js:
Static Site Generation generates static resources at build time and is a good rendering strategy for static content that rarely changes and SEO.
Server-Side Rendering generates each page on-demand at request time and is good for pages that are dynamic. Since from the browser perspective it’s still pre-rendered, like Static Site Generation, it’s also good for SEO.
Incremental Static Regeneration is a new rendering strategy that is good for apps with many pages where build times are high. With Incremental Static Regeneration, you can build page per-page without needing to rebuild the entire app.
Client-Side Rendering is the typical rendering strategy where the application is rendered in the browser with JavaScript. Next.js lets you choose the appropriate rendering method page-by-page. When a Next.js application is built, Next.js transforms the application to production-optimized files. You have HTML for statically generated pages, JavaScript for rendering on the server, JavaScript for rendering on the client, and CSS files.
Next.js also supports static HTML export, which has no server side component. Features that require a server are not supported with this approach. These apps can be hosted from S3 and CloudFront.
The remainder of this post focuses on Static Site Generation and Server-Side Rendering.
Next.js application project structure
Understanding how Next.js structures projects can give insight into how you deploy the application. A page is a React Component exported from files in the “pages” directory. These files are also used for routing where pages/index.js is routed to / route.
By default, these pages are pre-rendered. Static assets, such as images, are stored under “public” directory and can be referenced from /. Since these files are best stored in persistent storage and backed by a content delivery network (CDN), you can add a prefix in the implementation to distinguish these static files.
To create dynamic routes, add brackets to a page file – for example, pages/user/[id].js. This creates a statically generated page with the path /user/<id> where <id> can be dynamic.
API routes provide a way to create an API endpoint and are located in the pages/api directory. When building, Next.js generates an optimized version of your application under the .next directory. Static files not stored in the public directory are in .next/static. These static files are expected to be uploaded as _next/static to a CDN.
No other code in the .next/directory should be uploaded to a CDN because that would expose server code and other configuration.
Implementation
Next.js pre-renders HTML for every page using Static Site Generation or Server-side Rendering. For Static Site Generation, pages are rendered at build time and can be cached in CloudFront. Server-side rendered pages are rendered at request time, and typically fetch data from downstream resources on each request.
Clients connect to a CloudFront distribution, which is configured to forward requests for static resources to S3 and all other requests to API Gateway. API Gateway forwards requests to the Next.js application running on Lambda, which performs the server-side rendering.
At build time, Next.js Output File Tracing determines the minimal set of files needed for deploying to Lambda. The files are automatically copied to a standalone directory and must be enabled in next.config.js.
Since the Next.js application is essentially a webserver, this example uses the AWS Lambda Web Adapter as a Lambda layer to convert incoming events from API Gateway to HTTP requests that Next.js can process.
Once processed, the AWS Lambda Web Adapter converts the HTTP response back to a Lambda event response. The Lambda handler is configured to run the minimal server.js file created by the standalone build step.
The CloudFront distribution has two origins: one for the S3 bucket and another for the API Gateway. Two behaviors are created to specify path patterns to route static content produced by Next.js and static resources stored under the public/static directory. Next.js uses the public directory under root to serve static assets such as images.
These assets are then served under / so if you add public/me.png, it would be served at /me.png. This makes it harder to create a CloudFront behavior for these assets. One workaround is to create a static directory under the public directory and then map it to the CloudFront behavior. The default(*) path pattern behavior has the origin set to API Gateway with caching disabled.
Prerequisites and deployment
Refer to the project in its GitHub repository for instructions to deploy the solution using AWS SAM or AWS CDK. Multiple resources are provisioned for you as part of the deployment, and it takes several minutes to complete. The example Next.js application deployed is created using Create Next App.
Understanding the Next.js Application
To create a new page, you create a file under the pages directory and that creates a route based on the name (e.g. pages/hello.js creates route /hello). To create dynamic routes, create a file following the project’s example of pages/posts/[id].js to produce routes for posts/1, posts/2, and so forth.
For API routes, any file added to the directory pages/api is mapped to /api/* and becomes an API endpoint. These are server-side only bundles hosted by API Gateway and Lambda.
Conclusion
This blog shows how to run Next.js applications using S3, CloudFront, API Gateway, and Lambda. This architecture supports building Next.js applications that can use static-site generation, server-side rendering, and client-side rendering. The blog also covers how you can use open-source frameworks, AWS SAM and CDK, to build and deploy your Next.js applications.
Genomics workflows analyze data at petabyte scale. After processing is complete, data is often archived in cold storage classes. In some cases, like studies on the association of DNA variants against larger datasets, archived data is needed for further processing. This means manually initiating the restoration of each archived object and monitoring the progress. Scientists require a reliable process for on-demand archival data restoration so their workflows do not fail.
In Part 4 of this series, we look into genomics workloads processing data that is archived with Amazon Simple Storage Service (Amazon S3). We design a reliable data restoration process that informs the workflow when data is available so it can proceed. We build on top of the design pattern laid out in Parts 1-3 of this series. We use event-driven and serverless principles to provide the most cost-effective solution.
The S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive provide further cost savings, with retrieval times ranging from minutes to hours. We focus on the latter in order to provide the most cost-effective solution.
You must first restore the objects before accessing them. Our genomics workflow will pause until the data restore completes. The requirements for this workflow are:
Reliable launch of the restore so our workflow doesn’t fail (due to S3 Glacier service quotas, or because not all objects were restored)
Event-driven design to mirror the event-driven nature of genomics workflows and perform the retrieval upon request
Cost-effective and easy-to-manage by using serverless services
Upfront detection of archived data when formulating the genomics workflow task, avoiding idle computational tasks that incur cost
Scalable and elastic to meet the restore needs of large, archived datasets
Solution overview
Genomics workflows take multiple input parameters to prepare the initiation, such as launch ID, data path, workflow endpoint, and workflow steps. We store this data, including workflow configurations, in an S3 bucket. An AWS Fargate task reads from the S3 bucket and prepares the workflow. It detects if the input parameters include S3 Glacier URLs.
We use Amazon Simple Queue Service (Amazon SQS) to decouple S3 Glacier index creation from object restore actions (Figure 1). This increases the reliability of our process.
Figure 1. Solution architecture for S3 Glacier object restore
An AWS Lambda function creates the index of all objects in the specified S3 bucket URLs and submits them as an SQS message.
Another Lambda function polls the SQS queue and submits the request(s) to restore the S3 Glacier objects to S3 Standard storage class.
The function writes the job ID of each S3 Glacier restore request to Amazon DynamoDB. After the restore is complete, Lambda sets the status of the workflow to READY. Only then can any computing jobs start, such as with AWS Batch.
Implementation considerations
We consider the use case of Snakemake with Tibanna, which we detailed in Part 2 of this series. This allows us to dive deeper on launch details.
Snakemake is an open-source utility for whole-genome-sequence mapping in directed acyclic graph format. Snakemake uses Snakefiles to declare workflow steps and commands. Tibanna is an open-source, AWS-native software that runs bioinformatics data pipelines. It supports Snakefile syntax, plus other workflow languages, including Common Workflow Language and Workflow Description Language (WDL).
We recommend using Amazon Genomics CLI if Tibanna is not needed for your use case, or Amazon Omics if your workflow definitions are compliant with the supported WDL and Nextflow specifications.
Formulate the restore request
The Snakemake Fargate launch container detects if the S3 objects under the requested S3 bucket URLs are stored in S3 Glacier. The Fargate launch container generates and puts a JSON binary base call (BCL) configuration file into an S3 bucket and exits successfully. This file includes the launch ID of the workflow, corresponding with the DynamoDB item key, plus the S3 URLs to restore.
Query the S3 URLs
Once the JSON BCL configuration file lands in this S3 bucket, the S3 Event NotificationPutObject event invokes a Lambda function. This function parses the configuration file and recursively queries for all S3 object URLs to restore.
Initiate the restore
The main Lambda function then submits messages to the SQS queue that contains the full list of S3 URLs that need to be restored. SQS messages also include the launch ID of the workflow. This is to ensure we can bind specific restoration jobs to specific workflow launches. If all S3 Glacier objects belong to Flexible Retrieval storage class, the Lambda function puts the URLs in a single SQS message, enabling restoration with Bulk Glacier Job Tier. The Lambda function also sets the status of the workflow to WAITING in the corresponding DynamoDB item. The WAITING state is used to notify the end user that the job is waiting on the data-restoration process and will continue once the data restoration is complete.
A secondary Lambda function polls for new messages landing in the SQS queue. This Lambda function submits the restoration request(s)—for example, as a free-of-charge Bulk retrieval—using the RestoreObject API. The function subsequently writes the S3 Glacier Job ID of each request in our DynamoDB table. This allows the main Lambda function to check if all Job IDs associated with a workflow launch ID are complete.
Update status
The status of our workflow launch will remain WAITING as long as the Glacier object restore is incomplete. The AWS CloudTrail logs of completed S3 Glacier Job IDs invoke our main Lambda function (via an Amazon EventBridge rule) to update the status of the restoration job in our DynamoDB table. With each invocation, the function checks if all Job IDs associated with a workflow launch ID are complete.
After all objects have been restored, the function updates the workflow launch with status READY. This launches the workflow with the same launch ID prior to the restore.
Conclusion
In this blog post, we demonstrated how life-science research teams can make use of their archival data for genomic studies. We designed an event-driven S3 Glacier restore process, which retrieves data upon request. We discussed how to reliably launch the restore so our workflow doesn’t fail. Also, we determined upfront if an S3 Glacier restore is needed and used the WAITING state to prevent our workflow from failing.
With this solution, life-science research teams can save money using Amazon S3 Glacier without worrying about their day-to-day work or manually administering S3 Glacier object restores.
Welcome to the 20th edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!In case you missed our last ICYMI, check out what happened last quarter here.
AWS Lambda
For developers using Java, AWS Lambda has introduced Lambda SnapStart. SnapStart is a new capability that can improve the start-up performance of functions using Corretto (java11) runtime by up to 10 times, at no extra cost.
To use this capability, you must enable it in your function and then publish a new version. This triggers the optimization process. This process initializes the function, takes an immutable, encrypted snapshot of the memory and disk state, and caches it for reuse. When the function is invoked, the state is retrieved from the cache in chunks, on an as-needed basis, and it is used to populate the execution environment.
Also, now Amazon Inspectorsupports Lambda functions. You can enable Amazon Inspector to scan your functions continually for known vulnerabilities. The log4j vulnerability shows how important it is to scan your code for vulnerabilities continuously, not only after deployment. Vulnerabilities can be discovered at any time, and with Amazon Inspector, your functions and layers are rescanned whenever a new vulnerability is published.
During AWS re:Invent this year, we announced Step Functions Distributed Map. If you need to process many files, or items inside CSV or JSON files, this new flow can help you. The new distributed map flow orchestrates large-scale parallel workloads.
Amazon EventBridge launched two major features: Scheduler and Pipes. Amazon EventBridge Scheduler allows you to create, run, and manage scheduled tasks at scale. You can schedule one-time or recurring tasks across 270 services and over 6.000 APIs.
AWS Application Composer is a new visual designer that you can use to build serverless applications using multiple AWS services. This is ideal if you want to build a prototype, review with others architectures, generate diagrams for your projects, or onboard new team members to a project.
Within a simple user interface, you can drag and drop the different AWS resources and configure them visually. You can use AWS Application Composer together with AWS SAM Accelerate to build and test your applications in the AWS Cloud.
AWS Serverless digital learning badges
The new AWS Serverless digital learning badges let you show your AWS Serverless knowledge and skills. This is a verifiable digital badge that is aligned with the AWS Serverless Learning Plan.
This badge proves your knowledge and skills for Lambda, Amazon API Gateway, and designing serverless applications. To earn this badge, you must score at least 80 percent on the assessment associated with the Learning Plan. Visit this link if you are ready to get started learning or just jump directly to the assessment.
AWS re:Invent was held in Las Vegas from November 28 to December 2, 2022. Werner Vogels, Amazon’s CTO, highlighted event-driven applications during his keynote. He stated that the world is asynchronous and showed how strange a synchronous world would be. During the keynote, he showcased Serverlesspresso as an example of an event-driven application. The Serverless DA team presented many breakouts, workshops, and chalk talks. Rewatch all our breakout content:
In addition, we brought Serverlesspresso back to Vegas. Serverlesspresso is a contactless, serverless order management system for a physical coffee bar. The architecture comprises several serverless apps that support an ordering process from a customer’s smartphone to a real espresso bar. The customer can check the virtual line, place an order, and receive a notification when their drink is ready for pickup.
Weekly live virtual office hours: In each session, we talk about a specific topic or technology related to serverless and open it up to helping with your real serverless challenges and issues. Ask us anything about serverless technologies and applications.
The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. If you want to learn more about event-driven architectures, read our new guide that will help you get started.
This post describes how to use the AWS CDK Pipelines module to follow a Gitflow development model using AWS Cloud Development Kit (AWS CDK). Software development teams often follow a strict branching strategy during a solutions development lifecycle. Newly-created branches commonly need their own isolated copy of infrastructure resources to develop new features.
CDK Pipelines is a construct library module for continuous delivery of AWS CDK applications. CDK Pipelines are self-updating: if you add application stages or stacks, then the pipeline automatically reconfigures itself to deploy those new stages and/or stacks.
The following solution creates a new AWS CDK Pipeline within a development account for every new branch created in the source repository (AWS CodeCommit). When a branch is deleted, the pipeline and all related resources are also destroyed from the account. This GitFlow model for infrastructure provisioning allows developers to work independently from each other, concurrently, even in the same stack of the application.
Solution overview
The following diagram provides an overview of the solution. There is one default pipeline responsible for deploying resources to the different application environments (e.g., Development, Pre-Prod, and Prod). The code is stored in CodeCommit. When new changes are pushed to the default CodeCommit repository branch, AWS CodePipeline runs the default pipeline. When the default pipeline is deployed, it creates two AWS Lambda functions.
These two Lambda functions are invoked by CodeCommit CloudWatch events when a new branch in the repository is created or deleted. The Create Lambda function uses the boto3 CodeBuild module to create an AWS CodeBuild project that builds the pipeline for the feature branch. This feature pipeline consists of a build stage and an optional update pipeline stage for itself. The Destroy Lambda function creates another CodeBuild project which cleans all of the feature branch’s resources and the feature pipeline.
Figure 1. Architecture diagram.
Prerequisites
Before beginning this walkthrough, you should have the following prerequisites:
# Command to clone the repository
git clone https://github.com/aws-samples/multi-branch-cdk-pipelines.git
cd multi-branch-cdk-pipelines
Create a new CodeCommit repository in the AWS Account and region where you want to deploy the pipeline and upload the source code from above to this repository. In the config.ini file, change the repository_name and region variables accordingly.
Make sure that you set up a fresh Python environment. Install the dependencies:
pip install -r requirements.txt
Run the initial-deploy.sh script to bootstrap the development and production environments and to deploy the default pipeline. You’ll be asked to provide the following parameters: (1) Development account ID, (2) Development account AWS profile name, (3) Production account ID, and (4) Production account AWS profile name.
sh ./initial-deploy.sh --dev_account_id <YOUR DEV ACCOUNT ID> --
dev_profile_name <YOUR DEV PROFILE NAME> --prod_account_id <YOUR PRODUCTION
ACCOUNT ID> --prod_profile_name <YOUR PRODUCTION PROFILE NAME>
Default pipeline
In the CI/CD pipeline, we set up an if condition to deploy the default branch resources only if the current branch is the default one. The default branch is retrieved programmatically from the CodeCommit repository. We deploy an Amazon Simple Storage Service (Amazon S3) Bucket and two Lambda functions. The bucket is responsible for storing the feature branches’ CodeBuild artifacts. The first Lambda function is triggered when a new branch is created in CodeCommit. The second one is triggered when a branch is deleted.
Then, the CodeCommit repository is configured to trigger these Lambda functions based on two events:
(1) Reference created
# Configure AWS CodeCommit to trigger the Lambda function when a new branch is created
repo.on_reference_created(
'BranchCreateTrigger',
description="AWS CodeCommit reference created event.",
target=aws_events_targets.LambdaFunction(create_branch_func))
(2) Reference deleted
# Configure AWS CodeCommit to trigger the Lambda function when a branch is deleted
repo.on_reference_deleted(
'BranchDeleteTrigger',
description="AWS CodeCommit reference deleted event.",
target=aws_events_targets.LambdaFunction(destroy_branch_func))
Lambda functions
The two Lambda functions build and destroy application environments mapped to each feature branch. An Amazon CloudWatch event triggers the LambdaTriggerCreateBranch function whenever a new branch is created. The CodeBuild client from boto3 creates the build phase and deploys the feature pipeline.
Create function
The create function deploys a feature pipeline which consists of a build stage and an optional update pipeline stage for itself. The pipeline downloads the feature branch code from the CodeCommit repository, initiates the Build and Test action using CodeBuild, and securely saves the built artifact on the S3 bucket.
The second Lambda function is responsible for the destruction of a feature branch’s resources. Upon the deletion of a feature branch, an Amazon CloudWatch event triggers this Lambda function. The function creates a CodeBuild Project which destroys the feature pipeline and all of the associated resources created by that pipeline. The source property of the CodeBuild Project is the feature branch’s source code saved as an artifact in Amazon S3.
On your machine’s local copy of the repository, create a new feature branch using the following git commands. Replace user-feature-123 with a unique name for your feature branch. Note that this feature branch name must comply with the CodePipeline naming restrictions, as it will be used to name a unique pipeline later in this walkthrough.
The first Lambda function will deploy the CodeBuild project, which then deploys the feature pipeline. This can take a few minutes. You can log in to the AWS Console and see the CodeBuild project running under CodeBuild.
Figure 2. AWS Console – CodeBuild projects.
After the build is successfully finished, you can see the deployed feature pipeline under CodePipelines.
Figure 3. AWS Console – CodePipeline pipelines.
The Lambda S3 trigger project from AWS CDK Samples is used as the infrastructure resources to demonstrate this solution. The content is placed inside the src directory and is deployed by the pipeline. When visiting the Lambda console page, you can see two functions: one by the default pipeline and one by our feature pipeline.
Figure 4. AWS Console – Lambda functions.
Destroy a feature branch
There are two common ways for removing feature branches. The first one is related to a pull request, also known as a “PR”. This occurs when merging a feature branch back into the default branch. Once it’s merged, the feature branch will be automatically closed. The second way is to delete the feature branch explicitly by running the following git commands:
The CodeBuild project responsible for destroying the feature resources is now triggered. You can see the project’s logs while the resources are being destroyed in CodeBuild, under Build history.
Figure 5. AWS Console – CodeBuild projects.
Cleaning up
To avoid incurring future charges, log into the AWS console of the different accounts you used, go to the AWS CloudFormation console of the Region(s) where you chose to deploy, and select and click Delete on the main and branch stacks.
Conclusion
This post showed how you can work with an event-driven strategy and AWS CDK to implement a multi-branch pipeline flow using AWS CDK Pipelines. The described solutions leverage Lambda and CodeBuild to provide a dynamic orchestration of resources for multiple branches and pipelines. For more information on CDK Pipelines and all the ways it can be used, see the CDK Pipelines reference documentation.
This blog post is written by Jack Chen, Telco Solutions Architect, and Robert Belson, Developer Advocate.
AWS Wavelength embeds AWS compute and storage services within 5G networks, providing mobile edge computing infrastructure for developing, deploying, and scaling ultra-low-latency applications. AWS recently introduced support for Application Load Balancer (ALB) in AWS Wavelength zones. Although ALB addresses Layer-7 load balancing use cases, some low latency applications that get deployed in AWS Wavelength Zones rely on UDP-based protocols, such as QUIC, WebRTC, and SRT, which can’t be load-balanced by Layer-7 Load Balancers. In this post, we’ll review popular load-balancing patterns on AWS Wavelength, including a proposed architecture demonstrating how DNS-based load balancing can address customer requirements for load-balancing non-HTTP(s) traffic across multiple Amazon Elastic Compute Cloud (Amazon EC2) instances. This solution also builds a foundation for automatic scale-up and scale-down capabilities for workloads running in an AWS Wavelength Zone.
Load balancing use cases in AWS Wavelength
In the AWS Regions, customers looking to deploy highly-available edge applications often consider Amazon Elastic Load Balancing (Amazon ELB) as an approach to automatically distribute incoming application traffic across multiple targets in one or more Availability Zones (AZs). However, at the time of this publication, AWS-managed Network Load Balancer (NLB) isn’t supported in AWS Wavelength Zones and ALB is being rolled out to all AWS Wavelength Zones globally. As a result, this post will seek to document general architectural guidance for load balancing solutions on AWS Wavelength.
As one of the most prominent AWS Wavelength use cases, highly-immersive video streaming over UDP using protocols such as WebRTC at scale often require a load balancing solution to accommodate surges in traffic, either due to live events or general customer access patterns. These use cases, relying on Layer-4 traffic, can’t be load-balanced from a Layer-7 ALB. Instead, Layer-4 load balancing is needed.
To date, two infrastructure deployments involving Layer-4 load balancers are most often seen:
Amazon EC2-based deployments: Often the environment of choice for earlier-stage enterprises and ISVs, a fleet of EC2 instances will leverage a load balancer for high-throughput use cases, such as video streaming, data analytics, or Industrial IoT (IIoT) applications
Amazon EKS deployments: Customers looking to optimize performance and cost efficiency of their infrastructure can leverage containerized deployments at the edge to manage their AWS Wavelength Zone applications. In turn, external load balancers could be configured to point to exposed services via NodePort objects. Furthermore, a more popular choice might be to leverage the AWS Load Balancer Controller to provision an ALB when you create a Kubernetes Ingress.
Regardless of deployment type, the following design constraints must be considered:
Target registration: For load balancing solutions not managed by AWS, seamless solutions to load balancer target registration must be managed by the customer. As one potential solution, visit a recent HAProxyConf presentation, Practical Advice for Load Balancing at the Network Edge.
Edge Discovery: Although DNS records can be populated into Amazon Route 53 for each carrier-facing endpoint, DNS won’t deterministically route mobile clients to the most optimal mobile endpoint. When available, edge discovery services are required to most effectively route mobile clients to the lowest latency endpoint.
Cross-zone load balancing: Given the hub-and-spoke design of AWS Wavelength, customer-managed load balancers should proxy traffic only to that AWS Wavelength Zone.
Solution overview – Amazon EC2
In this solution, we’ll present a solution for a highly-available load balancing solution in a single AWS Wavelength Zone for an Amazon EC2-based deployment. In a separate post, we’ll cover the needed configurations for the AWS Load Balancer Controller in AWS Wavelength for Amazon Elastic Kubernetes Service (Amazon EKS) clusters.
The proposed solution introduces DNS-based load balancing, a technique to abstract away the complexity of intelligent load-balancing software and allow your Domain Name System (DNS) resolvers to distribute traffic (equally, or in a weighted distribution) to your set of endpoints.
Our solution leverages the weighted routing policy in Route 53 to resolve inbound DNS queries to multiple EC2 instances running within an AWS Wavelength zone. As EC2 instances for a given workload get deployed in an AWS Wavelength zone, Carrier IP addresses can be assigned to the network interfaces at launch.
Through this solution, Carrier IP addresses attached to AWS Wavelength instances are automatically added as DNS records for the customer-provided public hosted zone.
To determine how Route 53 responds to queries, given an arbitrary number of records of a public hosted zone, Route53 offers numerous routing policies:
Simple routing policy – In the event that you must route traffic to a single resource in an AWS Wavelength Zone, simple routing can be used. A single record can contain multiple IP addresses, but Route 53 returns the values in a random order to the client.
Weighted routing policy – To route traffic more deterministically using a set of proportions that you specify, this policy can be selected. For example, if you would like Carrier IP A to receive 50% of the traffic and Carrier IP B to receive 50% of the traffic, we’ll create two individual A records (one for each Carrier IP) with a weight of 50 and 50, respectively. Learn more about Route 53 routing policies by visiting the Route 53 Developer Guide.
The proposed solution leverages weighted routing policy in Route 53 DNS to route traffic to multiple EC2 instances running within an AWS Wavelength zone.
Reference architecture
The following diagram illustrates the load-balancing component of the solution, where EC2 instances in an AWS Wavelength zone are assigned Carrier IP addresses. A weighted DNS record for a host (e.g., www.example.com) is updated with Carrier IP addresses.
When a device makes a DNS query, it will be returned to one of the Carrier IP addresses associated with the given domain name. With a large number of devices, we expect a fair distribution of load across all EC2 instances in the resource pool. Given the highly ephemeral mobile edge environments, it’s likely that Carrier IPs could frequently be allocated to accommodate a workload and released shortly thereafter. However, this unpredictable behavior could yield stale DNS records, resulting in a “blackhole” – routes to endpoints that no longer exist.
Time-To-Live (TTL) is a DNS attribute that specifies the amount of time, in seconds, that you want DNS recursive resolvers to cache information about this record.
In our example, we should set to 30 seconds to force DNS resolvers to retrieve the latest records from the authoritative nameservers and minimize stale DNS responses. However, a lower TTL has a direct impact on cost, as a result of increased number of calls from recursive resolvers to Route53 to constantly retrieve the latest records.
The core components of the solution are as follows:
AWS Auto Scaling Group– a collection of EC2 instances logically grouped together for the purpose of automatic scaling.
Alongside the services above in the AWS Wavelength Zone, the following services are also leveraged in the AWS Region:
AWS Lambda – a serverless event-driven function that makes API calls to the Route 53 service to update DNS records.
Amazon EventBridge– a serverless event bus that reacts to EC2 instance lifecycle events and invokes the Lambda function to make DNS updates.
Route 53– cloud DNS service with a domain record pointing to AWS Wavelength-hosted resources.
In this post, we intentionally leave the specific load balancing software solution up to the customer. Customers can leverage various popular load balancers available on the AWS Marketplace, such as HAProxy and NGINX. To focus our solution on the auto-registration of DNS records to create functional load balancing, this solution is designed to support stateless workloads only. To support stateful workloads, sticky sessions – a process in which routes requests to the same target in a target group – must be configured by the underlying load balancer solution and are outside of the scope of what DNS can provide natively.
Automation overview
Using the aforementioned components, we can implement the following workflow automation:
Amazon CloudWatchalarm can trigger the Auto Scaling group Scale out or Scale in event by adding or removing EC2 instances. Eventbridge will detect the EC2 instance state change event and invoke the Lambda function. This function will update the DNS record in Route53 by either adding (scale out) or deleting (scale in) a weighted A record associated with the EC2 instance changing state.
Configuration of the automatic auto scaling policy is out of the scope of this post. There are many auto scaling triggers that you can consider using, based on predefined and custom metrics such as memory utilization. For the demo purposes, we will be leveraging manual auto scaling.
Amazon Virtual Private Cloud (Amazon VPC), subnets, Carrier Gateway, and Route Tables are foundational building blocks for AWS-based networking infrastructure. In our deployment, we are creating a new VPC, one subnet in an AWS Wavelength zone of your choice, a Carrier Gateway, and updating the route table for this subnet to point the default route to the Carrier Gateway.
Deployment prerequisites
The following are prerequisites to deploy the described solution in your account:
Access to an AWS Wavelength zone. If your account is not allow-listed to use AWS Wavelength zones, then opt-in to AWS Wavelength zones here.
Public DNS Hosted Zone hosted in Route 53. You must have access to a registered public domain to deploy this solution. The zone for this domain should be hosted in the same account where you plan to deploy AWS Wavelength workloads. If you don’t have a public domain, then you can register a new one. Note that there will be a service charge for the domain registration.
Amazon S3 bucket. For the Lambda function that updates DNS records in Route 53, store the source code as a .zip file in an Amazon S3 bucket.
Amazon EC2 Key pair. You can use an existing Key pair for the deployment. If you don’t have a KeyPair in the region where you plan to deploy this solution, then create one by following these instructions.
4G or 5G-connected device. Although the infrastructure can be deployed independent of the underlying connected devices, testing the connectivity will require a mobile device on one of the Wavelength partner’s networks. View the complete list of Telecommunications providers and Wavelength Zone locations to learn more.
Conclusion
In this post, we demonstrated how to implement DNS-based load balancing for workloads running in an AWS Wavelength zone. We deployed the solution that used the EventBridge Rule and the Lambda function to update DNS records hosted by Route53. If you want to learn more about AWS Wavelength, subscribe to AWS Compute Blog channel here.
Genomics workflows are high-performance computing workloads. Life-science research teams make use of various genomics workflows. With each invocation, they specify custom sets of data and processing steps, and translate them into commands. Furthermore, team members stay to monitor progress and troubleshoot errors, which can be cumbersome, non-differentiated, administrative work.
In Part 3 of this series, we describe the architecture of a workflow manager that simplifies the administration of bioinformatics data pipelines. The workflow manager dynamically generates the launch commands based on user input and keeps track of the workflow status. This workflow manager can be adapted to many scientific workloads—effectively becoming a bring-your-own-workflow-manager for each project.
In this blog post, we extend this idea to a new frontend layer in our design pattern. This layer automates command generation and monitors the invocations of a variety of workflows—becoming a workflow manager. Life-science research teams use multiple workflows for different datasets and use cases, each with different syntax and commands. The workflow manager we create removes the administrative burden of formulating workflow-specific commands and tracking their launches.
Solution overview
We allow scientists to upload their requested workflow configuration as objects in Amazon S3. We use S3 Event Notifications on PUT requests to invoke an AWS Lambda function. The function parses the uploaded S3 object and registers the new launch request as a DynamoDB item using the PutItem operation. Each item corresponds with a distinct launch request, stored as key-value pair. Item values store the:
S3 data path containing genomic datasets
Workflow endpoint
Preferred compute service (optional)
Another Lambda function monitors for change data captures in the DynamoDB Stream (Figure 1). With each PutItem operation, the Lambda function prepares a workflow invocation, which includes translating the user input into the syntax and launch commands of the respective workflow.
In the case of Snakemake (discussed in Part 2), the function creates a Snakefile that declares processing steps and commands. The function spins up an AWS Fargate task that builds the computational tasks, distributes them with AWS Batch, and monitors for completion. An AWS Step Functions state machine orchestrates job processing, for example, initiated by Tibanna.
Amazon CloudWatch provides a consolidated overview of performance metrics, like time elapsed, failed jobs, and error types. We store log data, including status updates and errors, in Amazon CloudWatch Logs. A third Lambda function parses those logs and updates the status of each workflow launch request in the corresponding DynamoDB item (Figure 1).
Figure 1. Workflow manager for genomics workflows
Implementation considerations
In this section, we describe some of our past implementation considerations.
Register new workflow requests
DynamoDB items are key-value pairs. We use launch IDs as key, and the value includes the workflow type, compute engine, S3 data path, the S3 object path to the user-defined configuration file and workflow status. Our Lambda function parses the configuration file and generates all commands plus ancillary artifacts, such as Snakefiles.
Launch workflows
Launch requests are picked by a Lambda function from the DynamoDB stream. The function has the following required parameters:
Launch ID: unique identifier of each workflow launch request
Configuration file: the Amazon S3 path to the configuration sheet with launch details (in s3://bucket/object format)
These points assume that the configuration sheet is already uploaded into an accessible location in an S3 bucket. This will issue a new Snakemake Fargate launch task. If either of the parameters is not provided or access fails, the workflow manager returns MissingRequiredParametersError.
Log workflow launches
Logs are written to CloudWatch Logs automatically. We write the location of the CloudWatch log group and log stream into the DynamoDB table. To send logs to Amazon CloudWatch, specify the awslogs driver in the Fargate task definition settings in your provisioning template.
Our Lambda function writes Fargate task launch logs from CloudWatch Logs to our DynamoDB table. For example, OutOfMemoryError can occur if the process utilizes more memory than the container is allocated.
AWS Batch job state logs are written to the following log group in CloudWatch Logs: /aws/batch/job. Our Lambda function writes status updates to the DynamoDB table. AWS Batch jobs may encounter errors, such as being stuck in RUNNABLE state.
Manage state transitions
We manage the status of each job in DynamoDB. Whenever a Fargate task changes state, it is picked up by a CloudWatch rule that references the Fargate compute cluster. This CloudWatch rule invokes a notifier Lambda function that updates the workflow status in DynamoDB.
Conclusion
In this blog post, we demonstrated how life-science research teams can simplify genomic analysis across an array of workflows. These workflows usually have their own command syntax and workflow management system, such as Snakemake. The presented workflow manager removes the administrative burden of preparing and formulating workflow launches, increasing reliability.
The pattern is broadly reusable with any scientific workflow and related high-performance computing systems. The workflow manager provides persistence to enable historical analysis and comparison, which enables us to automatically benchmark workflow launches for cost and performance.
How do you prevent the most common merge conflicts when your team is working on a Serverless application? How do you make sure that your team stays productive and avoids large merge issues while trying to update the same crucial files simultaneously? –The answer to both questions is code organization! You can use cfn-include and swagger-cli to organize, collaborate, and maintain a large serverless application as well as support a large or decentralized development team.
Real life inspiration
WRAP Technologies Inc. (WRAP) creates advanced technologies for the protection and security of public safety. Their WRAP Reality product allows law enforcement agencies to train their officers using virtual reality-based scenarios.
Too many cooks in the kitchen
When multiple developers collaborate on a serverless architecture built with AWS CloudFormation, and its extensions such as the AWS Serverless Application Model (SAM), the nature of specifying resources in both the template.yaml and the optional OpenAPI.yaml specification for Amazon API Gateway leads to merge conflicts, such as the one demonstrated in the following figure where two developers are adding different API endpoints at the same time. These conflicts detract from the developer’s time and agility. Furthermore, navigating and maintaining the long template files required for a larger serverless architecture slows development as the developer scans large files to find a particular resource definition.
Figure 1. The frustrating merge conflicts.
By refactoring and organizing the CloudFormation and OpenAPI files, your development team can realize several benefits:
Improve developer efficiency by decomposing large, hard-to-manage files into a series of well-organized and single-purpose files.
Enhance developer productivity by allowing each developer to have ownership of their own code, thereby reducing the need to coordinate merges with teammates.
Eliminate potential merge issues for files that generate the most conflicts during the development of a typical Serverless API application.
Rapid development
WRAP partnered with AWS to develop and host the backend for their new officer training management platform. This entirely new platform was developed, completed, and available for use in a matter of months. Moreover, it’s a collaboration of developers spread across multiple teams worldwide, all contributing to the same code base. By instituting the norms and techniques of this post, WRAP created a large and maintainable serverless application with minimal developer code collisions.
Development of the WRAP Reality training management system was accomplished using CloudFormation for defining Infrastructure as Code (IaC), and an Amazon API Gateway OpenAPI specification for defining API contracts. The development team for the WRAP Reality training management service leveraged agile development for expediency, including the GitHub Flow branching strategy. However, since project contributors were not co-located, several considerations were put in place to make sure of consistency and speed of code development:
The API specifications and contracts were defined in OpenAPI (Swagger) specifications early in the development process, clearly defining the project structure up front, and allowing developers to independently build infrastructure components.
The two code assets central to the entire project – the CloudFormation template and the OpenAPI Specification – were decomposed into small, easily manageable components. This enabled components to be organized in a way that enhanced development productivity and practically eliminated the inevitable merge conflicts that come with large source code files that are being modified on a daily basis.
The development process was accelerated by utilizing OpenAPI integrations with AWS Services, as well as techniques for managing the OpenAPI specification and Cloudformation Template files.
Sample project
To demonstrate these techniques, we’ll explore the following sample project comprised of API endpoints for “widget” management, available on GitHub. This project provides the following end points:
/widget PUT: Creation of a new widget
/widget GET: Retrieval of a new widget
/reports/color GET: Retrieval of a set of widgets based on the widget color
/reports/filterpage GET: Retrieval of widgets based on specified filters
The overall architecture of the application is shown in the following diagram:
Figure 2. Architecture Diagram
The application comprises:
Amazon API Gateway is a fully-managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. In this example, API Gateway serves as the web service for the API endpoints. The mapping of data to and from the API endpoints to the Lambda functions is formally defined by an OpenAPI specification file.
AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers, creating workload-aware cluster scaling logic, maintaining event integrations, or managing runtimes. In this example, four Lambda functions are used to service each of the four API calls.
Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. DynamoDB is used as a persistent data store for widgets and associated properties.
OpenAPI and AWS service integration
When using API Gateway, developers have the option of using proxy Lambda integrations, or formally defining the API interface in an OpenAPI yaml file. The OpenAPI specification can be leveraged to document the API prior to development, and the example/mock features of the OpenAPI specification facilitates concurrent development by quickly establishing a working infrastructure to build upon. Furthermore, API documentation can be automatically generated from the OpenAPI specification.
As the number of endpoints increases, the OpenAPI specification file can grow in size, reaching thousands of lines of code that must be updated and maintained regularly by multiple developers. To aid in management and usability, the OpenAPI file can be decomposed into separate files for endpoints, responses, fields, and schemas.
Start with a “skeleton” file as an entry point for the OpenAPI definition, and then add a separate file for the definition of each endpoint or construct. For example, the sample project entry point is api/apiSkeleton.yaml, which contains the global definitions and effectively defines a simple list of endpoints and the reference ($ref) file path to each endpoint’s definition.
Diving into a file referenced by an endpoint, we see that it contains all of the specification details for that endpoint. Looking at the reportsColor.yaml file reveals the full endpoint specification for /reports/color:
get:
description: Get widgets by color
parameters:
- in: path
$ref: '../../requestParameters/color.yaml'
responses:
200:
description: Get All the Widgets of a color
content:
application/json:
schema:
$ref: '../../schemas/widgetList.yaml'
. . .
In turn, this endpoint specification can include further references to yaml files defining common parameters, schemas, and even full gateway responses. For example, color.yaml defines the color path variable:
type: string
description: "The widget's color"
example: "Red"
To paraphrase a common catch phrase, “With a great many files, comes a great responsibility for organization.” To this end, we offer the following organizational structure as a start. Place all of the related API specifications in an “api” subfolder of your project. Have child subfolders for field, metadata, and gateway response definition files. Then, create child subfolder trees for each branch of your endpoints that mirror the endpoint paths. This will result in a highly-organized directory structure, as seen in the sample project:
We still need a consolidated single OpenAPI file to provide to CloudFormation during deployment to AWS. Therefore, the multiple files are combined and validated using the swagger-cli bundle command, resulting in a single file for deployment. The bundle command must be executed before a CloudFormation build. This command can also be included as a shortcut in the Makefile as the “buildOpenApi” command:
Once compiled, api/api.yaml is then used normally for API Gateway integrations and as a Postman API Collection import. As api/api.yaml is dynamically compiled, it’s included in .gitignore and not checked in to AWS CodeCommit.
cfn-include and nested stacks
The CloudFormation template that defines the infrastructure for even a simple service can grow to considerable length, perhaps thousands of lines. This presents challenges from a support and continued development perspective, as specific code locations become difficult to find and merge conflicts become commonplace.
CloudFormation Nested Stacks are a method of breaking a large CloudFormation template into separate templates. When there are clear delineations between groups of resources in a stack breaking it into separate nested stacks makes sense. There is also a 500 resource limit in a single CloudFormation stack and in order to go above that nested or separate stacks are necessary. Depending on the complexity of the architecture and frequency of updates however, the Nested Stacks can also become large. Furthermore, in a serverless architecture, the logical separation of architecture layers into separate stacks may not be direct, for example when a Lambda function is triggered by an event sent to an EventBridge event bus, then that Lambda function sends a different event back to the same event bus.
In these cases, CloudFormation templates can be decomposed to further leverage cfn-include . With this technique, the top-level CloudFormation template becomes a skeleton file which contains the stack parameters, global specifications, a list of resource names without properties, and the outputs. The properties of each resource are contained in separate files, referenced by an ‘include’ directive. CloudFormation template organization
To organize your CloudFormation template, deconstruct the template into one-file-per-resource, with one main “skeleton” file as the main entry point. This skeleton file contains the full parameters, global section, conditions, and output specification. The resources are specified by resource name in this skeleton file, and then an ‘include’ directive points to the file that contains the body of the resource declaration. See the following example of the main skeleton file with two resources:
This approach has the benefit of breaking the CloudFormation template into multiple small files, while still maintaining a top-level holistic view. The resource definitions, which normally comprise the majority of the content and can cause merge conflicts, are moved out of the main template.
For organization, you can create a directory in your project to contain the CloudFormation scripts. This directory also contains the entry-point skeleton file. Create further sub-folders for resources, and then further folders by resource type and architecture. We found that placing applicable AWS Identity and Access Management (IAM) role resource definitions in the same folder with the applied resource facilitated easier navigation. For example:
The files must be reconstituted to a single template.yaml for CloudFormation build and deployment. This is accomplished with the cfn-include command. A convenience command can optionally be included in the Makefile.
As the final template.yaml file is dynamically compiled, it’s included in .gitignore and not checked in to CodeCommit.
Conclusion
This post demonstrates techniques used by WRAP and AWS to rapidly develop and maintain key files in an Serverless architecture. The techniques discussed in this post allowed the WRAP and AWS team to do the following:
Improve developer efficiency by decomposing large, hard-to-manage files into a series of well-organized and single purpose files.
Enhance developer productivity by allowing each developer to have ownership of their own piece of the code without having to coordinate with teammates.
Eliminate potential merge issues on the files that typically generate the most conflicts during the development of a typical Serverless API application.
Applying these techniques was one of the key factors in the rapid development of the WRAP Reality training framework.
Many organizations require durable automated code delivery for their applications. They leverage multi-account continuous integration/continuous deployment (CI/CD) pipelines to deploy code and run automated tests in multiple environments before deploying to Production. In cases where the testing strategy is release specific, you must update the pipeline before every release. Traditional pipeline stages are predefined and static in nature, and once the pipeline stages are defined it’s hard to update them. In this post, we present a configuration driven dynamic CI/CD solution per repository. The pipeline state is maintained and governed by configurations stored in Amazon DynamoDB. This gives you the advantage of automatically customizing the pipeline for every release based on the testing requirements.
The following diagram illustrates the solution architecture:
Figure 1: Architecture Diagram
Users insert/update/delete entry in the DynamoDB table.
The Step Function Trigger Lambda is invoked on all modifications.
The Step Function Trigger Lambda evaluates the incoming event and does the following:
On insert and update, triggers the Step Function.
On delete, finds the appropriate CloudFormation stack and deletes it.
Steps in the Step Function are as follows:
Collect Information (Pass State) – Filters the relevant information from the event, such as repositoryName and referenceName.
Get Mapping Information (Backed by CodeCommit event filter Lambda) – Retrieves the mapping information from the Pipeline config stored in the DynamoDB.
Deployment Configuration Exist? (Choice State) – If the StatusCode == 200, then the DynamoDB entry is found, and Initiate CloudFormation Stack step is invoked, or else StepFunction exits with Successful.
Initiate CloudFormation Stack (Backed by stack create Lambda) – Constructs the CloudFormation parameters and creates/updates the dynamic pipeline based on the configuration stored in the DynamoDB via CloudFormation.
Code deliverables
The code deliverables include the following:
AWS CDK app – The AWS CDK app contains the code for all the Lambdas, Step Functions, and CloudFormation templates.
sample-application-repo – This directory contains the sample application repository used for deployment.
automated-tests-repo– This directory contains the sample automated tests repository for testing the sample repo.
Follow the README to deploy the solution to your main CI/CD account. Upon successful deployment, the following resources should be created in the CI/CD account:
A DynamoDB table
Step Function
Lambda Functions
Navigate to the Amazon Simple Storage Service (Amazon S3) console in your main CI/CD account and search for a bucket with the name: cloudformation-template-bucket-<AWS_ACCOUNT_ID>. You should see two CloudFormation templates (templates/codepipeline.yaml and templates/childaccount.yaml) uploaded to this bucket.
Run the childaccount.yaml in every target CI/CD account (Alpha, Beta, Gamma, and Prod) by going to the CloudFormation Console. Provide the main CI/CD account number as the “CentralAwsAccountId” parameter, and execute.
Upon successful creation of Stack, two roles will be created in the Child Accounts:
ChildAccountFormationRole
ChildAccountDeployerRole
Pipeline configuration
Make an entry into devops-pipeline-table-info for the Repository name and branch combination. A sample entry can be found in sample-entry.json.
The pipeline is highly configurable, and everything can be configured through the DynamoDB entry.
The following are the top-level keys:
RepoName: Name of the repository for which AWS CodePipeline is configured. RepoTag: Name of the branch used in CodePipeline. BuildImage: Build image used for application AWS CodeBuild project. BuildSpecFile: Buildspec file used in the application CodeBuild project. DeploymentConfigurations: This key holds the deployment configurations for the pipeline. Under this key are the environment specific configurations. In our case, we’ve named our environments Alpha, Beta, Gamma, and Prod. You can configure to any name you like, but make sure that the entries in json are the same as in the codepipeline.yaml CloudFormation template. This is because there is a 1:1 mapping between them. Sub-level keys under DeploymentConfigurations are as follows:
EnvironmentName. This is the top-level key for environment specific configuration. In our case, it’s Alpha, Beta, Gamma, and Prod. Sub level keys under this are:
<Env>AwsAccountId: AWS account ID of the target environment.
Deploy<Env>: A key specifying whether or not the artifact should be deployed to this environment. Based on its value, the CodePipeline will have a deployment stage to this environment.
ManualApproval<Env>: Key representing whether or not manual approval is required before deployment. Enter your email or set to false.
Tests: Once again, this is a top-level key with sub-level keys. This key holds the test related information to be run on specific environments. Each test based on whether or not it will be run will add an additional step to the CodePipeline. The tests’ related information is also configurable with the ability to specify the test repository, branch name, buildspec file, and build image for testing the CodeBuild project.
Execute
Make an entry into the devops-pipeline-table-info DynamoDB table in the main CI/CD account. A sample entry can be found in sample-entry.json. Make sure to replace the configuration values with appropriate values for your environment. An explanation of the values can be found in the Pipeline Configuration section above.
After the entry is made in the DynamoDB table, you should see a CloudFormation stack being created. This CloudFormation stack will deploy the CodePipeline in the main CI/CD account by reading and using the entry in the DynamoDB table.
Customize the solution for different combinations such as deploying to an environment while skipping for others by updating the pipeline configurations stored in the devops-pipeline-table-info DynamoDB table. The following is the pipeline configured for the sample-application repository’s main branch.
Figure 2: Dynamic Multi-Account CI/CD Pipeline
Clean up your dynamic multi-account CI/CD solution and related resources
To avoid ongoing charges for the resources that you created following this post, you should delete the following:
The pipeline configuration stored in the DynamoDB
The CloudFormation stacks deployed in the target CI/CD accounts
The AWS CDK app deployed in the main CI/CD account
Empty and delete the retained S3 buckets.
Conclusion
This configuration-driven CI/CD solution provides the ability to dynamically create and configure your pipelines in DynamoDB. IDEMIA, a global leader in identity technologies, adopted this approach for deploying their microservices based application across environments. This solution created by AWS Professional Services allowed them to dynamically create and configure their pipelines per repository per release. As Kunal Bajaj, Tech Lead of IDEMIA, states, “We worked with AWS pro-serve team to create a dynamic CI/CD solution using lambdas, step functions, SQS, and other native AWS services to conduct cross-account deployments to our different environments while providing us the flexibility to add tests and approvals as needed by the business.”
Genomics workflows are high-performance computing workloads. In Part 1 of this series, we demonstrated how life-science research teams can focus on scientific discovery without the associated heavy lifting. We used regenie for large genome-wide association studies. Our design pattern built on AWS Step Functions with AWS Batch and Amazon FSx for Lustre.
In Part 2, we explore genomics workloads with built-in workflow logic. Historically, running bioinformatics data pipelines was a manual and error-prone task. Over the last years, multiple workflow management systems have emerged. An example of these is the Snakemake workflow management system with Tibanna orchestration. We discuss the solution design and how you can fully automate the launch with Amazon Web Services (AWS).
Use case
We focus on the use case of Snakemake, an open-source utility for whole genome sequence mapping in directed acyclic graph (DAG) format. Snakemake uses Snakefiles to declare workflow steps and commands. A Snakefile extends Python syntax to declare workflow steps such as mapping data sets to DAG structure and identifying variants. Consult the Snakemake tutorial for further information on workflow rules.
Snakefiles provide an exception from the general design pattern and an alternative to granular modeling workflow logic in Amazon States Language. In our real-life use case, we used Tibanna to orchestrate Snakemake. Tibanna is an open-source, AWS-native software that runs bioinformatics data pipelines. It supports Snakefile syntax, plus other workflow languages, including Common Workflow Language and Workflow Description Language (WDL).
We recommend using Amazon Genomics CLI, if Tibanna is not needed for your use case, and Amazon Omics, if your workflow definitions are compliant with the supported WDL and Nextflow specifications.
The launched AWS Fargate task pulls the Snakefiles at launch time for each job and prepares the Snakemake initiation commands. Once the Snakefiles are downloaded on the Fargate task, the Snakemake head initiation command is invoked to begin launching jobs using Tibanna. Tibanna invokes a Step Functions state machine which orchestrates the launch of Snakemake on Amazon Elastic Compute Cloud (Amazon EC2).
Amazon CloudWatch provides a consolidated overview of performance metrics, including elapsed time, failed jobs, and error types. You can keep logs of your failed jobs in CloudWatch Logs (Figure 1). You can set up filters to match specific error types, plus create subscriptions to deliver a real-time stream of your log events to Amazon Kinesis or Lambda for further retry.
Figure 1. Solution architecture for Snakemake with Tibanna on AWS
Implementation considerations
Here, we describe some of the implementation considerations.
Creating Snakefiles
The launching point for the initiation depends on a Snakefile. Each Snakefile may contain one or more samples to be launched. The sheet resides in an S3 bucket. This adds flexibility and the ability to purge any sensitive or restrictive information after the job has been processed.
Invoking Tibanna
In order to launch Snakemake DAGs using Tibanna, we will need to set up a new Tibanna Unicorn. A Tibanna Unicorn is an Step Functions state machine and a corresponding Lambda function for provisioning EC2 instances.
The state machine runs the following sequence:
Create EC2 instance
Check EC2 status
Exit
After the Tibanna Unicorn has been created, we can start a Snakemake DAG using the following sample commands inside of the Fargate task.
The Snakemake command is used with the --tibanna flag to send launch requests to the Step Functions state machine in order to provision EC2 instances and run DAG tasks.
With this solution, each launch will automatically capture and retain start logs in a centralized location in Amazon CloudWatch Logs for tracing and auditing.
If there are issues during the launch of the Tibanna Step Function state machine, such as Amazon EC2 capacity limits, logs will be available in the S3 bucket that was specified during the Tibanna Unicorn creation process. There will be a file available in the format of <EXECUTION_ID>.log inside of the S3 bucket. This information is easily accessible via the command line interface. Use the following command to display specific log results or error messages.
tibanna log -j <EXECUTION_ID> -T
Retries and EC2 Spot Instances
We advise to use Amazon EC2 Spot Instances, if possible, for additional cost savings. This option is available in the --tibanna-config arguments with the setting spot_instance=true.
This is optional, and you need to create retry logic in the event a Spot Instance gets reclaimed. You can include --retries=3 in your Tibanna launch command. This would ensure all rules are retried three times. You can also specify the number of retries for individual rules when defining the Snakemake DAG definition; for example:
If EC2 Spot Instance capacity is hit, you can automatically switch to using EC2 On-Demand Instances instead. Add the behavior_on_capacity_limit argument and set retry_without_spot=true.
To initiate jobs on ParallelCluster, install the AWS Systems Manager agent on the head node. This is the launching point into the cluster and used for submitting jobs to the initiation queue. Systems Manager is a secure way to remotely invoke commands on an EC2 instance without the need for SSH access. You can restrict access to your EC2 instance through IAM policies.
Conclusion
In this blog post, we demonstrated how life-science research teams can simplify the launch of Snakemake using AWS. We used Snakefiles and Tibanna to orchestrate workflow steps. Snakefiles provide an exception from the general design pattern and an alternative to Amazon States Language. File uploads to Amazon S3 served as our launching point for workflow initiations.
Stay tuned for Part 3 of this series, in which we create a job manager that administrates multiple workflows.