All posts by Chris Munns

Coming soon: Expansion of AWS Lambda states to all functions

Post Syndicated from Chris Munns original

In November of 2019, we announced AWS Lambda function state attributes, a capability to track the current “state” of a function throughout its lifecycle.

Since launch, states have been used in two primary use-cases. First, to move the blocking setup of VPC resources out of the path of function invocation. Second, to allow the Lambda service to optimize new or updated container images for container-image based functions, also before invocation. By moving this additional work out of the path of the invocation, customers see lower latency and better consistency in their function performance. Soon, we will be expanding states to apply to all Lambda functions.

This post outlines the upcoming change, any impact, and actions to take during the roll out of function states to all Lambda functions. Most customers experience no impact from this change.

As functions are created or updated, or potentially fall idle due to low usage, they can transition to a state associated with that lifecycle event. Previously any function that was zip-file based and not attached to a VPC would only show an Active state. Updates to the application code and modifications of the function configuration would always show the Successful value for the LastUpdateStatus attribute. Now all functions will follow the same function state lifecycles described in the initial announcement post and in the documentation for Monitoring the state of a function with the Lambda API.

All AWS CLIs and SDKs have supported monitoring Lambda function states transitions since the original announcement in 2019. Infrastructure as code tools such as AWS CloudFormation, AWS SAM, Serverless Framework, and Hashicorp Terraform also already support states. Customers using these tools do not need to take any action as part of this, except for one recommended service role policy change for AWS CloudFormation customers (see Updating CloudFormation’s service role below).

However, there are some customers using SDK-based automation workflows, or calling Lambda’s service APIs directly, that must update those workflows for this change. To allow time for testing this change, we are rolling it out in a phased model, much like the initial rollout for VPC attached functions. We encourage all customers to take this opportunity to move to the latest SDKs and tools available.

Change details

Nothing is changing about how functions are created, updated, or operate as part of this. However, this change may impact certain workflows that attempt to invoke or modify a function shortly after a create or an update action. Before making API calls to a function that was recently created or modified, confirm it is first in the Active state, and that the LastUpdateStatus is Successful.

For a full explanation of both the create and update lifecycles, see Tracking the state of AWS Lambda functions.

Create function state lifecycle

Create function state lifecycle

Update function state lifecycle

Update function state lifecycle

Change timeframe

We are rolling out this change over a multiple phase period, starting with the Begin Testing phase today, July 12, 2021. The phases allow you to update tooling for deploying and managing Lambda functions to account for this change. By the end of the update timeline, all accounts transition to using the create/update Lambda lifecycle.

July 12 2021– Begin Testing: You can now begin testing and updating any deployment or management tools you have to account for the upcoming lifecycle change. You can also use this time to update your function configuration to delay the change until the End of Delayed Update.

September 6 2021 – General Update (with optional delayed update configuration): All customers without the delayed update configuration begin seeing functions transition through the lifecycles for create and update. Customers that have used the delay update configuration as described below will not see any change.

October 01 2021 – End of Delayed Update: The delay mechanism expires and customers now see the Lambda states lifecycle applied during function create or update.

Opt-in and delayed update configurations

Starting today, we are providing a mechanism for an opt-in. This allows you to update and test your tools and developer workflow processes for this change. We are also providing a mechanism to delay this change until the End of Delayed Update date. After the End of Delayed Update date, all functions will begin using the Lambda states lifecycle.

This mechanism operates on a function-by-function basis, so you can test and experiment individually without impacting your whole account. Once the General Update phase begins, all functions in an account that do not have the delayed update mechanism in place see the new lifecycle for their functions.

Both mechanisms work by adding a special string in the “Description” parameter of Lambda functions. You can add this string anywhere in this parameter. You can opt to add it to the prefix or suffix, or set the entire contents of the field. This parameter is processed at create or update in accordance with the requested action.

To opt in:


To delay the update:


NOTE: Delay configuration mechanism has no impact after the end of the Delayed Update.

Here is how this looks in the console:

I add the opt-in configuration to my function’s Description. You can find this under Configuration -> General Configuration in the Lambda console. Choose Edit to change the value.

Edit basic settings

Edit basic settings

After choosing Save, you can see the value in the console:

Opt-in flag set

Opt-in flag set

Once the opt-in is set for a function, then updates on that function go through the preceding update flow.

Checking a function’s state

With this in place, you can now test your development workflow ahead of the General Update phase. Download the latest AWS CLI (version 2.2.18 or greater) or SDKs to see function state and related attribute information.

You can confirm the current state of a function using the AWS APIs or AWS CLI to perform the GetFunction or GetFunctionConfiguration API or command for a specified function:

$ aws lambda get-function --function-name MY_FUNCTION_NAME --query 'Configuration.[State, LastUpdateStatus]'

This returns the State and LastUpdateStatus in order for a function.

Updating CloudFormation’s service role

CloudFormation allows customers to create an AWS Identity and Access Management (IAM) service role to make calls to resources in a stack on your behalf. Customers can use service roles to allow or deny the ability to create, update, or delete resources in a stack.

As part of the rollout of function states for all functions, we recommend that customers configure CloudFormation service roles with an Allow for the “lambda:GetFunction” API. This API allows CloudFormation to get the current state of a function, which is required to assist in the creation and deployment of functions.


With function states, you can have better clarity on how the resources required by your Lambda function are being created. This change does not impact the way that functions are invoked or how your code is run. While this is a minor change to when resources are created for your Lambda function, the result is even better consistency of working with the service.

For more serverless learning resources, visit Serverless Land.

Hosting Hugging Face models on AWS Lambda for serverless inference

Post Syndicated from Chris Munns original

This post written by Eddie Pick, AWS Senior Solutions Architect – Startups and Scott Perry, AWS Senior Specialist Solutions Architect – AI/ML

Hugging Face Transformers is a popular open-source project that provides pre-trained, natural language processing (NLP) models for a wide variety of use cases. Customers with minimal machine learning experience can use pre-trained models to enhance their applications quickly using NLP. This includes tasks such as text classification, language translation, summarization, and question answering – to name a few.

First introduced in 2017, the Transformer is a modern neural network architecture that has quickly become the most popular type of machine learning model applied to NLP tasks. It outperforms previous techniques based on convolutional neural networks (CNNs) or recurrent neural networks (RNNs). The Transformer also offers significant improvements in computational efficiency. Notably, Transformers are more conducive to parallel computation. This means that Transformer-based models can be trained more quickly, and on larger datasets than their predecessors.

The computational efficiency of Transformers provides the opportunity to experiment and improve on the original architecture. Over the past few years, the industry has seen the introduction of larger and more powerful Transformer models. For example, BERT was first published in 2018 and was able to get better benchmark scores on 11 natural language processing tasks using between 110M-340M neural network parameters. In 2019, the T5 model using 11B parameters achieved better results on benchmarks such as summarization, question answering, and text classification. More recently, the GPT-3 model was introduced in 2020 with 175B parameters and in 2021 the Switch Transformers are scaling to over 1T parameters.

One consequence of this trend toward larger and more powerful models is an increased barrier to entry. As the number of model parameters increases, as does the computational infrastructure that is necessary to train such a model. This is where the open-source Hugging Face Transformers project helps.

Hugging Face Transformers provides over 30 pretrained Transformer-based models available via a straightforward Python package. Additionally, there are over 10,000 community-developed models available for download from Hugging Face. This allows users to use modern Transformer models within their applications without requiring model training from scratch.

The Hugging Face Transformers project directly addresses challenges associated with training modern Transformer-based models. Many customers want a zero administration ML inference solution that allows Hugging Face Transformers models to be hosted in AWS easily. This post introduces a low touch, cost effective, and scalable mechanism for hosting Hugging Face models for real-time inference using AWS Lambda.


Our solution consists of an AWS Cloud Development Kit (AWS CDK) script that automatically provisions container image-based Lambda functions that perform ML inference using pre-trained Hugging Face models. This solution also includes Amazon Elastic File System (EFS) storage that is attached to the Lambda functions to cache the pre-trained models and reduce inference latency.Solution architecture

In this architectural diagram:

  1. Serverless inference is achieved by using Lambda functions that are based on container image
  2. The container image is stored in an Amazon Elastic Container Registry (ECR) repository within your account
  3. Pre-trained models are automatically downloaded from Hugging Face the first time the function is invoked
  4. Pre-trained models are cached within Amazon Elastic File System storage in order to improve inference latency

The solution includes Python scripts for two common NLP use cases:

  • Sentiment analysis: Identifying if a sentence indicates positive or negative sentiment. It uses a fine-tuned model on sst2, which is a GLUE task.
  • Summarization: Summarizing a body of text into a shorter, representative text. It uses a Bart model that was fine-tuned on the CNN / Daily Mail dataset.

For simplicity, both of these use cases are implemented using Hugging Face pipelines.


The following is required to run this example:

Deploying the example application

  1. Clone the project to your development environment:
    git clone
  2. Install the required dependencies:
    pip install -r requirements.txt
  3. Bootstrap the CDK. This command provisions the initial resources needed by the CDK to perform deployments:
    cdk bootstrap
  4. This command deploys the CDK application to its environment. During the deployment, the toolkit outputs progress indications:
    $ cdk deploy

Testing the application

After deployment, navigate to the AWS Management Console to find and test the Lambda functions. There is one for sentiment analysis and one for summarization.

To test:

  1. Enter “Lambda” in the search bar of the AWS Management Console:Console Search
  2. Filter the functions by entering “ServerlessHuggingFace”:Filtering functions
  3. Select the ServerlessHuggingFaceStack-sentimentXXXXX function:Select function
  4. In the Test event, enter the following snippet and then choose Test:Test function
   "text": "I'm so happy I could cry!"

The first invocation takes approximately one minute to complete. The initial Lambda function environment must be allocated and the pre-trained model must be downloaded from Hugging Face. Subsequent invocations are faster, as the Lambda function is already prepared and the pre-trained model is cached in EFS.Function test results

The JSON response shows the result of the sentiment analysis:

  "statusCode": 200,
  "body": {
    "label": "POSITIVE",
    "score": 0.9997532367706299

Understanding the code structure

The code is organized using the following structure:

├── inference
│ ├── Dockerfile
│ ├──
│ └──
└── ...

The inference directory contains:

  • The Dockerfile used to build a custom image to be able to run PyTorch Hugging Face inference using Lambda functions
  • The Python scripts that perform the actual ML inference

The script shows how to use a Hugging Face Transformers model:

import json
from transformers import pipeline

nlp = pipeline("sentiment-analysis")

def handler(event, context):
    response = {
        "statusCode": 200,
        "body": nlp(event['text'])[0]
    return response

For each Python script in the inference directory, the CDK generates a Lambda function backed by a container image and a Python inference script.

CDK script

The CDK script is named in the solution’s repository. The beginning of the script creates a virtual private cloud (VPC).

vpc = ec2.Vpc(self, 'Vpc', max_azs=2)

Next, it creates the EFS file system and an access point in EFS for the cached models:

        fs = efs.FileSystem(self, 'FileSystem',
        access_point = fs.add_access_point('MLAccessPoint',
                                               owner_gid='1001', owner_uid='1001', permissions='750'),
                                           posix_user=efs.PosixUser(gid="1001", uid="1001"))>

It iterates through the Python files in the inference directory:

docker_folder = os.path.dirname(os.path.realpath(__file__)) + "/inference"
pathlist = Path(docker_folder).rglob('*.py')
for path in pathlist:

And then creates the Lambda function that serves the inference requests:

            base = os.path.basename(path)
            filename = os.path.splitext(base)[0]
            # Lambda Function from docker image
            function = lambda_.DockerImageFunction(
                self, filename,
                    access_point, '/mnt/hf_models_cache'),
                    "TRANSFORMERS_CACHE": "/mnt/hf_models_cache"},

Adding a translator

Optionally, you can add more models by adding Python scripts in the inference directory. For example, add the following code in a file called

import json
from transformers 
import pipeline

en_fr_translator = pipeline('translation_en_to_fr')

def handler(event, context):
    response = {
        "statusCode": 200,
        "body": en_fr_translator(event['text'])[0]
    return response

Then run:

$ cdk synth
$ cdk deploy

This creates a new endpoint to perform English to French translation.

Cleaning up

After you are finished experimenting with this project, run “cdk destroy” to remove all of the associated infrastructure.


This post shows how to perform ML inference for pre-trained Hugging Face models by using Lambda functions. To avoid repeatedly downloading the pre-trained models, this solution uses an EFS-based approach to model caching. This helps to achieve low-latency, near real-time inference. The solution is provided as infrastructure as code using Python and the AWS CDK.

We hope this blog post allows you to prototype quickly and include modern NLP techniques in your own products.

Improved failure recovery for Amazon EventBridge

Post Syndicated from Chris Munns original

Today we’re announcing two new capabilities for Amazon EventBridgedead letter queues and custom retry policies. Both of these give you greater flexibility in how to handle any failures in the processing of events with EventBridge. You can easily enable them on a per target basis and configure them uniquely for each.

Dead letter queues (DLQs) are a common capability in queuing and messaging systems that allow you to handle failures in event or message receiving systems. They provide a way for failed events or messages to be captured and sent to another system, which can store them for future processing. With DLQs, you can have greater resiliency and improved recovery from any failure that happens.

You can also now configure a custom retry policy that can be set on your event bus targets. Today, there are two attributes that can control how events are retried: maximum number of retries and maximum event age. With these two settings, you could send events to a DLQ sooner and reduce the retries attempted.

For example, this could allow you to recover more quickly if an event bus target is overwhelmed by the number of events received, causing throttling to occur. The events are placed in a DLQ and then processed later.

Failures in event processing

Currently, EventBridge can fail to deliver an event to a target in certain scenarios. Events that fail to be delivered to a target due to client-side errors are dropped immediately. Examples of this are when EventBridge does not have permission to a target AWS service or if the target no longer exists. This can happen if the target resource is misconfigured or is deleted by the resource owner.

For service-side issues, EventBridge retries delivery of events for up to 24 hours. This can happen if the target service is unavailable or the target resource is not provisioned to handle the incoming event traffic and the target service is throttling the requests.

EventBridge failures

EventBridge failures

Previously, when all attempts to deliver an event to the target were exhausted, EventBridge published a CloudWatch metric indicating a failed target invocation. However, this provides no visibility into which events failed to be delivered and there was no way to recover the event that failed.

Dead letter queues

EventBridge’s DLQs are made possible today with Amazon Simple Queue Service (SQS) standard queues. With SQS, you get all of the benefits of a fully serverless queuing service: no servers to manage, automatic scalability, pay for what you consume, and high availability and security built in. You can configure the DLQs for your EventBridge bus and pay nothing until it is used, if and when a target experiences an issue. This makes it a great practice to follow and standardize on, and provides you with a safety net that’s active only when needed.

Optionally, you could later configure an AWS Lambda function to consume from that DLQ. The function is only invoked when messages exist in the queue, allowing you to maintain a serverless stack to recover from a potential failure.

DLQ configured per target

DLQ configured per target

With DLQ configured, the queue receives the event that failed in the message with important metadata that you can use to troubleshoot the issue. This can include: Error Code, Error Message, Exhausted Retry Condition, Retry Attempts, Rule ARN, and the Target ARN.

You can use this data to more easily troubleshoot what went wrong with the original delivery attempt and take action to resolve or prevent such failures in the future. You could also use the information such as Exhausted Retry Condition and Retry Attempts to further tweak your custom retry policy.

You can configure a DLQ when creating or updating rules via the AWS Management Console and AWS Command Line Interface (AWS CLI). You can also use infrastructure as code (IaC) tools such as AWS CloudFormation.

In the console, select the queue to be used for your DLQ configuration from the drop-down as shown here:

DLQ configuration

DLQ configuration

When configured via API, AWS CLI, or IaC tools, you must specify the ARN of the queue:


When you configure a DLQ, the target SQS queue requires a resource-based policy that grants EventBridge access. One is created and applied automatically via the console when you create or update an EventBridge rule with a DLQ that exists in your own account.

For any queues created in other accounts, or via API, AWS CLI, or IaC tools, you must add a policy that allows SQS’s SendMessage permission to the EventBridge rule ARN, as shown below:

  "Sid": "Dead-letter queue permissions",
  "Effect": "Allow",
  "Principal": {
     "Service": ""
  "Action": "sqs:SendMessage",
  "Resource": "arn:aws:sqs:us-east-1:123456789012:orders-bus-shipping-service-dlq",
  "Condition": {
    "ArnEquals": {
      "aws:SourceArn": "arn:aws:events:us-east-1:123456789012:rule/MyTestRule"

You can read more about setting permissions for DLQ the documentation for “Granting permissions to the dead-letter queue”.

Once configured, you can monitor CloudWatch metrics for the DLQ queue. This shows both the successful delivery of messages via the InvocationsSentToDLQ metric, in addition to any failures via the InvocationsFailedToBeSentToDLQ. Note that these metrics do not exist if your queue is not considered “active”.

Retry policies

By default, EventBridge retries delivery of an event to a target so long as it does not receive a client-side error as described earlier. Retries occur with a back-off, for up to 185 attempts or for up to 24 hours, after which the event is dropped or sent to a DLQ, if configured. Due to the jitter of the back-off and retry process you may reach the 24-hour limit before reaching 185 retries.

For many workloads, this provides an acceptable way to handle momentary service issues or throttling that might occur. For some however, this model of back-off and retry can cause increased and on-going traffic to an already overloaded target system.

For example, consider an Amazon API Gateway target that has a resource constrained backend service behind it.

Constrained target service

Constrained target service

Under a consistently high load, the bus could end up generating too many API requests, tripping the API Gateway’s throttling configuration. This would cause API Gateway to respond with throttling errors back to EventBridge.

Throttled API reply

Throttled API reply

You may decide that allowing the failed events to retry for 24 hours puts too much load into this system and it may not properly recover from the load. This could lead to potential data loss unless a DLQ was configured.

Added DLQ

Added DLQ

With a DLQ, you could choose to process these events later, once the overwhelmed target service has recovered.

DLQ drained back to API

DLQ drained back to API

Or the events in question may no longer have the same value as they did previously. This can occur in systems where data loss is tolerated but the timeliness of data processing matters. In these situations, the DLQ would have less value and dropping the message is acceptable.

For either of these situations, configuring the maximum number of retries or the maximum age of the event could be useful.

Now with retry policies, you can configure per target the following two attributes:

  • MaximumEventAgeInSeconds: between 60 and 86400 seconds (86400, or 24 hours the default)
  • MaximumRetryAttempts: between 0 and 185 (185 is the default)

When either condition is met, the event fails. It’s then either dropped, which generates an increase to the FailedInvocations CloudWatch metric, or sent to a configured DLQ.

You can configure retry policy attributes when creating or updating rules via the AWS Management Console and AWS Command Line Interface (AWS CLI). You can also use infrastructure as code (IaC) tools such as AWS CloudFormation.

Retry policy

Retry policy

There is no additional cost for configuring either of these new capabilities. You only pay for the usage of the SQS standard queue configured as the dead letter queue during a failure and any application that handles the failed events. SQS pricing can be found here.


With dead letter queues and custom retry policies, you have improved handling and control over failure in distributed systems built with EventBridge. With DLQs you can capture failed events and then process them later, potentially saving yourself from data loss. With custom retry policies, you gain the improved ability to control the number of retries and for how long they can be retried.

I encourage you to explore how both of these new capabilities can help make your applications more resilient to failures, and to standardize on using them both in your infrastructure.

For more serverless learning resources, visit

Implementing FIFO message ordering with Amazon MQ for Apache ActiveMQ

Post Syndicated from Chris Munns original

This post is contributed by Ravi Itha, Sr. Big Data Consultant

Messaging plays an important role in building distributed enterprise applications. Amazon MQ is a key offering within the AWS messaging services solution stack focused on enabling messaging services for modern application architectures. Amazon MQ is a managed message broker service for Apache ActiveMQ that simplifies setting up and operating message brokers in the cloud. Amazon MQ uses open standard APIs and protocols such as JMS, NMS, AMQP, STOMP, MQTT, and WebSocket. Using standards means that, in most cases, there’s no need to rewrite any messaging code when you migrate to AWS. This allows you to focus on your business logic and application architecture.

Message ordering via Message Groups

Sometimes it’s important to guarantee the order in which messages are processed. In ActiveMQ, there is no explicit distinction between a standard queue and a FIFO queue. However, a queue can be used to route messages in FIFO order. This ordering can be achieved via two different ActiveMQ features, either by implementing an Exclusive Consumer or using using Message Groups. This blog focuses on Message Groups, an enhancement to Exclusive Consumers. Message groups provide:

  • Guaranteed ordering of the processing of related messages across a single queue
  • Load balancing of the processing of messages across multiple consumers
  • High availability with automatic failover to other consumers if a JVM goes down

This is achieved programmatically as follows:

Sample producer code snippet:

TextMessage tMsg = session.createTextMessage("SampleMessage");
tMsg.setStringProperty("JMSXGroupID", "Group-A");

Sample consumer code snippet:

Message consumerMessage = consumer.receive(50);
TextMessage txtMessage = (TextMessage) message.get();
String msgBody = txtMessage.getText();
String msgGroup = txtMessage.getStringProperty("JMSXGroupID")

This sample code highlights:

  • A message group is set by the producer during message ingestion
  • A consumer determines the message group once a message is consumed

Additionally, if a queue has messages for multiple message groups then it’s possible a consumer receives messages for multiple message groups. This depends on various factors such as the number of consumers of a queue and consumer start time.

Scenarios: Multiple producers and consumers with Message Groups

A FIFO queue in ActiveMQ supports multiple ordered message groups. Due to this, it’s common that a queue is used to exchange messages between multiple producer and consumer applications. By running multiple consumers to process messages from a queue, the message broker is able to partition messages across consumers. This improves the scalability and performance of your application.

In terms of scalability, commonly asked questions center on the ideal number of consumers and how messages are distributed across all consumers. To provide more clarity in this area, we provisioned an Amazon MQ broker and ran various test scenarios.

Scenario 1: All consumers started at the same time

Test setup

  • All producers and consumers have the same start time
  • Each test uses a different combination of number of producers, message groups, and consumers
Setup for Tests 1 to 5

Setup for Tests 1 to 5

Test results

Test # # producers # message groups

# messages sent

by each producer

# consumers Total messages # messages received by consumers
C1 C2 C3 C4
1 3 3 5000 1 15000


(All Groups)

2 3 3 5000 2 15000




(Group-A and Group-B)

3 3 3 5000 3 15000






(Group C)

4 3 3 5000 4 15000







5 4 4 5000 3 20000






(Group-C and Group-D)


Test conclusions

  • Test 3 – illustrates even message distribution across consumers when a one-to-one relationship exists between message groups and number of consumers
  • Test 4 – illustrates one of the four consumers did not receive any messages. This highlights that running more consumers than the available number of messages groups does not provide additional benefits
  • Tests 1, 2, 5 – indicate that a consumer can receive messages belonging to multiple message groups. The following table provides additional granularity to messages received by consumer C2 in test #2. As you can see, these messages belong to Group-A and Group-B message groups, and FIFO ordering is maintained at a message group level
consumer_id msg_id msg_group
Consumer C2 A-1 Group-A
Consumer C2 B-1 Group-B
Consumer C2 A-2 Group-A
Consumer C2 B-2 Group-B
Consumer C2 A-3 Group-A
Consumer C2 B-3 Group-B
Consumer C2 A-4999 Group-A
Consumer C2 B-4999 Group-B
Consumer C2 A-5000 Group-A
Consumer C2 B-5000 Group-B

Scenario 2a: All consumers not started at same time

Test setup

  • Three producers and one consumer started at the same time
  • The second and third consumers started after 30 seconds and 60 seconds respectively
  • 15,000 messages sent in total across three message groups
 Setup for Test 6

Setup for Test 6

Test results

Test # # producers # message groups

# messages sent

by each producer

# consumers Total messages # messages received by consumers
C1 C2 C3
6 3 3 5000 3 15000 15000 0 0

Test conclusion

Consumer C1 received all messages, while consumers C2 and C3 both ran idle and did not receive any messages. Key takeaway here is that results can be inefficient in real-world scenarios where consumers start at different times.

The last scenario (2b) illustrates this same scenario, while optimizing message distribution so that all consumers are used.

Scenario 2b: Utilization of all consumers when not started at same time

Test setup

  • Three producers and one consumer started at the same time
  • The second and third consumers started after 30 seconds and 60 seconds respectively
  • 15,000 messages sent in total across three message groups
  • After each producer message group sends its 2501st message, their message groups are closed after which message distribution is restarted by sending the remaining messages. Closing a message group can be done as in the following code example (specifically the -1 value set for the JMSXGroupSeq property):
TextMessage tMsg = session.createTextMessage("<foo>hey</foo>");
tMsg.setStringProperty("JMSXGroupID", "Group-A");
tMsg.setIntProperty("JMSXGroupSeq", -1);
Setup for Test 7

Setup for Test 7

Test results

Test # # producers # message groups

# messages sent

by each producer

# consumers



# messages received by consumers
C1 C2 C3
7 3 3 5001 3 15003 10003 2500 2500

Distribution of messages received by message group

Consumer Group-A Group-B Group-C Consumer-wise total
Consumer 1 2501 2501 5001 10003
Consumer 2 2500 0 0 2500
Consumer 3 0 2500 0 2500
Group total 5001 5001 5001 NA
Total messages received 15003

Test conclusions

Message distribution is optimized with the closing and reopening of a message group when all consumers are not started at the same time. This mitigation step results in all consumers receiving messages.

  • After Group-A was closed, the broker assigned subsequent Group-A messages to consumer C2
  • After Group-B was closed, the broker assigned subsequent Group-B messages to consumer C3
  • After Group-C was closed, the broker continued to send Group-C messages to consumer C1. The assignment did not change because there was no other available consumer
Test 7 – Message distribution among consumers

Test 7 – Message distribution among consumers

Scalability techniques

Now that we understand how to use Message Groups to implement FIFO use cases within Amazon MQ, let’s look at how they scale. By default, a message queue supports a maximum of 1024 message groups. This means, if you use more than 1024 message groups per queue then message ordering is lost for the oldest message group. This is further explained in the ActiveMQ Message Groups documentation. This can be problematic for complex use cases involving stock exchanges or financial trading scenarios where thousands of ordered message groups are required. In the following table, are a couple of techniques to address this issue.

Scalability techniques Details
  1. Verify that the appropriate Amazon MQ broker instance type is used
  2. Increase number of message groups per message queue
  1. Select the appropriate broker instance type according to your use case. To learn more, refer to the Amazon MQ broker instance types documentation.
  2. Default max number of message groups can be increased via a custom configuration file at the time of launching a new broker. This default can also be increased by modifying an existing broker. Refer to the next section for an example (requirement #2)
Recycle the number of message groups when they are no longer needed A message group can be closed programmatically by a producer once it’s finished sending all messages to a queue. Following is a sample code snippet:

TextMessage tMsg = session.createTextMessage("<foo>hey</foo>");
tMsg.setStringProperty("JMSXGroupID", "GroupA");
tMsg.setIntProperty("JMSXGroupSeq", -1);

In the preceding scenario 2b, we used this technique to improve the message distribution across consumers.

Customize message broker configuration

In the previous section, to improve scalability we suggested increasing the number of message groups per queue by updating the broker configuration. A broker configuration is essentially an XML file that contains all ActiveMQ settings for a given message broker. Let’s look at the following broker configuration settings for the purpose of achieving a specific requirement. For your reference, we’ve placed a copy of a broker configuration file with these settings, within a GitHub repository.

# Requirement Applicable broker configuration
1 Change message group implementation from default CachedMessageGroupMap default to MessageGroupHashBucket

<!–valid values: simple, bucket, cached. default is cached–>

<!–keyword simple represents SimpleMessageGroupMap–>

<!–keyword bucket represents MessageGroupHashBucket–>

<!–keyword cached represents CachedMessageGroupMap–>

<policyEntry messageGroupMapFactoryType=“bucket” queue=“&gt;”/>

2 Increase number of message groups per queue from 1024 to 2048 and increase cache size from 64 to 128

<!–default value for bucketCount is 1024 and for cacheSize is 64–>

<policyEntry queue=“&gt;”>


<messageGroupHashBucketFactory bucketCount=“2048” cacheSize=“128”/>



3 Wait for three consumers or 30 seconds before broker begins sending messages <policyEntry queue=”&gt;” consumersBeforeDispatchStarts=”3″ timeBeforeDispatchStarts=”30000″/>

When must default broker configurations be updated? This would only apply in scenarios where the default settings do not meet your requirements. Additional information on how to update your broker configuration file can be found here.

Amazon MQ starter kit

Want to get up and running with Amazon MQ quickly? Start building with Amazon MQ by cloning the starter kit available on GitHub. The starter kit includes a CloudFormation template to provision a message broker, sample broker configuration file, and source code related to the consumer scenarios in this blog.


In this blog post, you learned how Amazon MQ simplifies the setup and operation of Apache ActiveMQ in the cloud. You also learned how the Message Groups feature can be used to implement FIFO. Lastly, you walked through real scenarios demonstrating how ActiveMQ distributes messages with queues to exchange messages between multiple producers and consumers.

Scheduling AWS Lambda Provisioned Concurrency for recurring peak usage

Post Syndicated from Chris Munns original

This post is contributed by Jerome Van Der Linden, AWS Solutions Architect

Concurrency of an AWS Lambda function is the number of requests it can handle at any given time. This metric is the average number of requests per second multiplied by the average duration in seconds. For example, if a Lambda function takes an average 500 ms to run with 100 requests per second, the concurrency is 50 (100 * 0.5 seconds).

When invoking a Lambda function, an execution environment is provisioned in a process commonly called a cold start. This initialization phase can increase the total execution time of the function. Consequently, with the same concurrency, cold starts can reduce the overall throughput. For example, if the function takes 600 ms instead of 500 ms due to cold start latency, with a concurrency of 50, it handles 83 requests per second.

As described in this blog post, Provisioned Concurrency helps to reduce the impact of cold starts. It keeps Lambda functions initialized and ready to respond to requests with double-digit millisecond latency. Functions can respond to requests with predictable latency. This is ideal for interactive services, such as web and mobile backends, latency-sensitive microservices, or synchronous APIs.

However, you may not need Provisioned Concurrency all the time. With a reasonable amount of traffic consistently throughout the day, the Lambda execution environments may be warmed already. Provisioned Concurrency incurs additional costs, so it is cost-efficient to use it only when necessary. For example, early in the morning when activity starts, or to handle recurring peak usage.

Example application

As an example, we use a serverless ecommerce platform with multiple Lambda functions. The entry point is a GraphQL API (using AWS AppSync) that lists products and accepts orders. The backend manages delivery and payments. This GitHub project provides an example of such an ecommerce platform and is based on the following architecture:

Post sample architecture

Post sample architecture

This company runs “deal of the day” promotion every day at noon. The “deal of the day” reduces the price of a specific product for 60 minutes, starting at 12pm. The CreateOrder function handles around 20 requests per seconds during the day. At noon, a notification is sent to registered users who then connect to the website. Traffic increases immediately and can exceed 400 requests per seconds for the CreateOrder function. This recurring pattern is shown in Amazon CloudWatch:

Reoccurring invocations

Reoccurring invocations

The peak shows an immediate increase at noon, slowly decreasing until 1pm:

Peak invocations

Peak invocations

Examining the response times of the function in AWS X-Ray, the first graph shows normal traffic:

Normal performance distribution

Normal performance distribution

While the second shows the peak:

Peak performance distribution

Peak performance distribution

The average latency (p50) is higher during the peak with 535 ms versus 475 ms at normal load. The p90 and p95 values show that most of the invocations are under 800 ms in the first graph but around 900 ms in the second one. This difference is due to the cold start, when the Lambda service prepares new execution environments to absorb the load during the peak.

Using Provisioned Concurrency at noon can avoid this additional latency and provide a better experience to end users. Ideally, it should also be stopped at 1pm to avoid incurring unnecessary costs.

Application Auto Scaling

Application Auto Scaling allows you to configure automatic scaling for different resources, including Provisioned Concurrency for Lambda. You can scale resources based on a specific CloudWatch metric or at a specific date and time. There is no extra cost for Application Auto Scaling, you only pay for the resources that you use. In this example, I schedule Provisioned Concurrency for the CreateOrder Lambda function.

Scheduling Provisioned Concurrency for a Lambda function

  1. Create an alias for the Lambda function to identify the version of the function you want to scale:
    $ aws lambda create-alias --function-name CreateOrder --name prod --function-version 1.2 --description "Production alias"
  2. In this example, the average execution time is 500 ms and the workload must handle 450 requests per second. This equates to a concurrency of 250 including a 10% buffer. Register the Lambda function as a scalable target in Application Auto Scaling with the RegisterScalableTarget API:
    $ aws application-autoscaling register-scalable-target \
    --service-namespace lambda \
    --resource-id function:CreateOrder:prod \
    --min-capacity 0 --max-capacity 250 \
    --scalable-dimension lambda:function:ProvisionedConcurrency

    Next, verify that the Lambda function is registered correctly:

    $ aws application-autoscaling describe-scalable-targets --service-namespace lambda

    The output shows:

        "ScalableTargets": [
                "ServiceNamespace": "lambda",
                "ResourceId": "function:CreateOrder:prod",
                "ScalableDimension": "lambda:function:ProvisionedConcurrency",
                "MinCapacity": 0,
                "MaxCapacity": 250,
                "RoleARN": "arn:aws:iam::012345678901:role/aws-service-role/",
                "CreationTime": 1596319171.297,
                "SuspendedState": {
                    "DynamicScalingInSuspended": false,
                    "DynamicScalingOutSuspended": false,
                    "ScheduledScalingSuspended": false
  3. Schedule the Provisioned Concurrency using the PutScheduledAction API of Application Auto Scaling:
    $ aws application-autoscaling put-scheduled-action --service-namespace lambda \
      --scalable-dimension lambda:function:ProvisionedConcurrency \
      --resource-id function:CreateOrder:prod \
      --scheduled-action-name scale-out \
      --schedule "cron(45 11 * * ? *)" \
      --scalable-target-action MinCapacity=250

    Note: Set the schedule a few minutes ahead of the expected peak to allow time for Lambda to prepare the execution environments. The time must be specified in UTC.

  4. Verify that the scaling action is correctly scheduled:
    $ aws application-autoscaling describe-scheduled-actions --service-namespace lambda

    The output shows:

        "ScheduledActions": [
                "ScheduledActionName": "scale-out",
                "ScheduledActionARN": "arn:aws:autoscaling:eu-west-1:012345678901:scheduledAction:643065ac-ba5c-45c3-cb46-bec621d49657:resource/lambda/function:CreateOrder:prod:scheduledActionName/scale-out",
                "ServiceNamespace": "lambda",
                "Schedule": "cron(45 11 * * ? *)",
                "ResourceId": "function:CreateOrder:prod",
                "ScalableDimension": "lambda:function:ProvisionedConcurrency",
                "ScalableTargetAction": {
                    "MinCapacity": 250
                "CreationTime": 1596319455.951
  5. To stop the Provisioned Concurrency, schedule another action after the peak with the capacity set to 0:
    $ aws application-autoscaling put-scheduled-action --service-namespace lambda \
      --scalable-dimension lambda:function:ProvisionedConcurrency \
      --resource-id function:CreateOrder:prod \
      --scheduled-action-name scale-in \
      --schedule "cron(15 13 * * ? *)" \
      --scalable-target-action MinCapacity=0,MaxCapacity=0

With this configuration, Provisioned Concurrency starts at 11:45 and stops at 13:15. It uses a concurrency of 250 to handle the load during the peak and releases resources after. This also optimizes costs by limiting the use of this feature to 90 minutes per day.

In this architecture, the CreateOrder function synchronously calls ValidateProduct, ValidatePayment, and ValidateDelivery. Provisioning concurrency only for the CreateOrder function would be insufficient as the three other functions would impact it negatively. To be efficient, you must configure Provisioned Concurrency for all four functions, using the same process.

Observing performance when Provisioned Concurrency is scheduled

Run the following command at any time outside the scheduled window to confirm that nothing has been provisioned yet:

$ aws lambda get-provisioned-concurrency-config \
--function-name CreateOrder --qualifier prod
No Provisioned Concurrency Config found for this function

When run during the defined time slot, the output shows that the concurrency is allocated and ready to use:

    "RequestedProvisionedConcurrentExecutions": 250,
    "AvailableProvisionedConcurrentExecutions": 250,
    "AllocatedProvisionedConcurrentExecutions": 250,
    "Status": "READY",
    "LastModified": "2020-08-02T11:45:18+0000"

Verify the different scaling operations using the following command:

$ aws application-autoscaling describe-scaling-activities \
--service-namespace lambda \
--scalable-dimension lambda:function:ProvisionedConcurrency \
--resource-id function:CreateOrder:prod

You can see the scale-out and scale-in activities at the times specified, and how Lambda fulfills the requests:

    "ScalingActivities": [
            "ActivityId": "6d2bf4ed-6eb5-4218-b4bb-f81fe6d7446e",
            "ServiceNamespace": "lambda",
            "ResourceId": "function:CreateOrder:prod",
            "ScalableDimension": "lambda:function:ProvisionedConcurrency",
            "Description": "Setting desired concurrency to 0.",
            "Cause": "maximum capacity was set to 0",
            "StartTime": 1596374119.716,
            "EndTime": 1596374155.789,
            "StatusCode": "Successful",
            "StatusMessage": "Successfully set desired concurrency to 0. Change successfully fulfilled by lambda."
            "ActivityId": "2ea46a62-2dbe-4576-aa42-0675b6448f0a",
            "ServiceNamespace": "lambda",
            "ResourceId": "function:CreateOrder:prod",
            "ScalableDimension": "lambda:function:ProvisionedConcurrency",
            "Description": "Setting min capacity to 0 and max capacity to 0",
            "Cause": "scheduled action name scale-in was triggered",
            "StartTime": 1596374119.415,
            "EndTime": 1596374119.431,
            "StatusCode": "Successful",
            "StatusMessage": "Successfully set min capacity to 0 and max capacity to 0"
            "ActivityId": "272cfd75-8362-4e5c-82cf-5c8cf9eacd24",
            "ServiceNamespace": "lambda",
            "ResourceId": "function:CreateOrder:prod",
            "ScalableDimension": "lambda:function:ProvisionedConcurrency",
            "Description": "Setting desired concurrency to 250.",
            "Cause": "minimum capacity was set to 250",
            "StartTime": 1596368709.651,
            "EndTime": 1596368901.897,
            "StatusCode": "Successful",
            "StatusMessage": "Successfully set desired concurrency to 250. Change successfully fulfilled by lambda."
            "ActivityId": "38b968ff-951e-4033-a3db-b6a86d4d0204",
            "ServiceNamespace": "lambda",
            "ResourceId": "function:CreateOrder:prod",
            "ScalableDimension": "lambda:function:ProvisionedConcurrency",
            "Description": "Setting min capacity to 250",
            "Cause": "scheduled action name scale-out was triggered",
            "StartTime": 1596368709.354,
            "EndTime": 1596368709.37,
            "StatusCode": "Successful",
            "StatusMessage": "Successfully set min capacity to 250"

In the following latency graph, most of the requests are now completed in less than 800 ms, in line with performance during the rest of the day. The traffic during the “deal of the day” at noon is comparable to any other time. This helps provide a consistent user experience, even under heavy load.

Performance distribution

Performance distribution

Automate the scheduling of Lambda Provisioned Concurrency

You can use the AWS CLI to set up scheduled Provisioned Concurrency, which can be helpful for testing in a development environment. You can also define your infrastructure with code to automate deployments and avoid manual configuration that may lead to mistakes. This can be done with AWS CloudFormation, AWS SAM, AWS CDK, or third-party tools.

The following code shows how to schedule Provisioned Concurrency in an AWS SAM template:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

    Type: AWS::Serverless::Function
      CodeUri: src/create_order/
      Handler: main.handler
      Runtime: python3.8
      MemorySize: 768
      Timeout: 30
      AutoPublishAlias: prod                              #1
        Type: AllAtOnce

    Type: AWS::ApplicationAutoScaling::ScalableTarget     #2
      MaxCapacity: 250
      MinCapacity: 0
      ResourceId: !Sub function:${CreateOrderFunction}:prod        #3
      RoleARN: !Sub arn:aws:iam::${AWS::AccountId}:role/aws-service-role/
      ScalableDimension: lambda:function:ProvisionedConcurrency
      ServiceNamespace: lambda
      ScheduledActions:                                   #4
        - ScalableTargetAction:
            MinCapacity: 250
          Schedule: 'cron(45 11 * * ? *)'
          ScheduledActionName: scale-out
        - ScalableTargetAction:
            MinCapacity: 0
            MaxCapacity: 0
          Schedule: 'cron(15 13 * * ? *)'
          ScheduledActionName: scale-in 
    DependsOn: CreateOrderFunctionAliasprod                        #5

In this template:

  1. As with the AWS CLI, you need an alias for the Lambda function. This automatically creates the alias “prod” and sets it to the latest version of the Lambda function.
  2. This creates an AWS::ApplicationAutoScaling::ScalableTarget resource to register the Lambda function as a scalable target.
  3. This references the correct version of the Lambda function by using the “prod” alias.
  4. Defines different actions to schedule as a property of the scalable target. You define the same properties as used in the CLI.
  5. You cannot define the scalable target until the alias of the function is published. The syntax is <FunctionResource>Alias<AliasName>.


Cold starts can impact the overall duration of your Lambda function, especially under heavy load. This can affect the end-user experience when the function is used as a backend for synchronous APIs.

Provisioned Concurrency helps reduce latency by creating execution environments ahead of invocations. Using the ecommerce platform example, I show how to combine this capability with Application Auto Scaling to schedule scaling-out and scaling-in. This combination helps to provide a consistent execution time even during special events that cause usage peaks. It also helps to optimize cost by limiting the time period.

To learn more about scheduled scaling, see the Application Auto Scaling documentation.

Executing Ansible playbooks in your Amazon EC2 Image Builder pipeline

Post Syndicated from Chris Munns original

This post is contributed by Andrew Pearce – Sr. Systems Dev Engineer, AWS

Since launching Amazon EC2 Image Builder, many customers say they want to re-use existing investments in configuration management technologies (such as Ansible, Chef, or Puppet) with Image Builder pipelines.

In this blog, I walk through creating a document that can execute an Ansible playbook in an Image Builder pipeline. I then use the document to create an Image Builder component that can be added to an Image Builder image recipe. In addition to this blog, you can find further samples in the amazon-ec2-image-builder-samples GitHub repository.

Ansible playbook requirements

There are likely some changes required so your existing Ansible playbooks can work in Image Builder.

When executing Ansible within Image Builder, the playbook must be configured to execute using the localhost. In a traditional Ansible environment, the control server executes playbooks against remote hosts using a protocol like SSH or WinRM. In Image Builder, the host executing the playbook is also the host that must be configured.

There are a number of ways to accomplish this in Ansible. In this example, I set the value of hosts to, gather_facts to false, and connection to local. These values set execution at the localhost, prevent gathering Ansible facts about remote hosts, and force local execution.

In this walk through, I use the following sample playbook. This installs Apache, configures a default webpage, and ensures that Apache is enabled and running.

- name: Apache Hello World
  gather_facts: false
  connection: local
    - name: Install Apache
        name: httpd
        state: latest
    - name: Create a default page
      shell: echo "<h1>Hello world from EC2 Image Builder and Ansible</h1>" > /var/www/html/index.html
    - name: Enable Apache
      service: name=httpd enabled=yes state=started

Creating the Ansible document

This example uses the build phase to install Ansible, download the playbook from an S3 bucket, execute the playbook and cleanup the system. It also ensures that Apache is returning the correct content in both the validate and test phases.

I use Amazon Linux 2 as the target operating system, and assume that the playbook is already uploaded to my S3 bucket.

Document design diagram

Document Design Diagram

To start, I must specify the top-level properties for the document. I specify a name and description to describe the document’s purpose, and a schemaVersion, which at the time of writing is 1.0. I also include a build phase, as I want this document to be used when building the image.

name: 'Ansible Playbook Execution on Amazon Linux 2'
description: 'This is a sample component that demonstrates how to download and execute an Ansible playbook against Amazon Linux 2.'
schemaVersion: 1.0
  - name: build

The first required step is to install Ansible. With Amazon Linux 2, I can install Ansible using Amazon Linux Extras, so I use the ExecuteBash action to perform the work.

     - name: InstallAnsible
        action: ExecuteBash
           - sudo amazon-linux-extras install -y ansible2

Once Ansible is installed, I must download the playbook for execution. I use the S3Download action to download the playbook and store it in a temporary location. Note that the S3Download action accepts a list of inputs, so a single S3Download step could download multiple files.

     - name: DownloadPlaybook
        action: S3Download
          - source: 's3://mybucket/my-playbook.yml'
            destination: '/tmp/my-playbook.yml'

Now Ansible is installed and the playbook is downloaded. I use the ExecuteBinary action to invoke Ansible and perform the work described in the playbook. This step uses a feature of the component management application called chaining.

Chaining allows you to reference the input or output values of a step in subsequent steps. In the following example, {{build.DownloadPlaybook.inputs[0].destination}} passes the string of the first “destination” property value in the DownloadPlaybook step, which executes in the build phase. The “[0]” indicates the first value in the array (position zero).

For more information about how chaining works, review the Component Manager documentation. In this example, I refer to the S3Download destination path to avoid mistyping the value, which causes the build to fail.

      - name: InvokeAnsible
        action: ExecuteBinary
          path: ansible-playbook
            - '{{build.DownloadPlaybook.inputs[0].destination}}'

Finally, I clean up the system, so I use another ExecuteBash action combined with chaining to delete the downloaded playbook.

      - name: DeletePlaybook
        action: ExecuteBash
            - rm '{{build.DownloadPlaybook.inputs[0].destination}}'

Now, I have installed Ansible, downloaded the playbook, executed it, and cleaned up the system. The next step is to add a validate phase where I use curl to ensure that Apache responds with the desired content. Including a validate phase allows you to identify build issues more quickly. This fails the build before Image Builder creates an AMI and moves to the test phase.

In the validate phase, I use grep to confirm that Apache returned a value containing my required string. If the command fails, a non-zero exit code is returned, indicating a failure. This failure is returned to Image Builder and the image build is marked as “Failed”.

  - name: validate
      - name: ValidateResponse
        action: ExecuteBash
            - curl -s | grep "Hello world from EC2 Image Builder and Ansible"

After the validate phase executes, the EC2 instance is stopped and an Amazon Machine Image (AMI) is created. Image Builder moves to the test stage where a new EC2 instance is launched using the AMI that was created in the previous step. During this stage, the test phase of a document is executed.

To ensure that Apache is still functioning as required, I add a test phase to the document using the same curl command as the validate phase. This ensures that Apache is started and still returning the correct content.

  - name: test
      - name: ValidateResponse
        action: ExecuteBash
            - curl -s | grep "Hello world from EC2 Image Builder and Ansible"

Here is the complete document:

name: 'Ansible Playbook Execution on Amazon Linux 2'
description: 'This is a sample component that demonstrates how to download and execute an Ansible playbook against Amazon Linux 2.'
schemaVersion: 1.0
  - name: build
      - name: InstallAnsible
        action: ExecuteBash
           - sudo amazon-linux-extras install -y ansible2
      - name: DownloadPlaybook
        action: S3Download
          - source: 's3://mybucket/playbooks/my-playbook.yml'
            destination: '/tmp/my-playbook.yml'
      - name: InvokeAnsible
        action: ExecuteBinary
          path: ansible-playbook
            - '{{build.DownloadPlaybook.inputs[0].destination}}'
      - name: DeletePlaybook
        action: ExecuteBash
            - rm '{{build.DownloadPlaybook.inputs[0].destination}}'
  - name: validate
      - name: ValidateResponse
        action: ExecuteBash
            - curl -s | grep "Hello world from EC2 Image Builder and Ansible"
  - name: test
      - name: ValidateResponse
        action: ExecuteBash
            - curl -s | grep "Hello world from EC2 Image Builder and Ansible"

I now have a document that installs Ansible, downloads a playbook from an S3 bucket, and invokes Ansible to perform the work described in the playbook. I am also validating the installation was successful by ensuring Apache is returning the desired content.

Next, I use this document to create a component in Image Builder and assign the component to an image recipe. The image recipe could then be used when creating an image pipeline. The blog post Automate OS Image Build Pipelines with EC2 Image Builder provides a tutorial demonstrating how to create an image pipeline with the EC2 Image Builder console. For more information, review Getting started with EC2 Image Builder.


In this blog, I walk through creating a document that can execute Ansible playbooks for use in Image Builder. I define the required changes for Ansible playbooks and when each phase of the document would execute. I also note the amazon-ec2-image-builder-samples GitHub repository, which provides sample documents for executing Ansible playbooks and Chef recipes.

We continue to use this repository for publishing sample content. We hope Image Builder is providing value and making it easier for you to build virtual machine (VM) images. As always, keep sending us your feedback!

Migrating a SharePoint application using the AWS Server Migration Service

Post Syndicated from Chris Munns original

This post contributed by Ashwini Rudra, Solutions Architect; Rajesh Rathod, Sr. Product Manager; Vivek Chawda, Senior Software Engineer, EC2 Enterprise

Many AWS customers are migrating on-premises SharePoint workloads to AWS for greater reliability, faster performance, and lower costs. While planning the migration, customers are looking for tools and methodologies that reduce the time to migrate, application downtime, and performance disruption. They use continuous replication to optimize cost and effort required to migrate applications reliably.

To accelerate these migrations, AWS provides a comprehensive set of tools. In this blog post, we explore how to use AWS Server Migration Service (AWS SMS) for migrating a SharePoint application from on-premises to AWS.

Overview of solution

AWS Server Migration Service is an agentless service. It makes it easier and faster to migrate thousands of workloads from on-premises or Microsoft Azure to AWS. In this article, we discuss one of the approaches and steps to migrate SharePoint farm using AWS SMS.

AWS SMS also supports migrating a group of servers organized as an application. This can simplify migration of applications with complex dependencies that must be migrated together. This service provides a customized replication schedule designed to simplify migration at scale. This also tracks the progress of each migration using AWS Migration Hub.

For SharePoint migrations, it facilitates migration failovers quickly. After the initial sync, the migration uses an incremental change capture approach to synchronize changes made to the on-premises SharePoint servers. This method also reduced required network bandwidth for the migration.

SharePoint Migration Architecture

SharePoint Migration Architecture

Here is how this service and solution works:


A basic SharePoint deployment is a 3-tier architecture comprising of web frontend servers, application servers, and backend SQL database servers. It also includes authentication services servers with Active Directory domain controllers.

To migrate this application, you must deploy a Server Migration Connector, which is a preconfigured virtual machine. This connector creates a server catalog. Based on selected server and configuration, it takes snapshots of virtual machines and stores them in S3 buckets.

In the background, AWS VM import/export service converts these snapshots into Amazon Machine Images (AMIs). Using these AMIs, you can configure launch settings where you define an order of application launch. You can also select instance types and set user-defined PowerShell scripts.

At the end, you define the cloud network topology: a VPC, subnets, and security groups. With these launch settings, SMS creates an AWS CloudFormation template, which can launch the SharePoint application in the target AWS account.

To migrate a SharePoint using SMS:

  1. Install and register the SMS connector.
  2. Create an application from SMS server catalog.
  3. Configure replication settings for your SharePoint farm.
  4. Configure launch settings.
  5. Start the replication.
  6. Launch the SharePoint application in AWS.

Install and register the Server Migration Connector

The Server Migration Connector is a VM that you install in your on-premises virtualization environment. The supported platforms are VMware vSphere, Microsoft Hyper-V/SCVMM, and Microsoft Azure. Follow the links below to install in your environment:

For example, VMWare is a common scenario. First, you must set up Server Migration Connector and import server catalog:

  1. Login to AWS Server Migration Service Console, and click to Connectors.
  2. Download .ova file and deploy it to your VMWare environment using vSphere client from AWS Server Migration Setup page.
  3. Install OVA file on your VMWare environment. This is your AWS SMS Connector.
  4. Use Connector Host IP address access Connector page and start five step registration process. Refer AWS documentation, Install the Server Migration Connector on VMWare.
  5. Follow the five-step process of registration. Here, you set up the password and network configuration between the connector virtual machine and AWS accounts.
    Setup network info

    Setup network info

    vCenter credentials

    vCenter credentials

    At the end, provide your vCenter host name and credentials.

    In setup, provide all details of the VMWare and AWS environment to the connector. After establishing the connection, the connector setup looks like this in the AWS Management Console:

    SMS Connectors

    SMS Connectors

Create an application from SMS server catalog

After configuring the connector and selecting “Import Server Catalog”, you are able to view the server catalog in AWS Server Migration Service console. To migrate your SharePoint application, select the application server, SQL Server, and Active Directory server from the server catalog.

Depending on the application architecture, you can group these servers to apply server-specific configuration settings and select appropriate instance types. Here are the steps:

  1. Navigate to the application feature of SMS and create a “new application” for the SharePoint farm. Provide the application name, description, and IAM role.

    Create new application - Application settings

    Create new application – Application settings

  2. Select servers to migrate from the catalog.
  3. Create different groups for these servers in your console. For this, select servers and choose Add servers to group. This helps in defining different instance types and run user-defined PowerShell scripts for all servers in a group during application launch. You may create different groups for the application, web frontend, database, and Active Directory servers. In the below example, there are two groups – one for application servers, and the other for database servers. This process assumes that you have the authentication services servers already in place and operational in AWS. For more information on SharePoint authentication services, Active Directory and Domain Services, Refer Active Directory Domain Services on AWS Deployment Scenario and Architecture.

    Create new application - Add servers to groups

    Create new application – Add servers to groups

  4. Add tags, per your organization tagging strategy or policy.
  5. Review your application and click “Next” when it is ready for replication settings configurations.

Configure replication settings

  1. Define “replication job type”, “when to start replication job”, and “automatic AMI deletion” based on your requirements. Choose Next.
  2. On the Configure server-specific settings page, in the License type column, select the license type for AMIs created from the replication job. Windows Servers can use either an AWS-provided license or Bring Your Own License (BYOL). Check Microsoft Licensing to review the licensing options. You can also choose Auto to allow AWS SMS to detect the source-system operating system (OS) and apply the appropriate license to the migrated virtual machine. Choose Next.

    Server-specific settings

    Server-specific settings

  3. Review your application replication setting and choose Configure Launch Settings.

Configure launch settings

An important aspect of migration is how this application should launch on EC2. This is configured on this page of SMS:

  1. On the Configure launch settings page, for the IAM CloudFormation role, provide an IAM role for launch settings. Refer AWS Documentation on IAM Roles for AWS SMS.

    Configure launch settings

    Configure launch settings

  2. Under Specify launch order, configure a launch order for your groups. For this SharePoint application, you may prefer Active Directory first, followed by the SQL database, and then the application servers.
  3. Under Configure launch settings for the application, edit the server settings individually:
    Target instance details

    Target instance details

    • Logical ID: AWS CloudFormation resource ID. This is the logical ID of the CloudFormation template that AWS SMS generates for the application. A value is created automatically when you use the console, but you must supply it manually when using the API, CLI, or SDKs. For more information, see Resources in the AWS CloudFormation User Guide.
    • Instance type: specifies the EC2 instance type on which to launch the server.
    • Key pair: specifies the SSH key pair for access to the server.
    • Configuration script: a script to run configuration commands at the startup of EC2 instances launched as part of an application. This is important for your SharePoint migration, as you can provide registry settings and configuration settings using PowerShell script for your SharePoint servers and SQL database servers.

    For example, the PowerShell script below retrieves the IP address of the current SQL Server and replaces the old SQL Server IP in the SharePoint configuration database and connection strings. You can automate many SharePoint configuration tasks using PowerShell scripts.

      Start-Transcript -Path "C:\UserData.log" -Append
      $oldIP = <<Your old IP goes here>>
      $newIP = ([System.Net.Dns]::GetHostAddresses("")).IPAddressToString
      $registryPath = "HKLM:\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\16.0\Secure\ConfigDB"
      $Name = "dsn"
      $regValue = (Get-ItemProperty -Path $registryPath -Name $Name).dsn
      $updatedRegValue = $regValue.Replace($oldIP, $newIP)
      New-ItemProperty -Path $registryPath -Name $Name -Value $updatedRegValue -PropertyType String -Force | Out-Null
  4. Configure the target network: VPC, subnets and security groups. Refer to the SharePoint on AWS documentation (Reference Deployment) for more guidance. The network topology varies based on platform requirements. Here is a reference architecture for SharePoint on AWS:

    Architecture topology

    Architecture topology

Start application replication

To start replication, select Start Replication under the Actions menu on the Applications page. The replication time depends on the amount of data replicated and available network bandwidth. On the application details page, you can observe the status of the replication in the Replication status field. If the replication fails, the status message field shows the reason.

Start replication

Start replication

Launch SharePoint in AWS

  1. On the Application page, choose Actions, Launch application. A replication job must complete before you perform this action.
  2. In the Launch application window, choose Launch. On the application details page, you can observe the status of the launch in the Launch status field. If the launch fails, you are able to find the reason in the status message field. You can also generate a CloudFormation template and download this template to use in different AWS accounts.

    Launch application

    Launch application

Test your migration

When the SharePoint application is launched, you can connect to Amazon EC2 instances based servers via Remote Desktop Protocol (RDP). You can access the application based on your Internet Information Services (IIS) Server settings runs on SharePoint Web Front End (WFE) application server (on Amazon EC2).  It is also recommended to investigate optimizations using the AWS Compute Optimizer. For this blog, we have not verified the migration steps with SharePoint 2007 and 2010.


AWS Server Migration Service simplifies SharePoint application migration. Using AWS SMS, you can easily migrate a SharePoint farm and reduce your migration timeline using the launch setting and launch order features.

To learn more, watch Application migration Using AWS Server Migration Service (SMS) or view a demo on the AWS Online Tech Talks Channel. If you have feedback, let us know in the comments section below.

Using AWS ParallelCluster with a serverless API

Post Syndicated from Chris Munns original

This post is contributed by Dario La Porta, AWS Senior Consultant – HPC

AWS ParallelCluster simplifies the creation and the deployment of HPC clusters. Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. AWS Lambda automatically runs your code without requiring you to provision or manage servers.

In this post, I create a serverless API of the AWS ParallelCluster command line interface using these services. With this API, you can create, monitor, and destroy your clusters. This makes it possible to integrate AWS ParallelCluster programmatically with other applications you may have running on-premises or in the AWS Cloud.

The serverless integration of AWS ParallelCluster can enable a cleaner and more reproducible infrastructure as code paradigm to legacy HPC environments.

Taking this serverless, infrastructure as code approach enables several new types of functionality for HPC environments. For example, you can build on-demand clusters from an API when on-premises resources cannot handle the workload. AWS ParallelCluster can extend on-premises resources for running elastic and large-scale HPC on AWS’ virtually unlimited infrastructure.

You can also create an event-driven workflow in which new clusters are created when new data is stored in an S3 bucket. With event-driven workflows, you can be creative in finding new ways to build HPC infrastructure easily. It also helps optimize time for researchers.

Security is paramount in HPC environments because customers are performing scientific analyses that are central to their businesses. By using a serverless API, this solution can improve security by removing the need to run the AWS ParallelCluster CLI in a user environment. This help keep customer environments secure and more easily control the IAM roles and security groups that researchers have access to.

Additionally, the Amazon API Gateway for HPC job submission post explains how to submit a job in the cluster using the API. You can use this instead of connecting to the master node via SSH.

This diagram shows the components required to create the cluster and interact with the solution.

Cluster Architecture

Cluster Architecture

Cost of the solution

You can deploy the solution in this blog post within the AWS Free Tier. Make sure that your AWS ParallelCluster configuration uses the t2.micro instance type for the cluster’s master and compute instances. This is the default instance type for AWS ParallelCluster configuration.

For real-world HPC use cases, you most likely want to use a different instance type, such as C5 or C5n. C5n in particular can work well for HPC workloads because it includes the option to use the Elastic Fabric Adapter (EFA) network interface. This makes it possible to scale tightly coupled workloads to more compute instances and reduce communications latency when using protocols such as MPI.

To stay within the AWS Free Tier allowance, be sure to destroy the created resources as described in the teardown section of this post.

VPC configuration

Choose Launch Stack to create the VPC used for this configuration in your account:

The stack creates the VPC, the public subnets, and the private subnet required for the cluster in the eu-west-1 Region.

Stack Outputs

Stack Outputs

You can also use an existing VPC that complies with the AWS ParallelCluster network requirements.

Deploy the API with AWS SAM

The AWS Serverless Application Model (AWS SAM) is an open-source framework that you can use to build serverless applications on AWS. You use AWS SAM to simplify the setup of the serverless architecture.

In this case, the framework automates the manual configuration of setting up the API Gateway and Lambda function. Instead you can focus more on how the API works with AWS ParallelCluster. It improves security and provides a simple, alternative method for cluster lifecycle management.

You can install the AWS SAM CLI by following the Installing the AWS SAM CLI documentation. You can download the code used in this example from this repo. Inside the repo:

  • the sam-app folder in the aws-sample repository contains the code required to build the AWS ParallelCluster serverless API.
  • sam-app/template.yml contains the policy required for the Lambda function for the creation of the cluster. Be sure to modify <AWS ACCOUNT ID> to match the value for your account.

The AWS Identity and Access Management Roles in AWS ParallelCluster document contains the latest version of the policy, in the ParallelClusterUserPolicy section.

To deploy the application, run the following commands:

cd sam-app
sam build
sam deploy --guided

From here, provide parameter values for the SAM deployment wizard that are appropriate for your Region and AWS account. After the deployment, take a note of the Outputs:

SAM deploying

SAM deploying

SAM Stack Outputs

SAM Stack Outputs

The API Gateway endpoint URL is used to interact with the API, and has the following format:

https:// <ServerlessRestApi>

AWS ParallelCluster configuration file

AWS ParallelCluster is an open source cluster management tool to deploy and manage HPC clusters in the AWS Cloud. AWS ParallelCluster uses a configuration file to build the cluster and its syntax is explained in the documentation guide. The pcluster.conf configuration file can be created in a directory of your local file system.

The configuration file has been tested with AWS ParallelCluster v2.6.0. The master_subnet_id contains the id of the created public subnet and the compute_subnet_id contains the private one.

Deploy the cluster with the pcluster API

The pcluster API created in the previous steps requires some parameters:

  • command – the pcluster command to execute. A detailed list is available commands is available in the AWS ParallelCluster CLI commands page.
  • cluster_name – the name of the cluster.
  • –data-binary “$(base64 /path/to/pcluster/config)” – parameter used to pass the local AWS ParallelCluster configuration file to the API.
  • -H “additional_parameters: <param1> <param2> <…>” – used to pass additional parameters to the pcluster cli.

The following command creates a cluster named “cluster1”:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>"

Beginning cluster creation for cluster: cluster1
Creating stack named: parallelcluster-cluster1

The cluster creation status can be queried with the following:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>"


When the cluster is in the “CREATE_COMPLETE” state, you can retrieve the master node IP address using the following API call:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>"


$ curl --request POST -H "additional_parameters: "  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>"

MasterServer: RUNNING
ClusterUser: ec2-user

When the cluster is not needed anymore, destroy it with the following API call:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>"

Deleting: cluster1

The additional_parameters: —nowait prevents waiting for stack events after executing a stack command and avoids triggering the Lambda function timeout. The Amazon API Gateway for HPC job submission post explains how you can submit a job in the cluster using the API, instead of connecting to the master node via SSH.

The authentication to the API can be managed by following the Controlling and Managing Access to a REST API in API Gateway Documentation.


You can destroy the resources by deleting the CloudFormation stacks created during installation. Deleting a Stack on the AWS CloudFormation Console explains the required steps.


In this post, I show how to integrate AWS ParallelCluster with Amazon API Gateway and manage the lifecycle of an HPC cluster using this API. Using Amazon API Gateway and AWS Lambda, you can run a serverless implementation of the AWS ParallelCluster CLI. This makes it possible to integrate AWS ParallelCluster programmatically with other applications you run on-premise or in the AWS Cloud.

This solution can help you improve the security of your HPC environment by simplifying the IAM roles and security groups that must be granted to individual users to successfully create HPC clusters. With this implementation, researchers no longer must run the AWS ParallelCluster CLI in their own user environment. As a result, by simplifying the security management of your HPC clusters’ lifecycle management, you can better ensure that important research is safe and secure.

To learn more, read more about how to use AWS ParallelCluster.

Coming soon: Updated Lambda states lifecycle for VPC networking

Post Syndicated from Chris Munns original

On November 27, we announced that AWS Lambda now includes additional attributes in the function information returned by several Lambda API actions to better communicate the current “state” of your function, when they are being created or updated. In our post “Tracking the state of AWS Lambda functions”, we covered the various states your Lambda function can be in, the conditions that lead to them, and how the Lambda service transitions the function through those states.

Our first feature using the function states lifecycle is a change to the recently announced improved VPC networking for AWS Lambda functions. As stated in the announcement post, Lambda creates the ENIs required for your function to connect to your VPCs, which can take 60–90 seconds to complete. We are updating this operation to explicitly place the function into a Pending state while pre-creating the required elastic network interface resources, and transitioning to an Active state after that process is completed. By doing this, we can use the lifecycle to complete the creation of these resources, and then reduce inconsistent invokes after the create/update has completed.

Most customers experience no impact from this change except for fewer long cold-starts due to network resource creation. As a reminder, any invocations or other API actions that operate on the function will fail during the time before the function is Active. To better assist you in adopting this behavior, we are rolling out this behavior for VPC configured functions in a phased manner. This post provides further details about timelines and approaches to both test the change before it is 100% live or delay it for your functions using a delay mechanism.

Changes to function create and update

On function create

During creation of new functions configured for VPC, your function remains in the Pending state until all VPC resources are created. You are not able to invoke the function or take any other Lambda API actions against it. After successful completion of the creation of these resources, your function transitions automatically to the Active state and is available for invokes and Lambda API actions. If the network resources fail to create then your function is placed in a Failed state.

On function update

During the update of functions configured for VPC, if there are any modifications to the VPC configuration, the function remains in the Active state, but shows in the InProgress status until all VPC resources are updated. During this time, any invokes go to the previous function code and configuration. After successful completion, the function LastUpdateStatus transitions automatically to Successful and all new invokes use the newly updated code and configuration. If the network resources fail to be created/updated then the LastUpdateStatus shows Failed, but the previous code and configuration remains in the Active state.

It’s important to note that creation or update of VPC resources can take between 60-90 seconds complete.

Change timeframe

As a reminder, all functions today show an Active state only. We are rolling out this change to create resources during the Pending state over a multiple phase period starting with the Begin Testing phase today, December 16, 2019. The phases allow you to update tooling for deploying and managing Lambda functions to account for this change. By the end of the update timeline, all accounts transition to using this new VPC resource create/update Lambda lifecycle.

Update timeline

Update timeline

December 16, 2019 – Begin Testing: You can now begin testing and updating any deployment or management tools you have to account for the upcoming lifecycle change. You can also use this time to update your function configuration to delay the change until the Delayed Update phase.

January 20, 2020 – General Update: All customers without the delayed update configuration begin seeing functions transition as described above under “On function create” and “On function update”.

February 17, 2020 – Delayed Update: The delay mechanism expires and customers now see the new VPC resource lifecycle applied during function create or update.

March 2, 2020 – Update End: All functions now have the new VPC resource lifecycle applied during function create or update.

Opt-in and delayed update configurations

Starting today, we are providing a mechanism for an opt-in, to allow you to update and test your tools and developer workflow processes for this change. We are also providing a mechanism to delay this change until the end of the Delayed Update phase. If you configure your functions for VPC and use the delayed update mechanism after the start of the General Update, your functions continue to experience a delayed first invocation due to VPC resource creation.

This mechanism operates on a function-by-function basis, so you can test and experiment individually without impacting your whole account. Once the General Update phase begins, all functions in an account that do not have the delayed update mechanism in place see the new lifecycle for their functions.

Both mechanisms work by adding a special string in the “Description” parameter of your Lambda functions. This string can be added to the prefix or suffix, or be the entire contents of the field.

To opt in:


To delay the update:


NOTE: Delay configuration mechanism has no impact after the Delayed Update phase ends.

Here is how this looks in the console:

  1. I add the opt-in configuration to my function’s Description.

    Opt-in in Description

    Opt-in in Description

  2. When I choose Save at the top, I see the update begin. During this time, I am blocked from executing tests, updating my code, and making some configuration changes against the function.

    Function updating

    Function updating

  3. After the update completes, I can once again run tests and other console commands.

    Function update successful

    Function update successful

Once the opt-in is set for a function, then updates on that function go through the update flow shown above. If I don’t change my function’s VPC configuration, then updates to my function transition almost instantly to the Successful update status.

With this in place, you can now test your development workflow ahead of the General Update phase. Download the latest CLI (version 1.16.291 or greater) or SDKs in order to see function state and related attribute information.


With functions states, you can have better clarity on how the resources required by your Lambda function are being created. This change does not impact the way that functions are invoked or how your code is executed. While this is a minor change to when resources are created for your Lambda function, the result is even better consistency of performance. Combined with the original announcement of improved VPC networking for Lambda, you experience better consistency for invokes, greatly reduced cold-starts, and fewer network resources created for your functions.

Tracking the state of Lambda functions

Post Syndicated from Chris Munns original

AWS Lambda functions often require resources from other AWS services in order to execute successfully, such as AWS Identity and Access Management (IAM) roles or Amazon Virtual Private Cloud (Amazon VPC) network interfaces. When you create or update a function, Lambda provisions the required resources on your behalf that enable your function to execute. In most cases, this process is very fast, and your function is ready to be invoked or modified immediately. However, there are situations where this action can take longer.

To better communicate the current “state” of your function when resources are being created or updated, AWS Lambda now includes several additional attributes in the function information returned by several Lambda API actions. This change does not impact the way that functions are invoked or how your code is
executed. In this post we cover the various states your Lambda function can be in, the conditions that lead to them, and how the Lambda service transitions the function through different states. You can also read more about this in the AWS Lambda documentation, Monitoring the State of a Function with the Lambda API.

All about function states

There are two main lifecycles that a Lambda function goes through, depending on if it is being created for the first time or existing and being updated. Before we go into the flows, let’s cover what the function states are:


Pending is the first state that all functions go through when they are created. During Pending Lambda attempts to create or configure any external resources. A function can only be invoked while in the Active state and so any invocations or other API actions that operate on the function will fail. Any automation built around invoking or updating functions immediately after creation must be modified to first see if the function has transitioned from Pending to Active.


Active is reached after any configuration or provisioning of resources has completed during the initial creation of a function. Functions can only be invoked in the Active state.

During a function update, the state remains set to Active, but there is a separate attribute called LastUpdateStatus, which represents the state of an update of an active function. During an update, invocations execute the function’s previous code and configuration until it has successfully completed the update operation, after which the new code and configuration is used.


Failed is a state that represents something has gone wrong with either the configuration or provisioning of external resources.


A function transitions to the Inactive state when it has been idle long enough for the Lambda service to reclaim the external resources that were configured for it. When you invoke a function that is in the Inactive state, it fails and sets the function to the Pending state until those resources are recreated. If the resources fail to be recreated, the function is set back to the Inactive state.

For all states, there are two other attributes that are set: StateReason and StateReasonCode. Both exist for troubleshooting issues and provide more information about the current state.


LastUpdateStatus represents a child lifecycle of the main function states lifecycle. It only becomes relevant during an update and signifies the changes a function goes through. There are three statuses associated with this:


InProgress signifies that an update is happening on an existing function. While a function is in the InProgress status, invocations go to the function’s previous code and configuration.


Successful signifies that the update has completed. Once a function has been updated, this stays set until a further update.


Failed signifies that the function update has failed. The change is aborted and the function’s previous code and configuration remains in the Active state and available.

For all LastUpdateStatus values there are two other attributes that are set: LastUpdateStatusReason and LastUpdateStatusReasonCode. These exist to help troubleshoot any issues with an update.

Function State Lifecycles

The transition of a function from one state to another is completely dependent on the actions taken against it. There is no way for you to manually move a function between states.

Create function state lifecycle

Create function state lifecycle

For all functions the primary state lifecycle looks like this:

  • On create, a function enters the Pending state
  • Once any required resources are successfully created, it transitions into the Active state
    • If resource or function create fails for any reason then it transitions to the Failed state
  • If a function remains idle for several weeks it will enter the Inactive state
  • On the first invoke after going Inactive, a function re-enters the Pending state
    • Success sets the state to Active again
    • Failure sets the state back to Inactive
  • Successfully updating an Inactive function also sets the state back to Active again

For updating functions, the lifecycle looks like this:

Update function state lifecycle

Update function state lifecycle

NOTE: Functions can only be updated if they are in the Active or Inactive state. Update commands issued against functions that aren’t in the Active or Inactive state will fail before execution.

  • On update, a function’s LastUpdateStatus is set to Pending
    • Success sets the LastUpdateStatus to Successful
    • Failure sets the LastUpdateStatus to Failed
  • A function that is Inactive will go back to Active after a successful update.

Accessing function state information

You can query the current state of the function using the latest AWS SDKs and AWS CLI (version 1.16.291 or greater). To make it easy for you to write the code to check the state of a function we’ve added a new set of Waiters to the AWS SDKs. Waiters are a feature of certain AWS SDKs that can be used when AWS services have asynchronous operations, that you would need to poll for a change in their state or status. Using waiters removes the need to write complicated polling code and make it easy to check the state of a Lambda function. You can find out more about these in the documentation for each SDK: Python, Node.js, Java, Ruby, Go, and PHP.

What’s next

Starting today, all functions today will show an Active state only. You will not see a function transition from Pending at all.

Our first feature leveraging the function states lifecycle, will be a change to the recently announced improved VPC networking for AWS Lambda functions. As stated in the announcement post, Lambda precreates the ENIs required for your function to connect to your VPCs, which can take 60 to 90 seconds to complete. We will be changing this process slightly, by creating the required ENI resources while the function is placed in a Pending state and transitioning to Active after that process is completed. Remember, any invocations or other API actions that operate on the function will fail during the time before the function is Active. We will roll this behavior out for VPC configured functions in a phased manner, and will provide further details about timelines, and approaches to test before it is 100% live, in an upcoming post.

Until then, we encourage you to update the tooling you use to deploy and manage Lambda-based applications to confirm that your function is in the Active state before trying to invoke or perform other actions against them.  If you use AWS tools like AWS CloudFormation or AWS SAM, or tools from Lambda’s partners to deploy and manage your functions, there is no further action required.

Update: Issue affecting HashiCorp Terraform resource deletions after the VPC Improvements to AWS Lambda

Post Syndicated from Chris Munns original

On September 3, 2019, we announced an exciting update that improves the performance, scale, and efficiency of AWS Lambda functions when working with Amazon VPC networks. You can learn more about the improvements in the original blog post. These improvements represent a significant change in how elastic network interfaces (ENIs) are configured to connect to your VPCs. With this new model, we identified an issue where VPC resources, such as subnets, security groups, and VPCs, can fail to be destroyed via HashiCorp Terraform. More information about the issue can be found here. In this post we will help you identify whether this issue affects you and the steps to resolve the issue.

How do I know if I’m affected by this issue?

This issue only affects you if you use HashiCorp Terraform to destroy environments. Versions of Terraform AWS Provider that are v2.30.0 or older are impacted by this issue. With these versions you may encounter errors when destroying environments that contain AWS Lambda functions, VPC subnets, security groups, and Amazon VPCs. Typically, terraform destroy fails with errors similar to the following:

Error deleting subnet: timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 20m0s)

Error deleting security group: DependencyViolation: resource sg-<id> has a dependent object
        	status code: 400, request id: <guid>

Depending on which AWS Regions the VPC improvements are rolled out, you may encounter these errors in some Regions and not others.

How do I resolve this issue if I am affected?

You have two options to resolve this issue. The recommended option is to upgrade your Terraform AWS Provider to v2.31.0 or later. To learn more about upgrading the Provider, visit the Terraform AWS Provider Version 2 Upgrade Guide. You can find information and source code for the latest releases of the AWS Provider on this page. The latest version of the Terraform AWS Provider contains a fix for this issue as well as changes that improve the reliability of the environment destruction process. We highly recommend that you upgrade the Provider version as the preferred option to resolve this issue.

If you are unable to upgrade the Provider version, you can mitigate the issue by making changes to your Terraform configuration. You need to make the following sets of changes to your configuration:

  1. Add an explicit dependency, using a depends_on argument, to the aws_security_group and aws_subnet resources that you use with your Lambda functions. The dependency has to be added on the aws_security_group or aws_subnet and target the aws_iam_policy resource associated with IAM role configured on the Lambda function. See the example below for more details.
  2. Override the delete timeout for all aws_security_group and aws_subnet resources. The timeout should be set to 40 minutes.

The following configuration file shows an example where these changes have been made(scroll to see the full code):

provider "aws" {
    region = "eu-central-1"
resource "aws_iam_role" "lambda_exec_role" {
  name = "lambda_exec_role"
  assume_role_policy = <<EOF
  "Version": "2012-10-17",
  "Statement": [
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": ""
      "Effect": "Allow",
      "Sid": ""
data "aws_iam_policy" "LambdaVPCAccess" {
  arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
resource "aws_iam_role_policy_attachment" "sto-lambda-vpc-role-policy-attach" {
  role       = "${}"
  policy_arn = "${data.aws_iam_policy.LambdaVPCAccess.arn}"
resource "aws_security_group" "allow_tls" {
  name        = "allow_tls"
  description = "Allow TLS inbound traffic"
  vpc_id      = "vpc-<id>"
  ingress {
    # TLS (change to whatever ports you need)
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    # Please restrict your ingress to only necessary IPs and ports.
    # Opening to can lead to security vulnerabilities.
    cidr_blocks = [""]
  egress {
    from_port       = 0
    to_port         = 0
    protocol        = "tcp"
    cidr_blocks     = [""]
  timeouts {
    delete = "40m"
  depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]  
resource "aws_subnet" "main" {
  vpc_id     = "vpc-<id>"
  cidr_block = ""

  timeouts {
    delete = "40m"
  depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]
resource "aws_lambda_function" "demo_lambda" {
    function_name = "demo_lambda"
    handler = "index.handler"
    runtime = "nodejs10.x"
    filename = ""
    source_code_hash = "${filebase64sha256("")}"
    role = "${aws_iam_role.lambda_exec_role.arn}"
    vpc_config {
     subnet_ids         = ["${}"]
     security_group_ids = ["${}"]

The key block to note here is the following, which can be seen in both the “allow_tls” security group and “main” subnet resources:

timeouts {
  delete = "40m"
depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]

These changes should be made to your Terraform configuration files before destroying your environments for the first time.

Can I delete resources remaining after a failed destroy operation?

Destroying environments without upgrading the provider or making the configuration changes outlined above may result in failures. As a result, you may have ENIs in your account that remain due to a failed destroy operation. These ENIs can be manually deleted a few minutes after the Lambda functions that use them have been deleted (typically within 40 minutes). Once the ENIs have been deleted, you can re-re-run terraform destroy.

Automating your lift-and-shift migration at no cost with CloudEndure Migration

Post Syndicated from Chris Munns original

This post is courtesy of Gonen Stein, Head of Product Strategy, CloudEndure 

Acquired by AWS in January 2019, CloudEndure offers a highly automated migration tool to simplify and expedite rehost (lift-and-shift) migrations. AWS recently announced that CloudEndure Migration is now available to all customers and partners at no charge.

Each free CloudEndure Migration license provides 90 days of use following agent installation. During this period, you can perform all migration steps: replicate your source machines, conduct tests, and perform a scheduled cutover to complete the migration.


In this post, I show you how to obtain free CloudEndure Migration licenses and how to use CloudEndure Migration to rehost a machine from an on-premises environment to AWS. Although I’ve chosen to focus on an on-premises-to-AWS use case, CloudEndure also supports migration from cloud-based environments. For those of you who are interested in advanced automation, I include information on how to further automate large-scale migration projects.

Understanding CloudEndure Migration

CloudEndure Migration works through an agent that you install on your source machines. No reboot is required, and there is no performance impact on your source environment. The CloudEndure Agent connects to the CloudEndure User Console, which issues an API call to your target AWS Region. The API call creates a Staging Area in your AWS account that is designated to receive replicated data.

CloudEndure Migration automated rehosting consists of three main steps :

  1. Installing the Agent: The CloudEndure Agent replicates entire machines to a Staging Area in your target.
  2. Configuration and testing: You configure your target machine settings and launch non-disruptive tests.
  3. Performing cutover: CloudEndure automatically converts machines to run natively in AWS.

The Staging Area comprises both lightweight Amazon EC2 instances that act as replication servers and staging Amazon EBS volumes. Each source disk maps to an identically sized EBS volume in the Staging Area. The replication servers receive data from the CloudEndure Agent running on the source machines and write this data onto staging EBS volumes. One replication server can handle multiple source machines replicating concurrently.

After all source disks copy to the Staging Area, the CloudEndure Agent continues to track and replicate any changes made to the source disks. Continuous replication occurs at the block level, enabling CloudEndure to replicate any application that runs on supported x86-based Windows or Linux operating systems via an installed agent.

When the target machines launch for testing or cutover, CloudEndure automatically converts the target machines so that they boot and run natively on AWS. This conversion includes injecting the appropriate AWS drivers, making appropriate bootloader changes, modifying network adapters, and activating operating systems using the AWS KMS. This machine conversion process generally takes less than a minute, irrespective of machine size, and runs on all launched machines in parallel.

CloudEndure Migration Architecture

CloudEndure Migration Architecture

Installing CloudEndure Agent

To install the Agent:

  1. Start your migration project by registering for a free CloudEndure Migration license. The registration process is quick–use your email address to create a username and password for the CloudEndure User Console. Use this console to create and manage your migration projects.

    After you log in to the CloudEndure User Console, you need an AWS access key ID and secret access key to connect your CloudEndure Migration project to your AWS account. To obtain these credentials, sign in to the AWS Management Console and create an IAM user.Enter your AWS credentials in the CloudEndure User Console.

  2. Configure and save your migration replication settings, including your migration project’s Migration Source, Migration Target, and Staging Area. For example, I selected the Migration Source: Other Infrastructure, because I am migrating on-premises machines. Also, I selected the Migration Target AWS Region: AWS US East (N. Virginia). The CloudEndure User Console also enables you to configure a variety of other settings after you select your source and target, such as subnet, security group, VPN or Direct Connect usage, and encryption.

    After you save your replication settings, CloudEndure prompts you to install the CloudEndure Agent on your source machines. In my example, the source machines consist of an application server, a web server, and a database server, and all three are running Debian GNU / Linux 9.

  3. Download the CloudEndure Agent Installer for Linux by running the following command:
    wget -O ./
  4. Run the Installer:
    sudo python ./ -t <INSTALLATION TOKEN> --no-prompt

    You can install the CloudEndure Agent locally on each machine. For large-scale migrations, use the unattended installation parameters with any standard deployment tool to remotely install the CloudEndure Agent on your machines.

    After the Agent installation completes, CloudEndure adds your source machines to the CloudEndure User Console. From there, your source machines undergo several initial replication steps. To obtain a detailed breakdown of these steps, in the CloudEndure User Console, choose Machines, and select a specific machine to open the Machine Details View page.

    details page

    Data replication consists of two stages: Initial Sync and Continuous Data Replication. During Initial Sync, CloudEndure copies all of the source disks’ existing content into EBS volumes in the Staging Area. After Initial Sync completes, Continuous Data Replication begins, tracking your source machines and replicating new data writes to the staging EBS volumes. Continuous Data Replication makes sure that your Staging Area always has the most up-to-date copy of your source machines.

  5. To track your source machines’ data replication progress, in the CloudEndure User Console, choose Machines, and see the Details view.
    When the Data Replication Progress status column reads Continuous Data Replication, and the Migration Lifecycle status column reads Ready for Testing, Initial Sync is complete. These statuses indicate that the machines are functioning correctly and are ready for testing and migration.

Configuration and testing

To test how your machine runs on AWS, you must configure the Target Machine Blueprint. This Blueprint is a set of configurations that define where and how the target machines are launched and provisioned, such as target subnet, security groups, instance type, volume type, and tags.

For large-scale migration projects, APIs can be used to configure the Blueprint for all of your machines within seconds.

I recommend performing a test at least two weeks before migrating your source machines, to give you enough time to identify potential problems and resolve them before you perform the actual cutover. For more information, see Migration Best Practices.

To launch machines in Test Mode:

  1. In the CloudEndure User Console, choose Machines.
  2. Select the Name box corresponding to each machine to test.
  3. Choose Launch Target Machines, Test Mode.Launch target test machines

After the target machines launch in Test Mode, the CloudEndure User Console reports those machines as Tested and records the date and time of the test.

Performing cutover

After you have completed testing, your machines continue to be in Continuous Data Replication mode until the scheduled cutover window.

When you are ready to perform a cutover:

  1. In the CloudEndure User Console, choose Machines.
  2. Select the Name box corresponding to each machine to migrate.
  3. Choose Launch Target Machines, Cutover Mode.
    To confirm that your target machines successfully launch, see the Launch Target Machines menu. As your data replicates, verify that the target machines are running correctly, make any necessary configuration adjustments, perform user acceptance testing (UAT) on your applications and databases, and redirect your users.
  4. After the cutover completes, remove the CloudEndure Agent from the source machines and the CloudEndure User Console.
  5. At this point, you can also decommission your source machines.


In this post, I showed how to rehost an on-premises workload to AWS using CloudEndure Migration. CloudEndure automatically converts your machines from any source infrastructure to AWS infrastructure. That means they can boot and run natively in AWS, and run as expected after migration to the cloud.

If you have further questions see CloudEndure Migration, or Registering to CloudEndure Migration.

Get started now with free CloudEndure Migration licenses.

Announcing improved VPC networking for AWS Lambda functions

Post Syndicated from Chris Munns original

We’re excited to announce a major improvement to how AWS Lambda functions work with your Amazon VPC networks. With today’s launch, you will see dramatic improvements to function startup performance and more efficient usage of elastic network interfaces. These improvements are rolling out to all existing and new VPC functions, at no additional cost. Roll out begins today and continues gradually over the next couple of months across all Regions.

Lambda first supported VPCs in February 2016, allowing you to access resources in your VPCs or on-premises systems using an AWS Direct Connect link. Since then, we’ve seen customers widely use VPC connectivity to access many different services:

  • Relational databases such as Amazon RDS
  • Datastores such as Amazon ElastiCache for Redis or Amazon Elasticsearch Service
  • Other services running in Amazon EC2 or Amazon ECS

Today’s launch brings enhancements to how this connectivity between Lambda and your VPCs works.

Previous Lambda integration model for VPCs

It helps to understand some basics about the way that networking with Lambda works. Today, all of the compute infrastructure for Lambda runs inside of VPCs owned by the service.Lambda runs in a VPC controlled by the service

When you invoke a Lambda function from any invocation source and with any execution model (synchronous, asynchronous, or poll based), it occurs through the Lambda API. There is no direct network access to the execution environment where your functions run:Invoke access only through the Lambda API

By default, when your Lambda function is not configured to connect to your own VPCs, the function can access anything available on the public internet such as other AWS services, HTTPS endpoints for APIs, or services and endpoints outside AWS. The function then has no way to connect to your private resources inside of your VPC.

When you configure your Lambda function to connect to your own VPC, it creates an elastic network interface in your VPC and then does a cross-account attachment. These network interfaces allow network access from your Lambda functions to your private resources. These Lambda functions continue to run inside of the Lambda service’s VPC and can now only access resources over the network through your VPC.

All invocations for functions continue to come from the Lambda service’s API and you still do not have direct network access to the execution environment. The attached network interface exists inside of your account and counts towards limits that exist in your account for that Region.lambda mapped to a customer eni

With this model, there are a number of challenges that you might face. The time taken to create and attach new network interfaces can cause longer cold-starts, increasing the time it takes to spin up a new execution environment before your code can be invoked.

As your function’s execution environment scales to handle an increase in requests, more network interfaces are created and attached to the Lambda infrastructure. The exact number of network interfaces created and attached is a factor of your function configuration and concurrency.many ENIs mapped to Lambda environments

Every network interface created for your function is associated with and consumes an IP address in your VPC subnets. It counts towards your account level maximum limit of network interfaces.

As your function scales, you have to be mindful of several issues:

  • Managing the IP address space in your subnets
  • Reaching the account level network interface limit
  • The potential to hit the API rate limit on creating new network interfaces

What’s changing

Starting today, we’re changing the way that your functions connect to your VPCs. AWS Hyperplane, the Network Function Virtualization platform used for Network Load Balancer and NAT Gateway, has supported inter-VPC connectivity for offerings like AWS PrivateLink, and we are now leveraging Hyperplane to provide NAT capabilities from the Lambda VPC to customer VPCs.

The Hyperplane ENI is a managed network resource that the Lambda service controls, allowing multiple execution environments to securely access resources inside of VPCs in your account. Instead of the previous solution of mapping network interfaces in your VPC directly to Lambda execution environments, network interfaces in your VPC are mapped to the Hyperplane ENI and the functions connect using it.

Hyperplane still uses a cross account attached network interface that exists in your VPC, but in a greatly improved way:

  • The network interface creation happens when your Lambda function is created or its VPC settings are updated. When a function is invoked, the execution environment simply uses the pre-created network interface and quickly establishes a network tunnel to it. This dramatically reduces the latency that was previously associated with creating and attaching a network interface at cold start.
  • Because the network interfaces are shared across execution environments, typically only a handful of network interfaces are required per function. Every unique security group:subnet combination across functions in your account requires a distinct network interface. If a combination is shared across multiple functions in your account, we reuse the same network interface across functions.
  • Your function scaling is no longer directly tied to the number of network interfaces and Hyperplane ENIs can scale to support large numbers of concurrent function executions

Hyperplane ENIs are tied to a security group:subnet combination in your account. Functions in the same account that share the same security group:subnet pairing use the same network interfaces. This way, a single application with multiple functions but the same network and security configuration can benefit from the existing interface configuration.

Several things do not change in this new model:

  • Your Lambda functions still need the IAM permissions required to create and delete network interfaces in your VPC.
  • You still control the subnet and security group configurations of these network interfaces. You can continue to apply normal network security controls and follow best practices on VPC configuration.
  • You still have to use a NAT device(for example VPC NAT Gateway) to give a function internet access or use VPC endpoints to connect to services outside of your VPC.
  • Nothing changes about the types of resources that your functions can access inside of your VPCs.

As mentioned earlier, Hyperplane now creates a shared network interface when your Lambda function is first created or when its VPC settings are updated, improving function setup performance and scalability. This one-time setup can take up to 90 seconds to complete. If your Lambda function is invoked while the shared network interface is still being created, the invocation still succeeds but may see longer cold-start times. When setup is complete, your Lambda function no longer requires network interface creation at invocation time. If Lambda functions in an account go idle for consecutive weeks, the service will reclaim the unused Hyperplane resources and so very infrequently invoked functions may still see longer cold-start times.

As I mentioned earlier, you don’t have to enable this new capability yourself, and there are no new associated costs. Starting today, we are gradually rolling this out across all AWS Regions over the next couple of months. We will update this post on a Region by Region basis after the rollout has completed in a given Region.

If you have existing VPC-configured Lambda workloads, you will soon see your applications scale and respond to new invocations more quickly. You can see this change using tools like AWS X-Ray or those from Lambda’s third party partners.

The following screenshots show a before/after example taken with AWS X-Ray:

The first shows a function that was launched in a VPC, with a full cold start with network interface creation at invocation.

In this second image, you see the same function executed but this time in an account that has the new VPC networking capability. You can see the massive difference that this has made in function execution duration, with it dropping to 933 ms from 14.8 seconds!


On the AWS Lambda team, we consider “Simplicity” a core tenet to how we think about building services for our customers. We’re constantly evaluating ways that we can improve and simplify the experience that you see in building serverless applications with Lambda. These changes in how we connect with your VPCs improve the performance and scale for your Lambda functions. They enable you to harness the full power of serverless architectures.

New: Using Amazon EC2 Instance Connect for SSH access to your EC2 Instances

Post Syndicated from Chris Munns original

This post is courtesy of Saloni Sonpal – Senior Product Manager – Amazon EC2

Today, AWS is introducing Amazon EC2 Instance Connect, a new way to control SSH access to your EC2 instances using AWS Identity and Access Management (IAM).

About Amazon EC2 Instance Connect

While infrastructure as code (IaC) tools such as Chef and Puppet have become customary in the industry for configuring servers, you occasionally must access your instances to fine-tune, consult system logs, or debug application issues. The most common tool to connect to Linux servers is Secure Shell (SSH). It was created in 1995 and is now installed by default on almost every Linux distribution.

When connecting to hosts via SSH, SSH key pairs are often used to individually authorize users. As a result, organizations have to store, share, manage access for, and maintain these SSH keys.

Some organizations also maintain bastion hosts, which help limit network access into hosts by the use of a single jump point. They provide logging and prevent rogue SSH access by adding an additional layer of network obfuscation. However, running bastion hosts comes with challenges. You maintain the installed user keys, handle rotation, and make sure that the bastion host is always available and, more importantly, secured.

Amazon EC2 Instance Connect simplifies many of these issues and provides the following benefits to help improve your security posture:

  • Centralized access control – You get centralized access control to your EC2 instances on a per-user and per-instance level. IAM policies and principals remove the need to share and manage SSH keys.
  • Short-lived keys – SSH keys are not persisted on the instance, but are ephemeral in nature. They are only accessible by an instance at the time that an authorized user connects, making it easier to grant or revoke access in real time. This also allows you to move away from long-lived keys. Instead, you generate one-time SSH keys each time that an authorized user connects, eliminating the need to track and maintain keys.
  • Auditability – User connections via EC2 Instance Connect are logged to AWS CloudTrail, providing the visibility needed to easily audit connection requests and maintain compliance.
  • Ubiquitous access – EC2 Instance Connect works seamlessly with your existing SSH client. You can also connect to your instances from a new browser-based SSH client in the EC2 console, providing a consistent experience without having to change your workflows or tools.

How EC2 Instance Connect works

When the EC2 Instance Connect feature is enabled on an instance, the SSH daemon (sshd) on that instance is configured with a custom AuthorizedKeysCommand script. This script updates AuthorizedKeysCommand to read SSH public keys from instance metadata during the SSH authentication process, and connects you to the instance.

The SSH public keys are only available for one-time use for 60 seconds in the instance metadata. To connect to the instance successfully, you must connect using SSH within this time window. Because the keys expire, there is no need to track or manage these keys directly, as you did previously.

Configuring an EC2 instance for EC2 Instance Connect

To get started using EC2 Instance Connect, you first configure your existing instances. Currently, EC2 Instance Connect supports Amazon Linux 2 and Ubuntu. Install RPM or Debian packages respectively to enable the feature. New Amazon Linux 2 instances have the EC2 Instance Connect feature enabled by default, so you can connect to those newly launched instances right away using SSH without any further configuration.

First, configure an existing instance. In this case, set up an Amazon Linux 2 instance running in your account. For the steps for Ubuntu, see Set Up EC2 Instance Connect.

  1. Connect to the instance using SSH. The instance is running a relatively recent version of Amazon Linux 2:
    [[email protected] ~]$ uname -srv
    Linux 4.14.104-95.84.amzn2.x86_64 #1 SMP Fri Jun 21 12:40:53 UTC 2019
  2. Use the yum command to install the ec2-instance-connect RPM package.
    $ sudo yum install ec2-instance-connect
    Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
    Resolving Dependencies
    --> Running transaction check
    ---> Package ec2-instance-connect.noarch 0:1.1-9.amzn2 will be installed
    --> Finished Dependency Resolution
      ec2-instance-connect.noarch 0:1.1-9.amzn2                                                                                                     

    This RPM installs a few scripts locally and changes the AuthorizedKeysCommand and AuthorizedKeysCommandUser configurations in /etc/ssh/sshd_config. If you are using a configuration management tool to manage your sshd configuration, install the package and add the lines as described in the documentation.

With ec2-instance-connect installed, you are ready to set up your users and have them connect to instances.

Set up IAM users

First, allow an IAM user to be able to push their SSH keys up to EC2 Instance Connect. Create a new IAM policy so that you can add it to any other users in your organization.

  1. In the IAM console, choose Policies, Create Policy.
  2. On the Create Policy page, choose JSON and enter the following JSON document. Replace $REGION and $ACCOUNTID with your own values:
        "Version": "2012-10-17",
        "Statement": [
                "Effect": "Allow",
                "Action": [
                "Resource": [
                "Condition": {
                    "StringEquals": {
                        "ec2:osuser": "ec2-user"

    Use ec2-user as the value for ec2:osuser with Amazon Linux 2. Specify this so that the metadata is made available for the proper SSH user. For more information, see Actions, Resources, and Condition Keys for Amazon EC2 Instance Connect Service.

  3. Choose Review Policy.
  4. On the Review Policy page, name the policy, give it a description, and choose Create Policy.
  5. On the Create Policy page, attach the policy to something by creating a new group (I named mine “HostAdmins”).
  6. On the Attach Policy page, attach the policy that you just created and choose Next Step.
  7. Choose Create Group.
  8. On the Groups page, select your newly created group and choose Group Actions, Add Users to Group.
  9. Select the user or users to add to this group, then choose Add Users.

Your users can now use EC2 Instance Connect.

Connecting to an instance using EC2 Instance Connect

With your instance configured and the users set with the proper policy, connect to your instance with your normal SSH client or directly, using the AWS Management Console.

To offer a seamless SSH experience, EC2 Instance Connect wraps up these steps in a command line tool. It also offers a browser-based interface in the console, which takes care of the SSH key generation and distribution for you.

To connect with your SSH client

  1. Generate the new private and public keys mynew_key and, respectively:
    $ ssh-keygen -t rsa -t mynew_key 
  2. Use the following AWS CLI command to authorize the user and push the public key to the instance using the send-ssh-public-key command. To support this, you need the latest version of the AWS CLI.
    $ aws ec2-instance-connect send-ssh-public-key --region us-east-1 --instance-id i-0989ec3292613a4f9 --availability-zone us-east-1f --instance-os-user ec2-user --ssh-public-key file://
        "RequestId": "505f8675-710a-11e9-9263-4d440e7745c6", 
        "Success": true
  3. After authentication, the public key is made available to the instance through the instance metadata for 60 seconds. During this time, connect to the instance using the associated private key:
    $ ssh -i mynew_key [email protected]

If for some reason you don’t connect within that 60-second window, you see the following error:

$ ssh -i mynew_key [email protected]
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

If you do, run the send-ssh-public-key command again to connect using SSH.

Now, connect to your instance from the console.

To connect from the Amazon EC2 console

  1. Open the Amazon EC2 console.
  2. In the left navigation pane, choose Instances and select the instance to which to connect.
  3. Choose Connect.
  4. On the Connect To Your Instance page, choose EC2 Instance Connect (browser-based SSH connection), Connect.

The following terminal window opens and you are now connected through SSH to your instance.

Auditing with CloudTrail

For every connection attempt, you can also view the event details. These include the destination instance ID, OS user name, and public key, all used to make the SSH connection that corresponds to the SendSSHPublicKey API calls in CloudTrail.

In the CloudTrail console, search for SendSSHPublicKey.

If EC2 Instance Connect has been used recently, you should see records of your users having called this API operation to send their SSH key to the target host. Viewing the event’s details shows you the instance and other valuable information that you might want to use for auditing.

In the following example, see the JSON from a CloudTrail event that shows the SendSSHPublicKey command in use:

    "eventVersion": "1.05",
    "userIdentity": {
        "type": "User",
        "principalId": "1234567890",
        "arn": "arn:aws:iam:: 1234567890:$USER",
        "accountId": "1234567890",
        "accessKeyId": "ABCDEFGHIJK3RFIONTQQ",
        "userName": "$ACCOUNT_NAME",
        "sessionContext": {
            "attributes": {
                "mfaAuthenticated": "true",
                "creationDate": "2019-05-07T18:35:18Z"
    "eventTime": "2019-06-21T13:58:32Z",
    "eventSource": "",
    "eventName": "SendSSHPublicKey",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "",
    "userAgent": "aws-cli/1.15.61 Python/2.7.16 Linux/4.14.77-70.59.amzn1.x86_64 botocore/1.10.60",
    "requestParameters": {
        "instanceId": "i-0989ec3292613a4f9",
        "osUser": "ec2-user",
        "SSHKey": {
            "publicKey": "ssh-rsa <removed>\\n"
    "responseElements": null,
    "requestID": "df1a5fa5-710a-11e9-9a13-cba345085725",
    "eventID": "070c0ca9-5878-4af9-8eca-6be4158359c4",
    "eventType": "AwsApiCall",
    "recipientAccountId": "1234567890"

If you’ve configured your AWS account to collect CloudTrail events in an S3 bucket, you can download and audit the information programmatically. For more information, see Getting and Viewing Your CloudTrail Log Files.


EC2 Instance Connect offers an alternative to complicated SSH key management strategies and includes the benefits of using built-in auditability with CloudTrail. By integrating with IAM and the EC2 instance metadata available on all EC2 instances, you get a secure way to distribute short-lived keys and control access by IAM policy.

There are some additional features in the works for EC2 Instance Connect. In the future, AWS hopes to launch tag-based authorization, which allows you to use resource tags in the condition of a policy to control access. AWS also plans to enable EC2 Instance Connect by default in popular Linux distros in addition to Amazon Linux 2.

EC2 Instance Connect is available now at no extra charge in the US East (Ohio and N. Virginia), US West (N. California and Oregon), Asia Pacific (Mumbai, Seoul, Singapore, Sydney, and Tokyo), Canada (Central), EU (Frankfurt, Ireland, London, and Paris), and South America (São Paulo) AWS Regions.

Updated timeframe for the upcoming AWS Lambda and AWS [email protected] execution environment update

Post Syndicated from Chris Munns original

On May 14th we announced an upcoming update to the AWS Lambda and AWS [email protected] execution environments. In that announcement we shared that we are updating the execution environment to a more recent version of Amazon Linux. This newer execution environment brings updates that offer improvements in capabilities, performance, security, and updated packages that your application code might interface with. The previous post explained approaches to proactively testing against the new update, and methods to update your code to be compatible in the rare case you were impacted.

So far, we’ve heard from many customers that their functions have not been impacted when using the new execution environment by configuring the Opt-in mechanism. For those that have been impacted, they have been able to follow the guidance on rebuilding any dependencies against the new execution environment and retesting their functions with success.

We also received feedback that customers wanted to see a longer time frame for validation as well as have more control over it, and so based on this feedback we’ve decided to modify the timeframe in two ways.

The first phase, Begin Testing, will be extended by three weeks, retroactive starting May 21 and now ending June 10. This will give you more time to test your functions with the Opt-in layer before any further changes to the platform kick in.

We are then taking the second phase, originally called Update/Create and breaking into two independent periods of time. The first, now referred to as the New Function Create phase, will be two weeks long and during this time all newly created functions will use the new execution environment unless a delayed-update layer is configured. The second new phase, Existing Function Update, will be three weeks long and during this time both newly created functions as well as existing functions that you update, will use the new execution environment unless a delayed-update layer is configured.

The end result is that you now have 5 more weeks in total to test and potentially update your functions for this change before the General Update begins. As a reminder, starting at that time, all functions without a delayed-update layer configured will begin migrating to the new execution environment.

New update timeline

The following is the new timeline for the update, which is now broken up over five phases:

May 14, 2019 – Begin Testing: You can begin testing your functions for the new execution environment locally with AWS SAM CLI or using an Amazon EC2 instance running on Amazon Linux 2018.03. You can also proactively enable the new environment in AWS Lambda using the opt-in mechanism described in the original announcement post.
June 11, 2019 – New Function Create: All newly created functions will result in those functions running on the new execution environment, unless they have a delayed-update layer configured.
June 25, 2019 – Existing Function Update: All newly created functions or existing functions that you update will result in those functions running on the new execution environment, unless they have a delayed-update layer configured.
July 16, 2019 – General Update: Existing functions begin using the new execution environment on invoke, unless they have a delayed-update layer configured.
July 23, 2019 – Delayed Update End: All functions with a delayed-update layer configured start being migrated automatically.
July 29, 2019 – Migration End: All functions have been migrated over to the new execution environment.

Note, that we have updated the original announcement post with this new timeline as well.


We also wanted to take the chance to provide additional information on follow up questions customers have had about the update.

Q. How does this relate to the recent Node.js v10 runtime launch?
A. The Node.js v10 launch is unrelated and is not impacted by this change. The Node.js v10 runtime is based on Amazon Linux 2 as its execution environment. Please see the AWS Lambda Runtimes section in the documentation for more information.

Q. Does this update change the execution environment for other runtimes to run on Amazon Linux 2?
A. No, this update brings the execution environment to the latest Amazon Linux 1 distribution release. In the future, new runtimes will launch on Amazon Linux 2, but all previous existing runtimes will continue to run on Amazon Linux 1.

Q. Was this update related to the recent Intel Quarterly Security Release (QSR) 2019.1?
A. No, this motion to begin updating the execution environment for Lambda and [email protected] is unrelated to the Intel QSR. There is no action for Lambda or [email protected] customers to take in relation to the QSR.

Next Steps

Your feedback greatly matters to us and we will continue to listen and learn from you. Please continue to contact us through AWS Support, the AWS forums, or AWS account teams.

Upcoming updates to the AWS Lambda and AWS [email protected] execution environment

Post Syndicated from Chris Munns original

AWS Lambda was first announced at AWS re:Invent 2014. Amazon CTO Werner Vogels highlighted the aspect of needing to run no servers, no instances, nothing, you just write your code. In 2016, we announced the launch of [email protected], which lets you run Lambda functions to customize content that CloudFront delivers, executing the functions in AWS locations closer to the viewer.

At AWS, we talk often about “shared responsibility” models. Essentially, those are the places where there is a handoff between what we as a technology provider offer you and what you as the customer are responsible for. In the case of Lambda and [email protected], one of the key things that we manage is the “execution environment.” The execution environment is what your code runs inside of. It is composed of the underlying operating system, system packages, the runtime for your language (if a managed one), and common capabilities like environment variables. From the customer standpoint, your primary responsibility is for your application code and configuration.

In this post, we outline an upcoming change to the execution environment for Lambda and [email protected] functions for all runtimes with the exception of Node.js v10. As with any update, some functionality could be affected. We encourage you to read through this post to understand the changes and any actions that you might need to take.

Update overview

AWS Lambda and AWS [email protected] run on top of the Amazon Linux operating system distribution and maintain updates to both the core OS and managed language runtimes. We are updating the Lambda execution environment AMI to version 2018.03 of Amazon Linux. This newer AMI brings updates that offer improvements in capabilities, performance, security, and updated packages that your application code might interface with.

This does not apply to the recently announced Node.js v10 runtime which today runs on Amazon Linux 2.

The majority of functions will benefit seamlessly from the enhancements in this update without any action from you. However, in rare cases, package updates may introduce compatibility issues. Potential impacted functions may be those that contain libraries or application code compiled against specific underlying OS packages or other system libraries. If you are primarily using an AWS SDK or the AWS X-Ray SDK with no other dependencies, you will see no impact.

You have the following options in terms of next steps:

  • Take no action before the automatic update of the execution environment starting May 21 for all newly created/updated functions and June 11 for all existing functions.
  • Proactively test your functions against the new environment starting today.
  • Configure your functions to delay the execution environment update until June 18 to allow for a longer testing window.

In addition to the overall timeline for this change, this post also provides instructions on the following:

  • How to test your functions for this new execution environment locally and on Lambda/[email protected]
  • How to proactively update your functions.
  • How to extend the testing window by one week.

Update timeline

The following is the timeline for the update, which is broken up over four phases over the next several weeks:

May 14, 2019—Begin Testing: You can begin testing your functions for the new execution environment locally with AWS SAM CLI or using an Amazon EC2 instance running on Amazon Linux 2018.03. You can also proactively enable the new environment in AWS Lambda using the opt-in mechanism described later in this post.
May 21, 2019—Update/Create: All new function creates or function updates result in your functions running on the new execution environment.
June 11, 2019—General Update: Existing functions begin using the new execution environment on invoke unless they have a delayed-update layer configured.
June 18, 2019—Delayed Update End: All functions with a delayed-update layer configured start being migrated automatically.
June 24, 2019—Migration End: All functions have been migrated over to the new execution environment.

Recommended Approach

decision tree

You only have to act if your application uses dependencies that are compiled to work on the previous execution environment. Otherwise, you can continue to deploy new and updated Lambda functions without needing to perform any other testing steps. For those who aren’t sure if their functions use such dependencies, we encourage you to do a new deployment of your functions and to test their functionality.

There are two options for when you can start testing your functions on the new execution environment:

  • You can begin testing today using the opt-in mechanism described later.
  • Starting May 21, a new deploy or update of your functions uses the new execution environment.

If you confirm that your functions would be affected by the new execution environment, you can begin re-compiling or building your dependencies using the new reference AMI for the execution environment today and then repeat the testing. The final step is to redeploy your applications any time after May 21 to use the new execution environment.

Building your dependencies and application for the new execution environment

Because we are basing the environment off of an existing Amazon Linux AMI, you can start with building and testing your code against that AMI on EC2. With an updated EC2 instance running this AMI, you can compile and build your packages using your normal processes. For the list of AMI IDs in all public Regions, check the release notes. To start an EC2 instance running this AMI, follow the steps in the Launching an Instance Using the Launch Instance Wizard topic in the Amazon EC2 User Guide.

Opt-in/Delayed-Update with Lambda layers

Some of you may want to begin testing as soon as you’ve read this announcement. Some know that they should postpone until later in the timeline.

To give you some control over testing, we’re releasing two special Lambda layers. Lambda layers can be used to provide shared resources, code, or data between Lambda functions and can simplify the deployment and update process. These layers don’t actually contain any data or code. Instead, they act as a special flag to Lambda to run your function executions either specifically on the new or old execution environment.

The Opt-In layer allows you to start testing today. You can use the Delayed-Update layer when you know that you must make updates to your function or its configuration after May 21, but aren’t ready to deploy to the new execution environment. The Delayed-Update layer extends the initial period available to you to deploy your functions by one week until the end of June 17, without changing the execution environment.

Neither layer brings any performance or runtime changes beyond this. After June 24, the layers will have no functionality. In a future deployment, you should remove them from any function configurations.

The ARNs for the two scenarios:

  • To OPT-IN to the update to the new execution environment, add the following layer:


  • To DELAY THE UPDATE to the new execution environment until June 18, add the following layer:


The action for adding a layer to your existing functions requires an update to the Lambda function’s configuration. You can do this with the AWS CLI, AWS CloudFormation or AWS SAM, popular third-party frameworks, the AWS Management Console, or an AWS SDK.

Validating your functions

There are several ways for you to test your function code and assure that it will work after the execution environment has been updated.

Local testing

We’re providing an update to the AWS SAM CLI to enable you to test your functions locally against this new execution environment. The AWS SAM CLI uses a Docker image that mirrors the live Lambda environment locally wherever you do development. To test against this new update, make sure that you have the most recent update to AWS SAM CLI version 0.16.0. You also should have an AWS SAM template configured for your function.

  1. Install or update the AWS SAM CLI:
    $ pip install --upgrade aws-sam-cli


    $ pip install aws-sam-cli
  2. Confirm that you have a valid AWS SAM template:
    $ sam validate -t <template file name>

    If you don’t have a valid AWS SAM template, you can begin with a basic template to test your functions. The following example represents the basic needs for running your function against a variety. The Runtime value must be listed in the AWS Lambda Runtimes topic.

    AWSTemplateFormatVersion: 2010-09-09
    Transform: 'AWS::Serverless-2016-10-31'
        Type: 'AWS::Serverless::Function'
          CodeUri: ./ 
          Handler: YOUR_HANDLER
          Runtime: YOUR_RUNTIME
  3. With a valid template, you can begin testing your function with mock event payloads. To generate a mock event payload, you can use the AWS SAM CLI local generate-event command. Here is an example of that command being run to generate an Amazon S3 notification type of event:
    sam local generate-event s3 put --bucket munns-test --key somephoto.jpeg
      "Records": [
          "eventVersion": "2.0", 
          "eventTime": "1970-01-01T00:00:00.000Z", 
          "requestParameters": {
            "sourceIPAddress": ""
          "s3": {
            "configurationId": "testConfigRule", 
            "object": {
              "eTag": "0123456789abcdef0123456789abcdef", 
              "sequencer": "0A1B2C3D4E5F678901", 
              "key": "somephoto.jpeg", 
              "size": 1024
            "bucket": {
              "arn": "arn:aws:s3:::munns-test", 
              "name": "munns-test", 
              "ownerIdentity": {
                "principalId": "EXAMPLE"
            "s3SchemaVersion": "1.0"
          "responseElements": {
            "x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH", 
            "x-amz-request-id": "EXAMPLE123456789"
          "awsRegion": "us-east-1", 
          "eventName": "ObjectCreated:Put", 
          "userIdentity": {
            "principalId": "EXAMPLE"
          "eventSource": "aws:s3"

    You can then use the AWS SAM CLI local invoke command and pipe in the output from the previous command. Or, you can save the output from the previous command to a file and then pass in a reference to the file’s name and location with the -e flag. Here is an example of the pipe event method:

    sam local generate-event s3 put --bucket munns-test --key somephoto.jpeg | sam local invoke myFunction
    2019-02-19 18:45:53 Reading invoke payload from stdin (you can also pass it from file with --event)
    2019-02-19 18:45:53 Found credentials in shared credentials file: ~/.aws/credentials
    2019-02-19 18:45:53 Invoking index.handler (python2.7)
    Fetching lambci/lambda:python2.7 Docker container image......
    2019-02-19 18:45:53 Mounting /home/ec2-user/environment/forblog as /var/task:ro inside runtime container
    START RequestId: 7c14eea1-96e9-4b7d-ab54-ed1f50bd1a34 Version: $LATEST
    {"Records": [{"eventVersion": "2.0", "eventTime": "1970-01-01T00:00:00.000Z", "requestParameters": {"sourceIPAddress": ""}, "s3": {"configurationId": "testConfigRule", "object": {"eTag": "0123456789abcdef0123456789abcdef", "key": "somephoto.jpeg", "sequencer": "0A1B2C3D4E5F678901", "size": 1024}, "bucket": {"ownerIdentity": {"principalId": "EXAMPLE"}, "name": "munns-test", "arn": "arn:aws:s3:::munns-test"}, "s3SchemaVersion": "1.0"}, "responseElements": {"x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH", "x-amz-request-id": "EXAMPLE123456789"}, "awsRegion": "us-east-1", "eventName": "ObjectCreated:Put", "userIdentity": {"principalId": "EXAMPLE"}, "eventSource": "aws:s3"}]}
    END RequestId: 7c14eea1-96e9-4b7d-ab54-ed1f50bd1a34
    REPORT RequestId: 7c14eea1-96e9-4b7d-ab54-ed1f50bd1a34 Duration: 1 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 14 MB
    "Success! Parsed Events"

    You can see the full output of your function in the logs that follow the invoke command. In this example, the Python function prints out the event payload and then exits.

With the AWS SAM CLI, you can pass in valid test payloads that interface with data in other AWS services. You can also have your Lambda function talk to other AWS resources that exist in your account, for example Amazon DynamoDB tables, Amazon S3 buckets, and so on. You could also test an API interface using the local start-api command, provided that you have configured your AWS SAM template with events of the API type. Follow the full instructions for setting up and configuring the AWS SAM CLI in Installing the AWS SAM CLI. Find the full syntax guide for AWS SAM templates in the AWS Serverless Application Model documentation.

Testing in the Lambda console

After you have deployed your functions after the start of the Update/Create phase or with the Opt-In layer added, test your functions in the Lambda console.

  1. In the Lambda console, select the function to test.
  2. Select a test event and choose Test.
  3. If no test event exists, choose Configure test events.
    1. Choose Event template and select the relevant invocation service from which to test.
    2. Name the test event.
    3. Modify the event payload for your specific function.
    4. Choose Create and then return to step 2.

The results from the test are displayed.


With Lambda and [email protected], AWS has allowed developers to focus on just application code without the need to think about the work involved managing the actual servers that run the code. We believe that the mechanisms provided and processes described in this post allow you to easily test and update your functions for this new execution environment.

Some of you may have questions about this process, and we are ready to help you. Please contact us through AWS Support, the AWS forums, or AWS account teams.

Creating an AWS Batch environment for mixed CPU and GPU genomics workflows

Post Syndicated from Chris Munns original

This post is courtesy of Lee Pang – AWS Technical Business Development 

I recently worked with a customer who needed to process a bunch of raw sequence files (FastQs) into Hi-C format (*.hic), which is used for the structural analysis of DNA/chromatin loops and sequence accessibility. The tooling they were interested in using was the Juicer suite and they needed a minimal workflow:

  • Align the sample to the reference using the juicer CLI utility.
  • Annotate loops using the HiCCUPS algorithm from the juicer-tools library.

Because they had many files to process, they wanted to do this as scalably as possible. The juicer step of the workflow was CPU and memory-intensive, while the HiCCUPS step needed GPU acceleration. So, they were interested in using AWS Step Functions and AWS Batch.

Since its launch, AWS Batch has enabled the ability to create scalable compute environments for processing a mixture of CPU and memory-intensive jobs. This covers the needs of the majority of genomics workflows. So, how do you create a genomics workflow environment using AWS Batch that also includes using GPUs? Thankfully, AWS Batch recently announced support for GPU resources!

In this post, I show you how to use these new features to execute mixed CPU and GPU genomics workflows. By the end of this post, you will be able to build the architecture shown below.

Configuring AWS Batch for running CPU and GPU jobs

To handle a mixture of CPU and GPU jobs, the recommended strategy is to create multiple compute environments and job queues:

  • GPU-only resources
    • Compute environments (using the ECS GPU Optimized AMI)
    • Spot Instances of the P2 and P3 instance family
    • On-Demand Instances of the P2 and P3 instance family
    • Job queue for GPU compute environments
  • CPU-only resources
    • Compute environments (using the default ECS Optimized AMI)
    • Spot Instances of the “optimal” instance family
    • On-Demand Instances of the “optimal” instance family
    • Job queue for CPU compute environments

With the above in place, you then point CPU jobs to the “CPU” queue and GPU jobs to the “GPU” queue.

Notice that CPU and GPU resources are kept separate with queues and compute environments. I don’t recommend creating a compute environment or job queue that mixes CPU and GPU optimized instance types. In a mixed compute environment or queue, there is a chance that CPU jobs could be placed on GPU instances when no GPU jobs are scheduled. This could result in few (or no) GPU instances available when GPU jobs must be run.

Create a GPU compute environment

Amazon EC2 has a wide variety of instance types. This includes the P3 family of instances that enable GPU-accelerated computing. Previously, the best option for using GPUs in AWS Batch was to create a compute environment based on the publicly available Deep Learning AMI used by AI/ML services. For example, Amazon SageMaker has support for running containers and NVIDIA / CUDA drivers pre-installed.

Earlier this year, the Amazon ECS team announced the availability of the ECS GPU-optimized AMI. It’s essentially the same as the existing Amazon ECS-optimized Amazon Linux 2 AMI but with pre-installed capabilities to provide Docker containers with access to GPU acceleration. For more information about what’s included, see Amazon ECS-optimized AMIs.

The key points are:

This is a much more lightweight solution, and makes creating GPU-specific AWS Batch compute environments much easier.

Creating an AWS Batch compute environment specifically for GPU jobs is similar to creating one for CPU jobs. The key difference is that you select only GPU instance families for the instance types.

For compute environments that use the P2, P3, and P3dn instance families, AWS Batch automatically associates the ECS GPU-optimized AMI. You don’t have to create a custom AMI for GPU jobs to run on these instances.

The G2 and G3 families use a different type of GPU and have different drivers. Compute environments that use G2 and G3 families need a custom AMI to take advantage of acceleration. Otherwise, they default to the ECS-optimized AMI.

To create a GPU-enabled compute environment with the AWS CLI, create a file called gpu-ce.json with the following contents:

    "computeEnvironmentName": "gpu",
    "type": "MANAGED",
    "state": "ENABLED",
    "serviceRole": "arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole",
    "computeResources": {
        "type": "EC2",
        "subnets": [
        "tags": {
            "Name": "batch-gpu-worker"
        "desiredvCpus": 0,
        "minvCpus": 0,
        "instanceTypes": [
        "instanceRole": "arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole",
        "maxvCpus": 256,
        "securityGroupIds": [
        "ec2KeyPair": "<keypair-name>"

From the command line, run the following:

aws batch create-compute-environment --cli-input-json file://gpu-ce.json

Create a GPU job queue

When you have a GPU compute environment, you can associate it with a dedicated GPU job queue.

From the command line, run the following:

aws batch create-job-queue \
    --job-queue-name gpu \
    --state ENABLED \
    --priority 100 \
    --compute-environment-order order=1,computeEnvironment=gpu

Specifying GPU resources in AWS Batch jobs

Creating job definitions in AWS Batch has not changed much. However, now there’s an additional field under Resource requirements that lets you specify how many GPUs the job should use.

The JSON for registering a job definition with GPU requirements looks like the following:

    "jobDefinitionName": "hiccups", 
    "type": "container", 
    "parameters": {
        "OutputS3Prefix": "s3://<bucket-name>/juicer/HIC003", 
        "InputHICS3Path": "s3://<bucket-name>/juicer/HIC003/aligned/inter.hic"
    "containerProperties": {
        "mountPoints": [], 
        "image": "<docker-image-repository>/juicer-tools:latest", 
        "environment": [], 
        "vcpus": 8, 
        "command": [

        /* BEGIN NEW STUFF (delete comment before use) */
        "resourceRequirements" : [
                "type" : "GPU",
                "value" : "1"
        /* END NEW STUFF (delete comment before use) */
        "volumes": [], 
        "memory": 60000, 
        "ulimits": []

Containerization considerations

To ensure that your containerized task can use GPU acceleration, use nvidia/cuda base images when building the container image for your job. For example, for a CentOS-based image with CUDA 9.2, your Dockerfile should have the following:

FROM nvidia/cuda:9.2-devel-centos7

This could be at the top if you are building the container entirely from scratch, or later if you are using a multi-stage build.

In this case, I already had a CentOS-based image for the juicer utility that I could recycle for juicer-tools. So my Dockerfile looked like the following:

FROM juicer AS base
FROM nvidia/cuda:9.2-devel-centos7

COPY --from=base /opt/juicer /opt/juicer

RUN yum install -y awscli
RUN yum install -y java-1.8.0-openjdk

WORKDIR /opt/juicer/scripts

WORKDIR /opt/juicer/work
ENTRYPOINT ["/opt/juicer/scripts/"]

Running a mixed CPU / GPU workflow

To run a workflow that contains a mixture of CPU– and GPU-based jobs, you can use AWS Step Functions. This blog channel previously covered how Step Functions and AWS Batch can be combined to run genomics workflows on AWS. Also, Step Functions and AWS Batch are now more tightly integrated, making scalable workflow solutions much easier to build.

With all the AWS Batch resources in place, it is a matter of pointing each task at the job queue with the right compute resources.

The solution I put together for the customer with the Juicer suite resulted in the following state machine:

    "Comment": "State machine for FASTQ to HIC with annotation using juicer and hiccups",
    "StartAt": "JuicerTask",
    "States": {
        "JuicerTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.juicer.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer:2",
                "JobName": "juicer",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/cpu",
                "Parameters.$": "$.juicer.parameters"
            "Next": "HiccupsTask"
        "HiccupsTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.hiccups.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer-tools:2",
                "JobName": "hiccups",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/gpu",
                "Parameters.$": "$.hiccups.parameters"
            "End": true

JuicerTask is submitted to the cpu job queue, while HiccupsTask is submitted to the gpu job queue.


With the ECS GPU-optimized AMI and AWS Batch support for GPU resources, you can easily build a scalable solution for running genomics workflows with both CPU and GPU resources.

Build on!

From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets

Post Syndicated from Chris Munns original

This post is courtesy of Adam Westrich – AWS Principal Solutions Architect and Ronan Prenty – Cloud Support Engineer

Want to deploy a web application and give a large number of users controlled access to data analytics? Or maybe you have a retail site that is fulfilling purchase orders, or an app that enables users to connect data from potentially long-running third-party data sources. Similar use cases exist across every imaginable industry and entail sending long-running requests to perform a subsequent action. For example, a healthcare company might build a custom web portal for physicians to connect and retrieve key metrics from patient visits.

This post is aimed at optimizing the delivery of information without needing to poll an endpoint. First, we outline the current challenges with the consumer polling pattern and alternative approaches to solve these information delivery challenges. We then show you how to build and deploy a solution in your own AWS environment.

Here is a glimpse of the sample app that you can deploy in your own AWS environment:

What’s the problem with polling?

Many customers need implement the delivery of long-running activities (such as a query to a data warehouse or data lake, or retail order fulfillment). They may have developed a polling solution that looks similar to the following:

  1. POST sends a request.
  2. GET returns an empty response.
  3. Another… GET returns an empty response.
  4. Yet another… GET returns an empty response.
  5. Finally, GET returns the data for which you were looking.

The challenges of traditional polling methods

  • Unnecessary chattiness and cost due to polling for result sets—Every time your frontend polls an API, it’s adding costs by leveraging infrastructure to compute the result. Empty polls are just wasteful!
  • Hampered mobile battery life—One of the top contributors of apps that eat away at battery life is excessive polling. Make sure that your app isn’t on your users’ Top App Battery Usage list-of-shame that could result in deletion.
  • Delayed data arrival due to polling schedule—Some approaches to polling include an incremental backoff to limit the number of empty polls. This sometimes results in a delay between data readiness and data arrival.

But what about long polling?

  • User request deadlocks can hamper application performance—Long synchronous user responses can lead to unexpected user wait times or UI deadlocks, which can affect mobile devices especially.
  • Memory leaks and consumption could bring your app down—Keeping long-running tasks queries open may overburden your backend and create failure scenarios, which may bring down your app.
  • HTTP default timeouts across browsers may result in inconsistent client experience—These timeouts vary across browsers, and can lead to an inconsistent experience across your end users. Depending on the size and complexity of the requests, processing can last longer than many of these timeouts and take minutes to return results.

Instead, create an event-driven architecture and move your APIs from poll to push.

Asynchronous push model

To create optimal UX experiences, frontend developers often strive to create progressive and reactive user experiences. Users can interact with frontend objects (for example, push buttons) with little lag when sending requests and receiving data. But frontend developers also want users to receive timely data, without sacrificing additional user actions or performing unnecessary processing.

The birth of microservices and the cloud over the past several years has enabled frontend developers and backend service designers to think about these problems in an asynchronous manner. This enables the virtually unlimited resources of the cloud to choreograph data processing. It also enables clients to benefit from progressive and reactive user experiences.

This is a fresh alternative to the synchronous design pattern, which often relies on client consumers to act as the conductor for all user requests. The following diagram compares the flow of communication between patterns.Comparison of communication patterns

Asynchronous orchestration can be easily choreographed using the workflow definition, monitoring, and tracking console with AWS Step Functions state machines. Break up services into functions with AWS Lambda and track executions within a state diagram, like the following:

With this approach, consumers send a request, which initiates downstream distributed processing. The subsequent steps are invoked according to the state machine workflow and each execution can be monitored.

Okay, but how does the consumer get the result of what they’re looking for?

Multiple approaches to this problem

There are different approaches for consumers to retrieve their resulting data. In order to create the optimal solution, there are several questions that a service owner may need to ask.

Is there known trust with the client (such as another microservice)?

If the answer is Yes, one approach is to use Amazon SNS. That way, consumers can subscribe to topics and have data delivered using email, SMS, or even HTTP/HTTPS as an event subscriber (that is, webhooks).

With webhooks, the consumer creates an endpoint where the service provider can issue a callback using a small amount of consumer-side resources. The consumer is waiting for an incoming request to facilitate downstream processing.

  1. Send a request.
  2. Subscribe the consumer to the topic.
  3. Open the endpoint.
  4. SNS sends the POST request to the endpoint.

If the trust answer is No, then clients may not be able to open a lightweight HTTP webhook after sending the initial request. In that case, you must open the webhook. Consider an alternative framework.

Are you restricted to only using a REST protocol?

Some applications can only run HTTP requests. These scenarios include the age of the technology or browser for end users, and potential security requirements that may block other protocols.

In these scenarios, customers may use a GET method as a best practice, but we still advise avoiding polling. In these scenarios, some design questions may be: Does data readiness happen at a predefined time or duration interval? Or can the user experience tolerate time between data readiness and data arrival?

If the answer to both of these is Yes, then consider trying to send GET calls to your RESTful API one time. For example, if a job averages 10 minutes, make your GET call at 10 minutes after the submission. Sounds simple, right? Much simpler than polling.

Should I use GraphQL, a WebSocket API, or another framework?

Each framework has tradeoffs.

If you want a more flexible query schema, you may gravitate to GraphQL, which follows a “data-driven UI” approach. If data drives the UI, then GraphQL may be the best solution for your use case.

AWS AppSync is a serverless GraphQL engine that supports the heavy lifting of these constructs. It offers functionality such as AWS service integration, offline data synchronization, and conflict resolution. With GraphQL, there’s a construct called Subscriptions where clients can subscribe to event-based subscriptions upon a data change.

Amazon API Gateway makes it easy for developers to deploy secure APIs at scale. With the recent introduction of API Gateway WebSocket APIs, web, mobile clients, and backend services can communicate over an established WebSocket connection. This also allows clients to be more reactive to data updates and only do work one time after an update has been received over the WebSocket connection.

The typical frontend design approach is to create a UI component that is updated when the results of the given procedure are complete. This is beneficial over a complete webpage refresh and gains in the customer’s user experience.

Because many companies have elected to use the REST framework for creating API-driven tightly bound service contracts, a RESTful interface can be used to validate and receive the request. It can provide a status endpoint used to deliver status. Also, it provides additional flexibility in delivering the result to the variety of clients, along with the WebSocket API.

Poll-to-push solution with API Gateway

Imagine a scenario where you want to be updated as soon as the data is created. Instead of the traditional polling methods described earlier, use an API Gateway WebSocket API. That pushes new data to the client as it’s created, so that it can be rendered on the client UI.

Alternatively, a WebSocket server can be deployed on Amazon EC2. With this approach, your server is always running to accept and maintain new connections. In addition, you manage the scaling of the instance manually at times of high demand.

By using an API Gateway WebSocket API in front of Lambda, you don’t need a machine to stay always on, eating away your project budget. API Gateway handles connections and invokes Lambda whenever there’s a new event. Scaling is handled on the service side. To update our connected clients from the backend, we can use the API Gateway callback URL. The AWS SDKs make communication from the backend easy. As an example, see the Boto3 sample of post_to_connection:

import boto3 
#Use a layer or deployment package to 
#include the latest boto3 version.
apiManagement = boto3.client('apigatewaymanagementapi', region_name={{api_region}},
response = apiManagement.post_to_connection(Data={{message}},ConnectionId={{connectionId}})

Solution example

To create this more optimized solution, create a simple web application that enables a user to make a request against a large dataset. It returns the results in a flat file over WebSocket. In this case, we’re querying, via Amazon Athena, a data lake on S3 populated with AWS Twitter sentiment (Twitter: @awscloud or #awsreinvent). However, this solution could apply to any data store, data mart, or data warehouse environment, or the long-running return of data for a response.

For the frontend architecture, create this web application using a JavaScript framework (such as JQuery). Use API Gateway to accept a REST API for the data and then open a WebSocket connection on the client to facilitate the return of results:poll to push solution architecture

  1. The client sends a REST request to API Gateway. This invokes a Lambda function that starts the Step Functions state machine execution. The function also returns a task token for the open connection activity worker. The function then returns the execution ARN and task token to the client.
  2. Using the data returned by the REST request, the client connects to the WebSocket API and sends the task token to the WebSocket connection.
  3. The WebSocket notifies the Step Functions state machine that the client is connected. The Lambda function completes the OpenConnection task through validating the task token and sending a success message.
  4. After RunAthenaQuery and OpenConnection are successful, the state machine updates the connected client over the WebSocket API that their long-running job is complete. It uses the REST API call post_to_connection.
  5. The client receives the update over their WebSocket connection using the IssueCallback Lambda function, with the callback URL from the API Gateway WebSocket API.

In this example application, the data response is an S3 presigned URL composed of results from Athena. You can then configure your frontend to use that S3 link to download to the client.

Why not just open the WebSocket API for the request?

While this approach can work, we advise against it for this use case. Break the required interfaces for this use case into three processes:

  1. Submit a request.
  2. Get the status of the request (to detect failure).
  3. Deliver the results.

For the submit request interface, RESTful APIs have strong controls to ensure that user requests for data are validated and given guardrails for the intended query purpose. This helps prevent rogue requests and unforeseen impacts to the analytics environment, especially when exposed to a large number of users.

With this example solution, you’re restricting the data requests to specific countries in the frontend JavaScript. Using the RESTful POST method for API requests enables you to validate data as query string parameters, such as the following:


API Gateway models and mapping templates can also be used to validate or transform request payloads at the API layer before they hit the backend. Use models to ensure that the structure or contents of the client payload are the same as you expect. Use mapping templates to transform the payload sent by clients to another format, to be processed by the backend.

This REST validation framework can also be used to detect header information on WebSocket browser compatibility (external site). While using WebSocket has many advantages, not all browsers support it, especially older browsers. Therefore, a REST API request layer can pass this browser metadata and determine whether a WebSocket API can be opened.

Because a REST interface is already created for submitting the request, you can easily add another GET method if the client must query the status of the Step Functions state machine. That might be the case if a health check in the request is taking longer than expected. You can also add another GET method as an alternative access method for REST-only compatible clients.

If low-latency request and retrieval are an important characteristic of your API and there aren’t any browser-compatibility risks, use a WebSocket API with JSON model selection expressions to protect your backend with a schema.

In the spirit of picking the best tool for the job, use a REST API for the request layer and a WebSocket API to listen for the result.

This solution, although secure, is an example for how a non-polling solution can work on AWS. At scale, it may require refactoring due to the cross-talk at high concurrency that may result in client resubmissions.

To discover, deploy, and extend this solution into your own AWS environment, follow the PollToPush instructions in the AWS Serverless Application Repository.


When application consumers poll for long-running tasks, it can be a wasteful, detrimental, and costly use of resources. This post outlined multiple ways to refactor the polling method. Use API Gateway to host a RESTful interface, Step Functions to orchestrate your workflow, Lambda to perform backend processing, and an API Gateway WebSocket API to push results to your clients.

Deploying a personalized API Gateway serverless developer portal

Post Syndicated from Chris Munns original

This post is courtesy of Drew Dresser, Application Architect – AWS Professional Services

Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. Customers of these APIs often want a website to learn and discover APIs that are available to them. These customers might include front-end developers, third-party customers, or internal system engineers. To produce such a website, we have created the API Gateway serverless developer portal.

The API Gateway serverless developer portal (developer portal or portal, for short) is an application that you use to make your API Gateway APIs available to your customers by enabling self-service discovery of those APIs. Your customers can use the developer portal to browse API documentation, register for, and immediately receive their own API key that they can use to build applications, test published APIs, and monitor their own API usage.

Over the past few months, the team has been hard at work contributing to the open source project, available on Github. The developer portal was relaunched on October 29, 2018, and the team will continue to push features and take customer feedback from the open source community. Today, we’re happy to highlight some key functionality of the new portal.

Benefits for API publishers

API publishers use the developer portal to expose the APIs that they manage. As an API publisher, you need to set up, maintain, and enable the developer portal. The new portal has the following benefits:

Benefits for API consumers

API Consumers use the developer portal as a traditional application user. An API consumer needs to understand the APIs being published. API consumers might be front-end developers, distributed system engineers, or third-party customers. The new developer portal comes with the following benefits for API consumers:

  • Explore – API consumers can quickly page through lists of APIs. When they find one they’re interested in, they can immediately see documentation on that API.
  • Learn – API consumers might need to drill down deeper into an API to learn its details. They want to learn how to form requests and what they can expect as a response.
  • Test – Through the developer portal, API consumers can get an API key and invoke the APIs directly. This enables developers to develop faster and with more confidence.


The developer portal is a completely serverless application. It leverages Amazon API Gateway, Amazon Cognito User Pools, AWS Lambda, Amazon DynamoDB, and Amazon S3. Serverless architectures enable you to build and run applications without needing to provision, scale, and manage any servers. The developer portal is broken down in to multiple microservices, each with a distinct responsibility, as shown in the following image.

Identity management for the developer portal is performed by Amazon Cognito and a Lambda function in the Login & Registration microservice. An Amazon Cognito User Pool is configured out of the box to enable users to register and login. Additionally, you can deploy the developer portal to use a UI hosted by Amazon Cognito, which you can customize to match your style and branding.

Requests are routed to static content served from Amazon S3 and built using React. The React app communicates to the Lambda backend via API Gateway. The Lambda function is built using the aws-serverless-express library and contains the business logic behind the APIs. The business logic of the web application queries and add data to the API Key Creation and Catalog Update microservices.

To maintain the API catalog, the Catalog Update microservice uses an S3 bucket and a Lambda function. When an API’s Swagger file is added or removed from the bucket, the Lambda function triggers and maintains the API catalog by updating the catalog.json file in the root of the S3 bucket.

To manage the mapping between API keys and customers, the application uses the API Key Creation microservice. The service updates API Gateway with API key creations or deletions and then stores the results in a DynamoDB table that maps customers to API keys.

Deploying the developer portal

You can deploy the developer portal using AWS SAM, the AWS SAM CLI, or the AWS Serverless Application Repository. To deploy with AWS SAM, you can simply clone the repository and then deploy the application using two commands from your CLI. For detailed instructions for getting started with the portal, see Use the Serverless Developer Portal to Catalog Your API Gateway APIs in the Amazon API Gateway Developer Guide.

Alternatively, you can deploy using the AWS Serverless Application Repository as follows:

  1. Navigate to the api-gateway-dev-portal application and choose Deploy in the top right.
  2. On the Review page, for ArtifactsS3BucketName and DevPortalSiteS3BucketName, enter globally unique names. Both buckets are created for you.
  3. To deploy the application with these settings, choose Deploy.
  4. After the stack is complete, get the developer portal URL by choosing View CloudFormation Stack. Under Outputs, choose the URL in Value.

The URL opens in your browser.

You now have your own serverless developer portal application that is deployed and ready to use.

Publishing a new API

With the developer portal application deployed, you can publish your own API to the portal.

To get started:

  1. Create the PetStore API, which is available as a sample API in Amazon API Gateway. The API must be created and deployed and include a stage.
  2. Create a Usage Plan, which is required so that API consumers can create API keys in the developer portal. The API key is used to test actual API calls.
  3. On the API Gateway console, navigate to the Stages section of your API.
  4. Choose Export.
  5. For Export as Swagger + API Gateway Extensions, choose JSON. Save the file with the following format: apiId_stageName.json.
  6. Upload the file to the S3 bucket dedicated for artifacts in the catalog path. In this post, the bucket is named apigw-dev-portal-artifacts. To perform the upload, run the following command.
    aws s3 cp apiId_stageName.json s3://yourBucketName/catalog/apiId_stageName.json

Uploading the file to the artifacts bucket with a catalog/ key prefix automatically makes it appear in the developer portal.

This might be familiar. It’s your PetStore API documentation displayed in the OpenAPI format.

With an API deployed, you’re ready to customize the portal’s look and feel.

Customizing the developer portal

Adding a customer’s own look and feel to the developer portal is easy, and it creates a great user experience. You can customize the domain name, text, logo, and styling. For a more thorough walkthrough of customizable components, see Customization in the GitHub project.

Let’s walk through a few customizations to make your developer portal more familiar to your API consumers.

Customizing the logo and images

To customize logos, images, or content, you need to modify the contents of the your-prefix-portal-static-assets S3 bucket. You can edit files using the CLI or the AWS Management Console.

Start customizing the portal by using the console to upload a new logo in the navigation bar.

  1. Upload the new logo to your bucket with a key named custom-content/nav-logo.png.
    aws s3 cp {myLogo}.png s3://yourPrefix-portal-static-assets/custom-content/nav-logo.png
  2. Modify object permissions so that the file is readable by everyone because it’s a publicly available image. The new navigation bar looks something like this:

Another neat customization that you can make is to a particular API and stage image. Maybe you want your PetStore API to have a dog picture to represent the friendliness of the API. To add an image:

  1. Use the command line to copy the image directly to the S3 bucket location.
    aws s3 cp my-image.png s3://yourPrefix-portal-static-assets/custom-content/api-logos/apiId-stageName.png
  2. Modify object permissions so that the file is readable by everyone.

Customizing the text

Next, make sure that the text of the developer portal welcomes your pet-friendly customer base. The YAML files in the static assets bucket under /custom-content/content-fragments/ determine the portal’s text content.

To edit the text:

  1. On the AWS Management Console, navigate to the website content S3 bucket and then navigate to /custom-content/content-fragments/.
  2. is the content displayed on the home page, controls the tab text on the navigation bar, and contains the content of the Getting Started tab. All three files are written in markdown. Download one of them to your local machine so that you can edit the contents. The following image shows edited to contain custom text:
  3. After editing and saving the file, upload it back to S3, which results in a customized home page. The following image reflects the configuration changes in from the previous step:

Customizing the domain name

Finally, many customers want to give the portal a domain name that they own and control.

To customize the domain name:

  1. Use AWS Certificate Manager to request and verify a managed certificate for your custom domain name. For more information, see Request a Public Certificate in the AWS Certificate Manager User Guide.
  2. Copy the Amazon Resource Name (ARN) so that you can pass it to the developer portal deployment process. That process is now includes the certificate ARN and a property named UseRoute53Nameservers. If the property is set to true, the template creates a hosted zone and record set in Amazon Route 53 for you. If the property is set to false, the template expects you to use your own name server hosting.
  3. If you deployed using the AWS Serverless Application Repository, navigate to the Application page and deploy the application along with the certificate ARN.

After the developer portal is deployed and your CNAME record has been added, the website is accessible from the custom domain name as well as the new Amazon CloudFront URL.

Customizing the logo, text content, and domain name are great tools to make the developer portal feel like an internally developed application. In this walkthrough, you completely changed the portal’s appearance to enable developers and API consumers to discover and browse APIs.


The developer portal is available to use right away. Support and feature enhancements are tracked in the public GitHub. You can contribute to the project by following the Code of Conduct and Contributing guides. The project is open-sourced under the Amazon Open Source Code of Conduct. We plan to continue to add functionality and listen to customer feedback. We can’t wait to see what customers build with API Gateway and the API Gateway serverless developer portal.