All posts by Chris Munns

Improved failure recovery for Amazon EventBridge

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/improved-failure-recovery-for-amazon-eventbridge/

Today we’re announcing two new capabilities for Amazon EventBridgedead letter queues and custom retry policies. Both of these give you greater flexibility in how to handle any failures in the processing of events with EventBridge. You can easily enable them on a per target basis and configure them uniquely for each.

Dead letter queues (DLQs) are a common capability in queuing and messaging systems that allow you to handle failures in event or message receiving systems. They provide a way for failed events or messages to be captured and sent to another system, which can store them for future processing. With DLQs, you can have greater resiliency and improved recovery from any failure that happens.

You can also now configure a custom retry policy that can be set on your event bus targets. Today, there are two attributes that can control how events are retried: maximum number of retries and maximum event age. With these two settings, you could send events to a DLQ sooner and reduce the retries attempted.

For example, this could allow you to recover more quickly if an event bus target is overwhelmed by the number of events received, causing throttling to occur. The events are placed in a DLQ and then processed later.

Failures in event processing

Currently, EventBridge can fail to deliver an event to a target in certain scenarios. Events that fail to be delivered to a target due to client-side errors are dropped immediately. Examples of this are when EventBridge does not have permission to a target AWS service or if the target no longer exists. This can happen if the target resource is misconfigured or is deleted by the resource owner.

For service-side issues, EventBridge retries delivery of events for up to 24 hours. This can happen if the target service is unavailable or the target resource is not provisioned to handle the incoming event traffic and the target service is throttling the requests.

EventBridge failures

EventBridge failures

Previously, when all attempts to deliver an event to the target were exhausted, EventBridge published a CloudWatch metric indicating a failed target invocation. However, this provides no visibility into which events failed to be delivered and there was no way to recover the event that failed.

Dead letter queues

EventBridge’s DLQs are made possible today with Amazon Simple Queue Service (SQS) standard queues. With SQS, you get all of the benefits of a fully serverless queuing service: no servers to manage, automatic scalability, pay for what you consume, and high availability and security built in. You can configure the DLQs for your EventBridge bus and pay nothing until it is used, if and when a target experiences an issue. This makes it a great practice to follow and standardize on, and provides you with a safety net that’s active only when needed.

Optionally, you could later configure an AWS Lambda function to consume from that DLQ. The function is only invoked when messages exist in the queue, allowing you to maintain a serverless stack to recover from a potential failure.

DLQ configured per target

DLQ configured per target

With DLQ configured, the queue receives the event that failed in the message with important metadata that you can use to troubleshoot the issue. This can include: Error Code, Error Message, Exhausted Retry Condition, Retry Attempts, Rule ARN, and the Target ARN.

You can use this data to more easily troubleshoot what went wrong with the original delivery attempt and take action to resolve or prevent such failures in the future. You could also use the information such as Exhausted Retry Condition and Retry Attempts to further tweak your custom retry policy.

You can configure a DLQ when creating or updating rules via the AWS Management Console and AWS Command Line Interface (AWS CLI). You can also use infrastructure as code (IaC) tools such as AWS CloudFormation.

In the console, select the queue to be used for your DLQ configuration from the drop-down as shown here:

DLQ configuration

DLQ configuration

When configured via API, AWS CLI, or IaC tools, you must specify the ARN of the queue:

arn:aws:sqs:us-east-1:123456789012:orders-bus-shipping-service-dlq

When you configure a DLQ, the target SQS queue requires a resource-based policy that grants EventBridge access. One is created and applied automatically via the console when you create or update an EventBridge rule with a DLQ that exists in your own account.

For any queues created in other accounts, or via API, AWS CLI, or IaC tools, you must add a policy that allows SQS’s SendMessage permission to the EventBridge rule ARN, as shown below:

{
  "Sid": "Dead-letter queue permissions",
  "Effect": "Allow",
  "Principal": {
     "Service": "events.amazonaws.com"
  },
  "Action": "sqs:SendMessage",
  "Resource": "arn:aws:sqs:us-east-1:123456789012:orders-bus-shipping-service-dlq",
  "Condition": {
    "ArnEquals": {
      "aws:SourceArn": "arn:aws:events:us-east-1:123456789012:rule/MyTestRule"
    }
  }
}

You can read more about setting permissions for DLQ the documentation for “Granting permissions to the dead-letter queue”.

Once configured, you can monitor CloudWatch metrics for the DLQ queue. This shows both the successful delivery of messages via the InvocationsSentToDLQ metric, in addition to any failures via the InvocationsFailedToBeSentToDLQ. Note that these metrics do not exist if your queue is not considered “active”.

Retry policies

By default, EventBridge retries delivery of an event to a target so long as it does not receive a client-side error as described earlier. Retries occur with a back-off, for up to 185 attempts or for up to 24 hours, after which the event is dropped or sent to a DLQ, if configured. Due to the jitter of the back-off and retry process you may reach the 24-hour limit before reaching 185 retries.

For many workloads, this provides an acceptable way to handle momentary service issues or throttling that might occur. For some however, this model of back-off and retry can cause increased and on-going traffic to an already overloaded target system.

For example, consider an Amazon API Gateway target that has a resource constrained backend service behind it.

Constrained target service

Constrained target service

Under a consistently high load, the bus could end up generating too many API requests, tripping the API Gateway’s throttling configuration. This would cause API Gateway to respond with throttling errors back to EventBridge.

Throttled API reply

Throttled API reply

You may decide that allowing the failed events to retry for 24 hours puts too much load into this system and it may not properly recover from the load. This could lead to potential data loss unless a DLQ was configured.

Added DLQ

Added DLQ

With a DLQ, you could choose to process these events later, once the overwhelmed target service has recovered.

DLQ drained back to API

DLQ drained back to API

Or the events in question may no longer have the same value as they did previously. This can occur in systems where data loss is tolerated but the timeliness of data processing matters. In these situations, the DLQ would have less value and dropping the message is acceptable.

For either of these situations, configuring the maximum number of retries or the maximum age of the event could be useful.

Now with retry policies, you can configure per target the following two attributes:

  • MaximumEventAgeInSeconds: between 60 and 86400 seconds (86400, or 24 hours the default)
  • MaximumRetryAttempts: between 0 and 185 (185 is the default)

When either condition is met, the event fails. It’s then either dropped, which generates an increase to the FailedInvocations CloudWatch metric, or sent to a configured DLQ.

You can configure retry policy attributes when creating or updating rules via the AWS Management Console and AWS Command Line Interface (AWS CLI). You can also use infrastructure as code (IaC) tools such as AWS CloudFormation.

Retry policy

Retry policy

There is no additional cost for configuring either of these new capabilities. You only pay for the usage of the SQS standard queue configured as the dead letter queue during a failure and any application that handles the failed events. SQS pricing can be found here.

Conclusion

With dead letter queues and custom retry policies, you have improved handling and control over failure in distributed systems built with EventBridge. With DLQs you can capture failed events and then process them later, potentially saving yourself from data loss. With custom retry policies, you gain the improved ability to control the number of retries and for how long they can be retried.

I encourage you to explore how both of these new capabilities can help make your applications more resilient to failures, and to standardize on using them both in your infrastructure.

For more serverless learning resources, visit https://serverlessland.com.

Implementing FIFO message ordering with Amazon MQ for Apache ActiveMQ

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/implementing-fifo-message-ordering-with-amazon-mq-for-apache-activemq/

This post is contributed by Ravi Itha, Sr. Big Data Consultant

Messaging plays an important role in building distributed enterprise applications. Amazon MQ is a key offering within the AWS messaging services solution stack focused on enabling messaging services for modern application architectures. Amazon MQ is a managed message broker service for Apache ActiveMQ that simplifies setting up and operating message brokers in the cloud. Amazon MQ uses open standard APIs and protocols such as JMS, NMS, AMQP, STOMP, MQTT, and WebSocket. Using standards means that, in most cases, there’s no need to rewrite any messaging code when you migrate to AWS. This allows you to focus on your business logic and application architecture.

Message ordering via Message Groups

Sometimes it’s important to guarantee the order in which messages are processed. In ActiveMQ, there is no explicit distinction between a standard queue and a FIFO queue. However, a queue can be used to route messages in FIFO order. This ordering can be achieved via two different ActiveMQ features, either by implementing an Exclusive Consumer or using using Message Groups. This blog focuses on Message Groups, an enhancement to Exclusive Consumers. Message groups provide:

  • Guaranteed ordering of the processing of related messages across a single queue
  • Load balancing of the processing of messages across multiple consumers
  • High availability with automatic failover to other consumers if a JVM goes down

This is achieved programmatically as follows:

Sample producer code snippet:

TextMessage tMsg = session.createTextMessage("SampleMessage");
tMsg.setStringProperty("JMSXGroupID", "Group-A");
producer.send(tMsg);

Sample consumer code snippet:

Message consumerMessage = consumer.receive(50);
TextMessage txtMessage = (TextMessage) message.get();
String msgBody = txtMessage.getText();
String msgGroup = txtMessage.getStringProperty("JMSXGroupID")

This sample code highlights:

  • A message group is set by the producer during message ingestion
  • A consumer determines the message group once a message is consumed

Additionally, if a queue has messages for multiple message groups then it’s possible a consumer receives messages for multiple message groups. This depends on various factors such as the number of consumers of a queue and consumer start time.

Scenarios: Multiple producers and consumers with Message Groups

A FIFO queue in ActiveMQ supports multiple ordered message groups. Due to this, it’s common that a queue is used to exchange messages between multiple producer and consumer applications. By running multiple consumers to process messages from a queue, the message broker is able to partition messages across consumers. This improves the scalability and performance of your application.

In terms of scalability, commonly asked questions center on the ideal number of consumers and how messages are distributed across all consumers. To provide more clarity in this area, we provisioned an Amazon MQ broker and ran various test scenarios.

Scenario 1: All consumers started at the same time

Test setup

  • All producers and consumers have the same start time
  • Each test uses a different combination of number of producers, message groups, and consumers
Setup for Tests 1 to 5

Setup for Tests 1 to 5

Test results

Test ## producers# message groups

# messages sent

by each producer

# consumersTotal messages# messages received by consumers
C1C2C3C4
1335000115000

15000

(All Groups)

NANANA
2335000215000

5000

(Group-C)

10000

(Group-A and Group-B)

NANA
3335000315000

5000

(Group-A)

5000

(Group-B)

5000

(Group C)

NA
4335000415000

5000

(Group-C)

5000

(Group-B)

5000

(Group-A)

0
5445000320000

5000

(Group-A)

5000

(Group-B)

10000

(Group-C and Group-D)

NA

Test conclusions

  • Test 3 – illustrates even message distribution across consumers when a one-to-one relationship exists between message groups and number of consumers
  • Test 4 – illustrates one of the four consumers did not receive any messages. This highlights that running more consumers than the available number of messages groups does not provide additional benefits
  • Tests 1, 2, 5 – indicate that a consumer can receive messages belonging to multiple message groups. The following table provides additional granularity to messages received by consumer C2 in test #2. As you can see, these messages belong to Group-A and Group-B message groups, and FIFO ordering is maintained at a message group level
consumer_idmsg_idmsg_group
Consumer C2A-1Group-A
Consumer C2B-1Group-B
Consumer C2A-2Group-A
Consumer C2B-2Group-B
Consumer C2A-3Group-A
Consumer C2B-3Group-B
Consumer C2A-4999Group-A
Consumer C2B-4999Group-B
Consumer C2A-5000Group-A
Consumer C2B-5000Group-B

Scenario 2a: All consumers not started at same time

Test setup

  • Three producers and one consumer started at the same time
  • The second and third consumers started after 30 seconds and 60 seconds respectively
  • 15,000 messages sent in total across three message groups
 Setup for Test 6

Setup for Test 6

Test results

Test ## producers# message groups

# messages sent

by each producer

# consumersTotal messages# messages received by consumers
C1C2C3
63350003150001500000

Test conclusion

Consumer C1 received all messages, while consumers C2 and C3 both ran idle and did not receive any messages. Key takeaway here is that results can be inefficient in real-world scenarios where consumers start at different times.

The last scenario (2b) illustrates this same scenario, while optimizing message distribution so that all consumers are used.

Scenario 2b: Utilization of all consumers when not started at same time

Test setup

  • Three producers and one consumer started at the same time
  • The second and third consumers started after 30 seconds and 60 seconds respectively
  • 15,000 messages sent in total across three message groups
  • After each producer message group sends its 2501st message, their message groups are closed after which message distribution is restarted by sending the remaining messages. Closing a message group can be done as in the following code example (specifically the -1 value set for the JMSXGroupSeq property):
TextMessage tMsg = session.createTextMessage("<foo>hey</foo>");
tMsg.setStringProperty("JMSXGroupID", "Group-A");
tMsg.setIntProperty("JMSXGroupSeq", -1);
producer.send(tMsg);
Setup for Test 7

Setup for Test 7

Test results

Test ## producers# message groups

# messages sent

by each producer

# consumers

Total

messages

# messages received by consumers
C1C2C3
73350013150031000325002500

Distribution of messages received by message group

ConsumerGroup-AGroup-BGroup-CConsumer-wise total
Consumer 125012501500110003
Consumer 22500002500
Consumer 30250002500
Group total500150015001NA
Total messages received15003

Test conclusions

Message distribution is optimized with the closing and reopening of a message group when all consumers are not started at the same time. This mitigation step results in all consumers receiving messages.

  • After Group-A was closed, the broker assigned subsequent Group-A messages to consumer C2
  • After Group-B was closed, the broker assigned subsequent Group-B messages to consumer C3
  • After Group-C was closed, the broker continued to send Group-C messages to consumer C1. The assignment did not change because there was no other available consumer
Test 7 – Message distribution among consumers

Test 7 – Message distribution among consumers

Scalability techniques

Now that we understand how to use Message Groups to implement FIFO use cases within Amazon MQ, let’s look at how they scale. By default, a message queue supports a maximum of 1024 message groups. This means, if you use more than 1024 message groups per queue then message ordering is lost for the oldest message group. This is further explained in the ActiveMQ Message Groups documentation. This can be problematic for complex use cases involving stock exchanges or financial trading scenarios where thousands of ordered message groups are required. In the following table, are a couple of techniques to address this issue.

Scalability techniquesDetails
  1. Verify that the appropriate Amazon MQ broker instance type is used
  2. Increase number of message groups per message queue
  1. Select the appropriate broker instance type according to your use case. To learn more, refer to the Amazon MQ broker instance types documentation.
  2. Default max number of message groups can be increased via a custom configuration file at the time of launching a new broker. This default can also be increased by modifying an existing broker. Refer to the next section for an example (requirement #2)
Recycle the number of message groups when they are no longer neededA message group can be closed programmatically by a producer once it’s finished sending all messages to a queue. Following is a sample code snippet:

TextMessage tMsg = session.createTextMessage("<foo>hey</foo>");
tMsg.setStringProperty("JMSXGroupID", "GroupA");
tMsg.setIntProperty("JMSXGroupSeq", -1);
producer.send(tMsg);

In the preceding scenario 2b, we used this technique to improve the message distribution across consumers.

Customize message broker configuration

In the previous section, to improve scalability we suggested increasing the number of message groups per queue by updating the broker configuration. A broker configuration is essentially an XML file that contains all ActiveMQ settings for a given message broker. Let’s look at the following broker configuration settings for the purpose of achieving a specific requirement. For your reference, we’ve placed a copy of a broker configuration file with these settings, within a GitHub repository.

#RequirementApplicable broker configuration
1Change message group implementation from default CachedMessageGroupMap default to MessageGroupHashBucket

<!–valid values: simple, bucket, cached. default is cached–>

<!–keyword simple represents SimpleMessageGroupMap–>

<!–keyword bucket represents MessageGroupHashBucket–>

<!–keyword cached represents CachedMessageGroupMap–>

<policyEntry messageGroupMapFactoryType=“bucket” queue=“&gt;”/>

2Increase number of message groups per queue from 1024 to 2048 and increase cache size from 64 to 128

<!–default value for bucketCount is 1024 and for cacheSize is 64–>

<policyEntry queue=“&gt;”>

<messageGroupMapFactory>

<messageGroupHashBucketFactory bucketCount=“2048” cacheSize=“128”/>

</messageGroupMapFactory>

</policyEntry>

3Wait for three consumers or 30 seconds before broker begins sending messages<policyEntry queue=”&gt;” consumersBeforeDispatchStarts=”3″ timeBeforeDispatchStarts=”30000″/>

When must default broker configurations be updated? This would only apply in scenarios where the default settings do not meet your requirements. Additional information on how to update your broker configuration file can be found here.

Amazon MQ starter kit

Want to get up and running with Amazon MQ quickly? Start building with Amazon MQ by cloning the starter kit available on GitHub. The starter kit includes a CloudFormation template to provision a message broker, sample broker configuration file, and source code related to the consumer scenarios in this blog.

Conclusion

In this blog post, you learned how Amazon MQ simplifies the setup and operation of Apache ActiveMQ in the cloud. You also learned how the Message Groups feature can be used to implement FIFO. Lastly, you walked through real scenarios demonstrating how ActiveMQ distributes messages with queues to exchange messages between multiple producers and consumers.

Scheduling AWS Lambda Provisioned Concurrency for recurring peak usage

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/scheduling-aws-lambda-provisioned-concurrency-for-recurring-peak-usage/

This post is contributed by Jerome Van Der Linden, AWS Solutions Architect

Concurrency of an AWS Lambda function is the number of requests it can handle at any given time. This metric is the average number of requests per second multiplied by the average duration in seconds. For example, if a Lambda function takes an average 500 ms to run with 100 requests per second, the concurrency is 50 (100 * 0.5 seconds).

When invoking a Lambda function, an execution environment is provisioned in a process commonly called a cold start. This initialization phase can increase the total execution time of the function. Consequently, with the same concurrency, cold starts can reduce the overall throughput. For example, if the function takes 600 ms instead of 500 ms due to cold start latency, with a concurrency of 50, it handles 83 requests per second.

As described in this blog post, Provisioned Concurrency helps to reduce the impact of cold starts. It keeps Lambda functions initialized and ready to respond to requests with double-digit millisecond latency. Functions can respond to requests with predictable latency. This is ideal for interactive services, such as web and mobile backends, latency-sensitive microservices, or synchronous APIs.

However, you may not need Provisioned Concurrency all the time. With a reasonable amount of traffic consistently throughout the day, the Lambda execution environments may be warmed already. Provisioned Concurrency incurs additional costs, so it is cost-efficient to use it only when necessary. For example, early in the morning when activity starts, or to handle recurring peak usage.

Example application

As an example, we use a serverless ecommerce platform with multiple Lambda functions. The entry point is a GraphQL API (using AWS AppSync) that lists products and accepts orders. The backend manages delivery and payments. This GitHub project provides an example of such an ecommerce platform and is based on the following architecture:

Post sample architecture

Post sample architecture

This company runs “deal of the day” promotion every day at noon. The “deal of the day” reduces the price of a specific product for 60 minutes, starting at 12pm. The CreateOrder function handles around 20 requests per seconds during the day. At noon, a notification is sent to registered users who then connect to the website. Traffic increases immediately and can exceed 400 requests per seconds for the CreateOrder function. This recurring pattern is shown in Amazon CloudWatch:

Reoccurring invocations

Reoccurring invocations

The peak shows an immediate increase at noon, slowly decreasing until 1pm:

Peak invocations

Peak invocations

Examining the response times of the function in AWS X-Ray, the first graph shows normal traffic:

Normal performance distribution

Normal performance distribution

While the second shows the peak:

Peak performance distribution

Peak performance distribution

The average latency (p50) is higher during the peak with 535 ms versus 475 ms at normal load. The p90 and p95 values show that most of the invocations are under 800 ms in the first graph but around 900 ms in the second one. This difference is due to the cold start, when the Lambda service prepares new execution environments to absorb the load during the peak.

Using Provisioned Concurrency at noon can avoid this additional latency and provide a better experience to end users. Ideally, it should also be stopped at 1pm to avoid incurring unnecessary costs.

Application Auto Scaling

Application Auto Scaling allows you to configure automatic scaling for different resources, including Provisioned Concurrency for Lambda. You can scale resources based on a specific CloudWatch metric or at a specific date and time. There is no extra cost for Application Auto Scaling, you only pay for the resources that you use. In this example, I schedule Provisioned Concurrency for the CreateOrder Lambda function.

Scheduling Provisioned Concurrency for a Lambda function

  1. Create an alias for the Lambda function to identify the version of the function you want to scale:
    $ aws lambda create-alias --function-name CreateOrder --name prod --function-version 1.2 --description "Production alias"
  2. In this example, the average execution time is 500 ms and the workload must handle 450 requests per second. This equates to a concurrency of 250 including a 10% buffer. Register the Lambda function as a scalable target in Application Auto Scaling with the RegisterScalableTarget API:
    $ aws application-autoscaling register-scalable-target \
    --service-namespace lambda \
    --resource-id function:CreateOrder:prod \
    --min-capacity 0 --max-capacity 250 \
    --scalable-dimension lambda:function:ProvisionedConcurrency

    Next, verify that the Lambda function is registered correctly:

    $ aws application-autoscaling describe-scalable-targets --service-namespace lambda

    The output shows:

    {
        "ScalableTargets": [
            {
                "ServiceNamespace": "lambda",
                "ResourceId": "function:CreateOrder:prod",
                "ScalableDimension": "lambda:function:ProvisionedConcurrency",
                "MinCapacity": 0,
                "MaxCapacity": 250,
                "RoleARN": "arn:aws:iam::012345678901:role/aws-service-role/lambda.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_LambdaConcurrency",
                "CreationTime": 1596319171.297,
                "SuspendedState": {
                    "DynamicScalingInSuspended": false,
                    "DynamicScalingOutSuspended": false,
                    "ScheduledScalingSuspended": false
                }
            }
        ]
    }
  3. Schedule the Provisioned Concurrency using the PutScheduledAction API of Application Auto Scaling:
    $ aws application-autoscaling put-scheduled-action --service-namespace lambda \
      --scalable-dimension lambda:function:ProvisionedConcurrency \
      --resource-id function:CreateOrder:prod \
      --scheduled-action-name scale-out \
      --schedule "cron(45 11 * * ? *)" \
      --scalable-target-action MinCapacity=250

    Note: Set the schedule a few minutes ahead of the expected peak to allow time for Lambda to prepare the execution environments. The time must be specified in UTC.

  4. Verify that the scaling action is correctly scheduled:
    $ aws application-autoscaling describe-scheduled-actions --service-namespace lambda

    The output shows:

    {
        "ScheduledActions": [
            {
                "ScheduledActionName": "scale-out",
                "ScheduledActionARN": "arn:aws:autoscaling:eu-west-1:012345678901:scheduledAction:643065ac-ba5c-45c3-cb46-bec621d49657:resource/lambda/function:CreateOrder:prod:scheduledActionName/scale-out",
                "ServiceNamespace": "lambda",
                "Schedule": "cron(45 11 * * ? *)",
                "ResourceId": "function:CreateOrder:prod",
                "ScalableDimension": "lambda:function:ProvisionedConcurrency",
                "ScalableTargetAction": {
                    "MinCapacity": 250
                },
                "CreationTime": 1596319455.951
            }
        ]
    }
  5. To stop the Provisioned Concurrency, schedule another action after the peak with the capacity set to 0:
    $ aws application-autoscaling put-scheduled-action --service-namespace lambda \
      --scalable-dimension lambda:function:ProvisionedConcurrency \
      --resource-id function:CreateOrder:prod \
      --scheduled-action-name scale-in \
      --schedule "cron(15 13 * * ? *)" \
      --scalable-target-action MinCapacity=0,MaxCapacity=0

With this configuration, Provisioned Concurrency starts at 11:45 and stops at 13:15. It uses a concurrency of 250 to handle the load during the peak and releases resources after. This also optimizes costs by limiting the use of this feature to 90 minutes per day.

In this architecture, the CreateOrder function synchronously calls ValidateProduct, ValidatePayment, and ValidateDelivery. Provisioning concurrency only for the CreateOrder function would be insufficient as the three other functions would impact it negatively. To be efficient, you must configure Provisioned Concurrency for all four functions, using the same process.

Observing performance when Provisioned Concurrency is scheduled

Run the following command at any time outside the scheduled window to confirm that nothing has been provisioned yet:

$ aws lambda get-provisioned-concurrency-config \
--function-name CreateOrder --qualifier prod
No Provisioned Concurrency Config found for this function

When run during the defined time slot, the output shows that the concurrency is allocated and ready to use:

{
    "RequestedProvisionedConcurrentExecutions": 250,
    "AvailableProvisionedConcurrentExecutions": 250,
    "AllocatedProvisionedConcurrentExecutions": 250,
    "Status": "READY",
    "LastModified": "2020-08-02T11:45:18+0000"
}

Verify the different scaling operations using the following command:

$ aws application-autoscaling describe-scaling-activities \
--service-namespace lambda \
--scalable-dimension lambda:function:ProvisionedConcurrency \
--resource-id function:CreateOrder:prod

You can see the scale-out and scale-in activities at the times specified, and how Lambda fulfills the requests:

{
    "ScalingActivities": [
        {
            "ActivityId": "6d2bf4ed-6eb5-4218-b4bb-f81fe6d7446e",
            "ServiceNamespace": "lambda",
            "ResourceId": "function:CreateOrder:prod",
            "ScalableDimension": "lambda:function:ProvisionedConcurrency",
            "Description": "Setting desired concurrency to 0.",
            "Cause": "maximum capacity was set to 0",
            "StartTime": 1596374119.716,
            "EndTime": 1596374155.789,
            "StatusCode": "Successful",
            "StatusMessage": "Successfully set desired concurrency to 0. Change successfully fulfilled by lambda."
        },
        {
            "ActivityId": "2ea46a62-2dbe-4576-aa42-0675b6448f0a",
            "ServiceNamespace": "lambda",
            "ResourceId": "function:CreateOrder:prod",
            "ScalableDimension": "lambda:function:ProvisionedConcurrency",
            "Description": "Setting min capacity to 0 and max capacity to 0",
            "Cause": "scheduled action name scale-in was triggered",
            "StartTime": 1596374119.415,
            "EndTime": 1596374119.431,
            "StatusCode": "Successful",
            "StatusMessage": "Successfully set min capacity to 0 and max capacity to 0"
        },
        {
            "ActivityId": "272cfd75-8362-4e5c-82cf-5c8cf9eacd24",
            "ServiceNamespace": "lambda",
            "ResourceId": "function:CreateOrder:prod",
            "ScalableDimension": "lambda:function:ProvisionedConcurrency",
            "Description": "Setting desired concurrency to 250.",
            "Cause": "minimum capacity was set to 250",
            "StartTime": 1596368709.651,
            "EndTime": 1596368901.897,
            "StatusCode": "Successful",
            "StatusMessage": "Successfully set desired concurrency to 250. Change successfully fulfilled by lambda."
        },
        {
            "ActivityId": "38b968ff-951e-4033-a3db-b6a86d4d0204",
            "ServiceNamespace": "lambda",
            "ResourceId": "function:CreateOrder:prod",
            "ScalableDimension": "lambda:function:ProvisionedConcurrency",
            "Description": "Setting min capacity to 250",
            "Cause": "scheduled action name scale-out was triggered",
            "StartTime": 1596368709.354,
            "EndTime": 1596368709.37,
            "StatusCode": "Successful",
            "StatusMessage": "Successfully set min capacity to 250"
        }
    ] 
}

In the following latency graph, most of the requests are now completed in less than 800 ms, in line with performance during the rest of the day. The traffic during the “deal of the day” at noon is comparable to any other time. This helps provide a consistent user experience, even under heavy load.

Performance distribution

Performance distribution

Automate the scheduling of Lambda Provisioned Concurrency

You can use the AWS CLI to set up scheduled Provisioned Concurrency, which can be helpful for testing in a development environment. You can also define your infrastructure with code to automate deployments and avoid manual configuration that may lead to mistakes. This can be done with AWS CloudFormation, AWS SAM, AWS CDK, or third-party tools.

The following code shows how to schedule Provisioned Concurrency in an AWS SAM template:

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Resources:
  CreateOrderFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/create_order/
      Handler: main.handler
      Runtime: python3.8
      MemorySize: 768
      Timeout: 30
      AutoPublishAlias: prod                              #1
      DeploymentPreference:
        Type: AllAtOnce

  CreateOrderConcurrency:
    Type: AWS::ApplicationAutoScaling::ScalableTarget     #2
    Properties:
      MaxCapacity: 250
      MinCapacity: 0
      ResourceId: !Sub function:${CreateOrderFunction}:prod        #3
      RoleARN: !Sub arn:aws:iam::${AWS::AccountId}:role/aws-service-role/lambda.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_LambdaConcurrency
      ScalableDimension: lambda:function:ProvisionedConcurrency
      ServiceNamespace: lambda
      ScheduledActions:                                   #4
        - ScalableTargetAction:
            MinCapacity: 250
          Schedule: 'cron(45 11 * * ? *)'
          ScheduledActionName: scale-out
        - ScalableTargetAction:
            MinCapacity: 0
            MaxCapacity: 0
          Schedule: 'cron(15 13 * * ? *)'
          ScheduledActionName: scale-in 
    DependsOn: CreateOrderFunctionAliasprod                        #5

In this template:

  1. As with the AWS CLI, you need an alias for the Lambda function. This automatically creates the alias “prod” and sets it to the latest version of the Lambda function.
  2. This creates an AWS::ApplicationAutoScaling::ScalableTarget resource to register the Lambda function as a scalable target.
  3. This references the correct version of the Lambda function by using the “prod” alias.
  4. Defines different actions to schedule as a property of the scalable target. You define the same properties as used in the CLI.
  5. You cannot define the scalable target until the alias of the function is published. The syntax is <FunctionResource>Alias<AliasName>.

Conclusion

Cold starts can impact the overall duration of your Lambda function, especially under heavy load. This can affect the end-user experience when the function is used as a backend for synchronous APIs.

Provisioned Concurrency helps reduce latency by creating execution environments ahead of invocations. Using the ecommerce platform example, I show how to combine this capability with Application Auto Scaling to schedule scaling-out and scaling-in. This combination helps to provide a consistent execution time even during special events that cause usage peaks. It also helps to optimize cost by limiting the time period.

To learn more about scheduled scaling, see the Application Auto Scaling documentation.

Executing Ansible playbooks in your Amazon EC2 Image Builder pipeline

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/executing-ansible-playbooks-in-your-amazon-ec2-image-builder-pipeline/

This post is contributed by Andrew Pearce – Sr. Systems Dev Engineer, AWS

Since launching Amazon EC2 Image Builder, many customers say they want to re-use existing investments in configuration management technologies (such as Ansible, Chef, or Puppet) with Image Builder pipelines.

In this blog, I walk through creating a document that can execute an Ansible playbook in an Image Builder pipeline. I then use the document to create an Image Builder component that can be added to an Image Builder image recipe. In addition to this blog, you can find further samples in the amazon-ec2-image-builder-samples GitHub repository.

Ansible playbook requirements

There are likely some changes required so your existing Ansible playbooks can work in Image Builder.

When executing Ansible within Image Builder, the playbook must be configured to execute using the localhost. In a traditional Ansible environment, the control server executes playbooks against remote hosts using a protocol like SSH or WinRM. In Image Builder, the host executing the playbook is also the host that must be configured.

There are a number of ways to accomplish this in Ansible. In this example, I set the value of hosts to 127.0.0.1, gather_facts to false, and connection to local. These values set execution at the localhost, prevent gathering Ansible facts about remote hosts, and force local execution.

In this walk through, I use the following sample playbook. This installs Apache, configures a default webpage, and ensures that Apache is enabled and running.

---
- name: Apache Hello World
  hosts: 127.0.0.1
  gather_facts: false
  connection: local
  tasks:
    - name: Install Apache
      yum:
        name: httpd
        state: latest
    - name: Create a default page
      shell: echo "<h1>Hello world from EC2 Image Builder and Ansible</h1>" > /var/www/html/index.html
    - name: Enable Apache
      service: name=httpd enabled=yes state=started

Creating the Ansible document

This example uses the build phase to install Ansible, download the playbook from an S3 bucket, execute the playbook and cleanup the system. It also ensures that Apache is returning the correct content in both the validate and test phases.

I use Amazon Linux 2 as the target operating system, and assume that the playbook is already uploaded to my S3 bucket.

Document design diagram

Document Design Diagram

To start, I must specify the top-level properties for the document. I specify a name and description to describe the document’s purpose, and a schemaVersion, which at the time of writing is 1.0. I also include a build phase, as I want this document to be used when building the image.

name: 'Ansible Playbook Execution on Amazon Linux 2'
description: 'This is a sample component that demonstrates how to download and execute an Ansible playbook against Amazon Linux 2.'
schemaVersion: 1.0
phases:
  - name: build
    steps:

The first required step is to install Ansible. With Amazon Linux 2, I can install Ansible using Amazon Linux Extras, so I use the ExecuteBash action to perform the work.

     - name: InstallAnsible
        action: ExecuteBash
        inputs:
          commands:
           - sudo amazon-linux-extras install -y ansible2

Once Ansible is installed, I must download the playbook for execution. I use the S3Download action to download the playbook and store it in a temporary location. Note that the S3Download action accepts a list of inputs, so a single S3Download step could download multiple files.

     - name: DownloadPlaybook
        action: S3Download
        inputs:
          - source: 's3://mybucket/my-playbook.yml'
            destination: '/tmp/my-playbook.yml'

Now Ansible is installed and the playbook is downloaded. I use the ExecuteBinary action to invoke Ansible and perform the work described in the playbook. This step uses a feature of the component management application called chaining.

Chaining allows you to reference the input or output values of a step in subsequent steps. In the following example, {{build.DownloadPlaybook.inputs[0].destination}} passes the string of the first “destination” property value in the DownloadPlaybook step, which executes in the build phase. The “[0]” indicates the first value in the array (position zero).

For more information about how chaining works, review the Component Manager documentation. In this example, I refer to the S3Download destination path to avoid mistyping the value, which causes the build to fail.

      - name: InvokeAnsible
        action: ExecuteBinary
        inputs:
          path: ansible-playbook
          arguments:
            - '{{build.DownloadPlaybook.inputs[0].destination}}'

Finally, I clean up the system, so I use another ExecuteBash action combined with chaining to delete the downloaded playbook.

      - name: DeletePlaybook
        action: ExecuteBash
        inputs:
          commands:
            - rm '{{build.DownloadPlaybook.inputs[0].destination}}'

Now, I have installed Ansible, downloaded the playbook, executed it, and cleaned up the system. The next step is to add a validate phase where I use curl to ensure that Apache responds with the desired content. Including a validate phase allows you to identify build issues more quickly. This fails the build before Image Builder creates an AMI and moves to the test phase.

In the validate phase, I use grep to confirm that Apache returned a value containing my required string. If the command fails, a non-zero exit code is returned, indicating a failure. This failure is returned to Image Builder and the image build is marked as “Failed”.

  - name: validate
    steps:
      - name: ValidateResponse
        action: ExecuteBash
        inputs:
          commands:
            - curl -s http://127.0.0.1 | grep "Hello world from EC2 Image Builder and Ansible"

After the validate phase executes, the EC2 instance is stopped and an Amazon Machine Image (AMI) is created. Image Builder moves to the test stage where a new EC2 instance is launched using the AMI that was created in the previous step. During this stage, the test phase of a document is executed.

To ensure that Apache is still functioning as required, I add a test phase to the document using the same curl command as the validate phase. This ensures that Apache is started and still returning the correct content.

  - name: test
    steps:
      - name: ValidateResponse
        action: ExecuteBash
        inputs:
          commands:
            - curl -s http://127.0.0.1 | grep "Hello world from EC2 Image Builder and Ansible"

Here is the complete document:

name: 'Ansible Playbook Execution on Amazon Linux 2'
description: 'This is a sample component that demonstrates how to download and execute an Ansible playbook against Amazon Linux 2.'
schemaVersion: 1.0
phases:
  - name: build
    steps:
      - name: InstallAnsible
        action: ExecuteBash
        inputs:
          commands:
           - sudo amazon-linux-extras install -y ansible2
      - name: DownloadPlaybook
        action: S3Download
        inputs:
          - source: 's3://mybucket/playbooks/my-playbook.yml'
            destination: '/tmp/my-playbook.yml'
      - name: InvokeAnsible
        action: ExecuteBinary
        inputs:
          path: ansible-playbook
          arguments:
            - '{{build.DownloadPlaybook.inputs[0].destination}}'
      - name: DeletePlaybook
        action: ExecuteBash
        inputs:
          commands:
            - rm '{{build.DownloadPlaybook.inputs[0].destination}}'
  - name: validate
    steps:
      - name: ValidateResponse
        action: ExecuteBash
        inputs:
          commands:
            - curl -s http://127.0.0.1 | grep "Hello world from EC2 Image Builder and Ansible"
  - name: test
    steps:
      - name: ValidateResponse
        action: ExecuteBash
        inputs:
          commands:
            - curl -s http://127.0.0.1 | grep "Hello world from EC2 Image Builder and Ansible"

I now have a document that installs Ansible, downloads a playbook from an S3 bucket, and invokes Ansible to perform the work described in the playbook. I am also validating the installation was successful by ensuring Apache is returning the desired content.

Next, I use this document to create a component in Image Builder and assign the component to an image recipe. The image recipe could then be used when creating an image pipeline. The blog post Automate OS Image Build Pipelines with EC2 Image Builder provides a tutorial demonstrating how to create an image pipeline with the EC2 Image Builder console. For more information, review Getting started with EC2 Image Builder.

Conclusion

In this blog, I walk through creating a document that can execute Ansible playbooks for use in Image Builder. I define the required changes for Ansible playbooks and when each phase of the document would execute. I also note the amazon-ec2-image-builder-samples GitHub repository, which provides sample documents for executing Ansible playbooks and Chef recipes.

We continue to use this repository for publishing sample content. We hope Image Builder is providing value and making it easier for you to build virtual machine (VM) images. As always, keep sending us your feedback!

Migrating a SharePoint application using the AWS Server Migration Service

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/migrating-a-sharepoint-application-using-the-aws-server-migration-service/

This post contributed by Ashwini Rudra, Solutions Architect; Rajesh Rathod, Sr. Product Manager; Vivek Chawda, Senior Software Engineer, EC2 Enterprise

Many AWS customers are migrating on-premises SharePoint workloads to AWS for greater reliability, faster performance, and lower costs. While planning the migration, customers are looking for tools and methodologies that reduce the time to migrate, application downtime, and performance disruption. They use continuous replication to optimize cost and effort required to migrate applications reliably.

To accelerate these migrations, AWS provides a comprehensive set of tools. In this blog post, we explore how to use AWS Server Migration Service (AWS SMS) for migrating a SharePoint application from on-premises to AWS.

Overview of solution

AWS Server Migration Service is an agentless service. It makes it easier and faster to migrate thousands of workloads from on-premises or Microsoft Azure to AWS. In this article, we discuss one of the approaches and steps to migrate SharePoint farm using AWS SMS.

AWS SMS also supports migrating a group of servers organized as an application. This can simplify migration of applications with complex dependencies that must be migrated together. This service provides a customized replication schedule designed to simplify migration at scale. This also tracks the progress of each migration using AWS Migration Hub.

For SharePoint migrations, it facilitates migration failovers quickly. After the initial sync, the migration uses an incremental change capture approach to synchronize changes made to the on-premises SharePoint servers. This method also reduced required network bandwidth for the migration.

SharePoint Migration Architecture

SharePoint Migration Architecture

Here is how this service and solution works:

Walkthrough

A basic SharePoint deployment is a 3-tier architecture comprising of web frontend servers, application servers, and backend SQL database servers. It also includes authentication services servers with Active Directory domain controllers.

To migrate this application, you must deploy a Server Migration Connector, which is a preconfigured virtual machine. This connector creates a server catalog. Based on selected server and configuration, it takes snapshots of virtual machines and stores them in S3 buckets.

In the background, AWS VM import/export service converts these snapshots into Amazon Machine Images (AMIs). Using these AMIs, you can configure launch settings where you define an order of application launch. You can also select instance types and set user-defined PowerShell scripts.

At the end, you define the cloud network topology: a VPC, subnets, and security groups. With these launch settings, SMS creates an AWS CloudFormation template, which can launch the SharePoint application in the target AWS account.

To migrate a SharePoint using SMS:

  1. Install and register the SMS connector.
  2. Create an application from SMS server catalog.
  3. Configure replication settings for your SharePoint farm.
  4. Configure launch settings.
  5. Start the replication.
  6. Launch the SharePoint application in AWS.

Install and register the Server Migration Connector

The Server Migration Connector is a VM that you install in your on-premises virtualization environment. The supported platforms are VMware vSphere, Microsoft Hyper-V/SCVMM, and Microsoft Azure. Follow the links below to install in your environment:

For example, VMWare is a common scenario. First, you must set up Server Migration Connector and import server catalog:

  1. Login to AWS Server Migration Service Console, and click to Connectors.
  2. Download .ova file and deploy it to your VMWare environment using vSphere client from AWS Server Migration Setup page.
  3. Install OVA file on your VMWare environment. This is your AWS SMS Connector.
  4. Use Connector Host IP address access Connector page and start five step registration process. Refer AWS documentation, Install the Server Migration Connector on VMWare.
  5. Follow the five-step process of registration. Here, you set up the password and network configuration between the connector virtual machine and AWS accounts.
    Setup network info

    Setup network info

    vCenter credentials

    vCenter credentials

    At the end, provide your vCenter host name and credentials.

    In setup, provide all details of the VMWare and AWS environment to the connector. After establishing the connection, the connector setup looks like this in the AWS Management Console:

    SMS Connectors

    SMS Connectors

Create an application from SMS server catalog

After configuring the connector and selecting “Import Server Catalog”, you are able to view the server catalog in AWS Server Migration Service console. To migrate your SharePoint application, select the application server, SQL Server, and Active Directory server from the server catalog.

Depending on the application architecture, you can group these servers to apply server-specific configuration settings and select appropriate instance types. Here are the steps:

  1. Navigate to the application feature of SMS and create a “new application” for the SharePoint farm. Provide the application name, description, and IAM role.

    Create new application - Application settings

    Create new application – Application settings

  2. Select servers to migrate from the catalog.
  3. Create different groups for these servers in your console. For this, select servers and choose Add servers to group. This helps in defining different instance types and run user-defined PowerShell scripts for all servers in a group during application launch. You may create different groups for the application, web frontend, database, and Active Directory servers. In the below example, there are two groups – one for application servers, and the other for database servers. This process assumes that you have the authentication services servers already in place and operational in AWS. For more information on SharePoint authentication services, Active Directory and Domain Services, Refer Active Directory Domain Services on AWS Deployment Scenario and Architecture.

    Create new application - Add servers to groups

    Create new application – Add servers to groups

  4. Add tags, per your organization tagging strategy or policy.
  5. Review your application and click “Next” when it is ready for replication settings configurations.

Configure replication settings

  1. Define “replication job type”, “when to start replication job”, and “automatic AMI deletion” based on your requirements. Choose Next.
  2. On the Configure server-specific settings page, in the License type column, select the license type for AMIs created from the replication job. Windows Servers can use either an AWS-provided license or Bring Your Own License (BYOL). Check Microsoft Licensing to review the licensing options. You can also choose Auto to allow AWS SMS to detect the source-system operating system (OS) and apply the appropriate license to the migrated virtual machine. Choose Next.

    Server-specific settings

    Server-specific settings

  3. Review your application replication setting and choose Configure Launch Settings.

Configure launch settings

An important aspect of migration is how this application should launch on EC2. This is configured on this page of SMS:

  1. On the Configure launch settings page, for the IAM CloudFormation role, provide an IAM role for launch settings. Refer AWS Documentation on IAM Roles for AWS SMS.

    Configure launch settings

    Configure launch settings

  2. Under Specify launch order, configure a launch order for your groups. For this SharePoint application, you may prefer Active Directory first, followed by the SQL database, and then the application servers.
  3. Under Configure launch settings for the application, edit the server settings individually:
    Target instance details

    Target instance details

    • Logical ID: AWS CloudFormation resource ID. This is the logical ID of the CloudFormation template that AWS SMS generates for the application. A value is created automatically when you use the console, but you must supply it manually when using the API, CLI, or SDKs. For more information, see Resources in the AWS CloudFormation User Guide.
    • Instance type: specifies the EC2 instance type on which to launch the server.
    • Key pair: specifies the SSH key pair for access to the server.
    • Configuration script: a script to run configuration commands at the startup of EC2 instances launched as part of an application. This is important for your SharePoint migration, as you can provide registry settings and configuration settings using PowerShell script for your SharePoint servers and SQL database servers.

    For example, the PowerShell script below retrieves the IP address of the current SQL Server and replaces the old SQL Server IP in the SharePoint configuration database and connection strings. You can automate many SharePoint configuration tasks using PowerShell scripts.

      Start-Transcript -Path "C:\UserData.log" -Append
      $oldIP = <<Your old IP goes here>>
      $newIP = ([System.Net.Dns]::GetHostAddresses("sp-ip-sql-server.aws.local")).IPAddressToString
      $registryPath = "HKLM:\SOFTWARE\Microsoft\Shared Tools\Web Server Extensions\16.0\Secure\ConfigDB"
      $Name = "dsn"
      $regValue = (Get-ItemProperty -Path $registryPath -Name $Name).dsn
      $updatedRegValue = $regValue.Replace($oldIP, $newIP)
      New-ItemProperty -Path $registryPath -Name $Name -Value $updatedRegValue -PropertyType String -Force | Out-Null
  4. Configure the target network: VPC, subnets and security groups. Refer to the SharePoint on AWS documentation (Reference Deployment) for more guidance. The network topology varies based on platform requirements. Here is a reference architecture for SharePoint on AWS:

    Architecture topology

    Architecture topology

Start application replication

To start replication, select Start Replication under the Actions menu on the Applications page. The replication time depends on the amount of data replicated and available network bandwidth. On the application details page, you can observe the status of the replication in the Replication status field. If the replication fails, the status message field shows the reason.

Start replication

Start replication

Launch SharePoint in AWS

  1. On the Application page, choose Actions, Launch application. A replication job must complete before you perform this action.
  2. In the Launch application window, choose Launch. On the application details page, you can observe the status of the launch in the Launch status field. If the launch fails, you are able to find the reason in the status message field. You can also generate a CloudFormation template and download this template to use in different AWS accounts.

    Launch application

    Launch application

Test your migration

When the SharePoint application is launched, you can connect to Amazon EC2 instances based servers via Remote Desktop Protocol (RDP). You can access the application based on your Internet Information Services (IIS) Server settings runs on SharePoint Web Front End (WFE) application server (on Amazon EC2).  It is also recommended to investigate optimizations using the AWS Compute Optimizer. For this blog, we have not verified the migration steps with SharePoint 2007 and 2010.

Conclusion

AWS Server Migration Service simplifies SharePoint application migration. Using AWS SMS, you can easily migrate a SharePoint farm and reduce your migration timeline using the launch setting and launch order features.

To learn more, watch Application migration Using AWS Server Migration Service (SMS) or view a demo on the AWS Online Tech Talks Channel. If you have feedback, let us know in the comments section below.

Using AWS ParallelCluster with a serverless API

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/using-aws-parallelcluster-with-a-serverless-api/

This post is contributed by Dario La Porta, AWS Senior Consultant – HPC

AWS ParallelCluster simplifies the creation and the deployment of HPC clusters. Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. AWS Lambda automatically runs your code without requiring you to provision or manage servers.

In this post, I create a serverless API of the AWS ParallelCluster command line interface using these services. With this API, you can create, monitor, and destroy your clusters. This makes it possible to integrate AWS ParallelCluster programmatically with other applications you may have running on-premises or in the AWS Cloud.

The serverless integration of AWS ParallelCluster can enable a cleaner and more reproducible infrastructure as code paradigm to legacy HPC environments.

Taking this serverless, infrastructure as code approach enables several new types of functionality for HPC environments. For example, you can build on-demand clusters from an API when on-premises resources cannot handle the workload. AWS ParallelCluster can extend on-premises resources for running elastic and large-scale HPC on AWS’ virtually unlimited infrastructure.

You can also create an event-driven workflow in which new clusters are created when new data is stored in an S3 bucket. With event-driven workflows, you can be creative in finding new ways to build HPC infrastructure easily. It also helps optimize time for researchers.

Security is paramount in HPC environments because customers are performing scientific analyses that are central to their businesses. By using a serverless API, this solution can improve security by removing the need to run the AWS ParallelCluster CLI in a user environment. This help keep customer environments secure and more easily control the IAM roles and security groups that researchers have access to.

Additionally, the Amazon API Gateway for HPC job submission post explains how to submit a job in the cluster using the API. You can use this instead of connecting to the master node via SSH.

This diagram shows the components required to create the cluster and interact with the solution.

Cluster Architecture

Cluster Architecture

Cost of the solution

You can deploy the solution in this blog post within the AWS Free Tier. Make sure that your AWS ParallelCluster configuration uses the t2.micro instance type for the cluster’s master and compute instances. This is the default instance type for AWS ParallelCluster configuration.

For real-world HPC use cases, you most likely want to use a different instance type, such as C5 or C5n. C5n in particular can work well for HPC workloads because it includes the option to use the Elastic Fabric Adapter (EFA) network interface. This makes it possible to scale tightly coupled workloads to more compute instances and reduce communications latency when using protocols such as MPI.

To stay within the AWS Free Tier allowance, be sure to destroy the created resources as described in the teardown section of this post.

VPC configuration

Choose Launch Stack to create the VPC used for this configuration in your account:

The stack creates the VPC, the public subnets, and the private subnet required for the cluster in the eu-west-1 Region.

Stack Outputs

Stack Outputs

You can also use an existing VPC that complies with the AWS ParallelCluster network requirements.

Deploy the API with AWS SAM

The AWS Serverless Application Model (AWS SAM) is an open-source framework that you can use to build serverless applications on AWS. You use AWS SAM to simplify the setup of the serverless architecture.

In this case, the framework automates the manual configuration of setting up the API Gateway and Lambda function. Instead you can focus more on how the API works with AWS ParallelCluster. It improves security and provides a simple, alternative method for cluster lifecycle management.

You can install the AWS SAM CLI by following the Installing the AWS SAM CLI documentation. You can download the code used in this example from this repo. Inside the repo:

  • the sam-app folder in the aws-sample repository contains the code required to build the AWS ParallelCluster serverless API.
  • sam-app/template.yml contains the policy required for the Lambda function for the creation of the cluster. Be sure to modify <AWS ACCOUNT ID> to match the value for your account.

The AWS Identity and Access Management Roles in AWS ParallelCluster document contains the latest version of the policy, in the ParallelClusterUserPolicy section.

To deploy the application, run the following commands:

cd sam-app
sam build
sam deploy --guided

From here, provide parameter values for the SAM deployment wizard that are appropriate for your Region and AWS account. After the deployment, take a note of the Outputs:

SAM deploying

SAM deploying

SAM Stack Outputs

SAM Stack Outputs

The API Gateway endpoint URL is used to interact with the API, and has the following format:

https:// <ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster

AWS ParallelCluster configuration file

AWS ParallelCluster is an open source cluster management tool to deploy and manage HPC clusters in the AWS Cloud. AWS ParallelCluster uses a configuration file to build the cluster and its syntax is explained in the documentation guide. The pcluster.conf configuration file can be created in a directory of your local file system.

The configuration file has been tested with AWS ParallelCluster v2.6.0. The master_subnet_id contains the id of the created public subnet and the compute_subnet_id contains the private one.

Deploy the cluster with the pcluster API

The pcluster API created in the previous steps requires some parameters:

  • command – the pcluster command to execute. A detailed list is available commands is available in the AWS ParallelCluster CLI commands page.
  • cluster_name – the name of the cluster.
  • –data-binary “$(base64 /path/to/pcluster/config)” – parameter used to pass the local AWS ParallelCluster configuration file to the API.
  • -H “additional_parameters: <param1> <param2> <…>” – used to pass additional parameters to the pcluster cli.

The following command creates a cluster named “cluster1”:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster?command=create&cluster_name=cluster1"

Beginning cluster creation for cluster: cluster1
Creating stack named: parallelcluster-cluster1
Status: CREATE_IN_PROGRESS

The cluster creation status can be queried with the following:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster?command=status&cluster_name=cluster1"

Status: CREATE_IN_PROGRESS

When the cluster is in the “CREATE_COMPLETE” state, you can retrieve the master node IP address using the following API call:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster?command=status&cluster_name=cluster1"

Status: CREATE_COMPLETE

$ curl --request POST -H "additional_parameters: "  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster?command=status&cluster_name=cluster1"

Status: CREATE_COMPLETE
MasterServer: RUNNING
MasterPublicIP: 34.253.102.227
ClusterUser: ec2-user
MasterPrivateIP: 10.0.0.134

When the cluster is not needed anymore, destroy it with the following API call:

$ curl --request POST -H "additional_parameters: --nowait"  --data-binary "$(base64 /tmp/pcluster.conf)" "https://<ServerlessRestApi>.execute-api.eu-west-1.amazonaws.com/Prod/pcluster?command=delete&cluster_name=cluster1"

Deleting: cluster1

The additional_parameters: —nowait prevents waiting for stack events after executing a stack command and avoids triggering the Lambda function timeout. The Amazon API Gateway for HPC job submission post explains how you can submit a job in the cluster using the API, instead of connecting to the master node via SSH.

The authentication to the API can be managed by following the Controlling and Managing Access to a REST API in API Gateway Documentation.

Teardown

You can destroy the resources by deleting the CloudFormation stacks created during installation. Deleting a Stack on the AWS CloudFormation Console explains the required steps.

Conclusion

In this post, I show how to integrate AWS ParallelCluster with Amazon API Gateway and manage the lifecycle of an HPC cluster using this API. Using Amazon API Gateway and AWS Lambda, you can run a serverless implementation of the AWS ParallelCluster CLI. This makes it possible to integrate AWS ParallelCluster programmatically with other applications you run on-premise or in the AWS Cloud.

This solution can help you improve the security of your HPC environment by simplifying the IAM roles and security groups that must be granted to individual users to successfully create HPC clusters. With this implementation, researchers no longer must run the AWS ParallelCluster CLI in their own user environment. As a result, by simplifying the security management of your HPC clusters’ lifecycle management, you can better ensure that important research is safe and secure.

To learn more, read more about how to use AWS ParallelCluster.

Coming soon: Updated Lambda states lifecycle for VPC networking

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/coming-soon-updated-lambda-states-lifecycle-for-vpc-networking/

On November 27, we announced that AWS Lambda now includes additional attributes in the function information returned by several Lambda API actions to better communicate the current “state” of your function, when they are being created or updated. In our post “Tracking the state of AWS Lambda functions”, we covered the various states your Lambda function can be in, the conditions that lead to them, and how the Lambda service transitions the function through those states.

Our first feature using the function states lifecycle is a change to the recently announced improved VPC networking for AWS Lambda functions. As stated in the announcement post, Lambda creates the ENIs required for your function to connect to your VPCs, which can take 60–90 seconds to complete. We are updating this operation to explicitly place the function into a Pending state while pre-creating the required elastic network interface resources, and transitioning to an Active state after that process is completed. By doing this, we can use the lifecycle to complete the creation of these resources, and then reduce inconsistent invokes after the create/update has completed.

Most customers experience no impact from this change except for fewer long cold-starts due to network resource creation. As a reminder, any invocations or other API actions that operate on the function will fail during the time before the function is Active. To better assist you in adopting this behavior, we are rolling out this behavior for VPC configured functions in a phased manner. This post provides further details about timelines and approaches to both test the change before it is 100% live or delay it for your functions using a delay mechanism.

Changes to function create and update

On function create

During creation of new functions configured for VPC, your function remains in the Pending state until all VPC resources are created. You are not able to invoke the function or take any other Lambda API actions against it. After successful completion of the creation of these resources, your function transitions automatically to the Active state and is available for invokes and Lambda API actions. If the network resources fail to create then your function is placed in a Failed state.

On function update

During the update of functions configured for VPC, if there are any modifications to the VPC configuration, the function remains in the Active state, but shows in the InProgress status until all VPC resources are updated. During this time, any invokes go to the previous function code and configuration. After successful completion, the function LastUpdateStatus transitions automatically to Successful and all new invokes use the newly updated code and configuration. If the network resources fail to be created/updated then the LastUpdateStatus shows Failed, but the previous code and configuration remains in the Active state.

It’s important to note that creation or update of VPC resources can take between 60-90 seconds complete.

Change timeframe

As a reminder, all functions today show an Active state only. We are rolling out this change to create resources during the Pending state over a multiple phase period starting with the Begin Testing phase today, December 16, 2019. The phases allow you to update tooling for deploying and managing Lambda functions to account for this change. By the end of the update timeline, all accounts transition to using this new VPC resource create/update Lambda lifecycle.

Update timeline

Update timeline

December 16, 2019 – Begin Testing: You can now begin testing and updating any deployment or management tools you have to account for the upcoming lifecycle change. You can also use this time to update your function configuration to delay the change until the Delayed Update phase.

January 20, 2020 – General Update: All customers without the delayed update configuration begin seeing functions transition as described above under “On function create” and “On function update”.

February 17, 2020 – Delayed Update: The delay mechanism expires and customers now see the new VPC resource lifecycle applied during function create or update.

March 2, 2020 – Update End: All functions now have the new VPC resource lifecycle applied during function create or update.

Opt-in and delayed update configurations

Starting today, we are providing a mechanism for an opt-in, to allow you to update and test your tools and developer workflow processes for this change. We are also providing a mechanism to delay this change until the end of the Delayed Update phase. If you configure your functions for VPC and use the delayed update mechanism after the start of the General Update, your functions continue to experience a delayed first invocation due to VPC resource creation.

This mechanism operates on a function-by-function basis, so you can test and experiment individually without impacting your whole account. Once the General Update phase begins, all functions in an account that do not have the delayed update mechanism in place see the new lifecycle for their functions.

Both mechanisms work by adding a special string in the “Description” parameter of your Lambda functions. This string can be added to the prefix or suffix, or be the entire contents of the field.

To opt in:

aws:states:opt-in

To delay the update:

aws:states:opt-out

NOTE: Delay configuration mechanism has no impact after the Delayed Update phase ends.

Here is how this looks in the console:

  1. I add the opt-in configuration to my function’s Description.

    Opt-in in Description

    Opt-in in Description

  2. When I choose Save at the top, I see the update begin. During this time, I am blocked from executing tests, updating my code, and making some configuration changes against the function.

    Function updating

    Function updating

  3. After the update completes, I can once again run tests and other console commands.

    Function update successful

    Function update successful

Once the opt-in is set for a function, then updates on that function go through the update flow shown above. If I don’t change my function’s VPC configuration, then updates to my function transition almost instantly to the Successful update status.

With this in place, you can now test your development workflow ahead of the General Update phase. Download the latest CLI (version 1.16.291 or greater) or SDKs in order to see function state and related attribute information.

Conclusion

With functions states, you can have better clarity on how the resources required by your Lambda function are being created. This change does not impact the way that functions are invoked or how your code is executed. While this is a minor change to when resources are created for your Lambda function, the result is even better consistency of performance. Combined with the original announcement of improved VPC networking for Lambda, you experience better consistency for invokes, greatly reduced cold-starts, and fewer network resources created for your functions.

Tracking the state of Lambda functions

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/tracking-the-state-of-lambda-functions/

AWS Lambda functions often require resources from other AWS services in order to execute successfully, such as AWS Identity and Access Management (IAM) roles or Amazon Virtual Private Cloud (Amazon VPC) network interfaces. When you create or update a function, Lambda provisions the required resources on your behalf that enable your function to execute. In most cases, this process is very fast, and your function is ready to be invoked or modified immediately. However, there are situations where this action can take longer.

To better communicate the current “state” of your function when resources are being created or updated, AWS Lambda now includes several additional attributes in the function information returned by several Lambda API actions. This change does not impact the way that functions are invoked or how your code is
executed. In this post we cover the various states your Lambda function can be in, the conditions that lead to them, and how the Lambda service transitions the function through different states. You can also read more about this in the AWS Lambda documentation, Monitoring the State of a Function with the Lambda API.

All about function states

There are two main lifecycles that a Lambda function goes through, depending on if it is being created for the first time or existing and being updated. Before we go into the flows, let’s cover what the function states are:

Pending

Pending is the first state that all functions go through when they are created. During Pending Lambda attempts to create or configure any external resources. A function can only be invoked while in the Active state and so any invocations or other API actions that operate on the function will fail. Any automation built around invoking or updating functions immediately after creation must be modified to first see if the function has transitioned from Pending to Active.

Active

Active is reached after any configuration or provisioning of resources has completed during the initial creation of a function. Functions can only be invoked in the Active state.

During a function update, the state remains set to Active, but there is a separate attribute called LastUpdateStatus, which represents the state of an update of an active function. During an update, invocations execute the function’s previous code and configuration until it has successfully completed the update operation, after which the new code and configuration is used.

Failed

Failed is a state that represents something has gone wrong with either the configuration or provisioning of external resources.

Inactive

A function transitions to the Inactive state when it has been idle long enough for the Lambda service to reclaim the external resources that were configured for it. When you invoke a function that is in the Inactive state, it fails and sets the function to the Pending state until those resources are recreated. If the resources fail to be recreated, the function is set back to the Inactive state.

For all states, there are two other attributes that are set: StateReason and StateReasonCode. Both exist for troubleshooting issues and provide more information about the current state.

LastUpdateStatus

LastUpdateStatus represents a child lifecycle of the main function states lifecycle. It only becomes relevant during an update and signifies the changes a function goes through. There are three statuses associated with this:

InProgress

InProgress signifies that an update is happening on an existing function. While a function is in the InProgress status, invocations go to the function’s previous code and configuration.

Successful

Successful signifies that the update has completed. Once a function has been updated, this stays set until a further update.

Failed

Failed signifies that the function update has failed. The change is aborted and the function’s previous code and configuration remains in the Active state and available.

For all LastUpdateStatus values there are two other attributes that are set: LastUpdateStatusReason and LastUpdateStatusReasonCode. These exist to help troubleshoot any issues with an update.

Function State Lifecycles

The transition of a function from one state to another is completely dependent on the actions taken against it. There is no way for you to manually move a function between states.

Create function state lifecycle

Create function state lifecycle

For all functions the primary state lifecycle looks like this:

  • On create, a function enters the Pending state
  • Once any required resources are successfully created, it transitions into the Active state
    • If resource or function create fails for any reason then it transitions to the Failed state
  • If a function remains idle for several weeks it will enter the Inactive state
  • On the first invoke after going Inactive, a function re-enters the Pending state
    • Success sets the state to Active again
    • Failure sets the state back to Inactive
  • Successfully updating an Inactive function also sets the state back to Active again

For updating functions, the lifecycle looks like this:

Update function state lifecycle

Update function state lifecycle

NOTE: Functions can only be updated if they are in the Active or Inactive state. Update commands issued against functions that aren’t in the Active or Inactive state will fail before execution.

  • On update, a function’s LastUpdateStatus is set to Pending
    • Success sets the LastUpdateStatus to Successful
    • Failure sets the LastUpdateStatus to Failed
  • A function that is Inactive will go back to Active after a successful update.

Accessing function state information

You can query the current state of the function using the latest AWS SDKs and AWS CLI (version 1.16.291 or greater). To make it easy for you to write the code to check the state of a function we’ve added a new set of Waiters to the AWS SDKs. Waiters are a feature of certain AWS SDKs that can be used when AWS services have asynchronous operations, that you would need to poll for a change in their state or status. Using waiters removes the need to write complicated polling code and make it easy to check the state of a Lambda function. You can find out more about these in the documentation for each SDK: Python, Node.js, Java, Ruby, Go, and PHP.

What’s next

Starting today, all functions today will show an Active state only. You will not see a function transition from Pending at all.

Our first feature leveraging the function states lifecycle, will be a change to the recently announced improved VPC networking for AWS Lambda functions. As stated in the announcement post, Lambda precreates the ENIs required for your function to connect to your VPCs, which can take 60 to 90 seconds to complete. We will be changing this process slightly, by creating the required ENI resources while the function is placed in a Pending state and transitioning to Active after that process is completed. Remember, any invocations or other API actions that operate on the function will fail during the time before the function is Active. We will roll this behavior out for VPC configured functions in a phased manner, and will provide further details about timelines, and approaches to test before it is 100% live, in an upcoming post.

Until then, we encourage you to update the tooling you use to deploy and manage Lambda-based applications to confirm that your function is in the Active state before trying to invoke or perform other actions against them.  If you use AWS tools like AWS CloudFormation or AWS SAM, or tools from Lambda’s partners to deploy and manage your functions, there is no further action required.

Update: Issue affecting HashiCorp Terraform resource deletions after the VPC Improvements to AWS Lambda

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/update-issue-affecting-hashicorp-terraform-resource-deletions-after-the-vpc-improvements-to-aws-lambda/

On September 3, 2019, we announced an exciting update that improves the performance, scale, and efficiency of AWS Lambda functions when working with Amazon VPC networks. You can learn more about the improvements in the original blog post. These improvements represent a significant change in how elastic network interfaces (ENIs) are configured to connect to your VPCs. With this new model, we identified an issue where VPC resources, such as subnets, security groups, and VPCs, can fail to be destroyed via HashiCorp Terraform. More information about the issue can be found here. In this post we will help you identify whether this issue affects you and the steps to resolve the issue.

How do I know if I’m affected by this issue?

This issue only affects you if you use HashiCorp Terraform to destroy environments. Versions of Terraform AWS Provider that are v2.30.0 or older are impacted by this issue. With these versions you may encounter errors when destroying environments that contain AWS Lambda functions, VPC subnets, security groups, and Amazon VPCs. Typically, terraform destroy fails with errors similar to the following:

Error deleting subnet: timeout while waiting for state to become 'destroyed' (last state: 'pending', timeout: 20m0s)

Error deleting security group: DependencyViolation: resource sg-<id> has a dependent object
        	status code: 400, request id: <guid>

Depending on which AWS Regions the VPC improvements are rolled out, you may encounter these errors in some Regions and not others.

How do I resolve this issue if I am affected?

You have two options to resolve this issue. The recommended option is to upgrade your Terraform AWS Provider to v2.31.0 or later. To learn more about upgrading the Provider, visit the Terraform AWS Provider Version 2 Upgrade Guide. You can find information and source code for the latest releases of the AWS Provider on this page. The latest version of the Terraform AWS Provider contains a fix for this issue as well as changes that improve the reliability of the environment destruction process. We highly recommend that you upgrade the Provider version as the preferred option to resolve this issue.

If you are unable to upgrade the Provider version, you can mitigate the issue by making changes to your Terraform configuration. You need to make the following sets of changes to your configuration:

  1. Add an explicit dependency, using a depends_on argument, to the aws_security_group and aws_subnet resources that you use with your Lambda functions. The dependency has to be added on the aws_security_group or aws_subnet and target the aws_iam_policy resource associated with IAM role configured on the Lambda function. See the example below for more details.
  2. Override the delete timeout for all aws_security_group and aws_subnet resources. The timeout should be set to 40 minutes.

The following configuration file shows an example where these changes have been made(scroll to see the full code):

provider "aws" {
    region = "eu-central-1"
}
 
resource "aws_iam_role" "lambda_exec_role" {
  name = "lambda_exec_role"
  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF
}
 
data "aws_iam_policy" "LambdaVPCAccess" {
  arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole"
}
 
resource "aws_iam_role_policy_attachment" "sto-lambda-vpc-role-policy-attach" {
  role       = "${aws_iam_role.lambda_exec_role.name}"
  policy_arn = "${data.aws_iam_policy.LambdaVPCAccess.arn}"
}
 
resource "aws_security_group" "allow_tls" {
  name        = "allow_tls"
  description = "Allow TLS inbound traffic"
  vpc_id      = "vpc-<id>"
 
  ingress {
    # TLS (change to whatever ports you need)
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    # Please restrict your ingress to only necessary IPs and ports.
    # Opening to 0.0.0.0/0 can lead to security vulnerabilities.
    cidr_blocks = ["0.0.0.0/0"]
  }
 
  egress {
    from_port       = 0
    to_port         = 0
    protocol        = "tcp"
    cidr_blocks     = ["0.0.0.0/0"]
  }
  
  timeouts {
    delete = "40m"
  }
  depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]  
}
 
resource "aws_subnet" "main" {
  vpc_id     = "vpc-<id>"
  cidr_block = "172.31.68.0/24"

  timeouts {
    delete = "40m"
  }
  depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]
}
 
resource "aws_lambda_function" "demo_lambda" {
    function_name = "demo_lambda"
    handler = "index.handler"
    runtime = "nodejs10.x"
    filename = "function.zip"
    source_code_hash = "${filebase64sha256("function.zip")}"
    role = "${aws_iam_role.lambda_exec_role.arn}"
    vpc_config {
     subnet_ids         = ["${aws_subnet.main.id}"]
     security_group_ids = ["${aws_security_group.allow_tls.id}"]
  }
}

The key block to note here is the following, which can be seen in both the “allow_tls” security group and “main” subnet resources:

timeouts {
  delete = "40m"
}
depends_on = ["aws_iam_role_policy_attachment.sto-lambda-vpc-role-policy-attach"]

These changes should be made to your Terraform configuration files before destroying your environments for the first time.

Can I delete resources remaining after a failed destroy operation?

Destroying environments without upgrading the provider or making the configuration changes outlined above may result in failures. As a result, you may have ENIs in your account that remain due to a failed destroy operation. These ENIs can be manually deleted a few minutes after the Lambda functions that use them have been deleted (typically within 40 minutes). Once the ENIs have been deleted, you can re-re-run terraform destroy.

Automating your lift-and-shift migration at no cost with CloudEndure Migration

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/automating-your-lift-and-shift-migration-at-no-cost-with-cloudendure-migration/

This post is courtesy of Gonen Stein, Head of Product Strategy, CloudEndure 

Acquired by AWS in January 2019, CloudEndure offers a highly automated migration tool to simplify and expedite rehost (lift-and-shift) migrations. AWS recently announced that CloudEndure Migration is now available to all customers and partners at no charge.

Each free CloudEndure Migration license provides 90 days of use following agent installation. During this period, you can perform all migration steps: replicate your source machines, conduct tests, and perform a scheduled cutover to complete the migration.

Overview

In this post, I show you how to obtain free CloudEndure Migration licenses and how to use CloudEndure Migration to rehost a machine from an on-premises environment to AWS. Although I’ve chosen to focus on an on-premises-to-AWS use case, CloudEndure also supports migration from cloud-based environments. For those of you who are interested in advanced automation, I include information on how to further automate large-scale migration projects.

Understanding CloudEndure Migration

CloudEndure Migration works through an agent that you install on your source machines. No reboot is required, and there is no performance impact on your source environment. The CloudEndure Agent connects to the CloudEndure User Console, which issues an API call to your target AWS Region. The API call creates a Staging Area in your AWS account that is designated to receive replicated data.

CloudEndure Migration automated rehosting consists of three main steps :

  1. Installing the Agent: The CloudEndure Agent replicates entire machines to a Staging Area in your target.
  2. Configuration and testing: You configure your target machine settings and launch non-disruptive tests.
  3. Performing cutover: CloudEndure automatically converts machines to run natively in AWS.

The Staging Area comprises both lightweight Amazon EC2 instances that act as replication servers and staging Amazon EBS volumes. Each source disk maps to an identically sized EBS volume in the Staging Area. The replication servers receive data from the CloudEndure Agent running on the source machines and write this data onto staging EBS volumes. One replication server can handle multiple source machines replicating concurrently.

After all source disks copy to the Staging Area, the CloudEndure Agent continues to track and replicate any changes made to the source disks. Continuous replication occurs at the block level, enabling CloudEndure to replicate any application that runs on supported x86-based Windows or Linux operating systems via an installed agent.

When the target machines launch for testing or cutover, CloudEndure automatically converts the target machines so that they boot and run natively on AWS. This conversion includes injecting the appropriate AWS drivers, making appropriate bootloader changes, modifying network adapters, and activating operating systems using the AWS KMS. This machine conversion process generally takes less than a minute, irrespective of machine size, and runs on all launched machines in parallel.

CloudEndure Migration Architecture

CloudEndure Migration Architecture

Installing CloudEndure Agent

To install the Agent:

  1. Start your migration project by registering for a free CloudEndure Migration license. The registration process is quick–use your email address to create a username and password for the CloudEndure User Console. Use this console to create and manage your migration projects.

    After you log in to the CloudEndure User Console, you need an AWS access key ID and secret access key to connect your CloudEndure Migration project to your AWS account. To obtain these credentials, sign in to the AWS Management Console and create an IAM user.Enter your AWS credentials in the CloudEndure User Console.

  2. Configure and save your migration replication settings, including your migration project’s Migration Source, Migration Target, and Staging Area. For example, I selected the Migration Source: Other Infrastructure, because I am migrating on-premises machines. Also, I selected the Migration Target AWS Region: AWS US East (N. Virginia). The CloudEndure User Console also enables you to configure a variety of other settings after you select your source and target, such as subnet, security group, VPN or Direct Connect usage, and encryption.

    After you save your replication settings, CloudEndure prompts you to install the CloudEndure Agent on your source machines. In my example, the source machines consist of an application server, a web server, and a database server, and all three are running Debian GNU / Linux 9.

  3. Download the CloudEndure Agent Installer for Linux by running the following command:
    wget -O ./installer_linux.py https://console.cloudendure.com/installer_linux.py
  4. Run the Installer:
    sudo python ./installer_linux.py -t <INSTALLATION TOKEN> --no-prompt

    You can install the CloudEndure Agent locally on each machine. For large-scale migrations, use the unattended installation parameters with any standard deployment tool to remotely install the CloudEndure Agent on your machines.

    After the Agent installation completes, CloudEndure adds your source machines to the CloudEndure User Console. From there, your source machines undergo several initial replication steps. To obtain a detailed breakdown of these steps, in the CloudEndure User Console, choose Machines, and select a specific machine to open the Machine Details View page.

    details page

    Data replication consists of two stages: Initial Sync and Continuous Data Replication. During Initial Sync, CloudEndure copies all of the source disks’ existing content into EBS volumes in the Staging Area. After Initial Sync completes, Continuous Data Replication begins, tracking your source machines and replicating new data writes to the staging EBS volumes. Continuous Data Replication makes sure that your Staging Area always has the most up-to-date copy of your source machines.

  5. To track your source machines’ data replication progress, in the CloudEndure User Console, choose Machines, and see the Details view.
    When the Data Replication Progress status column reads Continuous Data Replication, and the Migration Lifecycle status column reads Ready for Testing, Initial Sync is complete. These statuses indicate that the machines are functioning correctly and are ready for testing and migration.

Configuration and testing

To test how your machine runs on AWS, you must configure the Target Machine Blueprint. This Blueprint is a set of configurations that define where and how the target machines are launched and provisioned, such as target subnet, security groups, instance type, volume type, and tags.

For large-scale migration projects, APIs can be used to configure the Blueprint for all of your machines within seconds.

I recommend performing a test at least two weeks before migrating your source machines, to give you enough time to identify potential problems and resolve them before you perform the actual cutover. For more information, see Migration Best Practices.

To launch machines in Test Mode:

  1. In the CloudEndure User Console, choose Machines.
  2. Select the Name box corresponding to each machine to test.
  3. Choose Launch Target Machines, Test Mode.Launch target test machines

After the target machines launch in Test Mode, the CloudEndure User Console reports those machines as Tested and records the date and time of the test.

Performing cutover

After you have completed testing, your machines continue to be in Continuous Data Replication mode until the scheduled cutover window.

When you are ready to perform a cutover:

  1. In the CloudEndure User Console, choose Machines.
  2. Select the Name box corresponding to each machine to migrate.
  3. Choose Launch Target Machines, Cutover Mode.
    To confirm that your target machines successfully launch, see the Launch Target Machines menu. As your data replicates, verify that the target machines are running correctly, make any necessary configuration adjustments, perform user acceptance testing (UAT) on your applications and databases, and redirect your users.
  4. After the cutover completes, remove the CloudEndure Agent from the source machines and the CloudEndure User Console.
  5. At this point, you can also decommission your source machines.

Conclusion

In this post, I showed how to rehost an on-premises workload to AWS using CloudEndure Migration. CloudEndure automatically converts your machines from any source infrastructure to AWS infrastructure. That means they can boot and run natively in AWS, and run as expected after migration to the cloud.

If you have further questions see CloudEndure Migration, or Registering to CloudEndure Migration.

Get started now with free CloudEndure Migration licenses.

Announcing improved VPC networking for AWS Lambda functions

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/announcing-improved-vpc-networking-for-aws-lambda-functions/

We’re excited to announce a major improvement to how AWS Lambda functions work with your Amazon VPC networks. With today’s launch, you will see dramatic improvements to function startup performance and more efficient usage of elastic network interfaces. These improvements are rolling out to all existing and new VPC functions, at no additional cost. Roll out begins today and continues gradually over the next couple of months across all Regions.

Lambda first supported VPCs in February 2016, allowing you to access resources in your VPCs or on-premises systems using an AWS Direct Connect link. Since then, we’ve seen customers widely use VPC connectivity to access many different services:

  • Relational databases such as Amazon RDS
  • Datastores such as Amazon ElastiCache for Redis or Amazon Elasticsearch Service
  • Other services running in Amazon EC2 or Amazon ECS

Today’s launch brings enhancements to how this connectivity between Lambda and your VPCs works.

Previous Lambda integration model for VPCs

It helps to understand some basics about the way that networking with Lambda works. Today, all of the compute infrastructure for Lambda runs inside of VPCs owned by the service.Lambda runs in a VPC controlled by the service

When you invoke a Lambda function from any invocation source and with any execution model (synchronous, asynchronous, or poll based), it occurs through the Lambda API. There is no direct network access to the execution environment where your functions run:Invoke access only through the Lambda API

By default, when your Lambda function is not configured to connect to your own VPCs, the function can access anything available on the public internet such as other AWS services, HTTPS endpoints for APIs, or services and endpoints outside AWS. The function then has no way to connect to your private resources inside of your VPC.

When you configure your Lambda function to connect to your own VPC, it creates an elastic network interface in your VPC and then does a cross-account attachment. These network interfaces allow network access from your Lambda functions to your private resources. These Lambda functions continue to run inside of the Lambda service’s VPC and can now only access resources over the network through your VPC.

All invocations for functions continue to come from the Lambda service’s API and you still do not have direct network access to the execution environment. The attached network interface exists inside of your account and counts towards limits that exist in your account for that Region.lambda mapped to a customer eni

With this model, there are a number of challenges that you might face. The time taken to create and attach new network interfaces can cause longer cold-starts, increasing the time it takes to spin up a new execution environment before your code can be invoked.

As your function’s execution environment scales to handle an increase in requests, more network interfaces are created and attached to the Lambda infrastructure. The exact number of network interfaces created and attached is a factor of your function configuration and concurrency.many ENIs mapped to Lambda environments

Every network interface created for your function is associated with and consumes an IP address in your VPC subnets. It counts towards your account level maximum limit of network interfaces.

As your function scales, you have to be mindful of several issues:

  • Managing the IP address space in your subnets
  • Reaching the account level network interface limit
  • The potential to hit the API rate limit on creating new network interfaces

What’s changing

Starting today, we’re changing the way that your functions connect to your VPCs. AWS Hyperplane, the Network Function Virtualization platform used for Network Load Balancer and NAT Gateway, has supported inter-VPC connectivity for offerings like AWS PrivateLink, and we are now leveraging Hyperplane to provide NAT capabilities from the Lambda VPC to customer VPCs.

The Hyperplane ENI is a managed network resource that the Lambda service controls, allowing multiple execution environments to securely access resources inside of VPCs in your account. Instead of the previous solution of mapping network interfaces in your VPC directly to Lambda execution environments, network interfaces in your VPC are mapped to the Hyperplane ENI and the functions connect using it.
VPC to VPC NAT

Hyperplane still uses a cross account attached network interface that exists in your VPC, but in a greatly improved way:

  • The network interface creation happens when your Lambda function is created or its VPC settings are updated. When a function is invoked, the execution environment simply uses the pre-created network interface and quickly establishes a network tunnel to it. This dramatically reduces the latency that was previously associated with creating and attaching a network interface at cold start.
  • Because the network interfaces are shared across execution environments, typically only a handful of network interfaces are required per function. Every unique security group:subnet combination across functions in your account requires a distinct network interface. If a combination is shared across multiple functions in your account, we reuse the same network interface across functions.
  • Your function scaling is no longer directly tied to the number of network interfaces and Hyperplane ENIs can scale to support large numbers of concurrent function executions

Hyperplane ENIs are tied to a security group:subnet combination in your account. Functions in the same account that share the same security group:subnet pairing use the same network interfaces. This way, a single application with multiple functions but the same network and security configuration can benefit from the existing interface configuration.

Several things do not change in this new model:

  • Your Lambda functions still need the IAM permissions required to create and delete network interfaces in your VPC.
  • You still control the subnet and security group configurations of these network interfaces. You can continue to apply normal network security controls and follow best practices on VPC configuration.
  • You still have to use a NAT device(for example VPC NAT Gateway) to give a function internet access or use VPC endpoints to connect to services outside of your VPC.
  • Nothing changes about the types of resources that your functions can access inside of your VPCs.

As mentioned earlier, Hyperplane now creates a shared network interface when your Lambda function is first created or when its VPC settings are updated, improving function setup performance and scalability. This one-time setup can take up to 90 seconds to complete. If your Lambda function is invoked while the shared network interface is still being created, the invocation still succeeds but may see longer cold-start times. When setup is complete, your Lambda function no longer requires network interface creation at invocation time. If Lambda functions in an account go idle for consecutive weeks, the service will reclaim the unused Hyperplane resources and so very infrequently invoked functions may still see longer cold-start times.

As I mentioned earlier, you don’t have to enable this new capability yourself, and there are no new associated costs. Starting today, we are gradually rolling this out across all AWS Regions over the next couple of months. We will update this post on a Region by Region basis after the rollout has completed in a given Region.

If you have existing VPC-configured Lambda workloads, you will soon see your applications scale and respond to new invocations more quickly. You can see this change using tools like AWS X-Ray or those from Lambda’s third party partners.

The following screenshots show a before/after example taken with AWS X-Ray:

The first shows a function that was launched in a VPC, with a full cold start with network interface creation at invocation.

In this second image, you see the same function executed but this time in an account that has the new VPC networking capability. You can see the massive difference that this has made in function execution duration, with it dropping to 933 ms from 14.8 seconds!

Conclusion

On the AWS Lambda team, we consider “Simplicity” a core tenet to how we think about building services for our customers. We’re constantly evaluating ways that we can improve and simplify the experience that you see in building serverless applications with Lambda. These changes in how we connect with your VPCs improve the performance and scale for your Lambda functions. They enable you to harness the full power of serverless architectures.

New: Using Amazon EC2 Instance Connect for SSH access to your EC2 Instances

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/new-using-amazon-ec2-instance-connect-for-ssh-access-to-your-ec2-instances/

This post is courtesy of Saloni Sonpal – Senior Product Manager – Amazon EC2

Today, AWS is introducing Amazon EC2 Instance Connect, a new way to control SSH access to your EC2 instances using AWS Identity and Access Management (IAM).

About Amazon EC2 Instance Connect

While infrastructure as code (IaC) tools such as Chef and Puppet have become customary in the industry for configuring servers, you occasionally must access your instances to fine-tune, consult system logs, or debug application issues. The most common tool to connect to Linux servers is Secure Shell (SSH). It was created in 1995 and is now installed by default on almost every Linux distribution.

When connecting to hosts via SSH, SSH key pairs are often used to individually authorize users. As a result, organizations have to store, share, manage access for, and maintain these SSH keys.

Some organizations also maintain bastion hosts, which help limit network access into hosts by the use of a single jump point. They provide logging and prevent rogue SSH access by adding an additional layer of network obfuscation. However, running bastion hosts comes with challenges. You maintain the installed user keys, handle rotation, and make sure that the bastion host is always available and, more importantly, secured.

Amazon EC2 Instance Connect simplifies many of these issues and provides the following benefits to help improve your security posture:

  • Centralized access control – You get centralized access control to your EC2 instances on a per-user and per-instance level. IAM policies and principals remove the need to share and manage SSH keys.
  • Short-lived keys – SSH keys are not persisted on the instance, but are ephemeral in nature. They are only accessible by an instance at the time that an authorized user connects, making it easier to grant or revoke access in real time. This also allows you to move away from long-lived keys. Instead, you generate one-time SSH keys each time that an authorized user connects, eliminating the need to track and maintain keys.
  • Auditability – User connections via EC2 Instance Connect are logged to AWS CloudTrail, providing the visibility needed to easily audit connection requests and maintain compliance.
  • Ubiquitous access – EC2 Instance Connect works seamlessly with your existing SSH client. You can also connect to your instances from a new browser-based SSH client in the EC2 console, providing a consistent experience without having to change your workflows or tools.

How EC2 Instance Connect works

When the EC2 Instance Connect feature is enabled on an instance, the SSH daemon (sshd) on that instance is configured with a custom AuthorizedKeysCommand script. This script updates AuthorizedKeysCommand to read SSH public keys from instance metadata during the SSH authentication process, and connects you to the instance.

The SSH public keys are only available for one-time use for 60 seconds in the instance metadata. To connect to the instance successfully, you must connect using SSH within this time window. Because the keys expire, there is no need to track or manage these keys directly, as you did previously.

Configuring an EC2 instance for EC2 Instance Connect

To get started using EC2 Instance Connect, you first configure your existing instances. Currently, EC2 Instance Connect supports Amazon Linux 2 and Ubuntu. Install RPM or Debian packages respectively to enable the feature. New Amazon Linux 2 instances have the EC2 Instance Connect feature enabled by default, so you can connect to those newly launched instances right away using SSH without any further configuration.

First, configure an existing instance. In this case, set up an Amazon Linux 2 instance running in your account. For the steps for Ubuntu, see Set Up EC2 Instance Connect.

  1. Connect to the instance using SSH. The instance is running a relatively recent version of Amazon Linux 2:
    [[email protected] ~]$ uname -srv
    Linux 4.14.104-95.84.amzn2.x86_64 #1 SMP Fri Jun 21 12:40:53 UTC 2019
  2. Use the yum command to install the ec2-instance-connect RPM package.
    $ sudo yum install ec2-instance-connect
    Loaded plugins: extras_suggestions, langpacks, priorities, update-motd
    Resolving Dependencies
    --> Running transaction check
    ---> Package ec2-instance-connect.noarch 0:1.1-9.amzn2 will be installed
    --> Finished Dependency Resolution
    
    ........
    
    Installed:
      ec2-instance-connect.noarch 0:1.1-9.amzn2                                                                                                     
    
    Complete!

    This RPM installs a few scripts locally and changes the AuthorizedKeysCommand and AuthorizedKeysCommandUser configurations in /etc/ssh/sshd_config. If you are using a configuration management tool to manage your sshd configuration, install the package and add the lines as described in the documentation.

With ec2-instance-connect installed, you are ready to set up your users and have them connect to instances.

Set up IAM users

First, allow an IAM user to be able to push their SSH keys up to EC2 Instance Connect. Create a new IAM policy so that you can add it to any other users in your organization.

  1. In the IAM console, choose Policies, Create Policy.
  2. On the Create Policy page, choose JSON and enter the following JSON document. Replace $REGION and $ACCOUNTID with your own values:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "ec2-instance-connect:SendSSHPublicKey"
                ],
                "Resource": [
                    "arn:aws:ec2:$REGION:$ACCOUNTID:instance/*"
                ],
                "Condition": {
                    "StringEquals": {
                        "ec2:osuser": "ec2-user"
                    }
                }
            }
        ]
    }

    Use ec2-user as the value for ec2:osuser with Amazon Linux 2. Specify this so that the metadata is made available for the proper SSH user. For more information, see Actions, Resources, and Condition Keys for Amazon EC2 Instance Connect Service.

  3. Choose Review Policy.
  4. On the Review Policy page, name the policy, give it a description, and choose Create Policy.
  5. On the Create Policy page, attach the policy to something by creating a new group (I named mine “HostAdmins”).
  6. On the Attach Policy page, attach the policy that you just created and choose Next Step.
  7. Choose Create Group.
  8. On the Groups page, select your newly created group and choose Group Actions, Add Users to Group.
  9. Select the user or users to add to this group, then choose Add Users.

Your users can now use EC2 Instance Connect.

Connecting to an instance using EC2 Instance Connect

With your instance configured and the users set with the proper policy, connect to your instance with your normal SSH client or directly, using the AWS Management Console.

To offer a seamless SSH experience, EC2 Instance Connect wraps up these steps in a command line tool. It also offers a browser-based interface in the console, which takes care of the SSH key generation and distribution for you.

To connect with your SSH client

  1. Generate the new private and public keys mynew_key and mynew_key.pub, respectively:
    $ ssh-keygen -t rsa -t mynew_key 
  2. Use the following AWS CLI command to authorize the user and push the public key to the instance using the send-ssh-public-key command. To support this, you need the latest version of the AWS CLI.
    $ aws ec2-instance-connect send-ssh-public-key --region us-east-1 --instance-id i-0989ec3292613a4f9 --availability-zone us-east-1f --instance-os-user ec2-user --ssh-public-key file://mynew_key.pub
    {
        "RequestId": "505f8675-710a-11e9-9263-4d440e7745c6", 
        "Success": true
    } 
  3. After authentication, the public key is made available to the instance through the instance metadata for 60 seconds. During this time, connect to the instance using the associated private key:
    $ ssh -i mynew_key [email protected]

If for some reason you don’t connect within that 60-second window, you see the following error:

$ ssh -i mynew_key [email protected]
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

If you do, run the send-ssh-public-key command again to connect using SSH.

Now, connect to your instance from the console.

To connect from the Amazon EC2 console

  1. Open the Amazon EC2 console.
  2. In the left navigation pane, choose Instances and select the instance to which to connect.
  3. Choose Connect.
  4. On the Connect To Your Instance page, choose EC2 Instance Connect (browser-based SSH connection), Connect.

The following terminal window opens and you are now connected through SSH to your instance.

Auditing with CloudTrail

For every connection attempt, you can also view the event details. These include the destination instance ID, OS user name, and public key, all used to make the SSH connection that corresponds to the SendSSHPublicKey API calls in CloudTrail.

In the CloudTrail console, search for SendSSHPublicKey.

If EC2 Instance Connect has been used recently, you should see records of your users having called this API operation to send their SSH key to the target host. Viewing the event’s details shows you the instance and other valuable information that you might want to use for auditing.

In the following example, see the JSON from a CloudTrail event that shows the SendSSHPublicKey command in use:

{
    "eventVersion": "1.05",
    "userIdentity": {
        "type": "User",
        "principalId": "1234567890",
        "arn": "arn:aws:iam:: 1234567890:$USER",
        "accountId": "1234567890",
        "accessKeyId": "ABCDEFGHIJK3RFIONTQQ",
        "userName": "$ACCOUNT_NAME",
        "sessionContext": {
            "attributes": {
                "mfaAuthenticated": "true",
                "creationDate": "2019-05-07T18:35:18Z"
            }
        }
    },
    "eventTime": "2019-06-21T13:58:32Z",
    "eventSource": "ec2-instance-connect.amazonaws.com",
    "eventName": "SendSSHPublicKey",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "34.204.194.237",
    "userAgent": "aws-cli/1.15.61 Python/2.7.16 Linux/4.14.77-70.59.amzn1.x86_64 botocore/1.10.60",
    "requestParameters": {
        "instanceId": "i-0989ec3292613a4f9",
        "osUser": "ec2-user",
        "SSHKey": {
            "publicKey": "ssh-rsa <removed>\\n"
        }
    },
    "responseElements": null,
    "requestID": "df1a5fa5-710a-11e9-9a13-cba345085725",
    "eventID": "070c0ca9-5878-4af9-8eca-6be4158359c4",
    "eventType": "AwsApiCall",
    "recipientAccountId": "1234567890"
}

If you’ve configured your AWS account to collect CloudTrail events in an S3 bucket, you can download and audit the information programmatically. For more information, see Getting and Viewing Your CloudTrail Log Files.

Conclusion

EC2 Instance Connect offers an alternative to complicated SSH key management strategies and includes the benefits of using built-in auditability with CloudTrail. By integrating with IAM and the EC2 instance metadata available on all EC2 instances, you get a secure way to distribute short-lived keys and control access by IAM policy.

There are some additional features in the works for EC2 Instance Connect. In the future, AWS hopes to launch tag-based authorization, which allows you to use resource tags in the condition of a policy to control access. AWS also plans to enable EC2 Instance Connect by default in popular Linux distros in addition to Amazon Linux 2.

EC2 Instance Connect is available now at no extra charge in the US East (Ohio and N. Virginia), US West (N. California and Oregon), Asia Pacific (Mumbai, Seoul, Singapore, Sydney, and Tokyo), Canada (Central), EU (Frankfurt, Ireland, London, and Paris), and South America (São Paulo) AWS Regions.

Updated timeframe for the upcoming AWS Lambda and AWS [email protected] execution environment update

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/updated-timeframe-for-the-upcoming-aws-lambda-and-aws-lambdaedge-execution-environment-update/

On May 14th we announced an upcoming update to the AWS Lambda and AWS [email protected] execution environments. In that announcement we shared that we are updating the execution environment to a more recent version of Amazon Linux. This newer execution environment brings updates that offer improvements in capabilities, performance, security, and updated packages that your application code might interface with. The previous post explained approaches to proactively testing against the new update, and methods to update your code to be compatible in the rare case you were impacted.

So far, we’ve heard from many customers that their functions have not been impacted when using the new execution environment by configuring the Opt-in mechanism. For those that have been impacted, they have been able to follow the guidance on rebuilding any dependencies against the new execution environment and retesting their functions with success.

We also received feedback that customers wanted to see a longer time frame for validation as well as have more control over it, and so based on this feedback we’ve decided to modify the timeframe in two ways.

The first phase, Begin Testing, will be extended by three weeks, retroactive starting May 21 and now ending June 10. This will give you more time to test your functions with the Opt-in layer before any further changes to the platform kick in.

We are then taking the second phase, originally called Update/Create and breaking into two independent periods of time. The first, now referred to as the New Function Create phase, will be two weeks long and during this time all newly created functions will use the new execution environment unless a delayed-update layer is configured. The second new phase, Existing Function Update, will be three weeks long and during this time both newly created functions as well as existing functions that you update, will use the new execution environment unless a delayed-update layer is configured.

The end result is that you now have 5 more weeks in total to test and potentially update your functions for this change before the General Update begins. As a reminder, starting at that time, all functions without a delayed-update layer configured will begin migrating to the new execution environment.

New update timeline

The following is the new timeline for the update, which is now broken up over five phases:

May 14, 2019 – Begin Testing: You can begin testing your functions for the new execution environment locally with AWS SAM CLI or using an Amazon EC2 instance running on Amazon Linux 2018.03. You can also proactively enable the new environment in AWS Lambda using the opt-in mechanism described in the original announcement post.
June 11, 2019 – New Function Create: All newly created functions will result in those functions running on the new execution environment, unless they have a delayed-update layer configured.
June 25, 2019 – Existing Function Update: All newly created functions or existing functions that you update will result in those functions running on the new execution environment, unless they have a delayed-update layer configured.
July 16, 2019 – General Update: Existing functions begin using the new execution environment on invoke, unless they have a delayed-update layer configured.
July 23, 2019 – Delayed Update End: All functions with a delayed-update layer configured start being migrated automatically.
July 29, 2019 – Migration End: All functions have been migrated over to the new execution environment.

Note, that we have updated the original announcement post with this new timeline as well.

FAQ

We also wanted to take the chance to provide additional information on follow up questions customers have had about the update.

Q. How does this relate to the recent Node.js v10 runtime launch?
A. The Node.js v10 launch is unrelated and is not impacted by this change. The Node.js v10 runtime is based on Amazon Linux 2 as its execution environment. Please see the AWS Lambda Runtimes section in the documentation for more information.

Q. Does this update change the execution environment for other runtimes to run on Amazon Linux 2?
A. No, this update brings the execution environment to the latest Amazon Linux 1 distribution release. In the future, new runtimes will launch on Amazon Linux 2, but all previous existing runtimes will continue to run on Amazon Linux 1.

Q. Was this update related to the recent Intel Quarterly Security Release (QSR) 2019.1?
A. No, this motion to begin updating the execution environment for Lambda and [email protected] is unrelated to the Intel QSR. There is no action for Lambda or [email protected] customers to take in relation to the QSR.

Next Steps

Your feedback greatly matters to us and we will continue to listen and learn from you. Please continue to contact us through AWS Support, the AWS forums, or AWS account teams.

Upcoming updates to the AWS Lambda and AWS [email protected] execution environment

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/upcoming-updates-to-the-aws-lambda-execution-environment/

AWS Lambda was first announced at AWS re:Invent 2014. Amazon CTO Werner Vogels highlighted the aspect of needing to run no servers, no instances, nothing, you just write your code. In 2016, we announced the launch of [email protected], which lets you run Lambda functions to customize content that CloudFront delivers, executing the functions in AWS locations closer to the viewer.

At AWS, we talk often about “shared responsibility” models. Essentially, those are the places where there is a handoff between what we as a technology provider offer you and what you as the customer are responsible for. In the case of Lambda and [email protected], one of the key things that we manage is the “execution environment.” The execution environment is what your code runs inside of. It is composed of the underlying operating system, system packages, the runtime for your language (if a managed one), and common capabilities like environment variables. From the customer standpoint, your primary responsibility is for your application code and configuration.

In this post, we outline an upcoming change to the execution environment for Lambda and [email protected] functions for all runtimes with the exception of Node.js v10. As with any update, some functionality could be affected. We encourage you to read through this post to understand the changes and any actions that you might need to take.

Update overview

AWS Lambda and AWS [email protected] run on top of the Amazon Linux operating system distribution and maintain updates to both the core OS and managed language runtimes. We are updating the Lambda execution environment AMI to version 2018.03 of Amazon Linux. This newer AMI brings updates that offer improvements in capabilities, performance, security, and updated packages that your application code might interface with.

This does not apply to the recently announced Node.js v10 runtime which today runs on Amazon Linux 2.

The majority of functions will benefit seamlessly from the enhancements in this update without any action from you. However, in rare cases, package updates may introduce compatibility issues. Potential impacted functions may be those that contain libraries or application code compiled against specific underlying OS packages or other system libraries. If you are primarily using an AWS SDK or the AWS X-Ray SDK with no other dependencies, you will see no impact.

You have the following options in terms of next steps:

  • Take no action before the automatic update of the execution environment starting May 21 for all newly created/updated functions and June 11 for all existing functions.
  • Proactively test your functions against the new environment starting today.
  • Configure your functions to delay the execution environment update until June 18 to allow for a longer testing window.

In addition to the overall timeline for this change, this post also provides instructions on the following:

  • How to test your functions for this new execution environment locally and on Lambda/[email protected]
  • How to proactively update your functions.
  • How to extend the testing window by one week.

Update timeline

The following is the timeline for the update, which is broken up over four phases over the next several weeks:

May 14, 2019—Begin Testing: You can begin testing your functions for the new execution environment locally with AWS SAM CLI or using an Amazon EC2 instance running on Amazon Linux 2018.03. You can also proactively enable the new environment in AWS Lambda using the opt-in mechanism described later in this post.
May 21, 2019—Update/Create: All new function creates or function updates result in your functions running on the new execution environment.
June 11, 2019—General Update: Existing functions begin using the new execution environment on invoke unless they have a delayed-update layer configured.
June 18, 2019—Delayed Update End: All functions with a delayed-update layer configured start being migrated automatically.
June 24, 2019—Migration End: All functions have been migrated over to the new execution environment.

Recommended Approach

decision tree

You only have to act if your application uses dependencies that are compiled to work on the previous execution environment. Otherwise, you can continue to deploy new and updated Lambda functions without needing to perform any other testing steps. For those who aren’t sure if their functions use such dependencies, we encourage you to do a new deployment of your functions and to test their functionality.

There are two options for when you can start testing your functions on the new execution environment:

  • You can begin testing today using the opt-in mechanism described later.
  • Starting May 21, a new deploy or update of your functions uses the new execution environment.

If you confirm that your functions would be affected by the new execution environment, you can begin re-compiling or building your dependencies using the new reference AMI for the execution environment today and then repeat the testing. The final step is to redeploy your applications any time after May 21 to use the new execution environment.

Building your dependencies and application for the new execution environment

Because we are basing the environment off of an existing Amazon Linux AMI, you can start with building and testing your code against that AMI on EC2. With an updated EC2 instance running this AMI, you can compile and build your packages using your normal processes. For the list of AMI IDs in all public Regions, check the release notes. To start an EC2 instance running this AMI, follow the steps in the Launching an Instance Using the Launch Instance Wizard topic in the Amazon EC2 User Guide.

Opt-in/Delayed-Update with Lambda layers

Some of you may want to begin testing as soon as you’ve read this announcement. Some know that they should postpone until later in the timeline.

To give you some control over testing, we’re releasing two special Lambda layers. Lambda layers can be used to provide shared resources, code, or data between Lambda functions and can simplify the deployment and update process. These layers don’t actually contain any data or code. Instead, they act as a special flag to Lambda to run your function executions either specifically on the new or old execution environment.

The Opt-In layer allows you to start testing today. You can use the Delayed-Update layer when you know that you must make updates to your function or its configuration after May 21, but aren’t ready to deploy to the new execution environment. The Delayed-Update layer extends the initial period available to you to deploy your functions by one week until the end of June 17, without changing the execution environment.

Neither layer brings any performance or runtime changes beyond this. After June 24, the layers will have no functionality. In a future deployment, you should remove them from any function configurations.

The ARNs for the two scenarios:

  • To OPT-IN to the update to the new execution environment, add the following layer:

arn:aws:lambda:::awslayer:AmazonLinux1803

  • To DELAY THE UPDATE to the new execution environment until June 18, add the following layer:

arn:aws:lambda:::awslayer:AmazonLinux1703

The action for adding a layer to your existing functions requires an update to the Lambda function’s configuration. You can do this with the AWS CLI, AWS CloudFormation or AWS SAM, popular third-party frameworks, the AWS Management Console, or an AWS SDK.

Validating your functions

There are several ways for you to test your function code and assure that it will work after the execution environment has been updated.

Local testing

We’re providing an update to the AWS SAM CLI to enable you to test your functions locally against this new execution environment. The AWS SAM CLI uses a Docker image that mirrors the live Lambda environment locally wherever you do development. To test against this new update, make sure that you have the most recent update to AWS SAM CLI version 0.16.0. You also should have an AWS SAM template configured for your function.

  1. Install or update the AWS SAM CLI:
    $ pip install --upgrade aws-sam-cli

    -Or-

    $ pip install aws-sam-cli
  2. Confirm that you have a valid AWS SAM template:
    $ sam validate -t <template file name>

    If you don’t have a valid AWS SAM template, you can begin with a basic template to test your functions. The following example represents the basic needs for running your function against a variety. The Runtime value must be listed in the AWS Lambda Runtimes topic.

    AWSTemplateFormatVersion: 2010-09-09
    Transform: 'AWS::Serverless-2016-10-31'
    
    Resources:
      myFunction:
        Type: 'AWS::Serverless::Function'
        Properties:
          CodeUri: ./ 
          Handler: YOUR_HANDLER
          Runtime: YOUR_RUNTIME
  3. With a valid template, you can begin testing your function with mock event payloads. To generate a mock event payload, you can use the AWS SAM CLI local generate-event command. Here is an example of that command being run to generate an Amazon S3 notification type of event:
    sam local generate-event s3 put --bucket munns-test --key somephoto.jpeg
    {
      "Records": [
        {
          "eventVersion": "2.0", 
          "eventTime": "1970-01-01T00:00:00.000Z", 
          "requestParameters": {
            "sourceIPAddress": "127.0.0.1"
          }, 
          "s3": {
            "configurationId": "testConfigRule", 
            "object": {
              "eTag": "0123456789abcdef0123456789abcdef", 
              "sequencer": "0A1B2C3D4E5F678901", 
              "key": "somephoto.jpeg", 
              "size": 1024
            }, 
            "bucket": {
              "arn": "arn:aws:s3:::munns-test", 
              "name": "munns-test", 
              "ownerIdentity": {
                "principalId": "EXAMPLE"
              }
            }, 
            "s3SchemaVersion": "1.0"
          }, 
          "responseElements": {
            "x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH", 
            "x-amz-request-id": "EXAMPLE123456789"
          }, 
          "awsRegion": "us-east-1", 
          "eventName": "ObjectCreated:Put", 
          "userIdentity": {
            "principalId": "EXAMPLE"
          }, 
          "eventSource": "aws:s3"
        }
      ]
    }

    You can then use the AWS SAM CLI local invoke command and pipe in the output from the previous command. Or, you can save the output from the previous command to a file and then pass in a reference to the file’s name and location with the -e flag. Here is an example of the pipe event method:

    sam local generate-event s3 put --bucket munns-test --key somephoto.jpeg | sam local invoke myFunction
    2019-02-19 18:45:53 Reading invoke payload from stdin (you can also pass it from file with --event)
    2019-02-19 18:45:53 Found credentials in shared credentials file: ~/.aws/credentials
    2019-02-19 18:45:53 Invoking index.handler (python2.7)
    
    Fetching lambci/lambda:python2.7 Docker container image......
    2019-02-19 18:45:53 Mounting /home/ec2-user/environment/forblog as /var/task:ro inside runtime container
    START RequestId: 7c14eea1-96e9-4b7d-ab54-ed1f50bd1a34 Version: $LATEST
    {"Records": [{"eventVersion": "2.0", "eventTime": "1970-01-01T00:00:00.000Z", "requestParameters": {"sourceIPAddress": "127.0.0.1"}, "s3": {"configurationId": "testConfigRule", "object": {"eTag": "0123456789abcdef0123456789abcdef", "key": "somephoto.jpeg", "sequencer": "0A1B2C3D4E5F678901", "size": 1024}, "bucket": {"ownerIdentity": {"principalId": "EXAMPLE"}, "name": "munns-test", "arn": "arn:aws:s3:::munns-test"}, "s3SchemaVersion": "1.0"}, "responseElements": {"x-amz-id-2": "EXAMPLE123/5678abcdefghijklambdaisawesome/mnopqrstuvwxyzABCDEFGH", "x-amz-request-id": "EXAMPLE123456789"}, "awsRegion": "us-east-1", "eventName": "ObjectCreated:Put", "userIdentity": {"principalId": "EXAMPLE"}, "eventSource": "aws:s3"}]}
    END RequestId: 7c14eea1-96e9-4b7d-ab54-ed1f50bd1a34
    REPORT RequestId: 7c14eea1-96e9-4b7d-ab54-ed1f50bd1a34 Duration: 1 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 14 MB
    
    "Success! Parsed Events"

    You can see the full output of your function in the logs that follow the invoke command. In this example, the Python function prints out the event payload and then exits.

With the AWS SAM CLI, you can pass in valid test payloads that interface with data in other AWS services. You can also have your Lambda function talk to other AWS resources that exist in your account, for example Amazon DynamoDB tables, Amazon S3 buckets, and so on. You could also test an API interface using the local start-api command, provided that you have configured your AWS SAM template with events of the API type. Follow the full instructions for setting up and configuring the AWS SAM CLI in Installing the AWS SAM CLI. Find the full syntax guide for AWS SAM templates in the AWS Serverless Application Model documentation.

Testing in the Lambda console

After you have deployed your functions after the start of the Update/Create phase or with the Opt-In layer added, test your functions in the Lambda console.

  1. In the Lambda console, select the function to test.
  2. Select a test event and choose Test.
  3. If no test event exists, choose Configure test events.
    1. Choose Event template and select the relevant invocation service from which to test.
    2. Name the test event.
    3. Modify the event payload for your specific function.
    4. Choose Create and then return to step 2.

The results from the test are displayed.

Conclusion

With Lambda and [email protected], AWS has allowed developers to focus on just application code without the need to think about the work involved managing the actual servers that run the code. We believe that the mechanisms provided and processes described in this post allow you to easily test and update your functions for this new execution environment.

Some of you may have questions about this process, and we are ready to help you. Please contact us through AWS Support, the AWS forums, or AWS account teams.

Creating an AWS Batch environment for mixed CPU and GPU genomics workflows

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/creating-an-aws-batch-environment-for-mixed-cpu-and-gpu-genomics-workflows/

This post is courtesy of Lee Pang – AWS Technical Business Development 

I recently worked with a customer who needed to process a bunch of raw sequence files (FastQs) into Hi-C format (*.hic), which is used for the structural analysis of DNA/chromatin loops and sequence accessibility. The tooling they were interested in using was the Juicer suite and they needed a minimal workflow:

  • Align the sample to the reference using the juicer CLI utility.
  • Annotate loops using the HiCCUPS algorithm from the juicer-tools library.

Because they had many files to process, they wanted to do this as scalably as possible. The juicer step of the workflow was CPU and memory-intensive, while the HiCCUPS step needed GPU acceleration. So, they were interested in using AWS Step Functions and AWS Batch.

Since its launch, AWS Batch has enabled the ability to create scalable compute environments for processing a mixture of CPU and memory-intensive jobs. This covers the needs of the majority of genomics workflows. So, how do you create a genomics workflow environment using AWS Batch that also includes using GPUs? Thankfully, AWS Batch recently announced support for GPU resources!

In this post, I show you how to use these new features to execute mixed CPU and GPU genomics workflows. By the end of this post, you will be able to build the architecture shown below.

Configuring AWS Batch for running CPU and GPU jobs

To handle a mixture of CPU and GPU jobs, the recommended strategy is to create multiple compute environments and job queues:

  • GPU-only resources
    • Compute environments (using the ECS GPU Optimized AMI)
    • Spot Instances of the P2 and P3 instance family
    • On-Demand Instances of the P2 and P3 instance family
    • Job queue for GPU compute environments
  • CPU-only resources
    • Compute environments (using the default ECS Optimized AMI)
    • Spot Instances of the “optimal” instance family
    • On-Demand Instances of the “optimal” instance family
    • Job queue for CPU compute environments

With the above in place, you then point CPU jobs to the “CPU” queue and GPU jobs to the “GPU” queue.

Notice that CPU and GPU resources are kept separate with queues and compute environments. I don’t recommend creating a compute environment or job queue that mixes CPU and GPU optimized instance types. In a mixed compute environment or queue, there is a chance that CPU jobs could be placed on GPU instances when no GPU jobs are scheduled. This could result in few (or no) GPU instances available when GPU jobs must be run.

Create a GPU compute environment

Amazon EC2 has a wide variety of instance types. This includes the P3 family of instances that enable GPU-accelerated computing. Previously, the best option for using GPUs in AWS Batch was to create a compute environment based on the publicly available Deep Learning AMI used by AI/ML services. For example, Amazon SageMaker has support for running containers and NVIDIA / CUDA drivers pre-installed.

Earlier this year, the Amazon ECS team announced the availability of the ECS GPU-optimized AMI. It’s essentially the same as the existing Amazon ECS-optimized Amazon Linux 2 AMI but with pre-installed capabilities to provide Docker containers with access to GPU acceleration. For more information about what’s included, see Amazon ECS-optimized AMIs.

The key points are:

This is a much more lightweight solution, and makes creating GPU-specific AWS Batch compute environments much easier.

Creating an AWS Batch compute environment specifically for GPU jobs is similar to creating one for CPU jobs. The key difference is that you select only GPU instance families for the instance types.

For compute environments that use the P2, P3, and P3dn instance families, AWS Batch automatically associates the ECS GPU-optimized AMI. You don’t have to create a custom AMI for GPU jobs to run on these instances.

The G2 and G3 families use a different type of GPU and have different drivers. Compute environments that use G2 and G3 families need a custom AMI to take advantage of acceleration. Otherwise, they default to the ECS-optimized AMI.

To create a GPU-enabled compute environment with the AWS CLI, create a file called gpu-ce.json with the following contents:

{
    "computeEnvironmentName": "gpu",
    "type": "MANAGED",
    "state": "ENABLED",
    "serviceRole": "arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole",
    "computeResources": {
        "type": "EC2",
        "subnets": [
            "<subnet-id-1>",
            "<subnet-id-2>",
            "<subnet-id-3>"
        ],
        "tags": {
            "Name": "batch-gpu-worker"
        },
        "desiredvCpus": 0,
        "minvCpus": 0,
        "instanceTypes": [
            "p3",
            "p2"
        ],
        "instanceRole": "arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole",
        "maxvCpus": 256,
        "securityGroupIds": [
            "<security-group-1>",
            "<security-group-2>"
        ],
        "ec2KeyPair": "<keypair-name>"
    }
}

From the command line, run the following:

aws batch create-compute-environment --cli-input-json file://gpu-ce.json

Create a GPU job queue

When you have a GPU compute environment, you can associate it with a dedicated GPU job queue.

From the command line, run the following:

aws batch create-job-queue \
    --job-queue-name gpu \
    --state ENABLED \
    --priority 100 \
    --compute-environment-order order=1,computeEnvironment=gpu

Specifying GPU resources in AWS Batch jobs

Creating job definitions in AWS Batch has not changed much. However, now there’s an additional field under Resource requirements that lets you specify how many GPUs the job should use.

The JSON for registering a job definition with GPU requirements looks like the following:

{
    "jobDefinitionName": "hiccups", 
    "type": "container", 
    "parameters": {
        "OutputS3Prefix": "s3://<bucket-name>/juicer/HIC003", 
        "InputHICS3Path": "s3://<bucket-name>/juicer/HIC003/aligned/inter.hic"
    }, 
    "containerProperties": {
        "mountPoints": [], 
        "image": "<docker-image-repository>/juicer-tools:latest", 
        "environment": [], 
        "vcpus": 8, 
        "command": [
            "hiccups", 
            "Ref::InputHICS3Path", 
            "Ref::OutputS3Prefix"
        ], 

        /* BEGIN NEW STUFF (delete comment before use) */
        "resourceRequirements" : [
            {
                "type" : "GPU",
                "value" : "1"
            }
        ],
        /* END NEW STUFF (delete comment before use) */
        
        "volumes": [], 
        "memory": 60000, 
        "ulimits": []
    }
}

Containerization considerations

To ensure that your containerized task can use GPU acceleration, use nvidia/cuda base images when building the container image for your job. For example, for a CentOS-based image with CUDA 9.2, your Dockerfile should have the following:

FROM nvidia/cuda:9.2-devel-centos7

This could be at the top if you are building the container entirely from scratch, or later if you are using a multi-stage build.

In this case, I already had a CentOS-based image for the juicer utility that I could recycle for juicer-tools. So my Dockerfile looked like the following:

FROM juicer AS base
FROM nvidia/cuda:9.2-devel-centos7

COPY --from=base /opt/juicer /opt/juicer

RUN yum install -y awscli
RUN yum install -y java-1.8.0-openjdk

WORKDIR /opt/juicer/scripts
COPY juicer-tools.aws.sh .

WORKDIR /opt/juicer/work
ENTRYPOINT ["/opt/juicer/scripts/juicer-tools.aws.sh"]

Running a mixed CPU / GPU workflow

To run a workflow that contains a mixture of CPU– and GPU-based jobs, you can use AWS Step Functions. This blog channel previously covered how Step Functions and AWS Batch can be combined to run genomics workflows on AWS. Also, Step Functions and AWS Batch are now more tightly integrated, making scalable workflow solutions much easier to build.

With all the AWS Batch resources in place, it is a matter of pointing each task at the job queue with the right compute resources.

The solution I put together for the customer with the Juicer suite resulted in the following state machine:

{
    "Comment": "State machine for FASTQ to HIC with annotation using juicer and hiccups",
    "StartAt": "JuicerTask",
    "States": {
        "JuicerTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.juicer.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer:2",
                "JobName": "juicer",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/cpu",
                "Parameters.$": "$.juicer.parameters"
            },
            "Next": "HiccupsTask"
        },
        "HiccupsTask": {
            "Type": "Task",
            "InputPath": "$",
            "ResultPath": "$.hiccups.status",
            "Resource": "arn:aws:states:::batch:submitJob.sync",
            "Parameters": {
                "JobDefinition": "arn:aws:batch:<region>:<account_number>:job-definition/juicer-tools:2",
                "JobName": "hiccups",
                "JobQueue": "arn:aws:batch:<region>:<account_number>:job-queue/gpu",
                "Parameters.$": "$.hiccups.parameters"
            },
            "End": true
        }
    }
}

JuicerTask is submitted to the cpu job queue, while HiccupsTask is submitted to the gpu job queue.

Conclusion

With the ECS GPU-optimized AMI and AWS Batch support for GPU resources, you can easily build a scalable solution for running genomics workflows with both CPU and GPU resources.

Build on!

From Poll to Push: Transform APIs using Amazon API Gateway REST APIs and WebSockets

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/from-poll-to-push-transform-apis-using-amazon-api-gateway-rest-apis-and-websockets/

This post is courtesy of Adam Westrich – AWS Principal Solutions Architect and Ronan Prenty – Cloud Support Engineer

Want to deploy a web application and give a large number of users controlled access to data analytics? Or maybe you have a retail site that is fulfilling purchase orders, or an app that enables users to connect data from potentially long-running third-party data sources. Similar use cases exist across every imaginable industry and entail sending long-running requests to perform a subsequent action. For example, a healthcare company might build a custom web portal for physicians to connect and retrieve key metrics from patient visits.

This post is aimed at optimizing the delivery of information without needing to poll an endpoint. First, we outline the current challenges with the consumer polling pattern and alternative approaches to solve these information delivery challenges. We then show you how to build and deploy a solution in your own AWS environment.

Here is a glimpse of the sample app that you can deploy in your own AWS environment:

What’s the problem with polling?

Many customers need implement the delivery of long-running activities (such as a query to a data warehouse or data lake, or retail order fulfillment). They may have developed a polling solution that looks similar to the following:

  1. POST sends a request.
  2. GET returns an empty response.
  3. Another… GET returns an empty response.
  4. Yet another… GET returns an empty response.
  5. Finally, GET returns the data for which you were looking.

The challenges of traditional polling methods

  • Unnecessary chattiness and cost due to polling for result sets—Every time your frontend polls an API, it’s adding costs by leveraging infrastructure to compute the result. Empty polls are just wasteful!
  • Hampered mobile battery life—One of the top contributors of apps that eat away at battery life is excessive polling. Make sure that your app isn’t on your users’ Top App Battery Usage list-of-shame that could result in deletion.
  • Delayed data arrival due to polling schedule—Some approaches to polling include an incremental backoff to limit the number of empty polls. This sometimes results in a delay between data readiness and data arrival.

But what about long polling?

  • User request deadlocks can hamper application performance—Long synchronous user responses can lead to unexpected user wait times or UI deadlocks, which can affect mobile devices especially.
  • Memory leaks and consumption could bring your app down—Keeping long-running tasks queries open may overburden your backend and create failure scenarios, which may bring down your app.
  • HTTP default timeouts across browsers may result in inconsistent client experience—These timeouts vary across browsers, and can lead to an inconsistent experience across your end users. Depending on the size and complexity of the requests, processing can last longer than many of these timeouts and take minutes to return results.

Instead, create an event-driven architecture and move your APIs from poll to push.

Asynchronous push model

To create optimal UX experiences, frontend developers often strive to create progressive and reactive user experiences. Users can interact with frontend objects (for example, push buttons) with little lag when sending requests and receiving data. But frontend developers also want users to receive timely data, without sacrificing additional user actions or performing unnecessary processing.

The birth of microservices and the cloud over the past several years has enabled frontend developers and backend service designers to think about these problems in an asynchronous manner. This enables the virtually unlimited resources of the cloud to choreograph data processing. It also enables clients to benefit from progressive and reactive user experiences.

This is a fresh alternative to the synchronous design pattern, which often relies on client consumers to act as the conductor for all user requests. The following diagram compares the flow of communication between patterns.Comparison of communication patterns

Asynchronous orchestration can be easily choreographed using the workflow definition, monitoring, and tracking console with AWS Step Functions state machines. Break up services into functions with AWS Lambda and track executions within a state diagram, like the following:

With this approach, consumers send a request, which initiates downstream distributed processing. The subsequent steps are invoked according to the state machine workflow and each execution can be monitored.

Okay, but how does the consumer get the result of what they’re looking for?

Multiple approaches to this problem

There are different approaches for consumers to retrieve their resulting data. In order to create the optimal solution, there are several questions that a service owner may need to ask.

Is there known trust with the client (such as another microservice)?

If the answer is Yes, one approach is to use Amazon SNS. That way, consumers can subscribe to topics and have data delivered using email, SMS, or even HTTP/HTTPS as an event subscriber (that is, webhooks).

With webhooks, the consumer creates an endpoint where the service provider can issue a callback using a small amount of consumer-side resources. The consumer is waiting for an incoming request to facilitate downstream processing.

  1. Send a request.
  2. Subscribe the consumer to the topic.
  3. Open the endpoint.
  4. SNS sends the POST request to the endpoint.

If the trust answer is No, then clients may not be able to open a lightweight HTTP webhook after sending the initial request. In that case, you must open the webhook. Consider an alternative framework.

Are you restricted to only using a REST protocol?

Some applications can only run HTTP requests. These scenarios include the age of the technology or browser for end users, and potential security requirements that may block other protocols.

In these scenarios, customers may use a GET method as a best practice, but we still advise avoiding polling. In these scenarios, some design questions may be: Does data readiness happen at a predefined time or duration interval? Or can the user experience tolerate time between data readiness and data arrival?

If the answer to both of these is Yes, then consider trying to send GET calls to your RESTful API one time. For example, if a job averages 10 minutes, make your GET call at 10 minutes after the submission. Sounds simple, right? Much simpler than polling.

Should I use GraphQL, a WebSocket API, or another framework?

Each framework has tradeoffs.

If you want a more flexible query schema, you may gravitate to GraphQL, which follows a “data-driven UI” approach. If data drives the UI, then GraphQL may be the best solution for your use case.

AWS AppSync is a serverless GraphQL engine that supports the heavy lifting of these constructs. It offers functionality such as AWS service integration, offline data synchronization, and conflict resolution. With GraphQL, there’s a construct called Subscriptions where clients can subscribe to event-based subscriptions upon a data change.

Amazon API Gateway makes it easy for developers to deploy secure APIs at scale. With the recent introduction of API Gateway WebSocket APIs, web, mobile clients, and backend services can communicate over an established WebSocket connection. This also allows clients to be more reactive to data updates and only do work one time after an update has been received over the WebSocket connection.

The typical frontend design approach is to create a UI component that is updated when the results of the given procedure are complete. This is beneficial over a complete webpage refresh and gains in the customer’s user experience.

Because many companies have elected to use the REST framework for creating API-driven tightly bound service contracts, a RESTful interface can be used to validate and receive the request. It can provide a status endpoint used to deliver status. Also, it provides additional flexibility in delivering the result to the variety of clients, along with the WebSocket API.

Poll-to-push solution with API Gateway

Imagine a scenario where you want to be updated as soon as the data is created. Instead of the traditional polling methods described earlier, use an API Gateway WebSocket API. That pushes new data to the client as it’s created, so that it can be rendered on the client UI.

Alternatively, a WebSocket server can be deployed on Amazon EC2. With this approach, your server is always running to accept and maintain new connections. In addition, you manage the scaling of the instance manually at times of high demand.

By using an API Gateway WebSocket API in front of Lambda, you don’t need a machine to stay always on, eating away your project budget. API Gateway handles connections and invokes Lambda whenever there’s a new event. Scaling is handled on the service side. To update our connected clients from the backend, we can use the API Gateway callback URL. The AWS SDKs make communication from the backend easy. As an example, see the Boto3 sample of post_to_connection:

import boto3 
#Use a layer or deployment package to 
#include the latest boto3 version.
...
apiManagement = boto3.client('apigatewaymanagementapi', region_name={{api_region}},
                      endpoint_url={{api_url}})
...
response = apiManagement.post_to_connection(Data={{message}},ConnectionId={{connectionId}})
...

Solution example

To create this more optimized solution, create a simple web application that enables a user to make a request against a large dataset. It returns the results in a flat file over WebSocket. In this case, we’re querying, via Amazon Athena, a data lake on S3 populated with AWS Twitter sentiment (Twitter: @awscloud or #awsreinvent). However, this solution could apply to any data store, data mart, or data warehouse environment, or the long-running return of data for a response.

For the frontend architecture, create this web application using a JavaScript framework (such as JQuery). Use API Gateway to accept a REST API for the data and then open a WebSocket connection on the client to facilitate the return of results:poll to push solution architecture

  1. The client sends a REST request to API Gateway. This invokes a Lambda function that starts the Step Functions state machine execution. The function also returns a task token for the open connection activity worker. The function then returns the execution ARN and task token to the client.
  2. Using the data returned by the REST request, the client connects to the WebSocket API and sends the task token to the WebSocket connection.
  3. The WebSocket notifies the Step Functions state machine that the client is connected. The Lambda function completes the OpenConnection task through validating the task token and sending a success message.
  4. After RunAthenaQuery and OpenConnection are successful, the state machine updates the connected client over the WebSocket API that their long-running job is complete. It uses the REST API call post_to_connection.
  5. The client receives the update over their WebSocket connection using the IssueCallback Lambda function, with the callback URL from the API Gateway WebSocket API.

In this example application, the data response is an S3 presigned URL composed of results from Athena. You can then configure your frontend to use that S3 link to download to the client.

Why not just open the WebSocket API for the request?

While this approach can work, we advise against it for this use case. Break the required interfaces for this use case into three processes:

  1. Submit a request.
  2. Get the status of the request (to detect failure).
  3. Deliver the results.

For the submit request interface, RESTful APIs have strong controls to ensure that user requests for data are validated and given guardrails for the intended query purpose. This helps prevent rogue requests and unforeseen impacts to the analytics environment, especially when exposed to a large number of users.

With this example solution, you’re restricting the data requests to specific countries in the frontend JavaScript. Using the RESTful POST method for API requests enables you to validate data as query string parameters, such as the following:

https://<apidomain>.amazonaws.com/Demo/CreateStateMachineAndToken?Country=France

API Gateway models and mapping templates can also be used to validate or transform request payloads at the API layer before they hit the backend. Use models to ensure that the structure or contents of the client payload are the same as you expect. Use mapping templates to transform the payload sent by clients to another format, to be processed by the backend.

This REST validation framework can also be used to detect header information on WebSocket browser compatibility (external site). While using WebSocket has many advantages, not all browsers support it, especially older browsers. Therefore, a REST API request layer can pass this browser metadata and determine whether a WebSocket API can be opened.

Because a REST interface is already created for submitting the request, you can easily add another GET method if the client must query the status of the Step Functions state machine. That might be the case if a health check in the request is taking longer than expected. You can also add another GET method as an alternative access method for REST-only compatible clients.

If low-latency request and retrieval are an important characteristic of your API and there aren’t any browser-compatibility risks, use a WebSocket API with JSON model selection expressions to protect your backend with a schema.

In the spirit of picking the best tool for the job, use a REST API for the request layer and a WebSocket API to listen for the result.

This solution, although secure, is an example for how a non-polling solution can work on AWS. At scale, it may require refactoring due to the cross-talk at high concurrency that may result in client resubmissions.

To discover, deploy, and extend this solution into your own AWS environment, follow the PollToPush instructions in the AWS Serverless Application Repository.

Conclusion

When application consumers poll for long-running tasks, it can be a wasteful, detrimental, and costly use of resources. This post outlined multiple ways to refactor the polling method. Use API Gateway to host a RESTful interface, Step Functions to orchestrate your workflow, Lambda to perform backend processing, and an API Gateway WebSocket API to push results to your clients.

Deploying a personalized API Gateway serverless developer portal

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/deploying-a-personalized-api-gateway-serverless-developer-portal/

This post is courtesy of Drew Dresser, Application Architect – AWS Professional Services

Amazon API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. Customers of these APIs often want a website to learn and discover APIs that are available to them. These customers might include front-end developers, third-party customers, or internal system engineers. To produce such a website, we have created the API Gateway serverless developer portal.

The API Gateway serverless developer portal (developer portal or portal, for short) is an application that you use to make your API Gateway APIs available to your customers by enabling self-service discovery of those APIs. Your customers can use the developer portal to browse API documentation, register for, and immediately receive their own API key that they can use to build applications, test published APIs, and monitor their own API usage.

Over the past few months, the team has been hard at work contributing to the open source project, available on Github. The developer portal was relaunched on October 29, 2018, and the team will continue to push features and take customer feedback from the open source community. Today, we’re happy to highlight some key functionality of the new portal.

Benefits for API publishers

API publishers use the developer portal to expose the APIs that they manage. As an API publisher, you need to set up, maintain, and enable the developer portal. The new portal has the following benefits:

Benefits for API consumers

API Consumers use the developer portal as a traditional application user. An API consumer needs to understand the APIs being published. API consumers might be front-end developers, distributed system engineers, or third-party customers. The new developer portal comes with the following benefits for API consumers:

  • Explore – API consumers can quickly page through lists of APIs. When they find one they’re interested in, they can immediately see documentation on that API.
  • Learn – API consumers might need to drill down deeper into an API to learn its details. They want to learn how to form requests and what they can expect as a response.
  • Test – Through the developer portal, API consumers can get an API key and invoke the APIs directly. This enables developers to develop faster and with more confidence.

Architecture

The developer portal is a completely serverless application. It leverages Amazon API Gateway, Amazon Cognito User Pools, AWS Lambda, Amazon DynamoDB, and Amazon S3. Serverless architectures enable you to build and run applications without needing to provision, scale, and manage any servers. The developer portal is broken down in to multiple microservices, each with a distinct responsibility, as shown in the following image.

Identity management for the developer portal is performed by Amazon Cognito and a Lambda function in the Login & Registration microservice. An Amazon Cognito User Pool is configured out of the box to enable users to register and login. Additionally, you can deploy the developer portal to use a UI hosted by Amazon Cognito, which you can customize to match your style and branding.

Requests are routed to static content served from Amazon S3 and built using React. The React app communicates to the Lambda backend via API Gateway. The Lambda function is built using the aws-serverless-express library and contains the business logic behind the APIs. The business logic of the web application queries and add data to the API Key Creation and Catalog Update microservices.

To maintain the API catalog, the Catalog Update microservice uses an S3 bucket and a Lambda function. When an API’s Swagger file is added or removed from the bucket, the Lambda function triggers and maintains the API catalog by updating the catalog.json file in the root of the S3 bucket.

To manage the mapping between API keys and customers, the application uses the API Key Creation microservice. The service updates API Gateway with API key creations or deletions and then stores the results in a DynamoDB table that maps customers to API keys.

Deploying the developer portal

You can deploy the developer portal using AWS SAM, the AWS SAM CLI, or the AWS Serverless Application Repository. To deploy with AWS SAM, you can simply clone the repository and then deploy the application using two commands from your CLI. For detailed instructions for getting started with the portal, see Use the Serverless Developer Portal to Catalog Your API Gateway APIs in the Amazon API Gateway Developer Guide.

Alternatively, you can deploy using the AWS Serverless Application Repository as follows:

  1. Navigate to the api-gateway-dev-portal application and choose Deploy in the top right.
  2. On the Review page, for ArtifactsS3BucketName and DevPortalSiteS3BucketName, enter globally unique names. Both buckets are created for you.
  3. To deploy the application with these settings, choose Deploy.
  4. After the stack is complete, get the developer portal URL by choosing View CloudFormation Stack. Under Outputs, choose the URL in Value.

The URL opens in your browser.

You now have your own serverless developer portal application that is deployed and ready to use.

Publishing a new API

With the developer portal application deployed, you can publish your own API to the portal.

To get started:

  1. Create the PetStore API, which is available as a sample API in Amazon API Gateway. The API must be created and deployed and include a stage.
  2. Create a Usage Plan, which is required so that API consumers can create API keys in the developer portal. The API key is used to test actual API calls.
  3. On the API Gateway console, navigate to the Stages section of your API.
  4. Choose Export.
  5. For Export as Swagger + API Gateway Extensions, choose JSON. Save the file with the following format: apiId_stageName.json.
  6. Upload the file to the S3 bucket dedicated for artifacts in the catalog path. In this post, the bucket is named apigw-dev-portal-artifacts. To perform the upload, run the following command.
    aws s3 cp apiId_stageName.json s3://yourBucketName/catalog/apiId_stageName.json

Uploading the file to the artifacts bucket with a catalog/ key prefix automatically makes it appear in the developer portal.

This might be familiar. It’s your PetStore API documentation displayed in the OpenAPI format.

With an API deployed, you’re ready to customize the portal’s look and feel.

Customizing the developer portal

Adding a customer’s own look and feel to the developer portal is easy, and it creates a great user experience. You can customize the domain name, text, logo, and styling. For a more thorough walkthrough of customizable components, see Customization in the GitHub project.

Let’s walk through a few customizations to make your developer portal more familiar to your API consumers.

Customizing the logo and images

To customize logos, images, or content, you need to modify the contents of the your-prefix-portal-static-assets S3 bucket. You can edit files using the CLI or the AWS Management Console.

Start customizing the portal by using the console to upload a new logo in the navigation bar.

  1. Upload the new logo to your bucket with a key named custom-content/nav-logo.png.
    aws s3 cp {myLogo}.png s3://yourPrefix-portal-static-assets/custom-content/nav-logo.png
  2. Modify object permissions so that the file is readable by everyone because it’s a publicly available image. The new navigation bar looks something like this:

Another neat customization that you can make is to a particular API and stage image. Maybe you want your PetStore API to have a dog picture to represent the friendliness of the API. To add an image:

  1. Use the command line to copy the image directly to the S3 bucket location.
    aws s3 cp my-image.png s3://yourPrefix-portal-static-assets/custom-content/api-logos/apiId-stageName.png
  2. Modify object permissions so that the file is readable by everyone.

Customizing the text

Next, make sure that the text of the developer portal welcomes your pet-friendly customer base. The YAML files in the static assets bucket under /custom-content/content-fragments/ determine the portal’s text content.

To edit the text:

  1. On the AWS Management Console, navigate to the website content S3 bucket and then navigate to /custom-content/content-fragments/.
  2. Home.md is the content displayed on the home page, APIs.md controls the tab text on the navigation bar, and GettingStarted.md contains the content of the Getting Started tab. All three files are written in markdown. Download one of them to your local machine so that you can edit the contents. The following image shows Home.md edited to contain custom text:
  3. After editing and saving the file, upload it back to S3, which results in a customized home page. The following image reflects the configuration changes in Home.md from the previous step:

Customizing the domain name

Finally, many customers want to give the portal a domain name that they own and control.

To customize the domain name:

  1. Use AWS Certificate Manager to request and verify a managed certificate for your custom domain name. For more information, see Request a Public Certificate in the AWS Certificate Manager User Guide.
  2. Copy the Amazon Resource Name (ARN) so that you can pass it to the developer portal deployment process. That process is now includes the certificate ARN and a property named UseRoute53Nameservers. If the property is set to true, the template creates a hosted zone and record set in Amazon Route 53 for you. If the property is set to false, the template expects you to use your own name server hosting.
  3. If you deployed using the AWS Serverless Application Repository, navigate to the Application page and deploy the application along with the certificate ARN.

After the developer portal is deployed and your CNAME record has been added, the website is accessible from the custom domain name as well as the new Amazon CloudFront URL.

Customizing the logo, text content, and domain name are great tools to make the developer portal feel like an internally developed application. In this walkthrough, you completely changed the portal’s appearance to enable developers and API consumers to discover and browse APIs.

Conclusion

The developer portal is available to use right away. Support and feature enhancements are tracked in the public GitHub. You can contribute to the project by following the Code of Conduct and Contributing guides. The project is open-sourced under the Amazon Open Source Code of Conduct. We plan to continue to add functionality and listen to customer feedback. We can’t wait to see what customers build with API Gateway and the API Gateway serverless developer portal.

ICYMI: Serverless Q4 2018

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/icymi-serverless-q4-2018/

This post is courtesy of Eric Johnson, Senior Developer Advocate – AWS Serverless

Welcome to the fourth edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all of the most recent product launches, feature enhancements, blog posts, webinars, Twitch live streams, and other interesting things that you might have missed!

This edition of ICYMI includes all announcements from AWS re:Invent 2018!

If you didn’t see them, check our Q1 ICYMIQ2 ICYMI, and Q3 ICYMI posts for what happened then.

So, what might you have missed this past quarter? Here’s the recap.

New features

AWS Lambda introduced the Lambda runtime API and Lambda layers, which enable developers to bring their own runtime and share common code across Lambda functions. With the release of the runtime API, we can now support runtimes from AWS partners such as the PHP runtime from Stackery and the Erlang and Elixir runtimes from Alert Logic. Using layers, partners such as Datadog and Twistlock have also simplified the process of using their Lambda code libraries.

To meet the demand of larger Lambda payloads, Lambda doubled the payload size of asynchronous calls to 256 KB.

In early October, Lambda also lengthened the runtime limit by enabling Lambda functions that can run up to 15 minutes.

Lambda also rolled out native support for Ruby 2.5 and Python 3.7.

You can now process Amazon Kinesis streams up to 68% faster with AWS Lambda support for Kinesis Data Streams enhanced fan-out and HTTP/2 for faster streaming.

Lambda also released a new Application view in the console. It’s a high-level view of all of the resources in your application. It also gives you a quick view of deployment status with the ability to view service metrics and custom dashboards.

Application Load Balancers added support for targeting Lambda functions. ALBs can provide a simple HTTP/S front end to Lambda functions. ALB features such as host- and path-based routing are supported to allow flexibility in triggering Lambda functions.

Amazon API Gateway added support for AWS WAF. You can use AWS WAF for your Amazon API Gateway APIs to protect from attacks such as SQL injection and cross-site scripting (XSS).

API Gateway has also improved parameter support by adding support for multi-value parameters. You can now pass multiple values for the same key in the header and query string when calling the API. Returning multiple headers with the same name in the API response is also supported. For example, you can send multiple Set-Cookie headers.

In October, API Gateway relaunched the Serverless Developer Portal. It provides a catalog of published APIs and associated documentation that enable self-service discovery and onboarding. You can customize it for branding through either custom domain names or logo/styling updates. In November, we made it easier to launch the developer portal from the Serverless Application Repository.

In the continuous effort to decrease customer costs, API Gateway introduced tiered pricing. The tiered pricing model allows the cost of API Gateway at scale with an API Requests price as low as $1.51 per million requests at the highest tier.

Last but definitely not least, API Gateway released support for WebSocket APIs in mid-December as a final holiday gift. With this new feature, developers can build bidirectional communication applications without having to provision and manage any servers. This has been a long-awaited and highly anticipated announcement for the serverless community.

AWS Step Functions added eight new service integrations. With this release, the steps of your workflow can exist on Amazon ECS, AWS Fargate, Amazon DynamoDB, Amazon SNS, Amazon SQS, AWS Batch, AWS Glue, and Amazon SageMaker. This is in addition to the services that Step Functions already supports: AWS Lambda and Amazon EC2.

Step Functions expanded by announcing availability in the EU (Paris) and South America (São Paulo) Regions.

The AWS Serverless Application Repository increased its functionality by supporting more resources in the repository. The Serverless Application Repository now supports Application Auto Scaling, Amazon Athena, AWS AppSync, AWS Certificate Manager, Amazon CloudFront, AWS CodeBuild, AWS CodePipeline, AWS Glue, AWS dentity and Access Management, Amazon SNS, Amazon SQS, AWS Systems Manager, and AWS Step Functions.

The Serverless Application Repository also released support for nested applications. Nested applications enable you to build highly sophisticated serverless architectures by reusing services that are independently authored and maintained but easily composed using AWS SAM and the Serverless Application Repository.

AWS SAM made authorization simpler by introducing SAM support for authorizers. Enabling authorization for your APIs is as simple as defining an Amazon Cognito user pool or an API Gateway Lambda authorizer as a property of your API in your SAM template.

AWS SAM CLI introduced two new commands. First, you can now build locally with the sam build command. This functionality allows you to compile deployment artifacts for Lambda functions written in Python. Second, the sam publish command allows you to publish your SAM application to the Serverless Application Repository.

Our SAM tooling team also released the AWS Toolkit for PyCharm, which provides an integrated experience for developing serverless applications in Python.

The AWS Toolkits for Visual Studio Code (Developer Preview) and IntelliJ (Developer Preview) are still in active development and will include similar features when they become generally available.

AWS SAM and the AWS SAM CLI implemented support for Lambda layers. Using a SAM template, you can manage your layers, and using the AWS SAM CLI, you can develop and debug Lambda functions that are dependent on layers.

Amazon DynamoDB added support for transactions, allowing developer to enforce all-or-nothing operations. In addition to transaction support, Amazon DynamoDB Accelerator also added support for DynamoDB transactions.

Amazon DynamoDB also announced Amazon DynamoDB on-demand, a flexible new billing option for DynamoDB capable of serving thousands of requests per second without capacity planning. DynamoDB on-demand offers simple pay-per-request pricing for read and write requests so that you only pay for what you use, making it easy to balance costs and performance.

AWS Amplify released the Amplify Console, which is a continuous deployment and hosting service for modern web applications with serverless backends. Modern web applications include single-page app frameworks such as React, Angular, and Vue and static-site generators such as Jekyll, Hugo, and Gatsby.

Amazon SQS announced support for Amazon VPC Endpoints using PrivateLink.

Serverless blogs

October

November

December

Tech talks

We hold several Serverless tech talks throughout the year, so look out for them in the Serverless section of the AWS Online Tech Talks page. Here are the three tech talks that we delivered in Q4:

Twitch

We’ve been so busy livestreaming on Twitch that you’re most certainly missing out if you aren’t following along!

For information about upcoming broadcasts and recent livestreams, keep an eye on AWS on Twitch for more Serverless videos and on the Join us on Twitch AWS page.

New Home for SAM Docs

This quarter, we moved all SAM docs to https://docs.aws.amazon.com/serverless-application-model. Everything you need to know about SAM is there. If you don’t find what you’re looking for, let us know!

In other news

The schedule is out for 2019 AWS Global Summits in cities around the world. AWS Global Summits are free events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Summits are held in major cities around the world. They attract technologists from all industries and skill levels who want to discover how AWS can help them innovate quickly and deliver flexible, reliable solutions at scale. Get notified when to register and learn more at the AWS Global Summits website.

Still looking for more?

The Serverless landing page has lots of information. The resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials. Check it out!

Announcing WebSocket APIs in Amazon API Gateway

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/announcing-websocket-apis-in-amazon-api-gateway/

This post is courtesy of Diego Magalhaes, AWS Senior Solutions Architect – World Wide Public Sector-Canada & JT Thompson, AWS Principal Software Development Engineer – Amazon API Gateway

Starting today, you can build bidirectional communication applications using WebSocket APIs in Amazon API Gateway without having to provision and manage any servers.

HTTP-based APIs use a request/response model with a client sending a request to a service and the service responding synchronously back to the client. WebSocket-based APIs are bidirectional in nature. This means that a client can send messages to a service and services can independently send messages to its clients.

This bidirectional behavior allows for richer types of client/service interactions because services can push data to clients without a client needing to make an explicit request. WebSocket APIs are often used in real-time applications such as chat applications, collaboration platforms, multiplayer games, and financial trading platforms.

This post explains how to build a serverless, real-time chat application using WebSocket API and API Gateway.

Overview

Historically, building WebSocket APIs required setting up fleets of hosts that were responsible for managing the persistent connections that underlie the WebSocket protocol. Now, with API Gateway, this is no longer necessary. API Gateway handles the connections between the client and service. It lets you build your business logic using HTTP-based backends such as AWS Lambda, Amazon Kinesis, or any other HTTP endpoint.

Before you begin, here are a couple of the concepts of a WebSocket API in API Gateway. The first is a new resource type called a route. A route describes how API Gateway should handle a particular type of client request, and includes a routeKey parameter, a value that you provide to identify the route.

A WebSocket API is composed of one or more routes. To determine which route a particular inbound request should use, you provide a route selection expression. The expression is evaluated against an inbound request to produce a value that corresponds to one of your route’s routeKey values.

There are three special routeKey values that API Gateway allows you to use for a route:

  • $default—Used when the route selection expression produces a value that does not match any of the other route keys in your API routes. This can be used, for example, to implement a generic error handling mechanism.
  • $connect—The associated route is used when a client first connects to your WebSocket API.
  • $disconnect—The associated route is used when a client disconnects from your API. This call is made on a best-effort basis.

You are not required to provide routes for any of these special routeKey values.

Build a serverless, real-time chat application

To understand how you can use the new WebSocket API feature of API Gateway, I show you how to build a real-time chat app. To simplify things, assume that there is only a single chat room in this app. Here are the capabilities that the app includes:

  • Clients join the chat room as they connect to the WebSocket API.
  • The backend can send messages to specific users via a callback URL that is provided after the user is connected to the WebSocket API.
  • Users can send messages to the room.
  • Disconnected clients are removed from the chat room.

Here’s an overview of the real-time chat application:

A serverless real-time chat application using WebSocket API on Amazon API Gateway

The application is composed of the WebSocket API in API Gateway that handles the connectivity between the client and servers (1). Two AWS Lambda functions react when clients connect (2) or disconnect (5) from the API. The sendMessage function (3) is invoked when the clients send messages to the server. The server sends the message to all connected clients (4) using the new API Gateway Management API. To track each of the connected clients, use a DynamoDB table to persist the connection identifiers.

To help you see this in action more quickly, I’ve provided an AWS SAM application that includes the Lambda functions, DynamoDB table definition, and IAM roles. I placed it into the AWS Serverless Application Repository. For more information, see the simple-websockets-chat-app repo on GitHub.

Create a new WebSocket API

  1. In the API Gateway console, choose Create API, New API.
  2. Under Choose the protocol, choose WebSocket.
  3. For API name, enter My Chat API.
  4. For Route Selection Expression, enter $request.body.action.
  5. Enter a description if you’d like and click Create API.

The attribute described by the Route Selection Expression value should be present in every message that a client sends to the API. Here’s one example of a request message:

{
    "action": "sendmessage",
    “data”: "Hello, I am using WebSocket APIs in API Gateway.”
}

To route messages based on the message content, the WebSocket API expects JSON-formatted messages with a property serving as a routing key.

Manage routes

Now, configure your new API to respond to incoming requests. For this app, create three routes as described earlier. First, configure the sendmessage route with a Lambda integration.

  1. In the API Gateway console, under My Chat API, choose Routes.
  2. For New Route Key, enter sendmessage and confirm it.

Each route includes route-specific information such as its model schemas, as well as a reference to its target integration. With the sendmessage route created, use a Lambda function as an integration that is responsible for sending messages.

At this point, you’ve either deployed the AWS Serverless Application Repository backend or deployed it yourself using AWS SAM. In the Lambda console, look at the sendMessage function. This function receives the data sent by one of the clients, looks up all the currently connected clients, and sends the provided data to each of them. Here’s a snippet from the Lambda function:

DDB.scan(scanParams, function (err, data) {
  // some code omitted for brevity

  var apigwManagementApi = new AWS.APIGatewayManagementAPI({
    apiVersion: "2018-11-29",
    endpoint: event.requestContext.domainName " /" + event.requestContext.stage
  });
  var postParams = {
    Data: JSON.parse(event.body).data
  };

  data.Items.forEach(function (element) {
    postParams.ConnectionId = element.connectionId.S;
    apigwManagementApi.postToConnection(postParams, function (err, data) {
      // some code omitted for brevity
    });
  });

When one of the clients sends a message with an action of “sendmessage,” this function scans the DynamoDB table to find all of the currently connected clients. For each of them, it makes a postToConnection call with the provided data.

To send messages properly, you must know the API to which your clients are connected. This means explicitly setting the SDK endpoint to point to your API.

What’s left is to properly track the clients that are connected to the chat room. For this, take advantage of two of the special routeKey values and implement routes for $connect and $disconnect.

The onConnect function inserts the connectionId value from requestContext to the DynamoDB table.

exports.handler = function(event, context, callback) {
  var putParams = {
    TableName: process.env.TABLE_NAME,
    Item: {
      connectionId: { S: event.requestContext.connectionId }
    }
  };

  DDB.putItem(putParams, function(err, data) {
    callback(null, {
      statusCode: err ? 500 : 200,
      body: err ? "Failed to connect: " + JSON.stringify(err) : "Connected"
    });
  });
};

The onDisconnect function removes the record corresponding with the specified connectionId value.

exports.handler = function(event, context, callback) {
  var deleteParams = {
    TableName: process.env.TABLE_NAME,
    Key: {
      connectionId: { S: event.requestContext.connectionId }
    }
  };

  DDB.deleteItem(deleteParams, function(err) {
    callback(null, {
      statusCode: err ? 500 : 200,
      body: err ? "Failed to disconnect: " + JSON.stringify(err) : "Disconnected."
    });
  });
};

As I mentioned earlier, the onDisconnect function may not be called. To keep from retrying connections that are never going to be available again, add some additional logic in case the postToConnection call does not succeed:

if (err.statusCode === 410) {
  console.log("Found stale connection, deleting " + postParams.connectionId);
  DDB.deleteItem({ TableName: process.env.TABLE_NAME,
                   Key: { connectionId: { S: postParams.connectionId } } });
} else {
  console.log("Failed to post. Error: " + JSON.stringify(err));
}

API Gateway returns a status of 410 GONE when the connection is no longer available. If this happens, delete the identifier from the DynamoDB table.

To make calls to your connected clients, your application needs a new permission: “execute-api:ManageConnections”. This is handled for you in the AWS SAM template.

Deploy the WebSocket API

The next step is to deploy the WebSocket API. Because it’s the first time that you’re deploying the API, create a stage, such as “dev,” and give a sample description.

The Stage editor screen displays all the information for managing your WebSocket API, such as Settings, Logs/Tracing, Stage Variables, and Deployment History. Also, make sure that you check your limits and read more about API Gateway throttling.

Make sure that you monitor your tests using the dashboard available for your newly created WebSocket API.

Test the chat API

To test the WebSocket API, you can use wscat, an open-source, command line tool.

  1. Install NPM.
  2. Install wscat:
    $ npm install -g wscat
  3. On the console, connect to your published API endpoint by executing the following command:
    $ wscat -c wss://{YOUR-API-ID}.execute-api.{YOUR-REGION}.amazonaws.com/{STAGE}
  4. To test the sendMessage function, send a JSON message like the following example. The Lambda function sends it back using the callback URL:
    $ wscat -c wss://{YOUR-API-ID}.execute-api.{YOUR-REGION}.amazonaws.com/dev
    connected (press CTRL+C to quit)
    > {"action":"sendmessage", "data":"hello world"}
    < hello world

If you run wscat in multiple consoles, each will receive the “hello world”.

You’re now ready to send messages and get responses back and forth from your WebSocket API!

Conclusion

In this post, I showed you how to use WebSocket APIs in API Gateway, including a message exchange between client and server using routes and the route selection expression. You used Lambda and DynamoDB to create a serverless real-time chat application. While this post only scratched the surface of what is possible, you did all of this without needing to manage any servers at all.

You can get started today by creating your own WebSocket APIs in API Gateway. To learn more, see the Amazon API Gateway Developer Guide. You can also watch the Building Real-Time Applications using WebSocket APIs Supported by Amazon API Gateway webinar hosted by George Mao.

I’m excited to see what new applications come up with your use of the new WebSocket API. I welcome any feedback here, in the AWS forums, or on Twitter.

Happy coding!