Tag Archives: Best practices

Best practices for handling EC2 Spot Instance interruptions

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/best-practices-for-handling-ec2-spot-instance-interruptions/

This post is contributed by Scott Horsfield – Sr. Specialist Solutions Architect, EC2 Spot

Amazon EC2 Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand Instance prices. The only difference between an On-Demand Instance and a Spot Instance is that a Spot Instance can be interrupted by Amazon EC2 with two minutes of notification when EC2 needs the capacity back. Fortunately, there is a lot of spare capacity available, and handling interruptions in order to build resilient workloads with EC2 Spot is simple and straightforward. In this post, I introduce and walk through several best practices that help you design your applications to be fault tolerant and resilient to interruptions.

By using EC2 Spot Instances, customers can access additional compute capacity between 70%-90% off of On-Demand Instance pricing. This allows customers to run highly optimized and massively scalable workloads that would not otherwise be possible. These benefits make interruptions an acceptable trade-off for many workloads.

When you follow the best practices, the impact of interruptions is insignificant because interruptions are infrequent and don’t affect the availability of your application. At the time of writing this blog post, less than 5% of Spot Instances are interrupted by EC2 before being terminated intentionally by a customer, because they are automatically handled through integrations with AWS services.

Many applications can run on EC2 Spot Instances with no modifications, especially applications that already adhere to modern software development best practices such as microservice architecture. With microservice architectures, individual components are stateless, scalable, and fault tolerant by design. Other applications may require some additional automation to properly handle being interrupted. Examples of these applications include those that must gracefully remove themselves from a cluster before termination to avoid impact to availability or performance of the workload.

In this post, I cover best practices around handing interruptions so that you too can access the massive scale and steep discounts that Spot Instances can provide.

Architect for fault tolerance and reliability

Ensuring your applications are architected to follow modern software development best practices is the first step in successfully adopting Spot Instances. Software development best practices, such as graceful degradation, externalizing state, and microservice architecture already enforce the same best practices that can allow your workloads to be resilient to Spot interruptions.

A great place to start when designing new workloads, or evaluating existing workloads for fault tolerance and reliability, is the Well-Architected Framework. The Well-Architected Framework is designed to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications. Based on five pillars — Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization — the Framework provides a consistent approach for customers to evaluate architectures, and implement designs that will scale over time.

The Well-Architected Tool is a free tool available in the AWS Management Console. Using this tool, you can create self-assessments to identify and correct gaps in your current architecture that might affect your toleration to Spot interruptions.

Use Spot integrated services

Some AWS services that you might already use, such as Amazon ECS, Amazon EKS, AWS Batch, and AWS Elastic Beanstalk have built-in support for managing the lifecycle of EC2 Spot Instances. These services do so through integration with EC2 Auto Scaling. This service allows you to build fleets of compute using On-Demand Instance and Spot Instances with a mixture of instance types and launch strategies. Through these services, tasks such as replacing interrupted instances are handled automatically for you. For many fault-tolerant workloads, simply replacing the instance upon interruption is enough to ensure the reliability of your service. For advanced use cases, EC2 Auto Scaling groups provide notifications, auto-replacement of interrupted instances, lifecycle hooks, and weights. These more advanced features allow for more control by the user over the composition and lifecycle of your compute infrastructure.

Spot-integrated services automate processes for handling interruptions. This allows you to stay focused on building new features and capabilities, and avoid the additional cost that custom automation may accrue over time.

Most examples in this post use EC2 Auto Scaling groups to demonstrate best practices because of the built-in integration with other AWS services. Also, Auto Scaling brings flexibility when building scalable fault-tolerant applications with EC2 Spot Instances. However, these same best practices can also be applied to other services such as Amazon EMR, EC2 fleet, and your own custom automation and frameworks.

Choose the right Spot Allocation Strategy

It is a best practice to use the capacity-optimized Spot Allocation Strategy when configuring your EC2 Auto Scaling group. This allows Auto Scaling groups to launch instances from Spot Instance pools with the most available capacity. Since Spot Instances can be interrupted when EC2 needs the capacity back, launching instances optimized for available capacity is a key best practice for reducing the possibility of interruptions.

You can simply select the Spot allocation strategy when creating new Auto Scaling groups, or modifying an existing Auto Scaling group from the console.

capacity-optimized allocation

Choose instance types across multiple instance sizes, families, and Availability Zones

It is important that the Auto Scaling group has a diverse set of options to choose from so that services launch instances optimally based on capacity. You do this by configuring the Auto Scaling group to launch instances of multiple sizes, and families, across multiple Availability Zones. Each instance type of a particular size, family, and Availability Zone in each Region is a separate Spot capacity pool. When you provide the Auto Scaling group and the capacity-optimized Spot Allocation Strategy a diverse set of Spot capacity pools, your instances are launched from the deepest pools available.

When you create a new Auto Scaling group, or modify an existing Auto Scaling group, you can specify a primary instance type and secondary instance types. The Auto Scaling group also provides recommended options. The following image shows what this configuration looks like in the console.

instance type selection

When using the recommended options, the Auto Scaling group is automatically configured with a diverse list of instance types across multiple instance families, generations, or sizes. We recommend leaving a many of these instance types in place as possible.

Handle interruption notices

For many workloads, the replacement of interrupted instances from a diverse set of instance choices is enough to maintain the reliability of your application. In other cases, you can gracefully decommission an application on an instance that is being interrupted. You can do this by knowing that an instance is going to be interrupted and responding through automation to react to the interruption. The good news is, there are several ways you can capture an interruption warning, which is published two minutes before EC2 reclaims the instance.

Inside an instance

The Instance Metadata Service is a secure endpoint that you can query for information about an instance directly from the instance. When a Spot Instance interruption occurs, you can retrieve data about the interruption through this service. This can be useful if you must perform an action on the instance before the instance is terminated, such as gracefully stopping a process, and blocking further processing from a queue. Keep in mind that any actions you automate must be completed within two minutes.

To query this service, first retrieve an access token, and then pass this token to the Instance Metadata Service to authenticate your request. Information about interruptions can be accessed through http://169.254.169.254/latest/meta-data/spot/instance-action. This URI returns a 404 response code when the instance is not marked for interruption. The following code demonstrates how you could query the Instance Metadata Service to detect an interruption.

TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" –v http://169.254.169.254/latest/meta-data/spot/instance-action

If the instance is marked for interruption, you receive a 200 response code. You also receive a JSON formatted response that includes the action that is taken upon interruption (terminate, stop or hibernate) and a time when that action will be taken (ie: the expiration of your 2-minute warning period).

{"action": "terminate", "time": "2020-05-18T08:22:00Z"}

The following sample script demonstrates this pattern.

#!/bin/bash

TOKEN=`curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`

while sleep 5; do

    HTTP_CODE=$(curl -H "X-aws-ec2-metadata-token: $TOKEN" -s -w %{http_code} -o /dev/null http://169.254.169.254/latest/meta-data/spot/instance-action)

    if [[ "$HTTP_CODE" -eq 401 ]] ; then
        echo 'Refreshing Authentication Token'
        TOKEN=`curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30"`
    elif [[ "$HTTP_CODE" -eq 200 ]] ; then
        # Insert Your Code to Handle Interruption Here
    else
        echo 'Not Interrupted'
    fi

done

Developing automation with the Amazon EC2 Metadata Mock

When developing automation for handling Spot Interruptions within an EC2 Instance, it’s useful to mock Spot Interruptions to test your automation. The Amazon EC2 Metadata Mock (AEMM) project allows for running a mock endpoint of the EC2 Instance Metadata Service, and simulating Spot interruptions. You can use AEMM to serve a mock endpoint locally as you develop your automation, or within your test environment to support automated testing.

1. Download the latest Amazon EC2 Metadata Mock binary from the project repository.

2. Run the Amazon EC2 Metadata Mock configured to mock Spot Interruptions.

amazon-ec2-metadata-mock spotitn --instance-action terminate --imdsv2 

3. Test the mock endpoint with curl.

TOKEN=`curl -X PUT "http://localhost:1338/latest/api/token" -H "X-aws-ec2-  metadata-token-ttl-seconds: 21600"` \
&& curl -H "X-aws-ec2-metadata-token: $TOKEN" –v http://localhost:1338/latest/meta-data/spot/instance-action

With default configuration, you receive a response for a simulated Spot Interruption with a time set to two minutes after the request was received.

{"action": "terminate", "time": "2020-05-13T08:22:00Z"}

Outside an instance

When a Spot Interruption occurs, a Spot instance interruption notice event is generated. You can create rules using Amazon CloudWatch Events or Amazon EventBridge to capture these events, and trigger a response such as invoking a Lambda Function. This can be useful if you need to take action outside of the instance to respond to an interruption, such as graceful removal of an interrupted instance from a load balancer to allow in-flight requests to complete, or draining containers running on the instance.

The generated event contains useful information such as the instance that is interrupted, in addition to the action (terminate, stop, hibernate) that is taken when that instance is interrupted. The following example event demonstrates a typical EC2 Spot Instance Interruption Warning.

{
    "version": "0",
    "id": "12345678-1234-1234-1234-123456789012",
    "detail-type": "EC2 Spot Instance Interruption Warning",
    "source": "aws.ec2",
    "account": "123456789012",
    "time": "yyyy-mm-ddThh:mm:ssZ",
    "region": "us-east-2",
    "resources": ["arn:aws:ec2:us-east-2:123456789012:instance/i-1234567890abcdef0"],
    "detail": {
        "instance-id": "i-1234567890abcdef0",
        "instance-action": "action"
    }
}

Capturing this event is as simple as creating a new event rule that matches the pattern of the event, where the source is aws.ec2 and the detail-type is EC2 Spot Interruption Warning.

aws events put-rule --name "EC2SpotInterruptionWarning" \
    --event-pattern "{\"source\":[\"aws.ec2\"],\"detail-type\":[\"EC2 Spot Interruption Warning\"]}" \  
    --role-arn "arn:aws:iam::123456789012:role/MyRoleForThisRule"

When responding to these events, it’s common to retrieve additional details about the instance that is being interrupted, such as Tags. When triggering additional API calls in response to Spot Interruption Warnings, keep in mind that AWS APIs implement Request Throttling. So, ensure that you are implementing exponential backoffs and retries as needed and using paginated API calls.

For example, if multiple EC2 Spot Instances were interrupted, you could aggregate the instance-ids by temporarily storing them in DynamoDB and then combine the instance-ids into a single DescribeInstances API call. This allows you to retrieve details about multiple instances rather than implementing individual DescribeInstances API calls that may exceed API limits and result in throttling.

The following script demonstrates describing multiple EC2 Instances with a reusable paginator that can accept API-specific arguments. Additional examples can be found here.

import boto3

ec2 = boto3.client('ec2')

def paginate(method, **kwargs):
    client = method.__self__
    paginator = client.get_paginator(method.__name__)
    for page in paginator.paginate(**kwargs).result_key_iters():
        for item in page:
            yield item

def describe_instances(instance_ids):
    described_instances = []

    response = paginate(ec2.describe_instances, InstanceIds=instance_ids)

    for item in response:
        described_instances.append(item)

    return described_instances

if __name__ == "__main__":

    instance_ids = ['instance1','instance2', '...']

    print(describe_instances(instance_ids))

Container services

When running containerized workloads, it’s common to want to let the container orchestrator know when a node is going to be interrupted. This allows for the node to be marked so that the orchestrator knows to stop placing new containers on an interrupted node, and drain running containers off of the interrupted node. Amazon EKS and Amazon ECS are two popular services that manage containers on AWS, and each have simple integrations that allow for handling interruptions.

Amazon EKS/Kubernetes

With Kubernetes workloads, including self-managed clusters and those running on Amazon EKS, use the AWS maintained AWS Node Termination Handler to monitor for Spot Interruptions and make requests to the Kubernetes API to mark the node as non-schedulable. This project runs as a Daemonset on your Kubernetes nodes. In addition to handling Spot Interruptions, it can also be configured to handle Scheduled Maintenance Events.

You can use kubectl to apply the termination handler with default options to your cluster with the following command.

kubectl apply -f https://github.com/aws/aws-node-termination-handler/releases/download/v1.5.0/all-resources.yaml

Amazon ECS

With Amazon ECS workloads you can enable Spot Instance draining by passing a configuration parameter to the ECS container agent. Once enabled, when a container instance is marked for interruption, ECS receives the Spot Instance interruption notice and places the instance in DRAINING status. This prevents new tasks from being scheduled for placement on the container instance. If there are container instances in the cluster that are available, replacement service tasks are started on those container instances to maintain your desired number of running tasks.

You can enable this setting by configuring the user data on your container instances to apply this the setting at boot time.

#!/bin/bash
cat <<'EOF' >> /etc/ecs/ecs.config
ECS_CLUSTER=MyCluster
ECS_ENABLE_SPOT_INSTANCE_DRAINING=true
ECS_CONTAINER_STOP_TIMEOUT=120s
EOF

Inside a Container Running on Amazon ECS

When an ECS Container Instance is interrupted, and the instance is marked as DRAINING, running tasks are stopped on the instance. When these tasks are stopped, a SIGTERM signal is sent to the running task, and ECS waits up to 2 minutes before forcefully stopping the task, resulting in a SIGKILL signal sent to the running container. This 2 minute window is configurable through a stopTimeout container timeout option, or through ECS Agent Configuration, as shown in the prior code, giving you flexibility within your container to handle the interruption. If you set this value to be greater than 120 seconds, it will not prevent your instance from being interrupted after the 2 minute warning. So, I recommend setting to be less than or equal to 120 seconds.

You can capture the SIGTERM signal within your containerized applications. This allows you to perform actions such as preventing the processing of new work, checkpointing the progress of a batch job, or gracefully exiting the application to complete tasks such as ensuring database connections are properly closed.

The following example Golang application shows how you can capture a SIGTERM signal, perform cleanup actions, and gracefully exit an application running within a container.

package main

import (
    "os"
    "os/signal"
    "syscall"
    "time"
    "log"
)

func cleanup() {
    log.Println("Cleaning Up and Exiting")
    time.Sleep(10 * time.Second)
    os.Exit(0)
}

func do_work() {
    log.Println("Working ...")
    time.Sleep(10 * time.Second)
}

func main() {
    log.Println("Application Started")

    signalChannel := make(chan os.Signal)

    interrupted := false
    for interrupted == false {

        signal.Notify(signalChannel, os.Interrupt, syscall.SIGTERM)
        go func() {
            sig := <-signalChannel
            switch sig {
                case syscall.SIGTERM:
                    log.Println("SIGTERM Captured - Calling Cleanup")
                    interrupted = true
                    cleanup()
            }
        }()

        do_work()
    }

}

EC2 Spot Instances are a great fit for containerized workloads because containers can flexibly run on instances from multiple instance families, generations, and sizes. Whether you’re using Amazon ECS, Amazon EKS, or your own self-hosted orchestrator, use the patterns and tools discussed in this section to handle interruptions. Existing projects such as the AWS Node Termination Handler are thoroughly tested and vetted by multiple customers, and remove the need for investment in your own customized automation.

Implement checkpointing

For some workloads, interruptions can be very costly. An example of this type of workload is if the application needs to restart work on a replacement instance, especially if your applications needs to restart work from the beginning. This is common in batch workloads, sustained load testing scenarios, and machine learning training.

To lower the cost of interruption, investigate patterns for implementing checkpointing within your application. Checkpointing means that as work is completed within your application, progress is persisted externally. So, if work is interrupted, it can be restarted from where it left off rather than from the beginning. This is similar to saving your progress in a video game and restarting from the last save point vs. the start of the level. When working with an interruptible service such as EC2 Spot, resuming progress from the latest checkpoint can significantly reduce the cost of being interrupted, and reduce any delays that interruptions could otherwise incur.

For some workloads, enabling checkpointing may be as simple as setting a configuration option to save progress externally (Amazon S3 is often a great place to store checkpointing data). For example, when running an object detection model training job on Amazon SageMaker with Managed Spot Training, enabling checkpointing is as simple as passing the right configuration to the training job.

The following example demonstrates how to configure a SageMaker Estimator to train an object detection model with checkpointing enabled. You can access the complete example notebook here.

od_model = sagemaker.estimator.Estimator(
    image_name=training_image,
    base_job_name='SPOT-object-detection-checkpoint',
    role=role, 
    train_instance_count=1, 
    train_instance_type='ml.p3.2xlarge',
    train_volume_size = 50,
    train_use_spot_instances=True,
    train_max_wait=3600,
    train_max_run=3600,
    input_mode= 'File',
    checkpoint_s3_uri="s3://{}/{}".format(s3_checkpoint_path, uuid.uuid4()),
    output_path=s3_output_location,
    sagemaker_session=sess)

For other workloads, enabling checkpointing may require extending your custom framework, or a public frameworks to persist data externally, and then reload that data when an instance is replaced and work is resumed.

For example, MXNet, a popular open source library for deep-learning, has a mechanism for executing Custom Callbacks during the execution of deep learning jobs. Custom Callbacks can be invoked during the completion of each training epoch, and can perform custom actions such as saving checkpointing data to S3. Your deep-learning training container could then be configured to look for any saved checkpointing data once restarted. Then load that training data locally so that training can be continued from the latest checkpoint.

By enabling checkpointing, you ensure that your workload is resilient to any interruptions that occur. Checkpointing can help to minimize data loss and the impact of an interruption on your workload by saving state and recovering from a saved state. When using existing frameworks, or developing your own framework, look for ways to integrate checkpointing so that your workloads have the flexibility to run on EC2 Spot Instances.

Track the right metrics

The best practices discussed in this blog can help you build reliable and scalable architectures that are resilient to Spot Interruptions. Architecting for fault tolerance is the best way you can ensure that your applications are resilient to interruptions, and is where you should focus most of your energy. Monitoring your services is important because it helps ensure that best practices are being effectively implemented. It’s critical to pick metrics that reflect availability and reliability, and not get caught up tracking metrics that do not directly correlate to service availability and reliability.

Some customers choose to track Spot Interruptions, which can be useful when done for the right reasons, such as evaluating tolerance to interruptions or conducing testing. However, frequency of interruption, or number of interruptions of a particular instance type, are examples of metrics that do not directly reflect the availability or reliability of your applications. Since Spot interruptions can fluctuate dynamically based on overall Spot Instance availability and demand for On-Demand Instances, tracking interruptions often results in misleading conclusions.

Rather than tracking interruptions, look to track metrics that reflect the true reliability and availability of your service including:

Conclusion

The best practices we’ve discussed, when followed, can help you run your stateless, scalable, and fault tolerant workloads at significant savings with EC2 Spot Instances. Architect your applications to be fault tolerant and resilient to failure by using the Well-Architected Framework. Lean on Spot Integrated Services such as EC2 Auto Scaling, AWS Batch, Amazon ECS, and Amazon EKS to handle the provisioning and replacement of interrupted instances. Be flexible with your instance selections by choosing instance types across multiple families, sizes, and Availability Zones. Use the capacity-optimized allocation strategy to launch instances from the Spot Instance pools with the most available capacity. Lastly, track the right metrics that represent the true availability of your application. These best practices help you safely optimize and scale your workloads with EC2 Spot Instances.

Managing backend requests and frontend notifications in serverless web apps

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/managing-backend-requests-and-frontend-notifications-in-serverless-web-apps/

Web and mobile applications usually interact with a backend service, often via an API. Many front-end applications pass requests for processing, wait for a result, and then display this to the user. This synchronous approach is only one way to handle messages, but modern applications have alternatives to provide a better user experience.

There are three common ways to make and manage requests from your frontend. This blog post explains the benefits and use-cases of each approach. This post references the Ask Around Me example application, which allows users to ask and answer questions in their local geographic area in real time. To learn more, refer to part 1 of the blog application series.

The synchronous model

The synchronous call is the most common API request pattern, where the caller makes a request to an API and then waits for the response:

Synchronous API example

This type of request is easy to implement and understand because it mirrors the functional call-response pattern many developers are familiar with. The requestor is blocked until the calls completes, so it is well suited to simple requests with short execution times. Use cases include: retrieving the contents of a shopping cart, looking up a value in a database, or submitting an email address from a web form.

However, when your API interacts with other services or external workflows, the synchronous model can have limitations. In the following ecommerce example, any slow performance in the downstream services delays the entire roundtrip performance. Additionally, any outages in one of those services may result in an apparent failure of the entire service.

Tight coupling of services

For services with lengthy workflows, you may reach API Gateway’s 29-second integration timeout. In the following ride sharing example, the service responsible for finding available drivers may have a highly variable response time. This request may time out. It also provides a poor user experience, as there is no feedback to the user for a considerable period.

Lengthy request example

Synchronous requests also have others limitations. You cannot receive more than one response per request, nor can you subscribe to future changes in data. In this example, the API request can only inform callers about drivers at the end of the length process request.

The asynchronous model

Asynchronous tasks are common in serverless applications and distributed applications. They allow separate parts of an application to communicate without needing to wait for a synchronous response. Asynchronous workloads often use queues between services to help manage throughout and assist with retry logic.

With asynchronous tasks, the caller hands off the event and continues on to the next task after receiving the acknowledgment response. The caller does not wait for the entire task to complete. The downstream service works on this event while the caller continues servicing other requests. The ecommerce example, converted to an asynchronous flow, looks like this:

Asynchronous request example

In this example, a caller submits an order and receives a response from API Gateway almost immediately. With service integrations, API Gateway then stores the request directly in a durable store such as Amazon SQS or DynamoDB, before any processing has occurred. This results in a relatively consistent caller response time, regardless of downstream service processing time.

The downstream services fetch messages from the SQS queue or the DynamoDB table for processing. If there is a downstream outage, messages are persisted in the queue and may be retried later. From the user’s perspective, the request has been successfully submitted.

The Ask Around Me application handles the publishing of both questions and answers asynchronously. The API passes the user data to a Lambda function that stores the message in an SQS queue. SQS responds immediately to indicate that the message has been stored successfully, ending the API response. Another Lambda function then takes the messages from the SQS queue and processes these independently.

Ask Around Me save question example

Both synchronous and asynchronous requests are useful for different functions in web applications, so it can be helpful to compare their features and behaviors:

Synchronous requestsAsynchronous requests
The caller waits until the end of processing for a response.The caller receives an acknowledgment quickly while processing continues.
Waiting may incur cost.Minimizes the cost of waiting.
Downstream slowness or outages affects the overall request.Queuing separates ingestion of the request from the processing of the request.
Passes payloads between steps.More often passes transaction identifiers.
Failure affects entire request.Failure only affects segment of request.
Easy to implement.Moderate complexity in implementation

Handling response values and state for asynchronous requests

With asynchronous processes, you cannot pass a return value back to the caller in the same way as you can for synchronous processes. Beyond the initial acknowledgment that the request has been received, there is no return path to provide further information. There are a couple of options available to web and mobile developers to track the state of inflight requests:

  • Polling: the initial request returns a tracking identifier. You create a second API endpoint for the frontend to check the status of the request, referencing the tracking ID. Use DynamoDB or another data store to track the state of the request.
  • WebSocket: this is a bidirectional connection between the frontend client and the backend service. It allows you to send additional information after the initial request is completed. Your backend services can continue to send data back to the client by using a WebSocket connection.

Polling is a simple mechanism to implement for many systems but can result in many empty calls. There is also a delay between data availability and the client being notified. WebSockets provide notifications that are closer to real time and reduce the number of messages between the client and backend system. However, implementing WebSockets is often more complex.

Using AWS IoT Core for real-time messaging

In both the synchronous and asynchronous models, it’s assumed that the caller makes a request and is only interested in the final result of that request. This doesn’t allow for partial information, such as the percentage of a task complete, or being notified continuously as data changes.

Modern web applications commonly use the publish-subscribe pattern to receive notifications as data changes. From receiving alerts when new email arrives to providing dashboard analytics, this method allows for much richer streams of event from backend systems.

In Ask Around Me, the application uses this pattern when listening for new questions from the user’s local area. The frontend subscribes to the geohash value of the user’s location via the AWS SDK. It then waits for messages published by the backend to this topic.

AWS IoT Core between frontend and backend

The SDK automatically manages the WebSocket connection and also handles many common connectivity issues in web apps. The messages are categorized using topics, which are strings defining channels of messages.

The AWS IoT Core service manages broadcasts between backend publishers and frontend subscribers. This enables fan-out functionality, which occurs when multiple subscribers are listening to the same topic. You can broadcast messages to thousands of frontend devices using this mechanism. For web application integration, this is the preferable way to implement publish-subscribe than using Amazon SNS.

The IotData class in the AWS SDK returns a client that uses the MQTT protocol. Once the frontend application establishes the connection, it returns messages, errors, and the connection status via callbacks:

        mqttClient.on('connect', function () {
          console.log('mqttClient connected')
        })

        mqttClient.on('error', function (err) {
          console.log('mqttClient error: ', err)
        })

        mqttClient.on('message', function (topic, payload) {
          const msg = JSON.parse(payload.toString())
          console.log('IoT msg: ', topic, msg)
        })

For more details on how to implement MQTT WebSocket connectivity for your application, see the Ask Around Me sample application code.

Combining multiple approaches for your frontend application

Many frontend applications can combine these models depending on the request type. The Ask Around Me application uses multiple approaches in managing the state of user questions:

Combining multiple models in one application

  1. When the application starts, it retrieves an initial set of questions from the synchronous API endpoint. This returns the available list of questions up to this point in time.
  2. Simultaneously, the frontend subscribes to the geohash topic via AWS IoT Core. Any new questions for this geohash location are sent from the backend processing service to the frontend via this topic. This allows the frontend to receive new questions without subsequent API calls.
  3. When a new question is posted, it is saved to the relevant SQS queue and acknowledged. The question is processed asynchronously by a backend process, which sends updates to the topic.

There are several benefits to combining synchronous, asynchronous, and real-time messaging approaches like this. Most importantly, the user experience remains consistent. The user receives immediate feedback to posting new questions and answers, while longer-running processes are managed asynchronously.

When new information becomes available, the frontend is notified in near-real time. This happens without needing to poll an API endpoint or have the user refresh the user interface. This also reduces the number of unnecessary API calls on the backend service, reducing the cost of running this application. Finally, this uses scalable managed services so the frontend application can support large numbers of users without impacting performance.

Conclusion

Web applications commonly use synchronous APIs when communicating with backend services. For longer-running processes, asynchronous workflows can offer an improved user experience and help manage scaling. By using durable message stores like SQS or DynamoDB, you can separate the request ingestion and response from the request processing.

In this post, I show how modern web applications use real-time messaging via WebSockets to improve the user experience. This provides a transport mechanism for pushing state updates from the backend to the frontend client. The AWS IoT Core service can fan out messages using topics, broadcasting messages to large numbers of frontend subscribers.

To see these three methods in an example frontend application, read more about the Ask Around Me example application.

Low-latency computing with AWS Local Zones – Part 1

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/low-latency-computing-with-aws-local-zones-part-1/

This post was contributed by: Pranav Chachra, Rob Chen, Alan Goodman

AWS launched a Local Zone in Los Angeles (LA), California, in late 2019 at re:Invent. Since the launch, we have seen a lot of interest from you, and have worked to bring additional features and services based on your feedback.

In this blog series, we share best practices and recommendations for hosting your applications in Local Zones based on what has worked for customers. Part one focuses on sharing more about Local Zones and how customers are already using the LA Local Zone. In this blog, we also address key questions asked by customers since the launch of the LA Local Zone. In the upcoming parts of this blog series, we do a deep dive into specific use cases for which our customers are using the LA Local Zone. We also list best practices to host your applications in Local Zones and plan to provide related architectural considerations.

AWS Local Zones introduction

local zone base picture

Today, there are a significant number of applications that can run in the AWS Cloud. Most applications work well in our Regions. However, for workloads that require low-latency or local data processing, you need us to bring AWS infrastructure closer to you. Without local AWS presence, you have to procure, operate, and maintain IT infrastructure in your own data center or colocation facility for such workloads. And, you end up building and running such application components with a different set of APIs and tools than the other parts of your applications running in the AWS Cloud. This results in a lot of extra effort and expenses.

And that’s why, based on your feedback, we launched AWS Local Zones, a new type of AWS infrastructure deployment that places compute, storage, and other select services closer to large cities. We designed Local Zones as a powerful construct that extends our existing Regions into new locations. This gives you the ability to run applications on AWS that require single-digit millisecond latencies to your end-users or on-premises installations in the local area. Just like Regions, Local Zones are fully managed and supported by AWS, giving you the elasticity, scalability, and security benefits of running on the AWS. The first Local Zone is generally available in LA, and is an extension of the US West (Oregon) Region (the parent Region).

Use cases

Like Regions, customers leverage the LA Local Zone for many different use cases. The top customer use cases include:

  1. Media and Entertainment (M&E) content creation: M&E customers are migrating expensive on-premises workstations to the LA Local Zone. They do this to accelerate content creation by getting rid of capacity constraints while improving security and operational efficiency. For artist workstations, latency is the key to having a jitter-free experience on a remote instance. These customers typically require less than five millisecond latency from their offices to virtual instances. With Direct Connect, most of these customers are able to achieve as low as 1-2 millisecond latency from their animation hubs in LA to the Local Zone, and run latency-sensitive workloads, such as live production, video editing.
  2. Enterprise migration with hybrid architecture: Enterprises have workloads running in their existing on-premises data centers in the LA metro area. These customers use the Local Zone to migrate complex legacy on-premises applications to AWS without expensive revamp of their architecture. Customers have told us that it can be daunting to migrate a portfolio of interdependent applications to the cloud. Now with a Direct Connect to the LA Local Zone, customers can establish a hybrid environment that provides ultra-low latency communication between applications running in the LA Local Zone and on-premises installations. In turn, this enables customers to migrate applications incrementally, simplifying migrations drastically and enabling on-going hybrid deployments in the LA area.
  3. Real-time multiplayer gaming: This category includes gaming companies that deploy game servers for multiplayer sessions all over the world to be closer to gamers. A latency of 20 milliseconds or less is considered ideal for good gameplay experience. Until now, these customers were using on-premises installations in the LA area to supplement AWS presence. However, now, customers are deploying latency-sensitive game servers in the LA Local Zone to run real-time and interactive multiplayer game sessions, enabling them to provide end users in the Southern California area with a great experience.

In the next parts of the blog series, we further dive into these use cases, cover respective best practices, and review architectural guidance for you.

Services and features

AWS launched the LA Local Zone with support for seven Nitro based Amazon EC2 instance types (T3, C5, M5, R5, R5d, I3en, and G4), two EBS volume types (io1 and gp2), Amazon FSx for Windows File Server, Amazon FSx for Lustre, Application Load Balancer, Amazon VPC and Amazon Direct Connect. Since launch, AWS added support for Amazon RDS, Amazon EMR, AWS Shield, and Amazon EC2 Dedicated Hosts, and are adding more services based on your feedback.

Other parts of AWS, like AWS CloudFormation templates, CloudWatch, IAM resources, and Organizations, will continue to work as expected, providing you a consistent experience. You can also leverage the full suite of services like Amazon S3 in the parent Region, US West (Oregon), over AWS’s private network backbone with ~20–30 milliseconds latency.

Getting started and using the Local Zone

Now that we reviewed some common use cases and services of Local Zones, we want to get you started using the Local Zone. First, enable the LA Local Zone from the new “Zone settings” section of the EC2 console, as shown in the following image:

console Zone Settings

Once enabled, Local Zones looks and behaves similarly to an Availability Zones (AZs).  You can access Local Zones through the parent Region’s console and API endpoints. The following image shows that the LA Local Zone is visible as us-west-2-lax-1a along with other AZs in the EC2 console:

service health of zone

Once the LA Local Zone is enabled, you can extend your existing VPC from the parent Region to a Local Zone by creating a new VPC subnet assigned to the LA Local Zone:

creating subnet

Once a VPC subnet is established for the LA Local Zone, simply select the subnet while creating local resources. For example, you can launch an EC2 instance in the LA Local Zone by selecting the local subnet as the following:

configuring instance details

Local resources are then ready within seconds. You can manage these resources in the LA Local Zone just like resources in AZs:

shows the local zone created

Just like Regions, you can set up an Internet Gateway to access your local resources in the LA Local Zone over the internet. Or you can use AWS Direct Connect to route your traffic over a private network connection from any Direct Connect location. To get the best latency performance, you should use one of the Direct Connect locations available in LA:

  • T5 at El Segundo, Los Angeles, CA (recommended for lowest latency to the LA Local Zone)
  • CoreSite LA1, Los Angeles, CA
  • Equinix LA3, El Segundo, CA

Pricing

From a pricing perspective, Instances, and other local AWS resources running in a Local Zone have their own prices that might differ from the parent Region. Billing reports include a prefix, “LAX1,” or location name, “US West (Los Angeles),” that is specific to a group of Local Zones in LA. EC2 instances in Local Zones are available in On-Demand and Spot form, and you can also purchase Savings Plans. For pricing information, you can visit the pricing section on the respective services and filter pricing information by choosing the Local Zone location as “US West (Los Angeles)” in the dropdown.

Data transfer charges in AWS Local Zones are the same as in the AZs in the parent Region today. For example, data transferred between Amazon EC2 instances in the LA Local Zone and Amazon S3 in the parent Region, US West (Oregon), is free. Similarly, data transferred IN and OUT from Amazon EC2 in the Local Zone to the Amazon EC2 in the Parent Region is charged at $0.01/GB in each direction. You can learn more about data transfer prices “in” and “out” of Amazon EC2 here.

Thinking ahead

Later this year, AWS plans to open a second Local Zone in LA (us-west-2-lax-1b). The two Local Zones in LA will be interconnected with high-bandwidth, low-latency AWS networking allowing you to architect your low-latency applications for high availability and fault tolerance. Based on your feedback, we are also working on adding Local Zones in other locations along with the availability of the additional services, including Amazon ECS, Amazon Elastic Kubernetes Service, Amazon ElastiCache, Amazon ES, and Amazon Managed Streaming for Apache Kafka.

Conclusion

Now that we provided AWS Local Zones that you requested; we are really looking forward to seeing what all you can do with them. AWS would love to get your advice on locations or additional local services/features or other interesting use cases, so feel free to leave us your comments!

Upgrading to Amazon EventBridge from Amazon CloudWatch Events

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/upgrading-to-amazon-eventbridge-from-amazon-cloudwatch-events/

Amazon EventBridge was announced at the AWS New York Summit in 2019. It’s a serverless event bus service that uses the Amazon CloudWatch Events API, but also includes more functionality, like the ability to ingest events from SaaS apps. Event-based architectures can make it easier to decouple your application services and make your systems more extensible.

Events are emitted from services throughout AWS, and you can create custom events from your own applications. The popularity of event-based compute is why CloudWatch Events grew to process trillions of events every month. See this video for an overview of the features of the EventBridge service.

EventBridge is designed to extend the event model beyond AWS, bringing data from software as a service (SaaS) providers into your AWS environment. This means you can consume events from popular providers such as Zendesk, PagerDuty, and Auth0. You can use these in your applications with the same ease as any AWS-generated event.

This blog post explains the differences between the CloudWatch Events and EventBridge, and the benefits of upgrading. I also provide additional resources to help you gain the benefits of using the newer EventBridge service.

Integrated SaaS providers in EventBridge

Fetching data from third-party providers has typically relied on processes such as polling, or building custom webhooks where these are supported. Polling is compute-intensive and often wasteful, since many requests return no new data. Additionally, there is a lag between new data becoming available and your system receiving the information.

While webhooks offers improvements over polling and may approximate a real-time connection, these requests travel over the public internet. This means you must secure the webhook endpoint, and use a security mechanism between the two services. You must also scale up if the SaaS provider sends large numbers of messages.

EventBridge enables SaaS providers to publish data on an event bus within their AWS environment. These events are then routed to your AWS account, and appear on your partner event bus. All of this happens on the private AWS network, away from the public internet. The service manages scaling, security, and integration for you.

Configuring EventBridge in your SaaS provider account is easy. To learn how to set up the integration, see these videos for a full walkthrough:

A full list of EventBridge SaaS integrations is available on the EventBridge website. Additionally, if you want to integrate your own SaaS software with EventBridge, read more in the onboarding guide.

Custom event buses and enhanced rules

CloudWatch Events provides a default event bus that exists in every AWS account. All AWS events are routed via the default bus. You can also choose to publish your custom events to the default bus.

EventBridge introduces custom event buses you can use exclusively for your own workloads. These can be useful for controlling access to events to a limited set of AWS accounts or custom applications. Custom event buses are free to set up. Watch this video to see how to set up a custom event bus.

You can also use powerful advanced rules for routing events. With content-based filtering, you can use comparison operators to identify and filter values in events. This allows you to use EventBridge to handle more processing on behalf of your application, reducing the load on downstream services. See this blog post to learn more about using content-filtering to build advanced rules.

Schema registry

One of the traditional challenges of working with event-based architectures is the administration of managing event structures. With different applications, services and microservices publishing events, it can be hard to standardize event formats. These formats, or schemas, may also change when developers introduce new versions of their services.

The EventBridge Schema Registry allows you to automate the discovery and creation of schemas. You can find schemas for AWS services, integrated SaaS providers, and your own custom events. EventBridge infers the schemas and stores these in a registry. You can then download custom code bindings for popular type-safe programming languages. This accelerates the development process, making it easy to construct objects based on events.

Watch this video to see how to use the EventBridge Schema Registry, and get started with your own applications.

Migration and compatibility

Comparing the EventBridge console and CloudWatch Events console, EventBridge has a new design that makes it easier to build and manage rules and event buses. If you are new to using events within AWS, it’s recommended that you start with the EventBridge console.

The EventBridge service uses the CloudWatch Events API and it is fully backward compatible. Any rules you have configured in CloudWatch Events continue to work in EventBridge. You can access the default event bus and any configurations you created in CloudWatch Events from EventBridge immediately. If you’re a current CloudWatch Events customer, you can upgrade to EventBridge by simply opening the EventBridge console.

AWS continues to build new functionality to enhance the capabilities of event-based architectures. These features are only released via EventBridge, so it’s recommended that you upgrade to ensure you can take advantage of these new capabilities.

Additional EventBridge resources for serverless developers

Event-based architectures can help serverless developers create dynamic, decoupled applications. Here are additional resources with sample code repos to help you get started:

Conclusion

EventBridge is the evolution of the CloudWatch Events service. It brings new features, including the ability to integrate data from popular SaaS providers as events within AWS.

In this post, I discuss the new ability to create custom event buses, and how you can develop advanced rules for sophisticated event routing. I also discuss the new EventBridge Schema Registry, which automates event schema discovery, and downloading code bindings directly into your IDE.

New event-based features are now released in EventBridge. By migrating from CloudWatch Events, you can take advantage of new capabilities as they are released.

To learn more about using EventBridge for your AWS workloads, visit the EventBridge Learning Path, which includes a range of learning resources.

Architecture Monthly Magazine: Media & Entertainment

Post Syndicated from Annik Stahl original https://aws.amazon.com/blogs/architecture/architecture-monthly-magazine-media-entertainment/

AWS Architecture Monthly - June 2020 - Media &EntertainmentAWS makes it fast and easy for the Media and Entertainment (M&E) industry to produce, process, and deliver broadcast and over-the-top video. These pay-as-you-go services and appliance products offer the video infrastructure you need to deliver great viewing experiences to any screen.

For June’s issue of AWS Architecture Monthly, WW Tech Leader Konstantin Wilms talks about industry and architectural pattern trends in the cloud-based M&E space, and explains why this industry is unique in that almost any business problem can be solved using many AWS services. He also offers advice on what prospective customers should be thinking about and considering when moving to AWS.

In June’s Media & Entertainment issue

  • Ask an Expert: Konstantin Wilms, WW Tech Leader, AWS M&E
  • Blog: Deploying Your Favorite Post Production Applications on AWS Virtual Desktop Infrastructure
  • Case Study: L Benfica: Launching a Full-Featured VOD Platform in Just Four Weeks
  • Quick Start: Cloud Video Editing on AWS
  • Tutorial: Reimagine Your Studio
  • Whitepaper: Building Media & Entertainment Predictive Analytics Solutions on AWS

How to access the magazine

We hope you’re enjoying Architecture Monthly, and we’d like to hear from you—leave us star rating and comment on the Amazon Kindle Newsstand page or contact us anytime at [email protected].

AWS CodeArtifact and your package management flow – Best Practices for Integration

Post Syndicated from John Standish original https://aws.amazon.com/blogs/devops/integrating-aws-codeartifact-package-mgmt-flow/

You often use artifact repositories to store and share software or deployment packages. Centralized artifacts enable teams to operate independently and share versioned software artifacts across your organization. Sharing versioned artifacts across organizations increases code reuse and reduces delivery time. Having a central artifact store enables tighter artifact governance and improves security visibility. This post uses some of these patterns to show you how to integrate AWS CodeArtifact in an effective, cost-controlled, and efficient manner.

AWS CodeArtifact Diagram

AWS CodeArtifact Service Usage

AWS CodeArtifact concepts

AWS CodeArtifact uses the following elements:

  • Asset – An individual file stored in AWS CodeArtifact that is associated with a package version, such as an npm .tgz file or Maven POM and JAR files
  • Package – A package is a bundle of software and the metadata that is required to resolve dependencies and install the software. AWS CodeArtifact supports npmPyPI, and Maven package formats.
  • Repository – An CodeArtifact repository contains a set of package versions, each of which maps to a set of assets. Repositories are polyglot—a single repository can contain packages of any supported type. Each repository exposes endpoints for fetching and publishing packages using tools like the npm CLI, the Maven CLI (mvn), and pip.
  • Domain – Repositories are aggregated into a higher-level entity known as a domain. The domain allows organizational policy to be applied across multiple repositories. A domain deduplicates storage of the repositories packages.

Creating a domain based on organizational ownership

When you create a domain in CodeArtifact, it’s important to organize the domain by ownership within the organization. An example would be a a company being a domain, and the products being repositories. Domains allow you to apply organizational policies across multiple repositories. Generally we recommend creating one domain per company. In some cases it may also be beneficial to have a sandbox domain where prototype repositories reside. In a sandbox domain teams are at liberty to create their own repositories and experiment as needed, without affecting product deliverable assets. Using a sandbox domain will duplicate packages, isolate repositories since you can not copy packages between domains, and increase costs since package deduplication is handle at the domain level. Organizing packages by domain ownership increases the cache hits on a package within the domain and reduces cost for each subsequent package fetch request.

Whenever a package is fetched from a repository, the asset is cached in your CodeArtifact domain to minimize the cost of subsequent downstream requests. A given asset only needs to be stored once in a domain, even if it’s available in two—or two thousand—repositories. That means you only pay for storage once. Copying a package version with the CopyPackageVersions API is only possible between repositories within the same CodeArtifact domain.

You can create a domain for your organization by calling create-domain in the AWS Command Line Interface (AWS CLI), AWS SDK, or on the CodeArtifact console. See the following code:

aws codeartifact create-domain --domain "my-org"

After creating the domain you will see the domains listed in the Domains section on the CodeArtifact console.

AWS CodeArtifact domains per governing organization

Organizing packages by domain ownership

Using a shared repository

A shared repository is applicable when a team feels that a component is useful to the rest of the organization and isn’t in an experimental state, personal project, and not meant for wide distribution within the organization. Examples of shared components are open source public repositories (npm, PyPI, and Maven), authentication, logging, or helper libraries. Shared libraries aren’t related to product libraries; for instance, a service contract library shouldn’t live in a shared repository. The shared repository should be marked read-only to all users except for the publishing IAM role. At Amazon, we have found that many teams want to consume common packages as part of their application build, and don’t need to publish any package themselves. Those teams don’t need their own repository and pull packages from shared. Overall, approximately 80% of packages are downloaded from the shared repository, and 20% from team or project specific repositories.

You can create a shared repository by calling the create-repository command and setting a resource policy that makes the repository read-only.

Here is how you create a repository with the AWS CLI using the create-repository command. See the following code:

aws codeartifact create-repository --domain "my-org" \
--domain-owner "account-id" --repository "my-shared-repo-name" \
--description "My new repository"

Next you make the repository read-only by setting a resource policy. See the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
		"codeartifact:DescribePackageVersion",
                "codeartifact:DescribeRepository",
                "codeartifact:GetPackageVersionReadme",
                "codeartifact:GetRepositoryEndpoint",
                "codeartifact:ListPackages",
                "codeartifact:ListPackageVersions",
                "codeartifact:ListPackageVersionAssets",
                "codeartifact:ListPackageVersionDependencies",
                "codeartifact:ReadFromRepository"
            ],
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::444455556666:root"
            },
            "Resource": "*"
        }
    ]
}

To attach a resource polcicy to a repository by calling the put-repository-permissions command. See the following code:

aws codeartifact put-repository-permissions-policy --domain "my-org" \
--domain-owner "account-id" --repository "my-shared-repo-name" \
--policy-document file:///PATH/TO/policy.json

When you have created the repository, you will see it listed in the Repositories section on the CodeArtifact console.

A list of shared repositories in AWS CodeArtifact

Shared repositories in AWS CodeArtifact

External repository connections

CodeArtifact enables you to set external repository connections and replicate them within CodeArtifact. An external connection reduces the downstream dependency on the remote external repository. When you request a package from the CodeArtifact repository that’s not already present in the repository, the package can be fetched from the external connection. This makes it possible to consume open-source dependencies used by your application. Using an external connection reduces interruption in your development process for package external dependencies, an example is if a package is removed from a public repository, you will still have a copy of the package stored in CodeArtifact. You should have a one-to-one mapping with external repositories, and rather than have multiple CodeArtifact repositories pointing to the same public repository. Each asset that CodeArtifact imports into your repository from a public repository is billed as a single request, and each connection must reconcile and fetch the package before the response is returned. By having a one-to-one mapping, you can increase cache hits, reduces time to download an application dependency from CodeArtifact, and reduce the number of external package resolution requests. Associating an external repository connection with your repository is done using the associate-external-connection command. See the following code:

aws codeartifact associate-external-connection \
--domain "my-org" --domain-owner "account-id" \
--repository "my-external-repo" --external-connection public:npmjs

Once you have associated an external connection with your repository, you’ll see the external connection visible in the Repositories section detail. In this example we’ve connected the repository to the external npmjs repository.

External connection to npmjs with AWS CodeArtifact repositories

External connection to npmjs for an AWS CodeArtifact repository

Team and product repositories

When working in distributed teams, you often align repositories to the product or service ownership. Teams working on there own repository can update as needed. An example would be creating a private package that your team only uses internally.

See the following code:

aws codeartifact create-repository --domain my-org \
--domain-owner account-id --repository my-team-repo \
--description "My new team repository"

As team’s develop against the package they will need to publish their changes to the repository. As part of your development pipeline you would publish the package to the repository. See the following code for an example:

# Log in to CodeArtifact
aws codeartifact login --tool npm \
--domain "my-org" --domain-owner "account-id" \
--repository "my-team-repo"

# Run build commands here
...

# Set $VERSION from your build system
npm version $VERSION

# Publish to CodeArtifact
npm publish

After testing the feature and you find that it will be usable across your organization, you can copy the package into your shared repository. See the following code:

# Promoting to a shared repo
aws codeartifact copy-package-versions --domain "my-org" \
--domain-owner "account-id" --source-repository "my-team-repo" \
--destination-repository "my-shared-repo" \
--package my-package --format npm \
--versions '["6.0.2"]'

Once you’ve created your shared repository you will see the repositories updated as shown here.

Team and product repositories in AWS CodeArtifact

Team and product repositories

Sharing repositories across accounts

Often teams or workloads have separate accounts within an organization. This is a recommended practice because it clearly defines operational boundaries and domain of ownership and establishes security boundaries. If your organization uses a multi-account strategy, you can share repositories across accounts using CodeArtifact resource policy. Teams can develop in their own account and publish to a CodeArtifact repository controlled in a shared account.

Here you see a list of repositories, which includes both a shared and team repository.

Cross account sharing of AWS CodeArtifact repositories

Cross account sharing of AWS CodeArtifact repositories

Using Amazon CloudWatch Events when a package is pushed

When a package is pushed into a repository, its change can affect software dependencies, teams, or process dependencies. When an artifact is pushed to CodeArtifact, an Amazon CloudWatch Events event is triggered, which you can trigger additional functionality. You can react to these events by subscribing to a CodeArtifact event in Amazon EventBridge. Some examples of reactions to a change you could take are: checking dependencies, deploying dependent services, notifying teams or services of a change, or building the dependencies.

You can also use EventBridge to start a pipeline in AWS CodePipeline, notify an Amazon Simple Notification Service (Amazon SNS) topic, and have that call AWS Chatbot. For more information see, CodeArtifact event format and example. If you are looking to integrate AWS Chatbot into your delivery flow, see Receive AWS Developer Tools Notifications over Slack using AWS Chatbot.

Deploying code in a hybrid environment

You can enable seamless software deployment into AWS and on-premises environments by integrating CodeArtifact with software build and deployment services. You can use CodeArtifact with your existing development pipeline tooling such as NPM, Python, and Maven. With native support for these package managers, you can access CodeArtifact wherever you operate today.

First, log in to CodeArtifact, build your code, and finally publish using npm publish with the following code:

# Log in to CodeArtifact 
aws codeartifact login --tool npm \
--domain "my-org" --domain-owner "account-id" \
--repository "my-team-repo"

# Run build commands here 
... 

# Set $VERSION from your build system 
npm version $VERSION 

# Publish to CodeArtifact 
npm publish

Cleaning Up

When you’re ready to clean up the repositories and domains you’ve created, you’ll need to remove them in a specific order. Please be aware that deleting a repository is a destructive action which will remove any stored packages. To delete a domain and delete a repository created from the previous sections in this blog, you will be using the delete-domain and delete-repository commands.

You will need to remove the domain and repository in the following order:

  1. Remove any repositories in a domain
  2. Remove the domain

To delete the repository and domain, see the following code:

# Delete the repository
aws codeartifact delete-repository --domain "my-org" --domain-owner "account-id" --repository "my-team-repo"

# Delete the domain
aws codeartifact delete-domain --domain "my-org" --domain-owner "account-id"

Conclusion

This post covered how to integrate CodeArtifact into your delivery flow and use CodeArtifact effectively. A shared repository approach aides in creating reusable components across your organization. Using team repositories and promoting to a consumable repository allows your teams to iterate independently. For more information, see Getting started with CodeArtifact.

About the Author

John Standish

John Standish is a Solutions Architect at AWS and spent over 13 years as a Microsoft .Net developer. Outside of work, he enjoys playing video games, cooking, and watching hockey.

Yogesh Chaturvedi

Yogesh Chaturvedi is a Solutions Architect at AWS and has over 20 years of software development and architecture experience.

 

Best practices for organizing larger serverless applications

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/best-practices-for-organizing-larger-serverless-applications/

Well-designed serverless applications are decoupled, stateless, and use minimal code. As projects grow, a goal for development managers is to maintain the simplicity of design and low-code implementation. This blog post provides recommendations for designing and managing code repositories in larger serverless projects, and best practices for deploying releases of production systems.

Organizing your code repositories

Many serverless applications begin as monolithic applications. This can occur either because a simple application has grown more complex over time, or because developers are following existing development practices. A monolithic application is represented by a single AWS Lambda function performing multiple tasks, and a mono-repo is a single repository containing the entire application logic.

Monoliths work well for the simplest serverless applications that perform single-purpose functions. These are small applications such as cron jobs, data processing tasks, and some asynchronous processes. As those applications evolve into workflows or develop new features, it becomes important to refactor the code into smaller services.

Using frameworks such as the AWS Serverless Application Model (SAM) or the Serverless Framework can make it easier to group common pieces of functionality into smaller services. Each of these can have a separate code repository. For SAM, the template.yaml file contains all the resources and function definitions needed for an application. Consequently, breaking an application into microservices with separate templates is a simple way to split repos and resource groups.

Separate templates for microservices

In the smallest unit of a serverless application, it’s also possible to create one repository per function. If these functions are independent and do not share other AWS resources, this may be appropriate. Helper functions and simple event processing code are examples of candidates for this kind of repo structure.

In most cases, it makes sense to create repos around groups of functions and resources that define a microservice. In an ecommerce example, “Payment processing” is a microservice with multiple smaller related functions that share common resources.

As with any software, the repo design depends upon the use-case and structure of development teams. One large repo makes it harder for developer teams to work on different features, and test and deploy. Having too many repos can create duplicate code, and difficulty in sharing resources across repos. Finding the balance for your project is an important step in designing your application architecture.

Using AWS services instead of code libraries

AWS services are important building blocks for your serverless applications. These can frequently provide greater scale, performance, and reliability than bundled code packages with similar functionality.

For example, many web applications that are migrated to Lambda use web frameworks like Flask (for Python) or Express (for Node.js). Both packages support routing and separate user contexts that are well suited if the application is running on a web server. Using these packages in Lambda functions results in architectures like this:

Web servers in Lambda functions

In this case, Amazon API Gateway proxies all requests to the Lambda function to handle routing. As the application develops more routes, the Lambda function grows in size and deployments of new versions replace the entire function. It becomes harder for multiple developers to work on the same project in this context.

This approach is generally unnecessary, and it’s often better to take advantage of the native routing functionality available in API Gateway. In many cases, there is no need for the web framework in the Lambda function, which increases the size of the deployment package. API Gateway is also capable of validating parameters, reducing the need for checking parameters with custom code. It can also provide protection against unauthorized access, and a range of other features more suited to be handled at the service level. When using API Gateway this way, the new architecture looks like this:

Using API Gateway for routing

Additionally, the Lambda functions consist of less code and fewer package dependencies. This makes testing easier and reduces the need to maintain code library versions. Different developers in a team can work on separate routing functions independently, and it becomes simpler to reuse code in future projects. You can configure routes in API Gateway in the application’s SAM template:

Resources:
  GetProducts:
    Type: AWS::Serverless::Function 
    Properties:
      CodeUri: getProducts/
      Handler: app.handler
      Runtime: nodejs12.x
      Events:
        GetProductsAPI:
          Type: Api 
          Properties:
            Path: /getProducts
            Method: get

Similarly, you should usually avoid performing workflow orchestrations within Lambda functions. These are sections of code that call out to other services and functions, and perform subsequent actions based on successful execution or failure.

Lambda functions with embedded workflow orchestrations

These workflows quickly become fragile and difficult to modify for new requirements. They can cause idling in the Lambda function, meaning that the function is waiting for return values from external sources, increasingly the cost of execution.

Often, a better approach is to use AWS Step Functions, which can represent complex workflows as JSON definitions in the application’s SAM template. This service reduces the amount of custom code required, and enables long-lived workflows that minimize idling in Lambda functions. It also manages in-flight executions as workflows are upgraded. The example above, rearchitected with a Step Functions workflow, looks like this:

Using Step Functions for orchestration

Using multiple AWS accounts for development teams

There are many ways to deploy serverless applications to production. As applications grow and become more important to your business, development managers generally want to improve the robustness of the deployment process. You have a number of options within AWS for managing the development and deployment of serverless applications.

First, it is highly recommended to use more than one AWS account. Using AWS Organizations, you can centrally manage the billing, compliance, and security of these accounts. You can attach policies to groups of accounts to avoid custom scripts and manual processes. One simple approach is to provide each developer with an AWS account, and then use separate accounts for a beta deployment stage and production:

Multiple AWS accounts in a deployment pipeline

The developer accounts can contains copies of production resources and provide the developer with admin-level permissions to these resources. Each developer has their own set of limits for the account, so their usage does not impact your production environment. Individual developers can deploy CloudFormation stacks and SAM templates into these accounts with minimal risk to production assets.

This approach allows developers to test Lambda functions locally on their development machines against live cloud resources in their individual accounts. It can help create a robust unit testing process, and developers can then push code to a repository like AWS CodeCommit when ready.

By integrating with AWS Secrets Manager, you can store different sets of secrets in each environment and eliminate any need for credentials stored in code. As code is promoted from developer account through to the beta and production accounts, the correct set of credentials is automatically used. You do not need to share environment-level credentials with individual developers.

It’s also possible to implement a CI/CD process to start build pipelines when code is deployed. To deploy a sample application using a multi-account deployment flow, follow this serverless CI/CD tutorial.

Managing feature releases in serverless applications

As you implement CI/CD pipelines for your production serverless applications, it is best practice to favor safe deployments over entire application upgrades. Unlike traditional software deployments, serverless applications are a combination of custom code in Lambda functions and AWS service configurations.

A feature release may consist of a version change in a Lambda function. It may have a different endpoint in API Gateway, or use a new resource such as a DynamoDB table. Access to the deployed feature may be controlled via user configuration and feature toggles, depending upon the application. AWS SAM has AWS CodeDeploy built-in, which allows you to configure canary deployments in the YAML configuration:

Resources:
 GetProducts:
   Type: AWS::Serverless::Function
   Properties:
     CodeUri: getProducts/
     Handler: app.handler
     Runtime: nodejs12.x

     AutoPublishAlias: live

     DeploymentPreference:
       Type: Canary10Percent10Minutes 
       Alarms:
         # A list of alarms that you want to monitor
         - !Ref AliasErrorMetricGreaterThanZeroAlarm
         - !Ref LatestVersionErrorMetricGreaterThanZeroAlarm
       Hooks:
         # Validation Lambda functions run before/after traffic shifting
         PreTraffic: !Ref PreTrafficLambdaFunction
         PostTraffic: !Ref PostTrafficLambdaFunction

CodeDeploy automatically creates aliases pointing to the old and versions of a function. The canary deployment enables you to gradually shift traffic from the old to the new alias, as you become confident that the new version is working as expected. Or you can rollback the update if needed. You can also set PreTraffic and PostTraffic hooks to invoke Lambda functions before and after traffic shifting.

Conclusion

As any software application grows in size, it’s important for development managers to organize code repositories and manage releases. There are established patterns in serverless to help manage larger applications. Generally, it’s best to avoid monolithic functions and mono-repos, and you should scope repositories to either the microservice or function level.

Well-designed serverless applications use custom code in Lambda functions to connect with managed services. It’s important to identify libraries and packages that can be replaced with services to minimize the deployment size and simplify the code base. This is especially true in applications that have been migrated from server-based environments.

Using AWS Organizations, you manage groups of accounts to enable your developers to have their own AWS accounts for development. This enables engineers to clone production assets and test against the AWS Cloud when writing and debugging code. You can use a CI/CD pipeline to push code through a beta environment to production, while safeguarding secrets using Secrets Manager. You can also use CodeDeploy to manage canary deployments easily.

To learn more about deploying Lambda functions with SAM and CodeDeploy, follow the steps in this tutorial.

Running Web Applications on Amazon EC2 Spot Instances

Post Syndicated from Ben Peven original https://aws.amazon.com/blogs/compute/running-web-applications-on-amazon-ec2-spot-instances/

Author: Isaac Vallhonrat, Sr. EC2 Spot Specialist SA

Amazon EC2 Spot Instances allow customers to save up to 90% compared to On-Demand pricing by leveraging spare EC2 capacity. Spot Instances are a perfect fit for fault tolerant workloads that are flexible to run on multiple instance types such as batch jobs, code builds, load tests, containerized workloads, web applications, big data clusters, and High Performance Computing clusters. In this blog post, I walk you through how-to and best practices to run web applications on Spot Instances, so you can benefit from the scale and savings that they provide.

As Spot Instances are interruptible, your web application should be stateless, fault tolerant, loosely coupled, and rely on external data stores for persistent data, like Amazon ElastiCache, Amazon RDS, Amazon DynamoDB, etc.

Spot Instances recap

While Spot Instances have been around since 2009, recent updates and service integrations have made them easier than ever to integrate with your workloads. Before I get into the details of how to use them to host a web application, here is a brief recap of how Spot Instances work:

First, Spot Instances are a purchasing option within EC2. There is no difference in hardware when compared to On-Demand or Reserved Instances. The only difference is that these instances can be reclaimed with a two-minute notice if EC2 needs the capacity back.

A set of unused EC2 instances with the same instance type, operating system, and Availability Zone is a Spot capacity pool. Spot Instance pricing is set by Amazon EC2 and adjusts gradually based on long-term trends in supply and demand of EC2 instances in each pool. You can expect Spot pricing to be stable over time, meaning no sudden spikes or swings. You can view historical pricing data for the last three months in both the EC2 console and via the API. The following image is an example of the pricing history for m5.xlarge instances in the N. Virginia (us-east-1).

spot instance pricing history

Figure 1. Spot Instance pricing history for an m5.xlarge Linux/UNIX instance in us-east-1. Price changes independently for each Availability Zone and in small increments over time based on long term supply and demand.

As each capacity pool is independent from each other, I recommend you spread the fleet of instances that power your workload across multiple Availability Zones (AZs) and be flexible to use multiple instance types. This way, you effectively increase the amount of spare capacity you can use to launch Spot Instances and are able to replace interrupted instances with instances from other pools.

The best way to adhere to Spot Instance flexibility best practices and managing your fleet of instances is using an Amazon EC2 Auto Scaling group. This allows you to combine multiple instance types and purchase options. Once you identify a list of suitable instance types for your workload, you can use Auto Scaling features to launch a combination of On-Demand and Spot Instances from optimal Spot Instance pools in each Availability Zone.

Stateless web applications are a natural fit for Spot Instances as they are fault tolerant, used to handle interruptions through scale-in events in an Auto Scaling group, and commonly run across multiple Availability Zones. It’s also normally easy to implement instance type flexibility on web applications, so they tap on Spot Instance capacity from multiple pools.

Identifying suitable instance types for your web application

Being instance type flexible is key to being successful with Spot Instances. At the time of this writing, AWS provides more than 270 instance types, so it’s likely that your web application can run on multiple of them. For example, if your application runs on the m5.xlarge instance type, it can also run on the m5d.xlarge instance type as it is the same instance type, but with local SSDs. Also, running your application on the r5.xlarge or r5d.xlarge performs similarly since they use the same processor family. Instead, your application runs on an instance with more memory and you benefit from the savings Spot Instances provide.

You may be able to qualify additional similarly-sized instance types from other families or generations, like the AMD-based variants m5a.xlarge and r5a.xlarge, or the higher bandwidth variants m5n.xlarge and r5n.xlarge. These may present some performance differences compared to other types due to their hardware and/or the virtualization system. However, Application Load Balancer has mechanisms in place to spread the load across your instances according to the number of outstanding requests they are processing. This means, instances handling longer requests or that have lower processing power are not overloaded. This allows you to acquire capacity from even more Spot capacity pools regardless of hardware differences. Later in the blog post, I dive into the details on the load balancing mechanisms.

Configure your Auto Scaling group

Now that you have qualified multiple instance types to run your web application, you need to create an EC2 Auto Scaling group.

The first step is to create a Launch Template where you specify configuration for your fleet of instances including the AMI, Security Groups, Tags, user data scripts, etc. You configure a single instance type on the template, here you configure the instance type you commonly use to run your web application. Also, do not mark the Request Spot Instances checkbox on the template. You configure multiple instance types and the Spot Instance settings when configuring the Auto Scaling group. Here’s my Launch Template:

launch templates console

Figure 2. Launch templates console

Now, go to the EC2 Auto Scaling console to create an Auto Scaling group. In the wizard, select the launch template you created and then click next. Then, on the second step of the wizard, select Combine purchase options and instance types. You can see the configuration options in the following image.

combine EC2 purchase options

Figure 3. Combine purchase options and instance types configuration.

Here is a detailed look at the configuration details:

  • Optional On-Demand Base: This parameter specifies a baseline of On-Demand Instances within your Auto Scaling group. This is also handy if you have Reserved Instances or Savings Plans to cover your compute baseline, since On-Demand Instances apply towards your Savings Plans or the instances matching your reservation are billed as Reserved Instances.
  • On-Demand percentage above base: Once the size of the Auto Scaling group grows over the (optional) On-Demand base capacity, this parameter defines the percentage of instances that are On-Demand and Spot Instances. The default settings are a good starting point, set at 70% On-Demand and 30% Spot. You can adjust these percentages as suitable for your workload.
  • Spot Allocation Strategy per Availability Zone: This defines the logic that Auto Scaling uses to select which instance type to launch in each Availability Zone when launching Spot Instances. The possible Spot allocation strategies are:
    • Capacity Optimized: This strategy allocates your instances from the Spot Instance pool with optimal capacity for the number of instances that are launching. This is the default selection and the recommended one, as launching from capacity pools with optimal capacity can lower the chance of interruption. Every time EC2 Auto Scaling scales out, the strategy is evaluated, so as your Auto Scaling group scales out and scales in, your instances are recycled and you make use of Spot Instances in the optimal pools.
    • Lowest price: This strategy allocates your instances from the number (N) of Spot Instance pools that you specify and from the pools with the lowest price per unit at the time of fulfillment.

Next, you choose your instance types from the Instance Types section of the wizard. Here, there is a primary instance type inherited from your Launch Template. The console then automatically populates a list of same size instance types across other families and generations that could also fit your web application requirements. You can add or remove instance types as you see fit.

Selection of instance types

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 4. Selection of instance types

Note: There is an option to combine instance types of different sizes which is very helpful for queue worker nodes and similar workloads. We won’t cover this in this blog post as it is not relevant for web applications. You can read more about this feature here.

Once you select your instance types, your Auto Scaling group uses the configured Spot Instance allocation strategy to decide which Spot Instance types to launch on each Availability Zone. By specifying multiple instance types, EC2 Auto Scaling replenishes capacity from other Spot Instance pools applying your configured allocation strategy if some of your instances get interrupted. This maintains the size of your fleet and keeps your service available.

For the On-Demand part of your Auto Scaling group, the allocation strategy is prioritized. This means that EC2 Auto Scaling launches the primary instance type, and moves to the next types in the order you selected, in the unlikely event it cannot launch the capacity. If you have Reserved Instances to cover your compute baseline, set the instance type you reserved as the primary instance type in your Auto Scaling group, so the On-Demand Instances get the reservation discounts. You can learn more about how Reserved Instances are applied here.

After completing this step, complete the Auto Scaling group creation wizard, and you can move on to set up your Application Load Balancer.

Load balancing with Application Load Balancer

To balance the load across the fleet of instances running your web application you use an Application Load Balancer (ALB). EC2 Auto Scaling groups are fully integrated with ALB and Target Groups that manage the fleet of instances fronted by ALB.

Target groups are set up by default with a deregistration delay of 300 seconds. This means that when EC2 Auto Scaling needs to remove an instance out of the fleet, it first puts the instance in draining state, informs ALB to stop sending new requests to it, and allows the configured time for in-flight requests to complete before the instance is finally deregistered. This way, removing an instance from the fleet is not perceptible to your end users.

As Spot Instances are interrupted with a 2 minute warning, you need to adjust this setting to a slightly lower time. Recommended values are 90 seconds or less. This allows time for in-flight requests to complete and gracefully close existing connections before the instance is interrupted.

At the time of this writing, EC2 Auto Scaling is not aware of Spot Instance interruptions and does not trigger draining automatically when a Spot Instance is going to be interrupted. However, you can put the instance in draining state by catching the interruption notice and calling the EC2 Auto Scaling API. I cover this in more detail in the next section.

Target Groups allow you to configure a routing algorithm, which by default is Round Robin. As with Spot Instances you combine multiple instance types in your fleet, your web application may see some performance differences if they use different hardware and in some cases a different virtualization platform (Xen for the older generation instances and AWS Nitro for the newer ones). While the impact to response time may be negligible to the end user, you need to ensure your instances handle load fairly accounting for the processing power differences of each instance type.

With the Least Outstanding Requests load balancing algorithm, instead of distributing the requests across your fleet in a round-robin fashion, as a new request comes in, the load balancer sends it to the target with the least number of outstanding requests. This way backend instances with lower processing capabilities or processing long-standing requests are not burdened with more requests and the load is spread evenly across targets. This also helps the new targets effectively take load from overloaded targets.

These settings can be configured on the Target Groups section of the EC2 Console and editing the attributes of your target group.
application load balancer attributes

Figure 5. ALB Target group attribute configuration.

Your architecture will look like the following image.

stateless web application architecture

Figure 6. Stateless web application hosted on a combination of On-Demand and Spot Instances

Leveraging Instance Termination Notices for graceful termination

When a Spot Instance is going to be interrupted, EC2 triggers a Spot Instance interruption notice that is presented via the EC2 Metadata endpoint on the instance. It can also be caught by an Amazon EventBridge rule, for example, trigger an AWS Lambda function. You can find the specific documentation here.

By catching the interruption notice, you can take actions to gracefully take the instance out of the fleet without impacting your end users. For example, stop receiving new requests from the Load Balancer, allowing in-flight requests to finish and launch replacement capacity.

Below is a sample Spot Instance Interruption Notice caught by Amazon EventBridge:

{
  "version": "0",
  "id": "72d6566c-e6bd-117d-5bbc-2779c679abd5",
  "detail-type": "EC2 Spot Instance Interruption Warning",
  "source": "aws.ec2",
  "account": "12345678910",
  "time": "2020-03-06T08:25:40Z",
  "region": "eu-west-1",
  "resources": [
    "arn:aws:ec2:eu-west-1a:instance/i-0cefe48f36d7c281a"
  ],
  "detail": {
    "instance-id": "i-0cefe48f36d7c281a",
    "instance-action": "terminate"
  }
}

Upon intercepting the interruption event, you can leverage the DetachInstances API call of EC2 Auto Scaling. Invoking this API puts the instance in Draining state, which makes the configured Load Balancer stop sending new requests to the instance that is going to be interrupted. The instance remains in this state during the configured deregistration delay on the Target Group to allow time for in-flight requests to complete.

Target Group console

Figure 7. Target Group console showing instance draining.

Also, if you set the ShouldDecrementDesiredCapacity parameter of the API call to False, EC2 Auto Scaling attempts to launch replacement Spot Instance capacity according to your instance type selection and allocation strategy. Below is an example code snippet calling the Auto Scaling API using Python Boto3.

def detach_instance_from_asg(instance_id,as_group_name):
   try:
       # detach instance from ASG and launch replacement instance
       response = asgclient.detach_instances(
           InstanceIds=[instance_id],
           AutoScalingGroupName=as_group_name,
           ShouldDecrementDesiredCapacity=False)
       logger.info(response['Activities'][0]['Cause'])
   except ClientError as e:
       error_message = "Unable to detach instance {id} from AutoScaling Group {asg_name}. ".format(
           id=instance_id,asg_name=as_group_name)
       logger.error( error_message + e.response['Error']['Message'])
       raise e

In some cases, you may also want to take actions at the operating system level when a Spot Instance is going to be interrupted. For example, you may want to gracefully stop your application so it closes open connections to a database, deregister agents running on the instance or other clean-up activities. You can leverage AWS Systems Manager Run Command to invoke commands on the instance that is going to be interrupted to perform these actions.

You may also want to delay the execution of commands to capitalize on the two minute warning. As a first step, you can check the instance termination time on the instance metadata and wait, for example, until 30 seconds before the interruption. Then, gracefully stop your application and issue other termination commands. Note that there are no charges for using Run Command (limits apply) and the API call is asynchronous so it doesn’t hold your Lambda function running.

Note: To use Run Command you must have the AWS Systems Manager agent running on your instance and meet a set of pre-requisites, like configuring an IAM instance profile. You can find more information here.

We provide you with a sample solution that handles all the steps outlined above here that you can easily deploy on your AWS account.

The sample solution also allows you to optionally execute commands upon Spot Instance interruption by creating an AWS Systems Manager parameter in Parameter Store specific to each Auto Scaling group with the commands you want to run, so you can easily customize termination commands to the needs of each application. You can find detailed configuration instructions in the README.md file. The high-level architecture of this solution is detailed below:

serverless architecture spot interruptions

Figure 8. Serverless architecture to handle Spot Instance interruptions

Tying it all together

Now, that I covered all the relevant features to run web applications on Spot Instances, let’s put them all together and summarize:

  • Using mixed instance types and purchase options in EC2 Auto Scaling groups you can configure a combination of On-Demand and Spot Instances. Leverage this feature to configure multiple instance types to run your web application on Spot Instances, so you can leverage unused capacity from multiple Spot capacity pools and replace interrupted instances.
  • Using the capacity-optimized Spot Instance allocation strategy you can launch Spot Instances from your instance type selection based on the most available Spot Instance capacity pools. This reduces the chances of interruptions. Auto Scaling then replenishes Spot Instance capacity from the optimal pools if some of your instances are interrupted as Spot Instance capacity fluctuates.
  • With Application Load Balancer you can adjust the deregistration delay of your Target Group to a slightly lower time than the Spot Instance termination notice time (e.g. 90 seconds), so when an instance is going to be interrupted, you stop sending new requests to it and leverage the notice time to let in-flight requests complete and avoid impact to your end users. You can also configure the Least Outstanding Requests load balancing algorithm in your ALB Target Group so the load is distributed evenly across your fleet accounting for differences in processing power for different instance families.
  • You can leverage Amazon EventBridge, AWS Lambda and AWS Systems Manager Run Command to take interruption handling actions like put the instance in draining state, request EC2 Auto Scaling to launch a replacement instance and run commands to gracefully shutdown your application or trigger clean-up activities. You can also subscribe an SNS topic or other Lambda functions to the interruption event for alarming, logging or other purposes.

I hope you found this blog post useful, and if you are not using Spot Instances to run your stateless web applications today, I encourage you to try them to get the cost savings and scale they provide.


Isaac Vallhonrat is a Sr. Specialist Solutions Architect for EC2 Spot at AWS. He helps customers build resilient, fault tolerant and cost-effective architectures with EC2 Spot Instances across multiple workload types like web applications, containers, big data, CI/CD, etc.

 

 

 

BBVA: Helping Global Remote Working with Amazon AppStream 2.0

Post Syndicated from Joe Luis Prieto original https://aws.amazon.com/blogs/architecture/bbva-helping-global-remote-working-with-amazon-appstream-2-0/

This post was co-written with Javier Jose Pecete, Cloud Security Architect at BBVA, and Javier Sanz Enjuto, Head of Platform Protection – Security Architecture at BBVA.

Introduction

Speed and elasticity are key when you are faced with unexpected scenarios such as a massive employee workforce working from home or running more workloads on the public cloud if data centers face staffing reductions. AWS customers can instantly benefit from implementing a fully managed turnkey solution to help cope with these scenarios.

Companies not only need to use technology as the foundation to maintain business continuity and adjust their business model for the future, but they also must work to help their employees adapt to new situations.

About BBVA

BBVA is a customer-centric, global financial services group present in more than 30 countries throughout the world, has more than 126,000 employees, and serves more than 78 million customers.

Founded in 1857, BBVA is a leader in the Spanish market as well as the largest financial institution in Mexico. It has leading franchises in South America and the Sun Belt region of the United States and is the leading shareholder in Turkey’s Garanti BBVA.

The challenge

BBVA implemented a global remote working plan that protects customers and employees alike, including a significant reduction of the number of employees working in its branch offices. It also ensured continued and uninterrupted operations for both consumer and business customers by strengthening digital access to its full suite of services.

Following the company’s policies and adhering to new rules announced by national authorities in the last weeks, more than 86,000 employees from across BBVA’s international network of offices and its central service functions now work remotely.

BBVA, subject to a set of highly regulated requirements, and was looking for a global architecture to accommodate remote work. The solution needed to be fast to implement, adaptable to scale out gradually in the various countries in which it operates, and able to meet its operational, security, and regulatory requirements.

The architecture

BBVA selected Amazon AppStream 2.0 for particular use cases of applications that, due to their sensitivity, are not exposed to the internet (such as financial, employee, and security applications). Having had previous experience with the service, BBVA chose AppStream 2.0 to accommodate the remote work experience.

AppStream 2.0 is a fully managed application streaming service that provides users with instant access to their desktop applications from anywhere, regardless of what device they are using.

AppStream 2.0 works with IT environments, can be managed through the AWS SDK or console, automatically scales globally on demand, and is fully managed on AWS. This means there is no hardware or software to deploy, patch, or upgrade.

AppStream 2.0 can be managed through the AWS SDK (1)

  1. The streamed video and user inputs are sent over HTTPS and are SSL-encrypted between the AppStream 2.0 instance executing your applications, and your end users.
  2. Security groups are used to control network access to the customer VPC.
  3. AppStream 2.0 streaming instance access to the internet is through the customer VPC.

AppStream 2.0 fleets are created by use case to apply security restrictions depending on data sensitivity. Setting clipboard, file transfer, or print to local device options, the fleets control the data movement to and from employees’ AppStream 2.0 streaming sessions.

BBVA relies on a proprietary service called Heimdal to authenticate employees through the corporate identity provider. Heimdal calls the AppStream 2.0 API CreateStreamingURL operation to create a temporary URL to start a streaming session for the specified user, and tries to abstract the user from the service using:

  • FleetName to connect the most appropriate fleet based on the user’s location (BBVA has fleets deployed in Europe and America to improve the user’s experience.)
  • ApplicationId to launch the proper application without having to use an intermediate portal
  • SessionContext in situations where, for instance, the authentication service generates a token and needs to be forwarded to a browser application and injected as a session cookie

BBVA uses AWS Transit Gateway to build a hub-and-spoke network topology (2)

To simplify its overall network architecture, BBVA uses AWS Transit Gateway to build a hub-and-spoke network topology with full control over network routing and security.

There are situations where the application streamed in AppStream 2.0 needs to connect:

  1. On-premises, using AWS Direct Connect plus VPN providing an IPsec-encrypted private connection
  2. To the Internet through an outbound VPC proxy with domain whitelisting and content filtering to control the information and threats in the navigation of the employee

AppStream 2.0 activity is logged into a centralized repository support by Amazon S3 for detecting unusual behavior patterns and by regulatory requirements.

Conclusion

BBVA built a global solution reducing implementation time by 90% compared to on-premises projects, and is meeting its operational and security requirements. As well, the solution is helping with the company’s top concern: protecting the health and safety of its employees.

Using dynamic Amazon S3 event handling with Amazon EventBridge

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-dynamic-amazon-s3-event-handling-with-amazon-eventbridge/

A common pattern in serverless applications is to invoke a Lambda function in response to an event from Amazon S3. For example, you could use this pattern for automating document translation, transcribing audio files, or staging data imports. You can configure this integration in many places, including the AWS Management Console, the AWS CLI, or the AWS Serverless Application Model (SAM).

If you need to fan out notifications, or hold messages in queue, you are also able to route S3 events to Amazon SNS or Amazon SQS. These standard notification mechanisms work well for most applications, and are simple to implement. However, for more complex notification patterns, you can use Amazon EventBridge to route events dynamically. This blog post explores advanced use-cases and how to implement these in your serverless applications.

S3 to EventBridge, using CloudTrail.

To set up the example applications, visit the GitHub repo and follow the instructions in the README.md file. The code uses SAM templates, enabling you to deploy the applications in your own AWS account. This walkthrough creates resources covered in the AWS Free Tier but you may incur cost if you test with large amounts of data.

Integrating S3 events with Lambda via EventBridge

EventBridge consumes S3 events via AWS CloudTrail. A single trail can log events for one or more S3 buckets, and you can configure which data events are recorded. It’s best practice to store CloudTrail log files in a separate S3 bucket. Once this is configured, EventBridge can then receive any event logged in the trail.

The first example in the GitHub repo shows how this can be configured in a SAM template. The application comprises an S3 bucket, a Lambda EventConsumer function, and other required resources. First, the template defines the two buckets:

Resources: 
  SourceBucket: 
    Type: AWS::S3::Bucket
    Properties:
      BucketName: "TheSourceBucket"

  LoggingBucket: 
    Type: AWS::S3::Bucket
    Properties:
      BucketName: "TheLoggingBucket"

Next, an S3 bucket policy grants permissions for CloudTrail to write files to the logging bucket:

  BucketPolicy: 
    Type: AWS::S3::BucketPolicy
    Properties: 
      Bucket: 
        Ref: LoggingBucket
      PolicyDocument: 
        Version: "2012-10-17"
        Statement: 
          - 
            Sid: "AWSCloudTrailAclCheck"
            Effect: "Allow"
            Principal: 
              Service: "cloudtrail.amazonaws.com"
            Action: "s3:GetBucketAcl"
            Resource: 
              !Sub |-
                arn:aws:s3:::${LoggingBucket}
          - 
            Sid: "AWSCloudTrailWrite"
            Effect: "Allow"
            Principal: 
              Service: "cloudtrail.amazonaws.com"
            Action: "s3:PutObject"
            Resource:
              !Sub |-
                arn:aws:s3:::${LoggingBucket}/AWSLogs/${AWS::AccountId}/*
            Condition: 
              StringEquals:
                s3:x-amz-acl: "bucket-owner-full-control"

The template configures the trail and sets the logging bucket. It defines event selectors, which identify the specific events for logging:

  myTrail: 
    Type: AWS::CloudTrail::Trail
    DependsOn: 
      - BucketPolicy
    Properties: 
      TrailName: "MyTrailName"
      S3BucketName: 
        Ref: LoggingBucket
      IsLogging: true
      IsMultiRegionTrail: false
      EventSelectors:
        - DataResources:
          - Type: AWS::S3::Object
            Values:
              - !Sub |-
                arn:aws:s3:::${SourceBucket}/
      IncludeGlobalServiceEvents: false

The SAM template configures a target Lambda function for receiving the events:

  EventConsumerFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: eventConsumer/
      Handler: app.handler
      Runtime: nodejs12.x

Finally, it defines a rule that sets the event pattern and targets. It also grants permission to EventBridge to invoke the Lambda function:

  EventRule: 
    Type: AWS::Events::Rule
    Properties: 
      Description: "EventRule"
      State: "ENABLED"
      EventPattern: 
        source: 
          - "aws.s3"
        detail: 
          eventName: 
            - "PutObject"
          requestParameters:
            bucketName: !Ref SourceBucketName

      Targets: 
        - 
          Arn: 
            Fn::GetAtt: 
              - "EventConsumerFunction"
              - "Arn"
          Id: "EventConsumerFunctionTarget"

  PermissionForEventsToInvokeLambda: 
    Type: AWS::Lambda::Permission
    Properties: 
      FunctionName: 
        Ref: "EventConsumerFunction"
      Action: "lambda:InvokeFunction"
      Principal: "events.amazonaws.com"
      SourceArn: 
        Fn::GetAtt: 
          - "EventRule"
          - "Arn"

To deploy this application, follow the instructions in the GitHub repo’s README.file. To test, upload any file to the Source Bucket. This invokes the Lambda function via the EventBridge event, and logs out the event details. Open the CloudWatch Logs console for the deployed Lambda function to view the output.

The event pattern in this example matches on any PutObject event in the Source Bucket. You can also match on any attribute, or combination of attributes, in an S3 event. This makes it possible to identify events by source IP address, object size, time range, or principalId (the user causing the event). With access to the entire S3 event, this enables more granularity on matching events before invoking the target Lambda function.

Consuming events from existing S3 buckets

When deploying S3 and Lambda integrations in SAM templates, you cannot use existing buckets managed outside of the CloudFormation stack. Frequently, it’s useful to deploy serverless applications that integrate with existing S3 buckets. Using the S3-to-EventBridge integration, you can create new applications that receive events from existing buckets.

Consuming events from existing S3 buckets

The second example in the GitHub repo shows how to configure a new application for an existing bucket. This template takes the existing S3 bucket name as a parameter, and generates the CloudTrail trail, EventBridge rule, and required permissions.

Follow this example’s README.md file to deploy the application. To test, upload any file into the existing S3 bucket you selected. This invokes the eventConsumer logging function deployed in the template.

Invoking a single Lambda function from multiple S3 buckets

With EventBridge decoupling the producer and consumer of the events, this also makes it easier to introduce multiple producers. In the third example, the SAM template creates three buckets that invoke the same EventConsumer Lambda function:

Invoking Lambda from multiple S3 buckets

The MultiBucketName parameter is used to create the three buckets with a number appended to the name. First, the CloudTrail EventSelector includes the three buckets in the trail:

  # The CloudTrail trail 
  myTrail: 
    Type: AWS::CloudTrail::Trail
    DependsOn: 
      - BucketPolicy
    Properties: 
      TrailName: "myTrail"
      S3BucketName: 
        Ref: LoggingBucket
      IsLogging: true
      IsMultiRegionTrail: false
      EventSelectors:
        - DataResources:
          - Type: AWS::S3::Object
            Values:
              - !Sub 'arn:aws:s3:::${MultiBucketName}-1/'
              - !Sub 'arn:aws:s3:::${MultiBucketName}-2/'
              - !Sub 'arn:aws:s3:::${MultiBucketName}-3/'
      IncludeGlobalServiceEvents: false

Next, the EventRule includes the three bucket names in the event pattern, so events from any of these buckets can now trigger the rule:

  # EventBridge rule - invokes EventConsumerFunction 
  EventRule: 
    Type: AWS::Events::Rule
    Properties: 
      Description: "EventRule"
      State: "ENABLED"
      EventPattern: 
        source: 
          - "aws.s3"
        detail: 
          eventName: 
            - "PutObject"
          requestParameters:
            bucketName:
              - !Sub '${MultiBucketName}-1'
              - !Sub '${MultiBucketName}-2'
              - !Sub '${MultiBucketName}-3'

It’s also possible to use content-based filtering in event patterns to match dynamically on bucket names. For example, if you have multiple buckets with the prefix myCompanySales, you can create an event pattern to match all of these buckets:

      EventPattern: 
        source: 
          - "aws.s3"
        detail: 
          eventName: 
            - "PutObject"
          requestParameters:
            bucketName:
              - "prefix": "myCompanySales" 

This enables your application to consume events from new buckets created after the application is deployed. With content-based filtering, you can create search patterns that allow greater flexibility in matching events.

Multiple buckets with multiple Lambda functions

In the standard S3 and Lambda integration, a single Lambda function can only be invoked by distinct prefix and suffix patterns in the S3 trigger. This means that the same Lambda function cannot be set as the trigger for PutObject events for the same filetype or prefix. When you need to invoke multiple functions with the same or overlapping prefixes or suffixes, the EventBridge integration can handle this.

EventBridge allows up to five targets per rule, so you can specify up to five separate Lambda functions to receive the event. All five functions are invoked in parallel when the event pattern matches. To use this, add the targets in the rule – no change to the event pattern is required.

In the fourth example, the SAM template configures three buckets and three Lambda functions, all subscribing to the same event pattern.

Multiple buckets with multiple Lambda subscribers

This template takes the existing S3 bucket name as a parameter, and generates the CloudTrail trail, EventBridge rule, and required permissions. The key change to the template is in the EventRule, where now more than one target is defined:

      Targets: 
        - Arn: 
            Fn::GetAtt: 
              - "EventConsumerFunction1"
              - "Arn"
          Id: "EventConsumerFunctionTarget1"
        - Arn: 
            Fn::GetAtt: 
              - "EventConsumerFunction2"
              - "Arn"
          Id: "EventConsumerFunctionTarget2"
        - Arn: 
            Fn::GetAtt: 
              - "EventConsumerFunction3"
              - "Arn"
          Id: "EventConsumerFunctionTarget3"

This approach enables more complex routing of S3 events to Lambda targets. It allows events from multiple S3 buckets with overlapping prefixes and suffixes in object names. It also enables you to route those events to multiple Lambda functions simultaneously.

Conclusion

The standard S3 to Lambda integration enables developers to deploy code that responds to bucket- or object-based events. You can also use SNS or SQS as targets for fanning out or buffering messages from S3. Using Amazon EventBridge, you can employ even more sophisticated routing and filtering of events between S3 and Lambda.

In this blog post, I show how to deploy a basic integration using a SAM template with a single bucket and single Lambda function. I cover how to use existing S3 buckets in your new application deployments, and use EventBridge content filtering in rules to dynamically match bucket events.

Finally, in complex serverless applications, I show how EventBridge completely decouples the producers and consumers. This makes it easy to route events from multiple S3 buckets to multiple Lambda functions. When combined with attribute matching across the entire S3 event object, this allows much more granularity in identifying events before invoking Lambda functions.

To learn more about using decoupled, event-driven architectures in your serverless applications, visit the Amazon EventBridge Learning Path.

AWS Foundational Security Best Practices standard now available in Security Hub

Post Syndicated from Rima Tanash original https://aws.amazon.com/blogs/security/aws-foundational-security-best-practices-standard-now-available-security-hub/

AWS Security Hub offers a new security standard, AWS Foundational Security Best Practices

This week AWS Security Hub launched a new security standard called AWS Foundational Security Best Practices. This standard implements security controls that detect when your AWS accounts and deployed resources do not align with the security best practices defined by AWS security experts. By enabling this standard, you can monitor your own security posture to ensure that you are using AWS security best practices. These controls closely align to the Top 10 Security Best Practices outlined by AWS Chief Information Security Office, Stephen Schmidt, at AWS re:Invent 2019.

In the initial release, this standard consists of 31 fully-automated security controls in supported AWS Regions, and 27 controls in AWS GovCloud (US-West) and AWS GovCloud (US-East).

This standard is enabled by default when you enable Security Hub in a new account, so no extra steps are necessary to enable it. If you are an existing Security Hub user, when you open the Security Hub console, you will see a pop-up message recommending that you enable this standard. For more information, see AWS Foundational Security Best Practices standard in the AWS Security Hub User Guide.

As an example, let’s look at one of the new security controls for Amazon Relational Database Service (Amazon RDS), [RDS.1] RDS snapshots should be private. This control checks the resource types AWS::RDS::DBSnapshot and AWS::RDS::DBClusterSnapshot. The relevant AWS Config rule is rds-snapshots-public-prohibited, which checks whether Amazon RDS snapshots are public. The control fails if Security Hub identifies that any existing or new Amazon RDS snapshots are configured to be publicly accessible. The severity label is CRITICAL when the security check fails. The severity indicates the potential impact of not enforcing this rule.

You can find additional details about all the security controls, including the remediation instructions of the misconfigured resource, in the AWS Foundational Security Best Practices standard section of the Security Hub User Guide.

Getting started

In this post, we will cover:

  • How to enable the new AWS Foundational Security Best Practices standard.
  • An overview of the security controls.
  • An explanation of the security control details.
  • How to disable and enable specific security controls.
  • How to navigate to the remediation instructions for a failed security control.

Prerequisites

For the security standards to be functional in Security Hub, when you enable Security Hub in a particular account and AWS Region, you must also enable AWS Config in that account and Region. This is because Security Hub is a regional service.

Enable the new AWS Foundational Security Best Practices Security standard

After you enable AWS Config in your account and Region, you can enable the AWS Foundational Security Best Practices standard in Security Hub. We recommend that you enable Security Hub and this standard in all accounts and in all Regions where you have activity. For a script to enable AWS Security Hub across multi-account and Regions, see the AWS Security Hub multi-account scripts page on GitHub.

If you are a new user of Security Hub, when you open the Security Hub console, you are prompted to enable Security Hub. When you enable Security Hub, the AWS Foundational Security Best Practices standard is selected by default, as shown in the following screen shot. Leave the default selection and choose Enable Security Hub to enable the AWS Foundational Security Best Practices standard, as well as the other security standards you select, in your AWS account in your selected AWS Region.

Figure 1: Welcome to AWS Security Hub page

Figure 1: Welcome to AWS Security Hub page

If you are an existing user of Security Hub, when you open the Security Hub console, you are presented with a pop-up to enable the new security standard. You will see the number of new controls that are available in your AWS Region and the number of AWS services and resources that are associated with those controls, as shown in the following screen shot. Choose Enable standard to enable the AWS Foundational Security Best Practices standard in your AWS account in your selected AWS Region.

Figure 2: AWS Foundational Security Best Practices confirmation page

Figure 2: AWS Foundational Security Best Practices confirmation page

You also have the option to enable the new AWS Foundational Security Best Practices Security standard by using the command line, which we will describe later in this post.

View the security controls

Now that you have successfully enabled the standard, on the Security standards page, you see the new the AWS Foundational Security Best Practices v1.0.0 standard is displayed with the other security standards, CIS AWS Foundations and PCI DSS.

Figure 3: Security standards page in AWS Security Hub

Figure 3: Security standards page in AWS Security Hub

View security findings

Within two hours after you enable the standard, Security Hub begins to evaluate related resources in the current AWS account and Region against the available AWS controls within the AWS Foundational Security Best Practices standard. The scope of the assessment is the AWS account.

To view security findings, on the Security standards page, for AWS Foundational Security Best Practices standard, choose View results. The following image shows an example of the dashboard page you will see that displays all of the available controls in the standard, and the status of each control within the current AWS account and Region.

Figure 4: AWS Foundational Security Best Practices controls page

Figure 4: AWS Foundational Security Best Practices controls page

At a glance, each control card provides you with the following high-level information:

  • Title and unique identifier of the AWS control. This provides you with a synopsis of the purpose and functionality of the control.
  • The current status of the AWS control evaluation. The possible values are Passed, Failed, or Unknown (evaluation is still in progress and not finished).
  • Severity information associated with the AWS control. The possible values are CRITICAL, HIGH, MEDIUM, and LOW. For Passed findings associated with the controls, the severity appears, but is INFORMATIONAL. To learn more about how Security Hub determines the severity score, see Determining the severity of security standards findings.
  • A count of AWS resources that passed or failed the check for this particular AWS control.

You can use the Filter controls to search for specific AWS controls based on their evaluation status and severity. For example, you can search for all controls that have a check status of Failed and a severity of CRITICAL.

Inspect the security finding

To see detailed information about a specific security control and its findings, choose the security control card. Choosing the control displays a page that contains detailed information about the control, including a list of the findings for the security control. The page also indicates whether the resources for the security control are Passed, Failed, or if the compliance evaluation is still in progress (Unknown).

Figure 5: RDS.1 control findings view

Figure 5: RDS.1 control findings view

For business reasons, you may sometimes need to suppress a particular finding against a particular resource using the workflow status. Setting the Workflow status to SUPPRESSED means that the finding will not be reviewed again and will not be acted upon. If you suppress a FAILED finding, it will stay suppressed as long as it remains failed. However, if the finding moves from FAILED to PASSED, a new passed finding will be generated and the workflow status will be NEW. You can’t un-suppress a finding. If you suppress all findings for a control, the control status will be Unknown until any new finding is generated.

To suppress a finding

  1. In the Findings list, select the control you want to suppress, for example [RDS.1] RDS snapshot should be private.
  2. For Change workflow status, choose Suppressed.

 

Figure 6: RDS.1 control showing change workflow status options

Figure 6: RDS.1 control showing change workflow status options

You will no longer see the finding that you suppressed.

If you do not want to generate any findings for a specific control, you can instead choose to disable the control using the Disable feature, described in the next section.

Disable a security control

You can also disable the security check for a particular security control until you manually re-enable it. This disables the control check for all resources in the context of Security Hub in your AWS account and AWS Region. This may be helpful if a particular security control is not applicable for your environment. To disable a security control, on the AWS Foundational Security Best Practices standard dashboard page, on the specific control card, choose Disable. You can always re-enable the control when you need it in the future.

Figure 7: Control cards Disable option

Figure 7: Control cards Disable option

When you disable a particular control, you are required to enter a reason in the Reason for disabling field, so that you or someone else looking into it in the future have a clear record of why the control is not being used.

Figure 8: Reason for disabling a control page - Disabling control ACM.1example

Figure 8: Reason for disabling a control page – Disabling control ACM.1example

On the AWS Foundational Security Best Practices controls page, disabled controls are marked with a Disabled badge, as shown in the following screenshot. The cards also display the date when the control was disabled, and the reason that was provided. To re-enable a disabled control, on the control card, choose Enable.

Figure 9: Disabled control example – ACM.1

Figure 9: Disabled control example – ACM.1

You can enable the control any time without providing a reason. The evaluation for the control starts from the point in time when the control is re-enabled.

Remediate a failed security control

You can get the remediation instructions for a failed control from within the Security Hub console. On the AWS Foundational Security Best Practices standard dashboard page, choose the specific control card, then in the list of findings for a control, choose the finding you want to remediate. In the finding details, expand the Remediation section, and then choose the For directions on how to fix this issue link, as shown in the following screen shot.

Figure 10: Finding remediation link in the console

Figure 10: Finding remediation link in the console

You can also get to these step-by-step remediation instructions directly from the user guide. Go to the AWS Foundational Security Best Practices controls page and scroll down to the name of the specific control that generated the finding.

Use the AWS CLI to enable or disable the standard

To use the AWS Command Line Interface (AWS CLI) to enable the AWS Foundational Security Best Practices standard in Security Hub programmatically without using the Security Hub console, use the following command. Be sure you are running AWS CLI version 2.0.7 or later, and replace REGION-NAME with your AWS Region:


aws securityhub batch-enable-standards --standards-subscription-requests '{"StandardsArn":"arn:aws:securityhub:<REGION-NAME>::standards/aws-foundational-security-best-practices/v/1.0.0"}' --region <REGION-NAME>

To check the status, run the get-enabled-standards command. Be sure to replace REGION-NAME with your AWS Region:


aws securityhub get-enabled-standards --region <REGION-NAME> 

You should see the following “StandardsStatus”: “READY” output to indicate that the AWS Foundational Security Best Practices standard is enabled and ready:


{
    "StandardsSubscriptions": [
        {
            "StandardsSubscriptionArn": "arn:aws:securityhub:us-east-1:012345678912:subscription/aws-foundational-security-best-practices/v/1.0.0",
            "StandardsArn": "arn:aws:securityhub:us-east-1::standards/aws-foundational-security-best-practices/v/1.0.0",
            "StandardsInput": {},
            "StandardsStatus": "READY"
        }
    ]
}

To use the AWS CLI to disable the AWS Foundational Security Best Practices standard in Security Hub, use the following command. Be sure to replace ACCOUNT_ID with your account ID, and replace REGION-NAME with your AWS Region:


aws securityhub batch-disable-standards --standards-subscription-arns "arn:aws:securityhub:eu-central-1:<ACCOUNT_ID>:subscription/aws-foundational-security-best-practices/v/1.0.0" --region <REGION-NAME> 

Conclusion

In this post, you learned about how to implement the new AWS Foundational Security Best Practices standard in Security Hub, and how to interpret the findings. You also learned how to enable the standard by using the Security Hub console and AWS CLI, how to disable and enable specific controls within the standard, and how to follow remediation steps for failed findings. For more information, see the AWS Foundational Security Best Practices standard in the AWS Security Hub User Guide.

If you have comments about this post, submit them in the Comments section below. If you have questions, please start a new thread on the Security Hub forums.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Karthikeyan Vasuki Balasubramaniam

Karthikeyan is a Software Development Engineer on the Amazon Security Hub service team. At Amazon Web Services, he works on the infrastructure that supports the development and functioning of various security standards. He has a background in computer networking and operating systems. In his free time, you can find him learning Indian classical music and cooking.

Author

Rima Tanash

Rima Tanash is the Lead Security Engineer on the Amazon Security Hub service team. At Amazon Web Services, she applies automated technologies to audit various access and security configurations. She has a research background in data privacy using graph properties and machine learning.

Decoupling larger applications with Amazon EventBridge

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/decoupling-larger-applications-with-amazon-eventbridge/

Many applications start to grow in complexity as they mature, making it harder for developers to maintain code or add new features. This can lead to monolithic applications, where developers must know more about the entire architecture to make changes. Typically, this causes code to become more fragile, and the rate of development slows down.

This blog post shows how you can use an event-based architecture to decouple services and functional areas of applications. It uses the document repository solution as an example, to compare architecture after shifting to an event-based approach. The new architecture offers both greater extensibility and simplicity for developers adding new functionality in the future. It can help alleviate the problems associated with monolithic applications.

The original version of this application uses Amazon S3 event notifications to invoke AWS Lambda functions to index content in the Amazon Elasticsearch Service:

Original document repository application architecture

There are some limitations with this design. First, there is a single source bucket for documents, which may not reflect production usage. Also, while it could be modified to allow new file types for indexing, adding new functionality such as translating documents would require refactoring. And despite having multiple Lambda functions, it’s packaged as a single application, which makes it harder to deploy changes.

The new design uses events to decouple each service used to process incoming S3 objects. It can also use one or more buckets as event sources, which you can change dynamically as needed. Most importantly, it can be easier to introduce changes and new functionality, since the application is no longer deployed as a mono-repo. The new architecture uses this design:

Decoupled architecture

  1. Setup and configuration of AWS resources.
  2. Parser function to filter and reformat S3 events for the application.
  3. Converter functions to operate on distinct file types.
  4. Analyzer functions for interpreting the content of the files.
  5. The Loader function imports the metadata into the Amazon Elasticsearch Service.

The code uses the AWS Serverless Application Model (SAM), enabling you to deploy the application easily in your own AWS account. This walkthrough creates resources covered in the AWS Free Tier but you may incur cost for significant data usage. Additionally, it requires an Amazon Elasticsearch Service domain, which may incur cost on your AWS bill.

The resulting solution is five separate applications, which you deploy in stages. To set up the application, visit the GitHub repo and follow the instructions in the README.md file.

Setup and configuration

The SAM template in the setup directory creates the S3 buckets, and configures AWS CloudTrail to capture put events in these buckets. This is required as EventBridge consumes S3 events via CloudTrail. Now, when any object is stored in any of these S3 buckets, EventBridge receives an event.

This template also creates a customer managed IAM policy that creates read-only access to the source S3 buckets:

  MyManagedPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      ManagedPolicyName: docrepo-s3-read-policy
      PolicyDocument: 
        Version: 2012-10-17
        Statement: 
          - Effect: Allow
            Action:
              - s3:GetObject
              - s3:ListBucket
              - s3:GetBucketLocation
              - s3:GetObjectVersion
              - s3:GetLifecycleConfiguration
            Resource:
              - !Sub 'arn:aws:s3:::${Dept1Bucket}/*'
              - !Sub 'arn:aws:s3:::${Dept2Bucket}/*'
              - !Sub 'arn:aws:s3:::${Dept3Bucket}/*'

This policy can be attached to any Lambda function that must read the contents from one of the S3 buckets. If the pool of source buckets changes in the future, you only need to modify this policy. Any downstream Lambda functions using the policy automatically gain access to the added buckets.

In the second setup application, the Parser service receives those S3 events and reformats the event for downstream services. Specifically, it creates a new attribute for the file type of the S3 object. After you deploy these two templates, creating any objects in the source S3 buckets generates the following event in the default event bus:

Parsing events from Amazon S3

Building the converter processes

This application uses converters to process incoming objects in the S3 buckets. One converter handles one file type. There are two converters required to replicate the original application’s functionality, for pdf and docx files. An EventBridge rule matches incoming events and triggers the appropriate Lambda function to convert the object. This diagram shows abridged input and output events for these functions:

  1. A matching EventBridge rule invokes the relevant converter function. The function converts the source file into raw text.
  2. The text is split into batches of 5,000 characters.
  3. The functions publish the text batches back to EventBridge, using new detail-type and source attributes.

The SAM template specifies the EventBridge rules, the permissions for EventBridge to invoke the Lambda functions, and the processing Lambda functions. The Lambda functions use the customer managed IAM policy created during the setup for read-only access to the originating S3 bucket. Each converter has its own logic for processing file types differently, and can produce different types of events if needed.

The analyzer functions

In this workflow, any file type containing text is analyzed by Amazon Comprehend to detect entities. The AnalyzeText function is invoked by an EventBridge rule. The rule is filtering for the NewTextBatch attribute in an event from docRepo.converters.

Another EventBridge rule triggers the AnalyzeImage function. This is filtering for jpg and jpeg file types where the event source is docRepo.s3. This function uses Amazon Rekognition to identify labels in the images.

Both functions produce new events containing the entities and labels, using new detail-type and source attributes. These events are published back to the default bus on EventBridge:

Analyzers processing events

  1. A matching EventBridge rule invokes the relevant analyzer function. The function uses Amazon ML services to detect labels in images and entities in text.
  2. The functions publish the metadata back to EventBridge, using new detail-type and source attributes.

Loader function

The Loader function is invoked by an EventBridge rule that is filtering for events from the Analyzers functions. This final function receives those events and loads the labels and entities metadata into the Amazon Elasticsearch Service:

Loader function processing events

Choosing between AWS Step Functions and Amazon EventBridge

In this application, there is a sequence of steps to the workflow that could also be handled by AWS Step Functions. Both services can simplify workflows in distributed applications and make it easier to maintain and modify serverless applications. In many cases, it makes sense to use both services for larger enterprise applications with complex business logic.

However, EventBridge enables you to separate processes into independent applications. It also allows other consumers to build custom logic using your events without impacting your application design or performance. In enterprise applications, this makes it much easier to innovate and develop new application features.

Benefits for developers

With the original monolithic application divided into five separate applications, it’s now easier for different teams to work on this project. It’s also easier and safer to deploy changes to a single microservice without needing to deploy the entire application. Developers must only understand their own service rather than the complete architecture of the application.

For example, to add more S3 buckets to the source list, you only need to modify the SAM template in the setup part of the application. The Parser function consumes put events from any number of buckets, and downstream functions consume events via EventBridge. To add a new file type, you only need to add a new converter function. Or to change the indexing provider, you create a new loader function to route the metadata to another service. The services of this application are independent, decoupled by EventBridge, and you can add more producers and consumers as required.

Traditionally, one of the challenges with event-based applications is tracking the format of events. Event schemas are typically hard to manage because any service can produce an event. The schema may also change as developers release new versions of a service. To help solve these issues, EventBridge has a feature called schema discovery that can automate the tracking and management of events in your application.

All the microservices in this application publish with a source attribute of docRepo. If you enable schema discovery, EventBridge quickly identifies these custom event schemas:

Schema discovery in Amazon EventBridge

The schemas are defined in JSON using the OpenAPI Specification. As you develop new features, you can download code bindings directly from these schemas. For type-safe languages, this allows you to use events as objects directly in your applications, helping to accelerate development. To learn more about how to use code bindings and schema discovery, watch this video:

Conclusion

Larger applications can quickly become monoliths. You can use event-based architectures to decouple services within applications, and maintain flexibility as your application grows. Amazon EventBridge is a serverless event bus that can help simplify you architecture, allowing each service to operate independently with no dependence on event consumers.

In this post, I show how to rearchitect the Serverless Document Repository example into five smaller applications orchestrated using events. I explore the benefits of developing applications using this approach, including the ability to make changes more easily. I also show how EventBridge schema discovery can help automate event schema management.

To learn more about how to use Amazon EventBridge to decouple large applications, visit the Amazon EventBridge learning path.

Using AWS CodeBuild to execute administrative tasks

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/devops/using-aws-codebuild-to-execute-administrative-tasks/

This article is a guest post from AWS Serverless Hero Gojko Adzic.

At MindMup, we started using AWS CodeBuild to quickly lift and shift support tasks to the cloud. MindMup is a collaborative mind-mapping tool, used by millions of teachers and students to collaborate on assignments, structure ideas, and organize and navigate complex information. Still, the team behind the product consists of just two people, and we’re both responsible for everything from sales and product management to programming and customer support. One of the key reasons why such a tiny team can support a large group of users is that we tend to automate all recurring tasks in order to free up our time for more productive work.

Administrative support tasks often start as ad-hoc command line scripts, with manual intervention to resolve exceptions. As the scripts stabilize, humans can be less involved, so teams look for ways of scheduling and automating job executions. For infrastructure deployed to AWS, this also means moving away from running scripts from on-premises developers or operations computers to running in the cloud. With utilization-based pricing and on-demand capacity, AWS Lambda and AWS Fargate are the two obvious choices for running such tasks in AWS. There is a third option, often overlooked: CodeBuild. Although CodeBuild is designed for a completely different purpose, it offers some compelling features that make it very easy to set up and run periodic support jobs, especially as a first easy step towards a more systematic solution.

Solution overview

CodeBuild is, as the name suggests, a managed service for executing typical software build jobs. In some ways, such as each job having an associated IAM permissions, CodeBuild is similar to Lambda and Fargate. One of fundamental differences between Codebuild jobs and Lambda functions or Fargate tasks is the location of the executable definition of the job. The executable definition of a Lambda function is in a ZIP archive deployed to Lambda. For Fargate, the executable definition is in a Docker container image, deployed in a task with Amazon ECS or in a Kubernetes pod with Amazon EKS. Both services require an explicit deployment to update the executable definition of a task. For CodeBuild jobs, the executable definition is not deployed to an AWS service. Instead, it is in a source code control system that you can manage locally or using a service such as GitHub or AWS CodeCommit.

Sitting alongside the rest of the source code, each CodeBuild task has an entry-point configuration file, by convention called a buildspec.yml. The buildspec.yml file lists the programming language runtimes required by the job, and the steps to execute before, during and after the build job. For example, the following buildspec.yml sets up a build environment for JavaScript with Node.js 12, installs dependencies, runs tests, and then produces a deployment package using webpack.

version: 0.2

phases:
  install:
    runtime-versions:
      nodejs: 12
    commands:
      - npm install
  build:
    commands:
      - npm test
      - npm run web pack

Usually, the buildspec.yml file involves some variant of installing dependencies, compiling code and running tests, then packaging and versioning artifacts. But the steps of a buildspec.yml file are actually just shell commands, so CodeBuild doesn’t necessarily need to run tasks related to compiling or packaging. It can execute any sequence of Unix commands, scripts, or binaries. This makes CodeBuild a uniquely compelling choice for the transition from running shell scripts on an operations machine to running a shell script in the cloud.

Comparing CodeBuild and Lambda for administrative tasks

The major advantage of CodeBuild over Lambda functions for support jobs is that the scripts can be significantly more flexible. Moving from shell scripts to Lambda functions usually means rewriting the task in a language such as JavaScript or Python. You can execute a shell script from a Lambda function when using Amazon Linux 1 instances, or even use a Bash custom runtime, but when using CodeBuild, you can execute the same shell script without changes.

Lambda functions usually run only in a single language. Support tasks often perform a chain of actions, and different steps might require utilities written in different languages. Running such varied tasks with a Lambda function would require constructing a custom Lambda runtime, or splitting steps into multiple functions with different runtimes, and then somehow coordinating and passing data between them. AWS Step Functions can be used to coordinate the workflow, but most support tasks are a sequence of steps, to be executed in order if the previous one succeeds. With CodeBuild, you can configure the task to include all required runtimes.

Support tasks often need to transform the outputs of one tool and pass it into a different tool. For example, select rows from a database containing expired accounts, then filter out only the user emails, separate the data with commas, and send to an automated mailer with a template. Tools such as grep, awk, and sed become invaluable for such transformations. However, they aren’t available on new Lambda runtimes.

Lambda runtimes based on Amazon Linux 2 bundle only the absolutely minimal operating system packages. Even the basic command line Linux utilities, such as which, are not packaged with the recent Lambda runtimes. On the other hand, CodeBuild runs tasks in a full-blown Linux environment. Executing support tasks through CodeBuild means that you can pipe results into all the standard Unix tools, without having to use half-baked replacements written in a scripting language.

For applications running in the AWS ecosystem, support tasks often need to communicate with AWS services or resources. Standard CodeBuild environments also come with the aws command line tools, so you can use them without any additional setup. This becomes especially important for moving data from and to Amazon S3, where command line tools have operations for batch uploads or downloads or recursive directory synchronization. Those operations are not directly available through the programming language SDK libraries.

It is, of course, possible to install additional binaries to Lambda functions by building them for the right Linux environment. Because the standard shared system libraries are also not in the recent Lambda runtimes, compiling additional tools is akin to building a Linux distribution from scratch. With CodeBuild, most standard tools are included already, and you can add additional tools to the system by using an operating system package manager (apt-get or yum).

CodeBuild execution environments can also be more flexible in terms of execution time and performance constraints. Lambda tasks are currently limited to fifteen minutes. The only performance setting you can influence is the memory size, which proportionally impacts the CPU power. The highest setting is currently 3GB memory, which assigns two virtual cores. CodeBuild allows you to configure tasks which can run for up to 8 hours. You can also explicitly select a compute type, including using GPU processors and going all the way up to 255 GB memory or 72 virtual CPU cores. This makes CodeBuild an interesting choice for tasks that need to potentially run longer than fifteen minutes, that are very computationally intensive, or that need a lot of working memory.

On the other hand, compared to Lambda functions, CodeBuild jobs start significantly slower and running them in parallel is not as easy or convenient. For example, by default you can only run up to 60 CodeBuild tasks in parallel, but this is a soft limit that you can increase. However, support tasks are mostly batch jobs by nature, so saving a few seconds or being able to execute thousands of such tasks in parallel is not usually important.

Comparing CodeBuild or Amazon EC2/Fargate for administrative tasks

Most of the limitations of Lambda functions for admin jobs could be solved by running a virtual machine through Amazon EC2. In fact, running tasks on Amazon EC2 was the usual way of lifting support tasks from the operations computers and moving them into the cloud until Lambda became available. However, due to how Amazon EC2 instances are billed, teams often bundled all the operations tasks on a single Amazon EC2 instance. That instance needed a superset of all the security privileges required by the various tasks, opening potential security risks. That’s where Fargate can help. Fargate runs container-based tasks on demand, offering utilization-based billing and removing many restrictions of Lambda, such as the 15-minute runtime and reduced operating system environment, also allowing you to choose execution environments more flexibly.

This means that, compared to Fargate tasks, CodeBuild execution is more or less comparable in terms of what you can run and how much power you can assign to your tasks. Both can use a custom Docker container, and both run a full-blown operating system with all the standard binaries. They also have similar terms of start-up time and parallelization. However, setting up a CodeBuild job and updating it later is much easier than with Fargate using tasks or pods.

With Fargate, you need to provide a custom Docker container with the right entry point. CodeBuild lets you use custom containers or choose standard images provided by AWS, including Ubuntu or AWS Linux instances. Likewise, configuring a Fargate task involves deploying in an Amazon VPC, and if the task needs to access other AWS services, setting up a NAT gateway. CodeBuild tasks have network access by default, and can be deployed in a VPC if required.

Updating support scripts can also be easier with CodeBuild than with Fargate. Deploying a new version of a task into Fargate involves building a new Docker container and uploading it to a container manager such as Amazon ECS or Amazon EKS. Deploying a new version of a CodeBuild job involves committing to the version control system, without the need to set up a CI/CD pipeline. This makes CodeBuild a compelling way of setting up support tasks, especially for larger organizations with strict access rules. Support people can update tasks definitions by having access to the source code control system, without the need to get access to production resources on AWS.

Fargate environments are transient, similar to Lambda functions. If you want to preserve some files between job runs (for example, compiled task binaries or installed dependencies), you would have to manage that manually with Fargate. CodeBuild supports artifact caching out of the box, so it’s significantly easier to preserve data files or installed dependencies between runs.

Potential downsides

Although taking supporting tasks directly from the source code repository is one of the biggest advantages of CodeBuild over Fargate or Lambda, it can also be a major drawback. Ensuring that the scripts are always in a stable condition requires discipline regarding committing to the trunk. Without such discipline, untested or unstable code might be used for admin tasks by mistake. A potential workaround for teams without good trunk commit discipline would be to use a specific branch for CodeBuild tasks, and then merge code into that branch once it is ready to be released.

Using support scripts directly from a source code repository makes it more complicated to synchronize versions with other deployed software. If you need the support scripts to track the exact version of code that was deployed to other services, it’s probably safer and easier to use Lambda functions or Fargate containers with an explicit deployment step.

Executing support tasks through CodeBuild

CodeBuild jobs take a bit more setup than Lambda functions, but significantly less than Fargate tasks. Below is an example of a CodeBuild job set up through AWS CloudFormation.

Architecture diagram for the CodeBuild being used for administrative tasks

Here are a few things to note:

  • You can add the required IAM permissions for the task into the Policies section of the CodeBuildRole resource.
  • The Environment section of the CodeBuildProject resource is where you can define the container image, choose the virtual hardware or set up environment variables to configure the task.
  • Environment variables are directly available for the shell commands listed in the buildspec.yml file, so this trick allows you to easily parameterize jobs to use resources from the same AWS CloudFormation template.
  • The Location and BuildSpec properties in the Source section define the source code repository, and the path of the buildspec.yml file within the repository.
Resources:
  CodeBuildRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: "codebuild.amazonaws.com"
            Action: "sts:AssumeRole"
      Policies:
        - PolicyName: AllowLogs
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 'logs:*'
                Resource: '*'

  CodeBuildProject:
    Type: AWS::CodeBuild::Project
    Properties:
      Name: !Ref JobName
      ServiceRole: !GetAtt CodeBuildRole.Arn
      Artifacts:
        Type: NO_ARTIFACTS
      LogsConfig:
        CloudWatchLogs:
          Status: ENABLED
      Cache:
        Type: NO_CACHE
      Environment:
        Type: LINUX_CONTAINER
        ComputeType: BUILD_GENERAL1_SMALL
        Image: aws/codebuild/standard:3.0
        EnvironmentVariables:
          - Name: SYSTEM_BUCKET 
            Value: !Ref SystemBucketName
      Source:
        Type: GITHUB
        Location: !Ref GithubRepository 
        GitCloneDepth: 1
        BuildSpec: !Ref BuildSpecPath 
        ReportBuildStatus: False
        InsecureSsl: False
      TimeoutInMinutes: !Ref TimeoutInMinutes

CodeBuild jobs usually run after changes to source code files. Support tasks usually need to run on a periodic schedule. The previous snippet did not define the Triggers property for the CodeBuild job, so it will not track source code changes or run automatically. Instead, you can set up an Amazon CloudWatch Event rule (or optionally use Amazon EventBridge, that provides more sophisticated rules) that will periodically trigger the CodeBuild job. Here is how to do that with AWS CloudFormation:

  RunCodeBuildJobRole:
    Condition: ScheduleRuns
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: "events.amazonaws.com"
            Action: "sts:AssumeRole"
      Policies:
        - PolicyName: StartTask 
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 'codebuild:StartBuild'
                Resource:
                  - !GetAtt CodeBuildProject.Arn

  RunCodeBuildJobRoleRule:
    Condition: ScheduleRuns
    Type: AWS::Events::Rule
    Properties:
      Name: !Sub '${JobName}-scheduler'
      Description: Periodically runs codebuild job to archive defunct accounts
      ScheduleExpression: !Ref ScheduleRate
      State: ENABLED
      Targets:
        - Arn: !GetAtt CodeBuildProject.Arn
          Id: CodeBuildProject
          RoleArn: !GetAtt RunCodeBuildJobRole.Arn

Note the ScheduleExpression property of the RunCodeBuildJobRoleRule resource. You can use any supported CloudWatch schedule expression there to set up when or how frequently your job runs.

Observability and audit logs

If a support job fails for any reason, people need to know. Luckily, CodeBuild already integrates nicely with CloudWatch to report job statuses, so you can set up another CloudWatch Event rule that tracks failures and alerts someone about it. To make notifications flexible, you can send them to an Amazon SNS topic. You can then subscribe for email notifications or forward those alerts somewhere else easily. The following wires up notifications with an AWS CloudFormation template.

  SnsPublishRole:
    Condition: CreateSNSNotifications
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: "events.amazonaws.com"
            Action: "sts:AssumeRole"
      Policies:
        - PolicyName: AllowLogs
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 'SNS:Publish'
                Resource:
                  - !Ref SnsTopicArn

  CodeBuildNotificationRule:
    Condition: CreateSNSNotifications
    Type: AWS::Events::Rule
    Properties:
      Name: !Sub '${JobName}-fail-notification'
      Description: Notify about codebuild project failures
      RoleArn: !GetAtt SnsPublishRole.Arn
      EventPattern:
        source:
          - "aws.codebuild"
        detail-type:
          - "CodeBuild Build State Change"
        detail:
          build-status:
            - "FAILED"
            - "STOPPED"
          project-name:
            - !Ref CodeBuildProject
      State: ENABLED
      Targets:
        - Arn: Ref SnsTopicArn
          Id: NotificationTopic

Another option to keep the execution of your tasks under control is to generate a report using the test report functionality introduced a few months ago and specify in the buildspec.yml file about the location of the files that store results you want to include in your report.

Testing administrative tasks

Note the build-status list inside the CodeBuildNotificationRule resource. This defines a list of statuses about which you want to publish alerts. In the previous snippet, the list does not include successful runs. That’s because it’s usually not necessary to take any action when a support job runs successfully. However, during initial testing you may want to add IN_PROGRESS (notify when a task starts) and SUCCEEDED (notify when the job ends without an error).

Finally, one of the biggest challenges when moving scripts from an operations machine to running in CodeBuild is to create the right IAM policies. Command-line users on operations machines usually have a wide set of privileges, and identifying the minimum required for a specific job usually involves starting small, then iterating over failed attempts and opening up required operations. Running that process directly through CodeBuild can be quite slow. Instead, I suggest setting up a separate IAM policy for the job, then assigning it both to the role for the CodeBuild task, and to a command-line role or a command-line user. You can then iterate quickly directly on the command line and identify all required IAM operations, then remove the additional command-line user when done.

Conclusion

The next time you need to move a support task to the cloud, and you need a rich execution environment, consider using CodeBuild, at least as the initial step towards a more systematic solution. It will allow you to quickly get a script up and running with all the benefits of IAM isolation, scheduled execution, and reliable notifications.

Gojko is author of the Running Serverless book and interactive course. He is currently working on Video Puppet, a tool for editing videos as easily as editing text. You can reach out to him on Twitter.

Building well-architected serverless applications: Understanding application health – part 2

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-understanding-application-health-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the nine serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the Introduction post for a table of contents and explaining the example application.

Question OPS1: How do you evaluate your serverless application’s health?

This post continues part 1 of this Operational Excellence question. Previously, I covered using Amazon CloudWatch out-of-the-box standard metrics and alerts, and structured and centralized logging with the Embedded Metrics Format.

Best practice: Use application, business, and operations metrics

Identifying key performance indicators (KPIs), including business, customer, and operations outcomes, in additional to application metrics, helps to show a higher-level view of how the application and business is performing.

Business KPIs measure your application performance against business goals. For example, if fewer flight reservations are flowing through the system, the business would like to know.

Customer experience KPIs highlight the overall effectiveness of how customers use the application over time. Examples are perceived latency, time to find a booking, make a payment, etc.

Operational metrics help to see how operationally stable the application is over time. Examples are continuous integration/delivery/deployment feedback time, mean-time-between-failure/recovery, number of on-call pages and time to resolution, etc.

Custom Metrics

Embedded Metrics Format can also emit custom metrics to help understand your workload health’s impact on business.

The airline booking service uses AWS Step Functions, AWS Lambda, Amazon SNS, and Amazon DynamoDB.

In the confirm booking module function handler, I add a new namespace and dimension to associate this set of logs with this application and service.

metrics.set_namespace("ServerlessAirlineEMF")
metrics.put_dimensions({"service":"confirm_booking"})

Within the try statement within the try/except block, I emit a metric for a successful booking:

metrics.put_metric("BookingSuccessful", 1, "Count")

And within the except statement within the try/except block, I emit a metric for a failed booking:

metrics.put_metric("BookingFailed", 1, "Count")

Once I make a booking, within the CloudWatch console, I navigate to Logs | Log groups and select the Airline-ConfirmBooking-develop group. I select a log stream and find the custom metric as part of the structured log entry.

structured-log-entry

Custom metric structured log entry

I can also create a custom metric graph. Within the CloudWatch console, I navigate to Metrics. I see the ServerlessAirlineEMF Custom Namespace is available.

custom-namespace

Custom metric namespace

I select the Namespace and the available metric.

available-namespace

Available metric

I select a Line graph, add a name, and see the successful booking now plotted on the graph.

custom-metric-plotted

Plotted CloudWatch metric

I can also visualize and analyze this metric using CloudWatch Insights.

Once a booking is made, within the CloudWatch console, I navigate to Logs | Insights. I select the /aws/lambda/Airline-ConfirmBooking-develop log group. I choose Run query which shows a list of discovered fields on the right of the console.

I can search for discovered booking related fields.

cloudwatch-insights-discovered-fieldsI then enter the following in the query pane to search the logs and plot the sum of BookingReference and choose Run Query:

fields @timestamp, @message
| stats sum(@BookingReference)

cloudwatch-insights-displayed-bookingreference

CloudWatch Insights query

There are a number of other component metrics that are important to measure. Track interactions between upstream and downstream components such as message queue length, integration latency, and throttling.

Improvement plan summary:

  1. Identify user journeys and metrics from each customer transaction.
  2. Create custom metrics asynchronously instead of synchronously for improved performance, cost, and reliability outcomes.
  3. Emit business metrics from within your workload to measure application performance against business goals.
  4. Create and analyze component metrics to measure interactions with upstream and downstream components.
  5. Create and analyze operational metrics to assess the health of your continuous delivery pipeline and operational processes.

Good practice: Use distributed tracing and code is instrumented with additional context

Logging provides information on the individual point in time events the application generates. Tracing provides a wider continuous view of an application. Tracing helps to follow a user journey or transaction through the application.

AWS X-Ray is an example of a distributed tracing solution, there are a number of third-party options as well. X-Ray collects data about the requests that your application serves and also builds a service graph, which shows all the components of an application. This provides visibility to understand how upstream/downstream services may affect workload health.

For the most comprehensive view, enable X-Ray across as many services as possible and include X-Ray tracing instrumentation in code. This is the list of AWS Services integrated with X-Ray.

X-Ray receives data from services as segments. Segments for a common request are then grouped into traces. Segments can be further broken down into more granular subsegments. Custom data key-value pairs are added to segments and subsegments with annotations and metadata. Traces can also be grouped which helps with filter expressions.

AWS Lambda instruments incoming requests for all supported languages. Lambda application code can be further instrumented to emit information about its status, correlation identifiers, and business outcomes to determine transaction flows across your workload.

X-Ray tracing for a Lambda function is enabled in the Lambda Console. Select a Lambda function. I select the Airline-ReserveBooking-develop function. In the Configuration pane, I select to enable X-Ray Active tracing.

x-ray-enable

X-Ray tracing enabled

X-Ray can also be enabled via CloudFormation with the following code:

TracingConfig:
  Mode: Active

Lambda IAM permissions to write to X-Ray are added automatically when active tracing is enabled via the console. When using CloudFormation, Allow the Actions xray:PutTraceSegments and xray:PutTelemetryRecords

It is important to understand what invocations X-Ray does trace. X-Ray applies a sampling algorithm. If an upstream service, such as API Gateway with X-Ray tracing enabled, has already sampled a request, the Lambda function request is also sampled. Without an upstream request, X-Ray traces data for the first Lambda invocation each second, and then 5% of additional invocations.

For the airline application, X-Ray tracing is initiated within the shared library with the code:

from aws_xray_sdk.core import models, patch_all, xray_recorder

Segments, subsegments, annotations, and metadata are added to functions with the following example code:

segment = xray_recorder.begin_segment('segment_name')
# Start a subsegment
subsegment = xray_recorder.begin_subsegment('subsegment_name')
# Add metadata and annotations
segment.put_metadata('key', dict, 'namespace')
subsegment.put_annotation('key', 'value')
# Close the subsegment and segment
xray_recorder.end_subsegment()
xray_recorder.end_segment()

For example, within the collect payment module, an annotation is added for a successful payment with:

tracer.put_annotation("PaymentStatus", "SUCCESS")

CloudWatch ServiceLens

Once a booking is made and payment is successful, the tracing is available in the X-Ray console.

I explore how Amazon CloudWatch ServiceLens connects metrics, logs and the X-Ray tracing. Within the CloudWatch console, I navigate to ServiceLens | Service Map.

I can visualize all application resources and dependencies where X-Ray is enabled. I can trace performance or availability issues. If there was an issue connecting to SNS for example, this would be shown.

I select the Airline-CollectPayment-develop node and can view the out-of-the-box standard Lambda metrics.

I can select View Logs to jump to the CloudWatch Logs Insights console.

cloudwatch-insights-service-map-view

CloudWatch Insights Service map

I select View dashboard to see the function metrics, node map, and function details.

cloudwatch-insights-service-map-dashboard

CloudWatch Insights Service Map dashboard

I select View traces and can filter by the custom metric PaymentStatus. I select SUCCESS and chose Add to filter. I then select a trace.

cloudwatch-insights-service-map-select-trace

CloudWatch Insights Filtered traces

I see the full trace details, which show the full application transaction of a payment collection.

cloudwatch-insights-service-map-view-trace

Segments timeline

Selecting the Lambda handler subsegment – ## lambda_handler, I can view the trace Annotations and Metadata, which include the business transaction details such as Customer and PaymentStatus.

cloudwatch-insights-service-map-view-trace-annotations

X-Ray annotations

Trace groups are another feature of X-Ray and ServiceLens. Trace groups use filter expressions such as Annotation.PaymentStatus = "FAILED" which are used to view traces that match the particular group. Service graphs can also be viewed, and CloudWatch alarms created based on the group.

CloudWatch ServiceLens provides powerful capabilities to understand application performance bottlenecks and issues, helping determine how users are impacted.

Improvement plan summary:

  1. Identify common business context and system data that are commonly present across multiple transactions.
  2. Instrument SDKs and requests to upstream/downstream services to understand the flow of a transaction across system components.

Recent announcements

There have been a number of recent announcements for X-Ray and CloudWatch to improve how to evaluate serverless application health.

Conclusion

Evaluating application health helps you identify which services should be optimized to improve your customer’s experience. In part 1, I cover out-of-the-box standard metrics and alerts, as well as structured and centralized logging. In this post, I explore custom metrics and distributed tracing and show how to use ServiceLens to view logs, metrics, and traces together.

In an upcoming post, I will cover the next Operational Excellence question from the Well-Architected Serverless Lens – Approaching application lifecycle management.

Serverless Stream-Based Processing for Real-Time Insights

Post Syndicated from Justin Pirtle original https://aws.amazon.com/blogs/architecture/serverless-stream-based-processing-for-real-time-insights/

Building on our previous posts regarding messaging patterns and queue-based processing, we now explore stream-based processing and how it helps you achieve low-latency, near real-time data processing in your applications. AWS offers two managed services for streaming, Amazon Kinesis and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

What is streaming data?

At AWS, we define streaming data as data that is emitted at high volume in a continuous, incremental manner with the goal of low-latency processing. Whereas traditional batch-oriented business intelligence would offer insights in retrospect after months, days, or hours have passed, stream-based processing can offer actionable insights in real time. Stream-based processing is commonly used to respond to clickstream events, rapidly ingest various types of logs, and extract, transform, and load (ETL) data in real-time into data lakes and data warehouses.

Amazon Kinesis is the AWS service that makes it easy to collect, process, and analyze such real-time, streaming data with four different capabilities:

For this blog post, we focus on Kinesis Data Streams and Kinesis Data Firehose, since both of these services are foundational for streaming, ingestion, buffering, and processing in your streaming data pipeline.

Kinesis Data Streams

Amazon Kinesis Data Streams is a massively scalable service that can continuously capture gigabytes of data per second from hundreds of thousands of sources. Like many distributed systems, Kinesis Data Streams achieves this level of scalability by partitioning or sharding your data where records are simultaneously written to and read from different shards in parallel. All Kinesis Data Streams require allocation of at least one shard and you choose how many shards you want to allocate to a given stream.

When writing to a shard in a Kinesis Data Stream, each shard supports ingestion of up to 1 MB of data per second or 1,000 records written per second. When reading from a shard, each shard supports output of 2 MB of data per second. You choose an initial number of shards to allocate for your Kinesis Data Stream, then can update your shard allocation over time. Increasing your shard allocation enables your application to easily scale from thousands of records to millions of records written per second.

Producing streaming data

Streaming data producers are processes that put records onto a Kinesis stream by calling the putRecord API to write a single record or putRecords API to write multiple records in a single invocation. Common approaches for producing messages including direct use of AWS tools, including:

  • AWS SDK, which simplifies authentication and other semantics of invoking AWS service APIs
  • Amazon Kinesis Agent, which enables local file/log monitoring and rotation sending in real time
  • Amazon Kinesis Producer Library, which simplifies aggregating records into larger payloads to improve throughput.

Additionally, several AWS services natively integrate with Amazon Kinesis as a data producer:

There are also several third-party services that offer native integration as data producers, including:

Regardless of the producer service/tool of choice, all data producers put records onto a stream by providing a partition key, stream name, and the data itself, which altogether must not exceed 1 MB in size. The partition key provided is used to determine which shard the data should be written to on the stream. Amazon Kinesis Data Streams offers ordering guarantees and maintain message ordering within a given shard in a stream using sequence numbers to track the unique position of each message sent.

Consuming streaming data

Once records are written to a Kinesis Data Stream, they are buffered in their respective shards for consumption. Unlike queue-based processing, the records are buffered until the data retention period set on the stream elapses, enabling one or more consumers to replay all in the messages in the shards of the stream. If your application must deliver your records to a data lake, data warehouse, Elasticsearch Service cluster, or Splunk, Kinesis Data Firehose can natively deliver your records to the following without needing to write any custom code:

  • Amazon S3
  • Amazon Redshift
  • Amazon Elasticsearch Service
  • Splunk

You simply indicate the desired delivery destination and configuration regarding how to batch and deliver the messages. Kinesis Data Firehose can also use your configured S3 desired object naming, Amazon Redshift table name, Amazon Elasticsearch index name, and more.

For custom processing or destinations outside of the Amazon Kinesis Data Firehose supported services above, you will need to write and execute custom code to consume data from the stream. Though you can use the Kinesis Client Library (KCL) to run your own custom processing application on persistent virtual machines or container instances, AWS Lambda offers serverless computing with native event source integration with Amazon Kinesis Data Streams. AWS Lambda as a stream consumer takes care of the operational overhead of reading shards, maintaining record order, check pointing as records are processed, and parallelizing processing.

Serverless stream processing with AWS Lambda

When configured with a Kinesis Stream as its event source, AWS Lambda continuously polls every shard in your stream at no extra charge and only invokes your Lambda code if and when there are messages in the stream. It additionally scales up the number of concurrent executions to parallelize reading all shards of a stream at the same time (and can have multiple executions reading the same shard simultaneously for a higher parallelization factor, if desired). AWS Lambda automatically checkpoints which records were successfully processed and handles retries and any failures automatically according to your desired configuration.

Best of all, there is no additional cost of the Lambda service handling all of these operational needs for you. You only pay for compute time when your function is invoked and messages are available on the stream for processing. You’re able to focus on processing your data with your business logic directly in your code since your records are sent as an array to your Lambda code. There is no additional code to author/manage regarding checkpointing, shard splits/merges, or other complexities.

Conclusion

In this blog, we defined streaming data and explored the Amazon Kinesis service and its various capabilities. We then reviewed the various options available for producing and consuming real-time streaming data with Amazon Kinesis, including using AWS Lambda for serverless streaming data processing. Please refer to the following resources for further learning on AWS streaming data processing:

Using VPC Sharing for a Cost-Effective Multi-Account Microservice Architecture

Post Syndicated from Anandprasanna Gaitonde original https://aws.amazon.com/blogs/architecture/using-vpc-sharing-for-a-cost-effective-multi-account-microservice-architecture/

Introduction

Many cloud-native organizations building modern applications have adopted a microservice architecture because of its flexibility, performance, and scalability. Even customers with legacy and monolithic application stacks are embarking on an application modernization journey and opting for this type of architecture. A microservice architecture allows applications to be composed of several loosely coupled discreet services that are independently deployable, scalable, and maintainable. These applications can comprise a large number of microservices, which often span multiple business units within an organization. These customers typically have a multi-account AWS environment with each AWS account belonging to an individual business unit. Their microservice implementations reside in the Virtual Public Clouds (VPCs) of their respective AWS accounts. You can set up multi-account AWS environment incorporating best practices using AWS Landing Zone or AWS Control Tower.

This type of multi-account, multi-VPC architecture provides a good boundary and isolation for individual microservices and achieves a highly available, scalable, and secure architecture. However, for microservices that require a high degree of interconnectivity and are within the same trust boundaries, you can use other AWS capabilities to optimize cost and network management complexity.

This blog presents a cost-effective approach that requires less VPC management while still using separate accounts for billing and access control. This approach does not sacrifice scalability, high availability, fault tolerance, and security. To achieve a similar microservice architecture, you can share a VPC across AWS accounts using AWS Resource Access Manager (AWS RAM) and Network Load Balancer (NLB) support in a shared Amazon Virtual Private Cloud (VPC). This allows multiple microservices to coexist in the same VPC, even though they are developed by different business units.

Microservices architecture in a multi-VPC approach

In this architecture, microservices deployed across multiple VPCs use privately exposed endpoints for better security posture instead of going over the internet. This requires the customers to enable inter-VPC communication using the various networking capabilities of AWS as shown below:

microservices deployed across multiple VPCs use privately exposed endpoints

In the above reference architecture, we created a VPC in Account A, which is hosting the front end of the application across a fleet of Amazon Elastic Compute Cloud (Amazon EC2) instances using an AWS Auto Scaling group. For simplicity, we’ve illustrated a single public and private subnet for the application front end. In reality, this spans across multiple subnets across multiple Availability Zones (AZ) to support a highly available and fault-tolerant configuration.

To ensure security, the application must communicate privately to microservices mS1 and mS2 deployed in VPC of Account B and Account C respectively. For high availability, these microservices are also implemented using a fleet of Amazon EC2 instances with the Auto Scaling group spanning across multiple subnets/availability zones. For high-performance load balancing, they are fronted by a Network Load Balancer.

While this architecture shows an implementation using Amazon EC2, it can also use containerized services deployed using Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS). These microservices may have interdependencies and invoke each other’s’ APIs for servicing the requests of the application layer. This application to mS and mS to mS communication can be achieved using following possible connectivity options:

When only few VPC interconnections are required, Amazon VPC peering and AWS PrivateLink may be a viable option. For higher number of VPC interconnections, we recommend AWS Transit Gateway for better manageability of connections and routing through a centralized resource. However, based on the amount of traffic this can introduce significant costs to your architecture.

Alternative approach to microservice architecture using Network Load Balancers in a shared VPC

The above architecture pattern allows your individual microservice teams to continue to own their AWS resources that host their microservice implementation. But they can deploy them in a shared VPC owned by the central account, eliminating the need for inter-VPC network connections. You can share Amazon VPCs to use the implicit routing within a VPC for applications that require a high degree of interconnectivity and are within the same trust boundaries.

This architecture uses AWS RAM, which allows you to share the VPC Subnets from AWS Account A to participating AWS accounts within your AWS organization. When the subnets are shared, participant AWS accounts (Account B and Account C) can see the shared subnets in their own environment. They can then deploy their Amazon EC2 instances in those subnets. This is depicted in the diagram where the visibility of the shared subnets (SS1 and SS2) is extended to the participating accounts (Account B and Account C).

You can also deploy the NLB in these shared subnets. Then, each participant account owns all the AWS resources for their microservice stack, but it’s deployed in the VPC of Account A.

This allows your individual microservice teams to maintain control over load balancer configurations and Auto Scaling policies based for their specific microservices’ needs. At the same time, using the AWS RAM they are able to effectively use the existing VPC environment of Account A.

This architecture presents several benefits over the multi-VPC architecture discussed earlier:

  • You can deploy the entire application, including the individual microservices, into a single shared VPC. This is while still allowing individual microservice teams control over their AWS resources deployed in that VPC.
  • Since the entire architecture now resides in a single VPC, it doesn’t require other networking connectivity features. It can rely on intra-VPC traffic for communication between the application (API) layer and microservices.
  • This leads to reduction in cost of the architecture. While the AWS RAM functionality is free of charge, this also reduces the data transfer and per-connection costs incurred by other options such as VPC peering, AWS PrivateLink, and AWS Transit Gateway.
  • This maintains the isolation across the individual microservices and the application layer.  Participants can’t view, modify, or delete resources that belong to others or the VPC owner.
  • This also leads to effective utilization of your VPC CIDR block resources.
  • Since multiple subnets belonging to different Availability Zones are shared, the application and individual mS continues to take advantage of scalability, availability, and fault tolerance.

The following illustration shows how you can configure AWS RAM to set up the VPC subnet resource shares between owner Account A and participating Account B. The example below shows the sharing of private subnet SS1 using this method:

(Click for larger image)

Accounts A and B Resource Share

Once this subnet is shared, the participating Account B can launch its Network Load Balancer of its microservice ms1 in the shared VPC subnet as shown below:

Account B can launch its Network Load Balancer of its microservice ms1 in the shared VPC subnet

While this architecture has many advantages, there are important considerations:

  • This style of architecture is suitable when you are certain that the number of microservices is small enough to coexist in a single VPC without depleting the CIDR block of the shared subnets of the VPC.
  • If the traffic between these microservices is in-significant, then the cost benefit of this architecture over other options may not be substantial. This is due to the effect of traffic flow on data transfer cost.

Conclusion

AWS Cloud provides several options to build a microservices architecture. It is important to look at the characteristics of your application to determine which architectural choices top opt for. The AWS RAM and the ability to deploy AWS resources (including Network Load Balancers in shared VPC) helps you eliminate inter-VPC traffic and associated networking costs. And this without sacrificing high availability, scalability, fault tolerance, and security for your application.

10 things you can do today to reduce AWS costs

Post Syndicated from Roshni Pary original https://aws.amazon.com/blogs/compute/10-things-you-can-do-today-to-reduce-aws-costs/

This post is contributed by Shankar Ramachandran, SA Specialist, Cost Optimization

Introduction

AWS’s breadth of services and pricing options offer the flexibility to effectively manage your costs, and still keep the performance and capacity per your business requirement. While the fundamental process of cost optimization on AWS remains the same – monitor your AWS costs and usage, analyze the data to find savings, take action to realize the savings; in this blog I will take a more tactical approach of reducing cost with changes in user demand.

Before you start

Before taking any actions to reduce cost, find out the costs of AWS services you are consuming. The AWS Free Tier provides customers the ability to explore and try out AWS services free of charge up to specified limits for each service. Use the steps in this video to check if you are exceeding the free tier limit.

Next, use AWS Cost Explorer to view and analyze your AWS costs and usage. This tool provides default reports that help you visualize cost and usage at a high level (e.g. AWS  accounts, AWS service),  or at the resource level (e.g. EC2 instance ID). Start by identifying the top accounts where your costs incur using the “Monthly costs by linked account report”. Next, identify the top services that contribute to the costs within those accounts. You can do this by using the “Monthly costs by service report”. Use the hourly and resource level granularity and tags to filter and identify the top resources incurring costs.

Cost Explorer resource level granularity

Now, you should have an understanding of your AWS costs and usage. Next, let’s walk through 10 tactical things you can do today using existing AWS tools and services to reduce your AWS costs.

#1 Identify Amazon EC2 instances with low-utilization and reduce cost by stopping or rightsizing

Use AWS Cost Explorer Resource Optimization to get a report of EC2 instances that are either idle or have low utilization. You can reduce costs by either stopping or downsizing these instances. Use AWS Instance Scheduler to automatically stop instances. Use AWS Operations Conductor to automatically resize the EC2 instances (based on the recommendations report from Cost Explorer).

Cost Explorer rightsizing recommendations

Use AWS Compute Optimizer to look at instance type recommendations beyond downsizing within an instance family. It gives downsizing recommendations within or across instance families, upsizing recommendations to remove performance bottlenecks, and recommendations for EC2 instances that are parts of an Auto Scaling group.

#2 Identify Amazon EBS volumes with low-utilization and reduce cost by snapshotting then deleting them

EBS volumes that have very low activity (less than 1 IOPS per day) over a period of 7 days indicate that they are probably not in use. Identify these volumes using the Trusted Advisor Underutilized Amazon EBS Volumes Check. To reduce costs, first snapshot the volume (in case you need it later), then delete these volumes. You can automate the creation of snapshots using the Amazon Data Lifecycle Manager. Follow the steps here to delete EBS volumes.

#3 Analyze Amazon S3 usage and reduce cost by leveraging lower cost storage tiers

Use S3 Analytics to analyze storage access patterns on the object data set for 30 days or longer. It makes recommendations on where you can leverage S3 Infrequently Accessed (S3 IA) to reduce costs. You can automate moving these objects into lower cost storage tier using Life Cycle Policies. Alternately, you can also use S3 Intelligent-Tiering, which automatically analyzes and moves your objects to the appropriate storage tier.

#4 Identify Amazon RDS, Amazon Redshift instances with low utilization and reduce cost by stopping (RDS) and pausing (Redshift)

Use the Trusted Advisor Amazon RDS Idle DB instances check, to identify DB instances which have not had any connection over the last 7 days. To reduce costs, stop these DB instances using the automation steps described in this blog post. For Redshift, use the Trusted Advisor Underutilized Redshift clusters check, to identify clusters which have had no connections for the last 7 days, and less than 5% cluster wide average CPU utilization for 99% of the last 7 days. To reduce costs, pause these clusters using the steps in this blog.

Pause underutilized Redshift clusters

#5 Analyze Amazon DynamoDB usage and reduce cost by leveraging Autoscaling or On-demand

Analyze your DynamoDB usage by monitoring 2 metrics, ConsumedReadCapacityUnits and ConsumedWriteCapacityUnits, in CloudWatch. To automatically scale (in and out) your DynamoDB table, use the AutoScaling feature. Using the steps here, you can enable AutoScaling on your existing tables. Alternately, you can also use the on-demand option. This option allows you to pay-per-request for read and write requests so that you only pay for what you use, making it easy to balance costs and performance.

DynamoDB read/write capacity mode

#6 Review networking and reduce costs by deleting idle load balancers

Use the Trusted Advisor Idle Load Balancers check to get a report of load balancers that have RequestCount of less than 100 over the past 7 days. Then, use the steps here, to delete these load balancers to reduce costs. Additionally, use the steps provided in this blog, review your data transfer costs using Cost Explorer.

Cost Explorer filters for EC2 Data Transfer

If data transfer from EC2 to the public internet shows up as a significant cost, consider using Amazon CloudFront. Any image, video, or static web content can be cached at AWS edge locations worldwide, using the Amazon CloudFront Content Delivery Network (CDN). CloudFront eliminates the need to over-provision capacity in order to serve potential spikes in traffic.

#7 Use Amazon EC2 Spot Instances to reduce EC2 costs

If your workload is fault-tolerant, use Spot instances to reduce costs by up to 90%. Typical workload examples include big data, containerized workloads, CI/CD, web servers, high-performance computing (HPC), and other test & development workloads. Using EC2 Auto Scaling you can launch both On-Demand and Spot instances to meet a target capacity. Auto Scaling automatically takes care of requesting for Spot instances and attempts to maintain the target capacity even if your Spot instances are interrupted. You can learn more about Spot by watching this 2019 re:Invent session:

#8 Review and modify EC2 AutoScaling Groups configuration

An EC2 Autoscaling group allows your EC2 fleet to expand or shrink based on demand. Review your scaling activity using either the describe-scaling-activity CLI command or on the console using the steps described here. Analyze the result to see if the scaling policy can be tuned to add instances less aggressively. Also review your settings to see if the minimum can be reduced so as to serve end user requests but with a smaller fleet size.

#9 Use Reserved Instances (RI) to reduce RDS, Redshift, ElastiCache and Elasticsearch costs

Use one year, no upfront RIs to get a discount of up to 42% compared to On-Demand pricing. Use the recommendations provided in AWS Cost Explorer RI purchase recommendations, which is based on your RDS, Redshift, ElastiCache and Elasticsearch usage. Make sure you adjust the parameters to one year, no upfront. This requires a one year commitment but the break-even point is typically seven to nine months. I recommend doing #4 before #9

#10 Use Compute Savings Plans to reduce EC2, Fargate and Lambda costs

Compute Savings Plans automatically apply to EC2 instance usage regardless of instance family, size, AZ, region, OS or tenancy, and also apply to Fargate and Lambda usage. Use one year, no upfront Compute Savings Plans to get a discount of up to 54% compared to On-Demand pricing. Use the recommendations provided in AWS Cost Explorer, and ensure that you have chosen compute, one year, no upfront options. Once you sign up for Savings Plans, your compute usage is automatically charged at the discounted Savings Plans prices. Any usage beyond your commitment will be charged at regular On Demand rates. I recommend doing #1 before #10.

Savings Plans Recommendations

Whats Next

With these 10 steps, you can save costs on EC2, Fargate, Lambda, EBS, S3, ELB, RDS, Redshift, DynamoDB, ElastiCache and Elasticsearch. I recommend setting up a budget using AWS Budgets, so that you get alerted when your cost and usage changes.

Set a budget to track your AWS cost and usage

 

Setup an alert on forecasted costs

Using Budgets, you can set up an alert on forecasted costs too (apart from actual). This gives you the ability to get ahead of the problem and reduce costs proactively.

Conclusion

To learn more quick cost optimization techniques, watch the on-demand webinar – Nine Ways to Reduce Your AWS Bill. We are here to help you, please reach out to AWS Support and your AWS account team if you need further assistance in optimizing your AWS environment.

Building well-architected serverless applications: Understanding application health – part 1

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-understanding-application-health-part-1/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the nine serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the Introduction post for a table of contents and explaining the example application.

Question OPS1: How do you evaluate your serverless application’s health?

Evaluating your metrics, distributed tracing, and logging gives you insight into business and operational events, and helps you understand which services should be optimized to improve your customer’s experience. By understanding the health of your Serverless Application, you will know whether it is functioning as expected, and can proactively react to any signals that indicate it is becoming unhealthy.

Required practice: Understand, analyze, and alert on metrics provided out of the box

It is important to understand metrics for every AWS service used in your application so you can decide how to measure its behavior. AWS services provide a number of out-of-the-box standard metrics to help monitor the operational health of your application.

As these metrics are generated automatically, it is a simple way to start monitoring your application and can also be augmented with custom metrics.

The first stage is to identify which services the application uses. The airline booking component uses AWS Step Functions, AWS Lambda, Amazon SNS, and Amazon DynamoDB.

When I make a booking, as shown in the Introduction post, AWS services emit metrics to Amazon CloudWatch. These are processed asynchronously without impacting the application’s performance.

There are two default CloudWatch dashboards to visualize key metrics quickly: per service and cross service.

Per service

To view the per service metrics dashboard, I open the CloudWatch console.

per-service-metrics-dashboardI select a service where Overview is shown, such as Lambda. Now I can view the metrics for all Lambda functions in the account.

per-service-metrics-lambdaCross service

To see an overview of key metrics across all AWS services, open the CloudWatch console and choose View cross service dashboard.

cross-service-metrics-dashboardI see a list of all services with one or two key metrics displayed. This provides a good overview of all services your application uses.

Alerting

The next stage is to identify the key metrics for comparison and set up alerts for under- and over-performing services. Here are some recommended metrics to alarm on for a number of AWS services.

Alerts can be configured manually or via infrastructure as code tools such as the AWS Serverless Application Model, AWS CloudFormation, or third-party tools.

To configure a manual alert for Lambda function errors using CloudWatch Alarms:

  1. I open the CloudWatch console and select Alarms and select Create Alarm.
  2. I choose Select Metric and from AWS Namespaces, select Lambda, Across All Functions and select Errors and select Select metric.add-metric-to-alert
  3. I change the Statistic to Sum and the Period to 1 minute.metric-values
  4. Under Conditions, I select a Static threshold Greater than 1 and select Next.

Alarms can also be created using anomaly detection rather than static values if there is a discernible pattern or trend. Anomaly detection looks at past metric data and uses machine learning to create a model of expected values. Alerts can then be configured if they fall outside this band of “normal” values. I use a Static threshold for this alarm.

  1. For the notification, I set the trigger to alarm to an existing SNS topic with my email address, then choose Next.metric-notification
  2. I enter a descriptive alarm name such as serverlessairline-lambda-prod-errors > 1, select Next, and choose Create alarm.

I have now manually set up an alarm.

Use CloudWatch composite alarms to combine multiple alarms to reduce noise and focus on critical issues. For example, a single alarm could trigger if there are both Lambda function errors as well as high Lambda concurrent executions.

It is simpler and more scalable to include alerting within infrastructure as code. Here is an example of alerting programmatically using CloudFormation.

I view the out of the box standard metrics and in this example, manually create an alarm for Lambda function errors.

Improvement plan summary:

  1. Understand what metrics and dimensions each managed service used provides.
  2. Configure alerts on relevant metrics for when services are unhealthy.

Good practice: Use structured and centralized logging

Central logging provides a single place to search and analyze logs. Structured logging means selecting a consistent log format and content structure to simplify querying across multiple components.

To identify a business transaction across components, such as a particular flight booking, log operational information from upstream and downstream services. Add information such as customer_id along with business outcomes such as order=accepted or order=confirmed. Make sure you are not logging any sensitive or personal identifying data in any logs.

Use JSON as your logging output format. Log multiple fields in a single object or dictionary rather than many one line messages for simpler searching.

Here is an example of a structured logging format.

The airline booking component, which is written in Python, currently uses a shared library with a separate log processing stack.

Embedded Metrics Format is a simpler mechanism to replace the shared library and use structured logging. CloudWatch Embedded Metrics adds environmental metadata such as Lambda Function version and also automatically extracts custom metrics so you can visualize and alarm on them. There are open-source client libraries available for Node.js and Python.

I then add embedded metrics to the individual confirm booking module with the following steps:

  1. I install the aws-embedded-metrics library using the instructions.
  2. In the function init code, I import the module and create a metric_scope with the following code

from aws_embedded_metrics import metric_scope
@metric_scope

  1. In the function handler, I log the generated bookingReference with the following code.

metrics.set_property("BookingReference", ret["bookingReference"])

In this example I also log the entire incoming event details.

metrics.set_property("event", event)

It is best practice to only log what is required to avoid unnecessary costs. Ensure the event does not have any sensitive or personal identifying data which is available to anyone who has access to the logs.

To avoid the duplicate logging in this example airline application which adds cost, I remove the existing shared library logger.*() lines.

When I make a booking, the CloudWatch log message is in structured JSON format. It contains the properties I set event, BookingReference, as well as function metadata.

I can then search for all log activity related to a specific booking across multiple functions with booking_id. I can track customer activity across multiple bookings using customer_id.

Logging is often created as a shared library resource which all functions reference. Another option is using Lambda Layers, which lets functions import additional code such as external libraries. Multiple functions can share this code.

Improvement plan summary:

  1. Log request identifiers from downstream services, component name, component runtime information, unique correlation identifiers, and information that helps identify a business transaction.
  2. Use JSON as the logging output. Prefer logging entire objects/dictionaries rather than many one line messages. Mask or remove sensitive data when logging.
  3. Minimize logging debugging information to a minimum as they can incur both costs and increase noise to signal ratio

Conclusion

Evaluating serverless application health helps understand which services should be optimized to improve your customer’s experience. I cover out of the box metrics and alerts, as well as structured and centralized logging.

This well-architected question will be continued in an upcoming post where I look at custom metrics and distributed tracing.

Building well-architected serverless applications: Introduction

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-introduction/

Customers building production applications in the cloud are looking to build and operate their applications following best practices. Best practices are useful throughout the software development lifecycle and help to answer the question “am I well-architected?”

Serverless technologies provide a solid foundation for building well-architected applications with a goal to reduce and minimize the impact of issues that can happen. What does it mean to be “well architected”?

In 2015, AWS released the AWS Well-Architected Framework.  It’s a structured way to compare applications against AWS architectural best practices with advice on how to improve. This formalized framework publicly documents the approach our solutions architects use when conducting architectural reviews. The Well-Architected Framework can be used by customers and partners to evaluate applications. It’s currently based on five pillars:

  • Operational Excellence
  • Security
  • Reliability
  • Performance Efficiency
  • Cost Optimization

In 2017, AWS extended the framework’s general advice to be more application domain specific with the concept of a “lens”. A lens adds additional questions for specific technology areas, which focus on what is different from the generic advice. Today, there are three technology domain lenses:

In 2018, AWS added the AWS Well-Architected Tool within the AWS Management Console. The tool allows you to define a Workload and answer a number of questions based on each of the five pillars. Each question has context to explain what the question means and why it is important, and then provides a number of best practices.

well-architected-tool-questionThe tool is used to assess risks, and find opportunities for improvement for workloads. It can also provide a broader view across multiple applications for a central team. Each question has a number of best practices to follow, categorized as high or medium risk. This can help you decide where to focus.

Serverless Lens

In February, we added functionality to apply lenses to the Well-Architected Tool. The first one is the Serverless Lens. This includes nine questions, each with additional best practice recommendations to follow.

serverless-lens-in-consoleApplying best practices to production applications may need some more time and effort. The Well-Architected Tool and Serverless Lens are available to help and are not intended to only be a one-time check.

We suggest at a minimum reading the whitepapers and being familiar with the questions before the design phase. It’s a good idea to check throughout the application lifecycle; halfway (or sooner), close to launch, and post-launch for each iteration. Although it is never too late to add best practices to your applications, starting early helps reduce the remediation process as well as making it part of the full application lifecycle journey.

The following posts are a multi-part series addressing each of the questions within the Serverless Lens of the Well-Architected Tool.

Series Episodes: Building well-architected serverless applications

  1. Introduction
  2. Operational Excellence: Understanding serverless application health
    1. Out of the box metrics and alerts; structured and centralized logging
    2. Custom metrics and distributed tracing (upcoming)
  3. Operational Excellence – Understanding serverless application health (upcoming)
  4. Operational Excellence – Approaching application lifecycle management (upcoming)
  5. Security – Controlling access to serverless APIs (upcoming)
  6. Security – Managing serverless security boundaries (upcoming)
  7. Security – Implementing application security (upcoming)
  8. Reliability – Regulating inbound request rates (upcoming)
  9. Reliability – Building resiliency into serverless applications (upcoming)
  10. Performance Efficiency – Optimizing your Serverless Application Performance (upcoming)
  11. Cost Optimization – Optimizing your Serverless Application Costs (upcoming)

Example application

For this series, I am looking at an example application, AWS Serverless Airline Booking. This is a complete web application for a fictional airline booking service built using serverless technologies that provide Flight Search, Payment, Booking, and Loyalty Point features.

I show how to use the Well-Architected Tool with the Serverless Lens to help apply serverless best practices to the application.

airline-mobile-exampleThis web application is the theme of Season 2 of Build on Serverless, which is a video series on Twitch.tv, and presented at AWS re:Invent 2019.

You can view this application directly on GitHub or deploy into an AWS account using the Getting Started instructions. This is not required to follow this Building Well-Architected Serverless Applications series but allows you to explore the code.

The application architecture consists of a frontend web application, which interacts with a number of backend microservices to provide the airline booking functionality.

airline-architectureThere are four backend services that make up the application:

ServiceDescription
CatalogProvides flight search
BookingProvides new and list bookings. Business workflow implemented in Python.
PaymentProvides payment authorization, collection, and refund workflows using Stripe. Business workflow implemented in Python
LoyaltyProvides loyalty points for customers including tiers. Implemented in TypeScript.

Once the application is deployed, I can create a booking by searching for flights.

Once I select an available flight, I enter payment details using a test Stripe credit card.

The application authorizes the payment, confirms, and stores the booking and then updates the loyalty points.

Next steps

Building serverless applications using best practices can give you more confidence in the architecture and operations of your workloads.

Using the Well-Architected Tool and Serverless Lens can give you more visibility into your applications. They can help pinpoint and rank areas to improve. Well-Architected works best when integrated into your application lifecycle processes.

Continue learning in the next post, which dives into the first Well-Architected Serverless Lens question: Operational Excellence – Understanding Serverless Application Health – part1.

Top 10 security items to improve in your AWS account

Post Syndicated from Nathan Case original https://aws.amazon.com/blogs/security/top-10-security-items-to-improve-in-your-aws-account/

If you’re looking to improve your cloud security, a good place to start is to follow the top 10 most important cloud security tips that Stephen Schmidt, Chief Information Security Officer for AWS, laid out at AWS re:Invent 2019. Below are the tips, expanded to help you take action.

10 most important security tips

1) Accurate account information

When AWS needs to contact you about your AWS account, we use the contact information defined in the AWS Management Console, including the email address used to create the account and those listed under Alternate Contacts. All email addresses should be set up to go to aliases that are not dependent on a single person. You should also have a process for regularly checking that these email addresses work, and that you are responding to emails—especially security notifications you might receive from [email protected]. Learn how to set the alternate contacts to help ensure someone is receiving important messages, even when you are unavailable.

Alternative Contacts user interface

2) Use multi-factor authentication (MFA)

MFA is the best way to protect accounts from inappropriate access. Always set up MFA on your Root user and AWS Identity and Access Management (IAM) users. If you use AWS Single Sign-On (SSO) to control access to AWS or to federate your corporate identity store, you can enforce MFA there. Implementing MFA at the federated identity provider (IdP) means that you can take advantage of existing MFA processes in your organization. To get started, see Using Multi-Factor Authentication (MFA) in AWS.

3) No hard-coding secrets

When you build applications on AWS, you can use AWS IAM roles to deliver temporary, short-lived credentials for calling AWS services. However, some applications require longer-lived credentials, such as database passwords or other API keys. If this is the case, you should never hard code these secrets in the application or store them in source code.

You can use AWS Secrets Manager to control the information in your application. Secrets Manager allows you to rotate, manage, and retrieve database credentials, API keys, and other secrets through their lifecycle. Users and applications can retrieve secrets with a call to Secrets Manager APIs, eliminating the need to hard code sensitive information in plain text.

You should also learn how to use AWS IAM roles for applications running on Amazon EC2. Also, for best results, learn how to securely provide database credentials to AWS Lambda functions by using AWS Secrets Manager.

4) Limit security groups

Security groups are a key way that you can enable network access to resources you have provisioned on AWS. Ensuring that only the required ports are open and the connection is enabled from known network ranges is a foundational approach to security. You can use services such as AWS Config or AWS Firewall Manager to programmatically ensure that the virtual private cloud (VPC) security group configuration is what you intended. The Network Reachability rules package analyzes your Amazon Virtual Private Cloud (Amazon VPC) network configuration to determine whether your Amazon EC2 instances can be reached from external networks, such as the Internet, a virtual private gateway, or AWS Direct Connect. AWS Firewall Manager can also be used to automatically apply AWS WAF rules to internet-facing resources across your AWS accounts. Learn more about detecting and responding to changes in VPC Security Groups.

5) Intentional data policies

Not all data is created equal, which means classifying data properly is crucial to its security. It’s important to accommodate the complex tradeoffs between a strict security posture and a flexible agile environment. A strict security posture, which requires lengthy access-control procedures, creates stronger guarantees about data security. However, such a security posture can work counter to agile and fast-paced development environments, where developers require self-service access to data stores. Design your approach to data classification to meet a broad range of access requirements.

How you classify data doesn’t have to be as binary as public or private. Data comes in various degrees of sensitivity and you might have data that falls in all of the different levels of sensitivity and confidentiality. Design your data security controls with an appropriate mix of preventative and detective controls to match data sensitivity appropriately. In the suggestions below, we deal mostly with the difference between public and private data. If you have no classification policy currently, public versus private is a good place to start.

To protect your data once it has been classified, or while you are classifying it:

  1. If you have Amazon Simple Storage Service (Amazon S3) buckets that are for public usage, move all of that data into a separate AWS account set aside for public access. Set up policies to allow only processes — not humans — to move data into those buckets. This lets you block the ability to make a public Amazon S3 bucket in any other AWS account.
  2. Use Amazon S3 to block public access in any account that should not be able to share data through Amazon S3.
  3. Use two different IAM roles for encryption and decryption with KMS. This lets you separate the data entry (encryption) and data review (decryption), and it allows you to do threat detection on the failed decryption attempts by analyzing that role.

6) Centralize CloudTrail logs

Logging and monitoring are important parts of a robust security plan. Being able to investigate unexpected changes in your environment or perform analysis to iterate on your security posture relies on having access to data. AWS recommends that you write logs, especially AWS CloudTrail, to an S3 bucket in an AWS account designated for logging (Log Archive). The permissions on the bucket should prevent deletion of the logs, and they should also be encrypted at rest. Once the logs are centralized, you can integrate with SIEM solutions or use AWS services to analyze them. Learn how to use AWS services to visualize AWS CloudTrail logs. Once you have CloudTrail logs centralized, you can also use the same Log Archive account to centralize logs from other sources, such as CloudWatch Logs and AWS load balancers.

7) Validate IAM roles

As you operate your AWS accounts to iterate and build capability, you may end up creating multiple IAM roles that you discover later you don’t need. Use AWS IAM Access Analyzer to review access to your internal AWS resources and determine where you have shared access outside your AWS accounts. Routinely reevaluating AWS IAM roles and permissions with Security Hub or open source products such as Prowler will give you the visibility needed to validate compliance with your Governance, Risk, and Compliance (GRC) policies. If you’re already past this point, and have already created multiple roles, you can search for unused IAM roles and remove them.

8) Take actions on findings (This isn’t just GuardDuty anymore!)

AWS Security Hub, Amazon GuardDuty, and AWS Identity and Access Management Access Analyzer are managed AWS services that provide you with actionable findings in your AWS accounts. They are easy to turn on and can integrate across multiple accounts. Turning them on is the first step. You also need to take action when you see findings. The action(s) to take are determined by your own incident response policy. For each finding, ensure that you have determined what your required response actions should be.

Action can be notifying a human to respond, but as you get more experienced in AWS services, you will want to automate the response to the findings generated by Security Hub or GuardDuty. Learn more about how to automate your response and remediation from Security Hub findings.

9) Rotate keys

One of the things that Security Hub provides is a view of the compliance posture of your AWS accounts using the CIS Benchmarks. One of these checks is to look for IAM users with access keys more than 90 days old. If you need to use access keys rather than roles, you should rotate them regularly. Review best practices for managing AWS access keys for more guidance. If your users access AWS via federation, then you can remove the need to issue AWS access keys for your users. Users authenticate to the IdP and assume an IAM role in the target AWS account. The result is that long-term credentials are not needed, and your user will have short-term credentials associated with an IAM role.

10) Be involved in the dev cycle

All of the guidance to this point has been focused on the technology configuration that you can implement. The last piece of advice, “be involved in the dev cycle,” is about people, and can be broadly summarized as “raise the security culture of your organization.” The role of people in all parts of the organization is to help the business launch their solutions securely. As people focused on security, we can guide and educate the rest of our organization to understand what they need to do to raise the bar for security in everything they build. Security is everyone’s job — not just for those folks with it in their job title.

What the security people in every organization can do is to make security easier, by shifting the process to make the easiest and most desirable action one that is almost the most secure. For example, each team should not build their own identity federation or logging solution. We are stronger when we work together, and this applies to securing the cloud as well. The goal is to make security more approachable so that co-workers want to talk to the security team because they know it is the place to get help. For more about creating this type of security team, read Cultivating Security Leadership.

Now that you’ve revisited the top 10 things to make your cloud more secure, make sure you have them set up in your AWS accounts — and go build securely!

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Nathan Case

Nathan Case

Nathan is a Security Strategist, Geek. He joined AWS in 2016. You can learn more about him here.