Tag Archives: AWS Systems Manager

Optimizing GPU utilization for AI/ML workloads on Amazon EC2

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/optimizing-gpu-utilization-for-ai-ml-workloads-on-amazon-ec2/

­­­­This blog post is written by Ben Minahan, DevOps Consultant, and Amir Sotoodeh, Machine Learning Engineer.

Machine learning workloads can be costly, and artificial intelligence/machine learning (AI/ML) teams can have a difficult time tracking and maintaining efficient resource utilization. ML workloads often utilize GPUs extensively, so typical application performance metrics such as CPU, memory, and disk usage don’t paint the full picture when it comes to system performance. Additionally, data scientists conduct long-running experiments and model training activities on existing compute instances that fit their unique specifications. Forcing these experiments to be run on newly provisioned infrastructure with proper monitoring systems installed might not be a viable option.

In this post, we describe how to track GPU utilization across all of your AI/ML workloads and enable accurate capacity planning without needing teams to use a custom Amazon Machine Image (AMI) or to re-deploy their existing infrastructure. You can use Amazon CloudWatch to track GPU utilization, and leverage AWS Systems Manager Run Command to install and configure the agent across your existing fleet of GPU-enabled instances.

Overview

First, make sure that your existing Amazon Elastic Compute Cloud (Amazon EC2) instances have the Systems Manager Agent installed, and also have the appropriate level of AWS Identity and Access Management (IAM) permissions to run the Amazon CloudWatch Agent. Next, specify the configuration for the CloudWatch Agent in Systems Manager Parameter Store, and then deploy the CloudWatch Agent to our GPU-enabled EC2 instances. Finally, create a CloudWatch Dashboard to analyze GPU utilization.

Architecture Diagram depicting the integration between AWS Systems Manager with RunCommand Arguments stored in SSM Parameter Store, your Amazon GPU enabled EC2 instance with installed Amazon CloudWatch Agen­t, and Amazon CloudWatch Dashboard that aggregates and displays the ­reported metrics.

  1. Install the CloudWatch Agent on your existing GPU-enabled EC2 instances.
  2. Your CloudWatch Agent configuration is stored in Systems Manager Parameter Store.
  3. Systems Manager Documents are used to install and configure the CloudWatch Agent on your EC2 instances.
  4. GPU metrics are published to CloudWatch, which you can then visualize through the CloudWatch Dashboard.

Prerequisites

This post assumes you already have GPU-enabled EC2 workloads running in your AWS account. If the EC2 instance doesn’t have any GPUs, then the custom configuration won’t be applied to the CloudWatch Agent. Instead, the default configuration is used. For those instances, leveraging the CloudWatch Agent’s default configuration is better suited for tracking resource utilization.

For the CloudWatch Agent to collect your instance’s GPU metrics, the proper NVIDIA drivers must be installed on your instance. Several AWS official AMIs including the Deep Learning AMI already have these drivers installed. To see a list of AMIs with the NVIDIA drivers pre-installed, and for full installation instructions for Linux-based instances, see Install NVIDIA drivers on Linux instances.

Additionally, deploying and managing the CloudWatch Agent requires the instances to be running. If your instances are currently stopped, then you must start them to follow the instructions outlined in this post.

Preparing your EC2 instances

You utilize Systems Manager to deploy the CloudWatch Agent, so make sure that your EC2 instances have the Systems Manager Agent installed. Many AWS-provided AMIs already have the Systems Manager Agent installed. For a full list of the AMIs which have the Systems Manager Agent pre-installed, see Amazon Machine Images (AMIs) with SSM Agent preinstalled. If your AMI doesn’t have the Systems Manager Agent installed, see Working with SSM Agent for instructions on installing based on your operating system (OS).

Once installed, the CloudWatch Agent needs certain permissions to accept commands from Systems Manager, read Systems Manager Parameter Store entries, and publish metrics to CloudWatch. These permissions are bundled into the managed IAM policies AmazonEC2RoleforSSM, AmazonSSMReadOnlyAccess, and CloudWatchAgentServerPolicy. To create a new IAM role and associated IAM instance profile with these policies attached, you can run the following AWS Command Line Interface (AWS CLI) commands, replacing <REGION_NAME> with your AWS region, and <INSTANCE_ID> with the EC2 Instance ID that you want to associate with the instance profile:

aws iam create-role --role-name CloudWatch-Agent-Role --assume-role-policy-document  '{"Statement":{"Effect":"Allow","Principal":{"Service":"ec2.amazonaws.com"},"Action":"sts:AssumeRole"}}'
aws iam attach-role-policy --role-name CloudWatch-Agent-Role --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
aws iam attach-role-policy --role-name CloudWatch-Agent-Role --policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess
aws iam attach-role-policy --role-name CloudWatch-Agent-Role --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws iam create-instance-profile --instance-profile-name CloudWatch-Agent-Instance-Profile
aws iam add-role-to-instance-profile --instance-profile-name CloudWatch-Agent-Instance-Profile --role-name CloudWatch-Agent-Role
aws ec2 associate-iam-instance-profile --region <REGION_NAME> --instance-id <INSTANCE_ID> --iam-instance-profile Name=CloudWatch-Agent-Instance-Profile

Alternatively, you can attach the IAM policies to your existing IAM role associated with an existing IAM instance profile.

aws iam attach-role-policy --role-name <ROLE_NAME> --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
aws iam attach-role-policy --role-name <ROLE_NAME> --policy-arn arn:aws:iam::aws:policy/AmazonSSMReadOnlyAccess
aws iam attach-role-policy --role-name <ROLE_NAME> --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
aws ec2 associate-iam-instance-profile --region <REGION_NAME> --instance-id <INSTANCE_ID> --iam-instance-profile Name=<INSTANCE_PROFILE>

Once complete, you should see that your EC2 instance is associated with the appropriate IAM role.

An Amazon EC2 Instance with the CloudWatch-Agent-Role IAM Role attached

This role should have the AmazonEC2RoleforSSM, AmazonSSMReadOnlyAccess and CloudWatchAgentServerPolicy IAM policies attached.

The CloudWatch-Agent-Role IAM Role’s attached permission policies, Amazon EC2 Role for SSM, CloudWatch Agent Server ¬Policy, and Amazon SSM Read Only Access

Configuring and deploying the CloudWatch Agent

Before deploying the CloudWatch Agent onto our EC2 instances, make sure that those agents are properly configured to collect GPU metrics. To do this, you must create a CloudWatch Agent configuration and store it in Systems Manager Parameter Store.

Copy the following into a file cloudwatch-agent-config.json:

{
    "agent": {
        "metrics_collection_interval": 60,
        "run_as_user": "cwagent"
    },
    "metrics": {
        "aggregation_dimensions": [
            [
                "InstanceId"
            ]
        ],
        "append_dimensions": {
            "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
            "ImageId": "${aws:ImageId}",
            "InstanceId": "${aws:InstanceId}",
            "InstanceType": "${aws:InstanceType}"
        },
        "metrics_collected": {
            "cpu": {
                "measurement": [
                    "cpu_usage_idle",
                    "cpu_usage_iowait",
                    "cpu_usage_user",
                    "cpu_usage_system"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ],
                "totalcpu": false
            },
            "disk": {
                "measurement": [
                    "used_percent",
                    "inodes_free"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ]
            },
            "diskio": {
                "measurement": [
                    "io_time"
                ],
                "metrics_collection_interval": 60,
                "resources": [
                    "*"
                ]
            },
            "mem": {
                "measurement": [
                    "mem_used_percent"
                ],
                "metrics_collection_interval": 60
            },
            "swap": {
                "measurement": [
                    "swap_used_percent"
                ],
                "metrics_collection_interval": 60
            },
            "nvidia_gpu": {
                "measurement": [
                    "utilization_gpu",
                    "temperature_gpu",
                    "utilization_memory",
                    "fan_speed",
                    "memory_total",
                    "memory_used",
                    "memory_free",
                    "pcie_link_gen_current",
                    "pcie_link_width_current",
                    "encoder_stats_session_count",
                    "encoder_stats_average_fps",
                    "encoder_stats_average_latency",
                    "clocks_current_graphics",
                    "clocks_current_sm",
                    "clocks_current_memory",
                    "clocks_current_video"
                ],
                "metrics_collection_interval": 60
            }
        }
    }
}

Run the following AWS CLI command to deploy a Systems Manager Parameter CloudWatch-Agent-Config, which contains a minimal agent configuration for GPU metrics collection. Replace <REGION_NAME> with your AWS Region.

aws ssm put-parameter \
--region <REGION_NAME> \
--name CloudWatch-Agent-Config \
--type String \
--value file://cloudwatch-agent-config.json

Now you can see a CloudWatch-Agent-Config parameter in Systems Manager Parameter Store, containing your CloudWatch Agent’s JSON configuration.

CloudWatch-Agent-Config stored in Systems Manager Parameter Store

Next, install the CloudWatch Agent on your EC2 instances. To do this, you can leverage Systems Manager Run Command, specifically the AWS-ConfigureAWSPackage document which automates the CloudWatch Agent installation.

  1. Run the following AWS CLI command, replacing <REGION_NAME> with the Region into which your instances are deployed, and <INSTANCE_ID> with the EC2 Instance ID on which you want to install the CloudWatch Agent.
aws ssm send-command \
--query 'Command.CommandId' \
--region <REGION_NAME> \
--instance-ids <INSTANCE_ID> \
--document-name AWS-ConfigureAWSPackage \
--parameters '{"action":["Install"],"installationType":["In-place update"],"version":["latest"],"name":["AmazonCloudWatchAgent"]}'

2. To monitor the status of your command, use the get-command-invocation AWS CLI command. Replace <COMMAND_ID> with the command ID output from the previous step, <REGION_NAME> with your AWS region, and <INSTANCE_ID> with your EC2 instance ID.

aws ssm get-command-invocation --query Status --region <REGION_NAME> --command-id <COMMAND_ID> --instance-id <INSTANCE_ID>

3.Wait for the command to show the status Success before proceeding.

$ aws ssm send-command \
	 --query 'Command.CommandId' \
    --region us-east-2 \
    --instance-ids i-0123456789abcdef \
    --document-name AWS-ConfigureAWSPackage \
    --parameters '{"action":["Install"],"installationType":["Uninstall and reinstall"],"version":["latest"],"additionalArguments":["{}"],"name":["AmazonCloudWatchAgent"]}'

"5d8419db-9c48-434c-8460-0519640046cf"

$ aws ssm get-command-invocation --query Status --region us-east-2 --command-id 5d8419db-9c48-434c-8460-0519640046cf --instance-id i-0123456789abcdef

"Success"

Repeat this process for all EC2 instances on which you want to install the CloudWatch Agent.

Next, configure the CloudWatch Agent installation. For this, once again leverage Systems Manager Run Command. However, this time the AmazonCloudWatch-ManageAgent document which applies your custom agent configuration is stored in the Systems Manager Parameter Store to your deployed agents.

  1. Run the following AWS CLI command, replacing <REGION_NAME> with the Region into which your instances are deployed, and <INSTANCE_ID> with the EC2 Instance ID on which you want to configure the CloudWatch Agent.
aws ssm send-command \
--query 'Command.CommandId' \
--region <REGION_NAME> \
--instance-ids <INSTANCE_ID> \
--document-name AmazonCloudWatch-ManageAgent \
--parameters '{"action":["configure"],"mode":["ec2"],"optionalConfigurationSource":["ssm"],"optionalConfigurationLocation":["/CloudWatch-Agent-Config"],"optionalRestart":["yes"]}'

2. To monitor the status of your command, utilize the get-command-invocation AWS CLI command. Replace <COMMAND_ID> with the command ID output from the previous step, <REGION_NAME> with your AWS region, and <INSTANCE_ID> with your EC2 instance ID.

aws ssm get-command-invocation --query Status --region <REGION_NAME> --command-id <COMMAND_ID> --instance-id <INSTANCE_ID>

3. Wait for the command to show the status Success before proceeding.

$ aws ssm send-command \
    --query 'Command.CommandId' \
    --region us-east-2 \
    --instance-ids i-0123456789abcdef \
    --document-name AmazonCloudWatch-ManageAgent \
    --parameters '{"action":["configure"],"mode":["ec2"],"optionalConfigurationSource":["ssm"],"optionalConfigurationLocation":["/CloudWatch-Agent-Config"],"optionalRestart":["yes"]}'

"9a4a5c43-0795-4fd3-afed-490873eaca63"

$ aws ssm get-command-invocation --query Status --region us-east-2 --command-id 9a4a5c43-0795-4fd3-afed-490873eaca63 --instance-id i-0123456789abcdef

"Success"

Repeat this process for all EC2 instances on which you want to install the CloudWatch Agent. Once finished, the CloudWatch Agent installation and configuration is complete, and your EC2 instances now report GPU metrics to CloudWatch.

Visualize your instance’s GPU metrics in CloudWatch

Now that your GPU-enabled EC2 Instances are publishing their utilization metrics to CloudWatch, you can visualize and analyze these metrics to better understand your resource utilization patterns.

The GPU metrics collected by the CloudWatch Agent are within the CWAgent namespace. Explore your GPU metrics using the CloudWatch Metrics Explorer, or deploy our provided sample dashboard.

  1. Copy the following into a file, cloudwatch-dashboard.json, replacing instances of <REGION_NAME> with your Region:
{
    "widgets": [
        {
            "height": 10,
            "width": 24,
            "y": 16,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [{"expression": "SELECT AVG(nvidia_smi_utilization_gpu) FROM SCHEMA(\"CWAgent\", InstanceId) GROUP BY InstanceId","id": "q1"}]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "title": "GPU Core Utilization",
                "yAxis": {
                    "left": {"label": "Percent","max": 100,"min": 0,"showUnits": false}
                }
            }
        },
        {
            "height": 7,
            "width": 8,
            "y": 0,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [{"expression": "SELECT AVG(nvidia_smi_utilization_gpu) FROM SCHEMA(\"CWAgent\", InstanceId)", "label": "Utilization","id": "q1"}]
                ],
                "view": "gauge",
                "stacked": false,
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "title": "Average GPU Core Utilization",
                "yAxis": {"left": {"max": 100, "min": 0}
                },
                "liveData": false
            }
        },
        {
            "height": 9,
            "width": 24,
            "y": 7,
            "x": 0,
            "type": "metric",
            "properties": {
                "metrics": [
                    [{ "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_used\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m1", "visible": false }],
                    [{ "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_total\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m2", "visible": false }],
                    [{ "expression": "SEARCH(' MetricName=\"mem_used_percent\" {CWAgent, InstanceId} ', 'Average')", "id": "m3", "visible": false }],
                    [{ "expression": "100*AVG(m1)/AVG(m2)", "label": "GPU", "id": "e2", "color": "#17becf" }],
                    [{ "expression": "AVG(m3)", "label": "RAM", "id": "e3" }]
                ],
                "view": "timeSeries",
                "stacked": false,
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "yAxis": {
                    "left": {"min": 0,"max": 100,"label": "Percent","showUnits": false}
                },
                "title": "Average Memory Utilization"
            }
        },
        {
            "height": 7,
            "width": 8,
            "y": 0,
            "x": 8,
            "type": "metric",
            "properties": {
                "metrics": [
                    [ { "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_used\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m1", "visible": false } ],
                    [ { "expression": "SEARCH(' MetricName=\"nvidia_smi_memory_total\" {\"CWAgent\", InstanceId} ', 'Average')", "id": "m2", "visible": false } ],
                    [ { "expression": "100*AVG(m1)/AVG(m2)", "label": "Utilization", "id": "e2" } ]
                ],
                "sparkline": true,
                "view": "gauge",
                "region": "<REGION_NAME>",
                "stat": "Average",
                "period": 300,
                "yAxis": {
                    "left": {"min": 0,"max": 100}
                },
                "liveData": false,
                "title": "GPU Memory Utilization"
            }
        }
    ]
}

2. run the following AWS CLI command, replacing <REGION_NAME> with the name of your Region:

aws cloudwatch put-dashboard \
    --region <REGION_NAME> \
    --dashboard-name My-GPU-Usage \
    --dashboard-body file://cloudwatch-dashboard.json

View the My-GPU-Usage CloudWatch dashboard in the CloudWatch console for your AWS region..

An example CloudWatch dashboard, My-GPU-Usage, showing the GPU usage metrics over time.

Cleaning Up

To avoid incurring future costs for resources created by following along in this post, delete the following:

  1. My-GPU-Usage CloudWatch Dashboard
  2. CloudWatch-Agent-Config Systems Manager Parameter
  3. CloudWatch-Agent-Role IAM Role

Conclusion

By following along with this post, you deployed and configured the CloudWatch Agent across your GPU-enabled EC2 instances to track GPU utilization without pausing in-progress experiments and model training. Then, you visualized the GPU utilization of your workloads with a CloudWatch Dashboard to better understand your workload’s GPU usage and make more informed scaling and cost decisions. For other ways that Amazon CloudWatch can improve your organization’s operational insights, see the Amazon CloudWatch documentation.

AWS Week in Review: Public Preview of Amazon DataZone and AWS DataSync Updates – April 3, 2023

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-public-preview-of-amazon-datazone-and-aws-datasync-updates-april-3-2023/

Last weekend, I enjoyed the spring vibes at Seoul Forest, a large park in the middle of Seoul city, where cherry blossoms are in full bloom.

Compared to last year, there were crowds of people, so I realized that it was really back to normal after the pandemic. I hope you all enjoy the season of spring or fall with your family.

Last Week’s Launches
Like an April Fool’s Day joke, there were 65 launches last week, far more than usual. AWS product teams are working hard with a customer obsession.

So, I had a lot of trouble choosing the important ones. Other than the ones I’ve picked out, there may be important feature releases that fit your needs. Be sure to take a look at the full launches list in the last week.

First, here is a list of the general availability of AWS services and features treated by AWS News Blog:

Let’s take a look at some launches from the last week that I want to remind you of:

The Preview of Amazon DataZone – At AWS re:Invent 2022, we preannounced Amazon DataZone, a new data management service to catalog, discover, analyze, share, and govern data between data producers and consumers in the organization. You can now try out the public preview of Amazon DataZone.

Data producers populate the business data catalog from AWS Glue Data Catalog and Amazon Redshift tables. Data consumers search for and subscribe to data assets in the data catalog and analyze with tools such as Amazon Athena query editors in the Amazon DataZone portal. To get started with Amazon DataZone, see our Quick Start Guide to include sample datasets to implement a complete use case.

AWS DataSync Supports Azure Blob Storage in PreviewAWS DataSync supports copying your object data at scale from Azure Blob Storage to AWS storage services such as Amazon S3. AWS DataSync supports all blob types within Azure Blob Storage and can also be used with Azure Data Lake Storage (ADLS) Gen 2.

In addition to Azure Blob Storage, DataSync supports Google Cloud Storage and Azure Files storage locations as well as various general storage systems and AWS storage services. To learn more, see Migrating Azure Blob Storage to Amazon S3 using AWS DataSync in the AWS Storage Blog.

On-call schedules with AWS Systems Manager Incident Manager – You can now configure or change on-call rotation schedules with a group of contacts and have 24/7 coverage and responsiveness for critical issues in the Incident Manager console.

AWS Incident Manager helps you bring the right people and information together when a critical issue is detected, activating preconfigured response plans to engage responders using SMS, phone calls, and chat channels, as well as to run AWS Systems Manager Automation runbooks. To learn how to get started with an-call schedules in Incident Manager, see our Working with on-call schedules in Incident Manager in the AWS documentation.

AWS CloudShell Colsone Toolbar – You can now use AWS Cloudshell Console Toolbar with AWS Management Console in a single view. The Console Toolbar maintains its state (e.g., open, closed) and commands will continue to run in CloudShell as you navigate between services in the Console. For example, it allows you to run a command in CloudShell and view a CloudWatch alarm in the Console at the same time.

After signing into the Console, you can access CloudShell in the lower left of the Console by selecting the CloudShell icon in the Console Toolbar.

New Features of AWS Well-Architected Tool – The Consolidated Report and Enhanced Search enable customers to quickly identify risk themes across their workloads and scale improvements across their organization. This macro-level view helps executive stakeholders understand where common issues lie and prioritize team resources to drive widespread improvement. To learn more, see AWS Well-Architected Tool Dashboard in the AWS documentation.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items that you may find interesting from the last week:

Welcome to the .NET on AWS Blog – We launched a new blog channel for millions of .NET developers across the world. Blog posts will also cover built-for-the-cloud development, modernizing .NET Framework applications, and how to deploy .NET workloads on different AWS services. We will use this channel to share news on the work we’ve done with the .NET open-source community, post follow-ups from important events, and post announcements about upcoming presentations from our .NET developer advocates. To learn more, visit our .NET on AWS website and follow us on Twitter at @dotnetonAWS.

AWS Knowledge Center in AWS re:Post – You can now access trusted, authoritative articles and videos of AWS Knowledge Center on AWS re:Post to get answers to technical questions. Knowledge Center content is produced by an AWS team and covers the most frequent questions and requests from AWS customers. These articles are available in 10 localized languages: English, French, German, Italian, Japanese, Korean, Portuguese, Simplified Chinese, Spanish, and Traditional Chinese.

TF1’s FIFA Worldcup Digital Broadcasting Story – Sébastien shared an awesome story about how the French broadcaster TF1 use AWS Cloud technology and expertise to bring the FIFA World Cup to millions of people. He shared the history of redesigning its digital broadcasting architecture on AWS, testing the new platform on large-scale sporting events. For the preparation of the FIFA Worldcup event, TF1 enhanced monitoring to detect anomalies during the event and established the backup plan in a “war room” for the worst scenario. Even if you’re not a fan of football, I recommend reading the behind-the-scenes of the FIFA Worldcup Finals. It’s long but really fun!

Upcoming AWS Events
Check your calendars and sign up for these AWS-led events:

AWS re:Inforce 2023 – Now register AWS re:Inforce, in Anaheim, California, June 13–14. AWS Chief Information Security Officer CJ Moses will share the latest innovations in cloud security and what AWS Security is focused on. The breakout sessions will provide real-world examples of how security is embedded into the way businesses operate. To learn more and get the limited discount code to register, see CJ’s blog post of Gain insights and knowledge at AWS re:Inforce 2023 in the AWS Security Blog.

AWS Global Summits – Check your calendars and sign up for the AWS Summit closest to your city: Paris and Sydney (April 4), Seoul (May 3-4), Berlin and Singapore (May 4), Stockholm (May 11), Hong Kong (May 23), Amsterdam (June 1), London (June 7), Madrid (June 15), and Milano (June 22).

AWS Community Day – Join community-led conferences driven by AWS user group leaders closest to your city: Peru (April 15), Helsinki (April 20), Chicago (June 15), Philippines (June 29–30), and Munich (September 14). Recently, we are bringing together AWS user groups from around the world into Meetup Pro accounts. Find your group and its meetups in your city!

You can browse all upcoming AWS-led in-person and virtual events, and developer-focused events such as AWS DevDay.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Serverless ICYMI Q1 2023

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/serverless-icymi-q1-2023/

Welcome to the 21st edition of the AWS Serverless ICYMI (in case you missed it) quarterly recap. Every quarter, we share all the most recent product launches, feature enhancements, blog posts, webinars, live streams, and other interesting things that you might have missed!

ICYMI2023Q1

In case you missed our last ICYMI, check out what happened last quarter here.

Artificial intelligence (AI) technologies, ChatGPT, and DALL-E are creating significant interest in the industry at the moment. Find out how to integrate serverless services with ChatGPT and DALL-E to generate unique bedtime stories for children.

Example notification of a story hosted with Next.js and App Runner

Example notification of a story hosted with Next.js and App Runner

Serverless Land is a website maintained by the Serverless Developer Advocate team to help you build serverless applications and includes workshops, code examples, blogs, and videos. There is now enhanced search functionality so you can search across resources, patterns, and video content.

SLand-search

ServerlessLand search

AWS Lambda

AWS Lambda has improved how concurrency works with Amazon SQS. You can now control the maximum number of concurrent Lambda functions invoked.

The launch blog post explains the scaling behavior of Lambda using this architectural pattern, challenges this feature helps address, and a demo of maximum concurrency in action.

Maximum concurrency is set to 10 for the SQS queue.

Maximum concurrency is set to 10 for the SQS queue.

AWS Lambda Powertools is an open-source library to help you discover and incorporate serverless best practices more easily. Lambda Powertools for .NET is now generally available and currently focused on three observability features: distributed tracing (Tracer), structured logging (Logger), and asynchronous business and application metrics (Metrics). Powertools is also available for Python, Java, and Typescript/Node.js programming languages.

To learn more:

Lambda announced a new feature, runtime management controls, which provide more visibility and control over when Lambda applies runtime updates to your functions. The runtime controls are optional capabilities for advanced customers that require more control over their runtime changes. You can now specify a runtime management configuration for each function with three settings, Automatic (default), Function update, or manual.

There are three new Amazon CloudWatch metrics for asynchronous Lambda function invocations: AsyncEventsReceived, AsyncEventAge, and AsyncEventsDropped. You can track the asynchronous invocation requests sent to Lambda functions to monitor any delays in processing and take corrective actions if required. The launch blog post explains the new metrics and how to use them to troubleshoot issues.

Lambda now supports Amazon DocumentDB change streams as an event source. You can use Lambda functions to process new documents, track updates to existing documents, or log deleted documents. You can use any programming language that is supported by Lambda to write your functions.

There is a helpful blog post suggesting best practices for developing portable Lambda functions that allow you to port your code to containers if you later choose to.

AWS Step Functions

AWS Step Functions has expanded its AWS SDK integrations with support for 35 additional AWS services including Amazon EMR Serverless, AWS Clean Rooms, AWS IoT FleetWise, AWS IoT RoboRunner and 31 other AWS services. In addition, Step Functions also added support for 1000+ new API actions from new and existing AWS services such as Amazon DynamoDB and Amazon Athena. For the full list of added services, visit AWS SDK service integrations.

Amazon EventBridge

Amazon EventBridge has launched the AWS Controllers for Kubernetes (ACK) for EventBridge and Pipes . This allows you to manage EventBridge resources, such as event buses, rules, and pipes, using the Kubernetes API and resource model (custom resource definitions).

EventBridge event buses now also support enhanced integration with Service Quotas. Your quota increase requests for limits such as PutEvents transactions-per-second, number of rules, and invocations per second among others will be processed within one business day or faster, enabling you to respond quickly to changes in usage.

AWS SAM

The AWS Serverless Application Model (SAM) Command Line Interface (CLI) has added the sam list command. You can now show resources defined in your application, including the endpoints, methods, and stack outputs required to test your deployed application.

AWS SAM has a preview of sam build support for building and packaging serverless applications developed in Rust. You can use cargo-lambda in the AWS SAM CLI build workflow and AWS SAM Accelerate to iterate on your code changes rapidly in the cloud.

You can now use AWS SAM connectors as a source resource parameter. Previously, you could only define AWS SAM connectors as a AWS::Serverless::Connector resource. Now you can add the resource attribute on a connector’s source resource, which makes templates more readable and easier to update over time.

AWS SAM connectors now also support multiple destinations to simplify your permissions. You can now use a single connector between a single source resource and multiple destination resources.

In October 2022, AWS released OpenID Connect (OIDC) support for AWS SAM Pipelines. This improves your security posture by creating integrations that use short-lived credentials from your CI/CD provider. There is a new blog post on how to implement it.

Find out how best to build serverless Java applications with the AWS SAM CLI.

AWS App Runner

AWS App Runner now supports retrieving secrets and configuration data stored in AWS Secrets Manager and AWS Systems Manager (SSM) Parameter Store in an App Runner service as runtime environment variables.

AppRunner also now supports incoming requests based on HTTP 1.0 protocol, and has added service level concurrency, CPU and Memory utilization metrics.

Amazon S3

Amazon S3 now automatically applies default encryption to all new objects added to S3, at no additional cost and with no impact on performance.

You can now use an S3 Object Lambda Access Point alias as an origin for your Amazon CloudFront distribution to tailor or customize data to end users. For example, you can resize an image depending on the device that an end user is visiting from.

S3 has introduced Mountpoint for S3, a high performance open source file client that translates local file system API calls to S3 object API calls like GET and LIST.

S3 Multi-Region Access Points now support datasets that are replicated across multiple AWS accounts. They provide a single global endpoint for your multi-region applications, and dynamically route S3 requests based on policies that you define. This helps you to more easily implement multi-Region resilience, latency-based routing, and active-passive failover, even when data is stored in multiple accounts.

Amazon Kinesis

Amazon Kinesis Data Firehose now supports streaming data delivery to Elastic. This is an easier way to ingest streaming data to Elastic and consume the Elastic Stack (ELK Stack) solutions for enterprise search, observability, and security without having to manage applications or write code.

Amazon DynamoDB

Amazon DynamoDB now supports table deletion protection to protect your tables from accidental deletion when performing regular table management operations. You can set the deletion protection property for each table, which is set to disabled by default.

Amazon SNS

Amazon SNS now supports AWS X-Ray active tracing to visualize, analyze, and debug application performance. You can now view traces that flow through Amazon SNS topics to destination services, such as Amazon Simple Queue Service, Lambda, and Kinesis Data Firehose, in addition to traversing the application topology in Amazon CloudWatch ServiceLens.

SNS also now supports setting content-type request headers for HTTPS notifications so applications can receive their notifications in a more predictable format. Topic subscribers can create a DeliveryPolicy that specifies the content-type value that SNS assigns to their HTTPS notifications, such as application/json, application/xml, or text/plain.

EDA Visuals collection added to Serverless Land

The Serverless Developer Advocate team has extended Serverless Land and introduced EDA visuals. These are small bite sized visuals to help you understand concept and patterns about event-driven architectures. Find out about batch processing vs. event streaming, commands vs. events, message queues vs. event brokers, and point-to-point messaging. Discover bounded contexts, migrations, idempotency, claims, enrichment and more!

EDA-visuals

EDA Visuals

To learn more:

Serverless Repos Collection on Serverless Land

There is also a new section on Serverless Land containing helpful code repositories. You can search for code repos to use for examples, learning or building serverless applications. You can also filter by use-case, runtime, and level.

Serverless Repos Collection

Serverless Repos Collection

Serverless Blog Posts

January

Jan 12 – Introducing maximum concurrency of AWS Lambda functions when using Amazon SQS as an event source

Jan 20 – Processing geospatial IoT data with AWS IoT Core and the Amazon Location Service

Jan 23 – AWS Lambda: Resilience under-the-hood

Jan 24 – Introducing AWS Lambda runtime management controls

Jan 24 – Best practices for working with the Apache Velocity Template Language in Amazon API Gateway

February

Feb 6 – Previewing environments using containerized AWS Lambda functions

Feb 7 – Building ad-hoc consumers for event-driven architectures

Feb 9 – Implementing architectural patterns with Amazon EventBridge Pipes

Feb 9 – Securing CI/CD pipelines with AWS SAM Pipelines and OIDC

Feb 9 – Introducing new asynchronous invocation metrics for AWS Lambda

Feb 14 – Migrating to token-based authentication for iOS applications with Amazon SNS

Feb 15 – Implementing reactive progress tracking for AWS Step Functions

Feb 23 – Developing portable AWS Lambda functions

Feb 23 – Uploading large objects to Amazon S3 using multipart upload and transfer acceleration

Feb 28 – Introducing AWS Lambda Powertools for .NET

March

Mar 9 – Server-side rendering micro-frontends – UI composer and service discovery

Mar 9 – Building serverless Java applications with the AWS SAM CLI

Mar 10 – Managing sessions of anonymous users in WebSocket API-based applications

Mar 14 –
Implementing an event-driven serverless story generation application with ChatGPT and DALL-E

Videos

Serverless Office Hours – Tues 10AM PT

Weekly office hours live stream. In each session we talk about a specific topic or technology related to serverless and open it up to helping you with your real serverless challenges and issues. Ask us anything you want about serverless technologies and applications.

January

Jan 10 – Building .NET 7 high performance Lambda functions

Jan 17 – Amazon Managed Workflows for Apache Airflow at Scale

Jan 24 – Using Terraform with AWS SAM

Jan 31 – Preparing your serverless architectures for the big day

February

Feb 07- Visually design and build serverless applications

Feb 14 – Multi-tenant serverless SaaS

Feb 21 – Refactoring to Serverless

Feb 28 – EDA visually explained

March

Mar 07 – Lambda cookbook with Python

Mar 14 – Succeeding with serverless

Mar 21 – Lambda Powertools .NET

Mar 28 – Server-side rendering micro-frontends

FooBar Serverless YouTube channel

Marcia Villalba frequently publishes new videos on her popular serverless YouTube channel. You can view all of Marcia’s videos at https://www.youtube.com/c/FooBar_codes.

January

Jan 12 – Serverless Badge – A new certification to validate your Serverless Knowledge

Jan 19 – Step functions Distributed map – Run 10k parallel serverless executions!

Jan 26 – Step Functions Intrinsic Functions – Do simple data processing directly from the state machines!

February

Feb 02 – Unlock the Power of EventBridge Pipes: Integrate Across Platforms with Ease!

Feb 09 – Amazon EventBridge Pipes: Enrichment and filter of events Demo with AWS SAM

Feb 16 – AWS App Runner – Deploy your apps from GitHub to Cloud in Record Time

Feb 23 – AWS App Runner – Demo hosting a Node.js app in the cloud directly from GitHub (AWS CDK)

March

Mar 02 – What is Amazon DynamoDB? What are the most important concepts? What are the indexes?

Mar 09 – Choreography vs Orchestration: Which is Best for Your Distributed Application?

Mar 16 – DynamoDB Single Table Design: Simplify Your Code and Boost Performance with Table Design Strategies

Mar 23 – 8 Reasons You Should Choose DynamoDB for Your Next Project and How to Get Started

Sessions with SAM & Friends

SAMFiends

AWS SAM & Friends

Eric Johnson is exploring how developers are building serverless applications. We spend time talking about AWS SAM as well as others like AWS CDK, Terraform, Wing, and AMPT.

Feb 16 – What’s new with AWS SAM

Feb 23 – AWS SAM with AWS CDK

Mar 02 – AWS SAM and Terraform

Mar 10 – Live from ServerlessDays ANZ

Mar 16 – All about AMPT

Mar 23 – All about Wing

Mar 30 – SAM Accelerate deep dive

Still looking for more?

The Serverless landing page has more information. The Lambda resources page contains case studies, webinars, whitepapers, customer stories, reference architectures, and even more Getting Started tutorials.

You can also follow the Serverless Developer Advocacy team on Twitter to see the latest news, follow conversations, and interact with the team.

Securely validate business application resilience with AWS FIS and IAM

Post Syndicated from Dr. Rudolf Potucek original https://aws.amazon.com/blogs/devops/securely-validate-business-application-resilience-with-aws-fis-and-iam/

To avoid high costs of downtime, mission critical applications in the cloud need to achieve resilience against degradation of cloud provider APIs and services.

In 2021, AWS launched AWS Fault Injection Simulator (FIS), a fully managed service to perform fault injection experiments on workloads in AWS to improve their reliability and resilience. At the time of writing, FIS allows to simulate degradation of Amazon Elastic Compute Cloud (EC2) APIs using API fault injection actions and thus explore the resilience of workflows where EC2 APIs act as a fault boundary. 

In this post we show you how to explore additional fault boundaries in your applications by selectively denying access to any AWS API. This technique is particularly useful for fully managed, “black box” services like Amazon Simple Storage Service (S3) or Amazon Simple Queue Service (SQS) where a failure of read or write operations is sufficient to simulate problems in the service. This technique is also useful for injecting failures in serverless applications without needing to modify code. While similar results could be achieved with network disruption or modifying code with feature flags, this approach provides a fine granular degradation of an AWS API without the need to re-deploy and re-validate code.

Overview

We will explore a common application pattern: user uploads a file, S3 triggers an AWS Lambda function, Lambda transforms the file to a new location and deletes the original:

S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Figure 1. S3 upload and transform logical workflow: User uploads file to S3, upload triggers AWS Lambda execution, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

We will simulate the user upload with an Amazon EventBridge rate expression triggering an AWS Lambda function which creates a file in S3:

S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Figure 2. S3 upload and transform implemented demo workflow: Amazon EventBridge triggers a creator Lambda function, Lambda function creates a file in S3, file creation triggers AWS Lambda execution on transformer function, Lambda writes transformed file to a new bucket and deletes original. Workflow can be disrupted at file deletion.

Using this architecture we can explore the effect of S3 API degradation during file creation and deletion. As shown, the API call to delete a file from S3 is an application fault boundary. The failure could occur, with identical effect, because of S3 degradation or because the AWS IAM role of the Lambda function denies access to the API.

To inject failures we use AWS Systems Manager (AWS SSM) automation documents to attach and detach IAM policies at the API fault boundary and FIS to orchestrate the workflow.

Each Lambda function has an IAM execution role that allows S3 write and delete access, respectively. If the processor Lambda fails, the S3 file will remain in the bucket, indicating a failure. Similarly, if the IAM execution role for the processor function is denied the ability to delete a file after processing, that file will remain in the S3 bucket.

Prerequisites

Following this blog posts will incur some costs for AWS services. To explore this test application you will need an AWS account. We will also assume that you are using AWS CloudShell or have the AWS CLI installed and have configured a profile with administrator permissions. With that in place you can create the demo application in your AWS account by downloading this template and deploying an AWS CloudFormation stack:

git clone https://github.com/aws-samples/fis-api-failure-injection-using-iam.git
cd fis-api-failure-injection-using-iam
aws cloudformation deploy --stack-name test-fis-api-faults --template-file template.yaml --capabilities CAPABILITY_NAMED_IAM

Fault injection using IAM

Once the stack has been created, navigate to the Amazon CloudWatch Logs console and filter for /aws/lambda/test-fis-api-faults. Under the EventBridgeTimerHandler log group you should find log events once a minute writing a timestamped file to an S3 bucket named fis-api-failure-ACCOUNT_ID. Under the S3TriggerHandler log group you should find matching deletion events for those files.

Once you have confirmed object creation/deletion, let’s take away the permission of the S3 trigger handler lambda to delete files. To do this you will attach the FISAPI-DenyS3DeleteObject  policy that was created with the template:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam attach-role-policy \
  --role-name ${ROLE_NAME}\
  --policy-arn ${POLICY_ARN}

With the deny policy in place you should now see object deletion fail and objects should start showing up in the S3 bucket. Navigate to the S3 console and find the bucket starting with fis-api-failure. You should see a new object appearing in this bucket once a minute:

S3 bucket listing showing files not being deleted because IAM permissions DENY file deletion during FIS experiment.

Figure 3. S3 bucket listing showing files not being deleted because IAM permissions DENY file deletion during FIS experiment.

If you would like to graph the results you can navigate to AWS CloudWatch, select “Logs Insights“, select the log group starting with /aws/lambda/test-fis-api-faults-S3CountObjectsHandler, and run this query:

fields @timestamp, @message
| filter NumObjects >= 0
| sort @timestamp desc
| stats max(NumObjects) by bin(1m)
| limit 20

This will show the number of files in the S3 bucket over time:

AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

Figure 4. AWS CloudWatch Logs Insights graph showing the increase in the number of retained files in S3 bucket over time, demonstrating the effect of the introduced failure.

You can now detach the policy:

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws iam detach-role-policy \
  --role-name ${ROLE_NAME}\
  --policy-arn ${POLICY_ARN}

We see that newly written files will once again be deleted but the un-processed files will remain in the S3 bucket. From the fault injection we learned that our system does not tolerate request failures when deleting files from S3. To address this, we should add a dead letter queue or some other retry mechanism.

Note: if the Lambda function does not return a success state on invocation, EventBridge will retry. In our Lambda functions we are cost conscious and explicitly capture the failure states to avoid excessive retries.

Fault injection using SSM

To use this approach from FIS and to always remove the policy at the end of the experiment, we first create an SSM document to automate adding a policy to a role. To inspect this document, open the SSM console, navigate to the “Documents” section, find the FISAPI-IamAttachDetach document under “Owned by me”, and examine the “Content” tab (make sure to select the correct region). This document takes the name of the Role you want to impact and the Policy you want to attach as parameters. It also requires an IAM execution role that grants it the power to list, attach, and detach specific policies to specific roles.

Let’s run the SSM automation document from the console by selecting “Execute Automation”. Determine the ARN of the FISAPI-SSM-Automation-Role from CloudFormation or by running:

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

Use FISAPI-SSM-Automation-Role, a duration of 2 minutes expressed in ISO8601 format as PT2M, the ARN of the deny policy, and the name of the target role FISAPI-TARGET-S3TriggerHandlerRole:

Image of parameter input field reflecting the instructions in blog text.

Figure 5. Image of parameter input field reflecting the instructions in blog text.

Alternatively execute this from a shell:

ASSUME_ROLE_NAME=FISAPI-SSM-Automation-Role
ASSUME_ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ASSUME_ROLE_NAME}'].Arn" --output text )
echo Assume Role ARN: $ASSUME_ROLE_ARN

ROLE_NAME=FISAPI-TARGET-S3TriggerHandlerRole
ROLE_ARN=$( aws iam list-roles --query "Roles[?RoleName=='${ROLE_NAME}'].Arn" --output text )
echo Target Role ARN: $ROLE_ARN

POLICY_NAME=FISAPI-DenyS3DeleteObject
POLICY_ARN=$( aws iam list-policies --query "Policies[?PolicyName=='${POLICY_NAME}'].Arn" --output text )
echo Impact Policy ARN: $POLICY_ARN

aws ssm start-automation-execution \
  --document-name FISAPI-IamAttachDetach \
  --parameters "{
      \"AutomationAssumeRole\": [ \"${ASSUME_ROLE_ARN}\" ],
      \"Duration\": [ \"PT2M\" ],
      \"TargetResourceDenyPolicyArn\": [\"${POLICY_ARN}\" ],
      \"TargetApplicationRoleName\": [ \"${ROLE_NAME}\" ]
    }"

Wait two minutes and then examine the content of the S3 bucket starting with fis-api-failure again. You should now see two additional files in the bucket, showing that the policy was attached for 2 minutes during which files could not be deleted, and confirming that our application is not resilient to S3 API degradation.

Permissions for injecting failures with SSM

Fault injection with SSM is controlled by IAM, which is why you had to specify the FISAPI-SSM-Automation-Role:

Visual representation of IAM permission used for fault injections with SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the SSM user needing to have a pass-role permission to grant the SSM execution role to the SSM service.

Figure 6. Visual representation of IAM permission used for fault injections with SSM.

This role needs to contain an assume role policy statement for SSM to allow assuming the role:

      AssumeRolePolicyDocument:
        Statement:
          - Action:
             - 'sts:AssumeRole'
            Effect: Allow
            Principal:
              Service:
                - "ssm.amazonaws.com"

The role also needs to contain permissions to describe roles and their attached policies with an optional constraint on which roles and policies are visible:

          - Sid: GetRoleAndPolicyDetails
            Effect: Allow
            Action:
              - 'iam:GetRole'
              - 'iam:GetPolicy'
              - 'iam:ListAttachedRolePolicies'
            Resource:
              # Roles
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn
              # Policies
              - !Ref AwsFisApiPolicyDenyS3DeleteObject

Finally the SSM role needs to allow attaching and detaching a policy document. This requires

  1. an ALLOW statement
  2. a constraint on the policies that can be attached
  3. a constraint on the roles that can be attached to

In the role we collapse the first two requirements into an ALLOW statement with a condition constraint for the Policy ARN. We then express the third requirement in a DENY statement that will limit the '*' resource to only the explicit role ARNs we want to modify:

          - Sid: AllowOnlyTargetResourcePolicies
            Effect: Allow
            Action:  
              - 'iam:DetachRolePolicy'
              - 'iam:AttachRolePolicy'
            Resource: '*'
            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Policies that can be attached
                  - !Ref AwsFisApiPolicyDenyS3DeleteObject
          - Sid: DenyAttachDetachAllRolesExceptApplicationRole
            Effect: Deny
            Action: 
              - 'iam:DetachRolePolicy'
              - 'iam:AttachRolePolicy'
            NotResource: 
              # Roles that can be attached to
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn

We will discuss security considerations in more detail at the end of this post.

Fault injection using FIS

With the SSM document in place you can now create an FIS template that calls the SSM document. Navigate to the FIS console and filter for FISAPI-DENY-S3PutObject. You should see that the experiment template passes the same parameters that you previously used with SSM:

Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

Figure 7. Image of FIS experiment template action summary. This shows the SSM document ARN to be used for fault injection and the JSON parameters passed to the SSM document specifying the IAM Role to modify and the IAM Policy to use.

You can now run the FIS experiment and after a couple minutes once again see new files in the S3 bucket.

Permissions for injecting failures with FIS and SSM

Fault injection with FIS is controlled by IAM, which is why you had to specify the FISAPI-FIS-Injection-EperimentRole:

Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

Figure 8. Visual representation of IAM permission used for fault injections with FIS and SSM. It shows the SSM execution role permitting access to use SSM automation documents as well as modify IAM roles and policies via the SSM document. It also shows the FIS execution role permitting access to use FIS templates, as well as the pass-role permission to grant the SSM execution role to the SSM service. Finally it shows the FIS user needing to have a pass-role permission to grant the FIS execution role to the FIS service.

This role needs to contain an assume role policy statement for FIS to allow assuming the role:

      AssumeRolePolicyDocument:
        Statement:
          - Action:
              - 'sts:AssumeRole'
            Effect: Allow
            Principal:
              Service:
                - "fis.amazonaws.com"

The role also needs permissions to list and execute SSM documents:

            - Sid: RequiredReadActionsforAWSFIS
              Effect: Allow
              Action:
                - 'cloudwatch:DescribeAlarms'
                - 'ssm:GetAutomationExecution'
                - 'ssm:ListCommands'
                - 'iam:ListRoles'
              Resource: '*'
            - Sid: RequiredSSMStopActionforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:CancelCommand'
              Resource: '*'
            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT'

Finally, remember that the SSM document needs to use a Role of its own to execute the fault injection actions. Because that Role is different from the Role under which we started the FIS experiment, we need to explicitly allow SSM to assume that role with a PassRole statement which will expand to FISAPI-SSM-Automation-Role:

            - Sid: RequiredIAMPassRoleforSSMADocuments
              Effect: Allow
              Action: 'iam:PassRole'
              Resource: !Sub 'arn:aws:iam::${AWS::AccountId}:role/${SsmAutomationRole}'

Secure and flexible permissions

So far, we have used explicit ARNs for our guardrails. To expand flexibility, we can use wildcards in our resource matching. For example, we might change the Policy matching from:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Explicitly listed policies - secure but inflexible
                  - !Ref AwsFisApiPolicyDenyS3DeleteObject

or the equivalent:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Explicitly listed policies - secure but inflexible
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${FullPolicyName}

to a wildcard notation like this:

            Condition:
              ArnEquals:
                'iam:PolicyARN':
                  # Wildcard policies - secure and flexible
                  - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:policy/${PolicyNamePrefix}*'

If we set PolicyNamePrefix to FISAPI-DenyS3 this would now allow invoking FISAPI-DenyS3PutObject and FISAPI-DenyS3DeleteObject but would not allow using a policy named FISAPI-DenyEc2DescribeInstances.

Similarly, we could change the Resource matching from:

            NotResource: 
              # Explicitly listed roles - secure but inflexible
              - !GetAtt EventBridgeTimerHandlerRole.Arn
              - !GetAtt S3TriggerHandlerRole.Arn

to a wildcard equivalent like this:

            NotResource: 
              # Wildcard policies - secure and flexible
              - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixEventBridge}*'
              - !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:role/${RoleNamePrefixS3}*'
and setting RoleNamePrefixEventBridge to FISAPI-TARGET-EventBridge and RoleNamePrefixS3 to FISAPI-TARGET-S3.

Finally, we would also change the FIS experiment role to allow SSM documents based on a name prefix by changing the constraint on automation execution from:

            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                # Explicitly listed resource - secure but inflexible
                # Note: the $DEFAULT at the end could also be an explicit version number
                # Note: the 'automation-definition' is automatically created from 'document' on invocation
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationIamAttachDetachDocument}:$DEFAULT'

to

            - Sid: RequiredSSMWriteActionsforAWSFIS
              Effect: Allow
              Action:
                - 'ssm:StartAutomationExecution'
                - 'ssm:StopAutomationExecution'
              Resource: 
                # Wildcard resources - secure and flexible
                # 
                # Note: the 'automation-definition' is automatically created from 'document' on invocation
                - !Sub 'arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:automation-definition/${SsmAutomationDocumentPrefix}*'

and setting SsmAutomationDocumentPrefix to FISAPI-. Test this by updating the CloudFormation stack with a modified template:

aws cloudformation deploy --stack-name test-fis-api-faults --template-file template2.yaml --capabilities CAPABILITY_NAMED_IAM

Permissions governing users

In production you should not be using administrator access to use FIS. Instead we create two roles FISAPI-AssumableRoleWithCreation and FISAPI-AssumableRoleWithoutCreation for you (see this template). These roles require all FIS and SSM resources to have a Name tag that starts with FISAPI-. Try assuming the role without creation privileges and running an experiment. You will notice that you can only start an experiment if you add a Name tag, e.g. FISAPI-secure-1, and you will only be able to get details of experiments and templates that have proper Name tags.

If you are working with AWS Organizations, you can add further guard rails by defining SCPs that control the use of the FISAPI-* tags similar to this blog post.

Caveats

For this solution we are choosing to attach policies instead of permission boundaries. The benefit of this is that you can attach multiple independent policies and thus simulate multi-step service degradation. However, this means that it is possible to increase the permission level of a role. While there are situations where this might be of interest, e.g. to simulate security breaches, please implement a thorough security review of any fault injection IAM policies you create. Note that modifying IAM Roles may trigger events in your security monitoring tools.

The AttachRolePolicy and DetachRolePolicy calls from AWS IAM are eventually consistent, meaning that in some cases permission propagation when starting and stopping fault injection may take up to 5 minutes each.

Cleanup

To avoid additional cost, delete the content of the S3 bucket and delete the CloudFormation stack:

# Clean up policy attachments just in case
CLEANUP_ROLES=$(aws iam list-roles --query "Roles[?starts_with(RoleName,'FISAPI-')].RoleName" --output text)
for role in $CLEANUP_ROLES; do
  CLEANUP_POLICIES=$(aws iam list-attached-role-policies --role-name $role --query "AttachedPolicies[?starts_with(PolicyName,'FISAPI-')].PolicyName" --output text)
  for policy in $CLEANUP_POLICIES; do
    echo Detaching policy $policy from role $role
    aws iam detach-role-policy --role-name $role --policy-arn $policy
  done
done
# Delete S3 bucket content
ACCOUNT_ID=$( aws sts get-caller-identity --query Account --output text )
S3_BUCKET_NAME=fis-api-failure-${ACCOUNT_ID}
aws s3 rm --recursive s3://${S3_BUCKET_NAME}
aws s3 rb s3://${S3_BUCKET_NAME}
# Delete cloudformation stack
aws cloudformation delete-stack --stack-name test-fis-api-faults
aws cloudformation wait stack-delete-complete --stack-name test-fis-api-faults

Conclusion 

AWS Fault Injection Simulator provides the ability to simulate various external impacts to your application to validate and improve resilience. We’ve shown how combining FIS with IAM to selectively deny access to AWS APIs provides a generic path to explore fault boundaries across all AWS services. We’ve shown how this can be used to identify and improve a resilience problem in a common S3 upload workflow. To learn about more ways to use FIS, see this workshop.

About the authors:

Dr. Rudolf Potucek

Dr. Rudolf Potucek is Startup Solutions Architect at Amazon Web Services. Over the past 30 years he gained a PhD and worked in different roles including leading teams in academia and industry, as well as consulting. He brings experience from working with academia, startups, and large enterprises to his current role of guiding startup customers to succeed in the cloud.

Rudolph Wagner

Rudolph Wagner is a Premium Support Engineer at Amazon Web Services who holds the CISSP and OSCP security certifications, in addition to being a certified AWS Solutions Architect Professional. He assists internal and external Customers with multiple AWS services by using his diverse background in SAP, IT, and construction.

AWS Week in Review – February 20, 2023

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-february-20-2023/

Since the devastating earthquake in Türkiye and Syria, Amazon has activated disaster relief services to quickly provide relief items to impacted areas. The company and Amazon customers have donated nearly 100,000 relief items so far, and donations continue to come in.

The AWS Disaster Preparedness and Response team is providing trained technical volunteers and solutions to Help.NGO, a United Nations standby partner assisting in the region.

We continue to support field requests for winter survival equipment, clothing, hygiene products, and other items. If you wish to donate, check out our blog post to find your local donation site and to learn more about how we’ve supported relief efforts so far. Thank you for your support!

Last Week’s Launches
As usual, let’s take a look at some launches from the last week that I want to remind you of:

New Amazon EC2 M7g and R7g instances – Since we launched C7g instances in May 2022, the General Purpose (M7g) and the Memory-Optimized (R7g) instances are generally available. Both types are powered by the latest generation AWS Graviton3 processors, and are designed to deliver up to 25 percent better performance than the equivalent sixth-generation (M6g and R6g) instances, making them the best performers in Amazon EC2.

Here is my infographic to highlight the principal performance and capacity improvements that we have made available with the new instances:

Enable AWS Systems Manager across all Amazon EC2 instances – All EC2 instances in your account become managed instances, with a single action using the Default Host Management Configuration (DHMC) Agent without changing existing instance profile roles. DHMC is ideal for all EC2 users, and offers a simple, scalable process to standardize the availability of System Manager tools for users who manage many instances. To learn more, see Default Host Management Configuration in the AWS documentation.

Programmatically manage opt-in AWS Regions – You can now view and manage enabled and disabled opt-in AWS Regions on your AWS accounts using AWS APIs. You can enable, disable, read, and list Region opt status by using the following AWS CLI commands in case of enabling Africa (Cape Town) Region:

$ aws account enable-region --region-name af-south-1
$ aws account get-region-opt-status --region-name af-south-1 
{ 
   "RegionName": "af-south-1", 
   "RegionOptStatus": "ENABLING" 
}

It will save you the time and effort of doing it through the AWS Management Console. To learn more, see Specifying which AWS Regions your account can use in the AWS documentation.

Pictured: A 3D rendering of the AWS Modular Data Center (MDC) unit.AWS Modular Data Center (AWS MDC) – AWS MDC is available as a self-contained modular data center unit: an environmentally controlled physical enclosure that can host racks of AWS Outposts or AWS Snow Family devices. AWS MDC lets defense customers run low-latency applications in infrastructure-limited environments for scenarios like large-scale military operations, crisis response, and security cooperation.

At this time, AWS MDC is now available in the AWS GovCloud Regions, and this service can only be purchased by the U.S. Department of Defense under the Joint Warfighting Cloud Capability (JWCC) contract. To learn more, read the AWS Public Sector Blog post.

A picture of a cute English bulldog on top of 3 AWS Snowball Edge device. Amazon EKS Anywhere on Snow – This is a new deployment option that helps you create and operate Kubernetes clusters on AWS Snowball Edge devices for provisioning and familiar operational visibility tooling of container applications deployed at the edge.

Amazon EKS Anywhere on Snow is ideal for customers who run their operations using secure and durable AWS Snow Family devices in unconditioned or mobile environments such as construction sites, ships, and rapidly deployed military forces. To learn more, read the AWS Container Blog post.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items that you may find interesting in the last week:

Upcoming AWS Events
Check your calendars and sign up for these AWS-led events:

AWS at MWC 2023 – Join AWS at MWC23 in Barcelona, Spain, February 27 – March 2, and interact with upcoming innovative new service demonstrations, be inspired at one of our many sessions, or request a more personal meeting with us onsite.

AWS Innovate Data and AI/ML edition – AWS Innovate is a free online event to learn the latest from AWS experts and get step-by-step guidance on using AI/ML to drive fast, efficient, and measurable results. Register now for Asia Pacific & Japan (February 22, 2023), EMEA (March 9), and the Americas (March 14).

AWS Summits – AWS Global Summits are free events that bring the cloud computing community together to connect, collaborate, and learn about AWS. We kick off Paris and Sydney on April 4th and schedule most other Summits from April to June. Please stay tuned and watch for the dates and locations to be announced.

You can browse all upcoming AWS-led in-person, virtual events, and developer focused events such as Community Days.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

AWS Week in Review – February 6, 2023

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/aws-week-in-review-february-6-2023/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

If you are looking for a new year challenge, the Serverless Developer Advocate team launched the 30 days of Serverless. You can follow the hashtag #30DaysServerless on LinkedIn, Twitter, or Instagram or visit the challenge page and learn a new Serverless concept every day.

Last Week’s Launches
Here are some launches that got my attention during the previous week.

AWS SAM CLIv1.72 added the capability to list important information from your deployments.

  • List the URLs of the Amazon API Gateway or AWS Lambda function URL.
    $ sam list endpoints
  • List the outputs of the deployed stack.
    $ sam list outputs
  • List the resources in the local stack. If a stack name is provided, it also shows the corresponding deployed resources and the ids.
    $ sam list resources

Amazon RDSNow supports increasing the allocated storage size when creating read replicas or when restoring a database from snapshots. This is very useful when your primary instances are near their maximum allocated storage capacity.

Amazon QuickSight Allows you to create Radar charts. Radar charts are a way to visualize multivariable data that are used to plot one or more groups of values over multiple common variables.

AWS Systems Manager AutomationNow integrates with Systems Manager Change Calendar. Now you can reduce the risks associated with changes in your production environment by allowing Automation runbooks to run during an allowed time window configured in the Change Calendar.

AWS AppConfigIt announced its integration with AWS Secrets Manager and AWS Key Management Service (AWS KMS). All sensitive data retrieved from Secrets Manager via AWS AppConfig can be encrypted at deployment time using an AWS KMS customer managed key (CMK).

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you may have missed:

AWS Cloud Clubs – Cloud Clubs are peer-to-peer user groups for students and young people aged 18–28. In these clubs, you can network, attend career-building events, earn benefits like AWS credits, and more. Learn more about the clubs in your region in the AWS student portal.

Get AWS Certified: Profesional challenge – You can register now for the certification challenge. Prepare for your AWS Professional Certification exam and get a 50 percent discount for the certification exam. Learn more about the challenge on the official page.

Podcast Charlas Técnicas de AWS – If you understand Spanish, this podcast is for you. Podcast Charlas Técnicas is one of the official AWS podcasts in Spanish, and every other week, there is a new episode. The podcast is for builders, and it shares stories about how customers implemented and learned AWS services, how to architect applications, and how to use new services. You can listen to all the episodes directly from your favorite podcast app or at AWS Podcasts en Español.

AWS Open-Source News and Updates – This is a newsletter curated by my colleague Ricardo to bring you the latest open-source projects, posts, events, and more.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent recaps – We had a lot of announcements during re:Invent. If you want to learn them all in your language and in your area, check the re: Invent recaps. All the upcoming ones are posted on this site, so check it regularly to find an event nearby.

AWS Innovate Data and AI/ML edition – AWS Innovate is a free online event to learn the latest from AWS experts and get step-by-step guidance on using AI/ML to drive fast, efficient, and measurable results.

  • AWS Innovate Data and AI/ML edition for Asia Pacific and Japan is taking place on February 22, 2023. Register here.
  • Registrations for AWS Innovate EMEA (March 9, 2023) and the Americas (March 14, 2023) will open soon. Check the AWS Innovate page for updates.

You can find details on all upcoming events, in-person or virtual, here.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia

The most visited AWS DevOps blogs in 2022

Post Syndicated from Brian Beach original https://aws.amazon.com/blogs/devops/the-most-visited-aws-devops-blogs-in-2022/

As we kick off 2023, I wanted to take a moment to highlight the top posts from 2022. Without further ado, here are the top 10 AWS DevOps Blog posts of 2022.

#1: Integrating with GitHub Actions – CI/CD pipeline to deploy a Web App to Amazon EC2

Coming in at #1, Mahesh Biradar, Solutions Architect and Suresh Moolya, Cloud Application Architect use GitHub Actions and AWS CodeDeploy to deploy a sample application to Amazon Elastic Compute Cloud (Amazon EC2).

Architecture diagram from the original post.

#2: Deploy and Manage GitLab Runners on Amazon EC2

Sylvia Qi, Senior DevOps Architect, and Sebastian Carreras, Senior Cloud Application Architect, guide us through utilizing infrastructure as code (IaC) to automate GitLab Runner deployment on Amazon EC2.

Architecture diagram from the original post.

#3 Multi-Region Terraform Deployments with AWS CodePipeline using Terraform Built CI/CD

Lerna Ekmekcioglu, Senior Solutions Architect, and Jack Iu, Global Solutions Architect, demonstrate best practices for multi-Region deployments using HashiCorp Terraform, AWS CodeBuild, and AWS CodePipeline.

Architecture diagram from the original post.

#4 Use the AWS Toolkit for Azure DevOps to automate your deployments to AWS

Mahmoud Abid, Senior Customer Delivery Architect, leverages the AWS Toolkit for Azure DevOps to deploy AWS CloudFormation stacks.

Architecture diagram from the original post.

#5 Deploy and manage OpenAPI/Swagger RESTful APIs with the AWS Cloud Development Kit

Luke Popplewell, Solutions Architect, demonstrates using AWS Cloud Development Kit (AWS CDK) to build and deploy Amazon API Gateway resources using the OpenAPI specification.

Architecture diagram from the original post.

#6: How to unit test and deploy AWS Glue jobs using AWS CodePipeline

Praveen Kumar Jeyarajan, Senior DevOps Consultant, and Vaidyanathan Ganesa Sankaran, Sr Modernization Architect, discuss unit testing Python-based AWS Glue Jobs in AWS CodePipeline.

Architecture diagram from the original post.

#7: Jenkins high availability and disaster recovery on AWS

James Bland, APN Global Tech Lead for DevOps, and Welly Siauw, Sr. Partner solutions architect, discuss the challenges of architecting Jenkins for scale and high availability (HA).

Architecture diagram from the original post.

#8: Monitor AWS resources created by Terraform in Amazon DevOps Guru using tfdevops

Harish Vaswani, Senior Cloud Application Architect, and Rafael Ramos, Solutions Architect, explain how you can configure and use tfdevops to easily enable Amazon DevOps Guru for your existing AWS resources created by Terraform.

Architecture diagram from the original post.

#9: Manage application security and compliance with the AWS Cloud Development Kit and cdk-nag

Arun Donti, Senior Software Engineer with Twitch, demonstrates how to integrate cdk-nag into an AWS Cloud Development Kit (AWS CDK) application to provide continual feedback and help align your applications with best practices.

Featured image from the original post.

#10: Smithy Server and Client Generator for TypeScript (Developer Preview)

Adam Thomas, Senior Software Development Engineer, demonstrate how you can use Smithy to define services and SDKs and deploy them to AWS Lambda using a generated client.

Architecture diagram from the original post.

A big thank you to all our readers! Your feedback and collaboration are appreciated and help us produce better content.

 

 

About the author:

Brian Beach

Brian Beach has over 20 years of experience as a Developer and Architect. He is currently a Principal Solutions Architect at Amazon Web Services. He holds a Computer Engineering degree from NYU Poly and an MBA from Rutgers Business School. He is the author of “Pro PowerShell for Amazon Web Services” from Apress. He is a regular author and has spoken at numerous events. Brian lives in North Carolina with his wife and three kids.

How Wego secured developer connectivity to Amazon Relational Database Service instances

Post Syndicated from Adriaan de Jonge original https://aws.amazon.com/blogs/architecture/how-wego-secured-developer-connectivity-to-amazon-relational-database-service-instances/

How do you securely access Amazon Relational Database Service (Amazon RDS) instances from a developer’s laptop? Online travel marketplace, Wego, shares their journey from bastion hosts in the public subnet to lightweight VPN tunnels on top of Session Manager, a capability of AWS Systems Manager, using temporary access keys.

In this post, we explore how developers get access to allow-listed resources in their virtual private cloud (VPC) directly from their workstation, by tunnelling VPN over secure shell (SSH), which, in turn, is tunneled over Session Manager.

Note: This blog post is not intended as a step-by-step, how-to guide. Commands stated here are for illustrative purposes and may need customization.

Wego’s architecture before starting this journey

In 2021, Wego’s developer connectivity architecture was based on jump hosts in a public subnet, as illustrated in Figure 1.

Original Wego architecture

Figure 1. Original Wego architecture

Figure 1 demonstrates a network architecture with both public and private subnets. The public subnet contains an Amazon Elastic Compute Cloud (Amazon EC2) instance that serves as jump host. The diagram illustrates a VPN tunnel between the developer’s desktop and the VPC.

In Wego’s previous architecture, the jump host was connected to the internet for terminal access through the secure shell (SSH) protocol, which accepts traffic at Port 22. Despite restrictions to the allowed source IP addresses, exposing Port 22 to the internet can increase the likeliness of a security breach; it is possible to spoof (mimic) an allowed IP address and attempt a denial of service attack.

Moving the jump host to a private subnet with Session Manager

Session Manager helps minimize the likeliness of a security breach. Figure 2 demonstrates how Wego moved the jump host from a public subnet to a private subnet. In this architecture, Session Manager serves as the main entry point for incoming network traffic.

Wego's new architecture using Session Manager

Figure 2. Wego’s new architecture using Session Manager

We will explore how developers connect to Amazon RDS directly from their workstation in this architecture.

Tunnel TCP traffic through Session Manager

Session Manager is best known for its terminal access capability, but it can also tunnel TCP connections. This is helpful if you want to access EC2 instances from your local workstation (Figure 3).

Tunneling TCP traffic over Session Manager

Figure 3. Tunneling TCP traffic over Session Manager

Here’s an example command to forward traffic from local host Port 8888 to an EC2 instance:

$ aws ssm start-session --target <instance-id> \
  --document-name AWS-StartPortForwardingSession \
  --parameters '{"portNumber":["8888"], "localPortNumber":["8888"]}'

This assumes the target EC2 instance is configured with AWS Systems Manager connectivity.

Tunnel SSH traffic over Session Manager

SSH is a protocol built on top of TCP; therefore, you can tunnel SSH traffic similarly (Figure 4).

Tunneling SSH traffic over Session Manager

Figure 4. Tunneling SSH traffic over Session Manager

To allow a short-hand notation for SSH over SSM, add the following configuration to the ~/.ssh/config configuration file:

host i-* mi-*
    ProxyCommand sh -c "aws ssm start-session --target %h \
        --document-name AWS-StartSSHSession \
        --parameters 'portNumber=%p'"

You can now connect to the EC2 instance over SSH with the following command:

ssh -i <key-file> <username>@<ec2-instance-id>

For example:

ssh -i my_key [email protected]

Ideally, your key-file is a short-lived credential, as recommended by the AWS Well-Architected Framework, as it narrows the window of opportunity for a security breach. However, it can be tedious to manage short-lived credentials. This is where EC2 Instance Connect comes to the rescue!

Replace SSH keys with EC2 Instance Connect

EC2 Instance Connect is available both on the AWS console and the command line. It makes it easier to work with short-lived keys. On the command line, it allows us to install our own temporary access credentials into a private EC2 instance for the duration of 60 seconds (Figure 5).

Connecting to SSH with temporary keys

Figure 5. Connecting to SSH with temporary keys

Ensure the EC2 instance connect plugin is installed on your workstation:

pip3 install ec2instanceconnectcli

This blog post assumes you are using Amazon Linux on the EC2 instance with all pre-requisites installed. Make sure your IAM role or user has the required permissions.

To generate a temporary SSH key pair, insert:

$ ssh-keygen -t rsa -f my_key
$ ssh-add my_key

To install the public key into the EC2 instance, insert:

$ aws ec2-instance-connect send-ssh-public-key \
  --instance-id <instance-id> \
  --instance-os-user <username> \
  --ssh-public-key <location ssh key public key> \
  --availability-zone <availabilityzone> \
  --region <region>

For example:

$ aws ec2-instance-connect send-ssh-public-key \
  --instance-id i-1234567890abcdef0 \
  --instance-os-user ec2-user \
  --ssh-public-key file://my_key.pub \
  --availability-zone ap-southeast-1b \
  --region ap-southeast-1

Connect to the EC2 instance within 60 seconds and delete the key after use.

Tunneling VPN over SSH, then over Session Manager

In this section, we adopt a third-party, open-source tool that is not supported by AWS, called sshuttle. sshuttle is a transparent proxy server that works as a VPN over SSH. It is based on Python and released under the LGPL 2.1 license. It runs across a wide range of Linux distributions and on macOS (Figure 6).

Tunneling VPN over SSH over Session Manager

Figure 6. Tunneling VPN over SSH over Session Manager

Why do we need to tunnel VPN over SSH, rather than using the earlier TCP over Session Manager? Keep in mind that the developer’s goal is to connect to Amazon RDS, not Amazon EC2. The SSM tunnel only works for connections to EC2 instances, not Amazon RDS.

A lightweight VPN solution, like sshuttle, bridges this gap by allowing you to forward traffic from Amazon EC2 to Amazon RDS. From the developer’s perspective, this works transparently, as if it is regular network traffic.

To install sshuttle, use one of the documented commands:

$ pip3 install sshuttle

To start sshuttle, use the following command pattern:

$ sshuttle -r <username>@<instance-id> <private CIDR range>

For example:

$ sshuttle -r [email protected] 10.0.0.0/16

Make sure the security group for the RDS DB instance allows network access from the jump host. You can now connect directly from the developer’s workstation to the RDS DB instance based on its IP address.

Advantages of this architecture

In this blog post, we layered a VPN over SSH that, in turn, is layered over Session Manager, plus we used temporary SSH keys.

Wego designed this architecture, and it was practical and stable for day-to-day use. They found that this solution runs at lower cost than AWS Client VPN and is sufficient for the use case of developers accessing online development environments.

Wego’s new architecture has a number of advantages, including:

  • More easily connecting to workloads in private and isolated subnets
  • Inbound security group rules are not required for the jump host, as Session Manager is an outbound connection
  • Access attempts are logged in AWS CloudTrail
  • Access control uses standard IAM policies, including tag-based resource access
  • Security groups and network access control lists still apply to “allow” or “deny” traffic to specific destinations
  • SSH keys are installed only temporarily for 60 seconds through EC2 Instance Connect

Conclusion

In this blog post, we explored Wego’s access patterns that can help you reduce your exposure to potential security attacks. Whether you adopt Wego’s full architecture or only adopt intermediary steps (like SSH over Session Manager and EC2 Instance Connect), reducing exposure to the public subnet and shortening the lifetime of access credentials can improve your security posture!

Further reading

Securely retrieving secrets with AWS Lambda

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/securely-retrieving-secrets-with-aws-lambda/

AWS Lambda functions often need to access secrets, such as certificates, API keys, or database passwords. Storing secrets outside the function code in an external secrets manager helps to avoid exposing secrets in application source code. Using a secrets manager also allows you to audit and control access, and can help with secret rotation. Do not store secrets in Lambda environment variables, as these are visible to anyone who has access to view function configuration.

This post highlights some solutions to store secrets securely and retrieve them from within your Lambda functions.

AWS Partner Network (APN) member Hashicorp provides Vault to secure secrets and application data. Vault allows you to control access to your secrets centrally, across applications, systems, and infrastructure. You can store secrets in Vault and access them from a Lambda function to access a database, for example. The Vault Agent for AWS helps you authenticate with Vault, retrieve the database credentials, and then perform the queries. You can also use the Vault AWS Lambda extension to manage connectivity to Vault.

AWS Systems Manager Parameter Store enables you to store configuration data securely, including secrets, as parameter values. For information on Parameter Store pricing, see the documentation.

AWS Secrets Manager allows you to replace hardcoded credentials in your code with an API call to Secrets Manager to retrieve the secret programmatically. You can generate, protect, rotate, manage, and retrieve secrets throughout their lifecycle. By default, Secrets Manager does not write or cache the secret to persistent storage. Secrets Manager supports cross-account access to secrets. For information on Secrets Manager pricing, see the documentation.

Parameter Store integrates directly with Secrets Manager as a pass-through service for references to Secrets Manager secrets. Use this integration if you prefer using Parameter Store as a consistent solution for calling and referencing secrets across your applications. For more information, see “Referencing AWS Secrets Manager secrets from Parameter Store parameters.”

For an example application to show Secrets Manager functionality, deploy the example detailed in “How to securely provide database credentials to Lambda functions by using AWS Secrets Manager”.

When to retrieve secrets

When Lambda first invokes your function, it creates a runtime environment. It runs the function’s initialization (init) code, which is the code outside the main handler. Lambda then runs the function handler code as the invocation. This receives the event payload and processes your business logic. Subsequent invocations can use the same runtime environment.

You can retrieve secrets during each function invocation from within your handler code. This ensures that the secret value is always up to date but can lead to increased function duration and cost, as the function calls the secret manager during each invocation. There may also be additional retrieval costs from Secret Manager.

Retrieving secret during each invocation

Retrieving secret during each invocation

You can reduce costs and improve performance by retrieving the secret during the function init process. During subsequent invocations using the same runtime environment, your handler code can use the same secret.

Retrieving secret during function initialization.

Retrieving secret during function initialization.

The Serverless Land pattern example shows how to retrieve a secret during the init phase using Node.js and top-level await.

If a secret may change between subsequent invocations, ensure that your handler can check for the secret validity and, if necessary, retrieve the secret again.

Retrieve changed secret during subsequent invocation.

Retrieve changed secret during subsequent invocation.

You can also use Lambda extensions to retrieve secrets from Secrets Manager, cache them, and automatically refresh the cache based on a time value. The extension retrieves the secret from Secrets Manager before the init process and makes it available via a local HTTP endpoint. The function then retrieves the secret from the local HTTP endpoint, rather than directly from Secrets Manager, increasing performance. You can also share the extension with multiple functions, which can reduce function code. The extension handles refreshing the cache based on a configurable timeout value. This ensures that the function has the updated value, without handling the refresh in your function code, which increases reliability.

Using Lambda extensions to cache and refresh secret.

Using Lambda extensions to cache and refresh secret.

You can deploy the solution using the steps in Cache secrets using AWS Lambda extensions.

Lambda Powertools

Lambda Powertools provides a suite of utilities for Lambda functions to simplify the adoption of serverless best practices. AWS Lambda Powertools for Python and AWS Lambda Powertools for Java both provide a parameters utility that integrates with Secrets Manager.

from aws_lambda_powertools.utilities import parameters
def handler(event, context):
    # Retrieve a single secret
    value = parameters.get_secret("my-secret")
import software.amazon.lambda.powertools.parameters.SecretsProvider;
import software.amazon.lambda.powertools.parameters.ParamManager;

public class AppWithSecrets implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
    // Get an instance of the Secrets Provider
    SecretsProvider secretsProvider = ParamManager.getSecretsProvider();

    // Retrieve a single secret
    String value = secretsProvider.get("/my/secret");

Rotating secrets

You should rotate secrets to prevent the misuse of your secrets. This helps you to replace long-term secrets with short-term ones, which reduces the risk of compromise.

Secrets Manager has built-in functionality to rotate secrets on demand or according to a schedule. Secrets Manager has native integrations with Amazon RDS, Amazon DocumentDB, and Amazon Redshift, using a Lambda function to manage the rotation process for you. It deploys an AWS CloudFormation stack and populates the function with the Amazon Resource Name (ARN) of the secret. You specify the permissions to rotate the credentials, and how often you want to rotate the secret. You can view and edit Secrets Manager rotation settings in the Secrets Manager console.

Secrets Manager rotation settings

Secrets Manager rotation settings

You can also create your own rotation Lambda function for other services.

Auditing secrets access

You should continually review how applications are using your secrets to ensure that the usage is as you expect. You should also log any changes to them so you can investigate any potential issues, and roll back changes if necessary.

When using Hashicorp Vault, use Audit devices to log all requests and responses to Vault. Audit devices can append logs to a file, write to syslog, or write to a socket.

Secrets Manager supports logging API calls using AWS CloudTrail. CloudTrail monitors and records all API calls for Secrets Manager as events. This includes calls from code calling the Secrets Manager APIs and access via the Secrets Manager console. CloudTrail data is considered sensitive, so you should use AWS KMS encryption to protect it.

The CloudTrail event history shows the requests to secretsmanager.amazonaws.com.

Viewing CloudTrail access to Secrets Manager

Viewing CloudTrail access to Secrets Manager

You can use Amazon EventBridge to respond to alerts based on specific operations recorded in CloudTrail. These include secret rotation or deleted secrets. You can also generate an alert if someone tries to use a version of a secret version while it is pending deletion. This may help identify and alert you when an outdated certificate is used.

Securing secrets

You must tightly control access to secrets because of their sensitive nature. Create AWS Identity and Access Management (IAM) policies and resource policies to enable minimal access to secrets. You can use role-based, as well as attribute-based, access control. This can prevent credentials from being accidentally used or compromised. For more information, see “Authentication and access control for AWS Secrets Manager”.

Secrets Manager supports encryption at rest using AWS Key Management Service (AWS KMS) using keys that you manage. Secrets are encrypted in transit using TLS by default, which requires request signing.

You can access secrets from inside an Amazon Virtual Private Cloud (Amazon VPC) without requiring internet access. Use AWS PrivateLink and configure a Secrets Manager specific VPC endpoint.

Do not store plaintext secrets in Lambda environment variables. Ensure that you do not embed secrets directly in function code, commit these secrets to code repositories, or log the secret to CloudWatch.

Conclusion

Using a secrets manager to store secrets such as certificates, API keys or database passwords helps to avoid exposing secrets in application source code. This post highlights some AWS and third-party solutions, such as Hashicorp Vault, to store secrets securely and retrieve them from within your Lambda functions.

Secrets Manager is the preferred AWS solution for storing and managing secrets. I explain when to retrieve secrets, including using Lambda extensions to cache secrets, which can reduce cost and improve performance.

You can use the Lambda Powertools parameters utility, which integrates with Secrets Manager. Rotating secrets reduces the risk of compromise and you can audit secrets using CloudTrail and respond to alerts using EventBridge. I also cover security considerations for controlling access to your secrets.

For more serverless learning resources, visit Serverless Land.

AWS Week In Review – July 18, 2022

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-july-18-2022/

Last week, AWS Summit New York was held in person at the Javits Center with thousands of attendees and over 100 sponsors and partners. During the keynote, Martin Beeby, AWS Principal Developer Advocate, talked about how innovations in cloud infrastructure enable customers to adapt to challenges and seize new opportunities. It included Liz Fong-Jones‘s great migration story of AWS Graviton in Honeycomb and Elliott Cordo‘s story of improving pharmacy experiences using AWS analytics and machine learning services in Capsule.

Watch the full keynote video!

A Recap of AWS Summit NY Announcements
During the keynote, we announced the general availability of some new services:

Amazon Redshift Serverless – This serverless option lets you analyze data at any scale without having to manage data warehouse infrastructure. You can now create multiple serverless endpoints per AWS account and Region using namespaces and workgroups and enjoy reducing serverless compute costs compared to the preview. To learn more, check out Danilio’s blog post, this demo video, and the latest episode of The Official AWS Podcast. We also introduced new features of row-level security (RLS), which implement fine-grained access to the rows in tables, and automated materialized view to lower query latency for repeatable workloads.

AWS Cloud WAN – This new network service makes it easy to build and operate wide area networks (WAN) that connect your data centers and branch offices, as well as multiple VPCs in multiple AWS Regions. To learn more, read Seb’s blog post.

Amazon DevOps Guru’s Log Anomaly Detection and Recommendations – This new feature identifies anomalies such as increased latency, error rates, and resource constraints within your app and then sends alerts with a description and actionable recommendations for remediation. To learn more, see Donnie’s blog post as a new News Blog writer.

Last Week’s Launches
Here are some other launches that caught my attention last week:

AWS AppConfig, a feature of AWS Systems Manager, makes it easy for customers to quickly and safely configure, validate, and deploy feature flags and application configuration. Now, we have announced AWS AppConfig Extensions, a new capability that allows customers to enhance and extend the capabilities of feature flags and dynamic runtime configuration data.

Available extensions at launch include AppConfig Notification extensions that push messages about configuration updates to Amazon EventBridge, Amazon SNS, Amazon SQS, or a Jira extension to track Feature Flag changes in AppConfig as Atlassian’s Jira issues. To get started, read Announcing AWS AppConfig Extensions and AppConfig Extensions.

Amazon VPC Flow Logs for Transit Gateway is a new capability that allows customers to gain deeper visibility and insights into network traffic on AWS Transit Gateway. With this feature, Transit Gateway can export detailed information, such as source/destination IPs, ports, protocols, traffic counters, timestamps, and various metadata for all of the network flow traversing through the Transit Gateway. To learn more, read Introducing VPC Flow Logs for AWS Transit Gateway and Logging network traffic using Transit Gateway Flow Logs.

AWS Lambda Powertools for TypeScript is an open-source developer library that can help you incorporate Well-Architected Serverless best practices focusing on three observability features: distributed tracing (Tracer), structured logging (Logger), and asynchronous business and application metrics (Metrics). Powertools is also available in the Python and Java programming languages. To learn more, see the blog post Simplifying serverless best practices with AWS Lambda Powertools for TypeScript. You can submit feedback, ideas, and issues directly on our GitHub project.

AWS re:Post is a vibrant Q&A community that helps you become even more successful on AWS. You can now add a profile picture or avatar to your account and add inline images such as diagrams or screenshots to support your questions or answers. Add your profile picture and start using inline images today!

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some news, blog posts, and video series for you to know:

In July 2021, we notified users about the end of support for Internet Explorer 11, which is now approaching on July 31, 2022. The browser will no longer be supported in the AWS Management Console, web-based services such as Amazon QuickSight, Amazon Chime, Amazon Honeycode, and some other AWS websites. After that date, we can no longer guarantee that the features and webpages will function properly on IE 11. For more information, please visit AWS Supported Browsers.

In fall 2021, we began offering a free multi-factor authentication (MFA) security key to AWS account owners in the United States. Now eligible customers can order the free MFA security key through the ordering portal in the AWS Management Console. At this time, only U.S.-based AWS account root users who have spent more than $100 each month over the past 3 months are eligible to place an order. For more information, see our Free MFA Security Key page.

Amazon’s Machine Learning University expands with MLU Explains, a public website containing visual essays that incorporate fun animations and scrollytelling to explain machine learning concepts in an accessible manner. The following animation teaches the concepts of data splitting in machine learning using an example model that attempts to determine whether animals are cats or dogs. To learn more, read the Amazon Science blog post.

This is My Architecture is a video series that showcases innovative architectural solutions on the AWS Cloud by customers and partners. In June and July, over 15 episodes were updated, including GoDaddy, Riot Games, and Hudl. Each episode examines the most interesting and technically creative elements of each cloud architecture.

Upcoming AWS Events in August
Check your calendars and sign up for these AWS events:

AWS SummitRegistration is open for upcoming in-person AWS Summits that might be close to you in August: Sao Paulo (August 3–4), Anaheim (August 18), Taiwan (August 10–11), Chicago (August 28), and Canberra (August 31).

AWS Innovate – Data Edition – On August 23, learn how a modern data strategy can support your present and future use cases, including steps to build an end-to-end data solution to store and access, analyze and visualize, and even predict.

AWS Innovate – For Every Application Edition – On August 25, learn about a wide selection of AWS solutions across compute, storage, networking, hybrid, and edge infrastructure to help you scale application resources seamlessly and optimally.

Although these two Innovate events will be held in Asia Pacific and Japan time zones, you can view on-demand videos for two months following your registration.

If you’re interested in learning modern development practices live in New York City, I recommend joining AWS Solutions Day on August 10. I love advanced topics to focus on building new web apps with Java, JavaScript, TypeScript, and GraphQL.

If you’re interested in learning AWS fundamentals and preparing for AWS Certifications, there are several virtual events in August, such as AWS Cloud Practitioner Essentials Day, AWS Technical Essentials Day, and Exam Readiness for AWS Certificates.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Choosing the right solution for AWS Lambda external parameters

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/choosing-the-right-solution-for-aws-lambda-external-parameters/

This post is written by Thomas Moore, Solutions Architect, Serverless.

When using AWS Lambda to build serverless applications, customers often need to retrieve parameters from an external source at runtime. This allows you to share parameter values across multiple functions or microservices, providing a single source of truth for updates. A common example is retrieving database connection details from an external source and then using the retrieved hostname, user name, and password to connect to the database:

Lambda function retrieving database credentials from an external source

Lambda function retrieving database credentials from an external source

AWS provides a number of options to store parameter data, including AWS Systems Manager Parameter Store, AWS AppConfig, Amazon S3, and Lambda environment variables. This blog explores the different parameter data that you may need to store. I cover considerations for choosing the right parameter solution and how to retrieve and cache parameter data efficiently within the Lambda function execution environment.

Common use cases

Common parameter examples include:

  • Securely storing secret data, such as credentials or API keys.
  • Database connection details such as hostname, port, and credentials.
  • Schema data (for example, a structured JSON response).
  • TLS certificate for mTLS or JWT validation.
  • Email template.
  • Tenant configuration in a multitenant system.
  • Details of external AWS resources to communicate with such as an Amazon SQS queue URL, Amazon EventBridge event bus name, or AWS Step Functions ARN.

Key considerations

There are a number of key considerations when choosing the right solution for external parameter data.

  1. Cost – how much does it cost to store the data and retrieve it via an API call?
  2. Security – what encryption and fine-grained access control is required?
  3. Performance – what are the retrieval latency requirements?
  4. Data size – how much data is there to store and retrieve?
  5. Update frequency – how often does the parameter change and how does the function handle stale parameters?
  6. Access scope – do multiple functions or services access the parameter?

These considerations help to determine where to store the parameter data and how often to retrieve it.

For example, a 4KB parameter that updates hourly and is used by hundreds of functions needs to be optimized for low retrieval costs and high performance. Choosing a solution that supports low-cost API GET requests at a high transaction per second (TPS) would be better than one that supports large data.

AWS service options

There are a number of AWS services available to store external parameter data.

Amazon S3

S3 is an object storage service offering 99.999999999% (11 9s) of data durability and virtually unlimited scalability at low cost. Objects can be up to 5 TB in size in any format, making S3 a good solution to store larger parameter data.

Amazon DynamoDB

Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database designed for single-digit millisecond performance at any scale. Due to the high performance of this service, it’s a great place to store parameters when low retrieval latency is important.

AWS Secrets Manager

AWS Secrets Manager makes it easier to rotate, manage, and retrieve secret data. This makes it the ideal place to store sensitive parameters such as passwords and API keys.

AWS Systems Manager Parameter Store

Parameter Store provides a centralized store to manage configuration data. This data can be plaintext or encrypted using AWS Key Management Service (KMS). Parameters can be tagged and organized into hierarchies for simpler management. Parameter Store is a good default choice for general-purpose parameters in AWS. The standard version (no additional charge) can store parameters up to 4 KB in size and the advanced version (additional charges apply) up to 8 KB.

For a code example using Parameter Store for Lambda parameters, see the Serverless Land pattern.

AWS AppConfig

AppConfig is a capability of AWS Systems Manager to create, manage, and quickly deploy application configurations. AppConfig allows you to validate changes during roll-outs and automatically roll back, if there is an error. AppConfig deployment strategies help to manage configuration changes safely.

AppConfig also provides a Lambda extension to retrieve and locally cache configuration data. This results in fewer API calls and reduced function duration, reducing costs.

AWS Lambda environment variables

You can store parameter data as Lambda environment variables as part of the function’s version-specific configuration. Lambda environment variables are stored during function creation or updates. You can access these variables directly from your code without needing to contact an external source. Environment variables are ideal for parameter values that don’t need updating regularly and help make function code reusable across different environments. However, unlike the other options, values cannot be accessed centrally by multiple functions or services.

Lambda execution lifecycle

It is worth understanding the Lambda execution lifecycle, which has a number of stages. This helps to decide when to handle parameter retrieval within your Lambda code, including cache management.

Lambda execution lifecycle

Lambda execution lifecycle

When a Lambda function is invoked for the first time, or when Lambda is scaling to handle additional requests, an execution environment is created. The first phase in the execution environment’s lifecycle is initialization (Init), during which the code outside the main handler function runs. This is known as a cold start.

The execution environment can then be re-used for subsequent invocations. This means that the Init phase does not need to run again and only the main handler function code runs. This is known as a warm start.

An execution environment can only run a single invocation at a time. Concurrent invocations require additional execution environments. When a new execution environment is required, this starts a new Init phase, which runs the cold start process.

Caching and updates

Retrieving the parameter during Init

Retrieving the parameter during Init

Retrieving the parameter during Init

As Lambda execution environments are re-used, you can improve the performance and reduce the cost of retrieving an external parameter by caching the value. Writing the value to memory or the Lambda /tmp file system allows it to be available during subsequent invokes in the same execution environment.

This approach reduces API calls, as they are not made during every invocation. However, this can cause an out-of-date parameter and potentially different values across concurrent execution environments.

The following Python example shows how to retrieve a Parameter Store value outside the Lambda handler function during the Init phase.

import boto3
ssm = boto3.client('ssm', region_name='eu-west-1')
parameter = ssm.get_parameter(Name='/my/parameter')
def lambda_handler(event, context):
    # My function code...

Retrieving the parameter on every invocation

Retrieving the parameter on every invocation

Retrieving the parameter on every invocation

Another option is to retrieve the parameter during every invocation by making the API call inside the handler code. This keeps the value up to date, but can lead to higher retrieval costs and longer function durations due to the added API call during every invocation.

The following Python example shows this approach:

import boto3
ssm = boto3.client('ssm', region_name='eu-west-1')
def lambda_handler(event, context):
    parameter = ssm.get_parameter(Name='/my/parameter')
    # My function code...

Using AWS AppConfig Lambda extension

Using AWS AppConfig Lambda extension

Using AWS AppConfig Lambda extension

AppConfig allows you to retrieve and cache values from the service using a Lambda extension. The extension retrieves the values and makes them available via a local HTTP server. The Lambda function then queries the local HTTP server for the value. The AppConfig extension refreshes the values at a configurable poll interval, which defaults to 45 seconds. This improves performance and reduces costs, as the function only needs to make a local HTTP call.

The following Python code example shows how to access the cached parameters.

import urllib.request
def lambda_handler(event, context):
    url = f'http://localhost:2772/applications/application_name/environments/environment_name/configurations/configuration_name'
    config = urllib.request.urlopen(url).read()
    # My function code...

For caching secret values using a Lambda extension local HTTP cache and AWS Secrets Manager, see the AWS Prescriptive Guidance documentation.

Using Lambda Powertools for Python or Java

Lambda Powertools for Python or Lambda Powertools for Java contains utilities to manage parameter caching. You can configure the cache interval, which defaults to 5 seconds. Supported parameter stores include Secrets Manager, AWS Systems Manager Parameter Store, AppConfig, and DynamoDB. You also have the option to bring your own provider. The following example shows the Powertools for Python parameters utility retrieving a single value from Systems Manager Parameter Store.

from aws_lambda_powertools.utilities import parameters
def handler(event, context):
    value = parameters.get_parameter("/my/parameter")
    # My function code…

Security

Parameter security is a key consideration. You should evaluate encryption at rest, in-transit, private network access, and fine-grained permissions for each external parameter solution based on the use case.

All services highlighted in this post support server-side encryption at rest, and you can choose to use AWS KMS to manage your own keys. When accessing parameters using the AWS SDK and CLI tools, connections are encrypted in transit using TLS by default. You can force most to use TLS 1.2.

To access parameters from inside an Amazon Virtual Private Cloud (Amazon VPC) without internet access, you can use AWS PrivateLink and create a VPC endpoint for each service. All the services mentioned in this post support AWS PrivateLink connections.

Use AWS Identity and Access Management (IAM) policies to manage which users or roles can access specific parameters.

General guidance

This blog explores a number of considerations to make when using an external source for Lambda parameters. The correct solution is use-case dependent. There are some general guidelines when selecting an AWS service.

  • For general-purpose low-cost parameters, use AWS Systems Manager Parameter Store.
  • For single function, small parameters, use Lambda environment variables.
  • For secret values that require automatic rotation, use AWS Secrets Manager.
  • When you need a managed cache, use the AWS AppConfig Lambda extension or Lambda Powertools for Python/Java.
  • For items larger than 400 KB, use Amazon S3.
  • When access frequency is high, and low latency is required, use Amazon DynamoDB.

Conclusion

External parameters provide a central source of truth across distributed systems, allowing for efficient updates and code reuse. This blog post highlights a number of considerations when using external parameters with Lambda to help you choose the most appropriate solution for your use case.

Consider how you cache and reuse parameters inside the Lambda execution environment. Doing this correctly can help you reduce costs and improve the performance of your Lambda functions.

There are a number of services to choose from to store parameter data. These include DynamoDB, S3, Parameter Store, Secrets Manager, AppConfig, and Lambda environment variables. Each comes with a number of advantages, depending on the use case. This blog guidance, along with the AWS documentation and Service Quotas, can help you select the most appropriate service for your workload.

For more serverless learning resources, visit Serverless Land.

Building Blue/Green application deployment to Micro Focus Enterprise Server

Post Syndicated from Kevin Yung original https://aws.amazon.com/blogs/devops/building-blue-green-application-deployment-to-micro-focus-enterprise-server/

Organizations running mainframe production workloads often follow the traditional approach of application deployment. To release new features of existing applications into production, the application is redeployed using the new version of software on the existing infrastructure. This poses the following challenges:

  • The cutover of the application deployment from testing to production usually takes place during a planned outage window with associated downtime.
  • Rollback is difficult, since the earlier version of the software must be redeployed from scratch on the existing infrastructure. This may result in applications being unavailable for longer durations owing to the rollback.
  • Due to differences in testing and production environments, some defects may leak into production, affecting the application code quality and thus increasing the number of production outages

Automated, robust application deployment is recognized as a prime driver for moving from a Mainframe to AWS, as service stability, security, and quality can be better managed. In this post, you will learn how to build Blue/Green (zero-downtime) deployments for mainframe applications rehosted to Micro Focus Enterprise Server with AWS Developer Tools (AWS CodeBuild, CodePipeline, and CodeDeploy).

This is a continuation of our previous post “Automate thousands of mainframe tests on AWS with the Micro Focus Enterprise Suite”. In our last post, we explained how you can implement a pattern for continuous integration and testing of mainframe applications with AWS Developer tools and Micro Focus Enterprise Suite. If you haven’t already checked it out, then we strongly recommend that you read through it before proceeding to the rest of this post.

Overview of solution

In this section, we explain the three important design “ingredients” to be implemented in the overall solution:

  1. Implementation of Enterprise Server Performance and Availability Cluster (PAC)
  2. End-to-end design of CI/CD pipeline for multiple teams development
  3. Blue/green deployment process for a rehosted mainframe application

First, let’s look at the solution design for the Micro Focus Enterprise Server PAC cluster.

Overview of Micro Focus Enterprise Server Performance and Availability Cluster (PAC)

In the Blue/Green deployment solution, Micro Focus Enterprise Server is the hosting environment for mainframe applications with the software installed into Amazon EC2 instances. Application deployment in Amazon EC2 Auto Scaling is one of the critical requirements to build a Blue/Green deployment. Micro Focus Enterprise Server PAC technology is the feature that allows for the Auto Scaling of Enterprise Server instances. For details on how to build Micro Focus Enterprise PAC Cluster with Amazon EC2 Auto Scaling and Systems Manager, see our AWS Prescriptive Guidance document. An overview of the infrastructure architecture is shown in the following figure, and the following table explains the components in the architecture.

Infrastructure architecture overview for blue/green application deployment to Micro Focus Enterprise Server

Components Description
Micro Focus Enterprise Servers Deploy applications to Micro Focus Enterprise Servers PAC in Amazon EC2 Auto Scaling Group.
Micro Focus Enterprise Server Common Web Administration (ESCWA) Manage Micro Focus Enterprise Server PAC with ESCWA server, e.g., Adding or Removing Enterprise Server to/from a PAC.
Relational Database for both user and system data files Setup Amazon Aurora RDS Instance in Multi-AZ to host both user and system data files to be shared across the Enterprise server instances.
Micro Focus Enterprise Server Scale-Out Repository (SOR) Setup an Amazon ElastiCache Redis Instance and replicas in Multi-AZ to host user data.
Application endpoint and load balancer Setup a Network Load Balancer to provide a hostname for end users to connect the application, e.g., accessing the application through a 3270 emulator.

CI/CD Pipelines design supporting multi-streams of mainframe development

In a previous DevOps post, Automate thousands of mainframe tests on AWS with the Micro Focus Enterprise Suite, we introduced two levels of pipelines. The first level of pipeline is used by mainframe project teams to test project scope changes. The second level of the pipeline is used for system integration tests, where the pipeline will perform tests for all of the promoted changes from the project pipelines and perform extensive systems tests.

In this post, we are extending the two levels pipeline to add a production deployment pipeline. When system testing is complete and successful, the tested application artefacts are promoted to the production pipeline in preparation for live production release. The following figure depicts each stage of the three levels of CI/CD pipeline and the purpose of each stage.

Different levels of CI/CD pipeline - Project Team Pipeline, Systems Test Pipeline and Production Deployment Pipeline

Let’s look at the artifact promotion to production pipeline in greater detail. The Systems Test Pipeline promotes the tested artifacts in binary format into an Amazon S3 bucket and the S3 event triggers production pipeline to kick-off. This artifact promotion process can be gated using a manual approval action in CodePipeline. For customers who want to have a fully automated continuous deployment, the manual promotion approval step can be removed.

The following diagram shows the AWS Stages in AWS CodePipeline of the production deployment pipeline:

Stages in production deployment pipeline using AWS CodePipeline

After the production pipeline is kicked off, it downloads the new version artifact from the S3 bucket. See the details of how to setup the S3 bucket as a Source of CodePipeline in the document AWS CodePipeline Document S3 as Source

In the following section, we explain each of these pipeline stages in detail:

  1. It prepares and packages a new version of production configuration artifacts, for example, the Micro Focus Enterprise Server config file, blue/green deployment scripts etc.
  2. Use in the CodeBuild Project to kick off an application blue/green deployment with AWS CodeDeploy.
  3. Use a manual approval gate to wait for an operator to validate the new version of the application and approve to continue the production traffic switch
  4. Continue the blue/green deployment by allowing traffic to the new version of the application and block the traffic to the old version.
  5. After a successful Blue/Green switch and deployment, tag the production version in the code repository.

Now that you’ve seen the pipeline design, we will dive deep into the details of the blue/green deployment with AWS CodeDeploy.

Blue/green deployment with AWS CodeDeploy

In the blue/green deployment, we used the technique of swapping Auto Scaling Group behind an Elastic Load Balancer. Refer to the AWS Blue/Green deployment whitepaper for the details of the technique. As AWS CodeDeploy is a fully-managed service that automates software deployment, it is used to automate the entire Blue/Green process.

Firstly, the following best practices are applied to setup the Enterprise Server’s infrastructure:

  1. AWS Image Builder is used to install Micro Focus Enterprise Server software and AWS CodeDeploy Agent into Amazon Machine Image (AMI). Create an EC2 Launch Template with the Enterprise Server AMI ID.
  2. A Network Load Balancer is used to setup a TCP connection health check to validate that Micro Focus Enterprise Server is listening on the required ports, e.g., port 9270, so that connectivity is available for 3270 emulators.
  3. A script was created to confirm application deployment validity in each EC2 instance. This is achieved by using a PowerShell script that triggers a CICS transaction from the Micro Focus Enterprise Server command line interface.

In the CodePipeline, we created a CodeBuild project to create a new deployment with CodeDeploy. We will go into the details of the CodeBuild buildspec.yaml configuration.

In the CodeBuild buildspec.yaml’s pre_build section, we used the following steps:

In the pre-build stage, the CodeBuild will perform two steps:

  1. Create an initial Amazon EC2 Auto Scaling using Micro Focus Enterprise Server AMI and a Launch Template for the first-time deployment of the application.
  2. Use AWS CLI to update the initial Auto Scaling Group name into a Systems Manager Parameter Store, and it will later be used by CodeDeploy to create a copy during the blue/green deployment.

In the build stage, the buildspec will perform the following steps:

  1. Retrieve the Auto Scaling Group name of the Enterprise Servers from the Systems Manager Parameter Store.
  2. Then, a blue/green deployment configuration is created for the deployment group of the application. In the AWS CLI command, we use the WITH_TRAFFIC_CONTROL option to let us manually verify and approve before switching the traffic to the new version of the application. The command snippet is shown here.
BlueGreenConf=\
        "terminateBlueInstancesOnDeploymentSuccess={action=TERMINATE}"\
        ",deploymentReadyOption={actionOnTimeout=STOP_DEPLOYMENT,waitTimeInMinutes=600}" \
        ",greenFleetProvisioningOption={action=COPY_AUTO_SCALING_GROUP}"

DeployType="BLUE_GREEN,deploymentOption=WITH_TRAFFIC_CONTROL"

/usr/local/bin/aws deploy update-deployment-group \
      --application-name "${APPLICATION_NAME}" \
     --current-deployment-group-name "${DEPLOYMENT_GROUP_NAME}" \
     --auto-scaling-groups "${AsgName}" \
      --load-balancer-info targetGroupInfoList=[{name="${TARGET_GROUP_NAME}"}] \
      --deployment-style "deploymentType=$DeployType" \
      --Blue/Green-deployment-configuration "$BlueGreenConf"
  1. Next, the new version of application binary is released from the CodeBuild source DemoBinto the production S3 bucket.
release="bankdemo-$(date '+%Y-%m-%d-%H-%M').tar.gz"
RELEASE_FILE="s3://${PRODUCTION_BUCKET}/${release}"

/usr/local/bin/aws deploy push \
    --application-name ${APPLICATION_NAME} \
    --description "version - $(date '+%Y-%m-%d %H:%M')" \
    --s3-location ${RELEASE_FILE} \
    --source ${CODEBUILD_SRC_DIR_DemoBin}/
  1. Create a new deployment for the application to initiate the Blue/Green switch.
/usr/local/bin/aws deploy create-deployment \
    --application-name ${APPLICATION_NAME} \
    --s3-location bucket=${PRODUCTION_BUCKET},key=${release},bundleType=zip \
    --deployment-group-name "${DEPLOYMENT_GROUP_NAME}" \
    --description "Bankdemo Production Deployment ${release}"\
    --query deploymentId \
    --output text

After setting up the deployment options, the following is a snapshot of a deployment configuration from the AWS Management Console.

Snapshot of deployment configuration from AWS Management Console

In the AWS Post “Under the Hood: AWS CodeDeploy and Auto Scaling Integration”, we explain how AWS CodeDeploy sets up Auto Scaling lifecycle hooks to listen for Auto Scaling events. In the event of an EC2 instance launch and termination, AWS CodeDeploy can instruct its agent in the instance to run the prepared scripts.

In the following table, we list each stage in a blue/green deployment and the tasks that ran.

Hooks Tasks
BeforeInstall Create application folder structures in the newly launched Amazon EC2 and prepare for installation
  AfterInstall Enable Windows Firewall Rule for application traffic
Activate Micro Focus License using License Server
Prepare Production Database Connections
Import config to create Region in Micro Focus Enterprise Server
Deploy the latest application binaries into each of the Micro Focus Enterprise Servers
ApplicationStart Use AWS CLI to start a Systems Manager Automation “Scale-Out” runbook with the target of ESCWA server
The Automation runbook will add the newly launched Micro Focus Enterprise Server instance into a PAC
The Automation runbook will start the imported region in the newly launched Micro Focus Enterprise Server
Validate that the application is listening on a service port, for example, port 9270
Use the Micro Focus command “castran” to run an online transaction in Micro Focus Enterprise Server to validate the service status
AfterBlockTraffic Use AWS CLI to start a Systems Manager Automation “Scale-In” runbook with the target ESCWA server
The Automation runbook will try stopping the Region in the terminating EC2 instance
The Automation runbook will remove the Enterprise Server instance from the PAC

The tasks in the table are automated using PowerShell, and the scripts are used in appspec.yml config for CodeDeploy to orchestrate the deployment.

In the following appspec.yml, the locations of the binary files to be installed are defined in addition to the Micro Focus Enterprise Server Region XML config file. During the AfrerInstall stage, the XML config is imported into the Enterprise Server.

version: 0.0
os: windows
files:
  - source: scripts
    destination: C:\scripts\
  - source: online
    destination: C:\BANKDEMO\online\
  - source: common
    destination: C:\BANKDEMO\common\
  - source: batch
    destination: C:\BANKDEMO\batch\
  - source: scripts\BANKDEMO.xml
    destination: C:\BANKDEMO\
hooks:
  BeforeInstall: 
    - location: scripts\BeforeInstall.ps1
      timeout: 300
  AfterInstall: 
    - location: scripts\AfterInstall.ps1    
  ApplicationStart:
    - location: scripts\ApplicationStart.ps1
      timeout: 300
  ValidateService:
    - location: scripts\ValidateServer.cmd
      timeout: 300
  AfterBlockTraffic:
    - location: scripts\AfterBlockTraffic.ps1

Using the sample Micro Focus Bankdemo application, and the steps outlined above, we have setup a blue/green deployment process in Micro Focus Enterprise Server.

There are four important considerations when setting up blue/green deployment:

  1. For batch applications, the blue/green deployment should be invoked only outside of the scheduled “batch window”.
  2. For online applications, AWS CodeDeploy will deregister the Auto Scaling group from the target group of the Network Load Balancer. The deregistration may take a while as the server has to finish processing the ongoing requests before it can continue deployment of the new application instance. In this case, enabling Elastic Load Balancing connection draining feature with appropriate timeout value can minimize the risk of closing unfinished transactions. In addition, consider doing deployment in low-traffic windows to improve the deployment speeds.
  3. For application changes that require updates to the database schema, the version roll-forward and rollback can be managed via DB migrations tools, e.g., Flyway and Fluent Migrator.
  4. For testing in production environments, adherence to any regulatory compliance, such as full audit trail of events, must be considered.

Conclusion

In this post, we introduced the solution to use Micro Focus Enterprise Server PAC, Amazon EC2 Auto Scaling, AWS Systems Manager, and AWS CodeDeploy to automate the blue/green deployment of rehosted mainframe applications in AWS.

Through the blue/green deployment methodology, we can shift traffic between two identical clusters running different application versions in parallel. This mitigates the risks commonly associated with mainframe application deployment, namely downtime and rollback capacity, while ensure higher code quality in production through “Shift Right” testing.

A demo of the solution is available on the AWS Partner Micro Focus website [Solution-Demo]. If you’re interested in modernizing your mainframe applications, then please contact Micro Focus and AWS mainframe business development at [email protected].

Additional Information

About the authors

Kevin Yung

Kevin Yung

Kevin is a Senior Modernization Architect in AWS Professional Services Global Mainframe and Midrange Modernization (GM3) team. Kevin currently is focusing on leading and delivering mainframe and midrange applications modernization for large enterprise customers.

Krithika Palani Selvam

Krithika is a Senior Modernization Architect in AWS Professional Services Global Mainframe and Midrange Modernization (GM3) team. She is currently working with enterprise customers for migrating and modernizing mainframe and midrange applications to cloud.

Peter Woods

Peter Woods has been with Micro Focus for over 30 years <within the Application Modernisation & Connectivity portfolio>. His diverse range of roles has included Technical Support, Channel Sales, Product Management, Strategic Alliances Management and Pre-Sales and was primarily based in the UK. In 2017 Peter re-located to Melbourne, Australia and in his current role of AM2C APJ Regional Technical Leader and ANZ Pre-Sales Manager, he is charged with driving and supporting Application Modernisation sales activity across the APJ region.

Abraham Mercado Rondon

Abraham Rondon is a Solutions Architect working on Micro Focus Enterprise Solutions for the Application Modernization team based in Melbourne. After completing a degree in Statistics and before joining Micro Focus, Abraham had a long career in supporting Mainframe Applications in different countries doing progressive roles from Developer to Production Support, Business and Technical Analyst, and Project Team Lead.  Now, a vital part of the Micro Focus Application Modernization team, one of his main focus is Cloud implementations of mainframe DevOps and production workload rehost.

Volotea MRO Modernization in AWS

Post Syndicated from Albert Capdevila original https://aws.amazon.com/blogs/architecture/volotea-mro-modernization-in-aws/

Volotea is one of the fastest growing independent airlines in Europe, and has increased its fleet, routes, and number of available seats year over year. Volotea has already transported more than 30 million passengers across Europe since 2012, and has bases in 16 European capitals.

The maintenance, repair, and overhaul (MRO) application is a critical system for every airline. It’s used to manage the maintenance, repair, service, and inspection of aircraft. The main goal of an MRO application is to ensure the safety and airworthiness of the aircraft. Traditionally, those systems have been based on monolithic, packaged applications. However, these are difficult to scale and do not offer the benefit of elasticity to adapt to changing demand. Volotea migrated to Amazon Web Services (AWS) to modernize their MRO without refactoring the code. In this blog post, we’ll show you an architecture solution that can be applied to modernize an MRO (or similarly packaged monolithic application) without refactoring, and discuss some considerations.

The challenges with an on-premises MRO solution

Volotea’s MRO software previously ran in an on-premises data center. The system was based on Windows, an outdated database engine, and a virtual desktop system based on Citrix. Costs were fixed, yet MRO usage is typically seasonal. All the interfaces with other systems were based on an outdated communications protocol. This presented security concerns, especially considering that ransomware attacks are an increasing threat.

The main challenge for Volotea was adapting the MRO system to changing business requirements. Seasonal workloads and high impact projects, like changing fleets from Boeing to Airbus, require flexibility. The company also needed to adapt to the changing protocols necessitated by the COVID-19 pandemic, as airlines are one of the most impacted industries in Europe.

Volotea needed to modernize the operating system (OS) and database, simplify the end user application access, and increase the overall platform security, including integration with other applications.

Modernizing the MRO without refactoring

Following Volotea’s cloud strategy, the MRO system was migrated in 2 months to AWS to reduce technology costs and gain higher operational performance, availability, security, and flexibility. The migration was not simply based on a lift-and-shift approach, but used an existing AWS reference architecture for the MRO system. This reference architecture incorporates AWS managed services to modernize the application without incurring refactoring costs.

Figure 1. Volotea MRO deployment in a multi-account architecture

Figure 1. Volotea MRO deployment in a multi-account architecture

As shown in the high-level architecture in Figure 1:

  1. Volotea migrated their servers to Amazon EC2 instances based on Linux, to minimize the OS costs. The database management system is now using an open source engine. Those changes have permitted saving more than €10K yearly in licenses.
  2. The user access technology was migrated to Amazon AppStream 2.0. This is a managed service with increased security, elasticity, and flexibility compared to traditional virtual desktop infrastructure (VDI) solutions. Volotea aligned the cost with the real usage and decreased the TCO by configuring Auto Scaling fleets, reducing the workplace costs by 50%.
  3. AWS Transfer Family was used to centralize the information exchanged with third-party applications, while increasing the security of the communication channel. This managed service enabled the migration of the SFTP, FTPS, and FTP interfaces without the need to manage servers.
  4. To modernize the access of the MRO administrators, AWS Systems Manager Session Manager was used. This provided an ideal browser-based shell access without requiring bastion hosts or opening SSH ports in the Amazon EC2 instances.
  5. The AWS services were linked to Volotea’s user directory using AWS Single Sign-On. This allowed users to authenticate with their corporate credentials, decreasing maintenance costs, and increasing the security.

The application was deployed in Volotea’s AWS Landing Zone, which included the following services:

To make the systems management homogeneous, AWS Systems Manager and AWS Backup offered a single management point for the backup policies, system inventory, and patching.

Incorporating high availability to the MRO

Once this initial modernization is finished, Volotea will use the AWS reference architecture for high availability (HA) to increase resiliency. They’ll configure Amazon EC2 Auto Scaling with application failover to another Availability Zone and the DB native replication mechanisms. This will use Elastic IP addresses to remap the endpoints in a failover scenario. This architecture can be easily implemented in AWS to incorporate HA to applications that do not natively support horizontal scaling.

Conclusion

Volotea successfully modernized its MRO software, which has given them greater flexibility, elasticity, and the increased security of AWS services. They intend to continue with their digital transformation journey. Volotea is increasing its capacity to innovate faster to deliver new digital services more efficiently and with reduced IT costs. The AWS services and strategies discussed in this blog post can be applied to other similarly packaged applications to implement a first level of modernization with little effort and low migration risk.

References:

New – Securely manage your AWS IoT Greengrass edge devices using AWS Systems Manager

Post Syndicated from Sean M. Tracey original https://aws.amazon.com/blogs/aws/new-securely-manage-your-aws-iot-greengrass-edge-devices-using-aws-systems-manager/

A header image with the text AWS IoT Greengrass announces AWS System Manager

In 2020, we launched AWS IoT Greengrass 2.0, an open-source edge runtime and cloud service for building, deploying, and managing device software and applications. Today, we’re very excited to announce the ability to securely manage your AWS IoT Greengrass edge devices using AWS Systems Manager (SSM).

Managing vast fleets of varying systems and applications remotely can be a challenge for administrators of edge devices. AWS IoT Greengrass was built to enable these administrators to manage their edge device application stack. While this addressed the needs of many typical edge device administrators, system software on these devices still needed to be updated and maintained through operational policies consistent with those of their broader IT organizations. To this end, administrators would typically have to build or integrate tools to create a centralized interface for managing their edge and IT device software stacks – from security updates, to remote access, and operating system patches.

Until today, IT administrators have had to build or integrate custom tools to make sure edge devices can be managed alongside EC2 and on-prem instances, through a consistent set of policies. At scale, managing device and systems software across a wide variety of edge and IT systems becomes a significant investment in time and money. This is time that could be better spent deploying, optimizing, and managing the very edge devices that they’re maintaining.

What’s New?
Today, we have integrated IoT Greengrass and Systems Manager to simplify the management and maintenance of system software for edge devices. When coupled with the AWS IoT Greengrass Client Software, edge device administrators now can remotely access and securely manage with the multitude of devices that they own – from OS patching, to application deployments. Additionally, regularly scheduled operations that maintain edge compute systems can be automated, all without the need for creating additional custom processes. For IT administrators, this release gives a complete overview of all of their devices through a centralized interface, and a consistent set of tools and policies with the AWS Systems Manager.

For customers new to the AWS IoT Greengrass platform, the integration with Systems Manager simplifies setup even further with a new on- boarding wizard that can reduce the time it takes to create operational management systems for edge devices from weeks to hours.

How is this achieved?
This new capability is enabled by the AWS Systems Manager (SSM) Agent. As of today, customers can deploy the AWS Systems Manager Agent, via the AWS IoT Greengrass console, to their existing edge devices. Once installed on each device, AWS Systems Manager will list all of the devices in the Systems Manager Console, thereby giving administrators and IoT stakeholders an overview of their entire fleet. When coupled with the AWS IoT Greengrass console, administrators can manage their newly configured devices remotely; patching or updating operating systems, troubleshooting remotely, and deploying new applications, all through a centralized, integrated user interface. Devices can be patched individually, or in groups organized by tags or resource groups.

Further information
These new features are now available in all regions where AWS Systems Manager and AWS IoT Greengrass are available. To get started, please visit the IoT Greengrass home page.

Build Your Own Game Day to Support Operational Resilience

Post Syndicated from Lewis Taylor original https://aws.amazon.com/blogs/architecture/build-your-own-game-day-to-support-operational-resilience/

Operational resilience is your firm’s ability to provide continuous service through people, processes, and technology that are aware of and adaptive to constant change. Downtime of your mission-critical applications can not only damage your reputation, but can also make you liable to multi-million-dollar financial fines.

One way to test operational resilience is to simulate life-like system failures. An effective way to do this is by running events in your organization known as game days. Game days test systems, processes, and team responses and help evaluate your readiness to react and recover from operational issues. The AWS Well-Architected Framework recommends game days as a key strategy to develop and operate highly resilient systems because they focus not only on technology resilience issues but identify people and process gaps.

This blog post will explain how you can apply game day concepts to your workloads to help achieve a highly resilient workload.

Why does operational resilience matter from a regulatory perspective?

In March 2021, the Bank of England, Prudential Regulation Authority, and Financial Conduct Authority published their Building operational resilience: Feedback to CP19/32 and final rules policy. In this policy, operational resilience refers to a firm’s ability to prevent, adapt, and respond to and return to a steady system state when a disruption occurs. Further, firms are expected to learn and implement process improvements from prior disruptions.

This policy will not apply to everyone. However, across the board if you don’t establish operational resilience strategies, you are likely operating at an increased risk. If you have a service disruption, you may incur lost revenue and reputational damage.

What does it mean to be operationally resilient?

The final policy provides guidance on how firms should achieve operational resilience, which includes but is not limited to the following:

  • Identify and prioritize services based on the potential of intolerable harm to end consumers or risk to market integrity.
  • Define appropriate maximum impact tolerance of an important business service. This is reviewed annually using metrics to measure impact tolerance and answers questions like, “How long (in hours) can a service be offline before causing intolerable harm to end consumers?”
  • Document a complete view of all the aspects required to deliver each important service. This includes people, processes, technology, facilities, and information (resources). Firms should also test their ability to remain within the impact tolerances and provide assurance of resilience along with areas that need to be addressed.

What is a game day?

The AWS Well-Architected Framework defines a game day as follows:

“A game day simulates a failure or event to test systems, processes, and team responses. The purpose is to actually perform the actions the team would perform as if an exceptional event happened. These should be conducted regularly so that your team builds “muscle memory” on how to respond. Your game days should cover the areas of operations, security, reliability, performance, and cost.

In AWS, your game days can be carried out with replicas of your production environment using AWS CloudFormation. This enables you to test in a safe environment that resembles your production environment closely.”

Running game days that simulate system failure helps your organization evaluate and build operational resilience.

How can game days help build operational resilience?

Running a game day alone is not sufficient to ensure operational resilience. However, by navigating the following process to set up and perform a game day, you will establish a best practice-based approach for operating resilient systems.

Stage 1 – Identify key services

As part of setting up a game day event, you will catalog and identify business-critical services.

Game days are performed to test services where operational failure could result in significant financial, customer, and/or reputational impact to the firm. Game days can also evaluate other key factors, like the impact of a failure on the wider market where your firm operates.

For example, a firm may identify its digital banking mobile application from which their customers can initiate payments as one of its important business services.

Stage 2 – Map people, process, and technology supporting the business service

Game days are holistic events. To get a full picture of how the different aspects of your workload operate together, you’ll generate a detailed map of people and processes as they interact and operate the technical and non-technical components of the system. This mapping also helps your end consumers understand how you will provide them reliable support during a failure.

Stage 3 – Define and perform failure scenarios

Systems fail, and failures often happen when a system is operating at scale because various services working together can introduce complexity. To ensure operational resilience, you must understand how systems react and adapt to failures. To do this, you’ll identify and perform failure scenarios so you can understand how your systems will react and adapt and build “muscle memory” for actual events.

AWS builds to guard against outages and incidents, and accounts for them in the design of AWS services—so when disruptions do occur, their impact on customers and the continuity of services is as minimal as possible. At AWS, we employ compartmentalization throughout our infrastructure and services. We have multiple constructs that provide different levels of independent, redundant components.

Stage 4 – Observe and document people, process, and technology reactions

In running a failure scenario, you’ll observe how technological and non-technological components react to and recover from failure. This helps you identify failures and fix them as they cascade through impacted components across your workload. This also helps identify technical and operational challenges that might not otherwise be obvious.

Stage 5 – Conduct lessons learned exercises

Game days generate information on people, processes, and technology and also capture data on customer impact, incident response and remediation timelines, contributing factors, and corrective actions. By incorporating these data points into the system design process, you can implement continuous resilience for critical systems.

How to run your own game day in AWS

You may have heard of AWS GameDay events. This is an AWS organized event for our customers. In this team-based event, AWS provides temporary AWS accounts running fictional systems. Failures are injected into these systems and teams work together on completing challenges and improving the system architecture.

However, the method and tooling and principles we use to conduct AWS GameDays are agnostic and can be applied to your systems using the following services:

  • AWS Fault Injection Simulator is a fully managed service that runs fault injection experiments on AWS, which makes it easier to improve an application’s performance, observability, and resiliency.
  • Amazon CloudWatch is a monitoring and observability service that provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health.
  • AWS X-Ray helps you analyze and debug production and distributed applications (such as those built using a microservices architecture). X-Ray helps you understand how your application and its underlying services are performing to identify and troubleshoot the root cause of performance issues and errors.

Please note you are not limited to the tools listed for simulating failure scenarios. For complete coverage of failure scenarios, we encourage you to explore additional tools and strategies.

Figure 1 shows a reference architecture example that demonstrates conducting a game day for an Open Banking implementation.

Game day reference architecture example

Figure 1. Game day reference architecture example

Game day operators use Fault Injection Simulator to catalog and perform failure scenarios to be included in your game day. For example, in our Open Banking use case in Figure 1, a failure scenario might be for the business API functions servicing Open Banking requests to abruptly stop working. You can also combine such simple failure scenarios into a more complex one with failures injected across multiple components of the architecture.

Game day participants use CloudWatch, X-Ray, and their own custom observability and monitoring tooling to identify failures as they cascade through systems.

As you go through the process of identifying, communicating, and fixing issues, you’ll also document impact of failures on end-users. From there, you’ll generate lessons learned to holistically improve your workload’s resilience.

Conclusion

In this blog, we discussed the significance of ensuring operational resilience. We demonstrated how to set up game days and how they can supplement your efforts to ensure operational resilience. We discussed how using AWS services such as Fault Injection Simulator, X-Ray, and CloudWatch can be used to facilitate and implement game day failure scenarios.

Ready to get started? For more information, check out our AWS Fault Injection Simulator User Guide.

Related information:

How to automate incident response to security events with AWS Systems Manager Incident Manager

Post Syndicated from Sumit Patel original https://aws.amazon.com/blogs/security/how-to-automate-incident-response-to-security-events-with-aws-systems-manager-incident-manager/

Incident response is a core security capability for organizations to develop, and a core element in the AWS Cloud Adoption Framework (AWS CAF). Responding to security incidents quickly is important to minimize their impacts. Automating incident response helps you scale your capabilities, rapidly reduce the scope of compromised resources, and reduce repetitive work by your security team.

In this post, I show you how to use Incident Manager, a capability of AWS Systems Manager, to build an effective automated incident management and response solution to security events.

You’ll walk through three common security-related events and how you can use Incident Manager to automate your response.

  • AWS account root user activity: An Amazon Web Services (AWS) account root user has full access to all your resources for all AWS services, including billing information. It’s therefore elemental to adhere to the best practice of using the root user only to create your first IAM user and securely lock away the root user credentials and use them to perform only a few account and service management tasks. And it is critical to be aware when root user activity occurs in your AWS account.
  • Amazon GuardDuty high severity findings: Amazon GuardDuty is a threat detection service that continuously monitors for malicious or unauthorized behavior to help protect your AWS accounts and workloads. In this blog post, you’ll learn how to initiate an incident response plan whenever a high severity finding is discovered.
  • AWS Config rule change and S3 bucket allowing public access: AWS Config enables continuous monitoring of your AWS resources, making it simple to assess, audit, and record resource configurations and changes. You will use AWS Config to monitor your Amazon Simple Storage Service (S3) bucket ACLs and policies for settings that allow public read or public write access.

Prerequisites

If this is your first time using Incident Manager, follow the initial onboarding steps in Getting prepared with Incident Manager.

Incident Manager can start managing incidents automatically using Amazon CloudWatch or Amazon EventBridge. For the solution in this blog post, you will use EventBridge to capture events and start an incident.

To complete the steps in this walkthrough, you need the following:

Create an Incident Manager response plan

A response plan ties together the contacts, escalation plan, and runbook. When an incident occurs, a response plan defines who to engage, how to engage, which runbook to initiate, and which metrics to monitor. By creating a well-defined response plan, you can save your security team time down the road.

Add contacts

Your contacts should include everyone who might be involved in the incident. Follow these steps to add a contact.

To add contacts

  1. Open the AWS Management Console, and then go to Systems Manager within the console, expand Operations Management, and then expand Incident Manager.
  2. Choose Contacts, and then choose Create contact.

    Figure 1: Adding contact details

    Figure 1: Adding contact details

  3. On Contact information, enter names and define contact channels for your contacts.
  4. Under Contact channel, you can select Email, SMS, or Voice. You can also add multiple contact channels.
  5. In Engagement plan, specify how fast to engage your responders. In the example illustrated below, the incident responder will be engaged through email immediately (0 minutes) when an incident is detected and then through SMS 10 minutes into an incident. Complete the fields and then choose Create.

    Figure 2: Engagement plan

    Figure 2: Engagement plan

Create a response plan

Once you’ve created your contacts, you can create a response plan to define how to respond to incidents. Refer to the Best Practices for Response Plans.

Note: (Optional) You can also create an escalation plan that lets you further define the escalation path for your contacts. You can learn more in Create an escalation plan.

To create a response plan

  1. Open the Incident Manager console, and choose Response plans in the left navigation pane.
  2. Choose Create response plan.
  3. Enter a unique and identifiable name for your response plan.
  4. Enter an incident title. The incident title helps to identify an incident on the incidents home page.
  5. Select an appropriate Impact based on the potential scope of the incident.

    Figure 3: Selecting your impact level

    Figure 3: Selecting your impact level

  6. (Optional) Choose a chat channel for the incident responders to interact in during an incident. For more information about chat channels, see Chat channels.
  7. (Optional) For Engagement, you can choose any number of contacts and escalation plans. For this solution, select the security team responder that you created earlier as one of your contacts.

    Figure 4: Adding engagements

    Figure 4: Adding engagements

  8. (Optional) You can also create a runbook that can drive the incident mitigation and response. For further information, refer to Runbooks and automation.
  9. Under Execution permissions, choose Create an IAM role using a template. Under Role name, select the IAM role you created in the prerequisites that allows Incident Manager to run SSM automation documents, and then choose Create response plan.

Monitor AWS account root activity

When you first create an AWS account, you begin with a single sign-in identity that has complete access to all AWS services and resources in the account. This identity is called the root user and is accessed by signing in with the email address and password that you used to create the account.

An AWS account root user has full access to all your resources for all AWS services, including billing information. It is critical to prevent root user access from unauthorized use and to be aware whenever root user activity occurs in your AWS account. For more information about AWS recommendations, see Security best practices in IAM.

To be certain that all root user activity is authorized and expected, it’s important to monitor root API calls to a given AWS account and to be notified when root user activity is detected.

Create an EventBridge rule

Create and validate an EventBridge rule to capture AWS account root activity.

To create an EventBridge rule

  1. Open the EventBridge console.
  2. In the navigation pane, choose Rules, and then choose Create rule.
  3. Enter a name and description for the rule.
  4. For Define pattern, choose Event pattern.
  5. Choose Custom pattern.
  6. Enter the following event pattern:
    {
      "detail-type": [
        "AWS API Call via CloudTrail",
        "AWS Console Sign In via CloudTrail"
      ],
      "detail": {
        "userIdentity": {
          "type": [
            "Root"
          ]
        }
      }
    }
    

  7. For Select targets, choose Incident Manager response plan.
  8. For Response plan, choose SecurityEventResponsePlan, which you created when you set up Incident Manager.
  9. To create an IAM role automatically, choose Create a new role for this specific resource. To use an existing IAM role, choose Use existing role.
  10. (Optional) Enter one or more tags for the rule.
  11. Choose Create.

To validate the rule

  1. Sign in using root credentials.
  2. This console login activity by a root user should invoke the Incident Manager response plan and show an open incident as illustrated below. The respective contact channels that you defined earlier in your Engagement Plan, will be engaged.
Figure 5: Incident Manager open incidents

Figure 5: Incident Manager open incidents

Watch for GuardDuty high severity findings

GuardDuty is a monitoring service that analyzes AWS CloudTrail management and Amazon S3 data events, Amazon Virtual Private Cloud (Amazon VPC) flow logs, and Amazon Route 53 DNS logs to generate security findings for your account. Once GuardDuty is enabled, it immediately starts monitoring your environment.

GuardDuty integrates with EventBridge, which can be used to send findings data to other applications and services for processing. With EventBridge, you can use GuardDuty findings to invoke automatic responses to your findings by connecting finding events to targets such as Incident Manager response plan.

Create an EventBridge rule

You’ll use an EventBridge rule to capture GuardDuty high severity findings.

To create an EventBridge rule

  1. Open the EventBridge console.
  2. In the navigation pane, select Rules, and then choose Create rule.
  3. Enter a name and description for the rule.
  4. For Define pattern, choose Event pattern.
  5. Choose Custom pattern
  6. Enter the following event pattern which will filter on GuardDuty high severity findings
    {
      "source": ["aws.guardduty"],
      "detail-type": ["GuardDuty Finding"],
      "detail": {
        "severity": [
          7.0,
          7.1,
          7.2,
          7.3,
          7.4,
          7.5,
          7.6,
          7.7,
          7.8,
          7.9,
          8,
          8.0,
          8.1,
          8.2,
          8.3,
          8.4,
          8.5,
          8.6,
          8.7,
          8.8,
          8.9
        ]
      }
    } 
    

  7. For Select targets, choose Incident Manager response plan.
  8. For Response plan, select SecurityEventResponsePlan, which you created when you set up Incident Manager.
  9. To create an IAM role automatically, choose Create a new role for this specific resource. To use an IAM role that you created before, choose Use existing role.
  10. (Optional) Enter one or more tags for the rule.
  11. Choose Create.

To validate the rule

To test and validate whether the above rule is now functional, you can generate sample findings within the GuardDuty console.

  1. Open the GuardDuty console.
  2. In the navigation pane, choose Settings.
  3. On the Settings page, under Sample findings, choose Generate sample findings.
  4. In the navigation pane, choose Findings. The sample findings are displayed on the Current findings page with the prefix [SAMPLE].

Once you have generated sample findings, your Incident Manager response plan will be invoked almost immediately and the engagement plan with your contacts will begin.

You can select an open incident in the Incident Manager console to see additional details from the GuardDuty finding. Figure 6 shows a high severity finding.

Figure 6: Incident Manager open incident for GuardDuty high severity finding

Figure 6: Incident Manager open incident for GuardDuty high severity finding

Monitor S3 bucket settings for public access

AWS Config enables continuous monitoring of your AWS resources, making it easier to assess, audit, and record resource configurations and changes. AWS Config does this through rules that define the desired configuration state of your AWS resources. AWS Config provides a number of AWS managed rules that address a wide range of security concerns such as checking that your Amazon Elastic Block Store (Amazon EBS) volumes are encrypted, your resources are tagged appropriately, and multi-factor authentication (MFA) is enabled for root accounts.

Set up AWS Config and EventBridge

You will use AWS Config to monitor your S3 bucket ACLs and policies for violations which could allow public read or public write access. If AWS Config finds a policy violation, it will initiate an AWS EventBridge rule to invoke your Incident Manager response plan.

To create the AWS Config rule to capture S3 bucket public access

  1. Sign in to the AWS Config console.
  2. If this is your first time in the AWS Config console, refer to the Getting Started guide for more information.
  3. Select Rules from the menu and choose Add Rule.
  4. On the AWS Config rules page, enter S3 in the search box and select the s3-bucket-public-read-prohibited and s3-bucket-public-write-prohibited rules, and then choose Next.

    Figure 7: AWS Config rules

    Figure 7: AWS Config rules

  5. Leave the Configure rules page as default and select Next.
  6. On the Review page, select Add Rule. AWS Config is now analyzing your S3 buckets, capturing their current configurations, and evaluating the configurations against the rules you selected.

To create the EventBridge rule

  1. Open the Amazon EventBridge console
  2. In the navigation pane, choose Rules, and then choose Create rule.
  3. Enter a name and description for the rule.
  4. For Define pattern, choose Event pattern.
  5. Choose Custom pattern
  6. Enter the following event pattern, which will filter on AWS Config rule s3-bucket-public-write-prohibited being non-compliant.
    {
      "source": ["aws.config"],
      "detail-type": ["Config Rules Compliance Change"],
      "detail": {
        "messageType": ["ComplianceChangeNotification"],
        "configRuleName": ["s3-bucket-public-write-prohibited", ""],
        "newEvaluationResult": {
          "complianceType": [
            "NON_COMPLIANT"
          ]
        }
      }
    }
    

  7. For Select targets, choose Incident Manager response plan.
  8. For Response plan, choose SecurityEventResponsePlan, which you created earlier when setting up Incident Manager.
  9. To create an IAM role automatically, choose Create a new role for this specific resource. To use an existing IAM role, choose Use existing role.
  10. (Optional) Enter one or more tags for the rule.
  11. Choose Create.

To validate the rule

  1. Create a compliant test S3 bucket with no public read or write access through either an ACL or a policy.
  2. Change the ACL of the bucket to allow public listing of objects so that the bucket is non-compliant.

    Figure 8: Amazon S3 console

    Figure 8: Amazon S3 console

  3. After a few minutes, you should see the AWS Config rule initiated which invokes the EventBridge rule and therefore your Incident Manager response plan.

Summary

In this post, I showed you how to use Incident Manager to monitor for security events and invoke a response plan via Amazon CloudWatch or Amazon EventBridge. AWS CloudTrail API activity (for a root account login), Amazon GuardDuty (for high severity findings), and AWS Config (to enforce policies like preventing public write access to an S3 bucket). I demonstrated how you can create an incident management and response plan to ensure you have used the power of cloud to create automations that respond to and mitigate security incidents in a timely manner. To learn more about Incident Manager, see What Is AWS Systems Manager Incident Manager in the AWS documentation.

If you have feedback about this post, submit comments in the comments section below. If you have questions about this post, start a new thread on the Systems Manager forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Sumit Patel

As a Senior Solutions Architect at AWS, Sumit works with large enterprise customers helping them create innovative solutions to address their cloud challenges. Sumit uses his more than 15 years of enterprise experience to help customers navigate their cloud transformation journey and shape the right dynamics between technology and business.

17 additional AWS services authorized for DoD workloads in the AWS GovCloud Regions

Post Syndicated from Tyler Harding original https://aws.amazon.com/blogs/security/17-additional-aws-services-authorized-for-dod-workloads-in-the-aws-govcloud-regions/

I’m pleased to announce that the Defense Information Systems Agency (DISA) has authorized 17 additional Amazon Web Services (AWS) services and features in the AWS GovCloud (US) Regions, bringing the total to 105 services and major features that are authorized for use by the U.S. Department of Defense (DoD). AWS now offers additional services to DoD mission owners in these categories: business applications; computing; containers; cost management; developer tools; management and governance; media services; security, identity, and compliance; and storage.

Why does authorization matter?

DISA authorization of 17 new cloud services enables mission owners to build secure innovative solutions to include systems that process unclassified national security data (for example, Impact Level 5). DISA’s authorization demonstrates that AWS effectively implemented more than 421 security controls by using applicable criteria from NIST SP 800-53 Revision 4, the US General Services Administration’s FedRAMP High baseline, and the DoD Cloud Computing Security Requirements Guide.

Recently authorized AWS services at DoD Impact Levels (IL) 4 and 5 include the following:

Business Applications

Compute

Containers

Cost Management

  • AWS Budgets – Set custom budgets to track your cost and usage, from the simplest to the most complex use cases
  • AWS Cost Explorer – An interface that lets you visualize, understand, and manage your AWS costs and usage over time
  • AWS Cost & Usage Report – Itemize usage at the account or organization level by product code, usage type, and operation

Developer Tools

  • AWS CodePipeline – Automate continuous delivery pipelines for fast and reliable updates
  • AWS X-Ray – Analyze and debug production and distributed applications, such as those built using a microservices architecture

Management & Governance

Media Services

  • Amazon Textract – Extract printed text, handwriting, and data from virtually any document

Security, Identity & Compliance

  • Amazon Cognito – Secure user sign-up, sign-in, and access control
  • AWS Security Hub – Centrally view and manage security alerts and automate security checks

Storage

  • AWS Backup – Centrally manage and automate backups across AWS services

Figure 1 shows the IL 4 and IL 5 AWS services that are now authorized for DoD workloads, broken out into functional categories.
 

Figure 1: The AWS services newly authorized by DISA

Figure 1: The AWS services newly authorized by DISA

To learn more about AWS solutions for the DoD, see our AWS solution offerings. Follow the AWS Security Blog for updates on our Services in Scope by Compliance Program. If you have feedback about this blog post, let us know in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Tyler Harding

Tyler is the DoD Compliance Program Manager for AWS Security Assurance. He has over 20 years of experience providing information security solutions to the federal civilian, DoD, and intelligence agencies.

Building well-architected serverless applications: Optimizing application costs

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-optimizing-application-costs/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

COST 1. How do you optimize your serverless application costs?

Design, implement, and optimize your application to maximize value. Asynchronous design patterns and performance practices ensure efficient resource use and directly impact the value per business transaction. By optimizing your serverless application performance and its code patterns, you can directly impact the value it provides, while making more efficient use of resources.

Serverless architectures are easier to manage in terms of correct resource allocation compared to traditional architectures. Due to its pay-per-value pricing model and scale based on demand, a serverless approach effectively reduces the capacity planning effort. As covered in the operational excellence and performance pillars, optimizing your serverless application has a direct impact on the value it produces and its cost. For general serverless optimization guidance, see the AWS re:Invent talks, “Optimizing your Serverless applications” Part 1 and Part 2, and “Serverless architectural patterns and best practices”.

Required practice: Minimize external calls and function code initialization

AWS Lambda functions may call other managed services and third-party APIs. Functions may also use application dependencies that may not be suitable for ephemeral environments. Understanding and controlling what your function accesses while it runs can have a direct impact on value provided per invocation.

Review code initialization

I explain the Lambda initialization process with cold and warm starts in “Optimizing application performance – part 1”. Lambda reports the time it takes to initialize application code in Amazon CloudWatch Logs. As Lambda functions are billed by request and duration, you can use this to track costs and performance. Consider reviewing your application code and its dependencies to improve the overall execution time to maximize value.

You can take advantage of Lambda execution environment reuse to make external calls to resources and use the results for subsequent invocations. Use TTL mechanisms inside your function handler code. This ensures that you can prevent additional external calls that incur additional execution time, while preemptively fetching data that isn’t stale.

Review third-party application deployments and permissions

When using Lambda layers or applications provisioned by AWS Serverless Application Repository, be sure to understand any associated charges that these may incur. When deploying functions packaged as container images, understand the charges for storing images in Amazon Elastic Container Registry (ECR).

Ensure that your Lambda function only has access to what its application code needs. Regularly review that your function has a predicted usage pattern so you can factor in the cost of other services, such as Amazon S3 and Amazon DynamoDB.

Required practice: Optimize logging output and its retention

Considering reviewing your application logging level. Ensure that logging output and log retention are appropriately set to your operational needs to prevent unnecessary logging and data retention. This helps you have the minimum of log retention to investigate operational and performance inquiries when necessary.

Emit and capture only what is necessary to understand and operate your component as intended.

With Lambda, any standard output statements are sent to CloudWatch Logs. Capture and emit business and operational events that are necessary to help you understand your function, its integration, and its interactions. Use a logging framework and environment variables to dynamically set a logging level. When applicable, sample debugging logs for a percentage of invocations.

In the serverless airline example used in this series, the booking service Lambda functions use Lambda Powertools as a logging framework with output structured as JSON.

Lambda Powertools is added to the Lambda functions as a shared Lambda layer in the AWS Serverless Application Model (AWS SAM) template. The layer ARN is stored in Systems Manager Parameter Store.

Parameters:
  SharedLibsLayer:
    Type: AWS::SSM::Parameter::Value<String>
    Description: Project shared libraries Lambda Layer ARN
Resources:
    ConfirmBooking:
        Type: AWS::Serverless::Function
        Properties:
            FunctionName: !Sub ServerlessAirline-ConfirmBooking-${Stage}
            Handler: confirm.lambda_handler
            CodeUri: src/confirm-booking
            Layers:
                - !Ref SharedLibsLayer
            Runtime: python3.7
…

The LOG_LEVEL and other Powertools settings are configured in the Globals section as Lambda environment variable for all functions.

Globals:
    Function:
        Environment:
            Variables:
                POWERTOOLS_SERVICE_NAME: booking
                POWERTOOLS_METRICS_NAMESPACE: ServerlessAirline
                LOG_LEVEL: INFO 

For Amazon API Gateway, there are two types of logging in CloudWatch: execution logging and access logging. Execution logs contain information that you can use to identify and troubleshoot API errors. API Gateway manages the CloudWatch Logs, creating the log groups and log streams. Access logs contain details about who accessed your API and how they accessed it. You can create your own log group or choose an existing log group that could be managed by API Gateway.

Enable access logs, and selectively review the output format and request fields that might be necessary. For more information, see “Setting up CloudWatch logging for a REST API in API Gateway”.

API Gateway logging

API Gateway logging

Enable AWS AppSync logging which uses CloudWatch to monitor and debug requests. You can configure two types of logging: request-level and field-level. For more information, see “Monitoring and Logging”.

AWS AppSync logging

AWS AppSync logging

Define and set a log retention strategy

Define a log retention strategy to satisfy your operational and business needs. Set log expiration for each CloudWatch log group as they are kept indefinitely by default.

For example, in the booking service AWS SAM template, log groups are explicitly created for each Lambda function with a parameter specifying the retention period.

Parameters:
    LogRetentionInDays:
        Type: Number
        Default: 14
        Description: CloudWatch Logs retention period
Resources:
    ConfirmBookingLogGroup:
        Type: AWS::Logs::LogGroup
        Properties:
            LogGroupName: !Sub "/aws/lambda/${ConfirmBooking}"
            RetentionInDays: !Ref LogRetentionInDays

The Serverless Application Repository application, auto-set-log-group-retention can update the retention policy for new and existing CloudWatch log groups to the specified number of days.

For log archival, you can export CloudWatch Logs to S3 and store them in Amazon S3 Glacier for more cost-effective retention. You can use CloudWatch Log subscriptions for custom processing, analysis, or loading to other systems. Lambda extensions allows you to process, filter, and route logs directly from Lambda to a destination of your choice.

Good practice: Optimize function configuration to reduce cost

Benchmark your function using a different set of memory size

For Lambda functions, memory is the capacity unit for controlling the performance and cost of a function. You can configure the amount of memory allocated to a Lambda function, between 128 MB and 10,240 MB. The amount of memory also determines the amount of virtual CPU available to a function. Benchmark your AWS Lambda functions with differing amounts of memory allocated. Adding more memory and proportional CPU may lower the duration and reduce the cost of each invocation.

In “Optimizing application performance – part 2”, I cover using AWS Lambda Power Tuning to automate the memory testing process to balances performance and cost.

Best practice: Use cost-aware usage patterns in code

Reduce the time your function runs by reducing job-polling or task coordination. This avoids overpaying for unnecessary compute time.

Decide whether your application can fit an asynchronous pattern

Avoid scenarios where your Lambda functions wait for external activities to complete. I explain the difference between synchronous and asynchronous processing in “Optimizing application performance – part 1”. You can use asynchronous processing to aggregate queues, streams, or events for more efficient processing time per invocation. This reduces wait times and latency from requesting apps and functions.

Long polling or waiting increases the costs of Lambda functions and also reduces overall account concurrency. This can impact the ability of other functions to run.

Consider using other services such as AWS Step Functions to help reduce code and coordinate asynchronous workloads. You can build workflows using state machines with long-polling, and failure handling. Step Functions also supports direct service integrations, such as DynamoDB, without having to use Lambda functions.

In the serverless airline example used in this series, Step Functions is used to orchestrate the Booking microservice. The ProcessBooking state machine handles all the necessary steps to create bookings, including payment.

Booking service state machine

Booking service state machine

To reduce costs and improves performance with CloudWatch, create custom metrics asynchronously. You can use the Embedded Metrics Format to write logs, rather than the PutMetricsData API call. I cover using the embedded metrics format in “Understanding application health” – part 1 and part 2.

For example, once a booking is made, the logs are visible in the CloudWatch console. You can select a log stream and find the custom metric as part of the structured log entry.

Custom metric structured log entry

Custom metric structured log entry

CloudWatch automatically creates metrics from these structured logs. You can create graphs and alarms based on them. For example, here is a graph based on a BookingSuccessful custom metric.

CloudWatch metrics custom graph

CloudWatch metrics custom graph

Consider asynchronous invocations and review run away functions where applicable

Take advantage of Lambda’s event-based model. Lambda functions can be triggered based on events ingested into Amazon Simple Queue Service (SQS) queues, S3 buckets, and Amazon Kinesis Data Streams. AWS manages the polling infrastructure on your behalf with no additional cost. Avoid code that polls for third-party software as a service (SaaS) providers. Rather use Amazon EventBridge to integrate with SaaS instead when possible.

Carefully consider and review recursion, and establish timeouts to prevent run away functions.

Conclusion

Design, implement, and optimize your application to maximize value. Asynchronous design patterns and performance practices ensure efficient resource use and directly impact the value per business transaction. By optimizing your serverless application performance and its code patterns, you can reduce costs while making more efficient use of resources.

In this post, I cover minimizing external calls and function code initialization. I show how to optimize logging output with the embedded metrics format, and log retention. I recap optimizing function configuration to reduce cost and highlight the benefits of asynchronous event-driven patterns.

This post wraps up the series, building well-architected serverless applications, where I cover the AWS Well-Architected Tool with the Serverless Lens . See the introduction post for links to all the blog posts.

For more serverless learning resources, visit Serverless Land.

 

Apply the principle of separation of duties to shell access to your EC2 instances

Post Syndicated from Vesselin Tzvetkov original https://aws.amazon.com/blogs/security/apply-the-principle-of-separation-of-duties-to-shell-access-to-your-ec2-instances/

In this blog post, we will show you how you can use AWS Systems Manager Change Manager to control access to Amazon Elastic Compute Cloud (Amazon EC2) instance interactive shell sessions, to enforce separation of duties. Separation of duties is a design principle where more than one person’s approval is required to conclude a critical task, and it is an important part of the AWS Well-Architected Framework. You will be using AWS Systems Manager Session Manager in this post to start a shell session in managed EC2 instances.

To get approval, the operator requests permissions by creating a change request for a shell session to an EC2 instance. An approver reviews and approves the change request. The approver and requestor cannot be the same Identity and Access Management (IAM) principal. Upon approval, an AWS Systems Manager Automation runbook is started. The Automation runbook adds a tag to the operator’s IAM principal that allows it to start a shell in the specified targets. By default, the operator needs to start the session within 10 minutes of approval (although the validity period is configurable). After the 10 minutes elapse, the Automation runbook removes the tag from the principal, which means that the permissions to start new sessions are revoked.

To implement the solution described in this post, you use attribute-based access control (ABAC) based policy. In order to start a Systems Manager session, the operator’s IAM principal must have the tag key SecurityAllowSessionInstance, and the tag value set to the target EC2 instance ID. All operator principals have attached the same managed policy, which allows the session to start only if the tag is present and the value is equal to the instance ID. Figure 1 shows an example in which the IAM principal tag SecurityAllowSessionInstance has the value i-1234567890abcdefg, which is the same as the instance ID.

Figure 1: Tag and managed policy pattern

Figure 1: Tag and managed policy pattern

In this post, we will take you through the following steps:

  1. Review the architecture of the solution. (See the Architecture section.)
  2. Set up Systems Manager and Change Manager in the console.
  3. Deploy an AWS CloudFormation template that will provision the following:
    • A change management template AllowSsmSessionStartTemplate to request permission for a Session Manager shell session on a specified EC2 instance.
    • The Systems Manager Automation runbook with three steps that: adds a tag to the principal; waits 10 minutes (configurable); and removes the tag. The tag key is SecurityAllowSessionInstance.
    • An IAM managed policy to be added to an IAM principal, which allows starting a Systems Manager session only if the tag AllowStartSsmSessionBasedOnIamTags is present.
    • An Amazon SNS topic change-manager-ssm-approval where approvers can get notification about requests.
    • An IAM role named SsmSessionControlChangeMangerRole, to be used for the Systems Manager Automation runbook.

    Note: Before you use the change template, you will approve the change management template in the AWS Management Console (one time).

  4. Perform simple test cases to demonstrate how an operator can obtain permission and start a session in a managed instance.
  5. Perform status monitoring.

You can use this solution across your AWS Organizations to give you the benefit of centrally managing change-related tasks in one member account, which you specify to be the delegated administrator account. For more information about how to set this up, see Setting up Change Manager for an organization.

Note: The operator can have multiple sessions in different EC2 instances simultaneously, but the sessions must be approved and started one after another because of tag overwrite on approval.

For more information about change management actions, including approvals and starting the runbook, see Auditing and logging Change Manager activity in the AWS Systems Manager User Guide.

Architecture

The architecture of this solution is shown in Figure 2.

Figure 2: Solution architecture

Figure 2: Solution architecture

The main steps shown in Figure 2 are the following:

  1. Request: The requestor (which can be the operator) creates a change request in Systems Manager Change Manager and selects the template AllowSsmSessionStartTemplate. You need to provide the following mandatory parameters: name of change, approvals (users, group, or roles), IAM role for the execution of change, target account, EC2 instance ID, operator’s principal type (user or role), and operator’s principal name.
  2. Send notification: The notification is sent to the Amazon SNS topic change-manager-ssm-approval for the new change request.
  3. Approve: The approver reviews and approves the request.
  4. Start automation: The Automation runbook AllowStartSsmSession is started at the time specified in the change request.
  5. Tag: The operator’s IAM principal is tagged with the key SecurityAllowSessionInstance. After 10 minutes, the runbook completes by removing the tag from the IAM principal.
  6. Start session: The operator can start a session to the instance by using Systems Manager Session Manager within 10 minutes of approval. A notification is sent to the SNS topic change-manager-ssm-approval, where the operator can also subscribe to be notified.

Roles and permissions

The provided managed policy AllowStartSsmSessionBasedOnIamTags gives permission to start the Systems Manager session when the instance ID is equal to principal tag, and additionally to terminate the session. The managed policy allows the operator to keep an already active session beyond the approval interval and terminate it as preferred. Resumption of the session is not supported, and the operator will need to start a new session instead.

WARNING: You should validate that the operator principal (which is an IAM user or role) does not have permissions on the actions ssm:StartSession, ssm:TerminateSession, ssm:ResumeSession outside the managed policy used in this solution.

WARNING: It is very important that the operator must not have permission to change the relevant IAM roles, users, policies, or principal tags, so that the operator cannot bypass the approval process.

Set up Systems Manager and Change Manager

You need to initially activate Systems Manager and Systems Manager Change Manager in your account. If you have already activated them, you can skip this section.

Note: You should enable Systems Manager as described in Setting up AWS Systems Manager, according to your company needs. The minimal requirement is to set up the service-linked role AWSServiceRoleForAmazonSSM that will be used by Systems Manager.

To create the service-linked IAM role

  1. Open the IAM console. In the navigation pane, choose Roles, then choose Create role.
  2. For the AWS Service role type, choose Systems Manager.
  3. Choose the use case Systems Manager – Inventory and Maintenance Windows, then choose Next: Permissions.
  4. Keep all default values, choose Next: Tags, and then choose Next: Review.
  5. Review the role and then choose Create role.

For more information, see Creating a service-linked role for Systems Manager.

Next, you set up Systems Manager Change Manager as described in Setting up Change Manager. Your specifics will vary depending on whether you use AWS Organizations or a single account.

Define the IAM users or groups that are allowed to approve change templates

Every change template should be approved before use (optional). The approval can be done by users and groups. If you use IAM roles in your organization, you will need a temporary user, which you can set up as described in Creating IAM users (console). Alternatively, you can use the change templates without explicit approval, as described later in this section.

To add reviewers for change templates

  1. In the AWS Systems Manager console, choose Change Manager, choose Settings, then choose Template reviewers.
  2. On the Select IAM approvers page, review the Users tab and Groups tab, as shown in Figure 3, and add approvers if necessary.

 

Figure 3: Change Manger settings

Figure 3: Change Manger settings

If you prefer not to explicitly review and approve the change template before use, you must turn off approval as follows.

To turn off approval of change templates before use

  1. In the Systems Manager console, choose Change Manager, then choose Settings.
  2. Under Best practices, set the option Require template review and approval before use to disabled.

Deploy the solution

After you complete the setup, you will perform the following steps one time in your selected account and AWS Region. The solution manages the permissions in all Regions you select, because IAM roles and policies are global entities.

To launch the stack

  1. Choose the following Launch Stack button to open the AWS CloudFormation console pre-loaded with the template. You must sign in to your AWS account in order to launch the stack in the required Region.
    Select the Launch Stack button to launch the template
  2. On the CloudFormation launch panel, specify the parameter Approval validity in minutes to correspond to your company policy, or keep the default value of 10 minutes.

(Optional) To approve the template

  1. To request approval of the Change Manager template, in the Systems Manager console, choose Change Manager, and then choose Change templates. Select AllowSsmSessionStartTemplate and submit for approval.
  2. To approve the Change Manager template, sign in to the Systems Manager console as the required approver user or group. Choose Change Manager, and then choose Change templates. Select AllowSsmSessionStartTemplate and choose Actions, Approve template. For more information, see Reviewing and approving or rejecting change templates.
  3. (Optional) The Systems Manager session approvers should subscribe to the SNS topic change-manager-ssm-approval, to get notification on new requests.

Now you’re ready to use the solution.

Test the solution

Next, we’ll demonstrate how you can test the solution end-to-end by doing the following: creating two IAM roles (Operator and Approver), launching an EC2 instance, requesting access by Operator to the instance, approving the request by Approver, and finally starting a Systems Manager session on the EC2 instance by Operator. You will run the test in the single account where you deployed the solution. We assume that you have set up Systems Manager as described in the Set up Systems Manager and Change Manager section.

Note: If you’re not using IAM roles in your organization, you can use IAM users instead, as described in Creating IAM users (console).

To prepare to test the solution

  1. Open the IAM console and create an IAM role named Operator in your account, and attach the following managed policies: ReadOnlyAccess (AWS managed) and AllowStartSsmSessionBasedOnIamTags (which you created in this post). For more information, see Creating IAM roles
  2. Create a second IAM role named Approver in your account, and attach the following AWS managed policies: ReadOnlyAccess and AmazonSSMFullAccess.
  3. Create an IAM role named EC2Role with a trust policy to the EC2 service (ec2.amazonaws.com) and attach the AWS managed policy AmazonEC2RoleforSSM. Alternatively, you can confirm that your existing EC2 instances have the AmazonEC2RoleforSSM policy attached to their role. For more information, see Creating a role for an AWS service (console).
  4. Open the Amazon EC2 console and start a test EC2 instance of type Amazon Linux 2 with the IAM role EC2Role that you created in step 3. You can keep the default values for all the other parameters. You don’t need to set up VPC Security Group rules to allow inbound SSH to the EC2 instance. Take note of the instance-id, because you will need it later. For more information, see Launch an Amazon EC2 Instance.
  5. Open the Amazon SNS console. Under Simple Notification Service, for Topics, subscribe to the SNS topic change-manager-ssm-approval. For more information, see Subscribing to an Amazon SNS topic.

To do a positive test of the solution

  1. Open the Systems Manager console, sign in as Operator, choose Change Manager, and create a change request.
  2. Select the template AllowSsmSessionStartTemplate.
  3. On the Specify change details page, enter a name and description, and select the IAM role Approver as approver.
  4. For Target notification topic, select the SNS topic change-manager-ssm-approval, as shown in Figure 4. Choose Next.

    Figure 4: Creating a change request

    Figure 4: Creating a change request

  5. On the Specify parameters page, provide the automation IAM role SsmSessionControlChangeMangerRole, the instance-id you noted earlier, the principal name Operator, and the principal type role, as shown in Figure 5.

    Figure 5: Specify parameters for the change request

    Figure 5: Specify parameters for the change request

  6. Next, sign in as Approver. In the Systems Manager console, choose Change Manager.
  7. On the Requests tab, as shown in Figure 6, select the request and choose Approve. (For more information, see Reviewing and approving or rejecting change requests (console).) The Automation runbook will be started.

    Figure 6: Change Manager overview

    Figure 6: Change Manager overview

  8. Sign in as Operator. Within the approval validity time that you provided in the template (10 minutes is the default), connect to the instance by using Systems Manager as described in Start a session.When the session has started and you see Unix shell at the instance, the positive test is done.

Next, you can do a negative test, to demonstrate that access isn’t possible after the approval validity period (10 minutes) has elapsed.

To do a negative test of the solution

  1. Do steps 1 through 7 of the previous procedure, if you haven’t already done so.
  2. Sign in as IAM role Operator. Wait several minutes longer than the approval validity time (10-minute default) and connect to the instance by using Systems Manager as described in Start a session.You will see that the IAM role Operator doesn’t have permission to start a session.

Clean up the resources

After the tests are finished, terminate the EC2 instance to avoid incurring future costs and remove the roles if these are no longer needed.

Status monitoring

In the Systems Manager console, on the Change Manager page, on the Requests tab, you can find all service requests and their status, and a link to the log of the runbook, as shown in Figure 7.

Figure 7: Change Manger runbook log

Figure 7: Change Manger runbook log

In the example shown in Figure 7, you can see the status of the following steps in the Automation runbook: tagging the principal, waiting, and removing the principal tag. For more information about audit and login, see Auditing and logging Change Manager activity.

Conclusion

In this post, you‘ve learned how you can enforce separation of duties by using an approval workflow in AWS Systems Manager Change Manager. You can also extend this pattern to use it with AWS Organizations, as described in Setting up Change Manager for an organization. For more information, see Configuring Change Manager options and best practices.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Systems Manager forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Vesselin Tzvetkov

Vesselin is a senior security architect at AWS Professional Services and is passionate about security architecture and engineering innovative solutions. Outside of technology, he likes classical music, philosophy, and sports. He holds a Ph.D. in security from TU-Darmstadt and a M.S. in electrical engineering from Bochum University in Germany.

Pedro Galvao

Pedro is a security engineer at AWS Professional Services. His favorite activity is to help customer doing awesome security engineering work on AWS.

Using Cloud Fitness Functions to Drive Evolutionary Architecture

Post Syndicated from Hauke Juhls original https://aws.amazon.com/blogs/architecture/using-cloud-fitness-functions-to-drive-evolutionary-architecture/

“It is not the strongest of the species that survives, nor the most intelligent. It is the one that is most adaptable to change.” – often attributed to Charles Darwin

One common strategy for businesses that operate in dynamic market conditions (and thus need to continuously correct their course) is to aim for smaller, independent development teams. Microservices and two-pizza teams at Amazon are prominent examples of this strategy. But having smaller units is not the only success factor: to reduce organizational bottlenecks and make high-quality decisions quickly, these two-pizza teams need to be autonomous in most of their decision making.

Architects can no longer rely on static upfront design to meet the change rate required to be successful in such an environment.

This blog shows enterprise architects a mechanism to align decentralized architectural decision making with overall architecture goals.

Gathering data from your fitness functions

“Evolutionary architecture” was coined by Neal Ford and his colleagues from AWS Partner ThoughtWorks in their work on Building Evolutionary Architectures. It is defined as “supporting guided, incremental change as a first principle across multiple dimensions.”

Fitness functions help you obtain the necessary data to allow for the planned evolution of your architecture. They set measurable values to assess how close your solution is to achieving your set goals.

Fitness functions can and should be adapted as the architecture evolves to guide a desired change process. This provides architects with a tool to guide their teams while maintaining team autonomy.

Example of a regression fitness function in action

You’ve identified shorter time-to-market as a key non-functional requirement. You want to lower the risk of regressions and rollbacks after deployments. So, you and your team write automated test cases. To ensure that they have a good set of test cases in place, they measure test coverage. This test coverage measures the percentage of code that is tested automatically. This steers the team toward writing tests to mitigate the risk of regressions so they have fewer rollbacks and shorter time to market.

Fitness functions like this work best when they’re as automated as possible. But how do you acquire the necessary data points to use this mechanism outside of software architecture? We’ll show you how in the following sections.

AWS Cloud services with built-in fitness functions

AWS Cloud services are highly standardized, fully automated via API operations, and are built with observability in mind. This allows you to generate measurements for fitness functions automatically for areas such as availability, responsiveness, and security.

To start building your evolutionary architecture with fitness functions, use something that can be easily measured. AWS has services that can be used as inputs to fitness functions, including:

  • Amazon CloudWatch aggregates logs and metrics to check for availability, responsiveness, and reliability fitness functions.
  • AWS Security Hub provides a comprehensive view of your security alerts and security posture across your AWS accounts. Security Architects could, for example, define the fitness function of critical and high findings to be zero. Teams then would be guided into reducing the number of these findings, resulting in better security.
  • AWS Cost Explorer ensures your costs stay in line with value generated.
  • AWS Well-Architected Tool evaluates teams’ architectures in a consistent and repeatable way. The number of items acts as your fitness function, which can be queried using the API. To improve your architecture based on the results, review the Establishing Feedback Loops Based on the AWS Well-Architected Framework Review blog post.
  • Amazon SageMaker Model Monitor continuously monitors the quality of SageMaker machine learning models in production. Detecting deviations early allows you to take corrective actions like retraining models, auditing upstream systems, or fixing quality issues.

Using the observability that the cloud provides

Fitness functions can be derived by evaluating the AWS account activity such as configuration changes. AWS CloudTrail is useful for this. It records account activity and service events from most AWS services, which can then be analyzed with Amazon Athena.

Fitness functions provide feedback to engineers via metrics

Figure 1. Fitness functions provide feedback to engineers via metrics

Example of a cloud fitness function in action

In this example, we implement a fitness function that monitors the operability of your system.

You have had certain outages due to manual tasks in operations, and you have anecdotal evidence that engineers are spending time on manual work during application rollouts. To improve operations, you want to reduce manual interactions via the shell in favor of automation. First, you prevent direct secure shell (SSH) access by blocking SSH traffic via the managed AWS Config rule restricted-ssh. Second, you make use of AWS Systems Manager Session Manager, which provides a secure and auditable way to access Amazon Elastic Compute Cloud (Amazon EC2) instances.

By counting the logged API events in CloudTrail you can measure the number of shell sessions. This is shown in this sample Athena query to count the number of shell sessions:

SELECT count(*),
       DATE(from_iso8601_timestamp(eventTime)),
       userIdentity.type,
       eventSource,
       eventName
FROM "cloudtrail_logs_partition_projection"
WHERE readonly = 'false'
  AND eventsource = 'ssm.amazonaws.com'
  AND eventname in ('StartSession',
                    'ResumeSession',
                    'TerminateSession')
GROUP BY DATE(from_iso8601_timestamp(eventTime)),
         userIdentity.type,
         eventSource,
         eventName
ORDER BY DATE(from_iso8601_timestamp(eventTime)) DESC

The number of shell sessions now act as fitness function to improve operational excellence through operations as code. Coincidently, the fitness function you defined also rewards teams moving to serverless compute services such as AWS Fargate or AWS Lambda.

Fitness through exercising

Similar to people, your architecture’s fitness can be improved by exercising. It does not take much equipment, but you need to take the first step. To get started, we encourage you to think of the desired outcomes for your architecture that you can measure (and thus guide) through fitness functions. The following lessons learned will help you focus your goals:

  • Requirements and business goals may differ per domain. Thus, your fitness functions might differ. Work closely with your teams when defining fitness functions.
  • Start by taking something that can be easily measured and communicated as a goal.
  • Focus on a positive trendline rather than absolute values.
  • Make sure you and your teams are using the same metrics and the same way to measure them. We have seen examples where central governance departments had access to data the individual teams did not, leading to frustration on all sides.
  • Ensure that your architecture goals fit well into the current context and time horizon.
  • Continuously re-visit the fitness functions to ensure that they evolve with the changing business goals.

Conclusion

Fitness functions help architects focus on building. Once established, teams can use the data points from fitness functions to make decisions and work towards a common and measurable goal. The architects in turn can use the data points they get from fitness functions to confirm their hypothesis of the current state of the architecture. Get started building your fitness functions today by:

  • Gathering the most important system quality attributes.
  • Beginning with approximately three meaningful fitness functions relying on the API operations available.
  • Building a dashboard that shows progress over time, share it with your teams, and rely on this data in your daily work.