Tag Archives: Advanced (300)

Decoupled Serverless Scheduler To Run HPC Applications At Scale on EC2

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/decoupled-serverless-scheduler-to-run-hpc-applications-at-scale-on-ec2/

This post is written by Ludvig Nordstrom and Mark Duffield | on November 27, 2019

In this blog post, we dive in to a cloud native approach for running HPC applications at scale on EC2 Spot Instances, using a decoupled serverless scheduler. This architecture is ideal for many workloads in the HPC and EDA industries, and can be used for any batch job workload.

At the end of this blog post, you will have two takeaways.

  1. A highly scalable environment that can run on hundreds of thousands of cores across EC2 Spot Instances.
  2. A fully serverless architecture for job orchestration.

We discuss deploying and running a pre-built serverless job scheduler that can run both Windows and Linux applications using any executable file format for your application. This environment provides high performance, scalability, cost efficiency, and fault tolerance. We introduce best practices and benefits to creating this environment, and cover the architecture, running jobs, and integration in to existing environments.

quick note about the term cloud native: we use the term loosely in this blog. Here, cloud native  means we use AWS Services (to include serverless and microservices) to build out our compute environment, instead of a traditional lift-and-shift method.

Let’s get started!

 

Solution overview

This blog goes over the deployment process, which leverages AWS CloudFormation. This allows you to use infrastructure as code to automatically build out your environment. There are two parts to the solution: the Serverless Scheduler and Resource Automation. Below are quick summaries of each part of the solutions.

Part 1 – The serverless scheduler

This first part of the blog builds out a serverless workflow to get jobs from SQS and run them across EC2 instances. The CloudFormation template being used for Part 1 is serverless-scheduler-app.template, and here is the Reference Architecture:

 

Serverless Scheduler Reference Architecture . Reference Architecture for Part 1. This architecture shows just the Serverless Schduler. Part 2 builds out the resource allocation architecture. Outlined Steps with detail from figure one

    Figure 1: Serverless Scheduler Reference Architecture (grayed-out area is covered in Part 2).

Read the GitHub Repo if you want to look at the Step Functions workflow contained in preceding images. The walkthrough explains how the serverless application retrieves and runs jobs on its worker, updates DynamoDB job monitoring table, and manages the worker for its lifetime.

 

Part 2 – Resource automation with serverless scheduler


This part of the solution relies on the serverless scheduler built in Part 1 to run jobs on EC2.  Part 2 simplifies submitting and monitoring jobs, and retrieving results for users. Jobs are spread across our cost-optimized Spot Instances. AWS Autoscaling automatically scales up the compute resources when jobs are submitted, then terminates them when jobs are finished. Both of these save you money.

The CloudFormation template used in Part 2 is resource-automation.template. Building on Figure 1, the additional resources launched with Part 2 are noted in the following image, they are an S3 Bucket, AWS Autoscaling Group, and two Lambda functions.

Resource Automation using Serverless Scheduler This is Part 2 of the deployment process, and leverages the Part 1 architecture. This provides the resource allocation, that allows for automated job submission and EC2 Auto Scaling. Detailed steps for the prior image

 

Figure 2: Resource Automation using Serverless Scheduler

                               

Introduction to decoupled serverless scheduling

HPC schedulers traditionally run in a classic master and worker node configuration. A scheduler on the master node orchestrates jobs on worker nodes. This design has been successful for decades, however many powerful schedulers are evolving to meet the demands of HPC workloads. This scheduler design evolved from a necessity to run orchestration logic on one machine, but there are now options to decouple this logic.

What are the possible benefits that decoupling this logic could bring? First, we avoid a number of shortfalls in the environment such as the need for all worker nodes to communicate with a single master node. This single source of communication limits scalability and creates a single point of failure. When we split the scheduler into decoupled components both these issues disappear.

Second, in an effort to work around these pain points, traditional schedulers had to create extremely complex logic to manage all workers concurrently in a single application. This stifled the ability to customize and improve the code – restricting changes to be made by the software provider’s engineering teams.

Serverless services, such as AWS Step Functions and AWS Lambda fix these major issues. They allow you to decouple the scheduling logic to have a one-to-one mapping with each worker, and instead share an Amazon Simple Queue Service (SQS) job queue. We define our scheduling workflow in AWS Step Functions. Then the workflow scales out to potentially thousands of “state machines.” These state machines act as wrappers around each worker node and manage each worker node individually.  Our code is less complex because we only consider one worker and its job.

We illustrate the differences between a traditional shared scheduler and decoupled serverless scheduler in Figures 3 and 4.

 

Traditional Scheduler Model This shows a traditional sceduler where there is one central schduling host, and then multiple workers.

Figure 3: Traditional Scheduler Model

 

Decoupled Serverless Scheduler on each instance This shows what a Decoupled Serverless Scheduler design looks like, wit

Figure 4: Decoupled Serverless Scheduler on each instance

 

Each decoupled serverless scheduler will:

  • Retrieve and pass jobs to its worker
  • Monitor its workers health and take action if needed
  • Confirm job success by checking output logs and retry jobs if needed
  • Terminate the worker when job queue is empty just before also terminating itself

With this new scheduler model, there are many benefits. Decoupling schedulers into smaller schedulers increases fault tolerance because any issue only affects one worker. Additionally, each scheduler consists of independent AWS Lambda functions, which maintains the state on separate hardware and builds retry logic into the service.  Scalability also increases, because jobs are not dependent on a master node, which enables the geographic distribution of jobs. This geographic distribution allows you to optimize use of low-cost Spot Instances. Also, when decoupling the scheduler, workflow complexity decreases and you can customize scheduler logic. You can leverage lower latency job monitoring and customize automated responses to job events as they happen.

 

Benefits

  • Fully managed –  With Part 2, Resource Automation deployed, resources for a job are managed. When a job is submitted, resources launch and run the job. When the job is done, worker nodes automatically shut down. This prevents you from incurring continuous costs.

 

  • Performance – Your application runs on EC2, which means you can choose any of the high performance instance types. Input files are automatically copied from Amazon S3 into local Amazon EC2 Instance Store for high performance storage during execution. Result files are automatically moved to S3 after each job finishes.

 

  • Scalability – A worker node combined with a scheduler state machine become a stateless entity. You can spin up as many of these entities as you want, and point them to an SQS queue. You can even distribute worker and state machine pairs across multiple AWS regions. These two components paired with fully managed services optimize your architecture for scalability to meet your desired number of workers.

 

  • Fault Tolerance –The solution is completely decoupled, which means each worker has its own state machine that handles scheduling for that worker. Likewise, each state machine is decoupled into Lambda functions that make up your state machine. Additionally, the scheduler workflow includes a Lambda function that confirms each successful job or resubmits jobs.

 

  • Cost Efficiency – This fault tolerant environment is perfect for EC2 Spot Instances. This means you can save up to 90% on your workloads compared to On-Demand Instance pricing. The scheduler workflow ensures little to no idle time of workers by closely monitoring and sending new jobs as jobs finish. Because the scheduler is serverless, you only incur costs for the resources required to launch and run jobs. Once the job is complete, all are terminated automatically.

 

  • Agility – You can use AWS fully managed Developer Tools to quickly release changes and customize workflows. The reduced complexity of a decoupled scheduling workflow means that you don’t have to spend time managing a scheduling environment, and can instead focus on your applications.

 

 

Part 1 – serverless scheduler as a standalone solution

 

If you use the serverless scheduler as a standalone solution, you can build clusters and leverage shared storage such as FSx for Lustre, EFS, or S3. Additionally, you can use AWS CloudFormation or to deploy more complex compute architectures that suit your application. So, the EC2 Instances that run the serverless scheduler can be launched in any number of ways. The scheduler only requires the instance id and the SQS job queue name.

 

Submitting Jobs Directly to serverless scheduler

The severless scheduler app is a fully built AWS Step Function workflow to pull jobs from an SQS queue and run them on an EC2 Instance. The jobs submitted to SQS consist of an AWS Systems Manager Run Command, and work with any SSM Document and command that you chose for your jobs. Examples of SSM Run Commands are ShellScript and PowerShell.  Feel free to read more about Running Commands Using Systems Manager Run Command.

The following code shows the format of a job submitted to SQS in JSON.

  {

    "job_id": "jobId_0",

    "retry": "3",

    "job_success_string": " ",

    "ssm_document": "AWS-RunPowerShellScript",

    "commands":

        [

            "cd C:\\ProgramData\\Amazon\\SSM; mkdir Result",

            "Copy-S3object -Bucket my-bucket -KeyPrefix jobs/date/jobId_0 -LocalFolder .\\",

            "C:\\ProgramData\\Amazon\\SSM\\jobId_0.bat",

            "Write-S3object -Bucket my-bucket -KeyPrefix jobs/date/jobId_0 –Folder .\\Result\\"

        ],

  }

 

Any EC2 Instance associated with a serverless scheduler it receives jobs picked up from a designated SQS queue until the queue is empty. Then, the EC2 resource automatically terminates. If the job fails, it retries until it reaches the specified number of times in the job definition. You can include a specific string value so that the scheduler searches for job execution outputs and confirms the successful completions of jobs.

 

Tagging EC2 workers to get a serverless scheduler state machine

In Part 1 of the deployment, you must manage your EC2 Instance launch and termination. When launching an EC2 Instance, tag it with a specific tag key that triggers a state machine to manage that instance. The tag value is the name of the SQS queue that you want your state machine to poll jobs from.

In the following example, “my-scheduler-cloudformation-stack-name” is the tag key that serverless scheduler app will for with any new EC2 instance that starts. Next, “my-sqs-job-queue-name” is the default job queue created with the scheduler. But, you can change this to any queue name you want to retrieve jobs from when an instance is launched.

{"my-scheduler-cloudformation-stack-name":"my-sqs-job-queue-name"}

 

Monitor jobs in DynamoDB

You can monitor job status in the following DynamoDB. In the table you can find job_id, commands sent to Amazon EC2, job status, job output logs from Amazon EC2, and retries among other things.

Alternatively, you can query DynamoDB for a given job_id via the AWS Command Line Interface:

aws dynamodb get-item --table-name job-monitoring \

                      --key '{"job_id": {"S": "/my-jobs/my-job-id.bat"}}'

 

Using the “job_success_string” parameter

For the prior DynamoDB table, we submitted two identical jobs using an example script that you can also use. The command sent to the instance is “echo Hello World.” The output from this job should be “Hello World.” We also specified three allowed job retries.  In the following image, there are two jobs in SQS queue before they ran.  Look closely at the different “job_success_strings” for each and the identical command sent to both:

DynamoDB CLI info This shows an example DynamoDB CLI output with job information.

From the image we see that Job2 was successful and Job1 retried three times before permanently labelled as failed. We forced this outcome to demonstrate how the job success string works by submitting Job1 with “job_success_string” as “Hello EVERYONE”, as that will not be in the job output “Hello World.” In “Job2” we set “job_success_string” as “Hello” because we knew this string will be in the output log.

Job outputs commonly have text that only appears if job succeeded. You can also add this text yourself in your executable file. With “job_success_string,” you can confirm a job’s successful output, and use it to identify a certain value that you are looking for across jobs.

 

Part 2 – Resource Automation with the serverless scheduler

The additional services we deploy in Part 2 integrate with existing architectures to launch resources for your serverless scheduler. These services allow you to submit jobs simply by uploading input files and executable files to an S3 bucket.

Likewise, these additional resources can use any executable file format you want, including proprietary application level scripts. The solution automates everything else. This includes creating and submitting jobs to SQS job queue, spinning up compute resources when new jobs come in, and taking them back down when there are no jobs to run. When jobs are done, result files are copied to S3 for the user to retrieve. Similar to Part 1, you can still view the DynamoDB table for job status.

This architecture makes it easy to scale out to different teams and departments, and you can submit potentially hundreds of thousands of jobs while you remain in control of resources and cost.

 

Deeper Look at the S3 Architecture

The following diagram shows how you can submit jobs, monitor progress, and retrieve results. To submit jobs, upload all the needed input files and an executable script to S3. The suffix of the executable file (uploaded last) triggers an S3 event to start the process, and this suffix is configurable.

The S3 key of the executable file acts as the job id, and is kept as a reference to that job in DynamoDB. The Lambda (#2 in diagram below) uses the S3 key of the executable to create three SSM Run Commands.

  1. Synchronize all files in the same S3 folder to a working directory on the EC2 Instance.
  2. Run the executable file on EC2 Instances within a specified working directory.
  3. Synchronize the EC2 Instances working directory back to the S3 bucket where newly generated result files are included.

This Lambda (#2) then places the job on the SQS queue using the schedulers JSON formatted job definition seen above.

IMPORTANT: Each set of job files should be given a unique job folder in S3 or more files than needed might be moved to the EC2 Instance.

 

Figure 5: Resource Automation using Serverless Scheduler - A deeper look A deeper dive in to Part 2, resource allcoation.

Figure 5: Resource Automation using Serverless Scheduler – A deeper look

 

EC2 and Step Functions workflow use the Lambda function (#3 in prior diagram) and the Auto Scaling group to scale out based on the number of jobs in the queue to a maximum number of workers (plus state machine), as defined in the Auto Scaling Group. When the job queue is empty, the number of running instances scale down to 0 as they finish their remaining jobs.

 

Process Submitting Jobs and Retrieving Results

  1. Seen in1, upload input file(s) and an executable file into a unique job folder in S3 (such as /year/month/day/jobid/~job-files). Upload the executable file last because it automatically starts the job. You can also use a script to upload multiple files at a time but each job will need a unique directory. There are many ways to make S3 buckets available to users including AWS Storage Gateway, AWS Transfer for SFTP, AWS DataSync, the AWS Console or any one of the AWS SDKs leveraging S3 API calls.
  2. You can monitor job status by accessing the DynamoDB table directly via the AWS Management Console or use the AWS CLI to call DynamoDB via an API call.
  3. Seen in step 5, you can retrieve result files for jobs from the same S3 directory where you left the input files. The DynamoDB table confirms when jobs are done. The SQS output queue can be used by applications that must automatically poll and retrieve results.

You no longer need to create or access compute nodes as compute resources. These automatically scale up from zero when jobs come in, and then back down to zero when jobs are finished.

 

Deployment

Read the GitHub Repo for deployment instructions. Below are CloudFormation templates to help:

AWS RegionLaunch Stack
eu-north-1link to zone
ap-south-1
eu-west-3
eu-west-2
eu-west-1
ap-northeast-3
ap-northeast-2
ap-northeast-1
sa-east-1
ca-central-1
ap-southeast-1
ap-southeast-2
eu-central-1
us-east-1
us-east-2
us-west-1
us-west-2

 

 

Additional Points on Usage Patterns

 

  • While the two solutions in this blog are aimed at HPC applications, they can be used to run any batch jobs. Many customers that run large data processing batch jobs in their data lakes could use the serverless scheduler.

 

  • You can build pipelines of different applications when the output of one job triggers another to do something else – an example being pre-processing, meshing, simulation, post-processing. You simply deploy the Resource Automation template several times, and tailor it so that the output bucket for one step is the input bucket for the next step.

 

  • You might look to use the “job_success_string” parameter for iteration/verification used in cases where a shot-gun approach is needed to run thousands of jobs, and only one has a chance of producing the right result. In this case the “job_success_string” would identify the successful job from potentially hundreds of thousands pushed to SQS job queue.

 

Scale-out across teams and departments

Because all services used are serverless, you can deploy as many run environments as needed without increasing overall costs. Serverless workloads only accumulate cost when the services are used. So, you could deploy ten job environments and run one job in each, and your costs would be the same if you had one job environment running ten jobs.

 

All you need is an S3 bucket to upload jobs to and an associated AMI that has the right applications and license configuration. Because a job configuration is passed to the scheduler at each job start, you can add new teams by creating an S3 bucket and pointing S3 events to a default Lambda function that pulls configurations for each job start.

 

Setup CI/CD pipeline to start continuous improvement of scheduler

If you are advanced, we encourage you to clone the git repo and customize this solution. The serverless scheduler is less complex than other schedulers, because you only think about one worker and the process of one job’s run.

Ways you could tailor this solution:

  • Add intelligent job scheduling using AWS Sagemaker  – It is hard to find data as ready for ML as log data because every job you run has different run times and resource consumption. So, you could tailor this solution to predict the best instance to use with ML when workloads are submitted.
  • Add Custom Licensing Checkout Logic – Simply add one Lambda function to your Step Functions workflow to make an API call a license server before continuing with one or more jobs. You can start a new worker when you have a license checked out or if a license is not available then the instance can terminate to remove any costs waiting for licenses.
  • Add Custom Metrics to DynamoDB – You can easily add metrics to DynamoDB because the solution already has baseline logging and monitoring capabilities.
  • Run on other AWS Services – There is a Lambda function in the Step Functions workflow called “Start_Job”. You can tailor this Lambda to run your jobs on AWS Sagemaker, AWS EMR, AWS EKS or AWS ECS instead of EC2.

 

Conclusion

 

Although HPC workloads and EDA flows may still be dependent on current scheduling technologies, we illustrated the possibilities of decoupling your workloads from your existing shared scheduling environments. This post went deep into decoupled serverless scheduling, and we understand that it is difficult to unwind decades of dependencies. However, leveraging numerous AWS Services encourages you to think completely differently about running workloads.

But more importantly, it encourages you to Think Big. With this solution you can get up and running quickly, fail fast, and iterate. You can do this while scaling to your required number of resources, when you want them, and only pay for what you use.

Serverless computing  catalyzes change across all industries, but that change is not obvious in the HPC and EDA industries. This solution is an opportunity for customers to take advantage of the nearly limitless capacity that AWS.

Please reach out with questions about HPC and EDA on AWS. You now have the architecture and the instructions to build your Serverless Decoupled Scheduling environment.  Go build!


About the Authors and Contributors

Authors 

 

Ludvig Nordstrom is a Senior Solutions Architect at AWS

 

 

 

 

Mark Duffield is a Tech Lead in Semiconductors at AWS

 

 

 

Contributors

 

Steve Engledow is a Senior Solutions Builder at AWS

 

 

 

 

Arun Thomas is a Senior Solutions Builder at AWS

 

 

Use AWS Fargate and Prowler to send security configuration findings about AWS services to Security Hub

Post Syndicated from Jonathan Rau original https://aws.amazon.com/blogs/security/use-aws-fargate-prowler-send-security-configuration-findings-about-aws-services-security-hub/

In this blog post, I’ll show you how to integrate Prowler, an open-source security tool, with AWS Security Hub. Prowler provides dozens of security configuration checks related to services such as Amazon Redshift, Amazon ElasticCache, Amazon API Gateway and Amazon CloudFront. Integrating Prowler with Security Hub will provide posture information about resources not currently covered by existing Security Hub integrations or compliance standards. You can use Prowler checks to supplement the existing CIS AWS Foundations compliance standard Security Hub already provides, as well as other compliance-related findings you may be ingesting from partner solutions.

In this post, I’ll show you how to containerize Prowler using Docker and host it on the serverless container service AWS Fargate. By running Prowler on Fargate, you no longer have to provision, configure, or scale infrastructure, and it will only run when needed. Containers provide a standard way to package your application’s code, configurations, and dependencies into a single object that can run anywhere. Serverless applications automatically run and scale in response to events you define, rather than requiring you to provision, scale, and manage servers.

Solution overview

The following diagram shows the flow of events in the solution I describe in this blog post.
 

Figure 1: Prowler on Fargate Architecture

Figure 1: Prowler on Fargate Architecture

 

The integration works as follows:

  1. A time-based CloudWatch Event starts the Fargate task on a schedule you define.
  2. Fargate pulls a Prowler Docker image from Amazon Elastic Container Registry (ECR).
  3. Prowler scans your AWS infrastructure and writes the scan results to a CSV file.
  4. Python scripts in the Prowler container convert the CSV to JSON and load an Amazon DynamoDB table with formatted Prowler findings.
  5. A DynamoDB stream invokes an AWS Lambda function.
  6. The Lambda function maps Prowler findings into the AWS Security Finding Format (ASFF) before importing them to Security Hub.

Except for an ECR repository, you’ll deploy all of the above via AWS CloudFormation. You’ll also need the following prerequisites to supply as parameters for the CloudFormation template.

Prerequisites

  • A VPC with at least 2 subnets that have access to the Internet plus a security group that allows access on Port 443 (HTTPS).
  • An ECS task role with the permissions that Prowler needs to complete its scans. You can find more information about these permissions on the official Prowler GitHub page.
  • An ECS task execution IAM role to allow Fargate to publish logs to CloudWatch and to download your Prowler image from Amazon ECR.

Step 1: Create an Amazon ECR repository

In this step, you’ll create an ECR repository. This is where you’ll upload your Docker image for Step 2.

  1. Navigate to the Amazon ECR Console and select Create repository.
  2. Enter a name for your repository (I’ve named my example securityhub-prowler, as shown in figure 2), then choose Mutable as your image tag mutability setting, and select Create repository.
     
    Figure 2: ECR Repository Creation

    Figure 2: ECR Repository Creation

Keep the browser tab in which you created the repository open so that you can easily reference the Docker commands you’ll need in the next step.

Step 2: Build and push the Docker image

In this step, you’ll create a Docker image that contains scripts that will map Prowler findings into DynamoDB. Before you begin step 2, ensure your workstation has the necessary permissions to push images to ECR.

  1. Create a Dockerfile via your favorite text editor, and name it Dockerfile.
    
    FROM python:latest
    
    # Declar Env Vars
    ENV MY_DYANMODB_TABLE=MY_DYANMODB_TABLE
    ENV AWS_REGION=AWS_REGION
    
    # Install Dependencies
    RUN \
        apt update && \
        apt upgrade -y && \
        pip install awscli && \
        apt install -y python3-pip
    
    # Place scripts
    ADD converter.py /root
    ADD loader.py /root
    ADD script.sh /root
    
    # Installs prowler, moves scripts into prowler directory
    RUN \
        git clone https://github.com/toniblyx/prowler && \
        mv root/converter.py /prowler && \
        mv root/loader.py /prowler && \
        mv root/script.sh /prowler
    
    # Runs prowler, ETLs ouput with converter and loads DynamoDB with loader
    WORKDIR /prowler
    RUN pip3 install boto3
    CMD bash script.sh
    

  2. Create a new file called script.sh and paste in the below code. This script will call the remaining scripts, which you’re about to create in a specific order.

    Note: Change the AWS Region in the Prowler command on line 3 to the region in which you’ve enabled Security Hub.

    
    #!/bin/bash
    echo "Running Prowler Scans"
    ./prowler -b -n -f us-east-1 -g extras -M csv > prowler_report.csv
    echo "Converting Prowler Report from CSV to JSON"
    python converter.py
    echo "Loading JSON data into DynamoDB"
    python loader.py
    

  3. Create a new file called converter.py and paste in the below code. This Python script will convert the Prowler CSV report into JSON, and both versions will be written to the local storage of the Prowler container.
    
    import csv
    import json
    
    # Set variables for within container
    CSV_PATH = 'prowler_report.csv'
    JSON_PATH = 'prowler_report.json'
    
    # Reads prowler CSV output
    csv_file = csv.DictReader(open(CSV_PATH, 'r'))
    
    # Create empty JSON list, read out rows from CSV into it
    json_list = []
    for row in csv_file:
        json_list.append(row)
    
    # Writes row into JSON file, writes out to docker from .dumps
    open(JSON_PATH, 'w').write(json.dumps(json_list))
    
    # open newly converted prowler output
    with open('prowler_report.json') as f:
        data = json.load(f)
    
    # remove data not needed for Security Hub BatchImportFindings    
    for element in data: 
        del element['PROFILE']
        del element['SCORED']
        del element['LEVEL']
        del element['ACCOUNT_NUM']
        del element['REGION']
    
    # writes out to a new file, prettified
    with open('format_prowler_report.json', 'w') as f:
        json.dump(data, f, indent=2)
    

  4. Create your last file, called loader.py and paste in the below code. This Python script will read values from the JSON file and send them to DynamoDB.
    
    from __future__ import print_function # Python 2/3 compatibility
    import boto3
    import json
    import decimal
    import os
    
    awsRegion = os.environ['AWS_REGION']
    prowlerDynamoDBTable = os.environ['MY_DYANMODB_TABLE']
    
    dynamodb = boto3.resource('dynamodb', region_name=awsRegion)
    
    table = dynamodb.Table(prowlerDynamoDBTable)
    
    # CHANGE FILE AS NEEDED
    with open('format_prowler_report.json') as json_file:
        findings = json.load(json_file, parse_float = decimal.Decimal)
        for finding in findings:
            TITLE_ID = finding['TITLE_ID']
            TITLE_TEXT = finding['TITLE_TEXT']
            RESULT = finding['RESULT']
            NOTES = finding['NOTES']
    
            print("Adding finding:", TITLE_ID, TITLE_TEXT)
    
            table.put_item(
               Item={
                   'TITLE_ID': TITLE_ID,
                   'TITLE_TEXT': TITLE_TEXT,
                   'RESULT': RESULT,
                   'NOTES': NOTES,
                }
            )
    

  5. From the ECR console, within your repository, select View push commands to get operating system-specific instructions and additional resources to build, tag, and push your image to ECR. See Figure 3 for an example.
     
    Figure 3: ECR Push Commands

    Figure 3: ECR Push Commands

    Note: If you’ve built Docker images previously within your workstation, pass the –no-cache flag with your docker build command.

  6. After you’ve built and pushed your Image, note the URI within the ECR console (such as 12345678910.dkr.ecr.us-east-1.amazonaws.com/my-repo), as you’ll need this for a CloudFormation parameter in step 3.

Step 3: Deploy CloudFormation template

Download the CloudFormation template from GitHub and create a CloudFormation stack. For more information about how to create a CloudFormation stack, see Getting Started with AWS CloudFormation in the CloudFormation User Guide.

You’ll need the values you noted in Step 2 and during the “Solution overview” prerequisites. The description of each parameter is provided on the Parameters page of the CloudFormation deployment (see Figure 4)
 

Figure 4: CloudFormation Parameters

Figure 4: CloudFormation Parameters

After the CloudFormation stack finishes deploying, click the Resources tab to find your Task Definition (called ProwlerECSTaskDefinition). You’ll need this during Step 4.
 

Figure 5: CloudFormation Resources

Figure 5: CloudFormation Resources

Step 4: Manually run ECS task

In this step, you’ll run your ECS Task manually to verify the integration works. (Once you’ve tested it, this step will be automatic based on CloudWatch events.)

  1. Navigate to the Amazon ECS console and from the navigation pane select Task Definitions.
  2. As shown in Figure 6, select the check box for the task definition you deployed via CloudFormation, then select the Actions dropdown menu and choose Run Task.
     
    Figure 6: ECS Run Task

    Figure 6: ECS Run Task

  3. Configure the following settings (shown in Figure 7), then select Run Task:
    1. Launch Type: FARGATE
    2. Platform Version: Latest
    3. Cluster: Select the cluster deployed by CloudFormation
    4. Number of tasks: 1
    5. Cluster VPC: Enter the VPC of the subnets you provided as CloudFormation parameters
    6. Subnets: Select 1 or more subnets in the VPC
    7. Security groups: Enter the same security group you provided as a CloudFormation parameter
    8. Auto-assign public IP: ENABLED
       
      Figure 7: ECS Task Settings

      Figure 7: ECS Task Settings

  4. Depending on the size of your account and the resources within it, your task can take up to an hour to complete. Follow the progress by looking at the Logs tab within the Task view (Figure 8) by selecting your task. The stdout from Prowler will appear in the logs.

    Note: Once the task has completed it will automatically delete itself. You do not need to take additional actions for this to happen during this or subsequent runs.

     

    Figure 8: ECS Task Logs

    Figure 8: ECS Task Logs

  5. Under the Details tab, monitor the status. When the status reads Stopped, navigate to the DynamoDB console.
  6. Select your table, then select the Items tab. Your findings will be indexed under the primary key NOTES, as shown in Figure 9. From here, the Lambda function will trigger each time new items are written into the table from Fargate and will load them into Security Hub.
     
    Figure 9: DynamoDB Items

    Figure 9: DynamoDB Items

  7. Finally, navigate to the Security Hub console, select the Findings menu, and wait for findings from Prowler to arrive in the dashboard as shown in figure 10.
     
    Figure 10: Prowler Findings in Security Hub

    Figure 10: Prowler Findings in Security Hub

If you run into errors when running your Fargate task, refer to the Amazon ECS Troubleshooting guide. Log errors commonly come from missing permissions or disabled Regions – refer back to the Prowler GitHub for troubleshooting information.

Conclusion

In this post, I showed you how to containerize Prowler, run it manually, create a schedule with CloudWatch Events, and use custom Python scripts along with DynamoDB streams and Lambda functions to load Prowler findings into Security Hub. By using Security Hub, you can centralize and aggregate security configuration information from Prowler alongside findings from AWS and partner services.

From Security Hub, you can use custom actions to send one or a group of findings from Prowler to downstream services such as ticketing systems or to take custom remediation actions. You can also use Security Hub custom insights to create saved searches from your Prowler findings. Lastly, you can use Security Hub in a master-member format to aggregate findings across multiple accounts for centralized reporting.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the AWS Security Hub forum.

Want more AWS Security news? Follow us on Twitter.

Jonathon Rau

Jonathan Rau

Jonathan is the Senior TPM for AWS Security Hub. He holds an AWS Certified Specialty-Security certification and is extremely passionate about cyber security, data privacy, and new emerging technologies, such as blockchain. He devotes personal time into research and advocacy about those same topics.

How to get started with security response automation on AWS

Post Syndicated from Cameron Worrell original https://aws.amazon.com/blogs/security/how-get-started-security-response-automation-aws/

At AWS, we encourage you to use automation to help quickly detect and respond to security events within your AWS environments. In addition to increasing the speed of detection and response, automation also helps you scale your security operations as you expand your workloads running on AWS. For these reasons, security automation is a key ­principle outlined in both the Well-Architected and Cloud Adoption frameworks as well as in the AWS Security Incident Response Guide.

In this blog post, you’ll learn to implement automated security response mechanisms within your AWS environments. This post will include common patterns, implementation considerations, and an example solution. Security response automation is a broad topic that spans many areas. The goal of this blog post is to introduce you to core concepts and help you get started.

A word from our lawyers: Please note that you are responsible for making your own independent assessment of the information in this post. This post: (a) is for informational purposes only, (b) represents current AWS product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers, or licensors.

What is security response automation?

Security response automation is a planned and programmed action taken to achieve a desired state for an application or resource based on a condition or event. When you implement security response automation, you should adopt an approach that draws from existing security frameworks. Frameworks are published materials which consist of standards, guidelines, and best practices in order help organizations manage cybersecurity-related risk. Using frameworks helps you achieve consistency and scalability and enables you to focus more on the strategic aspects of your security program. You should work with compliance professionals within your organization to understand any specific security frameworks that may also be relevant for your AWS environment.

Our example solution is based on the NIST Cybersecurity Framework (CSF), which is designed to help organizations assess and improve their ability to prevent, detect, and respond to security events. According to the CSF, “cybersecurity incident response” supports your ability to contain the impact of potential cybersecurity incidents. Although automation is not a CSF requirement, automating responses to events enables you to create repeatable, predictable approaches to monitoring and responding to threats.

The five main steps in the CSF are identify, protect, detect, respond and recover. We’ve expanded the detect and respond steps to include automation and investigation activities.
 

Figure 1: The five steps in the CSF

Figure 1: The five steps in the CSF

The following definitions for each step in the diagram above are based on the CSF but have been adapted for our example in this blog post. Although we will focus on the detect, automate and respond steps, it’s important to understand the entire process flow.

  • Identify: Identify and understand the resources, applications, and data within your AWS environment.
  • Protect: Develop and implement appropriate controls and safeguards to ensure delivery of services.
  • Detect: Develop and implement appropriate activities to identify the occurrence of a cybersecurity event. This step includes the implementation of monitoring capabilities which will be discussed further in the next section.
  • Automate: Develop and implement planned, programmed actions that will achieve a desired state for an application or resource based on a condition or event.
  • Investigate: Perform a systematic examination of the security event to establish the root cause.
  • Respond: Develop and implement appropriate activities to take automated or manual actions regarding a detected security event.
  • Recover: Develop and implement appropriate activities to maintain plans for resilience and to restore any capabilities or services that were impaired due to a security event.

Security response automation on AWS

AWS CloudTrail, AWS Config, and Amazon EventBridge continuously record details about the resources and configuration changes in your AWS account. You can use this information to automatically detect resource changes and to react to deviations from your desired state.
 

Figure 2: Automated remediation flow

Figure 2: Automated remediation flow

As shown in the diagram above, an automated remediation flow on AWS has three stages:

  • Monitor: Your automated monitoring tools collect information about resources and applications running in your AWS environment. For example, they might collect AWS CloudTrail information about activities performed in your AWS account, usage metrics from your Amazon EC2 instances, or flow log information about the traffic going to and from network interfaces in your Amazon Virtual Private Cloud (VPC).
  • Detect: When a monitoring tool detects a predefined condition—such as a breached threshold, anomalous activity, or configuration deviation—it raises a flag within the system. A triggering condition might be an anomalous activity detected by Amazon GuardDuty, a resource becoming out of compliance with an AWS Config Rule, or a high rate of blocked requests on an Amazon VPC security group or AWS WAF web access control list.
  • Respond: When a condition is flagged, an automated response is triggered that performs an action you’ve predefined—something intended to remediate or mitigate the flagged condition. Examples of automated response actions might include modifying a VPC security group, patching an Amazon EC2 instance, or rotating credentials.

You can use the event-driven flow described above to achieve many automated response patterns with varying degrees of complexity. Your response pattern could be as simple as invoking a single AWS Lambda function, or it could be a complex series of AWS Step Function tasks with advanced logic. In this blog post, we’ll use two simple Lambda functions in our example solution.

How to define your response automation

Now that we’ve introduced the concept of security response automation, start thinking about security requirements within your environment that you’d like to enforce through automation. These design requirements might come from general best practices you’d like to follow, or they might be specific controls from compliance frameworks relevant for your business. Regardless, your objectives should be quantitative, not qualitative. Here are some examples of quantitative objectives:

  • Remote administrative network access to servers should be limited.
  • Server storage volumes should be encrypted.
  • AWS console logins should be protected by multi-factor authentication.

As an optional step, you can expand these objectives into user stories that define the conditions and remediation actions when there is an event. User stories are informal descriptions that briefly document a feature within a software system. User stories may be global and span across multiple applications or they may be specific to a single application. For example:

“Remote administrative network access to servers should be limited. Remote access ports include SSH TCP port 22 and RDP TCP port 3389. If open remote access ports are detected within the environment, they should be automatically closed and the owner will be notified.”

Once you’ve completed your user story, you can determine how to use automated remediation to help achieve these objectives in your AWS environment. User stories should be stored in a location that provides versioning support and can reference the associated automation code.

You should carefully consider the effect of your remediation mechanisms in order to prevent unintended impact on your resources and applications. Remediation actions such as instance termination, credential revocation, and security group modification can adversely affect application availability. Depending on the level of risk that’s acceptable to your organization, your automated mechanism might only provide a notification which can then be manually investigated prior to remediation. Once you’ve identified an automated remediation mechanism, you can build out the required components and test them in a non-production environment.

Sample response automation walkthrough

In the following section, we’ll walk you through an automated remediation for a simulated event that indicates potential unauthorized activity—the unintended disabling of CloudTrail logging. Outside parties might want to disable logging to prevent detection and recording of their unauthorized activity. Our response is to re-enable the CloudTrail logging and immediately notify the security contact. Here’s the user story for this scenario:

“CloudTrail logging should be enabled for all AWS accounts and regions. If CloudTrail logging is disabled, it will automatically be enabled and the security operations team will be notified.”

Note: The sample response automation below references Amazon EventBridge which extends and builds upon CloudWatch Events. Amazon EventBridge uses the same Amazon CloudWatch Events API, so the event structure and rules configuration are the same. This blog post uses base functionality that is identical in both EventBridge and CloudWatch Events.

Prerequisites

In order to use our sample remediation, you will need to enable Amazon GuardDuty and AWS Security Hub in the AWS Region you have selected. Both of these services include a 30-day free trial. See the AWS Security Hub pricing page and the Amazon GuardDuty pricing page for additional details.

Important: You’ll use AWS CloudTrail to test the sample remediation. Running more than one CloudTrail trail in your AWS account will result in charges based on the number of events processed while the trail is running. Charges for additional copies of management events recorded in a Region are applied based on the published pricing plan. To minimize the charges, follow the clean-up steps that we provide later in this post to remove the sample automation and delete the trail.

Deploy the sample response automation

In this section, we’ll show you how to deploy and test the CloudTrail logging remediation sample. Amazon GuardDuty generates the finding Stealth:IAMUser/CloudTrailLoggingDisabled when CloudTrail logging is disabled, and AWS Security Hub collects findings from GuardDuty using the standardized finding format mentioned earlier. We recommend that you deploy this sample into a non-production AWS account.

Select the Launch Stack button below to deploy a CloudFormation template with an automation sample in the us-east-1 Region. You can also download the template and implement it in another Region. The template consists of an Amazon EventBridge rule, an AWS Lambda function and the IAM permissions necessary for both components to execute. It takes several minutes for the CloudFormation stack build to complete.

Select this image to open a link that starts building the CloudFormation stack

  1. In the CloudFormation console, choose the Select Template form, and then select Next.
  2. On the Specify Details page, provide the email address for a security contact. (For the purpose of this walkthrough, it should be an email address you have access to.) Then select Next.
  3. On the Options page, accept the defaults, then select Next.
  4. On the Review page, confirm the details, then select Create.
  5. While the stack is being created, check the inbox of the email address you provided in step 2. Look for an email message with the subject AWS Notification – Subscription Confirmation. Select the link in the body of the email to confirm your subscription to the Amazon Simple Notification Service (Amazon SNS) topic. You should see a success message similar to the screenshot below:
     
    Figure 3: SNS subscription confirmation

    Figure 3: SNS subscription confirmation

  6. Return to the CloudFormation console. Once the Status field for the CloudFormation stack changes to CREATE COMPLETE (as shown in figure 4), the solution is implemented and is ready for testing.
     
    Figure 4: CREATE_COMPLETE status

    Figure 4: CREATE_COMPLETE status

Test the sample automation

You’re now ready to test the automated response by creating a test trail in CloudTrail, then trying to stop it.

  1. From the AWS Management Console, choose Services > CloudTrail.
  2. Select Trails, then select Create Trail.
  3. On the Create Trail form:
    1. Enter a value for Trail name. We use test-trail in our example below.
    2. Under Management events, select Write-only (to minimize event volume).
       
      Figure 5: Create a CloudTrail trail

      Figure 5: Create a CloudTrail trail

    3. Under Storage location, choose an existing S3 bucket or create a new one. Note that since S3 bucket names are globally unique, you must add characters (such as a random string) to the name. For example: my-test-trail-bucket-<random-string>.
  4. On the Trails page of the CloudTrail console, verify that the new trail has started. You should see a green checkmark in the Status column, as shown in figure 6.
     
    Figure 6: Verify new trail has started

    Figure 6: Verify new trail has started

  5. You’re now ready to act like an unauthorized user trying to cover their tracks! Stop the logging for the trail you just created:
    1. Select the new trail name to display its configuration page.
    2. Toggle the Logging switch in the top-right corner to OFF.
    3. When prompted with a warning dialog box, select Continue.
    4. Verify that the Logging switch is now off, as shown below.
       
      Figure 7: Verify logging switch is off

      Figure 7: Verify logging switch is off

      You have now simulated a security event by disabling logging for one of the trails in the CloudTrail service. Within the next few seconds, the near real-time automated response will detect the stopped trail, restart it, and send an email notification. You can refresh the Trails page of the CloudTrail console to verify that the trail’s status is ON again.

      Within the next several minutes, the investigatory automated response will also begin. GuardDuty will detect the action that stopped the trail and enrich the data about the source of unexpected behavior. Security Hub will then ingest that information and optionally correlate with other security events.

      Following the steps below, you can monitor findings within Security Hub for the finding type TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled to be generated:

  6. In the AWS Management Console, choose Services > Security Hub
    1. Select Findings in the left pane.
    2. Select the Add filters field, then select Type.
    3. Select EQUALS, paste TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled into the field, then select Apply.
    4. Refresh your browser periodically until the finding is generated.
    5. Figure 8: Monitor Security Hub for your finding

      Figure 8: Monitor Security Hub for your finding

While you wait on that detection, let’s dig into the components of automation.

How the sample automation works

This example incorporates two automated responses: a near real-time workflow and an investigatory workflow. The near real-time workflow provides a rapid response to an individual event, in this case the stopping of a trail. The goal is to restore the trail to a functioning state and alert security responders as quickly as possible. The investigatory workflow still includes a response to provide defense in depth and also uses services that support a more in-depth investigation of the incident.

Figure 9: Sample automation workflow

Figure 9: Sample automation workflow

In the near real-time workflow, Amazon EventBridge monitors for the undesired activity. When a trail is stopped, AWS CloudTrail publishes an event on the EventBridge bus. An EventBridge rule detects the trail-stopping event and invokes a Lambda function to respond to the event by restarting the trail and notifying the security contact via an Amazon Simple Notification Service (SNS) topic.

In the investigative workflow, CloudTrail logs are monitored for undesired activities. For example, if a trail is stopped, there will be a corresponding log record. GuardDuty detects this activity and retrieves additional data points about the source IP that executed the API call. Two common examples of those additional data points in GuardDuty findings include whether the API call came from an IP address on a threat list, or whether it came from a network not commonly used in your AWS account. An AWS Lambda function responds by restarting the trail and notifying the security contact. Finally, the finding is imported into AWS Security Hub for additional investigation.

AWS Security Hub imports findings from AWS security services such as GuardDuty, Amazon Macie and Amazon Inspector, plus from any third-party product integrations you’ve enabled. All findings are provided to Security Hub in AWS Security Finding Format, which eliminates the need for data conversion. Security Hub correlates these findings to help you identify related security events and determine a root cause. Security Hub also publishes its findings to Amazon EventBridge to enable further processing by other AWS services such as AWS Lambda.

Respond step deep dive

Amazon EventBridge and AWS Lambda work together to respond to a security finding. Amazon EventBridge is a service that provides real-time access to changes in data in AWS services, your own applications, and Software-as-a-Service (SaaS) applications without writing code. In this example, EventBridge identifies a Security Hub finding that requires action and invokes a Lambda function that performs remediation. As shown in figure 10, the Lambda function both notifies the security operator via SNS and restarts the stopped CloudTrail.

Figure 10: Sample "respond" workflow

Figure 10: Sample “respond” workflow

To set this response up, we looked for an event to indicate that a trail had stopped or was disabled. We knew that the GuardDuty finding Stealth:IAMUser/CloudTrailLoggingDisabled is raised when CloudTrail logging is disabled. Therefore, we configured the default event bus to look for this event. You can learn more about all of the available GuardDuty findings in the user guide.

How the code works

When Security Hub publishes a finding to EventBridge, it includes full details of the incident as discovered by GuardDuty. The finding is published in JSON format. If you review the details of the sample finding, note that it has several fields helping you identify the specific events that you’re looking for. Here are some of the relevant details:


{
   …
   "source":"aws.securityhub",
   …
   "detail":{
      "findings": [{
		…
    	“Types”: [
			"TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled"
			],
		…
      }]
}

You can build an event pattern using these fields, which an EventBridge filtering rule can then use to identify events and to invoke the remediation Lambda function. Below is a snippet from the CloudFormation template we provided earlier that defines that event pattern for the EventBridge filtering rule:


# pattern matches the nested JSON format of a specific Security Hub finding
      EventPattern:
        source:
        - aws.securityhub
        detail-type:
          - "Security Hub Findings - Imported"
        detail:
          findings:
            Types:
              - "TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled"

Once the rule is in place, EventBridge continuously scans the event bus for this pattern. When EventBridge finds a match, it invokes the remediating Lambda function and passes the full details of the event to the function. The Lambda function then parses the JSON fields in the event so that it can act as shown in this Python code snippet:


# extract trail ARN by parsing the incoming Security Hub finding (in JSON format)
trailARN = event['detail']['findings'][0]['ProductFields']['action/awsApiCallAction/affectedResources/AWS::CloudTrail::Trail']   

# description contains useful details to be sent to security operations
description = event['detail']['findings'][0]['Description']

The code also issues a notification to security operators so they can review the findings and insights in Security Hub and other services to better understand the incident and to decide whether further manual actions are warranted. Here’s the code snippet that uses SNS to send out a note to security operators:


#Sending the notification that the AWS CloudTrail has been disabled.
snspublish = snsclient.publish(
	TargetArn = snsARN,
	Message="Automatically restarting CloudTrail logging.  Event description: \"%s\" " %description
	)

While notifications to human operators are important, the Lambda function will not wait to take action. It immediately remediates the condition by restarting the stopped trail in CloudTrail. Here’s a code snippet that restarts the trail to reenable logging:


#Enabling the AWS CloudTrail logging
try:
	client = boto3.client('cloudtrail')
	enablelogging = client.start_logging(Name=trailARN)
	logger.debug("Response on enable CloudTrail logging- %s" %enablelogging)
except ClientError as e:
	logger.error("An error occured: %s" %e)

After the trail has been restarted, API activity is once again logged and can be audited. This can help provide relevant data for the remaining steps in the incident response process. The data is especially important for the post-incident phase, when your team analyzes lessons learned to prevent future incidents. You can also use this phase to identify additional steps to automate in your incident response.

Clean up

After you’ve completed the sample security response automation, we recommend that you remove the resources created in this walkthrough example from your account in order to minimize any charges associated with the trail in CloudTrail and data stored in S3.

Important: Deleting resources in your account can negatively impact the applications running in your AWS account. Verify that applications and AWS account security do not depend on the resources you’re about to delete.

Here are the clean-up steps:

  1. Delete the CloudFormation stack.
  2. Delete the trail you created in CloudTrail.
  3. If you created an S3 bucket for CloudTrail logs, you can also delete that S3 bucket.
  4. New accounts can try GuardDuty at no cost for 30 days. You can suspend or disable GuardDuty before the free trial period to avoid charges.
  5. Security Hub comes with a 30-day free trial. You can avoid charges by disabling the service before the trial period is over.

Summary

You’ve learned the basic concepts and considerations behind security response automation on AWS and how to use Amazon EventBridge, Amazon GuardDuty and AWS Security Hub to automatically re-enable AWS CloudTrail when it becomes disabled unexpectedly. As a next step, you may want to start building your own response automations and dive deeper into the AWS Security Incident Response Guide, NIST Cybersecurity Framework (CSF) or the AWS Cloud Adoption Framework (CAF) Security Perspective. You can explore additional automatic remediation solutions on the AWS Security Blog. You can find the code used in this example on GitHub.

If you have feedback about this blog post, submit them in the Comments section below. If you have questions about using this solution, start a thread in the EventBridgeGuardDuty or Security Hub forums, or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Cameron Worrell

Cameron Worrell

Cameron is a Solutions Architect with a passion for security and enterprise transformation. He joined AWS in 2015.

Alex Tomic

Alex Tomic

Alex is an AWS Enterprise Solutions Architect focused on security and compliance. He joined AWS in 2014.

Nathan Case

Nathan Case

Nathan is a Senior Security Strategist, and joined AWS in 2016. He is always interested to see where our customers plan to go and how we can help them get there. He is also interested in intel, combined data lake sharing opportunities, and open source collaboration. In the end Nathan loves technology and that we can change the world to make it a better place.

Debugging with Amazon CloudWatch Synthetics and AWS X-Ray

Post Syndicated from Nizar Tyrewalla original https://aws.amazon.com/blogs/devops/debugging-with-amazon-cloudwatch-synthetics-and-aws-x-ray/

Today, AWS X-Ray launches support for Amazon CloudWatch Synthetics, enabling developers to trace end-to-end requests from configurable scripts called “canaries”.  These canaries run the test script to monitor web endpoints and APIs using modular, light-weight tests that run 24×7, once per minute. It continuously captures the behavior and availability of the endpoint or URL being monitored, reporting what end-customers are seeing. Customers get alerted immediately in case of failures and tie them back to the root-cause using traces. These canaries provide a complete picture of all the services invoked in the path. With this feature, developers and DevOps engineers can correlate failures in the application with the impact on end customers, determine the root cause of the failures, and identify the upstream and downstream services impacted.

Overview

X-Ray helps developers and DevOps engineers analyze and debug distributed applications, such as applications built with microservice architecture. Using X-Ray, you can understand how your application and its underlying services are performing in order to identify and troubleshoot the root causes of performance issues and errors. X-Ray helps you debug and triage distributed applications, wherever those applications are running, and whether the architecture is serverless, containers, Amazon EC2, on-premises, or a mixture of all of the above.

Amazon CloudWatch Synthetics is a fully managed synthetic monitoring service that allows developers and DevOps engineers to view their application endpoints and URLs using configurable scripts called “canaries” that run 24×7. Canaries alert you as soon as something does not work as expected, as defined by your script. CloudWatch Synthetics canaries can be customized to check for availability, latency, transactions, broken or dead links, step-by-step task completions, page load errors, load latencies for UI assets, complex wizard flows, or checkout flows in your applications.

CloudWatch Synthetics supports End-User Experience Monitoring (EUEM—Synthetic Monitoring), Web Services Monitoring (REST APIs), URL Monitoring, and Website Content Monitoring (protection from unauthorized changes in your websites from phishing, code injection, and cross-site scripting). Canary traffic can continually verify your customers’ experiences even when you don’t have any customer traffic. Canaries can discover issues as soon as your customers do.

Use cases

Here are some of the use cases for which developers and DevOps engineers can use AWS X-Ray with Amazon CloudWatch Synthetics.

Determine if there is a problem

You can determine if there is a problem with your CloudWatch Synthetics canaries by setting CloudWatch alarms based on certain thresholds. You can also determine an increase in errors, faults, throttling rates, or slow response times within your X-Ray Service Map.

Developers can set up thresholds on canaries in Amazon CloudWatch Synthetics console which creates CloudWatch alarms to understand if there were any issues. These thresholds are managed centrally to help you get alerted based on the granularity of the test run. They can then use Amazon Simple Notification Service (SNS) or CloudWatch Event for their alerts. For example, you can set a notification when the canary fails 5 times in 5 runs if the canary runs once per min. You can further analyze and triage HAR File, screenshots and logs in the CloudWatch Synthetics console.

Enable thresholds in CloudWatch Synthetics

Canary thresholds in Amazon CloudWatch

Screenshots in Amazon CloudWatch Synthetics console

 

 

HAR file in Amazon CloudWatch Synthetics console

Developers can also use the new synthetic canary client nodes with type Client::Synthetics in AWS X-Ray console to pinpoint which synthetic canaries are experiencing issues with errors, faults, or throttling rates for the selected time frame. Select the synthetic canary node to view the response time distribution of the entire request. Zoom into a specific time range and dive deep into analyzing trends using X-Ray Analytics, as shown in the following screenshot of a service map with a synthetic canary client node.

X-Ray Service Map with Synthetics node showing errors

 

 

Find the root cause of ongoing failures

You can find the root cause of ongoing failures in upstream and downstream services to reduce the mean time to resolution.

Developers can view the end-to-end path of requests within the X-Ray service map to easily understand the services invoked, as well as the upstream and downstream services. You can also use trace maps for individual traces to dissect each request and determine which service results in the most latency or is causing an error, as shown in the following diagrams.

 

X-Ray Service Map with end to end path of the request invoked by Synthetic canaries

X-Ray Trace Map to view individual request invoked by Synthetic canaries

Once you receive a CloudWatch alarm for failures in a synthetic canary, it becomes critical to determine the root cause of the issue and reduce the impact on end users. Developers can use X-Ray Analytics to address this. X-Ray performs statistical modeling on trace data to determine the probable root cause of an issue. As you can see in the following screenshots, the synthetic tests for console “www” that is running on AWS Elastic Beanstalk are failing. It is because of the throughput capacity exception from the DynamoDB table, impacting the product microservice. This can be quickly determined in the X-Ray Analytics console. Also, developers can determine which of their canary tests are causing this issue using the already created X-Ray annotation “Annotation.aws:canary_arn”.

 

Analyze root cause of the issues in Synthetic canaries using X-Ray Analytics

 

Root cause of the issue using X-Ray Analytics

 

Identify performance bottlenecks and trends

You can identify performance bottlenecks and trends seen by your Synthetics canaries’ traces over time.

View trends in performance of your website or endpoint over time using the continuous traffic from your synthetic canaries to populate a response time histogram over a period of time. Understand the p50, p90, and p95 percentile for your requests originating from Synthetics tests and determine proportion of failures and time series activity to determine the time of occurrence.

Response time histogram and time series activity for Synthetic canaries

 

Compare latency and error/fault rates and experience of end user with Synthetic canaries

You can compare latency and error/fault rates and end-user experiences with the Synthetic canaries.

Using X-Ray Analytics, developers can compare any two trends or collection of traces. For example, in the below diagram, we are comparing the experience of end users using the system (shown in blue trend line) to what the Synthetics tests are reporting (shown in green trend line), which started around 3:00 am. We can easily identify that Synthetics tests are reporting more 5xx errors as compared to end users.

Compare end-user experience with Synthetics canaries using X-Ray Analytics

 

Identify if you have enough canary coverage for all the API’s and URLs.

When debugging synthetic canaries, developers and DevOps engineers are looking to understand if they have enough coverage for their APIs and URLs and if their Synthetic canaries have adequate tests. Using X-Ray Analytics, developers will be able to compare the experience of Synthetic canaries with the rest of their end users. The screenshot below shows blue trend line for canaries and green for the rest of the users. You can also identify that two out of the three URLs don’t have canary tests.

Identify which URLs and API's have canaries running.

Slice and dice the X-Ray service map to focus only on Synthetics tests and its path to downstream services using X-Ray groups

Developers can have multiple APIs and applications running within an account. It becomes critical at that point to focus on certain set of workflows like Synthetics tests for application “www” running on AWS Elastic Beanstalk as shown below. Developers can create X-Ray Group by using a filter expression “edge(id(name: “www”, type: “client::Synthetics”), id(name: “www”, type: “AWS::ElasticBeanstalk::Environment”))” to just focus on traces generated by their Synthetics tests for “www”.

Create X-Ray Groups for Synthetics canaries

 

Getting Started

Getting started with Synthetics is easy. Enable X-Ray tracing for your APIs, web application endpoints and, URLs being monitored by Synthetics canaries.  Visit the documentation to learn more about getting started using AWS X-Ray with Amazon CloudWatch Synthetics.

Conclusion

In this post, I outlined how you can use AWS X-Ray to debug and triage canaries running in Amazon CloudWatch Synthetics. We looked at some of the use cases that will enable developer and DevOps engineers to reduce time to resolution by viewing the end-to-end path of the request, determining the root cause of the issue and understanding which upstream or downstream services are impacted.

Digital signing with the new asymmetric keys feature of AWS KMS

Post Syndicated from Raj Copparapu original https://aws.amazon.com/blogs/security/digital-signing-asymmetric-keys-aws-kms/

AWS Key Management Service (AWS KMS) now supports asymmetric keys. You can create, manage, and use public/private key pairs to protect your application data using the new APIs via the AWS SDK. Similar to the symmetric key features we’ve been offering, asymmetric keys can be generated as customer master keys (CMKs) where the private portion never leaves the service, or as a data key where the private portion is returned to your calling application encrypted under a CMK. The private portion of asymmetric CMKs are used in AWS KMS hardware security modules (HSMs) designed so that no one, including AWS employees, can access the plaintext key material. AWS KMS supports the following asymmetric key types – RSA 2048, RSA 3072, RSA 4096, ECC NIST P-256, ECC NIST P-384, ECC NIST-521, and ECC SECG P-256k1.

We’ve talked with customers and know that one popular use case for asymmetric keys is digital signing. In this post, I will walk you through an example of signing and verifying files using some of the new APIs in AWS KMS.

Background

A common way to ensure the integrity of a digital message as it passes between systems is to use a digital signature. A sender uses a secret along with cryptographic algorithms to create a data structure that is appended to the original message. A recipient with access to that secret can cryptographically verify that the message hasn’t been modified since the sender signed it. In cases where the recipient doesn’t have access to the same secret used by the sender for verification, a digital signing scheme that uses asymmetric keys is useful. The sender can make the public portion of the key available to any recipient to verify the signature, but the sender retains control over creating signatures using the private portion of the key. Asymmetric keys are used for digital signature applications such as trusted source code, authentication/authorization tokens, document e-signing, e-commerce transactions, and secure messaging. AWS KMS supports what are known as raw digital signatures, where there is no identity information about the signer embedded in the signature object. A common way to attach identity information to a digital signature is to use digital certificates. If your application relies on digital certificates for signing and signature verification, we recommend you look at AWS Certificate Manager and Private Certificate Authority. These services allow you to programmatically create and deploy certificates with keys to your applications for digital signing operations. A common application of digital certificates is TLS termination on a web server to secure data in transit.

Signing and verifying files with AWS KMS

Assume that you have an application A that sends a file to application B in your AWS account. You want the file to be digitally signed so that the receiving application B can verify it hasn’t been tampered with in transit. You also want to make sure only application A can digitally sign files using the key because you don’t want application B to receive a file thinking it’s from application A when it was really from a different sender that had access to the signing key. Because AWS KMS is designed so that the private portion of the asymmetric key pair used for signing cannot be used outside the service or by unauthenticated users, you’re able to define and enforce policies so that only application A can sign with the key.

To start, application A will submit either the file itself or a digest of the file to the AWS KMS Sign API under an asymmetric CMK. If the file is less than 4KB, AWS KMS will compute a digest for you as a part of the signing operation. If the file is greater than 4KB, you must send only the digest you created locally and you must tell AWS KMS that you’re passing a digest in the MessageType parameter of the request. You can use any of several hashing functions in your local environment to create a digest of the file, but be aware that the receiving application in account B will need to be able to compute the digest using the same hash function in order to verify the integrity of the file. In my example, I’m using SHA256 as the hash function. Once the digest is created, AWS KMS uses the private portion of the asymmetric CMK to encrypt the digest using the signing algorithm specified in the API request. The result is a binary data object, which we’ll refer to as “the signature” throughout this post.

Once application B receives the file with the signature, it must create a digest of the file. It then passes this newly generated digest, the signature object, the signing algorithm used, and the CMK keyId to the Verify API. AWS KMS uses the corresponding public key of the CMK with the signing algorithm specified in the request to verify the signature. Instead of submitting the signature to the Verify API, application B could verify the signature locally by acquiring the public key. This might be an attractive option if application B didn’t have a way to acquire valid AWS credentials to make a request of AWS KMS. However, this method requires application B to have access to the necessary cryptographic algorithms and to have previously received the public portion of the asymmetric CMK. In my example, application B is running in the same account as application A, so it can acquire AWS credentials to make the Verify API request. I’ll describe how to verify signatures using both methods in a bit more detail later in the post.

Creating signing keys and setting up key policy permissions

To start, you need to create an asymmetric CMK. When calling the CreateKey API, you’ll pass one of the asymmetric values for the CustomerMasterKeySpec parameter. In my example, I’m choosing a key spec of ECC_NIST_P384 because keys used with elliptic curve algorithms tend to be more efficient than those used with RSA-based algorithms.

As a part of creating your asymmetric CMK, you need to attach a resource policy to the key to control which cryptographic operations the AWS principals representing applications A and B can use. A best practice is to use a different IAM principal for each application in order to scope down permissions. In this case, you want application A to only be able to sign files, and application B to only be able to verify them. I will assume each of these applications are running in Amazon EC2, and so I’ll create a couple of IAM roles.

  • The IAM role for application A (SignRole) will be given kms:Sign permission in the CMK key policy
  • The IAM role for application B (VerifyRole) will be given kms:Verify permission in the CMK key policy

The stanza in the CMK key policy document to allow signing should look like this (replace the account ID value of <111122223333> with your own):


{
	"Sid": "Allow use of the key for digital signing",
	"Effect": "Allow",
	"Principal": {"AWS":"arn:aws:iam::<111122223333>:role/SignRole"},
	"Action": "kms:Sign",
	"Resource": "*"
}

The stanza in the CMK key policy document to allow verification should look like this (replace the account ID value of <111122223333> with your own):


{
	"Sid": "Allow use of the key for verification",
	"Effect": "Allow",
	"Principal": {"AWS":"arn:aws:iam::<111122223333>:role/VerifyRole"},
	"Action": "kms:Verify",
	"Resource": "*"
}

Signing Workflow

Once you have created the asymmetric CMK and IAM roles, you’re ready to sign your file. Application A will create a message digest of the file and make a sign request to AWS KMS with the asymmetric CMK keyId, and signing algorithm. The CLI command to do this is shown below. Replace the key-id parameter with your CMK’s specific keyId.


aws kms sign \
	--key-id <1234abcd-12ab-34cd-56ef-1234567890ab> \
	--message-type DIGEST \
	--signing-algorithm ECDSA_SHA_256 \
	--message fileb://ExampleDigest

I chose the ECDSA_SHA_256 signing algorithm for this example. See the Sign API specification for a complete list of supported signing algorithms.

After validating that the API call is authorized by the credentials available to SignRole, KMS generates a signature around the digest and returns the CMK keyId, signature, and the signing algorithm.

Verify Workflow 1 — Calling the verify API

Once application B receives the file and the signature, it computes the SHA 256 digest over the copy of the file it received. It then makes a verify request to AWS KMS, passing this new digest, the signature it received from application A, signing algorithm, and the CMK keyId. The CLI command to do this is shown below. Replace the key-id parameter with your CMK’s specific keyId.


aws kms verify \
	--key-id <1234abcd-12ab-34cd-56ef-1234567890ab> \
	--message-type DIGEST \
	--signing-algorithm ECDSA_SHA_256 \
	--message fileb://ExampleDigest \
	--signature fileb://Signature

After validating that the verify request is authorized, AWS KMS verifies the signature by first decrypting the signature using the public portion of the CMK. It then compares the decrypted result to the digest received in the verify request. If they match, it returns a SignatureValid boolean of True, indicating that the original digest created by the sender matches the digest created by the recipient. Because the original digest is unique to the original file, the recipient can know that the file was not tampered with during transit.

One advantage of using the AWS KMS verify API is that the caller doesn’t have to keep track of the specific public key matching the private key used to create the signature; the caller only has to know the CMK keyId and signing algorithm used. Also, because all request to AWS KMS are logged to AWS CloudTrail, you can audit that the signature and verification operations were both executed as expected. See the Verify API specification for more detail on available parameters.

Verify Workflow 2 — Verifying locally using the public key

Apart from using the Verify API directly, you can choose to retrieve the public key in the CMK using the AWS KMS GetPublicKey API and verify the signature locally. You might want to do this if application B needs to verify multiple signatures at a high rate and you don’t want to make a network call to the Verify API each time. In this method, application B makes a GetPublicKey request to AWS KMS to retrieve the public key. The CLI command to do this is below. Replace the key-id parameter with your CMK’s specific keyId.

aws kms get-public-key \
–key-id <1234abcd-12ab-34cd-56ef-1234567890ab>

Note that the application B will need permissions to make a GetPublicKey request to AWS KMS. The stanza in the CMK key policy document to allow the VerifyRole identity to download the public key should look like this (replace the account ID value of <111122223333> with your own):


{
	"Sid": "Allow retrieval of the public key for verification",
	"Effect": "Allow",
	"Principal": {"AWS":"arn:aws:iam::<111122223333>:role/VerifyRole"},
	"Action": "kms:GetPublicKey ",
	"Resource": "*"
}

Once application B has the public key, it can use your preferred cryptographic provider to perform the signature verification locally. Application B needs to keep track of the public key and signing algorithm used for each signature object it will verify locally. Using the wrong public key will fail to decrypt the signature from application A, making the signature verification operation unsuccessful.

Availability and pricing

Asymmetric keys and operations in AWS KMS are available now in the Northern Virginia, Oregon, Sydney, Ireland, and Tokyo AWS Regions with support for other regions planned. Pricing information for the new feature can be found at the AWS KMS pricing page.

Summary

I showed you a simple example of how you can use the new AWS KMS APIs to digitally sign and verify an arbitrary file. By having AWS KMS generate and store the private portion of the asymmetric key, you can limit use of the key for signing only to IAM principals you define. OIDC ID tokens, OAuth 2.0 access tokens, documents, configuration files, system update messages, and audit logs are but a few of the types of objects you might want to sign and verify using this feature.

You can also perform encrypt and decrypt operations under asymmetric CMKs in AWS KMS as an alternative to using the symmetric CMKs available since the service launched. Similar to how you can ask AWS KMS to generate symmetric keys for local use in your application, you can ask AWS KMS to generate and return asymmetric key pairs for local use to encrypt and decrypt data. Look for a future AWS Security Blog post describing these use cases. For more information about asymmetric key support, see the AWS KMS documentation page.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about the asymmetric key feature, please start a new thread on the AWS KMS Discussion Forum.

Want more AWS Security news? Follow us on Twitter.

Raj Copparapu

Raj Copparapu

Raj Copparapu is a Senior Product Manager Technical. He’s a member of the AWS KMS team and focuses on defining the product roadmap to satisfy customer requirements. He spent over 5 years innovating on behalf of customers to deliver products to help customers secure their data in the cloud. Raj received his MBA from the Duke’s Fuqua School of Business and spent his early career working as an engineer and a business intelligence consultant. In his spare time, Raj enjoys yoga and spending time with his kids.

Integrating CodePipeline with on-premises Bitbucket Server

Post Syndicated from Alex Rosa original https://aws.amazon.com/blogs/devops/integrating-codepipeline-with-on-premises-bitbucket-server/

This blog post demonstrates how to integrate AWS CodePipeline with on-premises Bitbucket Server. If you want to integrate with Bitbucket Cloud, see Integrating Git with AWS CodePipeline. The AWS Lambda function provided can get the source code from a Bitbucket Server repository whenever the user sends a new code push and store it in a designed Amazon Simple Storage Service (Amazon S3) bucket.

The Bitbucket Server integration uses webhooks configured in the Bitbucket repository. Webhooks are ideal for this case and avoid the need for performing frequent polling to check for changes in the repository.

Some security protections are available with this solution:

  • The Amazon S3 bucket has encryption enabled using SSE-AES, and every object created is encrypted by default.
  • The Lambda function accepts only events signed by the Bitbucket Server.
  • All environment variables used by the Lambda function are encrypted in rest using AWS KMS.

Overview

During the creation of the CloudFormation stack, you can select either Amazon API Gateway or Elastic Load Balancing to communicate with the Lambda function. The following diagram shows how the integration works.

 

This diagram explains the solution flow, from an user code push to Bitbucket server to the CodePipeline trigger and what happen in the between.

 

  1. The user pushes code to the Bitbucket repository.
  2. Based on that user action, the Bitbucket server generates a new webhook event and sends it to Elastic Load Balancing or API Gateway based on which endpoint type you selected during AWS CloudFormation stack creation.
  3. API Gateway or Elastic Load Balancing forwards the request to the Lambda function, which checks the message signature using the secret configured in the webhook. If the signature is valid, then the Lambda function moves to the next step.
  4. The Lambda function calls the Bitbucket server API and requests that it generate a ZIP package with the content of the branch modified by the user in Step 1.
  5. The Lambda function sends the ZIP package to the Amazon S3 bucket.
  6. CodePipeline is triggered when it detects a new or updated file in the Amazon S3 bucket path.

Requirements

Before starting the solution setup, make sure you have:

  • An Amazon S3 bucket available to store the Lambda function setup files
  •  NPM or Yarn to install the package dependencies
  • AWS CLI

Setup

Follow these steps for setup.

Creating a personal token on the Bitbucket server

Create a personal token on the Bitbucket server that the Lambda function uses to access the repository.

  1. Log in to the Bitbucket server.
  2. Choose your user avatar, then choose Manage Account.
  3. On the Account screen, choose Personal access tokens.
  4. Choose Create a token.
  5. Fill out the form with the token name. In the Permissions section, leave Read for Projects and Repositories as is.
  6. Choose Create to finish.

Launch a CloudFormation stack

Using the steps below you will upload the Lambda function and Lambda layer ZIP files to an Amazon S3 bucket and launch the AWS CloudFormation stack to create the resources on your AWS account.

1. Clone the Git repository containing the solution source code:

git clone https://github.com/aws-samples/aws-codepipeline-bitbucket-integration.git

2. Install the NodeJS packages with npm:

cd code
npm install
cd ..

3. Prepare the packages for deployment.

aws cloudformation package --template-file ./infra/infra.yaml --s3-bucket your_bucket_name --output-template-file package.yaml

4. Edit the AWS CloudFormation parameters file.

Open the file located at infra/parameters.json in your favorite text editor and replace the parameters as follows:

BitbucketSecret – Bitbucket webhook secret used to sign webhook events. You should define the secret and use the same value here and in the Bitbucket server webhook.
BitbucketServerUrl – URL of your Bitbucket Server, such as https://server:port.
BitbucketToken – Bitbucket server personal token used by the Lambda function to access the Bitbucket API.
EndpointType – Select the type of endpoint to integrate with the Lambda function. It can be the Application Load Balancer or the API Gateway.
LambdaSubnets – Subnets where the Lambda function runs.
LBCIDR – CIDR allowed to communicate with the Load Balancer. It should allow the Bitbucket server IP address. Leave it blank if you are using the API Gateway endpoint type.
LBSubnets – Subnets where the Application Load Balancer runs. Leave it blank if you are using the API Gateway endpoint type.
LBSSLCertificateArn – SSL Certificate to associate with the Application Load Balancer. Leave it blank if you are using the API Gateway endpoint type.
S3BucketCodePipelineName – Amazon S3 bucket name that this stack creates to store the Bitbucket repository content.
VPCID – VPC ID where the Application Load Balancer and the Lambda function run.
WebProxyHost – Hostname of your proxy server used by the Lambda function to access the Bitbucket server, such as myproxy.mydomain.com. If you don’t need a web proxy, leave it blank.
WebProxyPort – Port of your proxy server used by the Lambda function to access the Bitbucket server, such as 8080. If you don’t need a web proxy leave it blank.

5. Create the CloudFormation stack:

aws cloudformation create-stack --stack-name CodePipeline-Bitbucket-Integration --template-body file:///package.yaml --parameters file:///infra/parameters.json --capabilities CAPABILITY_NAMED_IAM

Creating a webhook on the Bitbucket Server

Next, create the webhook on Bitbucket server to notify the Lambda function of push events to the repository:

  1. Log into the Bitbucket server and navigate to the repository page.
  2. Choose Repository settings.
  3. Select Webhook.
  4. Choose Create webhook.
  5. Fill out the form with the name of the webhook, such as CodePipeline.
  6. Fill out the URL field with the API Gateway or Load Balancer URL. To obtain this URL, choose the Outputs tab of the AWS CloudFormation stack.
  7. Fill out the Secret field with the same value used in the AWS CloudFormation stack.
  8. In the Events section, ensure Push is selected.
  9. Choose Create to finish.
  10. Repeat these steps for each repository in which you want to enable the integration.

Configuring your pipeline

Finally, change your pipeline on CodePipeline to use the Amazon S3 bucket created by the AWS CloudFormation stack as the source of your pipeline.

The Lambda function uploads the files to the Amazon S3 bucket using the following path structure:

Project Name/Repository Name/Branch Name.zip

Now, every time someone pushes code to the Bitbucket repository, your pipeline starts automatically.

Cleaning up

If you want to remove the integration and clean up the resources created at AWS, you need to delete the CloudFormation stack. Run the command below to delete the stack and associated resources.

aws cloudformation delete-stack --stack-name CodePipeline-Bitbucket-Integration 

Conclusion

This post demonstrated how to integrate your on-premises Bitbucket Server with CodePipeline.

Continuously monitor unused IAM roles with AWS Config

Post Syndicated from Michael Chan original https://aws.amazon.com/blogs/security/continuously-monitor-unused-iam-roles-aws-config/

Developing in the cloud encourages you to iterate frequently as your applications and resources evolve. You should also apply this iterative approach to the AWS Identity and Access Management (IAM) roles you create. Periodically ensuring that all the resources you’ve created are still being used can reduce operational complexity by eliminating the need to track unnecessary resources. It also improves security: identifying unused IAM roles helps reduce the potential for improper or unintended access to your critical infrastructure and workloads.

The IAM API now provides you with information about when a role has last been used to make an AWS request. In this post, I demonstrate how you can identify inactive roles using role last used information. Additionally, I’ll show you how to implement continuous monitoring of role activity using AWS Config.

AWS services and features used

This solution uses the following services and features:

  • AWS IAM: This service enables you to manage access to AWS services and resources securely. It provides an API to retrieve the timestamp of an IAM role’s last use when making an AWS request, and the region where the request was made.
  • AWS Config: This service allows you to continuously monitor and record your AWS resource configurations. It will periodically trigger your AWS Config rule (see next bullet) and will record compliance status.
  • AWS Config Rule: This resource represents your desired configuration settings for specific AWS resources or for an entire AWS account. This resource will check the compliance status of your AWS resources. You can provide the logic that determines compliance, which enables you to mark IAM roles in use as “compliant” and inactive roles as “non-compliant.”
  • AWS Lambda: This service lets you run code without provisioning or managing servers. Lambda will be used to execute API calls to retrieve role last used information and to provide compliance evaluations to AWS Config.
  • Amazon Simple Storage Service (Amazon S3): This is a highly available and durable object store. You’ll use it to store your Lambda code in .zip format prior to deploying your Lambda function.
  • AWS CloudFormation: This service provides a common language for you to describe and provision all the infrastructure resources in your cloud environment. You’ll use it to provision all the resources described in this solution.

Solution logic

This solution identifies unused IAM roles within your account. First, you’ll identify unused roles based on a time window (last number of days) you set. I use 60 days in my example, but this range is configurable. Second, you’ll use AWS Lambda to process all the roles in your account. Third, you’ll determine if they’re compliant based on their creation time and role last used information. Last, you’ll send your evaluations to AWS Config, which records the results and reports if each role is compliant or not. If not, you can take steps to remediate, such as denying all actions that the role can perform.

Prerequisites

This solution has the following prerequisites:

Solution architecture

 

Figure 1: Solution architecture

Figure 1: Solution architecture

As shown in the diagram, AWS Config (1) executes the AWS Config custom rule daily, and this frequency is configurable (2), which in turn invokes the Lambda function (3). The Lambda function enumerates each role and determines its creation date and role last used timestamp, both of which are provided via IAM’s GetAccountAuthorizationDetails API (4). When the Lambda function has determined the compliance of all your roles, the function returns the compliance results to AWS Config (5). AWS Config retains the history of compliance changes evaluated by the rule. If configured, compliance notifications can be sent to an Amazon Simple Notification Service (Amazon SNS) topic. Compliance status is viewable either in the AWS Management Console or through use of the AWS CLI or AWS SDK.

Deploying the solution

The resources for this solution are deployed through AWS CloudFormation. You must prepare the Lambda function’s source code for packaging before AWS CloudFormation can deploy the complete solution into your account.

Step 1: Prepare the Lambda deployment

First, make sure you’re running a *nix prompt (Linux, Mac, or Windows subsystem for Linux). Follow the commands below to create an empty folder named iam-role-last-used where you’ll place your Lambda source code.


mkdir iam-role-last-used
cd iam-role-last-used
touch lambda_function.py

Note that the directory you create and the code it contains will later be compressed into a .zip file by the AWS CLI’s cloudformation package command. This command also uploads the deployment .zip file to your S3 bucket. The cloudformation deploy command will reference this bucket when deploying the solution.

Next, create a Lambda layer with the latest boto3 package. This ensures that your Lambda function is using an up-to-date boto3 SDK and allows you to control the dependencies in your function’s deployment package. You can do this by following steps 1 through 4 in these directions. Be sure to record the Lambda layer ARN that you create because you will use it later.

Finally, open the lambda_function.py file in your favorite editor or integrated development environment (IDE), and place the following code into the lambda_function.py file:


import boto3
from botocore.exceptions import ClientError
from botocore.config import Config
import datetime
import fnmatch
import json
import os
import re
import logging


logger = logging.getLogger()
logging.basicConfig(
    format="[%(asctime)s] %(levelname)s [%(module)s.%(funcName)s:%(lineno)d] %(message)s", datefmt="%H:%M:%S"
)
logger.setLevel(os.getenv('log_level', logging.INFO))

# Configure boto retries
BOTO_CONFIG = Config(retries=dict(max_attempts=5))

# Define the default resource to report to Config Rules
DEFAULT_RESOURCE_TYPE = 'AWS::IAM::Role'

CONFIG_ROLE_TIMEOUT_SECONDS = 60

# Set to True to get the lambda to assume the Role attached on the Config service (useful for cross-account).
ASSUME_ROLE_MODE = False

# Evaluation strings for Config evaluations
COMPLIANT = 'COMPLIANT'
NON_COMPLIANT = 'NON_COMPLIANT'


# This gets the client after assuming the Config service role either in the same AWS account or cross-account.
def get_client(service, execution_role_arn):
    if not ASSUME_ROLE_MODE:
        return boto3.client(service)
    credentials = get_assume_role_credentials(execution_role_arn)
    return boto3.client(service, aws_access_key_id=credentials['AccessKeyId'],
                        aws_secret_access_key=credentials['SecretAccessKey'],
                        aws_session_token=credentials['SessionToken'],
                        config=BOTO_CONFIG
                        )


def get_assume_role_credentials(execution_role_arn):
    sts_client = boto3.client('sts')
    try:
        assume_role_response = sts_client.assume_role(RoleArn=execution_role_arn,
                                                      RoleSessionName="configLambdaExecution",
                                                      DurationSeconds=CONFIG_ROLE_TIMEOUT_SECONDS)
        return assume_role_response['Credentials']
    except ClientError as ex:
        if 'AccessDenied' in ex.response['Error']['Code']:
            ex.response['Error']['Message'] = "AWS Config does not have permission to assume the IAM role."
        else:
            ex.response['Error']['Message'] = "InternalError"
            ex.response['Error']['Code'] = "InternalError"
        raise ex


# Validates role pathname whitelist as passed via AWS Config parameters and returns a list of comma separated patterns.
def validate_whitelist(unvalidated_role_pattern_whitelist):
    # Names of users, groups, roles must be alphanumeric, including the following common
    # characters: plus (+), equal (=), comma (,), period (.), at (@), underscore (_), and hyphen (-).

    if not unvalidated_role_pattern_whitelist:
        return None

    regex = re.compile('^[-a-zA-Z0-9+=,[email protected]_/|*]+')
    if regex.search(unvalidated_role_pattern_whitelist):
        raise ValueError("[Error] Provided whitelist has invalid characters")

    return unvalidated_role_pattern_whitelist.split('|')


# This uses Unix filename pattern matching (as opposed to regular expressions), as documented here:
# https://docs.python.org/3.7/library/fnmatch.html.  Please note that if using a wildcard, e.g. "*", you should use
# it sparingly/appropriately.
# If the rolename matches the pattern, then it is whitelisted
def is_whitelisted_role(role_pathname, pattern_list):
    if not pattern_list:
        return False

    # If role_pathname matches pattern, then return True, else False
    # eg. /service-role/aws-codestar-service-role matches pattern /service-role/*
    # https://docs.python.org/3.7/library/fnmatch.html
    for pattern in pattern_list:
        if fnmatch.fnmatch(role_pathname, pattern):
            # whitelisted
            return True

    # not whitelisted
    return False


# Form an evaluation as a dictionary. Suited to report on scheduled rules.  More info here:
#   https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/config.html#ConfigService.Client.put_evaluations
def build_evaluation(resource_id, compliance_type, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=None):
    evaluation = {}
    if annotation:
        evaluation['Annotation'] = annotation
    evaluation['ComplianceResourceType'] = resource_type
    evaluation['ComplianceResourceId'] = resource_id
    evaluation['ComplianceType'] = compliance_type
    evaluation['OrderingTimestamp'] = notification_creation_time
    return evaluation


# Determine if any roles were used to make an AWS request
def determine_last_used(role_name, role_last_used, max_age_in_days, notification_creation_time):

    last_used_date = role_last_used.get('LastUsedDate', None)
    used_region = role_last_used.get('Region', None)

    if not last_used_date:
        compliance_result = NON_COMPLIANT
        reason = "No record of usage"
        logger.info(f"NON_COMPLIANT: {role_name} has never been used")
        return build_evaluation(role_name, compliance_result, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=reason)


    days_unused = (datetime.datetime.now() - last_used_date.replace(tzinfo=None)).days

    if days_unused > max_age_in_days:
        compliance_result = NON_COMPLIANT
        reason = f"Was used {days_unused} days ago in {used_region}"
        logger.info(f"NON_COMPLIANT: {role_name} has not been used for {days_unused} days, last use in {used_region}")
        return build_evaluation(role_name, compliance_result, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=reason)

    compliance_result = COMPLIANT
    reason = f"Was used {days_unused} days ago in {used_region}"
    logger.info(f"COMPLIANT: {role_name} used {days_unused} days ago in {used_region}")
    return build_evaluation(role_name, compliance_result, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=reason)


# Returns a list of docts, each of which has authorization details of each role.  More info here:
#   https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/iam.html#IAM.Client.get_account_authorization_details
def get_role_authorization_details(iam_client):

    roles_authorization_details = []
    roles_list = iam_client.get_account_authorization_details(Filter=['Role'])

    while True:
        roles_authorization_details += roles_list['RoleDetailList']
        if 'Marker' in roles_list:
            roles_list = iam_client.get_account_authorization_details(Filter=['Role'], MaxItems=100, Marker=roles_list['Marker'])
        else:
            break

    return roles_authorization_details


# Check the compliance of each role by determining if role last used is > than max_days_for_last_used
def evaluate_compliance(event, context):

    # Initialize our AWS clients
    iam_client = get_client('iam', event["executionRoleArn"])
    config_client = get_client('config', event["executionRoleArn"])

    # List of resource evaluations to return back to AWS Config
    evaluations = []

    # List of dicts of each role's authorization details as returned by boto3
    all_roles = get_role_authorization_details(iam_client)

    # Timestamp of when AWS Config triggered this evaluation
    notification_creation_time = str(json.loads(event['invokingEvent'])['notificationCreationTime'])

    # ruleParameters is received from AWS Config's user-defined parameters
    rule_parameters = json.loads(event["ruleParameters"])

    # Maximum allowed days that a role can be unused, or has been last used for an AWS request
    max_days_for_last_used = int(os.environ.get('max_days_for_last_used', '60'))
    if 'max_days_for_last_used' in rule_parameters:
        max_days_for_last_used = int(rule_parameters['max_days_for_last_used'])

    whitelisted_role_pattern_list = []
    if 'role_whitelist' in rule_parameters:
        whitelisted_role_pattern_list = validate_whitelist(rule_parameters['role_whitelist'])

    # Iterate over all our roles.  If the creation date of a role is <= max_days_for_last_used, it is compliant
    for role in all_roles:

        role_name = role['RoleName']
        role_path = role['Path']
        role_creation_date = role['CreateDate']
        role_last_used = role['RoleLastUsed']
        role_age_in_days = (datetime.datetime.now() - role_creation_date.replace(tzinfo=None)).days

        if is_whitelisted_role(role_path + role_name, whitelisted_role_pattern_list):
            compliance_result = COMPLIANT
            reason = "Role is whitelisted"
            evaluations.append(
                build_evaluation(role_name, compliance_result, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=reason))
            logger.info(f"COMPLIANT: {role} is whitelisted")
            continue

        if role_age_in_days <= max_days_for_last_used:
            compliance_result = COMPLIANT
            reason = f"Role age is {role_age_in_days} days"
            evaluations.append(
                build_evaluation(role_name, compliance_result, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=reason))
            logger.info(f"COMPLIANT: {role_name} - {role_age_in_days} is newer or equal to {max_days_for_last_used} days")
            continue

        evaluation_result = determine_last_used(role_name, role_last_used, max_days_for_last_used, notification_creation_time)
        evaluations.append(evaluation_result)

    # Iterate over our evaluations 100 at a time, as put_evaluations only accepts a max of 100 evals.
    evaluations_copy = evaluations[:]
    while evaluations_copy:
        config_client.put_evaluations(Evaluations=evaluations_copy[:100], ResultToken=event['resultToken'])
        del evaluations_copy[:100]

Here’s how the above code works. The AWS Config custom rule invokes the Lambda function, calling the evaluate_compliance() method. evaluate_compliance() does the following:

  1. Retrieves information on all roles from IAM using the GetAccountAuthorizationDetails API as mentioned previously. This includes each role’s creation date and role last used timestamp.
  2. Marks each role as compliant if the role name matches one of the patterns in your whitelisted_role_pattern_list. This pattern list is passed to your rule via a user-configurable AWS CloudFormation parameter named RolePatternWhitelist. “Whitelisting roles,” below, provides instructions about how to do this.
  3. Marks each role as compliant if the age of the role in days (role_age_in_days) is less than or equal to the parameter MaxDaysForLastUsed (max_days_for_last_used). This is set via a user-configurable parameter in your CloudFormation stack. You’ll use this parameter to set the time window for how long a role can be inactive.
  4. If neither of the above conditions are met, then determine_last_used() is called, and each role will be marked as non-compliant if days_unused is greater than max_age_in_days.
  5. Finally, evaluate_compliance() calls put_evaluations() against AWS Config to store your evaluations of each role.

Step 2: Deploy the AWS CloudFormation template

Next, create an AWS CloudFormation template file named  iam-role-last-used.yml. This template uses the AWS Serverless Application Model (AWS SAM), which is an extension of CloudFormation. AWS SAM simplifies the deployment so that you don’t have to manually upload your deployment .zip file to your Amazon S3 bucket. To ensure that your template knows the location of your code .zip file, place the file on the same directory level as the iam-role-last-used directory that you created above. Then copy and paste the code below and save it to the iam-role-last-used.yml file.


AWSTemplateFormatVersion: '2010-09-09'
Description: "Creates an AWS Config rule and Lambda to check all roles' last used compliance"
Transform: 'AWS::Serverless-2016-10-31'
Parameters:

  MaxDaysForLastUsed:
    Description: Checks the number of days allowed for a role to not be used before being non-compliant
    Type: Number
    Default: 60
    MaxValue: 365

  NameOfSolution:
    Type: String
    Default: iam-role-last-used
    Description: The name of the solution - used for naming of created resources

  RolePatternWhitelist:
    Description: Pipe separated whitelist of role pathnames using simple pathname matching
    Type: String
    Default: ''
    AllowedPattern: '[-a-zA-Z0-9+=,[email protected]_/|*]+|^$'

  LambdaLayerArn:
    Type: String
    Description: The ARN for the Lambda Layer you will use.
  
Resources:
  LambdaInvokePermission:
    Type: 'AWS::Lambda::Permission'
    DependsOn: CheckRoleLastUsedLambda
    Properties: 
      FunctionName: !GetAtt CheckRoleLastUsedLambda.Arn
      Action: lambda:InvokeFunction
      Principal: config.amazonaws.com
      SourceAccount: !Ref 'AWS::AccountId'

  LambdaExecutionRole:
    Type: 'AWS::IAM::Role'
    Properties:
      RoleName: !Sub '${NameOfSolution}-${AWS::Region}'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Effect: Allow
          Principal:
            Service: lambda.amazonaws.com
          Action:
          - sts:AssumeRole
      Path: /
      Policies:
      - PolicyName: !Sub '${NameOfSolution}'
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
          - Effect: Allow
            Action:
            - config:PutEvaluations
            Resource: '*'
          - Effect: Allow
            Action:
            - iam:GetAccountAuthorizationDetails
            Resource: '*'
          - Effect: Allow
            Action:
            - logs:CreateLogStream
            - logs:PutLogEvents
            Resource:
            - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:*:log-group:/aws/lambda/${NameOfSolution}:log-stream:*'

  CheckRoleLastUsedLambda:
    Type: 'AWS::Serverless::Function'
    Properties:
      Description: "Checks IAM roles' last used info for AWS Config"
      FunctionName: !Sub '${NameOfSolution}'
      Handler: lambda_function.evaluate_compliance
      MemorySize: 256
      Role: !GetAtt LambdaExecutionRole.Arn
      Runtime: python3.7
      Timeout: 300
      CodeUri: ./iam-role-last-used
      Layers:
      - !Ref LambdaLayerArn

  LambdaLogGroup:
    Type: 'AWS::Logs::LogGroup'
    Properties: 
      LogGroupName: !Sub '/aws/lambda/${NameOfSolution}'
      RetentionInDays: 30

  ConfigCustomRule:
    Type: 'AWS::Config::ConfigRule'
    DependsOn:
    - LambdaInvokePermission
    - LambdaExecutionRole
    Properties:
      ConfigRuleName: !Sub '${NameOfSolution}'
      Description: Checks the number of days that an IAM role has not been used to make a service request. If the number of days exceeds the specified threshold, it is marked as non-compliant.
      InputParameters: !Sub '{"role-whitelist":"${RolePatternWhitelist}","max_days_for_last_used":"${MaxDaysForLastUsed}"}'
      Source: 
        Owner: CUSTOM_LAMBDA
        SourceDetails: 
        - EventSource: aws.config
          MaximumExecutionFrequency: TwentyFour_Hours
          MessageType: ScheduledNotification
        SourceIdentifier: !GetAtt CheckRoleLastUsedLambda.Arn

For your reference, below is a summary of the template.

  • Parameters (these are user-configurable variables):
    • MaxDaysForLastUsed—maximum amount of days allowed for a role that has not been used to make an AWS request before becoming non-compliant
    • NameOfSolution—the name of the solution, used for naming of created resources
    • RolePatternWhitelist—a pipe (“|”) separated whitelist of role pathnames using simple pathname matching (see Whitelisting roles below)
    • LambdaLayerArn—the unique ARN for your Lambda layer
  • Resources (these are the AWS resources that will be created within your account):
    • LambdaInvokePermission—allows AWS Config to invoke your Lambda function
    • LambdaExecutionRole—the role and permissions that Lambda will assume to process your roles. The policies assigned to this role allow you to perform the iam:GetAccountAuthorizationDetails, config:PutEvaluations, logs:CreateLogStream, and logs:PutLogEvents actions. The PutEvaluations action allows you to send evaluation results back to AWS Config. The CreateLogStream and PutLogEvents actions allows you to write the Lambda execution logs to AWS CloudWatch Logs.
    • CheckRoleLastUsedLambda—defines your Lambda function and its attributes
    • LambdaLogGroup—logs from Lambda will be written to this CloudWatch Log Group
    • ConfigCustomRule—defines your custom AWS Config rule and its attributes

With the CloudFormation template you created above, use the AWS CLI’s cloudformation package command to zip the deployment package and upload it to the S3 bucket that you specify, as shown below. Make sure to replace <YOUR S3 BUCKET> with your bucket name only. Do not include the s3:// prefix:


aws cloudformation package --region <YOUR REGION> --template-file iam-role-last-used.yml \
--s3-bucket <YOUR S3 BUCKET> \
--output-template-file iam-role-last-used-transformed.yml

This will create the file iam-role-last-used-transformed.yml, which adds a reference to the S3 bucket and the pathname needed by CloudFormation to deploy your Lambda function.

Finally, deploy the solution into your AWS account using the cloudformation deploy command below. You can provide different values for NameOfSolutionMaxDaysForLastAccess, or RolePatternWhitelist by using the –parameter-overrides option. Otherwise, defaults will be used. These are specified at the top of the AWS Cloudformation template pasted above, under the Parameters section.


aws cloudformation deploy --region <YOUR REGION> --template-file iam-role-last-used-transformed.yml \
--stack-name iam-role-last-used \
--parameter-overrides NameOfSolution='iam-role-last-used' \
MaxDaysForLastUsed=60 \
RolePatternWhitelist='/breakglass-role|/security-*' \
LambdaLayerArn='<YOUR LAMBDA LAYER ARN>' \
--capabilities CAPABILITY_NAMED_IAM

The deployment is complete after the AWS CLI indicates success. This typically takes only a few minutes:


Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - iam-role-last-used

Step 3: View your findings

Now that your deployment is complete, you can view your compliance findings by going to the AWS Config console.

  1. Select the same region where you deployed the CloudFormation template.
  2. Select Rules in the left pane, which brings up the current list of rules in your account.
  3. Select the iam-role-last-used rule to view the rule’s details, as shown in Figure 2.

When a successful evaluation is indicated in the Overall rule status field, the compliance evaluation is complete. You may need to wait a few minutes for the function to complete successfully as results may not be available yet. You can periodically refresh your web browser to check for completion.
 

Figure 2: AWS Config rule details

Figure 2: AWS Config custom rule details

After the rule completes its evaluations of your roles, you’ll be able to view your compliance results on the same page. In the screenshot below, you can see that there are multiple non-compliant roles. You can switch between viewing compliant and non-compliant resources by selecting the dropdown menu under Compliance status.
 

Figure 3: Viewing the compliance status

Figure 3: Viewing the compliance status

For more insight, you can hover over the “i” symbol, which provides additional information about the role’s non-compliant status (see Figure 4).
 

Figure 4: Hover over the information icon

Figure 4: Hover over the information icon

Step 4: Export a report of your compliance

Once a successful evaluation has completed, you may want to create an exportable report of compliance. You can use the AWS CLI to programmatically script and automatically generate reports for your application, infrastructure, and security teams. They can use these reports to review non-compliant roles and take action if the role is no longer needed. The AWS CLI command below demonstrates how you can achieve this. Note that the command below encompasses a single line:

aws configservice get-compliance-details-by-config-rule –config-rule-name iam-role-last-used –output text –query ‘EvaluationResults [*].{A:EvaluationResultIdentifier.EvaluationResultQualifier.ResourceId,B:ComplianceType,C:Annotation}’

The output is tab-delimited and will be similar to the lines below. The first column displays the role name. The second column states the compliance status. The last column explains the reason for the compliance status:

AdminRole   COMPLIANT      Was last used in us-west-2 46 days ago
Ec2DevRole  NON_COMPLIANT  No record of usage

Remediation

Now that you have a report of non-compliant roles, you must decide what to do with them. If your teams agree that a role is not necessary, the remediation can be to simply delete the role. If unsure, you can retain the role but deny it from performing any action. You can do this by attaching a new permissions policy that will deny all actions for all resources. Re-enabling the role would be as easy as removing the added policy. Otherwise, if the role is necessary but not frequently used, you can whitelist the role through the method below.

Whitelisting roles

Whitelisted roles will be reported as compliant by the custom rule even if left unused. You might have roles such as a security incident response or a break-glass role that require whitelisting.

The whitelist is supplied via the CloudFormation parameter RolePatternWhitelist and is stored as an AWS Config rule parameter. The syntax uses UNIX filename pattern matching. If you need to specify multiple patterns, you can use the | (pipe) character as a delimiter between each pattern. Each delimited pattern will then be matched against the role name, including the path. For example, if you wish to whitelist the breakglass-role, security-incident-response-role and security-audit-role roles, the whitelist patterns you provide to the AWS CloudFormation template might be:

/breakglass-role|/security-*

Important: The use of wildcards (*) should be used thoughtfully, as they will match anything.

Enhancements

In this walkthrough, I’ve kept the architecture and code simple to make the solution easier to follow. You can further customize the solution through the following enhancements:

Conclusion

In this post, I’ve shown you how to use AWS IAM and AWS Config to implement a detective security control that provides visibility into your IAM roles and their last time of use. I’ve also shown how you can view the results in the AWS Management Console and export them using the AWS CLI. Finally, I’ve presented different options for remediation and a means to whitelist roles that are necessary but infrequently used. These techniques can augment your security and compliance program by preventing unintended access through your IAM roles.

Additional resources

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the IAM forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Michael Chan

Michael Chan

Michael is a Developer Advocate for AWS Identity. Prior to this, he was a Professional Services Consultant who assisted customers with their journey to AWS. He enjoys understanding customer problems and working backwards to provide practical solutions.

Roland AbiHanna

Roland is a Sr. Solutions Architect with Amazon Web Services. He’s focused on helping enterprise customers realize their business needs through cloud solutions, specializing in DevOps and automation. Prior to AWS, Roland ran DevOps for a variety of start-ups in Europe and the Middle East. Outside of work, Roland enjoys hiking and searching for the perfect blend of hops, barley, and water.

Creating CI/CD pipelines for ASP.NET 4.x with AWS CodePipeline and AWS Elastic Beanstalk

Post Syndicated from Kirk Davis original https://aws.amazon.com/blogs/devops/creating-ci-cd-pipelines-for-asp-net-4-x-with-aws-codepipeline-and-aws-elastic-beanstalk/

By Kirk Davis, Specialized Solutions Architect, Microsoft Platform team

As customers migrate ASP.NET (on .NET Framework) applications to AWS, many choose to deploy these apps with AWS Elastic Beanstalk, which provides a managed .NET platform to deploy, scale, and update the apps. Customers often ask how to create CI/CD pipelines for these ASP.NET 4.x (.NET Framework) apps without needing to set up or manage Jenkins instances or other infrastructure.

You can easily create these pipelines using AWS CodePipeline as the orchestrator, AWS CodeBuild for performing builds, and AWS CodeCommit, GitHub, or other systems for source control. This blog post demonstrates how to set up a simplified CI/CD pipeline that you could expand on later to include unit tests, using a CodeCommit Git repository for source control.

Creating a project and adding a buildspec.yml file

The first step in setting up this simplified CI/CD pipeline is to create a project and add a buildspec.yml file.

Creating or choosing an ASP.NET web application (.NET Framework)

First, either create a new ASP.NET Web Application (.NET Framework) project or choose an existing application to use. You can choose MVC, Web API, or even Web Forms project types based on ASP.NET 4.x. Whichever type you choose, make sure it builds and runs locally.

To set up your first CodePipeline for an ASP.NET (.NET Framework) application, you may wish to use a simple app that doesn’t require databases or other resources and which consists of a single project. The following screenshot shows the project type to choose when you create a new project in Visual Studio 2019.

Visual Studio 2019's Create New Project dialog window showing "ASP.NET Web Application (.NET Framework)" project type selected.

Visual Studio Create New Project dialog

Adding the project to CodeCommit

Next, add your project to a CodeCommit Git repository. You can either create a new repository in the CodeCommit web console and then add your new or legacy application to it by following the steps in the CodeCommit documentation or create the new repository from within Visual Studio’s Team Explorer by taking advantage of AWS Toolkit for Visual Studio’s integration with CodeCommit.

If you wish to use Team Explorer to create and interact with the CodeCommit Git repository for your project, follow Step 2 in the Integrate Visual Studio with AWS CodeCommit documentation to create the connection, and then follow the steps under Create a CodeCommit Repository from Visual Studio in the same section. Alternatively, you can work with Git from the command line.

You can reduce the number of files being stored in Git by adding a .gitignore file specific to .NET projects using Visual Studio’s Team Explorer:

  1. Choose the Home icon in the Team Explorer toolbar.
  2. Choose Settings, then Repository Settings.
  3. Choose the Add option for Ignore file under Ignore & Attributes Files, as shown in the following screenshot.
Visual Studio's Team Explorer - Repository Settings pane, showing the Add link for Ignore and Attribute Files.

Team Explorer – Repository Settings

After adding a .gitignore file and optionally connecting Visual Studio to CodeCommit, push your code up to the remote in CodeCommit using either git push or Team Explorer. After pushing your changes, you can use the CodeCommit management console in your browser to verify that all your files are there.

Adding a buildspec.yml file to your project

CodeBuild, which does the actual compilation, essentially launches a container using a docker image you specify, then runs a series of commands to install any required software and perform the actual build or tests that you want. Finally, it takes whatever output files you specify—artifacts—and uploads them in a .zip file to Amazon S3 for the next stage of the CodePipeline pipeline. The commands that CodeBuild executes in the container are specified in a buildspec.yml file, which is part of the source code of your project. You can also add it directly to the CodeBuild configuration, but it’s more convenient to edit and track in source control. When running CodeBuild with Windows containers, the default shell for these commands is PowerShell.

Add a plain text file to the root of your ASP.NET project named buildspec.yml and then open the file in an editor. Ensure you add the file to your project to easily find and edit it later. For details on the structure and contents of buildspec.yml files, refer to the CodeBuild documentation.

You can use the following sample buildspec.yml file and simply replace the values for PROJECT and DOTNET_FRAMEWORK with the name and .NET Framework target version for your project.

version: 0.2

env:
  variables:
    PROJECT: AspNetMvcSampleApp
    DOTNET_FRAMEWORK: 4.6.1
phases:
  build:
    commands:
      - nuget restore
      - msbuild $env:PROJECT.csproj /p:TargetFrameworkVersion=v$env:DOTNET_FRAMEWORK /p:Configuration=Release /p:DeployIisAppPath="Default Web Site" /p:PackageAsSingleFile=false /p:OutDir=C:\codebuild\artifacts\ /t:Package
artifacts:
  files:
    - '**/*'
  base-directory: C:\codebuild\artifacts\_PublishedWebsites\${env:PROJECT}_Package\Archive\'

Walkthrough of the buildspec commands

Looking at the buildspec.yml file above, you can see that the only phase defined for this sample application is build. If you need to perform some action either before or after the build, you can add pre_build and post_build phases.

The first command executed in the build phase is nuget restore to download any NuGet packages your project references. Then, MS build kicks off the build itself. Using the /t:Package parameter generates the web deployment folder structure that Elastic Beanstalk expects for ASP.NET Framework applications, and includes the archive.xml, parameters.xml, and systemInfo.xml files.

By default, the output of this type of build is a .zip file. However, when used in conjunction with CodePipeline, CodeBuild always zips up the artifact files that you specify, even if they’re already zipped. To avoid this double zipping, use the /p:PackageAsSingleFile=false parameter, which outputs the folder structure in a folder called Archive instead. The /p:OutDir parameter specifies where MSBuild should write the files. This example uses C:\codebuild\artifacts\.

Finally, in the artifacts node, specify which files (or artifacts) CodeBuild should compress and provide to CodePipeline. The sample above includes all the files (the ‘**/*’) in the C:\codebuild\artifacts\_PublishedWebsites\${env:PROJECT}_Package\Archive\ folder, in which ${env:PROJECT} is automatically replaced by the value of the variable for the project name specified at the top of the file.

After you finish editing the buildspec.yml file, commit and push your changes to ensure the file is in your CodeCommit Git repository.

Create an Elastic Beanstalk application and initial deployment

The CodePipeline deployment provider for Elastic Beanstalk deploys to an existing Elastic Beanstalk application environment. So before you build out your pipeline, manually deploy your application and create the destination application and environment in Elastic Beanstalk. The easiest way to do this is using the AWS Toolkit for Visual Studio. If you don’t have it installed, use the Visual Studio Extensions tool to search for aws and install the toolkit.

Once it’s installed, open your project in Visual Studio, right-click the project node in the Solutions Explorer pane, and choose Publish to AWS Elastic Beanstalk. This launches the publish wizard.

For step-by-step instructions on using the publishing wizard, see Deploy a Traditional ASP.NET Application to Elastic Beanstalk.

Once the publish wizard has finished deploying to Elastic Beanstalk, you should see the URL in the Elastic Beanstalk environment pane in Visual Studio, as shown in the following screenshot.

Alternately, you can navigate to the Elastic Beanstalk management console in your browser, select your application and environment, and see the URL in the environment dashboard. Verify that your application is viewable in your browser.

The AWS Toolkit for Visual Studio's Elastic Beanstalk deployment pane, with the environment URL circled.

AWS Toolkit – Elastic Beanstalk Environment

Creating the CI/CD pipeline

Next, create the CodePipeline pipeline.

Adding the source stage

Now that your source code is in CodeCommit, and you have an existing Elastic Beanstalk app, create your pipeline:

  1. In your browser, navigate to the CodePipeline management console.
  2. Choose Create pipeline and give your pipeline a name. To keep things simple, you might want to use the same name as your CodeCommit repo.
  3. Choose Next.
  4. Under Source, choose CodeCommit.
  5. Select your repository name from the drop-down, and choose the branch you wish to use. If you haven’t added any branches, your only choice will be the master branch.

Creating the build stage

Next, create the build stage:

  1. After choosing Next, select AWS CodeBuild as the build provider.
  2. Select your region, then choose Create project, which will open CodeBuild in another browser window.
  3. In the CodeBuild window, you can optionally assign your build project a name and description.
  4. Under Environment, select the Custom image option, and select Windows as the environment type.
  5. For building ASP.NET 4.x (.NET Framework) web projects, it’s easiest to start out with Microsoft’s .NET Framework SDK docker image, which they host on their registry.
    Select Other registry, and use mcr.microsoft.com/dotnet/framework/sdk:[version-tag] as the registry URL. Replace version-tag with the .NET framework version. For .NET Framework 4.x, the most likely options are 4.7.1, 4.7.2 or 4.8. This example uses mcr.microsoft.com/dotnet/framework/sdk:4.7.2.

For details about the .NET Framework SDK container image, see the container image page on Dockerhub. The SDK includes the Visual Studio Build Tools, the NuGet CLI, and ASP.NET Web Targets.

Next, choose a group name for Amazon CloudWatch logs under Logs (near the bottom of the page). This will output detailed build logs for each build to CloudWatch. Leave the rest of the settings as they are.

Then choose Continue to CodePipeline to save the CodeBuild configuration and return to the CodePipeline wizard’s Add build stage step. Ensure your newly created build project is specified in Project name, then choose Next.

Adding the deploy stage

In the Add deploy stage step:

  1. Select AWS Elastic Beanstalk as the Deploy provider.
  2. Select your region.
  3. In the Application name field, select the Elastic Beanstalk application you previously deployed.
  4. Select the environment you previously deployed and choose Next.
  5. Review all your settings and choose Create pipeline.

Testing out the pipeline

To test out your pipeline, make an easily visible change to your application’s code, such as adding some text to the home page. Then, commit your changes and push.

Within a few moments, the Source stage in your pipeline should move to in progress, followed by the Build stage. It can take 10 minutes or more for the build stage to complete, and then the Deploy stage should finish quickly.

After the Deploy stage status changes to Succeeded, choose AWS Elastic Beanstalk in that stage in the pipeline view, as shown in the following screenshot, to navigate to your Elastic Beanstalk application.

Select the environment to which you’re deploying and select the URL. You should see that your changes are now live.

After a successful build and deploy, your pipeline should appear as it does in the following screenshot.

Screenshot of a sample CodePipeline pipeline with all stages showing a successful build and deploy.

Screenshot of successful CodePipeline pipeline

Conclusion

In this blog post, I showed you how to create a simple CI/CD pipeline for ASP.NET 4.x web applications, built with the .NET Framework, using AWS services including CodeCommit, CodePipeline, CodeBuild and Elastic Beanstalk. You can extend this pipeline with additional build actions for things like unit tests, or by adding manual approval steps.

We welcome your feedback.

Add defense in depth against open firewalls, reverse proxies, and SSRF vulnerabilities with enhancements to the EC2 Instance Metadata Service

Post Syndicated from Colm MacCarthaigh original https://aws.amazon.com/blogs/security/defense-in-depth-open-firewalls-reverse-proxies-ssrf-vulnerabilities-ec2-instance-metadata-service/

Since it first launched over 10 years ago, the Amazon EC2 Instance Metadata Service (IMDS) has helped customers build secure and scalable applications. The IMDS solved a big security headache for cloud users by providing access to temporary, frequently rotated credentials, removing the need to hardcode or distribute sensitive credentials to instances manually or programatically. Attached locally to every EC2 instance, the IMDS runs on a special “link local” IP address of 169.254.169.254 that means only software running on the instance can access it. For applications with access to IMDS, it makes available metadata about the instance, its network, and its storage. The IMDS also makes the AWS credentials available for any IAM role that is attached to the instance.

When you run applications in the cloud, application security is as critical as instance security; if the applications running on an instance have vulnerabilities or misconfigurations, there can be serious consequences. While application security plays an important role in a layered defense, AWS also constantly evaluates where to add layers, even within the instance, to minimize the damage that can occur when these situations occur.

Today, AWS is making v2 of the EC2 Instance Metadata Service (IMDSv2) available. The existing instance metadata service (IMDSv1) is fully secure, and AWS will continue to support it. But IMDSv2 adds new “belt and suspenders” protections for four types of vulnerabilities that could be used to try to access the IMDS. These new protections go well beyond other types of mitigations, while working seamlessly with existing mitigations such as restricting IAM roles and using local firewall rules to restrict access to the IMDS. AWS is also making new versions of the AWS SDKs and CLIs available that support IMDSv2.

What’s new in IMDSv2

With IMDSv2, every request is now protected by session authentication. A session begins and ends a series of requests that software running on an EC2 instance uses to access the locally-stored EC2 instance metadata and credentials. The software starts a session with a simple HTTP PUT request to IMDSv2. IMDSv2 returns a secret token to the software running on the EC2 instance, which will use the token as a password to make requests to IMDSv2 for metadata and credentials. Unlike traditional passwords, you don’t need to worry about getting the token to the software, because the software gets it for itself with the PUT request. The token is never stored by IMDSv2 and can never be retrieved by subsequent calls, so a session and its token are effectively destroyed when the process using the token terminates. There’s no limit on the number of requests within a single session, and there’s no limit on the number of IMDSv2 sessions. Sessions can last up to six hours and, for added security, a session token can only be used directly from the EC2 instance where that session began.

For example, this curl recipe retrieves a session token that’s valid for the full six hours (21600 seconds) and then uses that token to access the EC2 instance’s profile metadata:


TOKEN=`curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"`

curl http://169.254.169.254/latest/meta-data/profile -H "X-aws-ec2-metadata-token: $TOKEN"

If you need to write code against the IMDSv2 directly, you can get more detail on the new scheme in the EC2 User Guide.

How these changes add defense in depth

IMDSv2’s changes are easy to use, and you’ll start using it automatically if you’re using the updated AWS SDKs and CLIs. These changes go beyond other types of mitigations to protect against misconfigured-open website application firewalls, misconfigured-open reverse proxies, unpatched SSRF vulnerabilities, and misconfigured-open layer-3 firewalls and network address translation.

Protecting against open Website Application Firewalls

Some Web Application Firewall (WAF) services, such as AWS WAF, can’t be configured to act as open WAFs. However, some third-party WAFs can be misconfigured to allow attackers unauthorized access to the network behind the WAF, including the EC2 IMDS.

Many WAFs are designed to act invisibly, so that they can protect websites and applications without administrators having to change or reconfigure the applications that are behind the WAF. To be transparent, WAFs usually pass on all of the headers that come with a request, and do not add their own headers, such as the standard X-Forwarded-For header that other kinds of proxies add. In other words, applications behind a WAF get requests just as the requester sent them.

The AWS approach is to block open WAFs by using a type of request that open WAFs very rarely support, HTTP PUT requests. Although web services such as Amazon S3 use PUT requests for object storage, they’re an uncommon type of request for websites and browsers to use. Our analysis of third-party WAF products and open WAF misconfigurations found that the vast majority do not permit HTTP PUT requests. We’re using this PUT request to provide a new layer of defense that goes beyond any existing capabilities – we’ve architected the IMDSv2 service to require a PUT request at the beginning of a session, which will prevent open WAFs from being abused to access the IMDS in the vast majority of cases.

Protecting against open reverse proxies

As it happens, it’s also very rare for open reverse proxies to allow PUT requests, but IMDSv2 has another layer of defense against open reverse proxies. Reverse proxies, such as Apache httpd or Squid, can also be misconfigured to allow external requests that reach internal resources, but it’s still normal for these proxies to send an X-Forwarded-For HTTP header. That header itself is used to pass on the IP address of the original caller. IMDSv2 will also not issue session tokens to any caller with an X-Forwarded-For header, which is effective at blocking unauthorized access due to misconfigurations like an open reverse proxy.

Protecting against SSRF vulnerabilities

SSRF vulnerabilities allow attackers to make unauthorized requests from web applications. Since these requests come from the application itself, they can be used to access internal resources that the application has access to but that were not intended to be accessible to outsiders. SSRF vulnerabilities vary in their severity, and some are immune to other types of mitigations. For instance, blocking SSRFs through static headers in instance metadata requests is effective only when the vulnerability merely allows the attacker to control the URL that is being requested; however, AWS analysis found many SSRF vulnerabilities that allow attackers to set arbitrary headers because the SSRF vulnerability impacts the application’s own header processing.

IMDSv2’s combination of beginning a session with a PUT request, and then requiring the secret session token in other requests, is always strictly more effective than requiring only a static header. AWS analysis of real-world vulnerabilities found that this combination protects against the vast majority of SSRF vulnerabilities.

Protecting against open layer 3 firewalls and NATs

Last, there is a final layer of defense in IMDSv2 that is designed to protect EC2 instances that have been misconfigured as open routers, layer 3 firewalls, VPNs, tunnels, or NAT devices. With IMDSv2, the PUT response containing the secret token will, by default, not be able to travel outside the instance. This is accomplished by having the default Time To Live (TTL) on the low-level IP packets containing the secret token set to “1,” much lower than a typical value, such as “64.” Hardware and software that handle packets, including EC2 instances, subtract 1 from each packet’s TTL field whenever they pass it on. If the TTL gets to 0, the packet is discarded, and an error message is sent back to the sender. A packet with a TTL of “64” can therefore make sixty-four “hops” in a network before giving up, while a packet with a TTL of “1” can exist in just one. This feature allows legitimate traffic to get to an intended destination, but is designed to stop packets from endlessly running around in circles if there’s a loop in a network.

With IMDSv2, setting the TTL value to “1” means that requests from the EC2 instance itself will work because they’re returned to the caller (on the instance) before the subtraction occurs. But if the EC2 instance has been misconfigured as an open router, layer 3 firewall, VPN, tunnel, or NAT device, the response containing the token will have its TTL reduced to zero before leaving the instance, and the packet containing the response will be discarded on its way out of the instance, preventing transport to the attacker. The information simply won’t make it further than the EC2 instance itself, which means that an attacker won’t get the response back with the token, and with it the ability to access instance metadata, even if they’ve been successful at getting past all other defenses.

Making the transition

Both IMDSv1 and IMDSv2 will be available and enabled by default, and customers can choose which they will use. The IMDS can now be restricted to v2 only, or IMDS (v1 and v2) can also be disabled entirely. AWS recommends adopting v2 and restricting access to v2 only for added security. IMDSv1 remains available for customers who have tools and scripts using v1, and who are comfortable with the existing security posture of their instances.

A number of tools are available to make transitioning to v2 and disabling v1 seamless. Starting today, a new CloudWatch metric is available that provides visibility into the number of v1 calls that are being made on any given instance. Customers can use this metric to monitor how often v1 is still being accessed as Amazon Machine Images, the AWS SDKs, CLIs, cloud-init, and other software accessing the IMDS is updated, released, and upgraded. When you can see that an instance can be launched, activated, used for service, and the metric is zero, it is safe to require v2 of the IMDS, disabling v1. For more information on transitioning to IMDSv2, see the user guide.

Security can also be further enhanced while this transition is happening. AWS credentials provided by the IMDS now include an ec2:RoleDelivery IAM context key. Credentials provided by the older IMDSv1 have an ec2:RoleDelivery value of “1.0,” and credentials using the new scheme will have an ec2:RoleDelivery value of “2.0.” This context key makes it easy to enforce use of the new scheme on a service-by-service or resource-by-resource basis by using those context keys as conditions in IAM policies, resource policies, or AWS Organizations service control policies. For example, if all of the software accessing an S3 bucket has been upgraded to use IMDSv2, then that S3 bucket can be safely restricted to only allow access to role-account credentials that have the “2.0” value (or greater) for the context key. The effect is that credentials retrieved using IMDSv1 will be prevented from accessing the bucket. AWS CloudTrail is also being updated to record the new ec2:RoleDelivery parameters.

Hear about IMDSv2 at re:Invent

Mark Ryland will be talking in more detail about IMDSv2, and the transition to it, at AWS re:Invent in December. We’ll update this post soon with a link to the session in the re:Invent catalog.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Processing batch jobs quickly, cost-efficiently, and reliably with Amazon EC2 On-Demand and Spot Instances

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/processing-batch-jobs-quickly-cost-efficiently-and-reliably-with-amazon-ec2-on-demand-and-spot-instances/

This post is contributed by Alex Kimber, Global Solutions Architect

No one asks for their High Performance Computing (HPC) jobs to take longer, cost more, or have more variability in the time to get results. Fortunately, you can combine Amazon EC2 and Amazon EC2 Auto Scaling to make the delivery of batch workloads fast, cost-efficient, and reliable. Spot Instances offer spare AWS compute power at a considerable discount. Customers such as Yelp, NASA, and FINRA use them to reduce costs and get results faster.

This post outlines an approach that combines On-Demand Instances and Spot Instances to balance a predictable delivery of HPC results with an opportunistic approach to cost optimization.

 

Prerequisites

This approach will be demonstrated via a simple batch-processing environment with the following components:

  • A producer Python script to generate batches of tasks to process. You can develop this script in the AWS Cloud9 development environment. This solution also uses the environment to run the script and generate tasks.
  • An Amazon SQS queue to manage the tasks.
  • A consumer Python script to take incomplete tasks from the queue, simulate work, and then remove them from the queue after they’re complete.
  • Amazon EC2 Auto Scaling groups to model scenarios.
  • Amazon CloudWatch alarms to trigger the Auto Scaling groups and detect whether the queue is empty. The EC2 instances run the consumer script in a loop on startup.

 

Testing On-Demand Instances

In this scenario, an HPC batch of 6,000 tasks must complete within five hours. Each task takes eight minutes to complete on a single vCPU.

A simple approach to meeting the target is to provision 160 vCPUs using 20 c5.2xlarge On-Demand Instances. Each of the instances should complete 60 tasks per hour, completing the batch in approximately five hours. This approach provides an adequate level of predictability. You can test this approach with a simple Auto Scaling group configuration, set to create 20 c5.2xlarge instances if the queue has any pending visible messages. As expected, the batch takes approximately five hours, as shown in the following screenshot.

In the Ireland Region, using 20 c5.2xlarge instances for five hours results in a cost of $0.384 per hour for each instance.  The batch total is $38.40.

 

Testing On-Demand and Spot Instances

The alternative approach to the scenario also provisions sufficient capacity for On-Demand Instances to meet the target time, in this case 20 instances. This approach gives confidence that you can meet the batch target of five hours regardless of what other capacity you add.

You can then configure the Auto Scaling group to also add a number of Spot Instances. These instances are more numerous, with the aim of delivering the results at a lower cost and also allowing the batch to complete much earlier than it would otherwise. When the queue is empty it automatically terminates all of the instances involved to prevent further charges. This example configures the Auto Scaling group to have 80 instances in total, with 20 On-Demand Instances and 60 Spot Instances. Selecting multiple different instance types is a good strategy to help secure Spot capacity by diversification.

Spot Instances occasionally experience interruptions when AWS must reclaim the capacity with a two-minute warning. You can handle this occurrence gracefully by configuring your batch processor code to react to the interruption, such as checkpointing progress to some data store. This example has the SQS visibility timeout on the SQS queue set to nine minutes, so SQS re-queues any task that doesn’t complete in that time.

To test the impact of the new configuration another 6000 tasks are submitted into the SQS queue. The Auto Scaling group quickly provisions 20 On-Demand and 60 Spot Instances.

The instances then quickly set to work on the queue.

The batch completes in approximately 30 minutes, which is a significant improvement. This result is due to the additional Spot Instance capacity, which gave a total of 2,140 vCPUs.

The batch used the following instances for 30 minutes.

 

Instance TypeProvisioningHost CountHourly Instance CostTotal 30-minute batch cost
c5.18xlargeSpot15 $     1.2367 $     9.2753
c5.2xlargeSpot22 $     0.1547 $     1.7017
c5.4xlargeSpot12 $     0.2772 $     1.6632
c5.9xlargeSpot11 $     0.6239 $     3.4315
c5.2xlargeOn-Demand13 $     0.3840 $     2.4960
c5.4xlargeOn-Demand3 $     0.7680 $     1.1520
c5.9xlargeOn-Demand4 $     1.7280 $     3.4560

The total cost is $23.18, which is approximately 60 percent of the On-Demand cost and allows you to compute the batch 10 times faster. This example also shows no interruptions to the Spot Instances.

 

Summary

This post demonstrated that by combining On-Demand and Spot Instances you can improve the performance of a loosely coupled HPC batch workload without compromising on the predictability of runtime. This approach balances reliability with improved performance while reducing costs. The use of Auto Scaling groups and CloudWatch alarms makes the solution largely automated, responding to demand and provisioning and removing capacity as required.

Post-quantum TLS now supported in AWS KMS

Post Syndicated from Andrew Hopkins original https://aws.amazon.com/blogs/security/post-quantum-tls-now-supported-in-aws-kms/

AWS Key Management Service (AWS KMS) now supports post-quantum hybrid key exchange for the Transport Layer Security (TLS) network encryption protocol that is used when connecting to KMS API endpoints. In this post, I’ll tell you what post-quantum TLS is, what hybrid key exchange is, why it’s important, how to take advantage of this new feature, and how to give us feedback.

What is post-quantum TLS?

Post-quantum TLS is a feature that adds new, post-quantum cipher suites to the protocol. AWS implements TLS using s2n, a streamlined open source implementation of TLS. In June, 2019, AWS introduced post-quantum s2n, which implements two proposed post-quantum hybrid cipher suites specified in this IETF draft. The cipher suites specify a key exchange that provides the security protections of both the classical and post-quantum schemes.

Why is this important?

A large-scale quantum computer would break the current public key cryptography that is used for key exchange in every TLS connection. While a large-scale quantum computer is not available today, it’s still important to think about and plan for your long-term security needs. TLS traffic recorded today could be decrypted by a large-scale quantum computer in the future. If you’re developing applications that rely on the long-term confidentiality of data passed over a TLS connection, you should consider a plan to migrate to post-quantum cryptography before a large-scale quantum computer is available for use by potential adversaries. AWS is working to prepare for this future, and we want you to be well-prepared, too.

We’re offering this feature now instead of waiting so you’ll have a way to measure the potential performance impact to your applications, and you’ll have the additional benefit of the protection afforded by the proposed post-quantum schemes today. While we believe the use of this feature raises the already high security bar for connecting to KMS endpoints, these new cipher suites will have an impact on bandwidth utilization, latency, and could also create issues for intermediate systems that proxy TLS connections. We’d like to get feedback from you on the effectiveness of our implementation so we can improve it over time.

Some background on post-quantum TLS

Today, all requests to AWS KMS use TLS with one of two key exchange schemes:

FFDHE and ECDHE are industry standards for secure key exchange. KMS uses only ephemeral keys for TLS key negotiation; this ensures every connection uses a unique key and the compromise of one connection does not affect the security of another connection. They are secure today against known cryptanalysis techniques which use classic computers; however, they’re not secure against known attacks which use a large-scale quantum computer. In the future a sufficiently capable large-scale quantum computer could run Shor’s Algorithm to recover the TLS session key of a recorded session, and therefore gain access to the data inside. Protecting against a large-scale quantum computer requires using a post-quantum key exchange algorithm during the TLS handshake.

The possibility of large-scale quantum computing has spurred the development of new quantum-resistant cryptographic algorithms. The National Institute for Science and Technology (NIST) has started the process of standardizing post-quantum cryptographic algorithms. AWS contributed to two NIST submissions:

BIKE and SIKE are Key Encapsulation Mechanisms (KEMs); a KEM is a type of key exchange used to establish a shared symmetric key. Post-quantum s2n only uses ephemeral BIKE and SIKE keys.

The NIST standardization process isn’t expected to complete until 2024. Until then, there is a risk that the exclusive use of proposed algorithms like BIKE and SIKE could expose data in TLS connections to security vulnerabilities not yet discovered. To mitigate this risk and use these new post-quantum schemes safely today, we need a way to combine classical algorithms with the expected post-quantum security of the new algorithms submitted to NIST. The Hybrid Post-Quantum Key Encapsulation Methods for Transport Layer Security 1.2 IETF draft describes how to combine BIKE and SIKE with ECDHE to create two new cipher suites for TLS.

These two cipher suites use a hybrid key exchange that performs two independent key exchanges during the TLS handshake and then cryptographically combines the keys into a single TLS session key. This strategy combines the high assurance of a classical key exchange with the security of the proposed post-quantum key exchanges.

The effect of hybrid post-quantum TLS on performance

Post-quantum cipher suites have a different performance profile and bandwidth requirements than traditional cipher suites. We measured the latency and bandwidth for a single handshake on an EC2 C5 2x.large. This provides a baseline for what to expect when you connect to KMS with the SDK. Your exact results will depend on your hardware (CPU speed and number of cores), existing workloads (how often you call KMS and what other work your application performs), and your network (location and capacity).

BIKE and SIKE have different performance tradeoffs: BIKE has faster computations and large keys, and SIKE has slower computations and smaller keys. The tables below show the results of the AWS measurements. ECDHE, a classic cryptographic key exchange algorithm, is included by itself for comparison.

Table 1
TLS MessageECDHE (bytes)ECDHE w/ BIKE (bytes)ECDHE w/SIKE (bytes)
ClientHello139147147
ServerKeyExchange3292,875711
ClientKeyExchange662,610470

Table 1 shows the amount of data (in bytes) sent in each TLS message. The ClientHello message is larger for post-quantum cipher suites because they include a new ClientHello extension. The key exchange messages are larger because they include BIKE or SIKE messages.

Table 2
ItemECDHE (ms)ECDHE w/ BIKE (ms)ECDHE w/SIKE (ms)
Server processing time0.1120.2695.53
Client processing time0.100.3957.05
Total handshake time1.1925.58155.08

Table 2 shows the time (in milliseconds) a client and server in the same region take to complete a handshake. Server processing time includes: key generation, signing the server key exchange message, and processing the client key exchange message. The client processing time includes: verifying the server’s certificate, processing the server key exchange message, and generating the client key exchange message. The total time was measured on the client from the start of the handshake to the end and includes network transfer time. All connections used RSA authentication with a 2048-bit key, and ECDHE used the secp256r1 curve. The BIKE test used the BIKE-1 Level 1 parameter and the SIKE test used the SIKEp503 parameter.

A TLS handshake is only performed once to setup a new connection. The SDK will reuse connections for multiple KMS requests when possible. This means that you don’t want to include measurements of subsequent round-trips under an existing TLS session, otherwise you will skew your performance data.

How to use hybrid post-quantum cipher suites

Note: The “AWS CRT HTTP Client” in the aws-crt-dev-preview branch of the aws-sdk-java-v2 repository is a beta release. This beta release and your use are subject to Section 1.10 (“Beta Service Participation”) of the AWS Service Terms.

To use the post-quantum cipher suites with AWS KMS, you’ll need the Developer Previews of the Java SDK 2.0 and the AWS Common Runtime. You’ll need to configure the AWS Common Runtime HTTP client to use s2n’s post-quantum hybrid cipher suites, and configure the AWS Java SDK 2.0 to use that HTTP client. This client can then be used when connecting to any KMS endpoints, but only those endpoints that are not using FIPS 140-2 validated crypto for the TLS termination. For example, kms.<region>.amazonaws.com supports the use of post-quantum cipher suites, while kms-fips.<region>.amazonaws.com does not.

To see a complete example of everything setup check out the example application here.
 

Figure 1: GitHub and package layout

Figure 1: GitHub and package layout

Figure 1 shows the GitHub and package layout. The steps below will walk you through building and configuring the SDK.

  1. Download the Java SDK v2 Common Runtime Developer Preview:
    
    $ git clone [email protected]:aws/aws-sdk-java-v2.git --branch aws-crt-dev-preview
    $ cd aws-sdk-java-v2
    

  2. Build the aws-crt-client JAR:
    
    $ mvn install -Pquick
    

  3. In your project add the AWS Common Runtime client to your Maven Dependencies:
    
    <dependency>
        <groupId>software.amazon.awssdk</groupId>
        <artifactId>aws-crt-client</artifactId>
        <version>2.10.7-SNAPSHOT</version>
    </dependency>
    

  4. Configure the new SDK and cipher suite in your application’s existing initialization code:
    
    if(!TLS_CIPHER_KMS_PQ_TLSv1_0_2019_06.isSupported()){
        throw new RuntimeException("Post Quantum Ciphers not supported on this Platform");
    }
    SdkAsyncHttpClient awsCrtHttpClient = AwsCrtAsyncHttpClient.builder()
              .tlsCipherPreference(TLS_CIPHER_KMS_PQ_TLSv1_0_2019_06)
              .build();
    KmsAsyncClient kms = KmsAsyncClient.builder()
             .httpClient(awsCrtHttpClient)
             .build();
    ListKeysResponse response = kms.listKeys().get();
    

Now, all connections made to AWS KMS in supported regions will use the new hybrid post-quantum cipher suites.

Things to try

Here are some ideas about how to use this post-quantum-enabled client:

  • Run load tests and benchmarks. These new cipher suites perform differently than traditional key exchange algorithms. You might need to adjust your connection timeouts to allow for the longer handshake times or, if you’re running inside an AWS Lambda function, extend the execution timeout setting.
  • Try connecting from different locations. Depending on the network path your request takes, you might discover that intermediate hosts, proxies, or firewalls with deep packet inspection (DPI) block the request. This could be due to the new cipher suites in the ClientHello or the larger key exchange messages. If this is the case, you might need to work with your Security team or IT administrators to update the relevant configuration to unblock the new TLS cipher suites. We’d like to hear from your about how your infrastructure interacts with this new variant of TLS traffic.

More info

If you’re interested to learn more about post-quantum cryptography check out:

Conclusion

In this blog post, I introduced you to the topic of post-quantum security and covered what AWS and NIST are doing to address the issue. I also showed you how to begin experimenting with hybrid post-quantum key exchange algorithms for TLS when connecting to KMS endpoints.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about how to configure the HTTP client or its interaction with KMS endpoints, please start a new thread on the AWS KMS discussion forum.

Migrating Azure VM to AWS using AWS SMS Connector for Azure

Post Syndicated from Emma White original https://aws.amazon.com/blogs/compute/migrating-azure-vm-to-aws-using-aws-sms-connector-for-azure/

AWS SMS is an agentless service that facilitates and expedites the migration of your existing workloads to AWS. The service enables you to automate, schedule, and monitor incremental replications of active server volumes, which facilitates large-scale server migration coordination. Recently, you could only migrate virtual machines (VMs) running in VMware vSphere and Microsoft Hyper-V environments. Currently, you can use the simplicity and ease of AWS Server Migration Service (SMS) to migrate virtual machines running on Microsoft Azure. You can discover Azure VMs, group them into applications, and migrate a group of applications as a single unit without having to go through the hassle of coordinating the replication of the individual servers or decoupling application dependencies. SMS significantly reduces application migration time, as well as decreases the risk of errors in the migration process.

 

This post takes you step-by-step through how to provision the SMS virtual machine on Microsoft Azure, discover the virtual machines in a Microsoft Azure subscription, create a replication job, and finally launch the instance on AWS.

 

1- Provisioning the SMS virtual machine

To provision your SMS virtual machine on Microsoft Azure, complete the following steps.

  1. Download three PowerShell scripts listed under Step 1 of Installing the Server Migration Connection on Azure.
FileURL
Installation scripthttps://s3.amazonaws.com/sms-connector/aws-sms-azure-setup.ps1
MD5 hashhttps://s3.amazonaws.com/sms-connector/aws-sms-azure-setup.ps1.md5
SHA256 hashhttps://s3.amazonaws.com/sms-connector/aws-sms-azure-setup.ps1.sha256

 

  1. To validate the integrity of the files you can compare the checksums of the files. You can use PowerShell 5.1 or newer.

 

2.1 To validate the MD5 hash of the aws-sms-azure-setup.ps1 script, run the following command and wait for an output similar to the following result:

Command to validate the MD5 has of the aws-sems-azure-setup.ps1 script

2.2 To validate the SHA256 hash of the aws-sms-azure-setup.ps1 file, run the following command and wait for an output similar to the following result:

Command to validate the SHA256 hash of the aws-sms-azure-setup.ps1 file

2.3 Compare the returned values ​​by opening the aws-sms-azure-setup.ps1.md5 and aws-sms-azure-setup.ps1.sha256 files in your preferred text editor.

2.4 To validate if the PowerShell script has a valid Amazon Web Services signature, run the following command and wait for an output similar to the following result:

Command to validate validate if the PowerShell script has a valid Amazon Web Services signature

 

  1. Before running the script for provisioning the SMS virtual machine, you must have an Azure Virtual Network and an Azure Storage Account in which you will temporarily store metadata for the tasks that SSM performs against the Microsoft Azure Subscription. A good recommendation is to use the same Azure Virtual Network as the Azure Virtual Machines being migrated, since the SSM virtual machine performs REST API communications to communicate with AWS endpoints as well as the Azure Cloud Service. It is not necessary for the SMS virtual machine to have a Public IP or Internet Inbounds Rules.

 

4.  Run the installation script .\aws-sms-azure-setup.ps1

Screenshot of running the installation script

  1. Enter with the name of the existing Storage Account Name and Azure Virtual Network in the subscription:

Screenshot of where to enter Storage Account Name and Azure Virtual Network

  1. The Microsoft Azure modules imports into the local PowerShell, and you receive a prompt for credentials to access the subscription.

Azure login credentials

  1. A summary of the created features appears, similar to the following:

Screenshot of created features

  1. Wait for the process to complete. It may take a few minutes:

screenshot of processing jobs

  1. After the provisioning an output containing the Object Id of System Assigned Identity and Private IP. Save this information as it is going to be used to register the connector to the SMS service in the step 23.

Screenshot of the information to save

  1. To check the provisioned resources, log into the Microsoft Azure Portal and select the Resource Group option. The provided AWS script performed a role created in the Microsoft Azure IAM that allows the virtual machine to make use of the necessary services through REST APIs over HTTPS calls and to be authenticated via Azure Inbuilt Instance Metadata Service (IMDS).

Screenshot of provisioned resources log in Microsoft Azure Portal

  1. As a requirement, you need to create an IAM User that contains the necessary permissions for the SMS service to perform the migration. To do this, log into your AWS account at https://aws.amazon.com/console, under services select IAM. Then select User, and click Add user.

Screenshot of AWS console. add user

 

  1. In the Add user page, insert a username and check the option Programmatic access. Click: Next Permissions

Screenshot of adding a username

  1. Attach an existing policy with the name ServerMigrationConnector. This policy allows the AWS Connector to connects and executes API-requests against AWS. Click Next:Tags.

Adding policy ServerMigrationConnector

  1. Optionally add tags to the user. Click Next: Review.

Screenshot of option to add tags to the user

15. Click Create User and save the Access Key and Secret Access Key. This information is going to be used during the AWS SMS Connector setup.

Create User and save the access key and secret access key

 

  1. From a computer that has access to the Azure Virtual Network, access the SMS Virtual Machine configuration using a browser and the previously recorded private IP from the output of the script. In this example, the URL is https://10.0.0.4.

Screenshot of accessing the SMS Virtual Machine configuration

  1. On the main page of the SMS virtual machine, click Get Started Now

Screenshot of the SMS virtual machine start page

  1. Read and accept the terms of the contract, then click Next.

Screenshot of accepting terms of contract

  1. Create a password that will be used to login later in the management connector console and click Next.

Screenshot of creating a password

  1. Review the Network Info and click Next.

Screenshot of reviewing the network info

  1. Choose if you would like to opt-in to having anonymous log data set to AWS then click Next.

Screenshot of option to add log data to AWS

  1. Insert an Access Key and Secret Access Key for an IAM User whose only policy is attached: “ServerMigrationConnector” Also, select the region in which the SMS endpoint will be used and click Next. The access key mentioned it was created through step 11 to 15.

Selet AWS Region, and Insert Access Key and Secret Key

  1. Enter the Object Id of System Assigned Identify copied in step 9 and click Next.

Enter Object Id of System Assigned Identify

  1. Congratulations, you have successfully configured the Azure connector, click Go to connector dashboard.

Screenshot of the successful configuration of the Azure connector

  1. Verify that the connector status is HEALTHY by clicking Connectors on the menu.

Screenshot of verifying that the connector status is healthy

 

2 – Replicating Azure Virtual Machines to Amazon Web Services

  1. Access the SMS console and go to the Servers option. Click Import Server Catalog or Re-Import Server Catalog if it has been previously executed.

Screenshot of SMS console and servers option

  1. Select the Azure Virtual Machines to be migrated and click Create Replication Job.

Screenshot of Azure virtual machines migration

  1. Select which type of licensing best suits your environment, such as:

– Auto (Current licensing autodetection)

– AWS (License Included)

– BYOL (Bring Your Own License).
See options: https://aws.amazon.com/windows/resources/licensing/

Screenshot of best type of licensing for your environment

  1. Select the appropriate replication frequency, when the replication should start, and the IAM service role. You can leave it blank and the SMS service is going to use the built-in service role “sms”

Screenshot of replication jobs and IAM service role

  1. A summary of the settings are displayed and click Create. 
    Screenshot of the summary of settings displayed
  2. In the SMS Console, go to the Replication Jobs option and follow the replication job status:

Overview of replication jobs

  1. After completion, access the EC2 console, go to AMIs, and a list of the AMIs generated by SMS will now be in this list. In the example below, several AMIs were generated because the replication frequency is 1 hour.

List of AMIs generated by SMS

  1. Now navigate to the SMS console, click Launch Instance and follow the screen processes for creating a new Amazon EC2 instance.

SMS console and Launch Instance screenshot

 

3 – Conclusion

This solution provides a simple, agentless, non-intrusive way to the migration process with the AWS Server Migration Service.

 

For more about Windows Workloads on AWS go to:  http://aws.amazon.com/windows

 

About the Author

Photo of the Author

 

 

Marcio Morales is a Senior Solution Architect at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on running their Microsoft workloads on AWS.

Optimizing deep learning on P3 and P3dn with EFA

Post Syndicated from whiteemm original https://aws.amazon.com/blogs/compute/optimizing-deep-learning-on-p3-and-p3dn-with-efa/

This post is written by Rashika Kheria, Software Engineer, Purna Sanyal, Senior Solutions Architect, Strategic Account and James Jeun, Sr. Product Manager

The Amazon EC2 P3dn.24xlarge instance is the latest addition to the Amazon EC2 P3 instance family, with upgrades to several components. This high-end size of the P3 family allows users to scale out to multiple nodes for distributed workloads more efficiently.  With these improvements to the instance, you can complete training jobs in a shorter amount of time and iterate on your Machine Learning (ML) models faster.

 

This blog reviews the significant upgrades with p3dn.24xlarge, walks you through deployment, and shows an example ML use case for these upgrades.

 

Overview of P3dn instance upgrades

The most notable upgrade to the p3dn.24xlarge instance is the 100-Gbps network bandwidth and the new EFA network interface that allows for highly scalable internode communication. This means you can scale runs on applications to use thousands of GPUs, which reduces time to get results. EFA’s operating system bypasses networking mechanisms and the underlying Scalable Reliable Protocol that is built in to the Nitro Controllers. The Nitro controllers enable a low-latency, low-jitter channel for inter-instance communication. EFA has been adopted in the mainline Linux and integrated with LibFabric and various distributions. AWS worked with NVIDIA for EFA to support NVIDIA Collective Communication Library (NCCL). NCCL optimizes multi-GPU and multi-node communication primitives and helps achieve high throughput over NVLink interconnects.

 

The following diagram shows the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.

the PCIe/NVLink communication topology used by the p3.16xlarge and p3dn.24xlarge instance types.

 

The following table summarizes the full set of differences between p3.16xlarge and p3dn.24xlarge.

Featurep3.16xlp3dn.24xl
ProcessorIntel Xeon E5-2686 v4Intel Skylake 8175 (w/ AVX 512)
vCPUs6496
GPU8x 16 GB NVIDIA Tesla V1008x 32 GB NVIDIA Tesla V100
RAM488 GB768 GB
Network25 Gbps ENA100 Gbps ENA + EFA
GPU InterconnectNVLink – 300 GB/sNVLink – 300 GB/s

 

P3dn.24xl offers more networking bandwidth than p3.16xl. Paired with EFA’s communication library, this feature increases scaling efficiencies drastically for large-scale, distributed training jobs. Other improvements include double the GPU memory for large datasets and batch sizes, increased system memory, and more vCPUs. This upgraded instance is the most performant GPU compute option on AWS.

 

The upgrades also improve your workload around distributed deep learning. The GPU memory improvement enables higher intranode batch sizes. The newer Layer-wise Adaptive Rate Scaling (LARS) has been tested with ResNet50 and other deep neural networks (DNNs) to allow for larger batch sizes. The increased batch sizes reduce wall-clock time per epoch with minimal loss of accuracy. Additionally, using 100-Gbps networking with EFA heightens performance with scale. Greater networking performance is beneficial when updating weights for a large number of parameters. You can see high scaling efficiency when running distributed training on GPUs for ResNet50 type models that primarily use images for object recognition. For more information, see Scalable multi-node deep learning training using GPUs in the AWS Cloud.

 

Natural language processing (NLP) also presents large compute requirements for model training. This large compute requirement is especially present with the arrival of large Transformer-based models like BERT and GPT-2, which have up to a billion parameters. The following describes how to set up distributed model trainings with scalability for both image and language-based models, and also notes how the AWS P3 and P3dn instances perform.

 

Optimizing your P3 family

First, optimize your P3 instances with an important environmental update. This update runs traditional TCP-based networking and is in the latest release of NCCL 2.4.8 as of this writing.

 

Two new environmental variables are available, which allow you to take advantage of multiple TCP sockets per thread: NCCL_SOCKET_NTHREADS and NCCL_NSOCKS_PERTHREAD.

 

These environmental variables allow the NCCL backend to exceed the 10-Gbps TCP single stream bandwidth limitation in EC2.

 

Enter the following command:

/opt/openmpi/bin/mpirun -n 16 -N 8 --hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_NSOCKS_PERTHREAD=4 -x NCCL_SOCKET_NTHREADS=4 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100

 

The following graph shows the synthetic NCCL tests and their increased performance with the additional directives.

synthetic NCCL tests and their increased performance with the additional directives

You can achieve a two-fold increase in throughput after a threshold in the synthetic payload size (around 1 MB).

 

 

Deploying P3dn

 

The following steps walk you through spinning up a cluster of p3dn.24xlarge instances in a cluster placement group. This allows you to take advantage of all the new performance features within the P3 instance family. For more information, see Cluster Placement Groups in the Amazon EC2 User Guide.

This post deploys the following stack:

 

  1. On the Amazon EC2 console, create a security group.

 Make sure that both inbound and outbound traffic are open on all ports and protocols within the security group.

 

  1. Modify the user variables in the packer build script so that the variables are compatible with your environment.

The following is the modification code for your variables:

 

{

  "variables": {

    "Region": "us-west-2",

    "flag": "compute",

    "subnet_id": "<subnet-id>",

    "sg_id": "<security_group>",

    "build_ami": "ami-0e434a58221275ed4",

    "iam_role": "<iam_role>",

    "ssh_key_name": "<keyname>",

    "key_path": "/path/to/key.pem"

},

3. Build and Launch the AMI by running the following packer script:

Packer build nvidia-efa-fsx-al2.yml

This entire workflow takes care of setting up EFA, compiling NCCL, and installing the toolchain. After building it, you have an AMI ID that you can launch in the EC2 console. Make sure to enable the EFA.

  1. Launch a second instance in a cluster placement group so you can run two node tests.
  2. Enter the following code to make sure that all components are built correctly:

/opt/nccl-tests/build/all_reduce_perf 

  1. The following output of the commend will confirm that the build is using EFA :

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa

INFO: Function: main Line: 49: NET/OFI Process rank 8 started. NCCLNet device used on ip-172-0-1-161 is AWS Libfabric.

INFO: Function: main Line: 53: NET/OFI Received 1 network devices

INFO: Function: main Line: 57: NET/OFI Server: Listening on dev 0

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa

 

Synthetic two-node performance

This blog includes the NCCL-tests GitHub as part of the deployment stack. This shows synthetic benchmarking of the communication layer over NCCL and the EFA network.

When launching the two-node cluster, complete the following steps:

  1. Place the instances in the cluster placement group.
  2. SSH into one of the nodes.
  3. Fill out the hosts file.
  4. Run the two-node test with the following code:

/opt/openmpi/bin/mpirun -n 16 -N 8 --hostfile hosts -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x FI_PROVIDER="efa" -x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_SOCKET_IFNAME=eth0 --mca btl_tcp_if_exclude lo,docker0 /opt/nccl-tests/build/all_reduce_perf -b 16 -e 8192M -f 2 -g 1 -c 1 -n 100

This test makes sure that the node performance works the way it is supposed to.

The following graph compares the NCCL bandwidth performance using -x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp“. There is a three-fold increase in bus bandwidth when using EFA.

 

 -x FI_PROVIDER="efa" vs. -x FI_PROVIDER="tcp". There is a three-fold increase in bus bandwidth when using EFA. 

Now that you have run the two node tests, you can move on to a deep learning use case.

FAIRSEQ ML training on a P3dn cluster

Fairseq(-py) is a sequence modeling toolkit that allows you to train custom models for translation, summarization, language modeling, and other text-generation tasks. FAIRSEQ MACHINE TRANSLATION distributed training requires a fast network to support the Allreduce algorithm. Fairseq provides reference implementations of various sequence-to-sequence models, including convolutional neural networks (CNN), long short-term memory (LSTM) networks, and transformer (self-attention) networks.

 

After you receive consistent 10 GB/s bus-bandwidth on the new P3dn instance, you are ready for FAIRSEQ distributed training.

To install fairseq from source and develop locally, complete the following steps:

  1. Copy FAIRSEQ source code to one of the P3dn instance.
  2. Copy FAIRSEQ Training data in the data folder.
  3. Copy FAIRSEQ Test Data in the data folder.

 

git clone https://github.com/pytorch/fairseq

cd fairseq

pip install -- editable . 

Now that you have FAIRSEQ installed, you can run the training model. Complete the following steps:

  1. Run FAIRSEQ Training in 1 node/8 GPU p3dn instance to check the performance and the accuracy of FAIRSEQ operations.
  2. Create a custom AMI.
  3. Build the other 31 instances from the custom AMI.

 

Use the following scripts for distributed All Reduce FAIRSEQ Training :

 

export RANK=$1 # the rank of this process, from 0 to 127 in case of 128 GPUs
export LOCAL_RANK=$2 # the local rank of this process, from 0 to 7 in case of 8 GPUs per mac
export NCCL_DEBUG=INFO
export NCCL_TREE_THRESHOLD=0;
export FI_PROVIDER="efa";

export FI_EFA_TX_MIN_CREDIS=64;
export LD_LIBRARY_PATH=/opt/amazon/efa/lib64/:/home/ec2-user/aws-ofi-nccl/install/lib/:/home/ec2-user/nccl/build/lib:$LD_LIBRARY_PATH;
echo $FI_PROVIDER
echo $LD_LIBRARY_PATH
python train.py data-bin/wmt18_en_de_bpej32k \
   --clip-norm 0.0 -a transformer_vaswani_wmt_en_de_big \
   --lr 0.0005 --source-lang en --target-lang de \
   --label-smoothing 0.1 --upsample-primary 16 \
   --attention-dropout 0.1 --dropout 0.3 --max-tokens 3584 \
   --log-interval 100  --weight-decay 0.0 \
   --criterion label_smoothed_cross_entropy --fp16 \
   --max-update 500000 --seed 3 --save-interval-updates 16000 \
   --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' \
   --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 \
   --warmup-updates 4000 --min-lr 1e-09 \
   --distributed-port 12597 --distributed-world-size 32 \
   --distributed-init-method 'tcp://172.31.43.34:9218' --distributed-rank $RANK \
   --device-id $LOCAL_RANK \
   --max-epoch 3 \
   --no-progress-bar  --no-save

Now that you have completed and validated your base infrastructure layer, you can add additional components to the stack for various workflows. The following charts show time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training.

time-to-train improvement factors when scaling out to multiple GPUs for FARSEQ model training

 

Conclusion

EFA on p3dn.24xlarge allows you to take advantage of additional performance at scale with no change in code. With this updated infrastructure, you can decrease cost and time to results by using more GPUs to scale out and get more done on complex workloads like natural language processing. This blog provides much of the undifferentiated heavy lifting with the DLAMI integrated with EFA. Go power up your ML workloads with EFA!

 

Optimizing for cost, availability and throughput by selecting your AWS Batch allocation strategy

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/optimizing-for-cost-availability-and-throughput-by-selecting-your-aws-batch-allocation-strategy/

This post is contributed by Steve Kendrex, Senior Technical Product Manager, AWS Batch

 

Introduction

 

AWS offers a broad range of instances that are advantageous for batch workloads. The scale and provisioning speed of AWS’ compute instances allow you to get up and running at peak capacity in minutes without paying for downtime. Today, I’m pleased to introduce allocation strategies: a significant new capability in AWS Batch that  makes provisioning compute resources flexible and simple. In this blog post, I explain how the AWS Batch allocation strategies work, when you should use them for your workload, and provide an example CloudFormation script. This blog helps you get started on building your personalized Compute Environment (CE) most appropriate to your workloads.

Overview

AWS Batch is a fully managed, cloud-native batch scheduler. It manages the queuing and scheduling of your batch jobs, and the resources required to run your jobs. One of AWS Batch’s great strengths is the ability to manage instance provisioning as your workload requirements and budget needs change. AWS Batch takes advantage of AWS’s broad base of compute types. For example, you can launch compute based instances and memory instances that can handle different workload types, without having to worry about building a cluster to meet peak demand.

Previously, AWS Batch had a cost-controlling approach to manage compute instances for your workloads. The service chose an instance that was the best fit for your jobs based on vCPU, memory, and GPU requirements, at the lowest cost. Now, the newly added allocation strategies provide flexibility. They allow AWS Batch to consider capacity and throughput in addition to cost when provisioning your instances. This allows you to leverage different priorities when launching instances depending on your workloads’ needs, such as: controlling cost, maximizing throughput, or minimizing Amazon EC2 Spot instances interruption rates.

There are now three instance allocation strategies from which to choose when creating an AWS Batch Compute Environment (CE). They are:

1.        Spot Capacity Optimized

2.        Best Fit Progressive

3.        Best Fit

 

Spot Capacity Optimized

As the name implies, the Spot capacity optimized strategy is only available when launching Spot CEs in AWS Batch. In fact, I recommend the Spot capacity optimized strategy for most of your interruptible workloads running on Spot today. This strategy takes advantage of the recently released EC2 Auto Scaling and EC2 Fleet capacity optimized strategy. Next, I examine how this strategy behaves in AWS Batch.

Let’s say you’re running a simulation workload in AWS Batch. Your workload is Spot-appropriate (see this whitepaper to determine), so you want to take advantage of the savings you can glean from using Spot. However, you also want to minimize your Spot interruption rate, so you’ve followed the Spot best practices. Your instances can run across multiple instance types and multiple Availability Zones. When creating your Spot CE in AWS Batch, Input all the instance types with which your workload is compatible in the instance field. OR select ‘optimal’, which allows Batch to choose from among M, C, or R instance families. The image below shows how this appears in the console:

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

AWS Batch console with SPOT_CAPACITY_OPTIMIZED selected

When evaluating your workload, AWS Batch selects from the allowed instance types. These allowed instance types are specified in the ‘compute resource parameter’, and are capable of running your jobs listed in your Spot CE.  From the capable instances, AWS calculates the correct assortment of instance types that have the most Spot capacity. AWS Batch then launches those instances on your behalf, and runs your jobs when those instances are available. This strategy gives you access to all AWS compute resources at a fraction of On-Demand cost. The Spot capacity optimized strategy works whether you’re trying to launch hundreds of thousands (or a million!) of vCPU’s in Spot, or simply trying to lower your chance of interruption. Additionally, AWS Batch manages your instance pool to meet the capacity needed to run your workload as time passes.

For example, as your workloads run, demand in an Availability Zone may shift. This might lead to several of your instances being reclaimed. In that event, AWS Batch automatically attempts to scale a different instance type based on the deepest capacity pools. Assuming you set a retry attempt count, your jobs then automatically retry. Then, AWS Batch scales new instances until either it meets the desired capacity, or it runs out of instance types to launch based on those provided.  That is why I recommend that you give AWS Batch as many instance types and families as possible to choose from when running Spot capacity optimized. Additional detail on behavior can be found in the capacity optimized documentation.

To launch a Spot capacity optimized CE, follow these steps:

1.       Navigate to the console

2.       Create a new Compute Environment.

3.       Select “Spot Capacity Optimized” in the Allocation Strategy field

4.       Alternatively, you can use the CreateComputeEnvironment API; in the Allocation Strategy field, pass in “Spot_Capacity_Optimized”. This command should look like the following:

…
"TestAllocationStrategyCE": { 
"Type": "AWS::Batch::ComputeEnvironment",
 "Properties": { 
"State": "ENABLED", 
"Type": "MANAGED", 
"ComputeResources": { 
"Subnets": [
 {"Ref": "TestSubnet"}
 ], 
"InstanceRole": {
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
" InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "SPOT_CAPACITY_OPTIMIZED", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 }
 },
…

Once you follow these steps your Spot capacity optimized CE should be up and running.

 

Best Fit Progressive

Imagine you have a time-sensitive machine learning workload that is very compute intensive. You want to run this workload on C5 instances because you know that those have a preferable vCPU/memory ratio for your jobs. In a pinch, however, you know that M5 instances can run your workload perfectly well. You’re happy to take advantage of Spot prices. However, you also have a base level of throughput you need so you have to run part of the workload on On-Demand instances.  In this case, I recommend the best fit progressive strategy. This strategy is available in both On-Demand and Spot CEs, and I recommend it for most On-Demand workloads. The best fit progressive strategy allows you to let AWS Batch choose the best fit instance for your workload (based on your jobs’ vCPU and memory requirements). In this context, “best fit” means AWS Batch provisions the least number of instances capable of running your jobs at the lowest cost.

Sometimes, AWS Batch cannot resource enough of the best fit instances to meet your capacity. When this is the case, AWS Batch progressively looks for the next best fit instance type from what you specified in the ‘compute resources’ parameter. Generally, AWS Batch attempts to spin up different instance sizes within the same family first. This is because AWS Batch has already determined that vCPU and memory ratio to fit your workload. If it still cannot find enough instances that can run your jobs to meet your capacity, AWS Batch launches instances from a different family. These attempts run until capacity is met, or until it runs out of available instances from which to select.

To create a best fit progressive CE, follow the steps detailed in the Spot capacity optimized strategy section. However, specify the strategy BEST_FIT_PROGRESSIVE when creating a CE, for example:


…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT_PROGRESSIVE", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important note: you can always restrict AWS Batch’s ability to launch instances by using the max vCPU setting in your CE. AWS Batch may go above Max vCPU to meet your capacity requirements for best fit progressive and Spot capacity optimized strategies. In this event, AWS Batch will never go above Max vCPU by more than a single instance (for example, no more than a single instance from among those specified in your CE compute resources parameter).

 

How to Combine Strategies

You can combine strategies using separate AWS Batch Compute Environments. Let’s take the case I mentioned earlier: you’re happy to take advantage of Spot prices, but you want a base level of throughput for your time-sensitive workloads.

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

This diagram describes shows an On-Demand CE with a secondary Spot CE, attached to the same queue

 

In this case, you can create two AWS Batch CEs:

1.       Create an On-Demand CE that uses the best fit progressive strategy.

2.       Set the max vCPU at the level of throughput that is necessary for your workload.

3.       Create a Spot CE using the Spot capacity optimized strategy.

4.        Attach both CEs to your job queue, with the On-Demand CE higher in order. Once you start submitting jobs to your queue, AWS Batch spins up your On-Demand CE first and starts placing jobs.

If AWS Batch meets its max vCPU criteria, it will spin up instances in the next CE. In this case, the next CE is your Spot CE, and AWS Batch will place any additional jobs on this CE.  AWS Batch continues to place jobs on both CEs until the queue is empty.

Please see this repository for sample CloudFormation code to replicate this environment. Or, click here for more examples of leveraging Spot with AWS Batch.

 

Best Fit

Imagine you have a well-defined genomics sequencing workload. You know that this workload performs best on M5 instances, and you run this workload On-Demand because it is not interruptible. You’ve run this workload on AWS Batch before and you’re happy with its current behavior. You’re willing to trade off occasional capacity constraints in return for the knowledge you’re controlling cost strictly.  In this case, the best fit strategy may be a good option. This strategy used to be AWS Batch’s only behavior. It examines the queue and picks the best fit instance type and size for the workload. As described earlier, best fit to AWS Batch means the least number of instances capable of running the workload, at the lowest cost. In general, we recommend the best fit strategy only when you want the lowest cost for your instance, and you’re willing to trade cost for throughput and availability.

Note: AWS Batch will not launch instances above Max vCPU while using the best fit strategy. To launch a best fit CE, you can launch it similar to the following:

…{
 "Ref": "TestIamInstanceProfile" 
},
 "MinvCpus": 0, 
"InstanceTypes": [ 
"optimal"
 ],
 "SecurityGroupIds": [
 	{"Ref": "TestSecurityGroup"} 
],
 "DesiredvCpus": 0, 
"MaxvCpus": 12, 
"AllocationStrategy": "BEST_FIT", 
"Type": "EC2" },
 "ServiceRole": { 
"Ref": "TestAWSBatchServiceRole" 
}
 
…

Important Note for AWS Batch Allocation Strategies with Spot Instances:

You always have the option to set a percentage of On-Demand price when creating a Spot CE. When setting a percentage of an On-Demand price, AWS Batch will only launch instances that have Spot prices lower than the lowest per-unit-hour instance. In general, setting a percentage of an On-Demand price lowers your availability, and should only be used if you want cost controls.If you want to enjoy the cost savings with Spot with better availability, I recommend that you do not set a percentage of On-Demand price.

Conclusion

With these new allocation strategies, you now have much greater flexibility to control how AWS Batch provisions your instances. This allows you to make better throughput and cost trade-offs depending on the sensitivity of your workload. To learn more about how these strategies behave, please visit the AWS Batch documentation. Feel free to experiment with AWS Batch on your own to get an idea of how they help you run your specific workload.

 

Thanks to Chad Scmutzer for his support on the CloudFormation template

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

Post Syndicated from Bala Thekkedath original https://aws.amazon.com/blogs/compute/leveraging-efa-to-run-hpc-and-ml-workloads-on-aws-batch/

Leveraging Elastic Fabric Adapter to run HPC and ML Workloads on AWS Batch

 This post is contributed by  Sean Smith, Software Development Engineer II, AWS ParallelCluster & Arya Hezarkhani, Software Development Engineer II, AWS Batch and HPC

 

On August 2, 2019, AWS Batch announced support for Elastic Fabric Adapter (EFA). This enables you to run highly performant, distributed high performance computing (HPC) and machine learning (ML) workloads by using AWS Batch’s managed resource provisioning and job scheduling.

EFA is a network interface for Amazon EC2 instances that enables you to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system bypasses the hardware interface and enhances the performance of inter-instance communications, which is critical to scaling these applications. With EFA, HPC applications using the Message Passing Interface (MPI) and ML applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of cores or GPUs. As a result, you get the application performance of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS Cloud.

AWS Batch is a cloud-native batch scheduler that manages instance provisioning and job scheduling. AWS Batch automatically provisions instances according to job specifications, with the appropriate placement group, networking configurations, and any user-specified file system. It also automatically sets up the EFA interconnect to the instances it launches, which you specify through a single launch template parameter.

In this post, we walk through the setup of EFA on AWS Batch and run the NAS Parallel Benchmark (NPB), a benchmark suite that evaluates the performance of parallel supercomputers, using the open source implementation of MPI, OpenMPI.

 

Prerequisites

This walk-through assumes:

 

Configuring your compute environment

First, configure your compute environment to launch instances with the EFA device.

Creating an EC2 placement group

The first step is to create a cluster placement group. This is a logical grouping of instances within a single Availability Zone. The chief benefit of a cluster placement group is non-blocking, non-oversubscribed, fully bi-sectional network connectivity. Use a Region that supports EFA—currently, that is us-east-1, us-east-2, us-west-2, and eu-west-1. Run the following command:

$ aws ec2 create-placement-group –group-name “efa” –strategy “cluster” –region [your-region]

Creating an EC2 launch template

Next, create a launch template that contains a user-data script to install EFA libraries onto the instance. Launch templates enable you to store launch parameters so that you do not have to specify them every time you launch an instance. This will be the launch template used by AWS Batch to scale the necessary compute resources in your AWS Batch Compute Environment.

First, encode the user data into base64-encoding. This example uses the CLI utility base64 to do so.

 

$ echo “MIME-Version: 1.0

Content-Type: multipart/mixed; boundary=”==MYBOUNDARY==”

 

–==MYBOUNDARY==

Content-Type: text/cloud-boothook; charset=”us-ascii” cloud-init-per once yum_wget yum install -y wget

cloud-init-per once wget_efa wget -q –timeout=20 https://s3-us-west-2.amazonaws.com/

cloud-init-per once tar_efa tar -xf /tmp/aws-efa-installer-latest.tar.gz -C /tmp pushd /tmp/aws-efa-installer

cloud-init-per once install_efa ./efa_installer.sh -y pop /tmp/aws-efa-installer

cloud-init-per once efa_info /opt/amazon/efa/bin/fi_info -p efa

–==MYBOUNDARY==–” | base64

 

Save the base64-encoded output, because you need it to create the launch template.

 

Next, make sure that your default security group is configured correctly. On the EC2 console, select the default security group associated with your default VPC and edit the inbound rules to allow SSH and All traffic to itself. This must be set explicitly to the security group ID for EFA to work, as seen in the following screenshot.

SecurityGroupInboundRules

SecurityGroupInboundRules

 

Then edit the outbound rules and add a rule that allows all inbound traffic from the security group itself, as seen in the following screenshot. This is a requirement for EFA to work.

SecurityGroupOutboundRules

SecurityGroupOutboundRules

 

Now, create an ecsInstanceRole, the Amazon ECS instance profile that will be applied to Amazon EC2 instances in a Compute Environment. To create a role, follow these steps.

  1. Choose Roles, then Create Role.
  2. Select AWS Service, then EC2.
  3. Choose Permissions.
  4. Attach the managed policy AmazonEC2ContainerServiceforEC2Role.
  5. Name the role ecsInstanceRole.

 

You will create the launch template using the ID of the security group, the ID of a subnet in your default VPC, and the ecsInstanceRole that you created.

Next, choose an instance type that supports EFA, that’s denoted by the n in the instance name. This example uses c5n.18xlarge instances.

You also need an Amazon Machine Image (AMI) ID. This example uses the latest ECS-optimized AMI based on Amazon Linux 2. Grab the AMI ID that corresponds to the Region that you are using.

This example uses UserData to install EFA. This adds 1.5 minutes of bootstrap time to the instance launch. In production workloads, bake the EFA installation into the AMI to avoid this additional bootstrap delay.

Now create a file called launch_template.json with the following content, making sure to substitute the account ID, security group, subnet ID, AMI ID, and key name.

{

“LaunchTemplateName”: “EFA-Batch-LaunchTemplate”, “LaunchTemplateData”: {

“InstanceType”: “c5n.18xlarge”,

“IamInstanceProfile”: {

“Arn”: “arn:aws:iam::<Account Id>:instance-profile/ecsInstanceRole”

},

“NetworkInterfaces”: [

{

“DeviceIndex”: 0,

“Groups”: [

“<Security Group>”

],

“SubnetId”: “<Subnet Id>”,

“InterfaceType”: “efa”,

“Description”: “NetworkInterfaces Configuration For EFA and Batch”

}

],

“Placement”: {

“GroupName”: “efa”

},

“TagSpecifications”: [

{

“ResourceType”: “instance”,

“Tags”: [

{

“Key”: “from-lt”,

“Value”: “networkInterfacesConfig-EFA-Batch”

}

]

}

],

“UserData”: “TUlNRS1WZXJzaW9uOiAxLjAKQ29udGVudC1UeXBlOiBtdWx0aXBhcnQvbWl4ZWQ7IGJvdW5kYXJ5PSI9PU1ZQk9VTkRBUlk9PSIKCi0tPT1NWUJPVU5EQVJZPT0KQ29udGVudC1UeXBlOiB0ZXh0L2Nsb3VkLWJvb3Rob29rOyBjaGFyc2V0PSJ1cy1hc2NpaSIKCmNsb3VkLWluaXQtcGVyIG9uY2UgeXVtX3dnZXQgeXVtIGluc3RhbGwgLXkgd2dldAoKY2xvdWQtaW5pdC1wZXIgb25jZSB3Z2V0X2VmYSB3Z2V0IC1xIC0tdGltZW91dD0yMCBodHRwczovL3MzLXVzLXdlc3QtMi5hbWF6b25hd3MuY29tL2F3cy1lZmEtaW5zdGFsbGVyL2F3cy1lZmEtaW5zdGFsbGVyLWxhdGVzdC50YXIuZ3ogLU8gL3RtcC9hd3MtZWZhLWluc3RhbGxlci1sYXRlc3QudGFyLmd6CgpjbG91ZC1pbml0LXBlciBvbmNlIHRhcl9lZmEgdGFyIC14ZiAvdG1wL2F3cy1lZmEtaW5zdGFsbGVyLWxhdGVzdC50YXIuZ3ogLUMgL3RtcAoKcHVzaGQgL3RtcC9hd3MtZWZhLWluc3RhbGxlcgpjbG91ZC1pbml0LXBlciBvbmNlIGluc3RhbGxfZWZhIC4vZWZhX2luc3RhbGxlci5zaCAteQpwb3AgL3RtcC9hd3MtZWZhLWluc3RhbGxlcgoKY2xvdWQtaW5pdC1wZXIgb25jZSBlZmFfaW5mbyAvb3B0L2FtYXpvbi9lZmEvYmluL2ZpX2luZm8gLXAgZWZhCgotLT09TVlCT1VOREFSWT09LS0K”

}

}

Create a launch template from that file:

 

$ aws ec2 create-launch-template –cli-input-json file://launch_template.json

{

“LaunchTemplate”: {

“LatestVersionNumber”: 1,

“LaunchTemplateId”: “lt-*****************”, “LaunchTemplateName”: “EFA-Batch-LaunchTemplate”, “DefaultVersionNumber”: 1,

“CreatedBy”: “arn:aws:iam::************:user/desktop-user”, “CreateTime”: “2019-09-23T13:00:21.000Z”

}

}

 

Creating a compute environment

Next, create an AWS Batch Compute Environment. This uses the information from the launch template

EFA-Batch-Launch-Template created earlier.

 

{

“computeEnvironmentName”: “EFA-Batch-ComputeEnvironment”,

“type”: “MANAGED”,

“state”: “ENABLED”,

“computeResources”: {

“type”: “EC2”,

“minvCpus”: 0,

“maxvCpus”: 2088,

“desiredvCpus”: 0,

“instanceTypes”: [

“c5n.18xlarge”

],

“subnets”: [

“<same-subnet-as-in-LaunchTemplate>”

],

“instanceRole”: “arn:aws:iam::<account-id>:instance-profile/ecsInstanceRole”,

“launchTemplate”: {

“launchTemplateName”: “EFA-Batch-LaunchTemplate”,

“version”: “$Latest”

}

},

“serviceRole”: “arn:aws:iam::<account-id>:role/service-role/AWSBatchServiceRole”

}

 

Now, create the compute environment:

$ aws batch create-compute-environment –cli-input-json file://compute_environment.json

{

“computeEnvironmentName”: “EFA-Batch-ComputeEnvironment”, “computeEnvironmentArn”: “arn:aws:batch:us-east-1:<Account Id>:compute-environment”

}

 

Building the container image

To build the container, clone the repository that contains the Dockerfile used in this example.

First, install git:

$ git clone https://github.com/aws-samples/aws-batch-efa.git

 

In that repository, there are several files, one of which is the following Dockerfile.

 

FROM amazonlinux:1 ENV USER efauser

 

RUN yum update -y

RUN yum install -y which util-linux make tar.x86_64 iproute2 gcc-gfortran openssh-serv RUN pip-2.7 install supervisor

 

RUN useradd -ms /bin/bash $USER ENV HOME /home/$USER

 

##################################################### ## SSH SETUP

ENV SSHDIR $HOME/.ssh

RUN mkdir -p ${SSHDIR} \

&& touch ${SSHDIR}/sshd_config \

&& ssh-keygen -t rsa -f ${SSHDIR}/ssh_host_rsa_key -N ” \

&& cp ${SSHDIR}/ssh_host_rsa_key.pub ${SSHDIR}/authorized_keys \ && cp ${SSHDIR}/ssh_host_rsa_key ${SSHDIR}/id_rsa \

&& echo ”  IdentityFile ${SSHDIR}/id_rsa” >> ${SSHDIR}/config \ && echo ”  StrictHostKeyChecking no” >> ${SSHDIR}/config \

&& echo ”  UserKnownHostsFile /dev/null” >> ${SSHDIR}/config \ && echo ”  Port 2022″ >> ${SSHDIR}/config \

&& echo ‘Port 2022’ >> ${SSHDIR}/sshd_config \

&& echo ‘UsePrivilegeSeparation no’ >> ${SSHDIR}/sshd_config \

&& echo “HostKey ${SSHDIR}/ssh_host_rsa_key” >> ${SSHDIR}/sshd_config \ && echo “PidFile ${SSHDIR}/sshd.pid” >> ${SSHDIR}/sshd_config \

&& chmod -R 600 ${SSHDIR}/* \

&& chown -R ${USER}:${USER} ${SSHDIR}/

 

# check if ssh agent is running or not, if not, run RUN eval `ssh-agent -s` && ssh-add ${SSHDIR}/id_rsa

 

################################################# ## EFA and MPI SETUP

RUN curl -O https://s3-us-west-2.amazonaws.com/aws-efa-installer/aws-efa-installer-1. && tar -xf aws-efa-installer-1.5.0.tar.gz \

&& cd aws-efa-installer \

&& ./efa_installer.sh -y –skip-kmod –skip-limit-conf –no-verify

 

RUN wget https://www.nas.nasa.gov/assets/npb/NPB3.3.1.tar.gz \ && tar xzf NPB3.3.1.tar.gz

COPY make.def_efa /NPB3.3.1/NPB3.3-MPI/config/make.def COPY suite.def      /NPB3.3.1/NPB3.3-MPI/config/suite.def

 

RUN cd /NPB3.3.1/NPB3.3-MPI \

&& make suite \

&& chmod -R 755 /NPB3.3.1/NPB3.3-MPI/

 

###################################################

## supervisor container startup

 

ADD conf/supervisord/supervisord.conf /etc/supervisor/supervisord.conf ADD supervised-scripts/mpi-run.sh supervised-scripts/mpi-run.sh

RUN chmod 755 supervised-scripts/mpi-run.sh

 

EXPOSE 2022

ADD batch-runtime-scripts/entry-point.sh batch-runtime-scripts/entry-point.sh RUN chmod 755 batch-runtime-scripts/entry-point.sh

 

CMD /batch-runtime-scripts/entry-point.sh

 

To build this Dockerfile, run the included Makerfile with:

make

Now, push the created container image to Amazon Elastic Container Registry (ECR), so you can use it in your AWS Batch JobDefinition:

From the AWS CLI, create an ECR repository, we’ll call it aws-batch-efa:

$ aws ecr create-repository –repository-name aws-batch-efa

{

“repository”: {

“registryId”: “<Account-Id>”,
“repositoryName”: “aws-batch-efa”,
“repositoryArn”: “arn:aws:ecr:us-east-2:<Account-Id>:repository/aws-batch-efa”,
“createdAt”: 1568154893.0,
“repositoryUri”: “<Account-Id>.dkr.ecr.us-east-2.amazonaws.com/aws-batch-efa”

}
}

Edit the top of the makefile and add your AWS account number and AWS Region.

AWS_REGION=<REGION>
ACCOUNT_ID=<ACCOUNT-ID>

To push the image to the ECR repository, run:

make tag
make push

 

Run the application

To run the application using AWS Batch multi-node parallel jobs, follow these steps.

 

Setting up the AWS Batch multi-node job definition

Set up the AWS Batch multi-node job definition and expose the EFA device to the container by following these steps.

 

First, create a file called job_definition.json with the following contents. This file holds the configurations for the AWS Batch JobDefinition. Specifically, this JobDefinition uses the newly supported field LinuxParameters.Devices to expose a particular device—in this case, the EFA device path /dev/infiniband/uverbs0—to the container. Be sure to substitute the image URI with the one you pushed to ECR in the previous step. This is used to start the container.

 

{

“jobDefinitionName”: “EFA-MPI-JobDefinition”,

“type”: “multinode”,

“nodeProperties”: {

“numNodes”: 8,

“mainNode”: 0,

“nodeRangeProperties”: [

{

“targetNodes”: “0:”,

“container”: {

“user”: “efauser”,

“image”: “<Docker Image From Previous Section>”,

“vcpus”: 72,

“memory”: 184320,

“linuxParameters”: {

“devices”: [

{

“hostPath”: “/dev/infiniband/uverbs0”

}

]

},

“ulimits”: [

{

“hardLimit”: -1,

“name”: “memlock”,

“softLimit”: -1

}

]

}

}

]

}

}

 

$ aws batch register-job-definition –cli-input-json file://job_definition.json

{

“jobDefinitionArn”: “arn:aws:batch:us-east-1:<account-id>:job-definition/EFA-MPI-JobDefinition”,

“jobDefinitionName”: “EFA-MPI-JobDefinition”,

“revision”: 1

}

 

Run the job

Next, create a job queue. This job queue points at the compute environment created before. When jobs are submitted to it, they queue until instances are available to run them.

 

{

“jobQueueName”: “EFA-Batch-JobQueue”,

“state”: “ENABLED”,

“priority”: 10,

“computeEnvironmentOrder”: [

{

“order”: 1,

“computeEnvironment”: “EFA-Batch-ComputeEnvironment”

}

]

}

 

aws   batch create-job-queue –cli-input-json  file://job_queue.json

Now that you’ve created all the resources, submit the job. The numNodes=8 parameter tells the job definition to use eight nodes.

aws   batch submit-job –job-name      example-mpi-job –job-queue EFA-Batch-JobQueue –job-definition EFA-MPI-JobDefinition –node-overrides numNodes=8

 

NPB overview

NPB is a small set of benchmarks derived from computational fluid dynamics (CFD) applications. They consist of five kernels and three pseudo-applications. This example runs the 3D Fast Fourier Transform (FFT) benchmark, as it tests all-to-all communication. For this run, use c5n.18xlarge, as configured in the AWS compute environment earlier. This is an excellent choice for this workload as it has an Intel Skylake processor (72 hyperthreaded cores) and 100 GB Enhanced Networking (ENA), which you can take advantage of with EFA.

 

This test runs the FT “C” Benchmark with eight nodes * 72 vcpus = 576 vcpus.

 

NAS Parallel Benchmarks 3.3 — FT Benchmark

No input file inputft.data. Using compiled defaults Size : 512x 512x 512

Iterations : 20

Number of processes : 512 Processor array : 1x 512 Layout type : 1D

Initialization time = 1.3533580760000063

T = 1 Checksum = 5.195078707457D+02 5.149019699238D+02

T = 2 Checksum = 5.155422171134D+02 5.127578201997D+02
T = 3 Checksum = 5.144678022222D+02 5.122251847514D+02
T = 4 Checksum = 5.140150594328D+02 5.121090289018D+02
T = 5 Checksum = 5.137550426810D+02 5.121143685824D+02
T = 6 Checksum = 5.135811056728D+02 5.121496764568D+02
T = 7 Checksum = 5.134569343165D+02 5.121870921893D+02
T = 8 Checksum = 5.133651975661D+02 5.122193250322D+02
T = 9 Checksum = 5.132955192805D+02 5.122454735794D+02
T = 10 Checksum = 5.132410471738D+02 5.122663649603D+02
T = 11 Checksum = 5.131971141679D+02 5.122830879827D+02
T = 12 Checksum = 5.131605205716D+02 5.122965869718D+02
T = 13 Checksum = 5.131290734194D+02 5.123075927445D+02
T = 14 Checksum = 5.131012720314D+02 5.123166486553D+02
T = 15 Checksum = 5.130760908195D+02 5.123241541685D+02
T = 16 Checksum = 5.130528295923D+02 5.123304037599D+02
T = 17 Checksum = 5.130310107773D+02 5.123356167976D+02
T = 18 Checksum = 5.130103090133D+02 5.123399592211D+02
T = 19 Checksum = 5.129905029333D+02 5.123435588985D+02
T = 20 Checksum = 5.129714421109D+02 5.123465164008D+02

 

Result verification successful class = C

FT Benchmark Completed. Class = C

Size = 512x 512x 512 Iterations = 20

Time in seconds = 1.92 Total processes = 512 Compiled procs = 512 Mop/s total = 206949.17 Mop/s/process = 404.20

Operation type = floating point Verification = SUCCESSFUL

 

Summary

In this post, we covered how to run MPI Batch jobs with an EFA-enabled elastic network interface using AWS Batch multi-node parallel jobs and an EC2 launch template. We used a launch template to configure the AWS Batch compute environment to launch an instance with the EFA device installed. We showed you how to expose the EFA device to the container. You also learned how to package an MPI benchmarking application, the NPB, as a Docker container, and how to run the application as an AWS Batch multi-node parallel job.

We hope you found the information in this post helpful and encouraging as to all the possibilities for HPC on AWS.

How to migrate symmetric exportable keys from AWS CloudHSM Classic to AWS CloudHSM

Post Syndicated from Mohamed AboElKheir original https://aws.amazon.com/blogs/security/migrate-symmetric-exportable-keys-aws-cloudhsm-classic-aws-cloudhsm/

In August 2017, we announced the “new” AWS CloudHSM service, which had a lot of improvements over AWS CloudHSM Classic (for clarity in this post I will refer to the services as New CloudHSM and CloudHSM Classic). These advantages in security, scalability, usability, and economy, included FIPS 140-2 Level 3 certification, fully managed high availability and backup, a management console, and lower costs.

Now, we turn another page. The Luna 5 HSMs used for CloudHSM Classic are reaching end of life, and the CloudHSM Classic service is being subsequently decommissioned, so CloudHSM Classic users must migrate cryptographic key material to the New CloudHSM.

In this post, I’ll show you how to use the RSA OAEP (optimal asymmetric encryption padding) wrapping mechanism, which was introduced in the CloudHSM client version 2.0.0, to move key material from CloudHSM Classic to New CloudHSM without exposing the plain text of the key material outside the HSM boundaries. You’ll use an RSA public key to wrap the key material (export it in encrypted form) on CloudHSM Classic, then use the corresponding RSA Private Key to unwrap it on New CloudHSM.

NOTE: This solution only works for symmetric exportable keys. Asymmetric keys on CloudHSM Classic can’t be exported. To replace non-exportable and asymmetric keys, you must generate new keys on New CloudHSM, then use the old keys to decrypt and the new keys to re-encrypt your data.

Solution overview

My solution shows you how to use the CKDemo utility on CloudHSM Classic, and key_mgmt_util on New CloudHSM, to: generate an RSA wrapping key pair; use it to wrap keys on CloudHSM Classic; and then unwrap the keys on New CloudHSM. These are all done via the RSA OAEP mechanism.
The following diagram provides a summary of the steps involved in the solution:

Figure 1: Solution overview

Figure 1: Solution overview

  1. Generate the RSA wrapping key pair on New CloudHSM.
  2. Export the RSA Public Key to the New CloudHSM client instance.
  3. Move the RSA public key to the CloudHSM Classic client instance.
  4. Import the RSA public key to CloudHSM Classic.
  5. Wrap the key using the imported RSA public key.
  6. Move the wrapped key to the New CloudHSM client instance.
  7. Unwrap the key on New CloudHSM with the RSA Private Key.

NOTE: You can perform the same procedure using supported libraries, such as JCE (Java Cryptography Extension) and PKCS#11. For example, you can use the wrap_with_imported_rsa_key sample to import an RSA public key into CloudHSM Classic, use that key to wrap your CloudHSM Classic keys, and then use the rsa_wrapping sample (specifically the rsa_oaep_unwrap_key function) to unwrap the keys into New CloudHSM using the RSA OAEP mechanism.

Prerequisites

  1. An active New CloudHSM cluster with at least one active hardware security module (HSM). Follow the Getting Started Guide to create and initialize a New CloudHSM cluster.
  2. An Amazon Elastic Compute Cloud (Amazon EC2) instance with the New CloudHSM client installed and configured to connect to the New CloudHSM cluster. You can refer to the Getting Started Guide to configure and connect the client instance.
  3. New CloudHSM CU (crypto user) credentials.
  4. An EC2 instance with the CloudHSM Classic client installed and configured to connect to the CloudHSM Classic partition or the high-availability (HA) partition group that contains the keys you want to migrate. You can refer to this guide to install and configure a CloudHSM Classic Client.
  5. The Password of the CloudHSM Classic partition or HA partition group that contains the keys you want to migrate.
  6. The handle of the symmetric key on CloudHSM Classic you want to migrate.

Step 1: Generate the RSA wrapping key pair on CloudHSM

1.1. On the New CloudHSM client instance, run the key_mgmt_util command line tool, and log in as the CU, as described in Getting Started with key_mgmt_util.


Command:  loginHSM -u CU -s <CU user> -p <CU password>
    
	Cfm3LoginHSM returned: 0x00 : HSM Return: SUCCESS

	Cluster Error Status
	Node id 0 and err state 0x00000000 : HSM Return: SUCCESS

1.2. Run the following
genRSAKeyPair command to generate an RSA key pair with the label
classic_wrap. Take note of the private and public key handles, as they’ll be used in the coming steps.


Command:  genRSAKeyPair -m 2048 -e 65537 -l classic_wrap

	Cfm3GenerateKeyPair returned: 0x00 : HSM Return: SUCCESS

	Cfm3GenerateKeyPair:    public key handle: 407    private key handle: 408

	Cluster Error Status
	Node id 0 and err state 0x00000000 : HSM Return: SUCCESS

Step 2: Export the RSA public key to the New CloudHSM client instance

2.1. Run the following exportPubKey command to export the RSA public key to the New CloudHSM client instance using the public key handle you received in step 1.2 (407, in my example). This will export the public key to a file named wrapping_public.pem.


Command:  exportPubKey -k <public key handle> -out wrapping_public.pem

PEM formatted public key is written to wrapping_public.pem

	Cfm3ExportPubKey returned: 0x00 : HSM Return: SUCCESS

Step 3: Move the RSA public key to the CloudHSM Classic client instance

Move the RSA Public Key to the CloudHSM Classic client instance using scp (or any other tool you prefer).

Step 4: Import the RSA public key to CloudHSM Classic

4.1. On the CloudHSM Classic instance, use the cmu command as shown below to import the RSA public key with the label classic_wrap. You’ll need the partition or HA partition group password for this command, plus the slot number of the partition or HA partition group (you can get the slot number of your partition or HA partition group using the vtl listSlots command).


# cmu import -inputFile=wrapping_public.pem -label classic_wrap
Select token
 [1] Token Label: partition1
 [2] Token Label: partition2
 [3] Token Label: partition3
 Enter choice: <slot number>
Please enter password for token in slot 1 : <password>

4.2. Run the below command to get the handle (highlighted below) of the imported key.


# cmu list -label classic_wrap
Select token
 [1] Token Label: partition1
 [2] Token Label: partition2
 [3] Token Label: partition3
 Enter choice: <slot number>
Please enter password for token in slot 1 : <password>
handle=149	label=classic_wrap

4.3. Run the CKDemo utility.


# ckdemo

4.4. Open a session to the partition or HA partition group slot.


Enter your choice : 1

Slots available:
	slot#1 - LunaNet Slot
	slot#2 - LunaNet Slot
	...
Select a slot: <slot number>

SO[0], normal user[1], or audit user[2]? 1

Status: Doing great, no errors (CKR_OK)

4.5. Log in using the partition or HA partition group pin.


Enter your choice : 3
Security Officer[0]
Crypto-Officer  [1]
Crypto-User     [2]:
Audit-User      [3]: 1
Enter PIN          : <password>

Status: Doing great, no errors (CKR_OK)

4.6. Change the CKA_WRAP attribute of the imported RSA public key to be able to use it for wrapping using the imported public key handle you received in step 4.2 above (149, in my example).


Enter your choice : 25

Which object do you want to modify (-1 to list available objects) : <imported public key handle>

Edit template for set attribute operation.

(1) Add Attribute   (2) Remove Attribute   (0) Accept Template :1

 0 - CKA_CLASS                  1 - CKA_TOKEN
 2 - CKA_PRIVATE                3 - CKA_LABEL
 4 - CKA_APPLICATION            5 - CKA_VALUE
 6 - CKA_XXX                    7 - CKA_CERTIFICATE_TYPE
 8 - CKA_ISSUER                 9 - CKA_SERIAL_NUMBER
10 - CKA_KEY_TYPE              11 - CKA_SUBJECT
12 - CKA_ID                    13 - CKA_SENSITIVE
14 - CKA_ENCRYPT               15 - CKA_DECRYPT
16 - CKA_WRAP                  17 - CKA_UNWRAP
18 - CKA_SIGN                  19 - CKA_SIGN_RECOVER
20 - CKA_VERIFY                21 - CKA_VERIFY_RECOVER
22 - CKA_DERIVE                23 - CKA_START_DATE
24 - CKA_END_DATE              25 - CKA_MODULUS
26 - CKA_MODULUS_BITS          27 - CKA_PUBLIC_EXPONENT
28 - CKA_PRIVATE_EXPONENT      29 - CKA_PRIME_1
30 - CKA_PRIME_2               31 - CKA_EXPONENT_1
32 - CKA_EXPONENT_2            33 - CKA_COEFFICIENT
34 - CKA_PRIME                 35 - CKA_SUBPRIME
36 - CKA_BASE                  37 - CKA_VALUE_BITS
38 - CKA_VALUE_LEN             39 - CKA_LOCAL
40 - CKA_MODIFIABLE            41 - CKA_ECDSA_PARAMS
42 - CKA_EC_POINT              43 - CKA_EXTRACTABLE
44 - CKA_ALWAYS_SENSITIVE      45 - CKA_NEVER_EXTRACTABLE
46 - CKA_CCM_PRIVATE           47 - CKA_FINGERPRINT_SHA1
48 - CKA_OUID                  49 - CKA_X9_31_GENERATED
50 - CKA_PRIME_BITS            51 - CKA_SUBPRIME_BITS
52 - CKA_USAGE_COUNT           53 - CKA_USAGE_LIMIT
54 - CKA_EKM_UID               55 - CKA_GENERIC_1
56 - CKA_GENERIC_2             57 - CKA_GENERIC_3
58 - CKA_FINGERPRINT_SHA256
Select which one: 16
Enter boolean value: 1

CKA_WRAP=01

(1) Add Attribute   (2) Remove Attribute   (0) Accept Template :0

Status: Doing great, no errors (CKR_OK)

Step 5: Wrap the key using the imported RSA public key

5.1. Check whether the symmetric key you want to migrate is exportable. This can be done by following the below command using the handle of the key you want to migrate, and confirming the value of the CKA_EXTRACTABLE attribute (highlighted below) is equal to 1. Otherwise, the key can’t be exported.


Enter your choice : 27

Enter handle of object to display (-1 to list available objects): <handle of the key to be migrated>
Object handle=120
CKA_CLASS=00000004
CKA_TOKEN=01
CKA_PRIVATE=01
CKA_LABEL=Generated AES Key
CKA_KEY_TYPE=0000001f
CKA_ID=
CKA_SENSITIVE=01
CKA_ENCRYPT=01
CKA_DECRYPT=01
CKA_WRAP=01
CKA_UNWRAP=01
CKA_SIGN=01
CKA_VERIFY=01
CKA_DERIVE=01
CKA_START_DATE=
CKA_END_DATE=
CKA_VALUE_LEN=00000020
CKA_LOCAL=01
CKA_MODIFIABLE=01
CKA_EXTRACTABLE=01
CKA_ALWAYS_SENSITIVE=01
CKA_NEVER_EXTRACTABLE=00
CKA_CCM_PRIVATE=00
CKA_FINGERPRINT_SHA1=f8babf341748ba5810be21acc95c6d4d9fac75aa
CKA_OUID=29010002f90900005e850700
CKA_EKM_UID=
CKA_GENERIC_1=
CKA_GENERIC_2=
CKA_GENERIC_3=
CKA_FINGERPRINT_SHA256=7a8efcbff27703e281617be3c3d484dc58df6a78f6b144207c1a54ad32a98c00

Status: Doing great, no errors (CKR_OK)

5.2. Wrap the key using the imported RSA public key. This will create a file called wrapped.key that contains the wrapped key. Make sure to use handles of public key handle you received in step 4.2 above (149, in my example), and the handle of the key you want to migrate.


Enter your choice : 60
[1]DES-ECB        [2]DES-CBC        [3]DES3-ECB       [4]DES3-CBC
                                    [7]CAST3-ECB      [8]CAST3-CBC
[9]RSA            [10]TRANSLA       [11]DES3-CBC-PAD  [12]DES3-CBC-PAD-IPSEC
[13]SEED-ECB      [14]SEED-CBC      [15]SEED-CBC-PAD  [16]DES-CBC-PAD
[17]CAST3-CBC-PAD [18]CAST5-CBC-PAD [19]AES-ECB       [20]AES-CBC
[21]AES-CBC-PAD   [22]AES-CBC-PAD-IPSEC [23]ARIA-ECB  [24]ARIA-CBC
[25]ARIA-CBC-PAD
[26]RSA_OAEP    [27]SET_OAEP
Select mechanism for wrapping: 26

Enter filename of OAEP Source Data [0 for none]: 0

Enter handle of wrapping key (-1 to list available objects) : <imported public key handle>

Enter handle of key to wrap (-1 to list available objects) : <handle of the key to be migrated>
Wrapped key was saved in file wrapped.key

Status: Doing great, no errors (CKR_OK)

Step 6: Move the wrapped key to the New CloudHSM client instance

Move the wrapped key to the New CloudHSM client instance using scp (or any other tool you prefer).

Step 7: Unwrap the key on New CloudHSM with the RSA Private Key

7.1. On the New CloudHSM client instance, run the key_mgmt_util and login as the CU.


Command:  loginHSM -s <CU user> -p <CU password>

	Cfm3LoginHSM returned: 0x00 : HSM Return: SUCCESS

	Cluster Error Status
	Node id 0 and err state 0x00000000 : HSM Return: SUCCESS

7.2. Run the following unWrapKey command to unwrap the key using the RSA private key handle you received in step 1.2 (408, in my example). The output of the command should show the handle of the wrapped key (highlighted below).


Command:  unWrapKey -f wrapped.key -w <private key handle> -m 8 -noheader -l unwrapped_aes -kc 4 -kt 31

	Cfm3CreateUnwrapTemplate2 returned: 0x00 : HSM Return: SUCCESS

	Cfm2UnWrapWithTemplate3 returned: 0x00 : HSM Return: SUCCESS

	Key Unwrapped.  Key Handle: 410

	Cluster Error Status
	Node id 0 and err state 0x00000000 : HSM Return: SUCCESS

Conclusion

Using RSA OAEP for key migration ensures that your key material doesn’t leave the HSM boundary in plain text, as it’s encrypted using an RSA public key before being exported from CloudHSM Classic, and it can only be decrypted by New CloudHSM through the RSA private key that is generated and kept on New CloudHSM.

My post provides an example of how to use the ckdemo and key_mgmt_util utilities for the migration, but the same procedure can also be performed using the supported software libraries, such as the Java JCE library or the PKCS#11 library,a to migrate larger volumes of keys in an automated manner.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author photo

Mohamed AboElKheir

Mohamed AboElKheir is an Application Security Engineer who worksa with different team to ensure AWS services, applications, and websites are designed and implemented to the highest security standards. He is a subject matter expert for CloudHSM and is always enthusiastic about assisting CloudHSM customers with advanced issues and use cases. Mohamed is passionate about InfoSec, specifically cryptography, penetration testing (he’s OSCP certified), application security, and cloud security (he’s AWS Security Specialty certified).

Visualizing Sensor Data in Amazon QuickSight

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/visualizing-sensor-data-in-amazon-quicksight/

This post is courtesy of Moheeb Zara, Developer Advocate, AWS Serverless

The Internet of Things (IoT) is a term used wherever physical devices are networked in some meaningful connected way. Often, this takes the form of sensor data collection and analysis. As the number of devices and size of data scales, it can become costly and difficult to keep up with demand.

Using AWS Serverless Application Model (AWS SAM), you can reduce the cost and time to market of an IoT solution. This guide demonstrates how to collect and visualize data from a low-cost, Wi-Fi connected IoT device using a variety of AWS services. Much of this can be accomplished within the AWS Free Usage Tier, which is necessary for the following instructions.

Services used

The following services are used in this example:

What’s covered in this post?

This post covers:

  • Connecting an Arduino MKR 1010 Wi-Fi device to AWS IoT Core.
  • Forwarding messages from an AWS IoT Core topic stream to a Lambda function.
  • Using a Kinesis Data Firehose delivery stream to store data in S3.
  • Analyzing and visualizing data stored in S3 using Amazon QuickSight.

Connect the device to AWS IoT Core using MQTT

The Arduino MKR 1010 is a low-cost, Wi-Fi enabled, IoT device, shown in the following image.

An Arduino MKR 1010 Wi-Fi microcontroller

Its analog and digital input and output pins can be used to read sensors or to write to actuators. Arduino provides a detailed guide on how to securely connect this device to AWS IoT Core. The following steps build upon it to push arbitrary sensor data to a topic stream and ultimately visualize that data using Amazon QuickSight.

  1. Start by following this comprehensive guide to using an Arduino MKR 1010 with AWS IoT Core. Upon completion, your device is connected to AWS IoT Core using MQTT (Message Queuing Telemetry Transport), a protocol for publishing and subscribing to messages using topics.
  2. In the Arduino IDE, choose File, Sketch, Include Library, and Manage Libraries.
  3. In the window that opens, search for ArduinoJson and select the library by Benoit Blanchon. Choose install.

4. Add #include <ArduinoJson.h> to the top of your sketch from the Arduino guide.

5. Modify the publishMessage() function with this code. It publishes a JSON message with two keys: time (ms) and the current value read from the first analog pin.

void publishMessage() {  
  Serial.println("Publishing message");

  // send message, the Print interface can be used to set the message contents
  mqttClient.beginMessage("arduino/outgoing");
  
  // create json message to send
  StaticJsonDocument<200> doc;
  doc["time"] = millis();
  doc["sensor_a0"] = analogRead(0);
  serializeJson(doc, mqttClient); // print to client
  
  mqttClient.endMessage();
}

6. Save and upload the sketch to your board.

Create a Kinesis Firehose delivery stream

Amazon Kinesis Data Firehose is a service that reliably loads streaming data into data stores, data lakes, and analytics tools. Amazon QuickSight requires a data store to create visualizations of the sensor data. This simple Kinesis Data Firehose delivery stream continuously uploads data to an S3 storage bucket. The next sections cover how to add records to this stream using a Lambda function.

  1. In the Kinesis Data Firehose console, create a new delivery stream, called SensorDataStream.
  2. Leave the default source as a Direct PUT or other sources and choose Next.
  3. On the next screen, leave all the default values and choose Next.
  4. Select Amazon S3 as the destination and create a new bucket with a unique name. This is where records are continuously uploaded so that they can be used by Amazon QuickSight.
  5. On the next screen, choose Create New IAM Role, Allow. This gives the Firehose delivery stream permission to upload to S3.
  6. Review and then choose Create Delivery Stream.

It can take some time to fully create the stream. In the meantime, continue on to the next section.

Invoking Lambda using AWS IoT Core rules

Using AWS IoT Core rules, you can forward messages from devices to a Lambda function, which can perform actions such as uploading to an Amazon DynamoDB table or an S3 bucket, or running data against various Amazon Machine Learning services. In this case, the function transforms and adds a message to the Kinesis Data Firehose delivery stream, which then adds that data to S3.

AWS IoT Core rules use the MQTT topic stream to trigger interactions with other AWS services. An AWS IoT Core rule is created by using an SQL statement, a topic filter, and a rule action. The Arduino example publishes messages every five seconds on the topic arduino/outgoing. The following instructions show how to consume those messages with a Lambda function.

Create a Lambda function

Before creating an AWS IoT Core rule, you need a Lambda function to consume forwarded messages.

  1. In the AWS Lambda console, choose Create function.
  2. Name the function ArduinoConsumeMessage.
  3. For Runtime, choose Author From Scratch, Node.js10.x. For Execution role, choose Create a new role with basic Lambda permissions. Choose Create.
  4. On the Execution role card, choose View the ArduinoConsumeMessage-role-xxxx on the IAM console.
  5. Choose Attach Policies. Then, search for and select AmazonKinesisFirehoseFullAccess.
  6. Choose Attach Policy. This applies the necessary permissions to add records to the Firehose delivery stream.
  7. In the Lambda console, in the Designer card, select the function name.
  8. Paste the following in the code editor, replacing SensorDataStream with the name of your own Firehose delivery stream. Choose Save.
const AWS = require('aws-sdk')

const firehose = new AWS.Firehose()
const StreamName = "SensorDataStream"

exports.handler = async (event) => {
    
    console.log('Received IoT event:', JSON.stringify(event, null, 2))
    
    let payload = {
        time: new Date(event.time),
        sensor_value: event.sensor_a0
    }
    
    let params = {
            DeliveryStreamName: StreamName,
            Record: { 
                Data: JSON.stringify(payload)
            }
        }
        
    return await firehose.putRecord(params).promise()

}

Create an AWS IoT Core rule

To create an AWS IoT Core rule, follow these steps.

  1. In the AWS IoT console, choose Act.
  2. Choose Create.
  3. For Rule query statement, copy and paste SELECT * FROM 'arduino/outgoing’. This subscribes to the outgoing message topic used in the Arduino example.
  4. Choose Add action, Send a message to a Lambda function, Configure action.
  5. Select the function created in the last set of instructions.
  6. Choose Create rule.

At this stage, any message published to the arduino/outgoing topic forwards to the ArduinoConsumeMessage Lambda function, which transforms and puts the payload on the Kinesis Data Firehose stream and also logs the message to Amazon CloudWatch. If you’ve connected an Arduino device to AWS IoT Core, it publishes to that topic every five seconds.

The following steps show how to test functionality using the AWS IoT console.

  1. In the AWS IoT console, choose Test.
  2. For Publish, enter the topic arduino/outgoing .
  3. Enter the following test payload:
    {
      “time”: 1567023375013,  
      “sensor_a0”: 456
    }
  4. Choose Publish to topic.
  5. Navigate back to your Lambda function.
  6. Choose Monitoring, View logs in CloudWatch.
  7. Select a log item to view the message contents, as shown in the following screenshot.

Visualizing data with Amazon QuickSight

To visualize data with Amazon QuickSight, follow these steps.

  1. In the Amazon QuickSight console, sign up.
  2. Choose Manage Data, New Data Set. Select S3 as the data source.
  3. A manifest file is necessary for Amazon QuickSight to be able to fetch data from your S3 bucket. Copy the following into a file named manifest.json. Replace YOUR-BUCKET-NAME with the name of the bucket created for the Firehose delivery stream.
    {
       "fileLocations":[
          {
             "URIPrefixes":[
                "s3://YOUR-BUCKET-NAME/"
             ]
          }
       ],
       "globalUploadSettings":{
          "format":"JSON"
       }
    }
  4. Upload the manifest.json file.
  5. Choose Connect, then Visualize. You may have to give Amazon QuickSight explicit permissions to your S3 bucket.
  6. Finally, design the Amazon QuickSight visualizations in the drag and drop editor. Drag the two available fields into the center card to generate a Sum of Sensor_value by Time visual.

Conclusion

This post demonstrated visualizing data from a securely connected remote IoT device. This was achieved by connecting an Arduino to AWS IoT Core using MQTT, forwarding messages from the topic stream to Lambda using IoT Core rules, putting records on an Amazon Kinesis Data Firehose delivery stream, and using Amazon QuickSight to visualize the data stored within an S3 bucket.

With these building blocks, it is possible to implement highly scalable and customizable IoT data collection, analysis, and visualization. With the use of other AWS services, you can build a full end-to-end platform for an IoT product that can reliably handle volume. To further explore how hardware and AWS Serverless can work together, visit the Amazon Web Services page on Hackster.

One to Many: Evolving VPC Design

Post Syndicated from Androski Spicer original https://aws.amazon.com/blogs/architecture/one-to-many-evolving-vpc-design/

Since its inception, the Amazon Virtual Private Cloud (VPC) has acted as the embodiment of security and privacy for customers who are looking to run their applications in a controlled, private, secure, and isolated environment.

This logically isolated space has evolved, and in its evolution has increased the avenues that customers can take to create and manage multi-tenant environments with multiple integration points for access to resources on-premises.

This blog is a two-part series that begins with a look at the Amazon VPC as a single unit of networking in the AWS Cloud but eventually takes you to a world in which simplified architectures for establishing a global network of VPCs are possible.

From One VPC: Single Unit of Networking

To be successful with the AWS Virtual Private Cloud you first have to define success for today and what success might look like as your organization’s adoption of the AWS cloud increases and matures. In essence, your VPCs should be designed to satisfy the needs of your applications today and must be scalable to accommodate future needs.

Classless Inter-Domain Routing (CIDR) notations are used to denote the size of your VPC. AWS allows you specify a CIDR block between /16 and /28. The largest, /16, provides you with 65,536 IP addresses and the smallest possible allowed CIDR block, /28, provides you with 16 IP addresses. Note, the first four IP addresses and the last IP address in each subnet CIDR block are not available for you to use, and cannot be assigned to an instance.

AWS VPC supports both IPv4 and IPv6. It is required that you specify an IPv4 CIDR range when creating a VPC. Specifying an IPv6 range is optional.

Customers can specify ANY IPv4 address space for their VPC. This includes but is not limited to RFC 1918 addresses.

After creating your VPC, you divide it into subnets. In an AWS VPC, subnets are not isolation boundaries around your application. Rather, they are containers for routing policies.

Isolation is achieved by attaching an AWS Security Group (SG) to the EC2 instances that host your application. SGs are stateful firewalls, meaning that connections are tracked to ensure return traffic is allowed. They control inbound and outbound access to the elastic network interfaces that are attached to an EC2 instance. These should be tightly configured, only allowing access as needed.

It is our best practice that subnets should be created in categories. There two main categories; public subnets and private subnets. At minimum they should be designed as outlined in the below diagrams for IPv4 and IPv6 subnet design.

Recommended IPv4 subnet design pattern

Recommended IPv6 subnet design pattern

Subnet types are denoted by the ability and inability for applications and users on the internet to directly initiate access to infrastructure within a subnet.

Public Subnets

Public subnets are attached to a route table that has a default route to the Internet via an Internet gateway.

Resources in a public subnet can have a public IP or Elastic IP (EIP) that has a NAT to the Elastic Network Interface (ENI) of the virtual machines or containers that hosts your application(s). This is a one-to-one NAT that is performed by the Internet gateway.

Illustration of public subnet access path to the Internet through the Internet Gateway (IGW)

Private Subnets

A private subnet contains infrastructure that isn’t directly accessible from the Internet. Unlike the public subnet, this infrastructure only has private IPs.

Infrastructure in a private subnet gain access to resources or users on the Internet through a NAT infrastructure of sorts.

AWS natively provides NAT capability through the use of the NAT Gateway service. Customers can also create NAT instances that they manage or leverage third-party NAT appliances from the AWS Marketplace.

In most scenarios, it is recommended to use the AWS NAT Gateway as it is highly available (in a single Availability Zone) and is provided as a managed service by AWS. It supports 5 Gbps of bandwidth per NAT gateway and automatically scales up to 45 Gbps.

An AWS NAT gateway’s high availability is confined to a single Availability Zone. For high availability across AZs, it is recommended to have a minimum of two NAT gateways (in different AZs). This allows you to switch to an available NAT gateway in the event that one should become unavailable.

This approach allows you to zone your Internet traffic, reducing cross Availability Zone connections to the Internet. More details on NAT gateway are available here.

Illustration of an environment with a single NAT Gateway (NAT-GW)

Illustration of high availability with a multiple NAT Gateways (NAT-GW) attached to their own route table

Illustration of the failure of one NAT Gateway and the fail over to an available NAT Gateway by the manual changing of the default route next hop in private subnet A route table

AWS allocated IPv6 addresses are Global Unicast Addresses by default. That said, you can privatize these subnets by using an Egress-Only Internet Gateway (E-IGW), instead of a regular Internet gateway. E-IGWs are purposely built to prevents users and applications on the Internet from initiating access to infrastructure in your IPv6 subnet(s).

Illustration of internet access for hybrid IPv6 subnets through an Egress-Only Internet Gateway (E-IGW)

Applications hosted on instances living within a private subnet can have different access needs. Some require access to the Internet while others require access to databases, applications, and users that are on-premises. For this type of access, AWS provides two avenues: the Virtual Gateway and the Transit Gateway. The Virtual Gateway can only support a single VPC at a time, while the Transit Gateway is built to simplify the interconnectivity of tens to hundreds of VPCs and then aggregating their connectivity to resources on-premises. Given that we are looking at the VPC as a single unit of networking, all diagrams below contain illustrations of the Virtual Gateway which acts a WAN concentrator for your VPC.

Illustration of private subnets connecting to data center via a Virtual Gateway (VGW)

 

Illustration of private subnets connecting to Data Center via a VGW

 

Illustration of private subnets connecting to Data Center using AWS Direct Connect as primary and IPsec as backup

The above diagram illustrates a WAN connection between a VGW attached to a VPC and a customer’s data center.

AWS provides two options for establishing a private connectivity between your VPC and on-premises network: AWS Direct Connect and AWS Site-to-Site VPN.

AWS Site-to-Site VPN configuration leverages IPSec with each connection providing two redundant IPSec tunnels. AWS support both static routing and dynamic routing (through the use of BGP).

BGP is recommended, as it allows dynamic route advertisement, high availability through failure detection, and fail over between tunnels in addition to decreased management complexity.

VPC Endpoints: Gateway & Interface Endpoints

Applications running inside your subnet(s) may need to connect to AWS public services (like Amazon S3, Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), Amazon API Gateway, etc.) or applications in another VPC that lives in another account. For example, you may have a database in another account that you would like to expose applications that lives in a completely different account and subnet.

For these scenarios you have the option to leverage an Amazon VPC Endpoint.

There are two types of VPC Endpoints: Gateway Endpoints and Interface Endpoints.

Gateway Endpoints only support Amazon S3 and Amazon DynamoDB. Upon creation, a gateway is added to your specified route table(s) and acts as the destination for all requests to the service it is created for.

Interface Endpoints differ significantly and can only be created for services that are powered by AWS PrivateLink.

Upon creation, AWS creates an interface endpoint consisting of one or more Elastic Network Interfaces (ENIs). Each AZ can support one interface endpoint ENI. This acts as a point of entry for all traffic destined to a specific PrivateLink service.

When an interface endpoint is created, associated DNS entries are created that point to the endpoint and each ENI that the endpoint contains. To access the PrivateLink service you must send your request to one of these hostnames.

As illustrated below, ensure the Private DNS feature is enabled for AWS public and Marketplace services:

Since interface endpoints leverage ENIs, customers can use cloud techniques they are already familiar with. The interface endpoint can be configured with a restrictive security group. These endpoints can also be easily accessed from both inside and outside the VPC. Access from outside a VPC can be accomplished through Direct Connect and VPN.

Illustration of a solution that leverages an interface and gateway endpoint

Customers can also create AWS Endpoint services for their applications or services running on-premises. This allows access to these services via an interface endpoint which can be extended to other VPCs (even if the VPCs themselves do not have Direct Connect configured).

VPC Sharing

At re:Invent 2018, AWS launched the feature VPC sharing, which helps customers control VPC sprawl by decoupling the boundary of an AWS account from the underlying VPC network that supports its infrastructure.

VPC sharing uses Amazon Resource Access Manager (RAM) to share subnets across accounts within the same AWS organization.

VPC sharing is defined as:

VPC sharing allows customers to centralize the management of network, its IP space and the access paths to resources external to the VPC. This method of centralization and reuse (of VPC components such as NAT Gateway and Direct Connect connections) results in a reduction of cost to manage and maintain this environment.

Great, but there are times when a customer needs to build networks with multiple VPCs in and across AWS regions. How should this be done and what are the best practices?

This will be answered in part two of this blog.

 

 

Architecting multiple microservices behind a single domain with Amazon API Gateway

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/architecting-multiple-microservices-behind-a-single-domain-with-amazon-api-gateway/

This post is courtesy of Roberto Iturralde, Solutions Architect.

Today’s modern architectures are increasingly microservices-based, with separate engineering teams working independently on services with their own feature requirements and deployment pipelines. The benefits of this approach include increased agility and release velocity.

Microservice architectures also come with some challenges, particularly when they make up parts of a public service or API. These include enforcing engineering and security standards and collating application logs and metrics for a cross-service operational view.

It’s also important to have the microservices feel like a cohesive product to external customers, for authentication and metering in particular:

  • The engineering teams want autonomy.
  • The security team wants a cross-service view and to make it easy for the teams to adhere to the organization’s guidelines.
  • Customers want to feel like they’re using a unified product.

The AWS toolbox

AWS offers many services that you can weave together to meet these needs.

Amazon API Gateway is a fully managed service for deploying and managing a unified front door to your applications. It has features for routing your domain’s traffic to different backing microservices, enforcing consistent authentication and authorization with fine-grained permissions across them, and implementing consistent API throttling and usage metering. The microservice that backs a given API can live in another AWS account. You don’t have to expose it to the internet.

Amazon Cognito is a user management service with rich support for authentication and authorization of users. You can manage those users within Amazon Cognito or from other federated IdPs. Amazon Cognito can vend JSON Web Tokens and integrates natively with API Gateway to support OAuth scopes for fine-grained API access.

Amazon CloudWatch is a monitoring and management service that collects and visualizes data across AWS services. CloudWatch dashboards are customizable home pages that can contain graphs showing metrics and alarms. You can customize these to represent a specific microservice, a collection of microservices that comprise a product, or any other meaningful view with fine-grained access control to the dashboard.

AWS X-Ray is an analysis and debugging tool designed for distributed applications. It has tools to help gain insight into the performance of your microservices, and the APIs that front them, to measure and debug any potential customer impact.

AWS Service Catalog allows the central management and self-service creation of AWS resources that meet your organization’s guidelines and best practices. You can require separate permissions for managing catalog entries from deploying catalog entries, allowing a central team to define and publish templates for resources across the company.

Architectural options

There are many options for how you can combine these AWS services to meet your requirements. Your decisions may also depend on your expertise with AWS. The following features are common to all the designs below:

  • Amazon Route 53 has registered custom domains and hosts their DNS. You could also use an external registrar and DNS service.
  • AWS Certificate Manager (ACM) manages Transport Layer Security (TLS) certificates for the custom domains that route traffic to API Gateway APIs in a given account.
  • Amazon Cognito manages the users who access the APIs in API Gateway.
  • Service Catalog holds catalog products for API Gateway APIs that adhere to the organizational guidelines and best practices, such as security configuration and default API throttling. Microservice teams have permission to create an API pointed to their service and configure specific parameters, with approvals required for production environments. For more information, see Standardizing infrastructure delivery in distributed environments using AWS Service Catalog.

The following shows common design patterns and their high-level benefits and challenges.

Single AWS account

Microservices, their fronting API Gateway APIs, and supporting services are in the same AWS account. This account also includes core AWS services such as the following:

  • Route 53 for domain name registration and DNS
  • ACM for managing server certificates for your domain
  • Amazon Cognito for user management
  • Service Catalog for the catalog of best-practice product templates to use across the organization

Single AWS account example

Use this approach if you do not yet have a multi-account strategy or if you use AWS native tools for observability. With a single AWS account, the microservices can share the same networking topology, and so more easily communicate with each other when needed. With all the API Gateway APIs in the same AWS account, you can configure API throttling, metering, authentication, and authorization features for a unified experience for customers. You can also route traffic to a given API using subdomains or base path mapping in API Gateway.

A single AWS account can manage TLS certificates for AWS domains in one place. This feature is available to all API Gateway APIs. Having the microservices and their API Gateway APIs in the same AWS account gives more complete X-Ray service maps, given that X-Ray currently can’t analyze traces across AWS accounts. Similarly, you have a complete view of the metrics all AWS services publish to CloudWatch. This feature allows you to create CloudWatch dashboards that span the API Gateway APIs and their backing microservices.

There is an increased blast radius with this architecture, because the microservices share the same account. The microservices can impact each other through shared AWS service limits or mistakes by team members on other microservice teams. Most AWS services support tagging for cost allocation and granular access control, but there are some features of AWS services that do not. Because of this, it’s more difficult to separate the costs of each microservice completely.

Separate AWS accounts

When using separate AWS accounts, each API Gateway API lives in the same AWS account as its backing microservice. Separate AWS accounts hold the Service Catalog portfolio, domain registration (using Route 53), and aggregated logs from the microservices. The organization account, security account, and other core accounts are discussed further in the AWS Landing Zone Solution.

Separate AWS accounts

Use this architecture if you have a mature multi-account strategy and existing tooling for cross-account observability. In this approach, an AWS account encapsulates a microservice completely, for cost isolation and reduced blast radius. With the API Gateway API in the same account as the backing microservice, you have a complete view of the microservice in CloudWatch and X-Ray.

You can only meter API usage by microservice because API Gateway usage plans can’t track activity across accounts. Implement a process to ensure each customer’s API Gateway API key is the same across accounts for a smooth customer experience.

API Gateway base path mappings are local to an AWS account, so you must use subdomains to separate the microservices that comprise a product under a single domain. However, you can have a complete view of each microservice in the CloudWatch dashboards and X-Ray console for its AWS account. This creates a view across microservices that requires aggregation in a central AWS account or external tool.

Central API account

Using a central API account is similar to the separate account architecture, except the API Gateway APIs are in a central account.

Central API account

This architecture is the best approach for most users. It offers a balance of the benefits of microservice separation with the unification of particular services for a better end-user experience. Each microservice has an AWS account, which isolates it from the other services and reduces the risk of AWS service limit contention or accidents due to sharing the account with other engineering teams.

Because each microservice lives in a separate account, that account’s bill captures all the costs for that microservice. You can track the API costs, which are in the shared API account, using tags on API Gateway resources.

While the microservices are isolated in separate AWS accounts, the API Gateway throttling, metering, authentication, and authorization features are centralized for a consistent experience for customers. You can use subdomains or API Gateway base path mappings to route traffic to different API Gateway APIs. Also, the TLS certificates for your domains are centrally managed and available to all API Gateway APIs.

You can now split CloudWatch metrics, X-Ray traces, and application logs across accounts for a given microservice and its fronting API Gateway API. Unify these in a central AWS account or a third-party tool.

Conclusion

The breadth of the AWS Cloud presents many architectural options to customers. When designing your systems, it’s essential to understand the benefits and challenges of design decisions before implementing a solution.

This post walked you through three common architectural patterns for allowing independent microservice teams to operate behind a unified domain presented to your customers. The best approach for your organization depends on your priorities, experience, and familiarity with AWS.

How to use AWS Secrets Manager to securely store and rotate SSH key pairs

Post Syndicated from Maitreya Ranganath original https://aws.amazon.com/blogs/security/how-to-use-aws-secrets-manager-securely-store-rotate-ssh-key-pairs/

AWS Secrets Manager provides full lifecycle management for secrets within your environment. In this post, Maitreya and I will show you how to use Secrets Manager to store, deliver, and rotate SSH keypairs used for communication within compute clusters. Rotation of these keypairs is a security best practice, and sometimes a regulatory requirement. Traditionally, these keypairs have been associated with a number of tough challenges. For example, synchronizing key rotation across all compute nodes, enable detailed logging and auditing, and manage access to users in order to modify secrets.

However, rotating the keypair on all compute clusters’ nodes must be done in a tightly coordinated fashion, and failures generally result in availability risks. Moreover, the keypairs themselves are highly sensitive security credentials which must be carefully controlled with fine-grain access controls, detailed monitoring, and audit logging. These are precisely the types of tough challenges that AWS Secrets Manger solves for you.

In this post, we’ll show you how to secure, rotate, and use SSH keypairs for inter-cluster communication. You’ll use an AWS CloudFormation template to launch a cluster and configure Secrets Manager. Then we’ll show you how to use Secrets Manager to deliver the keypair to the cluster and use it for management operations, such as securely copying a file between nodes. Finally, we’ll use Secrets Manager to seamlessly rotate the keypair used by the cluster without any changes or outages. In this post, we’ve highlighted compute clusters, but you can use Secrets Manager to apply this solution directly to any SSH based use-case.

Solution overview

The following architecture diagram presents an overview of the solution:
 

Figure 1: Solution architecture

Figure 1: Solution architecture

The sample architecture created by CloudFormation includes one master node, three worker nodes, AWS Secret Manager—which utilizes a rotation AWS Lambda function—and AWS Systems Manager. Setting up the cluster is out of scope for this post; in our walkthrough, we’ll focus on the keypair rotation architecture.

Secrets Manager uses staging labels to identify different versions of a secret during rotation. A staging label is a text string. For example, by default, AWSCURRENT is attached to the current version of the secret, while AWSPENDING will be attached to new versions of the secret before they have been verified and deployed to corresponding resources.

As shown in the diagram:

  1. A secret is created in AWS Secrets Manager. The secret holds the SSH keypair that the master node will use to connect to the other nodes in the cluster. Upon keypair rotation, Secrets Manager will invoke a Lambda function (labeled 1.a in the diagram). The Lambda function will perform four steps:
    • 1.b: createSecret – create a new SSH keypair and store the private key as a new version of the secret.
    • 1.c: setSecret – label the newly created secret version with the label AWSPENDING and copy the public key to the worker nodes with AWS Systems Manager Run Command.

    The Lambda function will also perform two steps not shown in the diagram:

    • testSecret – verify that the new SSH keypair has been successfully deployed by invoking a test SSH connection.
    • finishSecret – set the staging label AWSCURRENT to the new secret version and remove the old keys from the worker nodes. This will also set the staging label AWSPREVIOUS to the old secret, allowing your administrator to have the ‘last known password’ if something goes wrong.

    An overview of the rotation Lambda function is available in the AWS Secrets Manager user guide. You have full control over the rotation function so that you can customize it to your needs. Note that no key is installed on the master node. Instead, the function will retrieve the private key from Secrets Manager only when it needs to securely communicate with the worker nodes. That private key is not saved on the master node’s filesystem but rather in volatile memory (per best practice, the private key variable is overwritten after successful authentication and deleted before the script exits); details about keeping secret data in volatile memory will follow later in this post.

  2. When the master node needs to communicate with any worker node, it will use an AWS SDK (Python Boto3) to read the SSH private key from Secrets Manager (2.a) and use the private key to establish an SSH tunnel with the worker nodes (2.b). The master node is authorized to read the private key from Secrets Manager because an AWS Identity and Access Management (IAM) role with a policy that allows it to access the secret is attached to the master node. The corresponding public key was deployed to each of the worker nodes during the rotation process in step one above.
  3. The secrets in Secrets Managers are encrypted with AWS Key Management System (KMS), and every version of the secret is encrypted with a unique data encryption key. The SSH key pair in the cluster will periodically rotate based on a configurable rotation interval, which you’ll configure from the Secrets Manager console later in this post. Each rotation repeats the process described in steps 1-2, resulting in a new version of the secret. Each new version will be encrypted using a new KMS data key, which provides an extra layer of security.
  4. The AWS Systems Manager Run Command will use the Amazon Elastic Compute Cloud (EC2) tag RotateSSHKeys with a value of True to identify the cluster’s worker node instances. Note that if you rely on tags as a security control, you must have clear governance and control over which users are able to change the tags and tag values on your EC2 instances.

Solution cost

Today, this solution will cost $0.48 an hour for the four T2.micro EC2 instances that comprise the sample cluster. Secrets manager has a 30-day trial period, after which one secret will cost $0.40 per month and $0.05 per 10,000 API calls. There is no additional charge for AWS Systems Manager.

Deploying the sample solution

In this section, you’ll deploy a test stack that demonstrates the entire solution. After deployment, you’ll log in to the master node and securely copy a file to one of the worker nodes. Finally, you’ll use Secrets Manager to rotate and deploy a new SSH keypair. The CloudFormation templates and secret rotation code are available in the AWS GitHub repository.

Set up the sample deployment by selecting the AWS CloudFormation Launch Stack button bellow; by default, the stack will be deployed in the us-east-1 (N. Virginia) Region.
 
Select this image to open a link that starts building the CloudFormation stack

The template creates an Amazon Amazon Virtual Private Cloud (VPC), private and public subnets, EC2 instances (master node and mock cluster), and the IAM role and policies used for the EC2 instances.

  1. Select your EC2 SSH key pair and input your IP range as stack parameters. In the YourIPRange field, enter the CIDR of your machine or network only, as this ensures only hosts from your network can access the master server. You may leave all other parameters as default. This CloudFormation template launches four t2.micro instances in a new VPC. One instance will be tagged as MasterServer and the rest will be tagged WorkerServer1-3.

    Note: The SSH keypair referenced here will be used to connect from your local computer to the master node. It is distinct from the SSH keypair used by the master node to connect to the worker nodes.

     

    Figure 2: Enter the CIDR of your machine or network

    Figure 2: Enter the CIDR of your machine or network

    Important: For simplicity, the master node you’ll create in this walkthrough will be in a public subnet, making it accessible from the CIDR you provided in Step 2. However, this is not the most secure approach possible. Follow the guidance in the Amazon EC2 VPC documentation to securely configure your cluster in a private subnet following the “defense in depth” principal.

  2. Monitor the status of the stack. When the status is CREATE_COMPLETE, the deployment is ready. Select the Outputs tab to find information about the newly created resources, and write down the master node’s public DNS and a worker node IP address. You’ll need both later in this post.
  3. Select the Launch Stack button to launch the AWS CloudFormation template that will deploy the Lambda function used by Secrets Manager, Accept the default values for the parameters. This template is designed for reusability; it can be applied to any SSH rotation use-case.
     
    Select this image to open a link that starts building the CloudFormation stack

Next, create and configure a new secret from the Secrets Manager console to store the cluster communication SSH keypair.

Configuring a secret in AWS Secrets Manager

The CloudFormation template did not deploy a secret, so follow these steps to create a secret from the console and rotation function configuration. To create a new secret:

  1. Open the AWS Secrets Manager console and select Store New Secret.
  2. Select Other type of secrets, then select the Plaintext tab.
  3. As shown in Figure 3, enter {} to create an empty JSON value with no properties. This value will be initially populated with a keypair by the rotation Lambda function.
     
    Figure 3: Create an empty JSON value with no properties

    Figure 3: Create an empty JSON value with no properties

  4. Keep the default encryption key and select Next. We’re keeping the default encryption key for the sake of simplicity in this example, but security best practices suggest using a Customer Master Key (CMK) that you’ve created.
  5. In Step 2: Name and description, name the secret /dev/ssh. The path of a secret can be used in the secret’s IAM policy to restrict users and roles to a secret or hierarchy of secrets. For example, the IAM policy could include /dev/* or /prod/* to control access to secrets in development or production, respectively.
  6. Add a description, then select Next.
     
    Figure 4: Add a description

    Figure 4: Add a description

  7. In Step 3: Configure rotation, choose Enable automatic rotation and enable a rotation interval of your choice, which you can configure using the rotation interval dropdown list.
  8. Select the Choose an AWS Lambda function drop-down and choose RotateSSH. This is the Lambda function that was deployed by the CloudFormation template.
  9. Select Next, then review your configuration and select Store. When the new secrets configuration is stored, the rotation Lambda function is immediately invoked, populating the value of the secret.
     
    Figure 5: Configure the rotation

    Figure 5: Configure the rotation

Testing the sample solution

With the secret configuration completed and the instances up and running, you’re now going to securely copy a file from the master node to one of the worker nodes, using the SSH key stored in Secrets Manager to test the solution.

  1. Log in to the master node via SSH, using the EC2 key that you specified in the CloudFormation template.
  2. Once connected, securely copy a file from the master node to the worker node using SCP (secure copy protocol) by entering the command below. Replace <private-ip-of-worker> with the worker node IP you copied down in step 3:
    
                python copy_file.py ec2-user <private-ip-of-worker>
            

Figure 6 shows ssh login to master node, and the copy_file.py command to worker node.
 

Figure 6: The <span style="font-family: courier">ssh</span> login to master node, and the <span style="font-family: courier">copy_file.py</span> command

Figure 6: The ssh login to master node, and the copy_file.py command

During execution, the python script will use the Secrets Manager get_secret_value API to retrieve the secret, which includes the private key. It will then use this key to establish a secure SSH connection with the worker nodes, without saving the private key on any master node storage.

You can review the copy_file.py on the master node or on GitHub. In the get_private_key() function, you can read the secret value, which includes the private key:


    get_secret_value_response = client.get_secret_value(
    SecretId=secret_name)           

In the copy_file function, create a secured SSH tunnel to copy a file using the private key from memory, using Paramiko, a Python implementation of SSHv2.


    private_key_str = io.StringIO()
    # Write private key to a memory file
    private_key_str.write(private_key)
    
    # Create key object
    key = paramiko.RSAKey.from_private_key(private_key_str)
    
    # Open a channel and authenticate 
    trans = paramiko.Transport(ip, 22) 
    trans.start_client()
    trans.auth_publickey(user, key)
    del key        

To demonstrate the rotation of the SSH keypair, you’ll now manually invoke the rotation function:

  1. Return to the Secrets Manager console, select your /dev/ssh secret, and choose Retrieve Secret Value to see the key pair.
  2. Select Rotate secret immediately. In the pop-up window, confirm your choice by selecting Rotate.
     
    Figure 7: Set the "Secret value" and "Rotation configuration"

    Figure 7: Set the “Secret value” and “Rotation configuration”

  3. Choose Rotate again to complete the rotation.
     
    Figure 8: Select "Rotate"

    Figure 8: Select “Rotate”

  4. Select the Close button to refresh the view, and then choose Retrieve Secret Value again.
  5. Once the rotation has completed, you can inspect the new keypair via the Secrets Manager console. Go back to the terminal and run the same python script to copy a file using SCP. Replace <private-ip-of-worker> with your own worker node ID:
    
                    python copy_file.py ec2-user <private-ip-of-worker>
            

The file has now been transferred successfully using a new key pair, with no updates required.

Auditing and monitoring

You can monitor and audit all APIs used to create and rotate your keys in Secrets Manager via AWS CloudTrail. To view CloudTrail events, follow these steps:

  1. Open the CloudTrail console and select Event history.
  2. From the Filter dropdown field, select Event source, enter secret in the filter field, then select secretsmanager.amazonaws.com from the dropdown menu.
  3. From here, you can review Secrets Manager’s events, such as GetSecretValue, PutSecretValue, UpdateSecretVersionStage (which modifies the staging labels attached to a version of a secret), and RotationSucceeded, in the CloudTrail event history. These event logs help to audit secrets configuration, rotation, and access.
     
    Figure 9: The "Event history" window

    Figure 9: The “Event history” window

Additionally, Secrets Manager can work with CloudWatch Events to trigger alerts when administrator-specified operations occur in an organization (for example, to notify you of a secret deletion attempt).

Cleaning up the CloudFormation Stack

To delete the entire CloudFormation stack:

  1. Select the stack named RotateSSH from the CloudFormation console.
  2. Select Actions, and then Delete Stack. This will delete all AWS resources created by the stack.
  3. Repeat the steps above to delete the stack named MasterWorkers.
  4. From the AWS Secrets Manager console, delete the secret /dev/ssh. Read more about Deleting and Restoring a Secret in the AWS Secrets Manager User Guide.

Conclusion

In this post, we demonstrate how you can use AWS Secrets Manager to store, rotate, and deliver SSH keypairs in order to secure communication within a compute cluster. Keys are securely encrypted and stored in AWS Secret Manager, which will also rotate the keys and install public keys on all nodes for you. By using this method, you won’t have to manually deploy SSH Keys on the various EC2 instances or manually rotate them. APIs associated with secrets management and rotation are logged in CloudTrail for auditing and monitoring. This key rotation solution is serverless. It does not require any servers to maintain and can scale rapidly.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the AWS Secrets Manager forum.

Want more AWS Security news? Follow us on Twitter.

Author

Assaf Namer

Assaf is a Senior Solutions Architect. He likes coding, hackathons, and enjoys helping customers building reliable and secure cloud solutions. Outside of work, Assaf enjoys spinning and tennis.

Author

Maitreya Ranganath

Maitreya is a Solutions Architect with the Enterprise team. He has a focus on Security and Compliance and enjoys helping customers architect secure, scalable, and cost-effective solutions on AWS.