Tag Archives: Expert (400)

Create a Multi-Region Python Package Publishing Pipeline with AWS CDK and CodePipeline

Post Syndicated from Brian Smitches original https://aws.amazon.com/blogs/devops/create-a-multi-region-python-package-publishing-pipeline-with-aws-cdk-and-codepipeline/

Customers can author and store internal software packages in AWS by leveraging the AWS CodeSuite (AWS CodePipeline, AWS CodeBuild, AWS CodeCommit, and AWS CodeArtifact).  As of the publish date of this blog post, there is no native way to replicate your CodeArtifact Packages across regions. This blog addresses how a custom solution built with the AWS Cloud Development Kit and AWS CodePipeline can create a Multi-Region Python Package Publishing Pipeline.

Whether it’s for resiliency or performance improvement, many customers want to deploy their applications across multiple regions. When applications are dependent on custom software packages, the software packages should be replicated to multiple regions as well. This post will walk through how to deploy a custom package publishing pipeline in your own AWS Account. This pipeline connects a Python package source code repository to build and publish pip packages to CodeArtifact Repositories spanning three regions (the primary and two replica regions). While this sample CDK Application is built specifically for pip packages, the underlying architecture can be reused for different software package formats, such as npm, Maven, NuGet, etc.

Solution overview

The following figure demonstrates the solution workflow:

  1. A CodePipeline pipeline orchestrates the building and publishing of the software package
    1. This pipeline is triggered by commits on the main branch of the CodeCommit repository
    2. A CodeBuild job builds the pip packages using twine to be distributed
    3. The publish stage (third column) uses three parallel CodeBuild jobs to publish the distribution package to the two CodeArtifact repositories in separate regions
  1. The first CodeArtifact Repository stores the package contents in the primary region.
  2. The second and third CodeArtifact Repository act as replicas and store the package contents in other regions.
Figure 1. A figure showing the architecture diagram

Figure 1.  Architecture diagram

All of these resources are defined in a single AWS CDK Application. The resources are defined in CDK Stacks that are deployed as AWS CloudFormation Stacks. AWS CDK can deploy the different stacks across separate regions.

Prerequisites

Before getting started, you will need the following:

  1. An AWS account
  2. An instance of the AWS Cloud9 IDE or an alternative local compute environment, such as your personal computer
  3. The following installed on your compute environment:
    1. AWS CDK
    2. AWS Command Line Interface (AWS CLI)
    3. npm
  1. The AWS Accounts must be bootstrapped for CDK in the necessary regions. The default configuration uses us-east-1, us-east-2 and us-west-2  as these three regions support CodeArtifact.

A new AWS Cloud9 IDE is recommended for this tutorial to isolate these actions in this post from your normal compute environment. See the Cloud9 Documentation for Creating an Environment.

Deploy the Python Package Publishing Pipeline into your AWS Account with the CDK

The source code can be found in this GitHub Repository.

  1. Fork the GitHub Repo into your account. This way you can experiment with changes as necessary to fit your workload.
  2. In your local compute environment, clone the GitHub Repository and cd into the project directory:
git clone [email protected]:<YOUR_GITHUB_USERNAME>/multi-region-
python-package-publishing-pipeline.git && cd multi-region-
python-package-publishing-pipeline
  1. Install the necessary node packages:
npm i
  1. (Optional) Override the default configurations for the CodeArtifact domainName, repositoryName, primaryRegion, and replicaRegions.
    1. navigate to ./bin/multiregion_package_publishing.ts and update the relevant fields.
    2. From the project’s root directory (multi-region-python-package-publishing-pipeline), deploy the AWS CDK application. This step may take 5-10 minutes.
cdk deploy --all
  1. When prompted “Do you wish to deploy these changes (y/n)?”, Enter y.

Viewing the deployed CloudFormation stacks

After the deployment of the AWS CDK application completes, you can view the deployed AWS CDK Stacks via CloudFormation. From the AWS Console, search “CloudFormation’ in the search bar and navigate to the service dashboard. In the primary region (us-east-1(N. Virginia)) you should see two stacks: CodeArtifactPrimaryStack-<region> and PackagePublishingPipelineStack.

Screenshot showing the CloudFormation Stacks in the primary region

Figure 2. Screenshot showing the CloudFormation Stacks in the primary region

Switch regions to one of the secondary regions us-west-2 (Oregon) or us-east-2 (Ohio) to see the remaining stacks named CodeArtifactReplicaStack-<region>. These correspond to the three AWS CDK Stacks from the architecture diagram.

Screenshot showing the CloudFormation stacks in a separate region

Figure 3. Screenshot showing the CloudFormation stacks in a separate region

Viewing the CodePipeline Package Publishing Pipeline

From the Console, select the primary region (us-east-1) and navigate to CodePipeline by utilizing the search bar. Select the Pipeline titled packagePipeline and inspect the state of the pipeline. This pipeline triggers after every commit from the CodeCommit repository named PackageSourceCode. If the pipeline is still in process, then wait a few minutes, as this pipeline can take approximately 7–8 minutes to complete all three stages (Source, Build, and Publish). Once it’s complete, the pipeline should reflect the following screenshot:

A screenshot showing the CodePipeline flow

Figure 4. A screenshot showing the CodePipeline flow

Viewing the Published Package in the CodeArtifact Repository

To view the published artifacts, go to the primary or secondary region and navigate to the CodeArtifact dashboard by utilizing the search bar in the Console. You’ll see a repository named package-artifact-repo. Select the repository and you’ll see the sample pip package named mypippackage inside the repository. This package is defined by the source code in the CodeCommit repository named PackageSourceCode in the primary region (us-east-1).

Screenshot of the package repository

Figure 5. Screenshot of the package repository

Create a new package version in CodeCommit and monitor the pipeline release

Navigate to your CodeCommit’s PackageSourceCode (us-east-1 CodeCommit > Repositories > PackageSourceCode. Open the setup.py file and select the Edit button. Make a simple modification, change the version = '1.0.0' to version = '1.1.0' and commit the changes to the Main branch.

A screenshot of the source package's code repository in CodeCommit

Figure 6. A screenshot of the source package’s code repository in CodeCommit

Now navigate back to CodePipeline and watch as the pipeline performs the release automatically. When the pipeline finishes, this new package version will live in each of the three CodeArtifact Repositories.

Install the custom pip package to your local Python Environment

For your development team to connect to this CodeArtifact Repository to download repositories, you must configure the pip tool to look in this repository. From your Cloud9 IDE (or local development environment), let’s test the installation of this package for Python3:

  1. Copy the connection instructions for the pip tool. Navigate to the CodeArtifact repository of your choice and select View connection instructions
    1. Select Copy to copy the snippet to your clipboard
Screenshot showing directions to connect to a code artifact repository

Figure 7. Screenshot showing directions to connect to a code artifact repository

  1. Paste the command from your clipboard
  2. Run pip install mypippackage==1.0.0
Screenshot showing CodeArtifact login

Figure 8. Screenshot showing CodeArtifact login

  1. Test the package works as expected by importing the modules
  2. Start the Python REPL by running python3 in the terminal
Screenshot of the package being imported

Figure 9. Screenshot of the package being imported

Clean up

Destroy all of the AWS CDK Stacks by running cdk destroy --all from the root AWS CDK application directory.

Conclusion

In this post, we walked through how to deploy a CodePipeline pipeline to automate the publishing of Python packages to multiple CodeArtifact repositories in separate regions. Leveraging the AWS CDK simplifies the maintenance and configuration of this multi-region solution by using Infrastructure as Code and predefined Constructs. If you would like to customize this solution to better fit your needs, please read more about the AWS CDK and AWS Developer Tools. Some links we suggest include the CodeArtifact User Guide (with sections covering npm, Python, Maven, and NuGet), the CDK API Reference, CDK Pipelines, and the CodePipeline User Guide.

About the authors:

Andrew Chen

Andrew Chen is a Solutions Architect with an interest in Data Analytics, Machine Learning, and DevOps. Andrew has previous experience in management consulting in which he worked as a technical architect for various cloud migration projects. In his free time, Andrew enjoys fishing, hiking, kayaking, and keeping up with financial markets.

Brian Smitches

Brian Smitches is a Solutions Architect with an interest in Infrastructure as Code and the AWS Cloud Development Kit. Brian currently supports Federal SMB Partners and has previous experience with Full Stack Application Development. In his personal time, Brian enjoys skiing, water sports, and traveling with friends and family.

Ingest streaming data to Apache Hudi tables using AWS Glue and Apache Hudi DeltaStreamer

Post Syndicated from Vishal Pathak original https://aws.amazon.com/blogs/big-data/ingest-streaming-data-to-apache-hudi-tables-using-aws-glue-and-apache-hudi-deltastreamer/

In today’s world with technology modernization, the need for near-real-time streaming use cases has increased exponentially. Many customers are continuously consuming data from different sources, including databases, applications, IoT devices, and sensors. Organizations may need to ingest that streaming data into data lakes built on Amazon Simple Storage Service (Amazon S3). You may also need to achieve analytics and machine learning (ML) use cases in near-real time. To ensure consistent results in those near-real-time streaming use cases, incremental data ingestion and atomicity, consistency, isolation, and durability (ACID) properties on data lakes have been a common ask.

To address such use cases, one approach is to use Apache Hudi and its DeltaStreamer utility. Apache Hudi is an open-source data management framework designed for data lakes. It simplifies incremental data processing by enabling ACID transactions and record-level inserts, updates, and deletes of streaming ingestion on data lakes built on top of Amazon S3. Hudi is integrated with well-known open-source big data analytics frameworks, such as Apache Spark, Apache Hive, Presto, and Trino, as well as with various AWS analytics services like AWS Glue, Amazon EMR, Amazon Athena, and Amazon Redshift. The DeltaStreamer utility provides an easy way to ingest streaming data from sources like Apache Kafka into your data lake.

This post describes how to run the DeltaStreamer utility on AWS Glue to read streaming data from Amazon Managed Streaming for Apache Kafka (Amazon MSK) and ingest the data into S3 data lakes. AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development. With AWS Glue, you can create Spark, Spark Streaming, and Python shell jobs to extract, transform, and load (ETL) data. You can create AWS Glue Spark streaming ETL jobs using either Scala or PySpark that run continuously, consuming data from Amazon MSK, Apache Kafka, and Amazon Kinesis Data Streams and writing it to your target.

Solution overview

To demonstrate the DeltaStreamer utility, we use fictional product data that represents product inventory including product name, category, quantity, and last updated timestamp. Let’s assume we stream the data from data sources to an MSK topic. Now we want to ingest this data coming from the MSK topic into Amazon S3 so that we can run Athena queries to analyze business trends in near-real time.

The following diagram provides the overall architecture of the solution described in this post.

To simulate application traffic, we use Amazon Elastic Compute Cloud (Amazon EC2) to send sample data to an MSK topic. Amazon MSK is a fully managed service that makes it easy to build and run applications that use Apache Kakfa to process streaming data. To consume the streaming data from Amazon MSK, we set up an AWS Glue streaming ETL job that uses the Apache Hudi Connector 0.10.1 for AWS Glue 3.0, with the DeltaStreamer utility to write the ingested data to Amazon S3. The Apache Hudi Connector 0.9.0 for AWS Glue 3.0 also supports the DeltaStreamer utility.

As the data is being ingested, the AWS Glue streaming job writes the data into the Amazon S3 base path. The data in Amazon S3 is cataloged using the AWS Glue Data Catalog. We then use Athena, which is an interactive query service, to query and analyze the data using standard SQL.

Prerequisites

We use an AWS CloudFormation template to provision some resources for our solution. The template requires you to select an EC2 key pair. This key is configured on an EC2 instance that lives in the public subnet. We use this EC2 instance to ingest data to the MSK cluster running in a private subnet. Make sure you have a key in the AWS Region where you deploy the template. If you don’t have one, you can create a new key pair.

Create the Apache Hudi connection

To add the Apache Hudi Connector for AWS Glue, complete the following steps:

  1. On the AWS Glue Studio console, choose Connectors.
  2. Choose Go to AWS Marketplace.
  3. Search for and choose Apache Hudi Connector for AWS Glue.
  4. Choose Continue to Subscribe.
  5. Review the terms and conditions, then choose Accept Terms.

    After you accept the terms, it takes some time to process the request.
    When the subscription is complete, you see the Effective date populated next to the product.
  6. Choose Continue to Configuration.
  7. For Fulfillment option, choose Glue 3.0.
  8. For Software version, choose 0.10.1.
  9. Choose Continue to Launch.
  10. Choose Usage instructions, and then choose Activate the Glue connector from AWS Glue Studio.

    You’re redirected to AWS Glue Studio.
  11. For Name, enter Hudi-Glue-Connector.
  12. Choose Create connection and activate connector.

A message appears that the connection was successfully created. Verify that the connection is visible on the AWS Glue Studio console.

Launch the CloudFormation stack

For this post, we provide a CloudFormation template to create the following resources:

  • VPC, subnets, security groups, and VPC endpoints
  • AWS Identity and Access Management (IAM) roles and policies with required permissions
  • An EC2 instance running in a public subnet within the VPC with Kafka 2.12 installed and with the source data initial load and source data incremental load JSON files
  • An Amazon MSK server running in a private subnet within the VPC
  • An AWS Glue Streaming DeltaStreamer job to consume the incoming data from the Kafka topic and write it to Amazon S3
  • Two S3 buckets: one of the buckets stores code and config files, and other is the target for the AWS Glue streaming DeltaStreamer job

To create the resources, complete the following steps:

  1. Choose Launch Stack:
  2. For Stack name, enter hudi-deltastreamer-glue-blog.
  3. For ClientIPCIDR, enter the IP address of your client that you use to connect to the EC2 instance.
  4. For HudiConnectionName, enter the AWS Glue connection you created earlier (Hudi-Glue-Connector).
  5. For KeyName, choose the name of the EC2 key pair that you created as a prerequisite.
  6. For VpcCIDR, leave as is.
  7. Choose Next.
  8. Choose Next.
  9. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  10. Choose Create stack.

After the CloudFormation template is complete and the resources are created, the Outputs tab shows the following information:

  • HudiDeltastreamerGlueJob – The AWS Glue streaming job name
  • MSKCluster – The MSK cluster ARN
  • PublicIPOfEC2InstanceForTunnel – The public IP of the EC2 instance for tunnel
  • TargetS3Bucket – The S3 bucket name

Create a topic in the MSK cluster

Next, SSH to Amazon EC2 using the key pair you created and run the following commands:

  1. SSH to the EC2 instance as ec2-user:
    ssh -i <KeyName> ec2-user@<PublicIPOfEC2InstanceForTunnel>

    You can get the KeyName value on the Parameters tab and the public IP of the EC2 instance for tunnel on the Outputs tab of the CloudFormation stack.

  2. For the next command, retrieve the bootstrap server endpoint of the MSK cluster by navigating to msk-source-cluster on the Amazon MSK console and choosing View client information.
  3. Run the following command to create the topic in the MSK cluster hudi-deltastream-demo:
    ./kafka_2.12-2.6.2/bin/kafka-topics.sh --create \
    --topic hudi-deltastream-demo \
    --bootstrap-server "<replace text with value under private endpoint on MSK>" \
    --partitions 1 \
    --replication-factor 2 \
    --command-config ./config_file.txt

  4. Ingest the initial data from the deltastreamer_initial_load.json file into the Kafka topic:
    ./kafka_2.12-2.6.2/bin/kafka-console-producer.sh \
    --broker-list "<replace text with value under private endpoint on MSK>" \
    --topic hudi-deltastream-demo \
    --producer.config ./config_file.txt < deltastreamer_initial_load.json

The following is the schema of a record ingested into the Kafka topic:

{
  "type":"record",
  "name":"products",
  "fields":[{
     "name": "id",
     "type": "int"
  }, {
     "name": "category",
     "type": "string"
  }, {
     "name": "ts",
     "type": "string"
  },{
     "name": "name",
     "type": "string"
  },{
     "name": "quantity",
     "type": "int"
  }
]}

The schema uses the following parameters:

  • id – The product ID
  • category – The product category
  • ts – The timestamp when the record was inserted or last updated
  • name – The product name
  • quantity – The available quantity of the product in the inventory

The following code gives an example of a record:

{
    "id": 1, 
    "category": "Apparel", 
    "ts": "2022-01-02 10:29:00", 
    "name": "ABC shirt", 
    "quantity": 4
}

Start the AWS Glue streaming job

To start the AWS Glue streaming job, complete the following steps:

  1. On the AWS Glue Studio console, find the job with the value for HudiDeltastreamerGlueJob.
  2. Choose the job to review the script and job details.
  3. On the Job details tab, replace the value of the --KAFKA_BOOTSTRAP_SERVERS key with the Amazon MSK bootstrap server’s private endpoint.
  4. Choose Save to save the job settings.
  5. Choose Run to start the job.

When the AWS Glue streaming job runs, the records from the MSK topic are consumed and written to the target S3 bucket created by AWS CloudFormation. To find the bucket name, check the stack’s Outputs tab for the TargetS3Bucket key value.

The data in Amazon S3 is stored in Parquet file format. In this example, the data written to Amazon S3 isn’t partitioned, but you can enable partitioning by specifying hoodie.datasource.write.partitionpath.field=<column_name> as the partition field and setting hoodie.datasource.write.hive_style_partitioning to True in the Hudi configuration property in the AWS Glue job script.

In this post, we write the data to a non-partitioned table, so we set the following two Hudi configurations:

  • hoodie.datasource.hive_sync.partition_extractor_class is set to org.apache.hudi.hive.NonPartitionedExtractor
  • hoodie.datasource.write.keygenerator.class is set to org.apache.hudi.keygen.NonpartitionedKeyGenerator

DeltaStreamer options and configuration

DeltaStreamer has multiple options available; the following are the options set in the AWS Glue streaming job used in this post:

  • continuous – DeltaStreamer runs in continuous mode running source-fetch.
  • enable-hive-sync – Enables table sync to the Apache Hive Metastore.
  • schemaprovider-class – Defines the class for the schema provider to attach schemas to the input and target table data.
  • source-class – Defines the source class to read data and has many built-in options.
  • source-ordering-field – The field used to break ties between records with the same key in input data. Defaults to ts (the Unix timestamp of record).
  • target-base-path – Defines the path for the target Hudi table.
  • table-type – Indicates the Hudi storage type to use. In this post, it’s set to COPY_ON_WRITE.

The following are some of the important DeltaStreamer configuration properties set in the AWS Glue streaming job:

# Schema provider props (change to absolute path based on your installation)
hoodie.deltastreamer.schemaprovider.source.schema.file=s3://" + args("CONFIG_BUCKET") + "/artifacts/hudi-deltastreamer-glue/config/schema.avsc
hoodie.deltastreamer.schemaprovider.target.schema.file=s3://" + args("CONFIG_BUCKET") + "/artifacts/hudi-deltastreamer-glue/config/schema.avsc

# Kafka Source
hoodie.deltastreamer.source.kafka.topic=hudi-deltastream-demo

#Kafka props
bootstrap.servers=args("KAFKA_BOOTSTRAP_SERVERS")
auto.offset.reset=earliest
security.protocol=SSL

The configuration contains the following details:

  • hoodie.deltastreamer.schemaprovider.source.schema.file – The schema of the source record
  • hoodie.deltastreamer.schemaprovider.target.schema.file – The schema for the target record.
  • hoodie.deltastreamer.source.kafka.topic – The source MSK topic name
  • bootstap.servers – The Amazon MSK bootstrap server’s private endpoint
  • auto.offset.reset – The consumer’s behavior when there is no committed position or when an offset is out of range

Hudi configuration

The following are some of the important Hudi configuration options, which enable us to achieve in-place updates for the generated schema:

  • hoodie.datasource.write.recordkey.field – The record key field. This is the unique identifier of a record in Hudi.
  • hoodie.datasource.write.precombine.field – When two records have the same record key value, Apache Hudi picks the one with the largest value for the pre-combined field.
  • hoodie.datasource.write.operation – The operation on the Hudi dataset. Possible values include UPSERT, INSERT, and BULK_INSERT.

AWS Glue Data Catalog table

The AWS Glue job creates a Hudi table in the Data Catalog mapped to the Hudi dataset on Amazon S3. Because the hoodie.datasource.hive_sync.table configuration parameter is set to product_table, the table is visible under the default database in the Data Catalog.

The following screenshot shows the Hudi table column names in the Data Catalog.

Query the data using Athena

With the Hudi datasets available in Amazon S3, you can query the data using Athena. Let’s use the following query:

SELECT * FROM "default"."product_table";

The following screenshot shows the query output. The table product_table has four records from the initial ingestion: two records for the category Apparel, one for Cosmetics, and one for Footwear.

Load incremental data into the Kafka topic

Now suppose that the store sold some quantity of apparel and footwear and added a new product to its inventory, as shown in the following code. The store sold two items of product ID 1 (Apparel) and one item of product ID 3 (Footwear). The store also added the Cosmetics category, with product ID 5.

{"id": 1, "category": "Apparel", "ts": "2022-01-02 10:45:00", "name": "ABC shirt", "quantity": 2}
{"id": 3, "category": "Footwear", "ts": "2022-01-02 10:50:00", "name": "DEF shoes", "quantity": 5}
{"id": 5, "category": "Cosmetics", "ts": "2022-01-02 10:55:00", "name": "JKL Lip gloss", "quantity": 7}

Let’s ingest the incremental data from the deltastreamer_incr_load.json file to the Kafka topic and query the data from Athena:

./kafka_2.12-2.6.2/bin/kafka-console-producer.sh \
--broker-list "<replace text with value under private endpoint on MSK>" \
--topic hudi-deltastream-demo \
--producer.config ./config_file.txt < deltastreamer_incr_load.json

Within a few seconds, you should see a new Parquet file created in the target S3 bucket under the product_table prefix. The following is the screenshot from Athena after the incremental data ingestion showing the latest updates.

Additional considerations

There are some hard-coded Hudi options in the AWS Glue Streaming job scripts. These options are set for the sample table that we created for this post, so update the options based on your workload.

Clean up

To avoid any incurring future charges, delete the CloudFormation stack, which deletes all the underlying resources created by this post, except for the product_table table created in the default database. Manually delete the product_table table under the default database from the Data Catalog.

Conclusion

In this post, we illustrated how you can add the Apache Hudi Connector for AWS Glue and perform streaming ingestion into an S3 data lake using Apache Hudi DeltaStreamer with AWS Glue. You can use the Apache Hudi Connector for AWS Glue to create a serverless streaming pipeline using AWS Glue streaming jobs with the DeltaStreamer utility to ingest data from Kafka. We demonstrated this by reading the latest updated data using Athena in near-real time.

As always, AWS welcomes feedback. If you have any comments or questions on this post, please share them in the comments.


About the authors

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He enjoys learning different use cases from customers and sharing knowledge about big data technologies with the wider community.

Anand Prakash is a Senior Solutions Architect at AWS Data Lab. Anand focuses on helping customers design and build AI/ML, data analytics, and database solutions to accelerate their path to production.

Using AWS Shield Advanced protection groups to improve DDoS detection and mitigation

Post Syndicated from Joe Viggiano original https://aws.amazon.com/blogs/security/using-aws-shield-advanced-protection-groups-to-improve-ddos-detection-and-mitigation/

Amazon Web Services (AWS) customers can use AWS Shield Advanced to detect and mitigate distributed denial of service (DDoS) attacks that target their applications running on Amazon Elastic Compute Cloud (Amazon EC2), Elastic Local Balancing (ELB), Amazon CloudFront, AWS Global Accelerator, and Amazon Route 53. By using protection groups for Shield Advanced, you can logically group your collections of Shield Advanced protected resources. In this blog post, you will learn how you can use protection groups to customize the scope of DDoS detection for application layer events, and accelerate mitigation for infrastructure layer events.

What is a protection group?

A protection group is a resource that you create by grouping your Shield Advanced protected resources, so that the service considers them to be a single protected entity. A protection group can contain many different resources that compose your application, and the resources may be part of multiple protection groups spanning different AWS Regions within an AWS account. Common patterns that you might use when designing protection groups include aligning resources to applications, application teams, or environments (such as production and staging), and by product tiers (such as free or paid). For more information about setting up protection groups, see Managing AWS Shield Advanced protection groups.

Why should you consider using a protection group?

The benefits of protection groups differ for infrastructure layer (layer 3 and layer 4) events and application layer (layer 7) events. For layer 3 and layer 4 events, protection groups can reduce the time it takes for Shield Advanced to begin mitigations. For layer 7 events, protection groups add an additional reporting mechanism. There is no change in the mechanism that Shield Advanced uses internally for detection of an event, and you do not lose the functionality of individual resource-level detections. You receive both group-level and individual resource-level Amazon CloudWatch metrics to consume for operational use. Let’s look at the benefits for each layer in more detail.

Layers 3 and 4: Accelerate time to mitigate for DDoS events

For infrastructure layer (layer 3 and layer 4) events, Shield Advanced monitors the traffic volume to your protected resource. An abnormal traffic deviation signals the possibility of a DDoS attack, and Shield Advanced then puts mitigations in place. By default, Shield Advanced observes the elevation of traffic to a resource over multiple consecutive time intervals to establish confidence that a layer 3/layer 4 event is under way. In the absence of a protection group, Shield Advanced follows the default behavior of waiting to establish confidence before it puts mitigation in place for each resource. However, if the resources are part of a protection group, and if the service detects that one resource in a group is targeted, Shield Advanced uses that confidence for other resources in the group. This can accelerate the process of putting mitigations in place for those resources.

Consider a case where you have an application deployed in different AWS Regions, and each stack is fronted with a Network Load Balancer (NLB). When you enable Shield Advanced on the Elastic IP addresses associated with the NLB in each Region, you can optionally add those Elastic IP addresses to a protection group. If an actor targets one of the NLBs in the protection group and a DDoS attack is detected, Shield Advanced will lower the threshold for implementing mitigations on the other NLBs associated with the protection group. If the scope of the attack shifts to target the other NLBs, Shield Advanced can potentially mitigate the attack faster than if the NLB was not in the protection group.

Note: This benefit applies only to Elastic IP addresses and Global Accelerator resource types.

Layer 7: Reduce false positives and improve accuracy of detection for DDoS events

Shield Advanced detects application layer (layer 7) events when you associate a web access control list (web ACL) in AWS WAF with it. Shield Advanced consumes request data for the associated web ACL, analyzes it, and builds a traffic baseline for your application. The service then uses this baseline to detect anomalies in traffic patterns that might indicate a DDoS attack.

When you group resources in a protection group, Shield Advanced aggregates the data from individual resources and creates the baseline for the whole group. It then uses this aggregated baseline to detect layer 7 events for the group resource. It also continues to monitor and report for the resources individually, regardless of whether they are part of protection groups or not.

Shield Advanced provides three types of aggregation to choose from (sum, mean, and max) to aggregate the volume data of individual resources to use as a baseline for the whole group. We’ll look at the three types of aggregation, with a use case for each, in the next section.

Note: Traffic aggregation is applicable only for layer 7 detection.

Case 1: Blue/green deployments

Blue/green is a popular deployment strategy that increases application availability and reduces deployment risk when rolling out changes. The blue environment runs the current application version, and the green environment runs the new application version. When testing is complete, live application traffic is directed to the green environment, and the blue environment is dismantled.

During blue/green deployments, the traffic to your green resources can go from zero load to full load in a short period of time. Shield Advanced layer 7 detection uses traffic baselining for individual resources, so newly created resources like an Application Load Balancer (ALB) that are part of a blue/green operation would have no baseline, and the rapid increase in traffic could cause Shield Advanced to declare a DDoS event. In this scenario, the DDoS event could be a false positive.

Figure 1: A blue/green deployment with ALBs in a protection group. Shield is using the sum of total traffic to the group to baseline layer 7 traffic for the group as a single unit

Figure 1: A blue/green deployment with ALBs in a protection group. Shield is using the sum of total traffic to the group to baseline layer 7 traffic for the group as a single unit

In the example architecture shown in Figure 1, we have configured Shield to include all resources of type ALB in a single protection group with aggregation type sum. Shield Advanced will use the sum of traffic to all resources in the protection group as an additional baseline. We have only one ALB (called blue) to begin with. When you add the green ALB as part of your deployment, you can optionally add it to the protection group. As traffic shifts from blue to green, the total traffic to the protection group remains the same even though the volume of traffic changes for the individual resources that make up the group. After the blue ALB is deleted, the Shield Advanced baseline for that ALB is deleted with it. At this point, the green ALB hasn’t existed for sufficient time to have its own accurate baseline, but the protection group baseline persists. You could still receive a DDoSDetected CloudWatch metric with a value of 1 for individual resources, but with a protection group you have the flexibility to set one or more alarms based on the group-level DDoSDetected metric. Depending on your application’s use case, this can reduce non-actionable event notifications.

Note: You might already have alarms set for individual resources, because the onboarding wizard in Shield Advanced provides you an option to create alarms when you add protection to a resource. So, you should review the alarms you already have configured before you create a protection group. Simply adding a resource to a protection group will not reduce false positives.

Case 2: Resources that have traffic patterns similar to each other

Client applications might interact with multiple services as part of a single transaction or workflow. These services can be behind their own dedicated ALBs or CloudFront distributions and can have traffic patterns similar to each other. In the example architecture shown in Figure 2, we have two services that are always called to satisfy a user request. Consider a case where you add a new service to the mix. Before protection groups existed, setting up such a new protected resource, such as ALB or CloudFront, required Shield Advanced to build a brand-new baseline. You had to wait for a certain minimum period before Shield Advanced could start monitoring the resource, and the service would need to monitor traffic for a few days in order to be accurate.

Figure 2: Deploying a new service and including it in a protection group with an existing baseline. Shield is using the mean aggregation type to baseline traffic for the group.

Figure 2: Deploying a new service and including it in a protection group with an existing baseline. Shield is using the mean aggregation type to baseline traffic for the group.

For improved accuracy of detection of level 7 events, you can cause Shield Advanced to inherit the baseline of existing services that are part of the same transaction or workflow. To do so, you can put your new resource in a protection group along with an existing service or services, and set the aggregation type to mean. Shield Advanced will take some time to build up an accurate baseline for the new service. However, the protection group has an established baseline, so the new service won’t be susceptible to decreased accuracy of detection for that period of time. Note that this setting will not stop Shield Advanced from sending notifications for the new service individually; however, you might prefer to take corrective action based on the detection for the group instead.

Case 3: Resources that share traffic in a non-uniform way

Consider the case of a CloudFront distribution with an ALB as origin. If the content is cached in CloudFront edge locations, the traffic reaching the application will be lower than that received by the edge locations. Similarly, if there are multiple origins of a CloudFront distribution, the traffic volumes of individual origins will not reflect the aggregate traffic for the application. Scenarios like invalidation of cache or an origin failover can result in increased traffic at one of the ALB origins. This could cause Shield Advanced to send “1” as the value for the DDoSDetected CloudWatch metric for that ALB. However, you might not want to initiate an alarm or take corrective action in this case.

Figure 3: CloudFront and ALBs in a protection group with aggregation type max. Shield is using CloudFront’s baseline for the group

Figure 3: CloudFront and ALBs in a protection group with aggregation type max. Shield is using CloudFront’s baseline for the group

You can combine the CloudFront distribution and origin (or origins) in a protection group with the aggregation type set to max. Shield Advanced will consider the CloudFront distribution’s traffic volume as the baseline for the protection group as a whole. In the example architecture in Figure 3, a CloudFront distribution fronts two ALBs and balances the load between the two. We have bundled all three resources (CloudFront and two ALBs) into a protection group. In case one ALB fails, the other ALB will receive all the traffic. This way, although you might receive an event notification for the active ALB at the individual resource level if Shield detects a volumetric event, you might not receive it for the protection group because Shield Advanced will use CloudFront traffic as the baseline for determining the increase in volume. You can set one or more alarms and take corrective action according to your application’s use case.

Conclusion

In this blog post, we showed you how AWS Shield Advanced provides you with the capability to group resources in order to consider them a single logical entity for DDoS detection and mitigation. This can help reduce the number of false positives and accelerate the time to mitigation for your protected applications.

A Shield Advanced subscription provides additional capabilities, beyond those discussed in this post, that supplement your perimeter protection. It provides integration with AWS WAF for level 7 DDoS detection, health-based detection for reducing false positives, enhanced visibility into DDoS events, assistance from the Shield Response team, custom mitigations, and cost-protection safeguards. You can learn more about Shield Advanced capabilities in the AWS Shield Advanced User Guide.

 
If you have feedback about this blog post, submit comments in the Comments section below. You can also start a new thread on AWS Shield re:Post to get answers from the community.

Want more AWS Security news? Follow us on Twitter.

Joe Viggiano

Joe Viggiano

Joe is a Sr. Solutions Architect helping media and entertainment companies accelerate their adoptions of cloud-based solutions.

Deepak Garg

Deepak Garg

Deepak is a Solutions Architect at AWS. He loves diving deep into AWS services and sharing his knowledge with customers. Deepak has background in Content Delivery Networks and Telecommunications.

How to centralize findings and automate deletion for unused IAM roles

Post Syndicated from Hong Pham original https://aws.amazon.com/blogs/security/how-to-centralize-findings-and-automate-deletion-for-unused-iam-roles/

Maintaining AWS Identity and Access Management (IAM) resources is similar to keeping your garden healthy over time. Having visibility into your IAM resources, especially the resources that are no longer used, is important to keep your AWS environment secure. Proactively detecting and responding to unused IAM roles helps you prevent unauthorized entities from gaining access to your AWS resources. In this post, I will show you how to apply resource tags on IAM roles and deploy serverless technologies on AWS to detect unused IAM roles and to require the owner of the IAM role (identified through tags) to take action.

You can use this solution to check for unused IAM roles in a standalone AWS account. As you grow your workloads in the cloud, you can run this solution for multiple AWS accounts by using AWS Organizations. In this solution, you use AWS Control Tower to create an AWS Organizations organization with a Security organizational unit (OU), and a Security account in this OU. In this blog post, you deploy the solution in the Security account belonging to a Security OU of an organization.

For more information and recommended best practices, see the blog post Managing the multi-account environment using AWS Organizations and AWS Control Tower. Following this best practice, you can create a Security OU, in which you provision one or more Security and Audit accounts that are dedicated for security automation and audit activities on behalf of the entire organization.

Solution architecture

The architecture diagram in Figure 1 demonstrates the solution workflow.

Figure 1: Solution workflow for standalone account or member account of an AWS Organization.

Figure 1: Solution workflow for standalone account or member account of an AWS Organization.

The solution is triggered periodically by an Amazon EventBridge scheduled rule and invokes a series of actions. You specify the frequency (in number of days) when you create the EventBridge rule. There are two options to run this solution, based on the needs of your organization.

Option 1: For a standalone account

Choose this option if you would like to check for unused IAM roles in a single AWS account. This AWS account might or might not belong to an organization or OU. In this blog post, I refer to this account as the standalone account.

Prerequisites

  1. You need an AWS account specifically for security automation. For this blog post, I refer to this account as the standalone Security account.
  2. You should deploy the solution to the standalone Security account, which has appropriate admin permission to audit other accounts and manage security automation.
  3. Because this solution uses AWS CloudFormation StackSets, you need to grant self-managed permissions to create stack sets in standalone accounts. Specifically, you need to establish a trust relationship between the standalone Security account and the standalone account by creating the AWSCloudFormationStackSetAdministrationRole IAM role in the standalone Security account, and the AWSCloudFormationStackSetExecutionRole IAM role in the standalone account.
  4. You need to have AWS Security Hub enabled in your standalone Security account, and you need to deploy the solution in the same AWS Region as your Security Hub dashboard.
  5. You need a tagging enforcement in place for IAM roles. This solution uses an IAM tag key Owner to identify the email address of the owner. The value of this tag key should be the email address associated with the owner of the IAM role. If the Owner tag isn’t available, the notification email is sent to the email address that you provided in the parameter ITSecurityEmail when you provisioned the CloudFormation stack.
  6. This solution uses Amazon Simple Email Service (Amazon SES) to send emails to the owner of the IAM roles. The destination address needs to be verified with Amazon SES. With Amazon SES, you can verify identity at the individual email address or at the domain level.

An EventBridge rule triggers the AWS Lambda function LambdaCheckIAMRole in the standalone Security account. The LambdaCheckIAMRolefunction assumes a role in the standalone account. This role is named after the Cloudformation stack name that you specify when you provision the solution. Then LambdaCheckIAMRole calls the IAM API action GetAccountAuthorizationDetails to get the list of IAM roles in the standalone account, and parses the data type RoleLastUsed to retrieve the date, time, and the Region in which the roles were last used. If the last time value is not available, the IAM role is skipped. Based on the CloudFormation parameter MaxDaysForLastUsed that you provide, LambdaCheckIAMRole determines if the last time used is greater than the MaxDaysForLastUsed value. LambdaCheckIAMRole also extracts tags associated with the IAM roles, and retrieves the email address of the IAM role owner from the value of the tag key Owner. If there is no Owner tag, then LambdaCheckIAMRole sends an email to a default email address provided by you from the CloudFormation parameter ITSecurityEmail.

Option 2: For all member accounts that belong to an organization or an OU

Choose this option if you want to check for unused IAM roles in every member account that belongs to an AWS Organizations organization or OU.

Prerequisites

  1. You need to have an AWS Organizations organization with a dedicated Security account that belongs to a Security OU. For this blog post, I refer to this account as the Security account.
  2. You should deploy the solution to the Security account that has appropriate admin permission to audit other accounts and to manage security automation.
  3. Because this solution uses CloudFormation StackSets to create stack sets in member accounts of the organization or OU that you specify, the Security account in the Security OU needs to be granted CloudFormation delegated admin permission to create AWS resources in this solution.
  4. You need Security Hub enabled in your Security account, and you need to deploy the solution in the same Region as your Security Hub dashboard.
  5. You need tagging enforcement in place for IAM roles. This solution uses the IAM tag key Owner to identify the owner email address. The value of this tag key should be the email address associated with the owner of the IAM role. If the Owner tag isn’t available, the notification email will be sent to the email address that you provided in the parameter ITSecurityEmail when you provisioned the CloudFormation stack.
  6. This solution uses Amazon SES to send emails to the owner of the IAM roles. The destination address needs to be verified with Amazon SES. With Amazon SES, you can verify identity at the individual email address or at the domain level.

An EventBridge rule triggers the Lambda function LambdaGetAccounts in the Security account to collect the account IDs of member accounts that belong to the organization or OU. LambdaGetAccounts sends those account IDs to an SNS topic. Each account ID invokes the Lambda function LambdaCheckIAMRole once.

Similar to the process for Option 1, LambdaCheckIAMRole in the Security account assumes a role in the member account(s) of the organization or OU, and checks the last time that IAM roles in the account were used.

In both options, if an IAM role is not currently used, the function LambdaCheckIAMRole generates a Security Hub finding, and performs BatchImportFindings for all findings to Security Hub in the Security account. At the same time, the Lambda function starts an AWS Step Functions state machine execution. Each execution is for an unused IAM role following this naming convention:
[target-account-id]-[unused IAM role name]-[time the execution created in Unix format]

You should avoid running this solution against special IAM roles, such as a break-glass role or a disaster recovery role. In the CloudFormation parameter RolePatternAllowedlist, you can provide a list of role name patterns to skip the check.

Use a Step Functions state machine to process approval

Figure 2 shows the state machine workflow for owner approval.

Figure 2: Owner approval state machine workflow

Figure 2: Owner approval state machine workflow

After the solution identifies an unused IAM role, it creates a Step Functions state machine execution. Figure 2 demonstrates the workflow of the execution. After the execution starts, the first Lambda task NotifyOwner (powered by the Lambda function NotifyOwnerFunction) sends an email to notify the IAM role owner. This is a callback task that pauses the execution until a taskToken is returned. The maximum pause for a callback task is 1 year. The execution waits until the owner responds with a decision to delete or keep the role, which is captured by a private API endpoint in Amazon API Gateway. You can configure a timeout to avoid waiting for callback task execution.

With a private API endpoint, you can build a REST API that is only accessible within your Amazon Virtual Private Cloud (Amazon VPC), or within your internal network connected to your VPC. Using a private API endpoint will prevent anyone from outside of your internal network from selecting this link and deleting the role. You can implement authentication and authorization with API Gateway to make sure that only the appropriate owner can delete a role.

If the owner denies role deletion, then the role remains intact until the next automation cycle runs, and the state machine execution stops immediately with a Fail status. If the owner approves role deletion, the next Lambda task Approve (powered by the function ApproveFunction) checks again if the role is not currently used. If the role isn’t in use, the Lambda task Approve attaches an IAM policy DenyAllCheckUnusedIAMRoleSolution to deny the role to perform any actions, and waits for 30 days. During this wait time, you can restore the IAM role by removing the IAM policy DenyAllCheckUnusedIAMRoleSolution from the role. The Step Functions state machine execution for this role is still in progress until the wait time expires.

After the wait time expires, the state machine execution invokes the Validate task. The Lambda function ValidateFunction checks again if the role is not in use after the amount of time calculated by adding MaxDaysForLastUsed and the preceding wait time. It also checks if the IAM policy DenyAllCheckUnusedIAMRoleSolution is attached to the role. If both of these conditions are true, the Lambda function follows a process to detach the IAM policies and delete the role permanently. The role can’t be recovered after deletion.

Note: To restore a role that has been marked for deletion, detach the DenyAll IAM policy from the role.

To deploy the solution using the AWS CLI

  1. Clone git repo from AWS Samples to get source code and CloudFormation templates.
    git clone https://github.com/aws-samples/aws-blog-automate-iam-role-deletion 
    cd /aws-blog-automate-iam-role-deletion

  2. Run the AWS CLI command below to upload CloudFormation templates and Lambda code to a S3 bucket in the Security Account. The S3 bucket needs to be in the same Region where you will deploy the solution.
    • To deploy the solution for a single account, use the following commands. Be sure to replace <YOUR_BUCKET_NAME> and <PATH_TO_UPLOAD_CODE> with your own values.
      #Deploy solution for a single target AWS Account
      aws cloudformation package \
      --template-file solution_scope_account.yml \
      --s3-bucket <YOUR_BUCKET_NAME> \
      --s3-prefix <PATH_TO_UPLOAD_CODE> \
      --output-template-file solution_scope_account.template

    • To deploy the solution for an organization or OU, use the following commands. Be sure to replace <YOUR_BUCKET_NAME> and <PATH_TO_UPLOAD_CODE> with your own values.
      #Deploy solution for an Organization/OU
      aws cloudformation package \
      --template-file solution_scope_organization.yml \
      --s3-bucket <YOUR_BUCKET_NAME> \
      --s3-prefix <PATH_TO_UPLOAD_CODE> \
      --output-template-file solution_scope_organization.template

  3. Validate the template generated by the CloudFormation package.
    • To validate the solution for a single account, use the following commands.
      #Deploy solution for a single target AWS Account
      aws cloudformation validate-template —template-body file://solution_scope_account.template

    • To validate the solution for an organization or OU, use the following commands.
      #Deploy solution for an Organization/OU
      aws cloudformation validate-template —template-body file://solution_scope_organization.template

  4. Deploy the solution in the same Region that you use for Security Hub. The stack takes 30 minutes to complete deployment.
    • To deploy the solution for a single account, use the following commands. Be sure to replace all of the placeholders with your own values.
      #Deploy solution for a single target AWS Account
      aws cloudformation deploy \
      --template-file solution_scope_account.template \
      --stack-name <UNIQUE_STACK_NAME> \
      --region <REGION> \
      --capabilities CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND \
      --parameter-overrides AccountId='<STANDALONE ACCOUNT ID>' \
      Frequency=<DAYS> MaxDaysForLastUsed=<DAYS> \
      ITSecurityEmail='<YOUR IT TEAM EMAIL>' \
      RolePatternAllowedlist='<ALLOWED PATTERN>'

    • To deploy the solution for an organization, run the following commands to create CloudFormation stack in the Security Account of the organization.
      #Deploy solution for an Organization
      aws cloudformation deploy \
      --template-file solution_scope_organization.template \
      --stack-name <UNIQUE_STACK_NAME> \
      --region <REGION> \
      --capabilities CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND \
      --parameter-overrides Scope=Organization \
      OrganizationId='<o-12345abcde>' \
      OrgRootId='<r-1234>'  \
      Frequency=<DAYS> MaxDaysForLastUsed=<DAYS> \
      ITSecurityEmail='<[email protected]>' \
      RolePatternAllowedlist='<ALLOWED PATTERN>'

    • To deploy the solution for an OU, run the following commands to create CloudFormation stack in the Security Account of the organization.
      #Deploy solution for an OU
      aws cloudformation deploy \
      --template-file solution_scope_organization.template \
      --stack-name <UNIQUE_STACK_NAME> \
      --region <REGION> \
      --capabilities CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND \
      --parameter-overrides Scope=OrganizationalUnit \
      OrganizationId='<o-12345abcde>' \
      OrganizationalUnitId='<ou-1234-1234abcd>'  \
      Frequency=<DAYS> MaxDaysForLastUsed=<DAYS> \
      ITSecurityEmail=’<[email protected]>’ \
      RolePatternAllowedlist=’<ALLOWED PATTERN>

Test the solution

The solution is triggered by an EventBridge scheduled rule, so it doesn’t perform the checks immediately. To test the solution right away after the CloudFormation stacks are successfully created, follow these steps.

To manually trigger the automation for a single account

  1. Navigate to the AWS Lambda console and choose the function
    <CloudFormation stackname>-LambdaCheckIAMRole.
  2. Choose Test.
  3. Choose New event.
  4. For Name, enter a name for the event, and provide the current time in UTC Date Time format YYYY-MM-DDTHH:MM:SSZ. For example {“time”: “2022-01-22T04:36:52Z”}. The Lambda function uses this value to calculate how much time has passed since the last time that a role was used. Figure 5 shows an example of configuring a test event.
    Figure 5: Configure test event for standalone account

    Figure 5: Configure test event for standalone account

  5. Choose Test.

To manually trigger the automation for an organization or OU

  1. Choose the function
    [CloudFormation stackname]-LambdaGetAccounts.
  2. Choose Test.
  3. Choose New event.
  4. For Name, enter a name for the event. Leave the default values for the remaining fields.
  5. Choose Test.

Respond to unused IAM roles

After you’ve triggered the Lambda function, the automation runs the necessary checks. For each unused IAM role, it creates a Step Functions state machine execution.

To see the list of Step Functions state machine executions

  1. Navigate to the AWS Step Functions console.
  2. Choose state machine [CloudFormation stackname]OnwerApprovalStateMachine.
  3. Under the Executions tab, you will see the list of executions in running state following this naming convention: [target-account-id]-[unused IAM role name]-[time the execution created in Unix format]. Figure 6 shows an example list of executions.
    Figure 6: Each unused IAM role generates an execution in the Step Functions state machine

    Figure 6: Each unused IAM role generates an execution in the Step Functions state machine

Each execution sends out an email notification to the IAM role owner (if available through the Owner tag) or to the IT security email address that you provided in the CloudFormation stack parameter ITSecurityEmail. The email content is:

Subject: Please take action on this unused IAM Role
 
Hello!
 
This IAM Role arn:aws:iam::<AWS account>:role/<role name> is not in use for
more than 60 days.
 
Can you please delete the role by following this link: Approve link
 
Or keep this role by following this link: Deny Link

In the email, the Approve link and Deny link is the hyperlink to a private API endpoint with a parameter taskToken. If you try to access these links publicly, they won’t work. When you access the link, the taskToken is provided to the private API endpoint, which updates the Step Functions state machine.

To test the approval action using an API Gateway test

  1. Navigate to the AWS Step Functions console. Under State machines, choose the state machine that has the name [CloudFormation stackname]OwnerApprovalStateMachine
  2. On the Executions tab, there is a list of executions. Each execution represents a workflow for one IAM role, as shown in Figure 6. Choose the execution name that includes the IAM role name in the email that you received earlier.
  3. Scroll down to Execution event history.
  4. Expand the Step Notify Owner, enter TaskScheduled, find the item taskToken, and copy its value to a notepad, as shown in Figure 7.
    Figure 7: Retrieve taskToken from execution

    Figure 7: Retrieve taskToken from execution

  5. Navigate to the API Gateway console.
  6. Choose the API that has a name similar to [CloudFormation stackname]-PrivateAPIGW-[unique string]-ApprovalEndpoint.
  7. Choose which action to test: Deny or Approve.
    • To test the Deny action, under /deny resource, choose the GET method.
    • To test the Approve action, under /approve resource, choose the GET method.
  8. Choose Test.
  9. Under Query Strings, enter taskToken= and paste the taskToken you copied earlier from the state machine execution. Figure 8 shows how to pass the taskToken to API Gateway.
    Figure 8: Provide taskToken to API Gateway Method

    Figure 8: Provide taskToken to API Gateway Method

  10. Choose Test. After you test, the state machine resumes the workflow and finishes the automation. You won’t be able to change the action.
  11. Navigate to the AWS Step Functions console. Choose the state machine and go to the state machine execution.
    1. If you choose to deny the role deletion, the execution immediately stops as Fail.
    2. If you choose to approve the role deletion, the execution moves to the Wait task. This task removes IAM policies associated to the role and waits for a period of time before moving to the next task. By default, the wait time is 30 days. To change this number, go to the Lambda function [CloudFormation stackname]ApproveFunction, and update the variable wait_time_stamp.
    3. After the waiting period expires, the state machine triggers the Validate task to do a final validation on the role before deleting it. If the Validate task decides that the role is being used, it leaves the role intact. Otherwise, it deletes the role permanently.

Conclusion

In this blog post, you learned how serverless services such as Lambda, Step Functions, and API Gateway can work together to build security automation. We recommend testing this solution as a starting point. Then, you can build more features on top of the sample code and templates to customize it to perform checks, following guidance from your IT security team.

Here are a few suggestions that you can take to extend this solution.

  • This solution uses a private API Gateway to handle the approval response from the IAM role owner. You need to establish private connectivity between your internal network and AWS to invoke a private API Gateway. For instructions, see How to invoke a private API.
  • Add a mechanism to control access to API Gateway by using endpoint policies for interface VPC endpoints.
  • Archive the Security Hub finding after the IAM role is deleted using the AWS CLI or AWS Console.
  • Use a Step Functions state machine for other automation that needs human approval.
  • Add the capability to report on IAM roles that were skipped due to the absence of RoleLastUsed information.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Hong Pham

Hong Pham

Hong is a Senior Solutions Architect at AWS. For more than five years, she has helped many customers from start-ups to enterprises in different industries to adopt Cloud Computing. She was born in Vietnam and currently lives in Seattle, Washington.

Deploy and manage OpenAPI/Swagger RESTful APIs with the AWS Cloud Development Kit

Post Syndicated from Luke Popplewell original https://aws.amazon.com/blogs/devops/deploy-and-manage-openapi-swagger-restful-apis-with-the-aws-cloud-development-kit/

This post demonstrates how AWS Cloud Development Kit (AWS CDK) Infrastructure as Code (IaC) constructs and AWS serverless technology can be used to build and deploy a RESTful Application Programming Interface (API) defined in the OpenAPI specification. This post uses an example API that describes  Widget resources and demonstrates how to use an AWS CDK Pipeline to:

  • Deploy a RESTful API stage to Amazon API Gateway from an OpenAPI specification.
  • Build and deploy an AWS Lambda function that contains the API functionality.
  • Auto-generate API documentation and publish it to an Amazon Simple Storage Service (Amazon S3)-hosted website served by the Amazon CloudFront content delivery network (CDN) service. This provides technical and non-technical stakeholders with versioned, current, and accessible API documentation.
  • Auto-generate client libraries for invoking the API and deploy them to AWS CodeArtifact, which is a fully-managed artifact repository service. This allows API client development teams to integrate with different versions of the API in different environments.

The diagram shown in the following figure depicts the architecture of the AWS services and resources described in this post.

 The architecture described in this post consists of an AWS CodePipeline pipeline, provisioned using the AWS CDK, that deploys the Widget API to AWS Lambda and API Gateway. The pipeline then auto-generates the API’s documentation as a website served by CloudFront and deployed to S3. Finally, the pipeline auto-generates a client library for the API and deploys this to CodeArtifact.

Figure 1 – Architecture

The code that accompanies this post, written in Java, is available here.

Background

APIs must be understood by all stakeholders and parties within an enterprise including business areas, management, enterprise architecture, and other teams wishing to consume the API. Unfortunately, API definitions are often hidden in code and lack up-to-date documentation. Therefore, they remain inaccessible for the majority of the API’s stakeholders. Furthermore, it’s often challenging to determine what version of an API is present in different environments at any one time.

This post describes some solutions to these issues by demonstrating how to continuously deliver up-to-date and accessible API documentation, API client libraries, and API deployments.

AWS CDK

The AWS CDK is a software development framework for defining cloud IaC and is available in multiple languages including TypeScript, JavaScript, Python, Java, C#/.Net, and Go. The AWS CDK Developer Guide provides best practices for using the CDK.

This post uses the CDK to define IaC in Java which is synthesized to a cloud assembly. The cloud assembly includes one to many templates and assets that are deployed via an AWS CodePipeline pipeline. A unit of deployment in the CDK is called a Stack.

OpenAPI specification (formerly Swagger specification)

OpenAPI specifications describe the capabilities of an API and are both human and machine-readable. They consist of definitions of API components which include resources, endpoints, operation parameters, authentication methods, and contact information.

Project composition

The API project that accompanies this post consists of three directories:

  • app
  • api
  • cdk

app directory

This directory contains the code for the Lambda function which is invoked when the Widget API is invoked via API Gateway. The code has been developed in Java as an Apache Maven project.

The Quarkus framework has been used to define a WidgetResource class (see src/main/java/aws/sample/blog/cdkopenapi/app/WidgetResources.java ) that contains the methods that align with HTTP Methods of the Widget API.
api directory

The api directory contains the OpenAPI specification file ( openapi.yaml ). This file is used as the source for:

  • Defining the REST API using API Gateway’s support for OpenApi.
  • Auto-generating the API documentation.
  • Auto-generating the API client artifact.

The api directory also contains the following files:

  • openapi-generator-config.yaml : This file contains configuration settings for the OpenAPI Generator framework, which is described in the section CI/CD Pipeline.
  • maven-settings.xml: This file is used support the deployment of the generated SDKs or libraries (Apache Maven artifacts) for the API and is described in the CI/CD Pipeline section of this post.

This directory contains a sub directory called docker . The docker directory contains a Dockerfile which defines the commands for building a Docker image:

FROM ruby:2.6.5-alpine
 
RUN apk update \
 && apk upgrade --no-cache \
 && apk add --no-cache --repository http://dl-cdn.alpinelinux.org/alpine/v3.14/main/ nodejs=14.20.0-r0 npm \
 && apk add git \
 && apk add --no-cache build-base
 
# Install Widdershins node packages and ruby gem bundler 
RUN npm install -g widdershins \
 && gem install bundler 
 
# working directory
WORKDIR /openapi
 
# Clone and install the Slate framework
RUN git clone https://github.com/slatedocs/slate
RUN cd slate \
 && bundle install

The Docker image incorporates two open source projects, the NodeJS Widdershins library and the Ruby Slate-framework. These are used together to auto-generate the documentation for the API from the OpenAPI specification.  This Dockerfile is referenced and built by the  ApiStack class, which is described in the CDK Stacks section of this post.

cdk directory

This directory contains an Apache Maven Project developed in Java for provisioning the CDK stacks for the  Widget API.

Under the  src/main/java  folder, the package  aws.sample.blog.cdkopenapi.cdk  contains the files and classes that define the application’s CDK stacks and also the entry point (main method) for invoking the stacks from the CDK Toolkit CLI:

  • CdkApp.java: This file contains the  CdkApp class which provides the main method that is invoked from the AWS CDK Toolkit to build and deploy the  application stacks.
  • ApiStack.java: This file contains the   ApiStack class which defines the  OpenApiBlogAPI   stack and is described in the CDK Stacks section of this post.
  • PipelineStack.java: This file contains the   PipelineStack class which defines the OpenAPIBlogPipeline  stack and is described in the CDK Stacks section of this post.
  • ApiStackStage.java: This file contains the  ApiStackStage class which defines a CDK stage. As detailed in the CI/CD Pipeline section of this post, a DEV stage, containing the  OpenApiBlogAPI stack resources for a DEV environment, is deployed from the  OpenApiBlogPipeline pipeline.

CDK stacks

ApiStack

Note that the CDK bundling functionality is used at multiple points in the  ApiStack  class to produce CDK Assets. The post, Building, bundling, and deploying applications with the AWS CDK, provides more details regarding using CDK bundling mechanisms.

The  ApiStack  class defines multiple resources including:

  • Widget API Lambda function: This is bundled by the CDK in a Docker container using the Java 11 runtime image.
  • Widget  REST API on API Gateway: The REST API is created from an Inline API Definition which is passed as an S3 CDK Asset. This asset includes a reference to the  Widget API OpenAPI specification located under the  api folder (see  api/openapi.yaml ) and builds upon the SpecRestApi construct and API Gateway’s support for OpenApi.
  • API documentation Docker Image Asset: This is the Docker image that contains the open source frameworks (Widdershins and Slate) that are leveraged to generate the API documentation.
  • CDK Asset bundling functionality that leverages the API documentation Docker image to auto-generate documentation for the API.
  • An S3 Bucket for holding the API documentation website.
  • An origin access identity (OAI) which allows CloudFront to securely serve the S3 Bucket API documentation content.
  • A CloudFront distribution which provides CDN functionality for the S3 Bucket website.

Note that the  ApiStack class features the following code which is executed on the  Widget API Lambda construct:

CfnFunction apiCfnFunction = (CfnFunction)apiLambda.getNode().getDefaultChild();
apiCfnFunction.overrideLogicalId("APILambda");

The CDK, by default, auto-assigns an ID for each defined resource but in this case the generated ID is being overridden with “APILambda”. The reason for this is that inside of the  Widget API OpenAPI specification (see  api/openapi.yaml ), there is a reference to the Lambda function by name (“APILambda”) so that the function can be integrated as a proxy for each listed API path and method combination. The OpenAPI specification includes this name as a variable to derive the Amazon Resource Name (ARN) for the Lambda function:

uri:
	Fn::Sub: "arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${APILambda.Arn}/invocations"

PipelineStack

The  PipelineStack class defines a CDK CodePipline construct which is a higher level construct and pattern. Therefore, the construct doesn’t just map directly to a single CloudFormation resource, but provisions multiple resources to fulfil the requirements of the pattern. The post, CDK Pipelines: Continuous delivery for AWS CDK applications, provides more detail on creating pipelines with the CDK.

final CodePipeline pipeline = CodePipeline.Builder.create(this, "OpenAPIBlogPipeline")
.pipelineName("OpenAPIBlogPipeline")
.selfMutation(true)
      .dockerEnabledForSynth(true)
      .synth(synthStep)
      .build();

CI/CD pipeline

The diagram in the following figure shows the multiple CodePipeline stages and actions created by the CDK CodePipeline construct that is defined in the PipelineStack class.

The CI/CD pipeline’s stages include the Source stage, the Synth stage, the Update pipeline, the Assets stage, and the DEV stage.

Figure 2 – CI/CD Pipeline

The stages defined include the following:

  • Source stage: The pipeline is passed the source code contents from this stage.
  • Synth stage: This stage consists of a Synth Action that synthesizes the CloudFormation templates for the application’s CDK stacks and compiles and builds the project Lambda API function.
  • Update Pipeline stage: This stage checks the OpenAPIBlogPipeline stack and reinitiates the pipeline when changes to its definition have been deployed.
  • Assets stage: The application’s CDK stacks produce multiple file assets (for example, zipped Lambda code) which are published to Amazon S3. Docker image assets are published to a managed CDK framework Amazon Elastic Container Registry (Amazon ECR) repository.
  • DEV stage: The API’s CDK stack ( OpenApiBlogAPI ) is deployed to a hypothetical development environment in this stage. A post stage deployment action is also defined in this stage. Through the use of a CDK ShellStep construct, a Bash script is executed that deploys a generated client Java Archive (JAR) for the Widget API to CodeArtifact. The script employs the OpenAPI Generator project for this purpose:
CodeBuildStep codeArtifactStep = CodeBuildStep.Builder.create("CodeArtifactDeploy")
    .input(pipelineSource)
    .commands(Arrays.asList(
           	"echo $REPOSITORY_DOMAIN",
           	"echo $REPOSITORY_NAME",
           	"export CODEARTIFACT_TOKEN=`aws codeartifact get-authorization-token --domain $REPOSITORY_DOMAIN --query authorizationToken --output text`",
           	"export REPOSITORY_ENDPOINT=$(aws codeartifact get-repository-endpoint --domain $REPOSITORY_DOMAIN --repository $REPOSITORY_NAME --format maven | jq .repositoryEndpoint | sed 's/\\\"//g')",
           	"echo $REPOSITORY_ENDPOINT",
           	"cd api",
           	"wget -q https://repo1.maven.org/maven2/org/openapitools/openapi-generator-cli/5.4.0/openapi-generator-cli-5.4.0.jar -O openapi-generator-cli.jar",
     	          "cp ./maven-settings.xml /root/.m2/settings.xml",
        	          "java -jar openapi-generator-cli.jar batch openapi-generator-config.yaml",
                    "cd client",
                    "mvn --no-transfer-progress deploy -DaltDeploymentRepository=openapi--prod::default::$REPOSITORY_ENDPOINT"
))
      .rolePolicyStatements(Arrays.asList(codeArtifactStatement, codeArtifactStsStatement))
.env(new HashMap<String, String>() {{
      		put("REPOSITORY_DOMAIN", codeArtifactDomainName);
            	put("REPOSITORY_NAME", codeArtifactRepositoryName);
       }})
      .build();

Running the project

To run this project, you must install the AWS CLI v2, the AWS CDK Toolkit CLI, a Java/JDK 11 runtime, Apache Maven, Docker, and a Git client. Furthermore, the AWS CLI must be configured for a user who has administrator access to an AWS Account. This is required to bootstrap the CDK in your AWS account (if not already completed) and provision the required AWS resources.

To build and run the project, perform the following steps:

  1. Fork the OpenAPI blog project in GitHub.
  2. Open the AWS Console and create a connection to GitHub. Note the connection’s ARN.
  3. In the Console, navigate to AWS CodeArtifact and create a domain and repository.  Note the names used.
  4. From the command line, clone your forked project and change into the project’s directory:
git clone https://github.com/<your-repository-path>
cd <your-repository-path>
  1. Edit the CDK JSON file at  cdk/cdk.json  and enter the details:
"RepositoryString": "<your-github-repository-path>",
"RepositoryBranch": "<your-github-repository-branch-name>",
"CodestarConnectionArn": "<connection-arn>",
"CodeArtifactDomain": "<code-artifact-domain-name>",
"CodeArtifactRepository": "<code-artifact-repository-name>"

Please note that for setting configuration values in CDK applications, it is recommend to use environment variables or AWS Systems Manager parameters.

  1. Commit and push your changes back to your GitHub repository:
git push origin main
  1. Change into the  cdk directory and bootstrap the CDK in your AWS account if you haven’t already done so (enter “Y” when prompted):
cd cdk
cdk bootstrap
  1. Deploy the CDK pipeline stack (enter “Y” when prompted):
cdk deploy OpenAPIBlogPipeline

Once the stack deployment completes successfully, the pipeline  OpenAPIBlogPipeline will start running. This will build and deploy the API and its associated resources. If you open the Console and navigate to AWS CodePipeline, then you’ll see a pipeline in progress for the API.

Once the pipeline has completed executing, navigate to AWS CloudFormation to get the output values for the  DEV-OpenAPIBlog  stack deployment:

  1. Select the  DEV-OpenAPIBlog  stack entry and then select the Outputs column. Record the REST_URL value for the key that begins with   OpenAPIBlogRestAPIEndpoint .
  2. Record the CLOUDFRONT_URL value for the key  OpenAPIBlogCloudFrontURL .

The API ping method at https://<REST_URL>/ping can now be invoked using your browser or an API development tool like Postman. Other API other methods, as defined by the OpenApi specification, are also available for invocation (For example, GET https://<REST_URL>/widgets).

To view the generated API documentation, open a browser at https://< CLOUDFRONT_URL>.

The following figure shows the API documentation website that has been auto-generated from the API’s OpenAPI specification. The documentation includes code snippets for using the API from multiple programming languages.

The API’s auto-generated documentation website provides descriptions of the API’s methods and resources as well as code snippets in multiple languages including JavaScript, Python, and Java.

Figure 3 – Auto-generated API documentation

To view the generated API client code artifact, open the Console and navigate to AWS CodeArtifact. The following figure shows the generated API client artifact that has been published to CodeArtifact.

The CodeArtifact service user interface in the Console shows the different versions of the API’s auto-generated client libraries.

Figure 4 – API client artifact in CodeArtifact

Cleaning up

  1. From the command change to the  cdk directory and remove the API stack in the DEV stage (enter “Y” when prompted):
cd cdk
cdk destroy OpenAPIBlogPipeline/DEV/OpenAPIBlogAPI
  1. Once this has completed, delete the pipeline stack:
cdk destroy OpenAPIBlogPipeline
  1. Delete the S3 bucket created to support pipeline operations. Open the Console and navigate to Amazon S3. Delete buckets with the prefix  openapiblogpipeline .

Conclusion

This post demonstrates the use of the AWS CDK to deploy a RESTful API defined by the OpenAPI/Swagger specification. Furthermore, this post describes how to use the AWS CDK to auto-generate API documentation, publish this documentation to a web site hosted on Amazon S3, auto-generate API client libraries or SDKs, and publish these artifacts to an Apache Maven repository hosted on CodeArtifact.

The solution described in this post can be improved by:

  • Building and pushing the API documentation Docker image to Amazon ECR, and then using this image in CodePipeline API pipelines.
  • Creating stages for different environments such as TEST, PREPROD, and PROD.
  • Adding integration testing actions to make sure that the API Deployment is working correctly.
  • Adding Manual approval actions for that are executed before deploying the API to PROD.
  • Using CodeBuild caching of artifacts including Docker images and libraries.

About the author:

Luke Popplewell

Luke Popplewell works primarily with federal entities in the Australian Government. In his role as an architect, Luke uses his knowledge and experience to help organisations reach their goals on the AWS cloud. Luke has a keen interest in serverless technology, modernization, DevOps and event-driven architectures.

Best practices to optimize cost and performance for AWS Glue streaming ETL jobs

Post Syndicated from Gonzalo Herreros original https://aws.amazon.com/blogs/big-data/best-practices-to-optimize-cost-and-performance-for-aws-glue-streaming-etl-jobs/

AWS Glue streaming extract, transform, and load (ETL) jobs allow you to process and enrich vast amounts of incoming data from systems such as Amazon Kinesis Data Streams, Amazon Managed Streaming for Apache Kafka (Amazon MSK), or any other Apache Kafka cluster. It uses the Spark Structured Streaming framework to perform data processing in near-real time.

This post covers use cases where data needs to be efficiently processed, delivered, and possibly actioned in a limited amount of time. This can cover a wide range of cases, such as log processing and alarming, continuous data ingestion and enrichment, data validation, internet of things, machine learning (ML), and more.

We discuss the following topics:

  • Development tools that help you code faster using our newly launched AWS Glue Studio notebooks
  • How to monitor and tune your streaming jobs
  • Best practices for sizing and scaling your AWS Glue cluster, using our newly launched features like auto scaling and the small worker type G 0.25X

Development tools

AWS Glue Studio notebooks can speed up the development of your streaming job by allowing data engineers to work using an interactive notebook and test code changes to get quick feedback—from business logic coding to testing configuration changes—as part of tuning.

Before you run any code in the notebook (which would start the session), you need to set some important configurations.

The magic %streaming creates the session cluster using the same runtime as AWS Glue streaming jobs. This way, you interactively develop and test your code using the same runtime that you use later in the production job.

Additionally, configure Spark UI logs, which will be very useful for monitoring and tuning the job.

See the following configuration:

%streaming
%%configure
{
"--enable-spark-ui": "true",
"--spark-event-logs-path": "s3://your_bucket/sparkui/"
}

For additional configuration options such as version or number of workers, refer to Configuring AWS Glue interactive sessions for Jupyter and AWS Glue Studio notebooks.

To visualize the Spark UI logs, you need a Spark history server. If you don’t have one already, refer to Launching the Spark History Server for deployment instructions.

Structured Streaming is based on streaming DataFrames, which represent micro-batches of messages.
The following code is an example of creating a stream DataFrame using Amazon Kinesis as the source:

kinesis_options = {
  "streamARN": "arn:aws:kinesis:us-east-2:777788889999:stream/fromOptionsStream",
  "startingPosition": "TRIM_HORIZON",
  "inferSchema": "true",
  "classification": "json"
}
kinesisDF = glueContext.create_data_frame_from_options(
   connection_type="kinesis",
   connection_options=kinesis_options
)

The AWS Glue API helps you create the DataFrame by doing schema detection and auto decompression, depending on the format. You can also build it yourself using the Spark API directly:

kinesisDF = spark.readStream.format("kinesis").options(**kinesis_options).load()

After your run any code cell, it triggers the startup of the session, and the application soon appears in the history server as an incomplete app (at the bottom of the page there is a link to display incomplete apps) named GlueReplApp, because it’s a session cluster. For a regular job, it’s listed with the job name given when it was created.

History server home page

From the notebook, you can take a sample of the streaming data. This can help development and give an indication of the type and size of the streaming messages, which might impact performance.

Monitor the cluster with Structured Streaming

The best way to monitor and tune your AWS Glue streaming job is using the Spark UI; it gives you the overall streaming job trends on the Structured Streaming tab and the details of each individual micro-batch processing job.

Overall view of the streaming job

On the Structured Streaming tab, you can see a summary of the streams running in the cluster, as in the following example.

Normally there is just one streaming query, representing a streaming ETL. If you start multiple in parallel, it’s good if you give it a recognizable name, calling queryName() if you use the writeStream API directly on the DataFrame.

After a good number of batches are complete (such as 10), enough for the averages to stabilize, you can use Avg Input/sec column to monitor how many events or messages the job is processing. This can be confusing because the column to the right, Avg Process/sec, is similar but often has a higher number. The difference is that this process time tells us how efficient our code is, whereas the average input tells us how many messages the cluster is reading and processing.

The important thing to note is that if the two values are similar, it means the job is working at maximum capacity. It’s making the best use of the hardware but it likely won’t be able to cope with an increase in volume without causing delays.

In the last column is the latest batch number. Because they’re numbered incrementally from zero, this tells us how many batches the query has processed so far.

When you choose the link in the “Run ID” column of a streaming query, you can review the details with graphs and histograms, as in the following example.

The first two rows correspond to the data that is used to calculate the averages shown on the summary page.

For Input Rate, each data point is calculated by dividing the number of events read for the batch by the time passed between the current batch start and the previous batch start. In a healthy system that is able to keep up, this is equal to the configured trigger interval (in the GlueContext.forEachBatch() API, this is set using the option windowSize).

Because it uses the current batch rows with the previous batch latency, this graph is often unstable in the first batches until the Batch Duration (the last line graph) stabilizes.

In this example, when it stabilizes, it gets completely flat. This means that either the influx of messages is constant or the job is hitting the limit per batch set (we discuss how to do this later in the post).

Be careful if you set a limit per batch that is constantly hit, you could be silently building a backlog, but everything could look good in the job metrics. To monitor this, have a metric of latency measuring the difference between the message timestamp when it gets created and the time it’s processed.

Process Rate is calculated by dividing the number of messages in a batch by the time it took to process that batch. For instance, if the batch contains 1,000 messages, and the trigger interval is 10 seconds but the batch only needed 5 seconds to process it, the process rate would be 1000/5 = 200 msg/sec. while the input rate for that batch (assuming the previous batch also ran within the interval) is 1000/10 = 100 msg/sec.

This metric is useful to measure how efficient our code processing the batch is, and therefore it can get higher than the input rate (this doesn’t mean it’s processing more messages, just using less time). As mentioned earlier, if both metrics get close, it means the batch duration is close to the interval and therefore additional traffic is likely to start causing batch trigger delays (because the previous batch is still running) and increase latency.

Later in this post, we show how auto scaling can help prevent this situation.

Input Rows shows the number of messages read for each batch, like input rate, but using volume instead of rate.

It’s important to note that if the batch processes the data multiple times (for example, writing to multiple destinations), the messages are counted multiple times. If the rates are greater than the expected, this could be the reason. In general, to avoid reading messages multiple times, you should cache the batch while processing it, which is the default when you use the GlueContext.forEachBatch() API.

The last two rows tell us how long it takes to process each batch and how is that time spent. It’s normal to see the first batches take much longer until the system warms up and stabilizes.
The important thing to look for is that the durations are roughly stable and well under the configured trigger interval. If that’s not the case, the next batch gets delayed and could start a compounding delay by building a backlog or increasing batch size (if the limit allows taking the extra messages pending).

In Operation Duration, the majority of time should be spent on addBatch (the mustard color), which is the actual work. The rest are fixed overhead, therefore the smaller the batch process, the more percentage of time that will take. This represents the trade-off between small batches with lower latency or bigger batches but more computing efficient.

Also, it’s normal for the first batch to spend significant time in the latestOffset (the brown bar), locating the point at which it needs to start processing when there is no checkpoint.

The following query statistics show another example.

In this case, the input has some variation (meaning it’s not hitting the batch limit). Also, the process rate is roughly the same as the input rate. This tells us the system is at max capacity and struggling to keep up. By comparing the input rows and input rate, we can guess that the interval configured is just 3 seconds and the batch duration is barely able to meet that latency.

Finally, in Operation Duration, you can observe that because the batches are so frequent, a significant amount of time (proportionally speaking) is spent saving the checkpoint (the dark green bar).

With this information, we can probably improve the stability of the job by increasing the trigger interval to 5 seconds or more. This way, it checkpoints less often and has more time to process data, which might be enough to get batch duration consistently under the interval. The trade-off is that the latency between when a message is published and when it’s processed is longer.

Monitor individual batch processing

On the Jobs tab, you can see how long each batch is taking and dig into the different steps the processing involves to understand how the time is spent. You can also check if there are tasks that succeed after retry. If this happens continuously, it can silently hurt performance.

For instance, the following screenshot shows the batches on the Jobs tab of the Spark UI of our streaming job.

Each batch is considered a job by Spark (don’t confuse the job ID with the batch number; they only match if there is no other action). The job group is the streaming query ID (this is important only when running multiple queries).

The streaming job in this example has a single stage with 100 partitions. Both batches processed them successfully, so the stage is marked as succeeded and all the tasks completed (100/100 in the progress bar).

However, there is a difference in the first batch: there were 20 task failures. You know all the failed tasks succeeded in the retries, otherwise the stage would have been marked as failed. For the stage to fail, the same task would have to fail four times (or as configured by spark.task.maxFailures).

If the stage fails, the batch fails as well and possibly the whole job; if the job was started by using GlueContext.forEachBatch(), it has a number of retries as per the batchMaxRetries parameter (three by default).

These failures are important because they have two effects:

  • They can silently cause delays in the batch processing, depending on how long it took to fail and retry.
  • They can cause records to be sent multiple times if the failure is in the last stage of the batch, depending on the type of output. If the output is files, in general it won’t cause duplicates. However, if the destination is Amazon DynamoDB, JDBC, Amazon OpenSearch Service, or another output that uses batching, it’s possible that some part of the output has already been sent. If you can’t tolerate any duplicates, the destination system should handle this (for example, being idempotent).

Choosing the description link takes you to the Stages tab for that job. Here you can dig into the failure: What is the exception? Is it always in the same executor? Does it succeed on the first retry or took multiple?

Ideally, you want to identify these failures and solve them. For example, maybe the destination system is throttling us because doesn’t have enough provisioned capacity, or a larger timeout is needed. Otherwise, you should at least monitor it and decide if it is systemic or sporadic.

Sizing and scaling

Defining how to split the data is a key element in any distributed system to run and scale efficiently. The design decisions on the messaging system will have a strong influence on how the streaming job will perform and scale, and thereby affect the job parallelism.

In the case of AWS Glue Streaming, this division of work is based on Apache Spark partitions, which define how to split the work so it can be processed in parallel. Each time the job reads a batch from the source, it divides the incoming data into Spark partitions.

For Apache Kafka, each topic partition becomes a Spark partition; similarly, for Kinesis, each stream shard becomes a Spark partition. To simplify, I’ll refer to this parallelism level as number of partitions, meaning Spark partitions that will be determined by the input Kafka partitions or Kinesis shards on a one-to-one basis.

The goal is to have enough parallelism and capacity to process each batch of data in less time than the configured batch interval and therefore be able to keep up. For instance, with a batch interval of 60 seconds, the job lets 60 seconds of data build up and then processes that data. If that work takes more than 60 seconds, the next batch waits until the previous batch is complete before starting a new batch with the data that has built up since the previous batch started.

It’s a good practice to limit the amount of data to process in a single batch, instead of just taking everything that has been added since the last one. This helps make the job more stable and predictable during peak times. It allows you to test that the job can handle volume of data without issues (for example, memory or throttling).

To do so, specify a limit when defining the source stream DataFrame:

  • For Kinesis, specify the limit using kinesis.executor.maxFetchRecordsPerShard, and revise this number if the number of shards changes substantially. You might need to increase kinesis.executor.maxFetchTimeInMs as well, in order to allow more time to read the batch and make sure it’s not truncated.
  • For Kafka, set maxOffsetsPerTrigger, which divides that allowance equally between the number of partitions.

The following is an example of setting this config for Kafka (for Kinesis, it’s equivalent but using Kinesis properties):

kafka_properties= {
  "kafka.bootstrap.servers": "bootstrapserver1:9092",
  "subscribe": "mytopic",
  "startingOffsets": "latest",
  "maxOffsetsPerTrigger": "5000000"
}
# Pass the properties as options when creating the DataFrame
spark.spark.readStream.format("kafka").options(**kafka_properties).load()

Initial benchmark

If the events can be processed individually (no interdependency such as grouping), you can get a rough estimation of how many messages a single Spark core can handle by running with a single partition source (one Kafka partition or one Kinesis shard stream) with data preloaded into it and run batches with a limit and the minimum interval (1 second). This simulates a stress test with no downtime between batches.

For these repeated tests, clear the checkpoint directory, use a different one (for example, make it dynamic using the timestamp in the path), or just disable the checkpointing (if using the Spark API directly), so you can reuse the same data.
Leave a few batches to run (at least 10) to give time for the system and the metrics to stabilize.

Start with a small limit (using the limit configuration properties explained in the previous section) and do multiple reruns, increasing the value. Record the batch duration for that limit and the throughput input rate (because it’s a stress test, the process rate should be similar).

In general, larger batches tend to be more efficient up to a point. This is because the fixed overhead taken for each to checkpoint, plan, and coordinate the nodes is more significant if the batches are smaller and therefore more frequent.

Then pick your reference initial settings based on the requirements:

  • If a goal SLA is required, use the largest batch size whose batch duration is less than half the latency SLA. This is because in the worst case, a message that is stored just after a batch is triggered has to wait at least the interval and then the processing time (which should be less than the interval). When the system is keeping up, the latency in this worst case would be close to twice the interval, so aim for the batch duration to be less than half the target latency.
  • In the case where the throughput is the priority over latency, just pick the batch size that provides a higher average process rate and define an interval that allows some buffer over the observed batch duration.

Now you have an idea of the number of messages per core our ETL can handle and the latency. These numbers are idealistic because the system won’t scale perfectly linearly when you add more partitions and nodes. You can use the messages per core obtained to divide the total number of messages per second to process and get the minimum number of Spark partitions needed (each core handles one partition in parallel).

With this number of estimated Spark cores, calculate the number of nodes needed depending on the type and version, as summarized in the following table.

AWS Glue Version Worker Type vCores Spark Cores per Worker
2 G 1X 4 8
2 G 2X 8 16
3 G 0.25X 2 2
3 G 1X 4 4
3 G 2X 8 8

Using the newer version 3 is preferable because it includes more optimizations and features like auto scaling (which we discuss later). Regarding size, unless the job has some operation that is heavy on memory, it’s preferable to use the smaller instances so there aren’t so many cores competing for memory, disk, and network shared resources.

Spark cores are equivalent to threads; therefore, you can have more (or less) than the actual cores available in the instance. This doesn’t mean that having more Spark cores is going to necessarily be faster if they’re not backed by physical cores, it just means you have more parallelism competing for the same CPU.

Sizing the cluster when you control the input message system

This is the ideal case because you can optimize the performance and the efficiency as needed.

With the benchmark information you just gathered, you can define your initial AWS Glue cluster size and configure Kafka or Kinesis with the number of partitions or topics estimated, plus some buffer. Test this baseline setup and adjust as needed until the job can comfortably meet the total volume and required latency.

For instance, if we have determined that we need 32 cores to be well within the latency requirement for the volume of data to process, then we can create an AWS Glue 3.0 cluster with 9 G.1X nodes (a driver and 8 workers with 4 cores = 32) which reads from a Kinesis data stream with 32 shards.

Imagine that the volume of data in that stream doubles and we want to keep the latency requirements. To do so, we double the number of workers (16 + 1 driver = 17) and the number of shards on the stream (now 64). Remember this is just a reference and needs to be validated; in practice you might need more or less nodes depending on the cluster size, if the destination system can keep up, complexity of transformations, or other parameters.

Sizing the cluster when you don’t control the message system configuration

In this case, your options for tuning are much more limited.

Check if a cluster with the same number of Spark cores as existing partitions (determined by the message system) is able to keep up with the expected volume of data and latency, plus some allowance for peak times.

If that’s not the case, adding more nodes alone won’t help. You need to repartition the incoming data inside AWS Glue. This operation adds an overhead to redistribute the data internally, but it’s the only way the job can scale out in this scenario.

Let’s illustrate with an example. Imagine we have a Kinesis data stream with one shard that we don’t control, and there isn’t enough volume to justify asking the owner to increase the shards. In the cluster, significant computing for each message is needed; for each message, it runs heuristics and other ML techniques to take action depending on the calculations. After running some benchmarks, the calculations can be done promptly for the expected volume of messages using 8 cores working in parallel. By default, because there is only one shard, only one core will process all the messages sequentially.

To solve this scenario, we can provision an AWS Glue 3.0 cluster with 3 G 1X nodes to have 8 worker cores available. In the code repartition, the batch distributes the messages randomly (as evenly as possible) between them:

def batch_function(data_frame, batch_id):
    # Repartition so the udf is called in parallel for each partition
    data_frame.repartition(8).foreach(process_event_udf)

glueContext.forEachBatch(frame=streaming_df, batch_function=batch_function)

If the messaging system resizes the number of partitions or shards, the job picks up this change on the next batch. You might need to adjust the cluster capacity accordingly with the new data volume.

The streaming job is able to process more partitions than Spark cores are available, but might cause inefficiencies because the additional partitions will be queued and won’t start being processed until others finish. This might result in many nodes being idle while the remaining partitions finish and the next batch can be triggered.

When the messages have processing interdependencies

If the messages to be processed depend on other messages (either in the same or previous batches), that’s likely to be a limiting factor on the scalability. In that case, it might help to analyze a batch (job in Spark UI) to see where the time is spent and if there are imbalances by checking the task duration percentiles on the Stages tab (you can also reach this page by choosing a stage on the Jobs tab).

Auto scaling

Up to now, you have seen sizing methods to handle a stable stream of data with the occasional peak.
However, for variable incoming volumes of data, this isn’t cost-effective because you need to size for the worst-case scenario or accept higher latency at peak times.

This is where AWS Glue Streaming 3.0 auto scaling comes in. You can enable it for the job and define the maximum number of workers you want to allow (for example, using the number you have determined needed for the peak times).

The runtime monitors the trend of time spent on batch processing and compares it with the configured interval. Based on that, it makes a decision to increase or decrease the number of workers as needed, being more aggressive as the batch times get near or go over the allowed interval time.

The following screenshot is an example of a streaming job with auto scaling enabled.

Splitting workloads

You have seen how to scale a single job by adding nodes and partitioning the data as needed, which is enough on most cases. As the cluster grows, there is still a single driver and the nodes have to wait for the others to complete the batch before they can take additional work. If it reaches a point that increasing the cluster size is no longer effective, you might want to consider splitting the workload between separate jobs.

In the case of Kinesis, you need to divide the data into multiple streams, but for Apache Kafka, you can divide a topic into multiple jobs by assigning partitions to each one. To do so, instead of the usual subscribe or subscribePattern where the topics are listed, use the property assign to assign using JSON a subset of the topic partitions that the job will handle (for example, {"topic1": [0,1,2]}). At the time of this writing, it’s not possible to specify a range, so you need to list all the partitions, for instance building that list dynamically in the code.

Sizing down

For low volumes of traffic, AWS Glue Streaming has a special type of small node: G 0.25X, which provides two cores and 4 GB RAM for a quarter of the cost of a DPU, so it’s very cost-effective. However, even with that frugal capacity, if you have many small streams, having a small cluster for each one is still not practical.

For such situations, there are currently a few options:

  • Configure the stream DataFrame to feed from multiple Kafka topics or Kinesis streams. Then in the DataFrame, use the columns topic and streamName, for Kafka and Kinesis sources respectively, to determine how to handle the data (for example, different transformations or destinations). Make sure the DataFrame is cached, so you don’t read the data multiple times.
  • If you have a mix of Kafka and Kinesis sources, you can define a DataFrame for each, join them, and process as needed using the columns mentioned in the previous point.
  • The preceding two cases require all the sources to have the same batch interval and links their processing (for example, a busier stream can delay a slower one). To have independent stream processing inside the same cluster, you can trigger the processing of separate stream’s DataFrames using separate threads. Each stream is monitored separately in the Spark UI, but you’re responsible for starting and managing those threads and handle errors.

Settings

In this post, we showed some config settings that impact performance. The following table summarizes the ones we discussed and other important config properties to use when creating the input stream DataFrame.

Property Applies to Remarks
maxOffsetsPerTrigger Kafka Limit of messages per batch. Divides the limit evenly among partitions.
kinesis.executor.maxFetchRecordsPerShard Kinesis Limit per each shard, therefore should be revised if the number of shards changes.
kinesis.executor.maxFetchTimeInMs Kinesis When increasing the batch size (either by increasing the batch interval or the previous property), the executor might need more time, allotted by this property.
startingOffsets Kafka Normally you want to read all the data available and therefore use earliest. However, if there is a big backlog, the system might take a long time to catch up and instead use latest to skip the history.
startingposition Kinesis Similar to startingOffsets, in this case the values to use are TRIM_HORIZON to backload and LATEST to start processing from now on.
includeHeaders Kafka Enable this flag if you need to merge and split multiple topics in the same job (see the previous section for details).
kinesis.executor.maxconnections Kinesis When writing to Kinesis, by default it uses a single connection. Increasing this might improve performance.
kinesis.client.avoidEmptyBatches Kinesis It’s best to set it to true to avoid wasting resources (for example, generating empty files) when there is no data (like the Kafka connector does). GlueContext.forEachBatch prevents empty batches by default.

Further optimizations

In general, it’s worth doing some compression on the messages to save on transfer time (at the expense of some CPU, depending on the compression type used).

If the producer compresses the messages individually, AWS Glue can detect it and decompress automatically in most cases, depending on the format and type. For more information, refer to Adding Streaming ETL Jobs in AWS Glue.

If using Kafka, you have the option to compress the topic. This way, the compression is more effective because it’s done in batches, end-to-end, and it’s transparent to the producer and consumer.

By default, the GlueContext.forEachBatch function caches the incoming data. This is helpful if the data needs to be sent to multiple sinks (for example, as Amazon S3 files and also to update a DynamoDB table) because otherwise the job would read the data multiple times from the source. But it can be detrimental to performance if the volume of data is big and there is only one output.

To disable this option, set persistDataFrame as false:

glueContext.forEachBatch(
    frame=myStreamDataFrame,
    batch_function=processBatch,
    options={
        "windowSize": "30 seconds",
        "checkpointLocation": myCheckpointPath,
        "persistDataFrame":  "false"
    }
)

In streaming jobs, it’s common to have to join streaming data with another DataFrame to do enrichment (for example, lookups). In that case, you want to avoid any shuffle if possible, because it splits stages and causes data to be moved between nodes.

When the DataFrame you’re joining to is relatively small to fit in memory, consider using a broadcast join. However, bear in mind it will be distributed to the nodes on every batch, so it might not be worth it if the batches are too small.

If you need to shuffle, consider enabling the Kryo serializer (if using custom serializable classes you need to register them first to use it).

As in any AWS Glue jobs, avoid using custom udf() if you can do the same with the provided API like Spark SQL. User-defined functions (UDFs) prevent the runtime engine from performing many optimizations (the UDF code is a black box for the engine) and in the case of Python, it forces the movement of data between processes.

Avoid generating too many small files (especially columnar like Parquet or ORC, which have overhead per file). To do so, it might be a good idea to coalesce the micro-batch DataFrame before writing the output. If you’re writing partitioned data to Amazon S3, repartition based on columns can significantly reduce the number of output files created.

Conclusion

In this post, you saw how to approach sizing and tuning an AWS Glue streaming job in different scenarios, including planning considerations, configuration, monitoring, tips, and pitfalls.

You can now use these techniques to monitor and improve your existing streaming jobs or use them when designing and building new ones.


About the author

Gonzalo Herreros is a Senior Big Data Architect on the AWS Glue team.

How to secure an enterprise scale ACM Private CA hierarchy for automotive and manufacturing

Post Syndicated from Anthony Pasquariello original https://aws.amazon.com/blogs/security/how-to-secure-an-enterprise-scale-acm-private-ca-hierarchy-for-automotive-and-manufacturing/

In this post, we show how you can use the AWS Certificate Manager Private Certificate Authority (ACM Private CA) to help follow security best practices when you build a CA hierarchy. This blog post walks through certificate authority (CA) lifecycle management topics, including an architecture overview, centralized security, separation of duties, certificate issuance auditing, and certificate sharing by means of templates. These topics provide best practices surrounding your ACM Private CA hierarchy so that you can build the right CA hierarchy for your organization.

With ACM Private CA, you can create private certificate authority hierarchies, including root and subordinate CAs, without the upfront investment and ongoing maintenance costs of operating your own private CA. You can issue certificates for authenticating internal users, computers, applications, services, servers or other devices, and code signing.

This post includes the following Amazon Web Services (AWS) services:

Solution overview

In this blog post, you’ll see an example automotive manufacturing company and their supplier companies. Each will have associated AWS accounts, which we will call Manufacturer Account(s) and Supplier Account(s), respectively.

Automotive manufacturing companies usually have modules that come from different suppliers. Modules, in the automotive context, are embedded systems that control electrical systems in the vehicle. These modules might be interconnected throughout the in-vehicle network or provide connectivity external to the vehicle, for example, for navigation or sending telemetry to off-board systems.

The architecture needs to allow the Manufacturer to retain control of their CA hierarchy, while giving their external Suppliers limited access to sign the certificates on these modules with the Manufacturer’s CA hierarchy. The architecture we provide here gives you the basic information you need to cover the following objectives:

  1. Creation of accounts that logically separate CAs in a hierarchy
  2. IAM role creation for specific personas to manage the CA lifecycle
  3. Auditing the CA hierarchy by using audit reports
  4. Cross-account sharing by using AWS RAM with certificate template scoping

Architecture overview

Figure 1 shows the solution architecture.

Figure 1: Multi-account certificate authority hierarchy using ACM Private CA

Figure 1: Multi-account certificate authority hierarchy using ACM Private CA

The Manufacturer has two categories of AWS accounts:

  1. A dedicated account to hold the Manufacturer’s root CA
  2. An account to hold their subordinate CA

Note: The diagram shows two subordinate CAs in the Manufacturer account. However, depending on your security needs, you can have a subordinate CA per account per supplier.

Additionally, each Supplier has one AWS account. These accounts will have the Manufacturer’s subordinate CA shared by using AWS RAM. The Manufacturer will have a subordinate CA for each Supplier.

Logically separate accounts

In order to minimize the scope of impact and scope users to actions within their duties, it’s critical that you logically separate AWS accounts based on workload within the CA hierarchy. The following section shows a recommendation for how to do that.

AWS account that holds the root CA

You, the Manufacturer, should place the ACM Private root CA within its own dedicated AWS account to segment and tightly control access to the root CA. This limits access at the account level and only uses the dedicated account for a single purpose: holding the root CA for your organization. This account will only have access from IAM principals that maintain the CA hierarchy through a federation service like AWS Single Sign-On (AWS SSO) or direct federation to IAM through an existing identity provider. This account also has AWS CloudTrail enabled and configured for business-specific alerting, including actions like creation, updating, or deletion of the root CA.

AWS account that holds the subordinate CAs

You, the Manufacturer, will have a dedicated account where the entire CA hierarchy below the root will be located. You should have a separate subordinate CA for each Supplier, and in some cases a separate subordinate CA for each hardware module the Supplier is building. The subordinate CAs can issue certificates for specific hardware modules within the Supplier account.

This Manufacturer account shares each subordinate CA to the respective Supplier’s AWS account by using AWS RAM. This provides joint control to the shared subordinate CA, creating isolation between individual Suppliers. AWS RAM allows Suppliers to control certificate issuance and revocation if this is allowed by the Manufacturer. Each Supplier is only shared certificate provisioning access through AWS RAM configuration, which means that you can tightly monitor and revoke access through AWS RAM. Given this sharing through AWS RAM, the Suppliers don’t have access to modify or delete the CA hierarchy itself and can only provision certificates from it.

Supplier AWS account(s)

These AWS accounts are owned by each respective Supplier. For example, you might partner with radio, navigation system, and telemetry suppliers. Each Supplier would have their own AWS account, which they control. The Supplier accepts an invitation from the manufacturer through AWS RAM, sharing the subordinate CA. The subordinate is allowed to take only certain actions, based on how the Manufacturer configured the share (more on this later in the post).

Separation of duties by means of IAM role creation

In order to follow least privilege best practices when you create a CA hierarchy with ACM Private CA, you must create IAM roles that are specific to each job function. The recommended method is to separate administrator and certificate issuer roles.

For this automotive manufacturing use case, we recommend the following roles:

  1. Manufacturer IAM roles:
    • A CA admin role with CA disable permission
    • A CA admin role with CA delete permission
  2. Supplier certificate issuer IAM role:

Manufacturer IAM role overview

In this flow, one IAM role is able to disable the CA, and a second principal can delete the CA. This enables two-person control for this highly privileged action—meaning that you need a two-person quorum to rotate the CA certificate.

Day-to-day CA admin policy (with CA disable)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "acm-pca:ImportCertificateAuthorityCertificate",
                "acm-pca:DeletePolicy",
                "acm-pca:PutPolicy",
                "acm-pca:TagCertificateAuthority",
                "acm-pca:ListTags",
                "acm-pca:GetCertificate",
                "acm-pca:CreateCertificateAuthority",
                "acm-pca:ListCertificateAuthorities",
                "acm-pca:UntagCertificateAuthority",
                "acm-pca:GetCertificateAuthorityCertificate",
                "acm-pca:RevokeCertificate",
                "acm-pca:UpdateCertificateAuthority",
                "acm-pca:GetPolicy",
                "acm-pca:IssueCertificate",
                "acm-pca:DescribeCertificateAuthorityAuditReport",
                "acm-pca:CreateCertificateAuthorityAuditReport",
                "acm-pca:RestoreCertificateAuthority",
                "acm-pca:GetCertificateAuthorityCsr",
                "acm-pca:DeletePermission",
                "acm-pca:DescribeCertificateAuthority",
                "acm-pca:CreatePermission",
                "acm-pca:ListPermissions"
            ],
            "Resource": “*”
        },
        {
            "Effect": "Deny",
            "Action": [
                "acm-pca:DeleteCertificateAuthority"
            ],
            "Resource": <Enter Root CA ARN Here>
        }
    ]
}

Privileged CA admin policy (with CA delete)

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "acm-pca:ImportCertificateAuthorityCertificate",
                "acm-pca:DeletePolicy",
                "acm-pca:PutPolicy",
                "acm-pca:TagCertificateAuthority",
                "acm-pca:ListTags",
                "acm-pca:GetCertificate",
                "acm-pca:UntagCertificateAuthority",
                "acm-pca:GetCertificateAuthorityCertificate",
                "acm-pca:RevokeCertificate",
                "acm-pca:GetPolicy",
    "acm-pca:CreateCertificateAuthority",
                "acm-pca:ListCertificateAuthorities",
                "acm-pca:DescribeCertificateAuthorityAuditReport",
                "acm-pca:CreateCertificateAuthorityAuditReport",
                "acm-pca:RestoreCertificateAuthority",
                "acm-pca:GetCertificateAuthorityCsr",
                "acm-pca:DeletePermission",
    "acm-pca:IssueCertificate",
                "acm-pca:DescribeCertificateAuthority",
                "acm-pca:CreatePermission",
                "acm-pca:ListPermissions",
                "acm-pca:DeleteCertificateAuthority"
            ],
            "Resource": “*”
        },
        {
            "Effect": "Deny",
            "Action": [
                "acm-pca:UpdateCertificateAuthority"
            ],
            "Resource": <Enter Root CA ARN Here>
        }
    ]
}

We recommend that you, the Manufacturer, create a two-person process for highly privileged events like CA certificate rotation ceremonies. The preceding policies serve two purposes. First, they allow you to designate separation of management duties between day-to-day CA admin tasks and infrequent root CA rotation ceremonies. The day-to-day CA admin policy allows all ACM Private CA actions except the ability to delete the root CA. This is because the day-to-day CA admin should not be deleting the root CA. Meanwhile, the privileged CA admin policy has the ability to call DeleteCertificateAuthority. However, in order to call DeleteCertificateAuthority, you first need to have the day-to-day CA admin role disable the root CA.

This means that both roles listed here are necessary to perform a root CA deletion for a rotation or replacement ceremony. This arrangement creates a way to control the deletion of the CA resource by requiring two separate actors to disable and delete. It’s crucial that the two roles are assumed by two different people at the identity provider. Having one person assume both of these roles negates the increased security created by each role.

You might also consider enforcing tagging of CAs at the organization level so that each new CA has relevant tags. The blog post Securing resource tags used for authorization using a service control policy in AWS Organizations illustrates in detail how to secure tags using service control policies (SCPs), so that only authorized users can modify tags.

Supplier IAM role overview

Your Suppliers should also follow least privilege when creating IAM roles within their own accounts. However, as we’ll see in the Cross-account sharing by using AWS RAM section, even if the Suppliers don’t follow best practices, the Manufacturer’s ACM Private CA hierarchy is still isolated and secure.

That being said, here are common IAM roles that your Suppliers should create within their own accounts:

  1. Developers who provision certificates for development and QA workloads
  2. Developers who provision certificates for production

These certificate issuing roles give the Supplier the ability to issue end-entity certificates from the CA hierarchy. In this use case, the Supplier needs two different levels of permissions: non-production certificates and production certificates. To simplify the roles within IAM, the Supplier decided to use ABAC. These ABAC policies allow operations when the principal’s tag matches the resource tag. Because the Supplier has many similar policies, each with a different set of users, they use ABAC to create a single IAM policy that uses principal tags rather than creating multiple slightly different IAM policies.

Certificate issuing policy that uses ABAC

{
	"Version": "2012-10-17",
	"Statement": [
	{
		"Effect": "Allow",
		"Action": [
			"acm-pca:IssueCertificate",
			"acm-pca:ListTags",
			"acm-pca:GetCertificate",
			"acm-pca:ListCertificateAuthorities"
		],
		"Resource": "*",
		"Condition": {
			"StringEquals": {
				"aws:ResourceTag/access-project": "${aws:PrincipalTag/access-project}",
				"aws:ResourceTag/access-team": "${aws:PrincipalTag/access-team}"
			}
		}
	}
}

This single policy enables all personas to be scoped to least privilege access. If you look at the Condition portion of the IAM policy, you can see the power of ABAC. This condition verifies that the PrincipalTag matches the ResourceTag. The Supplier is federating into IAM roles through AWS SSO and tagging the Supplier’s principals within its selected identity providers.

Because you as the Manufacturer have tagged the subordinate CAs that are shared with the Supplier, the Supplier can use identity provider (IdP) attributes as tags to simplify the Supplier’s IAM strategy. In this example, the Supplier configures each relevant user in the IdP with the attribute (tag) key: access-team. This tag matches the tagging strategy used by the Manufacturer. Here’s the mapping for each persona within the use case:

  • Dev environment:
    • access-team: DevTeam
  • Production environment:
    • access-team: ProdTeam

You can choose to add or remove tags depending on your use case, and the preceding scenario serves as a simple example. This offloads the need to create new IAM policies as the number of subordinate CAs grow. If you decide to use ABAC, make sure that you require both principal tagging and resource tagging upon creation of each, because these tags become your authorization mechanism.

CA lifecycle: Audit report published by the Manufacturer

In terms of auditing and monitoring, we recommend that the Manufacturer have a mechanism to track how many certificates were issued for a specific Supplier or module. Within the Manufacturer accounts, you can generate audit reports through the console or CLI. This allows you, the manufacturer, to gather metrics on certificate issuance and revocation. Following is an example of a certificate issuance.

Figure 2: Audit report output for certificate issuance

Figure 2: Audit report output for certificate issuance

For more information on generating an audit report, see Using audit reports with your private CA.

Cross-account sharing by using AWS RAM

With AWS RAM, you can share CAs with another account. We recommend that you, as a Manufacturer, use AWS RAM to share CAs with Suppliers so that they can issue certificates without administrator access to the CA. This arrangement allows you as the Manufacturer to more easily limit and revoke access if you change Suppliers. The Suppliers can create certificates through the ACM console or through the CLI, API, or AWS CloudFormation. Manufacturers are only sharing the ability to create, manage, bind, and export certificates from the CA hierarchy. The CA hierarchy itself is contained within the Manufacturers’ accounts, and not within the Suppliers’ accounts. By using AWS RAM, the Suppliers don’t have any administrator access to the CA hierarchy. From a cost perspective, you can centrally control and monitor the costs of your private CA hierarchy without having to deal with cost-sharing across Suppliers.

Refer to How to use AWS RAM to share your ACM Private CA cross-account for a full walkthrough on how to use RAM with ACM Private CA.

Certificate templates with AWS RAM managed permissions

AWS RAM has the ability to create managed permissions in order to define the actions that can be performed on shared resources. For each shareable resource type, you can use AWS RAM managed permissions to define which permissions to grant to whom for shared resource types that support additional managed permissions. This means that when you use AWS RAM to share a resource (in this case ACM Private CA), you can now specify which IAM actions can take place on that resource. AWS RAM managed permissions integrate with the following ACM Private CA certificate templates:

  • Permission 1: BlankEndEntityCertificate_APICSRPassthrough
  • Permission 2: EndEntityClientAuthCertificate
  • Permission 3: EndEntityServerAuthCertificate
  • Permission 4: subordinatesubordinateCACertificate_PathLen0
  • Permission 5: RevokeCertificate

These five certificate templates allow a Manufacturer to scope its Suppliers to the certificate template provisioning level. This means that you can limit which certificate templates can be issued by the Suppliers.

Let’s assume you have a Supplier that is supplying a module that has infotainment media capability, and you, the manufacturer, want the Supplier to provision the end-entity client certificate but you don’t want them to be able to revoke that certificate. You can use AWS RAM managed permissions to scope that Supplier’s shared private CA to allow the EndEntityClientAuthCertificate issuance template, which implicitly denies RevokeCertificate template actions. This further scopes down what the Supplier is authorized to issue on the shared CA, gives the responsibility for revoking infotainment device certificates to the Manufacturer, but still allows the Supplier to load devices with a certificate upon creation.

Example of creating a resource share in AWS RAM by using the AWS CLI

This walkthrough shows you the general process of sharing a private CA by using AWS RAM and then accepting that shared resource in the partner account.

  1. Create your shared resource in AWS RAM from the Manufacturer subordinate CA account. Notice that in the example that follows, we selected one of the certificate templates within the managed permissions option. This limits the shared CA so that it can only issue certain types of certificate templates.

    Note: Replace the <variable> placeholders with your own values.

    aws ram create-resource-share
    		--name Shared_Private_CA
    		--resource-arns arn:aws:acm-pca:<region:111122223333>:certificate-authority/<xxxx-xxxx-xxxx-xxxx-example>
    		--permission-arns "arn:aws:ram::aws:permission/<AWSRAMBlankEndEntityCertificateAPICSRPassthroughIssuanceCertificateAuthority>"
    		--principals <444455556666>

  2. From the Supplier account, the Supplier administrator will accept the resource. Follow How to use AWS RAM to share your ACM Private CA cross-account to complete the shared resource acceptance and issue an end entity certificate.

Conclusion

In this blog post, you learned about the various considerations for building a secure public key infrastructure (PKI) hierarchy by using ACM Private CA through an example customer’s prescriptive setup. You learned how you can use AWS RAM to share CAs across accounts easily and securely. You also learned about sharing specific CAs through the ability to define permissions to specific principals across accounts, allowing for granular control of permissions on principals that might act on those resources.

The main takeaways of this post are how to create least privileged roles within IAM in order to scope down the activities of each persona and limit the potential scope of impact for your organization’s private CA hierarchy. Although these best practices are specific to manufacturer business requirements, you can alter them based on your business needs. With the managed permissions in AWS RAM, you can further scope down the actions that principals can perform with your CA by limiting the certificate templates allowed on that CA when you share it. Using all of these tools, you can help your PKI hierarchy to have a high level of security. To learn more, see the other ACM Private CA posts on the AWS Security Blog.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Anthony Pasquariello

Anthony Pasquariello

Anthony is a Senior Solutions Architect at AWS based in New York City. He specializes in modernization and security for our advanced enterprise customers. Anthony enjoys writing and speaking about all things cloud. He’s pursuing an MBA, and received his MS and BS in Electrical & Computer Engineering.

Omar Zoma

Omar Zoma

Omar is a senior AWS Security Solutions Architect that lives in metro Detroit. Omar is passionate about helping customers solve cloud and vehicle security problems at a global scale. In his free time, Omar trains hundreds of students a year in security and cloud through universities and training programs.

Use Amazon Cognito to add claims to an identity token for fine-grained authorization

Post Syndicated from Ajit Ambike original https://aws.amazon.com/blogs/security/use-amazon-cognito-to-add-claims-to-an-identity-token-for-fine-grained-authorization/

With Amazon Cognito, you can quickly add user sign-up, sign-in, and access control to your web and mobile applications. After a user signs in successfully, Cognito generates an identity token for user authorization. The service provides a pre token generation trigger, which you can use to customize identity token claims before token generation. In this blog post, we’ll demonstrate how to perform fine-grained authorization, which provides additional details about an authenticated user by using claims that are added to the identity token. The solution uses a pre token generation trigger to add these claims to the identity token.

Scenario

Imagine a web application that is used by a construction company, where engineers log in to review information related to multiple projects. We’ll look at two different ways of designing the architecture for this scenario: a standard design and a more optimized design.

Standard architecture

A sample standard architecture for such an application is shown in Figure 1, with labels for the various workflow steps:

  1. The user interface is implemented by using ReactJS (a JavaScript library for building user interfaces).
  2. The user pool is configured in Amazon Cognito.
  3. The back end is implemented by using Amazon API Gateway.
  4. AWS Lambda functions exist to implement business logic.
  5. The AWS Lambda CheckUserAccess function (5) checks whether the user has authorization to call the AWS Lambda functions (4).
  6. The project information is stored in an Amazon DynamoDB database.
Figure 1: Lambda functions that need the user’s projectID call the GetProjectID Lambda function

Figure 1: Lambda functions that need the user’s projectID call the GetProjectID Lambda function

In this scenario, because the user has access to information from several projects, several backend functions use calls to the CheckUserAccess Lambda function (step 5 in Figure 1) in order to serve the information that was requested. This will result in multiple calls to the function for the same user, which introduces latency into the system.

Optimized architecture

This blog post introduces a new optimized design, shown in Figure 2, which substantially reduces calls to the CheckUserAccess API endpoint:

  1. The user logs in.
  2. Amazon Cognito makes a single call to the PretokenGenerationLambdaFunction-pretokenCognito function.
  3. The PretokenGenerationLambdaFunction-pretokenCognito function queries the Project ID from the DynamoDB table and adds that information to the Identity token.
  4. DynamoDB delivers the query result to the PretokenGenerationLambdaFunction-pretokenCognito function.
  5. This Identity token is passed in the authorization header for making calls to the Amazon API Gateway endpoint.
  6. Information in the identity token claims is used by the Lambda functions that contain business logic, for additional fine-grained authorization. Therefore, the CheckUserAccess function (7) need not be called.

The improved architecture is shown in Figure 2.

Figure 2. Get the projectID and inset it in a custom claim in the Identity token

Figure 2. Get the projectID and inset it in a custom claim in the Identity token

The benefits of this approach are:

  1. The number of calls to get the Project ID from the DynamoDB table are reduced, which in turn reduces overall latency.
  2. The dependency on the CheckUserAccess Lambda function is removed from the business logic. This reduces coupling in the architecture, as depicted in the diagram.

In the code sample provided in this post, the user interface is run locally from the user’s computer, for simplicity.

Code sample

You can download a zip file that contains the code and the AWS CloudFormation template to implement this solution. The code that we provide to illustrate this solution is described in the following sections.

Prerequisites

Before you deploy this solution, you must first do the following:

  1. Download and install Python 3.7 or later.
  2. Download the AWS SDK for Python (Boto3) library by using the following pip command.
    pip install boto3
  3. Install the argparse package by using the following pip command.
    pip install argparse
  4. Install the AWS Command Line Interface (AWS CLI).
  5. Configure the AWS CLI.
  6. Download a code editor for Python. We used Visual Studio Code for this post.
  7. Install Node.js.

Description of infrastructure

The code provided with this post installs the following infrastructure in your AWS account.

Resource Description
Amazon Cognito user pool The users, added by the addUserInfo.py script, are added to this pool. The client ID is used to identify the web client that will connect to the user pool. The user pool domain is used by the web client to request authentication of the user.
Required AWS Identity and Access Management (IAM) roles and policies Policies used for running the Lambda function and connecting to the DynamoDB database.
Lambda function for the pre token generation trigger A Lambda function to add custom claims to the Identity token.
DynamoDB table with user information A sample database to store user information that is specific to the application.

Deploy the solution

In this section, we describe how to deploy the infrastructure, save the trigger configuration, add users to the Cognito user pool, and run the web application.

To deploy the solution infrastructure

  1. Download the zip file to your machine. The readme.md file in the addclaimstoidtoken folder includes a table that describes the key files in the code.
  2. Change the directory to addclaimstoidtoken.
    cd addclaimstoidtoken
  3. Review stackInputs.json. Change the value of the userPoolDomainName parameter to a random unique value of your choice. This example uses pretokendomainname as the Amazon Cognito domain name; you should change it to a unique domain name of your choice.
  4. Deploy the infrastructure by running the following Python script.
    python3 setup_pretoken.py

    After the CloudFormation stack creation is complete, you should see the details of the infrastructure created as depicted in Figure 3.

    Figure 3: Details of infrastructure

    Figure 3: Details of infrastructure

Now you’re ready to add users to your Amazon Cognito user pool.

To add users to your Cognito user pool

  1. To add users to the Cognito user pool and configure the DynamoDB store, run the Python script from the addclaimstoidtoken directory.
    python3 add_user_info.py
  2. This script adds one user. It will prompt you to provide a username, email, and password for the user.

    Note: Because this is sample code, advanced features of Cognito, like multi-factor authentication, are not enabled. We recommend enabling these features for a production application.

    The addUserInfo.py script performs two actions:

    • Adds the user to the Cognito user pool.
      Figure 4: User added to the Cognito user pool

      Figure 4: User added to the Cognito user pool

    • Adds sample data to the DynamoDB table.
      Figure 5: Sample data added to the DynamoDB table named UserInfoTable

      Figure 5: Sample data added to the DynamoDB table named UserInfoTable

Now you’re ready to run the application to verify the custom claim addition.

To run the web application

  1. Change the directory to the pre-token-web-app directory and run the following command.
    cd pre-token-web-app
  2. This directory contains a ReactJS web application that displays details of the identity token. On the terminal, run the following commands to run the ReactJS application.
    npm install
    npm start

    This should open http://localhost:8081 in your default browser window that shows the Login button.

    Figure 6: Browser opens to URL http://localhost:8081

    Figure 6: Browser opens to URL http://localhost:8081

  3. Choose the Login button. After you do so, the Cognito-hosted login screen is displayed. Log in to the website with the user identity you created by using the addUserInfo.py script in step 1 of the To add users to your Cognito user pool procedure.
    Figure 7: Input credentials in the Cognito-hosted login screen

    Figure 7: Input credentials in the Cognito-hosted login screen

  4. When the login is successful, the next screen displays the identity and access tokens in the URL. You can reveal the token details to verify that the custom claim has been added to the token by choosing the Show Token Detail button.
    Figure 8: Token details displayed in the browser

    Figure 8: Token details displayed in the browser

What happened behind the scenes?

In this web application, the following steps happened behind the scenes:

  1. When you ran the npm start command on the terminal command line, that ran the react-scripts start command from package.json. The port number (8081) was configured in the pre-token-web-app/.env file. This opened the web application that was defined in app.js in the default browser at the URL http://localhost:8081.
  2. The Login button is configured to navigate to the URL that was defined in the constants.js file. The constants.js file was generated during the running of the setup_pretoken.py script. This URL points to the Cognito-hosted default login user interface.
  3. When you provided the login information (username and password), Amazon Cognito authenticated the user. Before generating the set of tokens (identity token and access token), Cognito first called the pre-token-generation Lambda trigger. This Lambda function has the code to connect to the DynamoDB database. The Lambda function can then access the project information for the user that is stored in the userInfo table. The Lambda function read this project information and added it to the identity token that was delivered to the web application.

    Lambda function code

    const AWS = require("aws-sdk");
    
    // Create the DynamoDB service object
    var ddb = new AWS.DynamoDB({ apiVersion: "2012-08-10" });
    
    // PretokenGeneration Lambda
    exports.handler = async function (event, context) {
        var eventUserName = "";
        var projects = "";
    
        if (!event.userName) {
            return event;
        }
    
        var params = {
            ExpressionAttributeValues: {
                ":v1": {
                    S: event.userName
                }
            },
            KeyConditionExpression: "userName = :v1",
            ProjectionExpression: "projects",
            TableName: "UserInfoTable"
        };
    
        event.response = {
            "claimsOverrideDetails": {
                "claimsToAddOrOverride": {
                    "userName": event.userName,
                    "projects": null
                },
            }
        };
    
        try {
            let result = await ddb.query(params).promise();
            if (result.Items.length > 0) {
                const projects = result.Items[0]["projects"]["S"];
                console.log("projects = " + projects);
                event.response.claimsOverrideDetails.claimsToAddOrOverride.projects = projects;
            }
        }
        catch (error) {
            console.log(error);
        }
    
        return event;
    };

    The code for the Lambda function is as follows.

  4. After a successful login, Amazon Cognito redirected to the URL that was specified in the App Client Settings section, and added the token to the URL.
  5. The webpage detected the token in the URL and displayed the Show Token Detail button. When you selected the button, the webpage read the token in the URL, decoded the token, and displayed the information in the relevant text boxes.
  6. Notice that the Decoded ID Token box shows the custom claim named projects that displays the projectID that was added by the PretokenGenerationLambdaFunction-pretokenCognito trigger.

How to use the sample code in your application

We recommend that you use this sample code with the following modifications:

  1. The code provided does not implement the API Gateway and Lambda functions that consume the custom claim information. You should implement the necessary Lambda functions and read the custom claim for the event object. This event object is a JSON-formatted object that contains authorization data.
  2. The ReactJS-based user interface should be hosted on an Amazon Simple Storage Service (Amazon S3) bucket.
  3. The projectId of the user is available in the token. Therefore, when the token is passed by the Authorization trigger to the back end, this custom claim information can be used to perform actions specific to the project for that user. For example, getting all of that user’s work items that are related to the project.
  4. Because the token is valid for one hour, the information in the custom claim information is available to the user interface during that time.
  5. You can use the AWS Amplify library to simplify the communication between your web application and Amazon Cognito. AWS Amplify can handle the token retention and refresh token mechanism for the web application. This also removes the need for the token to be displayed in the URL.
  6. If you’re using Amazon Cognito to manage your users and authenticate them, using the Amazon Cognito user pool to control access to your API is easier, because you don’t have to write the authentication code in your authorizer.
  7. If you decide to use Lambda authorizers, note the following important information from the topic Steps to create an API Gateway Lambda authorizer: “In production code, you may need to authenticate the user before granting authorization. If so, you can add authentication logic in the Lambda function as well by calling an authentication provider as directed in the documentation for that provider.”
  8. Lambda authorizer is recommended if the final authorization (not just token validity) decision is made based on custom claims.

Conclusion

In this blog post, we demonstrated how to implement fine-grained authorization based on data stored in the back end, by using claims stored in an identity token that is generated by the Amazon Cognito pre token generation trigger. This solution can help you achieve a reduction in latency and improvement in performance.

For more information on the pre token generation Lambda trigger, refer to the Amazon Cognito Developer Guide.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Ajit Ambike

Ajit Ambike

Ajit Ambike is a Sr. Application Architect at Amazon Web Services. As part of AWS Energy team, he leads the creation of new business capabilities for the customers. Ajit also brings best practices to the customers and partners that accelerate the productivity of their teams.

Zafar Kapadia

Zafar Kapadia

Zafar Kapadia is a Sr. Customer Delivery Architect at AWS. He has over 17 years of IT experience and has worked on several Application Development and Optimization projects. He is also an avid cricketer and plays in various local leagues.

How to use AWS KMS RSA keys for offline encryption

Post Syndicated from Patrick Palmer original https://aws.amazon.com/blogs/security/how-to-use-aws-kms-rsa-keys-for-offline-encryption/

This blog post discusses how you can use AWS Key Management Service (AWS KMS) RSA public keys on end clients or devices and encrypt data, then subsequently decrypt data by using private keys that are secured in AWS KMS.

Asymmetric cryptography is a cryptographic system that uses key pairs. Each pair consists of a public key, which can be seen or accessed by anyone, and a private key, which can be accessed only by authorized people. This system has a useful property, which is that anything encrypted with a public key can only be decrypted by the corresponding private key. A popular method for generating key pairs and encrypting data is the RSA algorithm and cryptosystem.

For RSA key pairs, calculating the private key from the public key is seen as computationally infeasible, and therefore RSA key pairs can be used for both authentication and encryption. The features of asymmetric encryption allow separated parties to share information across an untrusted domain, such as the internet, without having to pre-share any other secrets. However, this type of encryption poses an issue of keeping the private key secure, because the private key has the power to decrypt all messages that are transmitted by a large number of end users.

AWS KMS provides simple APIs that you can use to securely generate, store, and manage keys, including RSA key pairs inside hardware security modules (HSMs). Key pairs are generated within FIPS 140-2 validated HSMs that are managed by AWS. You can then use these private keys through APIs to do actions such as decrypt ciphertexts, meaning that plaintext private keys never leave the HSM, which provides assurances of privacy for the private key. Additional APIs allow a customer to retrieve a plaintext copy of the corresponding public key, which allows disconnected or offline uses of RSA public keys.

Limits of asymmetric cryptography

A key drawback to asymmetric cryptography is the fact that you cannot encrypt large pieces of data. When you have a 2048-bit RSA key pair and encrypt something by using the cipher RSAES_OASEP_SHA_256, the largest amount of data that you can encrypt is 190 bytes.

In contrast, symmetric encryption ciphers that use a chained or counter-mode operation don’t have this limit, and they make it possible for you to encrypt data in the tens-of-gigabytes. Symmetric encryption algorithms such as the Advanced Encryption Standard (AES) also benefit from faster data encryption speeds due to smaller key sizes and less complex operations that can be built into hardware.

By combining these two algorithms in a hybrid cryptosystem, you give end clients with a public key the ability to encrypt large pieces of information. A client generates a random 256-bit AES key, which should be from a secure source such as /dev/urandom or a dedicated embedded chip. The client then encrypts its large payload by using a mode of operation such as AES-GCM or AES-CBC by using that 256-bit AES key. Next, the client encrypts that 256-bit AES key by using the RSA public key (see step 5 in Figure 1). End clients then transmit only encrypted data across insecure channels, maintaining privacy of the payload data.

A challenge that customers often face is that they want to use AWS KMS for its security properties, but also want to access their KMS keys from devices that don’t have AWS credentials embedded within them. Without AWS credentials, a device can’t call AWS APIs. This blog post shows how you can use a hybrid cryptosystem where RSA public keys can be downloaded or embedded into devices to overcome this challenge.

Prerequisites and initial considerations

This walkthrough assumes that you have some understanding of RSA ciphers and symmetric encryption schemes such as AES. The walkthrough uses OpenSSL for demonstration of the encryption process, but similar libraries can be used on a client-side device.

The walkthrough also assumes that you have an AWS Identity and Access Management (IAM) user with permissions to the AWS KMS service, and the AWS Command Line Interface (AWS CLI) installed with the relevant credentials.

When you create a KMS key, you will also generate a key policy that defines access to it. The default key policy allows all users in your account with AWS KMS actions in their IAM policies to access the KMS key. The key policy for a given KMS key is the primary method for determining access.

Important: You will incur charges for the services used in this example. You can find the cost of each service on the corresponding service pricing page. For more information, see AWS KMS Pricing.

Architectural overview

This post contains procedures for completing the following operations, which are also shown in Figure 1:

  1. Create an RSA key pair in AWS KMS.
  2. Download or pre-install the AWS KMS public key to an end-client device.
  3. Generate an AES 256-bit key on an end client.
  4. Encrypt a large payload of data on the end client by using the AES 256-bit key.
  5. Encrypt the AES 256-bit key with the AWS KMS public key.
  6. Transfer the encrypted payload and key.
  7. Decrypt the AES 256-bit key by using AWS KMS.
  8. Decrypt the payload data by using the now-shared AES 256-bit key.
Figure 1: The steps for hybrid encryption

Figure 1: The steps for hybrid encryption

This diagram shows an end client device, an untrusted network such as a cellular network, and the AWS Cloud. An RSA key pair is generated in AWS KMS, and then the public key can either be embedded in the end client, or pulled by the end client through HTTP(S) or other remote means. In all circumstances, only the public key persists on the end client, which means that no secrets are stored on the device.

How you host the public key on your end clients depends on what network access they have. For example, an embedded Internet of Things (IoT) device for mining vehicles might never connect to the internet, but could communicate with a central system through a private 5G network. In this circumstance, you would host this public key within that network for retrieval. For other disconnected IoT devices that can connect to the internet, such as smart-home appliances, you might want to host the public key on a web server at a predefined URL or through an API.

Note: Whenever you vend public keys over an untrusted channel, such as when you vend the public key through an API, you should make sure that the key can be verified in some way to confirm that it hasn’t been tampered with. This is typically done by vending keys over an HTTPS connection, where the integrity of the keys is provided by the X.509 certificate that was used in the TLS connection. The X.509 certificate also verifies an association with the key-pair owner, typically by domain name.

Implement the solution

The following steps can be used as a proof-of-concept to guide you through implementing a hybrid-cryptosystem by using a KMS public key on an example device.

Create keys in AWS KMS

In the first step of this solution, you create an RSA asymmetric key pair in AWS KMS (step 1 in the architectural overview). With AWS KMS, you can create key pairs in a variety of dimensions according to your security requirements or standards. For more information, see Choosing a KMS key type in the AWS KMS documentation.

To create a key pair in AWS KMS, use the CreateKey API. For this example, you will create an RSA key pair with RSA_2048 for the CustomerMasterKeySpec parameter and ENCRYPT_DECRYPT for the KeyUsage parameter in the AWS CLI. This post uses 2048-bit keys, but note that AWS KMS allows larger key sizes. The CLI will return a KeyId value that uniquely identifies the KMS key in your account, which you should take note of.

To create a KMS key by using the CLI

  • Enter the following command in the AWS CLI.
    aws kms create-key --key-spec RSA_2048 \
        --key-usage ENCRYPT_DECRYPT \
        --description "Example RSA Encryption Key Pair"

You can follow the Creating asymmetric KMS keys documentation to see how to use the AWS Management Console to create a KMS key pair with the same properties as shown here.

Note: When a KMS key is created, it will be logged by AWS CloudTrail, a service that monitors and records activity within your account. All API calls to the AWS KMS service are logged in CloudTrail, which you can use to audit access to KMS keys.

To allow your KMS key to be identified by a human-readable string rather than KeyId, you can assign an alias for the KMS key (replace the target-key-id value of <1234abcd-12ab-34cd-56ef-1234567890ab> with your KeyId). This makes it easier to use and manage.

To create a KMS key alias for your key by using the CLI

  • Enter the following command in the AWS CLI.
    aws kms create-alias \
        --alias-name alias/example-rsa-key \
        --target-key-id <1234abcd-12ab-34cd-56ef-1234567890ab>
    

Download the public key from AWS KMS

A benefit of asymmetric encryption is that you can distribute a public key to a large, untrusted network, and the public key can only be used for encryption. Decryption of those messages can only be conducted by the corresponding private key. You can use the AWS KMS Encrypt API to encrypt data with a KMS key pair (specifically the public key). However, because the AWS APIs are authenticated by using a signature, you must have access to AWS credentials to use these APIs, which you might not want to do on untrusted devices. Additionally, in a private 5G network, you might not have the capability to call the AWS KMS API endpoints from the end clients. Instead, you can download the public key from a local source or embed that into the end client at the time of manufacture.

To retrieve a copy of the public key from your AWS KMS key pair, you can use the GetPublicKey API. The following example shows how to use this with the AWS CLI command get-public-key and reference the key alias you set earlier.

To view the public key for your KMS key pair by using the CLI

  • Enter the following command in the AWS CLI.
    aws kms get-public-key --key-id alias/example-rsa-key

The return value from this API will contain several elements, including the PublicKey. The returned PublicKey value is the DER-encoded X.509, and because you’re using the AWS CLI, it is base64-encoded for readability purposes. By using the AWS CLI, you can query just the PublicKey return value, base64-decode it, and then save the key to a file on disk, as follows.

To use the AWS CLI to query only the public key, then base64 decode it and output it to a file

  • Enter the following command in the AWS CLI.
    aws kms get-public-key \
    --key-id alias/example-rsa-key \ 
    --output text \ 
    --query PublicKey | base64 -–decode > public_key.der

In this example, the local machine where you saved the public_key.der file will now represent the end-client device.

Note: If you call this API by using one of the AWS SDKs, such as boto3, then the PublicKey value is not base64-encoded.

Create an AES 256-bit symmetric key on the end client

Although the end client now has a copy of the public key from the associated KMS private key, the public key can’t be used for encrypting data that you plan on transmitting, due to the size limits on data that can be encrypted. Instead, you can use symmetric encryption. Typically, symmetric keys are smaller than asymmetric keys, the ciphers are faster when encrypting data, and the resulting ciphertext is similar in size to the original data.

To generate a symmetric key, you should use a source of random entropy. Some operating systems offer block access to hardware-based sources of random numbers, such as /dev/hwrng. To provide an example process in this blog post, you will use the OpenSSL rand utility, which uses a cryptographically secure pseudo random generator (CSPRNG) seeded by /dev/urandom. In production systems, you might have stronger sources of entropy to rely on, or compliance requirements for random number generation. In hardware-constrained environments, you should take extra care to make sure that sources of entropy are cryptographically secure. The following command uses OpenSSL to create an AES 256-bit (32 bytes) key and base64-encode it, then save it to disk in plaintext as key.b64.

Note: Anyone with access to this file system will have access to this key.

To use the OpenSSL rand command to create a symmetric key and output it to a file

  • Enter the following command.
    openssl rand -base64 32 > key.b64

Encrypt the data to be sent from the end client

Now that you have two different key types on the end client, you can use a hybrid cryptosystem to encrypt a large text file. First, you will generate a sample file to encrypt on your system. By outputting some bytes from /dev/urandom, you can create this file to the size you want. The following command outputs 200 random bytes, base64-encodes the file, and writes that to disk in a file called encrypt.me.

To generate a sample file from random data, which will be encrypted later

  • Enter the following command.
    head -c 200 /dev/urandom | base64 –-wrap=0 > encrypt.me

Next, you will encrypt the newly created file with the AES 256-bit key that you created earlier (which is base64-encoded). By using the OpenSSL command line, you will encrypt the file on disk and create a new file called encrypt.me.enc.

Note: For demonstration purposes, this solution uses OpenSSL to complete the encryption process. However, the command line OpenSSL enc utility doesn’t allow the cipher aes-256-gcm. Galois Counter Mode (GCM) is recommended when encrypting and sending data, because it includes authentication, so that that the ciphertext can’t be tampered with in transit. Instead, for this demonstration, you will use aes-256-cbc, which is not authenticated.

To use the OpenSSL enc command to encrypt your sample file with a symmetric key

  • • Enter the following command.
    openssl enc -aes-256-cbc \
    -in encrypt.me -out encrypt.me.enc \
    -pass file:./key.b64

Encrypt the AES 256-bit key

So that the data can be decrypted again, you will need to share the same AES 256-bit key with the recipient. To share that with only the person who can use the KMS private key that you created earlier, you can encrypt the symmetric key (key.b64) with the RSA public key that you retrieved earlier (public_key.der).

Again, you will use OpenSSL to see how this works and the required cipher options. When encrypting or decrypting with a KMS RSA key pair, you can use one of two encryption algorithms, either RSAES_OAEP_SHA_1 or RSAES_OAEP_SHA_256. These identify the cipher suites used in encryption that are currently supported by AWS KMS for encryption.

To use the OpenSSL pkeyutl command to encrypt your symmetric key with your local copy of your KMS public key

  • Enter the following command.
    openssl pkeyutl \
    	-in key.b64 -out key.b64.enc \
    	-inkey public_key.der -keyform DER -pubin -encrypt \
    	-pkeyopt rsa_padding_mode:oaep -pkeyopt rsa_oaep_md:sha256

This command creates a new file on disk called key.b64.enc. This file is the encrypted AES 256-bit key, which can now be transported securely across an insecure network, such as the internet. The last two options in the command define the padding mode used (OAEP) and the length of the message digest (SHA-256), which align with the options available to decrypt when you use the AWS KMS APIs.

Note: You should securely delete both the original payload file (encrypt.me) and the plaintext AES 256-bit key (key.b64) if you want to prevent anyone else from accessing these files. At this point, you will have three files on disk: public_key.der, encrypt.me.enc, and key.b64.enc. If you want to verify the decryption process later in this example, keep these files.

In production, you might never write any of these values to disk. Instead, you can keep all values in memory and only write the encrypted data (ciphertext) to disk, clearing memory after that process has completed.

You can now use the method of your choice to transfer the encrypted files across an unsecured network without compromising the privacy of those files. For smart-home appliance use cases, you can upload the encrypted files in Amazon Simple Storage Service (Amazon S3), a highly durable storage system that can be accessed from the internet, keeping in mind the preventative security practices that AWS recommends. Later, another service can pull these files from S3, and with the correct permissions for the KMS key, can decrypt the files by using the AWS KMS Decrypt API.

Decrypt the files

With access to the decrypt operation for the KMS key and the encrypted files, you can now retrieve the plaintext data file again. To do this, you will replicate the preceding steps, but in reverse. This involves decrypting the AWS 256-bit key by using the AWS KMS API, and then using that result to decrypt the encrypted data. You will need access to the AWS KMS API to complete these actions, because the private key exists in plaintext only within the AWS KMS HSMs.

To decrypt the files

  1. The first step is to decrypt the AWS 256-bit key. You will need to use the AWS CLI to submit the key.b64.enc file to the AWS KMS API, and specify the algorithm you used to encrypt the file (RSAES_OAEP_SHA_256). Use the following command to retrieve the AES 256-bit key in plaintext. Again, you’re using the –query selector to output only the plaintext, and then decode the base64 value.
    aws kms decrypt --key-id alias/example-rsa-key \ 
    		--ciphertext-blob fileb://key.b64.enc \
    		--encryption-algorithm RSAES_OAEP_SHA_256 --output text \
    		--query 'Plaintext' | base64 --decode > decrypted_key.b64

  2. The final step in decrypting the data is to reverse the CBC encryption process you used in OpenSSL. If another mode of symmetric encryption was used, such as AES-GCM, then you would need to decrypt by using that algorithm and the input AES 256-bit key. Use the following OpenSSL command to retrieve the original plaintext payload.
    openssl enc -d -aes-256-cbc \
    		-in encrypt.me.enc -out decrypted.file \
    		-pass file:./decrypted_key.b64

Conclusion

In this post, you learned how to combine AWS KMS asymmetric key pairs with locally created symmetric keys to encrypt and share data that exceeds 190 bytes, without storing a secret on a client device. By taking advantage of the RSA cryptosystem for offline encryption, you can reduce the exposure of plaintext data or secrets to devices outside of your control, and without having to complete complex key exchanges. By using the steps in this solution, you can more securely share large amounts of data, such as update files or configuration settings. To learn more about the asymmetric keys feature of AWS KMS, refer to the AWS KMS Developer Guide. If you have questions about the asymmetric keys feature, interact with us through AWS re:Post.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Patrick Palmer

Patrick is a security solutions architect at AWS. He has a passion for learning new technologies and cryptography across AWS services and having deep conversations with customers. He works on a team of security specialists who strive to continually delight customers. Outside of work, he spends time with his wife and two cats, occasionally playing video games when he can.

Manage application security and compliance with the AWS Cloud Development Kit and cdk-nag

Post Syndicated from Rodney Bozo original https://aws.amazon.com/blogs/devops/manage-application-security-and-compliance-with-the-aws-cloud-development-kit-and-cdk-nag/

Infrastructure as Code (IaC) is an important part of Cloud Applications. Developers rely on various Static Application Security Testing (SAST) tools to identify security/compliance issues and mitigate these issues early on, before releasing their applications to production. Additionally, SAST tools often provide reporting mechanisms that can help developers verify compliance during security reviews.

cdk-nag integrates directly into AWS Cloud Development Kit (AWS CDK) applications to provide identification and reporting mechanisms similar to SAST tooling.

This post demonstrates how to integrate cdk-nag into an AWS CDK application to provide continual feedback and help align your applications with best practices.

Overview of cdk-nag

cdk-nag (inspired by cfn_nag) validates that the state of constructs within a given scope comply with a given set of rules. Additionally, cdk-nag provides a rule suppression and compliance reporting system. cdk-nag validates constructs by extending AWS CDK Aspects. If you’re interested in learning more about the AWS CDK Aspect system, then you should check out this post.

cdk-nag includes several rule sets (NagPacks) to validate your application against. As of this post, cdk-nag includes the AWS Solutions, HIPAA Security, NIST 800-53 rev 4, NIST 800-53 rev 5, and PCI DSS 3.2.1 NagPacks. You can pick and choose different NagPacks and apply as many as you wish to a given scope.

cdk-nag rules can either be warnings or errors. Both warnings and errors will be displayed in the console and compliance reports. Only unsuppressed errors will prevent applications from deploying with the cdk deploy command.

You can see which rules are implemented in each of the NagPacks in the Rules Documentation in the GitHub repository.

Walkthrough

This walkthrough will setup a minimal AWS CDK v2 application, as well as demonstrate how to apply a NagPack to the application, how to suppress rules, and how to view a report of the findings. Although cdk-nag has support for Python, TypeScript, Java, and .NET AWS CDK applications, we’ll use TypeScript for this walkthrough.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • A local installation of and experience using the AWS CDK.

Create a baseline AWS CDK application

In this section you will create and synthesize a small AWS CDK v2 application with an Amazon Simple Storage Service (Amazon S3) bucket. If you are unfamiliar with using the AWS CDK, then learn how to install and setup the AWS CDK by looking at their open source GitHub repository.

  1. Run the following commands to create the AWS CDK application:
mkdir CdkTest
cd CdkTest
cdk init app --language typescript
  1. Replace the contents of the lib/cdk_test-stack.ts with the following:
import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Bucket } from 'aws-cdk-lib/aws-s3';

export class CdkTestStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);
    const bucket = new Bucket(this, 'Bucket')
  }
}
  1. Run the following commands to install dependencies and synthesize our sample app:
npm install
npx cdk synth

You should see an AWS CloudFormation template with an S3 bucket both in your terminal and in cdk.out/CdkTestStack.template.json.

Apply a NagPack in your application

In this section, you’ll install cdk-nag, include the AwsSolutions NagPack in your application, and view the results.

  1. Run the following command to install cdk-nag:
npm install cdk-nag
  1. Replace the contents of the bin/cdk_test.ts with the following:
#!/usr/bin/env node
import 'source-map-support/register';
import * as cdk from 'aws-cdk-lib';
import { CdkTestStack } from '../lib/cdk_test-stack';
import { AwsSolutionsChecks } from 'cdk-nag'
import { Aspects } from 'aws-cdk-lib';

const app = new cdk.App();
// Add the cdk-nag AwsSolutions Pack with extra verbose logging enabled.
Aspects.of(app).add(new AwsSolutionsChecks({ verbose: true }))
new CdkTestStack(app, 'CdkTestStack', {});
  1. Run the following command to view the output and generate the compliance report:
npx cdk synth

The output should look similar to the following (Note: SSE stands for Server-side encryption):

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S1: The S3 Bucket has server access logs disabled. The bucket should have server access logging enabled to provide detailed records for the requests that are made to the bucket.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S2: The S3 Bucket does not have public access restricted and blocked. The bucket should have public access restricted and blocked to prevent unauthorized access.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S3: The S3 Bucket does not default encryption enabled. The bucket should minimally have SSE enabled to help protect data-at-rest.

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S10: The S3 Bucket does not require requests to use SSL. You can use HTTPS (TLS) to help prevent potential attackers from eavesdropping on or manipulating network traffic using person-in-the-middle or similar attacks. You should allow only encrypted connections over HTTPS (TLS) using the aws:SecureTransport condition on Amazon S3 bucket policies.

Found errors

Note that applying the AwsSolutions NagPack to the application rendered several errors in the console (AwsSolutions-S1, AwsSolutions-S2, AwsSolutions-S3, and AwsSolutions-S10). Furthermore, the cdk.out/AwsSolutions-CdkTestStack-NagReport.csv contains the errors as well:

Rule ID,Resource ID,Compliance,Exception Reason,Rule Level,Rule Info
"AwsSolutions-S1","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket has server access logs disabled."
"AwsSolutions-S2","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not have public access restricted and blocked."
"AwsSolutions-S3","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not default encryption enabled."
"AwsSolutions-S5","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 static website bucket either has an open world bucket policy or does not use a CloudFront Origin Access Identity (OAI) in the bucket policy for limited getObject and/or putObject permissions."
"AwsSolutions-S10","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not require requests to use SSL."

Remediating and suppressing errors

In this section, you’ll remediate the AwsSolutions-S10 error, suppress the  AwsSolutions-S1 error on a Stack level, suppress the  AwsSolutions-S2 error on a Resource level errors, and not remediate the  AwsSolutions-S3 error and view the results.

  1. Replace the contents of the lib/cdk_test-stack.ts with the following:
import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { Bucket } from 'aws-cdk-lib/aws-s3';
import { NagSuppressions } from 'cdk-nag'

export class CdkTestStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);
    // The local scope 'this' is the Stack. 
    NagSuppressions.addStackSuppressions(this, [
      {
        id: 'AwsSolutions-S1',
        reason: 'Demonstrate a stack level suppression.'
      },
    ])
    // Remediating AwsSolutions-S10 by enforcing SSL on the bucket.
    const bucket = new Bucket(this, 'Bucket', { enforceSSL: true })
    NagSuppressions.addResourceSuppressions(bucket, [
      {
        id: 'AwsSolutions-S2',
        reason: 'Demonstrate a resource level suppression.'
      },
    ])
  }
}
  1. Run the cdk synth command again:
npx cdk synth

The output should look similar to the following:

[Error at /CdkTestStack/Bucket/Resource] AwsSolutions-S3: The S3 Bucket does not default encryption enabled. The bucket should minimally have SSE enabled to help protect data-at-rest.

Found errors

The cdk.out/AwsSolutions-CdkTestStack-NagReport.csv contains more details about rule compliance, non-compliance, and suppressions.

Rule ID,Resource ID,Compliance,Exception Reason,Rule Level,Rule Info
"AwsSolutions-S1","CdkTestStack/Bucket/Resource","Suppressed","Demonstrate a stack level suppression.","Error","The S3 Bucket has server access logs disabled."
"AwsSolutions-S2","CdkTestStack/Bucket/Resource","Suppressed","Demonstrate a resource level suppression.","Error","The S3 Bucket does not have public access restricted and blocked."
"AwsSolutions-S3","CdkTestStack/Bucket/Resource","Non-Compliant","N/A","Error","The S3 Bucket does not default encryption enabled."
"AwsSolutions-S5","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 static website bucket either has an open world bucket policy or does not use a CloudFront Origin Access Identity (OAI) in the bucket policy for limited getObject and/or putObject permissions."
"AwsSolutions-S10","CdkTestStack/Bucket/Resource","Compliant","N/A","Error","The S3 Bucket does not require requests to use SSL."

Moreover, note that the resultant cdk.out/CdkTestStack.template.json template contains the cdk-nag suppression data. This provides transparency with what rules weren’t applied to an application, as the suppression data is included in the resources.

{
  "Metadata": {
    "cdk_nag": {
      "rules_to_suppress": [
        {
          "id": "AwsSolutions-S1",
          "reason": "Demonstrate a stack level suppression."
        }
      ]
    }
  },
  "Resources": {
    "BucketDEB6E181": {
      "Type": "AWS::S3::Bucket",
      "UpdateReplacePolicy": "Retain",
      "DeletionPolicy": "Retain",
      "Metadata": {
        "aws:cdk:path": "CdkTestStack/Bucket/Resource",
        "cdk_nag": {
          "rules_to_suppress": [
            {
              "id": "AwsSolutions-S2",
              "reason": "Demonstrate a resource level suppression."
            }
          ]
        }
      }
    },
  ...
  },
  ...
}

Reflecting on the Walkthrough

In this section, you learned how to apply a NagPack to your application, remediate/suppress warnings and errors, and review the compliance reports. The reporting and suppression systems provide mechanisms for the development and security teams within organizations to work together to identify and mitigate potential security/compliance issues. Security can choose which NagPacks developers should apply to their applications. Then, developers can use the feedback to quickly remediate issues. Security can use the reports to validate compliances. Furthermore, developers and security can work together to use suppressions to transparently document exceptions to rules that they’ve decided not to follow.

Advanced usage and further reading

This section briefly covers some advanced options for using cdk-nag.

Unit Testing with the AWS CDK Assertions Library

The Annotations submodule of the AWS CDK assertions library lets you check for cdk-nag warnings and errors without AWS credentials by integrating a NagPack into your application unit tests. Read this post for further information about the AWS CDK assertions module. The following is an example of using assertions with a TypeScript AWS CDK application and Jest for unit testing.

import { Annotations, Match } from 'aws-cdk-lib/assertions';
import { App, Aspects, Stack } from 'aws-cdk-lib';
import { AwsSolutionsChecks } from 'cdk-nag';
import { CdkTestStack } from '../lib/cdk_test-stack';

describe('cdk-nag AwsSolutions Pack', () => {
  let stack: Stack;
  let app: App;
  // In this case we can use beforeAll() over beforeEach() since our tests 
  // do not modify the state of the application 
  beforeAll(() => {
    // GIVEN
    app = new App();
    stack = new CdkTestStack(app, 'test');

    // WHEN
    Aspects.of(stack).add(new AwsSolutionsChecks());
  });

  // THEN
  test('No unsuppressed Warnings', () => {
    const warnings = Annotations.fromStack(stack).findWarning(
      '*',
      Match.stringLikeRegexp('AwsSolutions-.*')
    );
    expect(warnings).toHaveLength(0);
  });

  test('No unsuppressed Errors', () => {
    const errors = Annotations.fromStack(stack).findError(
      '*',
      Match.stringLikeRegexp('AwsSolutions-.*')
    );
    expect(errors).toHaveLength(0);
  });
});

Additionally, many testing frameworks include watch functionality. This is a background process that reruns all of the tests when files in your project have changed for fast feedback. For example, when using the AWS CDK in JavaScript/Typescript, you can use the Jest CLI watch commands. When Jest watch detects a file change, it attempts to run unit tests related to the changed file. This can be used to automatically run cdk-nag-related tests when making changes to your AWS CDK application.

CDK Watch

When developing in non-production environments, consider using AWS CDK Watch with a NagPack for fast feedback. AWS CDK Watch attempts to synthesize and then deploy changes whenever you save changes to your files. Aspects are run during synthesis. Therefore, any NagPacks applied to your application will also run on save. As in the walkthrough, all of the unsuppressed errors will prevent deployments, all of the messages will be output to the console, and all of the compliance reports will be generated. Read this post for further information about AWS CDK Watch.

Conclusion

In this post, you learned how to use cdk-nag in your AWS CDK applications. To learn more about using cdk-nag in your applications, check out the README in the GitHub Repository. If you would like to learn how to create your own rules and NagPacks, then check out the developer documentation. The repository is open source and welcomes community contributions and feedback.

Author:

Arun Donti

Arun Donti is a Senior Software Engineer with Twitch. He loves working on building automated processes and tools that enable builders and organizations to focus on and deliver their mission critical needs. You can find him on GitHub.

A new Spark plugin for CPU and memory profiling

Post Syndicated from Bo Xiong original https://aws.amazon.com/blogs/devops/a-new-spark-plugin-for-cpu-and-memory-profiling/

Introduction

Have you ever wondered if there are low-hanging optimization opportunities to improve the performance of a Spark app? Profiling can help you gain visibility regarding the runtime characteristics of the Spark app to identify its bottlenecks and inefficiencies. We’re excited to announce the release of a new Spark plugin that enables profiling for JVM based Spark apps via Amazon CodeGuru. The plugin is open sourced on GitHub and published to Maven.

Walkthrough

This post shows how you can onboard this plugin with two steps in under 10 minutes.

  • Step 1: Create a profiling group in Amazon CodeGuru Profiler and grant permission to your Amazon EMR on EC2 role, so that profiler agents can emit metrics to CodeGuru. Detailed instructions can be found here.
  • Step 2: Reference codeguru-profiler-for-spark when submitting your Spark job, along with PROFILING_CONTEXT and ENABLE_AMAZON_PROFILER defined.

Prerequisites

Your app is built against Spark 3 and run on Amazon EMR release 6.x or newer. It doesn’t matter if you’re using Amazon EMR on Amazon Elastic Compute Cloud (Amazon EC2) or on Amazon Elastic Kubernetes Service (Amazon EKS).

Illustrative Example

For the purposes of illustration, consider the following example where profiling results are collected by the plugin and emitted to the “CodeGuru-Spark-Demo” profiling group.

spark-submit \
--master yarn \
--deploy-mode cluster \
--class \
--packages software.amazon.profiler:codeguru-profiler-for-spark:1.0 \
--conf spark.plugins=software.amazon.profiler.AmazonProfilerPlugin \
--conf spark.executorEnv.PROFILING_CONTEXT="{\\\"profilingGroupName\\\":\\\"CodeGuru-Spark-Demo\\\"}" \
--conf spark.executorEnv.ENABLE_AMAZON_PROFILER=true \
--conf spark.dynamicAllocation.enabled=false \t

An alternative way to specify PROFILING_CONTEXT and ENABLE_AMAZON_PROFILER is under the yarn-env.export classification for instance groups in the Amazon EMR web console. Note that PROFILING_CONTEXT, if configured in the web console, must escape all of the commas on top of what’s for the above spark-submit command.

[
  {
    "classification": "yarn-env",
    "properties": {},
    "configurations": [
      {
        "classification": "export",
        "properties": {
          "ENABLE_AMAZON_PROFILER": "true",
          "PROFILING_CONTEXT": "{\\\"profilingGroupName\\\":\\\"CodeGuru-Spark-Demo\\\"\\,\\\"driverEnabled\\\":\\\"true\\\"}"
        },
        "configurations": []
      }
    ]
  }
]

Once the job above is launched on Amazon EMR, profiling results should show up in your CodeGuru web console in about 10 minutes, similar to the following screenshot. Internally, it has helped us identify issues, such as thread contentions (revealed by the BLOCKED state in the latency flame graph), and unnecessarily create AWS Java clients (revealed by the CPU Hotspots view).

Go to your profiling group under the Amazon CodeGuru web console. Click the “Visualize CPU” button to render a flame graph displaying CPU usage. Switch to the latency view to identify latency bottlenecks, and switch to the heap summary view to identify objects consuming most memory.

Troubleshooting

To help with troubleshooting, use a sample Spark app provided in the plugin to check if everything is set up correctly. Note that the profilingGroupName value specified in PROFILING_CONTEXT should match what’s created in CodeGuru.

spark-submit \
--master yarn \
--deploy-mode cluster \
--class software.amazon.profiler.SampleSparkApp \
--packages software.amazon.profiler:codeguru-profiler-for-spark:1.0 \
--conf spark.plugins=software.amazon.profiler.AmazonProfilerPlugin \
--conf spark.executorEnv.PROFILING_CONTEXT="{\\\"profilingGroupName\\\":\\\"CodeGuru-Spark-Demo\\\"}" \
--conf spark.executorEnv.ENABLE_AMAZON_PROFILER=true \
--conf spark.yarn.appMasterEnv.PROFILING_CONTEXT="{\\\"profilingGroupName\\\":\\\"CodeGuru-Spark-Demo\\\",\\\"driverEnabled\\\":\\\"true\\\"}" \
--conf spark.yarn.appMasterEnv.ENABLE_AMAZON_PROFILER=true \
--conf spark.dynamicAllocation.enabled=false \
/usr/lib/hadoop-yarn/hadoop-yarn-server-tests.jar

Running the command above from the master node of your EMR cluster should produce logs similar to the following:

21/11/21 21:27:21 INFO Profiler: Starting the profiler : ProfilerParameters{profilingGroupName='CodeGuru-Spark-Demo', threadSupport=BasicThreadSupport (default), excludedThreads=[Signal Dispatcher, Attach Listener], shouldProfile=true, integrationMode='', memoryUsageLimit=104857600, heapSummaryEnabled=true, stackDepthLimit=1000, samplingInterval=PT1S, reportingInterval=PT5M, addProfilerOverheadAsSamples=true, minimumTimeForReporting=PT1M, dontReportIfSampledLessThanTimes=1}
21/11/21 21:27:21 INFO ProfilingCommandExecutor: Profiling scheduled, sampling rate is PT1S
...
21/11/21 21:27:23 INFO ProfilingCommand: New agent configuration received : AgentConfiguration(AgentParameters={MaxStackDepth=1000, MinimumTimeForReportingInMilliseconds=60000, SamplingIntervalInMilliseconds=1000, MemoryUsageLimitPercent=10, ReportingIntervalInMilliseconds=300000}, PeriodInSeconds=300, ShouldProfile=true)
21/11/21 21:32:23 INFO ProfilingCommand: Attempting to report profile data: start=2021-11-21T21:27:23.227Z end=2021-11-21T21:32:22.765Z force=false memoryRefresh=false numberOfTimesSampled=300
21/11/21 21:32:23 INFO javaClass: [HeapSummary] Processed 20 events.
21/11/21 21:32:24 INFO ProfilingCommand: Successfully reported profile

Note that the CodeGuru Profiler agent uses a reporting interval of five minutes. Therefore, any executor process shorter than five minutes won’t be reflected by the profiling result. If the right profiling group is not specified, or it’s associated with a wrong EC2 role in CodeGuru, then the log will show a message similar to “CodeGuruProfilerSDKClient: Exception while calling agent orchestration” along with a stack trace including a 403 status code. To rule out any network issues (e.g., your EMR job running in a VPC without an outbound gateway or a misconfigured outbound security group), then you can remote into an EMR host and ping the CodeGuru endpoint in your Region (e.g., ping codeguru-profiler.us-east-1.amazonaws.com).

Cleaning up

To avoid incurring future charges, you can delete the profiling group configured in CodeGuru and/or set the ENABLE_AMAZON_PROFILER environment variable to false.

Conclusion

In this post, we describe how to onboard this plugin with two steps. Consider to give it a try for your Spark app? You can find the Maven artifacts here. If you have feature requests, bug reports, feedback of any kind, or would like to contribute, please head over to the GitHub repository.

Author:

Bo Xiong

Bo Xiong is a software engineer with Amazon Ads, leveraging big data technologies to process petabytes of data for billing and reporting. His main interests include performance tuning and optimization for Spark on Amazon EMR, and data mining for actionable business insights.

LGPD workbook for AWS customers managing personally identifiable information in Brazil

Post Syndicated from Rodrigo Fiuza original https://aws.amazon.com/blogs/security/lgpd-workbook-for-aws-customers-managing-personally-identifiable-information-in-brazil/

Portuguese version

AWS is pleased to announce the publication of the Brazil General Data Protection Law Workbook.

The General Data Protection Law (LGPD) in Brazil was first published on 14 August 2018, and started its applicability on 18 August 2020. Companies that manage personally identifiable information (PII) in Brazil as defined by LGPD will have to comply with and attend to the law.

To better help customers prepare and implement controls that focus on LGPD Chapter VII Security and Best Practices, AWS created a workbook based on industry best practices, AWS service offerings, and controls.

Amongst other topics, this workbook covers information security and AWS controls from:

In combination with Brazil General Data Protection Law Workbook, customers can use the detailed Navigating LGPD Compliance on AWS whitepaper.

AWS adheres to a shared responsibility model. Customers will have to observe which services offer privacy features and determine their applicability to their specific compliance requirements. Further information about data privacy at AWS can be found at our Data Privacy Center. Specific information about LGPD and data privacy at AWS in Brazil can be found on our Brazil Data Privacy page.

To learn more about our compliance and security programs, see AWS Compliance Programs. As always, we value your feedback and questions; reach out to the AWS Compliance team through the Contact Us page.

If you have feedback about this post, submit comments in the Comments section below.
Want more AWS Security news? Follow us on Twitter.
 


Portuguese

Workbook da LGPD para Clientes AWS que gerenciam Informações de Identificação Pessoal no Brasil

A AWS tem o prazer de anunciar a publicação do Workbook Lei Geral de Proteção de Dados do Brasil.

A Lei Geral de Proteção de Dados (LGPD) teve sua primeira publicação em 14 de agosto de 2018 no Brasil e iniciou sua aplicabilidade em 18 de agosto de 2020. Empresas que gerenciam informações pessoais identificáveis (PII) conforme definido na LGPD terão que cumprir e atender às cláusulas da lei.

Para ajudar melhor os clientes a preparar e implementar controles que se concentram no Capítulo VII da LGPD “da Segurança e Boas Práticas”, a AWS criou uma pasta de trabalho com base nas melhores práticas do setor, ofertas de serviços e controles da AWS.

Entre outros tópicos, esta pasta de trabalho aborda a segurança da informação e os controles da AWS de:

Em combinação com o Workbook Lei Geral de Proteção de Dados do Brasil, os clientes podem usar o whitepaper detalhado Navegando na conformidade com a LGPD na AWS.

A AWS adere a um modelo de responsabilidade compartilhada. Clientes terão que observar quais serviços oferecem recursos de privacidade e determinar sua aplicabilidade aos seus requisitos específicos de compliance. Mais informações sobre a privacidade de dados na AWS podem ser encontradas em nosso Centro de Privacidade de Dados. Informações adicionais sobre LGPD e Privacidade de dados na AWS no Brasil podem ser encontradas em nossa página de Privacidade de Dados no Brasil.

Para saber mais sobre nossos programas de conformidade e segurança, consulte Programas de conformidade da AWS. Como sempre, valorizamos seus comentários e perguntas; entre em contato com a equipe de conformidade da AWS por meio da página Fale conosco.

Se você tiver feedback sobre esta postagem, envie comentários na seção Comentários abaixo.

Quer mais notícias sobre segurança da AWS? Siga-nos no Twitter.

Author

Rodrigo Fiuza

Rodrigo is a Security Audit Manager at AWS, based in São Paulo. He leads audits, attestations, certifications, and assessments across Latin America, Caribbean and Europe. Rodrigo has previously worked in risk management, security assurance, and technology audits for the past 12 years.

Building Blue/Green application deployment to Micro Focus Enterprise Server

Post Syndicated from Kevin Yung original https://aws.amazon.com/blogs/devops/building-blue-green-application-deployment-to-micro-focus-enterprise-server/

Organizations running mainframe production workloads often follow the traditional approach of application deployment. To release new features of existing applications into production, the application is redeployed using the new version of software on the existing infrastructure. This poses the following challenges:

  • The cutover of the application deployment from testing to production usually takes place during a planned outage window with associated downtime.
  • Rollback is difficult, since the earlier version of the software must be redeployed from scratch on the existing infrastructure. This may result in applications being unavailable for longer durations owing to the rollback.
  • Due to differences in testing and production environments, some defects may leak into production, affecting the application code quality and thus increasing the number of production outages

Automated, robust application deployment is recognized as a prime driver for moving from a Mainframe to AWS, as service stability, security, and quality can be better managed. In this post, you will learn how to build Blue/Green (zero-downtime) deployments for mainframe applications rehosted to Micro Focus Enterprise Server with AWS Developer Tools (AWS CodeBuild, CodePipeline, and CodeDeploy).

This is a continuation of our previous post “Automate thousands of mainframe tests on AWS with the Micro Focus Enterprise Suite”. In our last post, we explained how you can implement a pattern for continuous integration and testing of mainframe applications with AWS Developer tools and Micro Focus Enterprise Suite. If you haven’t already checked it out, then we strongly recommend that you read through it before proceeding to the rest of this post.

Overview of solution

In this section, we explain the three important design “ingredients” to be implemented in the overall solution:

  1. Implementation of Enterprise Server Performance and Availability Cluster (PAC)
  2. End-to-end design of CI/CD pipeline for multiple teams development
  3. Blue/green deployment process for a rehosted mainframe application

First, let’s look at the solution design for the Micro Focus Enterprise Server PAC cluster.

Overview of Micro Focus Enterprise Server Performance and Availability Cluster (PAC)

In the Blue/Green deployment solution, Micro Focus Enterprise Server is the hosting environment for mainframe applications with the software installed into Amazon EC2 instances. Application deployment in Amazon EC2 Auto Scaling is one of the critical requirements to build a Blue/Green deployment. Micro Focus Enterprise Server PAC technology is the feature that allows for the Auto Scaling of Enterprise Server instances. For details on how to build Micro Focus Enterprise PAC Cluster with Amazon EC2 Auto Scaling and Systems Manager, see our AWS Prescriptive Guidance document. An overview of the infrastructure architecture is shown in the following figure, and the following table explains the components in the architecture.

Infrastructure architecture overview for blue/green application deployment to Micro Focus Enterprise Server

Components Description
Micro Focus Enterprise Servers Deploy applications to Micro Focus Enterprise Servers PAC in Amazon EC2 Auto Scaling Group.
Micro Focus Enterprise Server Common Web Administration (ESCWA) Manage Micro Focus Enterprise Server PAC with ESCWA server, e.g., Adding or Removing Enterprise Server to/from a PAC.
Relational Database for both user and system data files Setup Amazon Aurora RDS Instance in Multi-AZ to host both user and system data files to be shared across the Enterprise server instances.
Micro Focus Enterprise Server Scale-Out Repository (SOR) Setup an Amazon ElastiCache Redis Instance and replicas in Multi-AZ to host user data.
Application endpoint and load balancer Setup a Network Load Balancer to provide a hostname for end users to connect the application, e.g., accessing the application through a 3270 emulator.

CI/CD Pipelines design supporting multi-streams of mainframe development

In a previous DevOps post, Automate thousands of mainframe tests on AWS with the Micro Focus Enterprise Suite, we introduced two levels of pipelines. The first level of pipeline is used by mainframe project teams to test project scope changes. The second level of the pipeline is used for system integration tests, where the pipeline will perform tests for all of the promoted changes from the project pipelines and perform extensive systems tests.

In this post, we are extending the two levels pipeline to add a production deployment pipeline. When system testing is complete and successful, the tested application artefacts are promoted to the production pipeline in preparation for live production release. The following figure depicts each stage of the three levels of CI/CD pipeline and the purpose of each stage.

Different levels of CI/CD pipeline - Project Team Pipeline, Systems Test Pipeline and Production Deployment Pipeline

Let’s look at the artifact promotion to production pipeline in greater detail. The Systems Test Pipeline promotes the tested artifacts in binary format into an Amazon S3 bucket and the S3 event triggers production pipeline to kick-off. This artifact promotion process can be gated using a manual approval action in CodePipeline. For customers who want to have a fully automated continuous deployment, the manual promotion approval step can be removed.

The following diagram shows the AWS Stages in AWS CodePipeline of the production deployment pipeline:

Stages in production deployment pipeline using AWS CodePipeline

After the production pipeline is kicked off, it downloads the new version artifact from the S3 bucket. See the details of how to setup the S3 bucket as a Source of CodePipeline in the document AWS CodePipeline Document S3 as Source

In the following section, we explain each of these pipeline stages in detail:

  1. It prepares and packages a new version of production configuration artifacts, for example, the Micro Focus Enterprise Server config file, blue/green deployment scripts etc.
  2. Use in the CodeBuild Project to kick off an application blue/green deployment with AWS CodeDeploy.
  3. Use a manual approval gate to wait for an operator to validate the new version of the application and approve to continue the production traffic switch
  4. Continue the blue/green deployment by allowing traffic to the new version of the application and block the traffic to the old version.
  5. After a successful Blue/Green switch and deployment, tag the production version in the code repository.

Now that you’ve seen the pipeline design, we will dive deep into the details of the blue/green deployment with AWS CodeDeploy.

Blue/green deployment with AWS CodeDeploy

In the blue/green deployment, we used the technique of swapping Auto Scaling Group behind an Elastic Load Balancer. Refer to the AWS Blue/Green deployment whitepaper for the details of the technique. As AWS CodeDeploy is a fully-managed service that automates software deployment, it is used to automate the entire Blue/Green process.

Firstly, the following best practices are applied to setup the Enterprise Server’s infrastructure:

  1. AWS Image Builder is used to install Micro Focus Enterprise Server software and AWS CodeDeploy Agent into Amazon Machine Image (AMI). Create an EC2 Launch Template with the Enterprise Server AMI ID.
  2. A Network Load Balancer is used to setup a TCP connection health check to validate that Micro Focus Enterprise Server is listening on the required ports, e.g., port 9270, so that connectivity is available for 3270 emulators.
  3. A script was created to confirm application deployment validity in each EC2 instance. This is achieved by using a PowerShell script that triggers a CICS transaction from the Micro Focus Enterprise Server command line interface.

In the CodePipeline, we created a CodeBuild project to create a new deployment with CodeDeploy. We will go into the details of the CodeBuild buildspec.yaml configuration.

In the CodeBuild buildspec.yaml’s pre_build section, we used the following steps:

In the pre-build stage, the CodeBuild will perform two steps:

  1. Create an initial Amazon EC2 Auto Scaling using Micro Focus Enterprise Server AMI and a Launch Template for the first-time deployment of the application.
  2. Use AWS CLI to update the initial Auto Scaling Group name into a Systems Manager Parameter Store, and it will later be used by CodeDeploy to create a copy during the blue/green deployment.

In the build stage, the buildspec will perform the following steps:

  1. Retrieve the Auto Scaling Group name of the Enterprise Servers from the Systems Manager Parameter Store.
  2. Then, a blue/green deployment configuration is created for the deployment group of the application. In the AWS CLI command, we use the WITH_TRAFFIC_CONTROL option to let us manually verify and approve before switching the traffic to the new version of the application. The command snippet is shown here.
BlueGreenConf=\
        "terminateBlueInstancesOnDeploymentSuccess={action=TERMINATE}"\
        ",deploymentReadyOption={actionOnTimeout=STOP_DEPLOYMENT,waitTimeInMinutes=600}" \
        ",greenFleetProvisioningOption={action=COPY_AUTO_SCALING_GROUP}"

DeployType="BLUE_GREEN,deploymentOption=WITH_TRAFFIC_CONTROL"

/usr/local/bin/aws deploy update-deployment-group \
      --application-name "${APPLICATION_NAME}" \
     --current-deployment-group-name "${DEPLOYMENT_GROUP_NAME}" \
     --auto-scaling-groups "${AsgName}" \
      --load-balancer-info targetGroupInfoList=[{name="${TARGET_GROUP_NAME}"}] \
      --deployment-style "deploymentType=$DeployType" \
      --Blue/Green-deployment-configuration "$BlueGreenConf"
  1. Next, the new version of application binary is released from the CodeBuild source DemoBinto the production S3 bucket.
release="bankdemo-$(date '+%Y-%m-%d-%H-%M').tar.gz"
RELEASE_FILE="s3://${PRODUCTION_BUCKET}/${release}"

/usr/local/bin/aws deploy push \
    --application-name ${APPLICATION_NAME} \
    --description "version - $(date '+%Y-%m-%d %H:%M')" \
    --s3-location ${RELEASE_FILE} \
    --source ${CODEBUILD_SRC_DIR_DemoBin}/
  1. Create a new deployment for the application to initiate the Blue/Green switch.
/usr/local/bin/aws deploy create-deployment \
    --application-name ${APPLICATION_NAME} \
    --s3-location bucket=${PRODUCTION_BUCKET},key=${release},bundleType=zip \
    --deployment-group-name "${DEPLOYMENT_GROUP_NAME}" \
    --description "Bankdemo Production Deployment ${release}"\
    --query deploymentId \
    --output text

After setting up the deployment options, the following is a snapshot of a deployment configuration from the AWS Management Console.

Snapshot of deployment configuration from AWS Management Console

In the AWS Post “Under the Hood: AWS CodeDeploy and Auto Scaling Integration”, we explain how AWS CodeDeploy sets up Auto Scaling lifecycle hooks to listen for Auto Scaling events. In the event of an EC2 instance launch and termination, AWS CodeDeploy can instruct its agent in the instance to run the prepared scripts.

In the following table, we list each stage in a blue/green deployment and the tasks that ran.

Hooks Tasks
BeforeInstall Create application folder structures in the newly launched Amazon EC2 and prepare for installation
  AfterInstall Enable Windows Firewall Rule for application traffic
Activate Micro Focus License using License Server
Prepare Production Database Connections
Import config to create Region in Micro Focus Enterprise Server
Deploy the latest application binaries into each of the Micro Focus Enterprise Servers
ApplicationStart Use AWS CLI to start a Systems Manager Automation “Scale-Out” runbook with the target of ESCWA server
The Automation runbook will add the newly launched Micro Focus Enterprise Server instance into a PAC
The Automation runbook will start the imported region in the newly launched Micro Focus Enterprise Server
Validate that the application is listening on a service port, for example, port 9270
Use the Micro Focus command “castran” to run an online transaction in Micro Focus Enterprise Server to validate the service status
AfterBlockTraffic Use AWS CLI to start a Systems Manager Automation “Scale-In” runbook with the target ESCWA server
The Automation runbook will try stopping the Region in the terminating EC2 instance
The Automation runbook will remove the Enterprise Server instance from the PAC

The tasks in the table are automated using PowerShell, and the scripts are used in appspec.yml config for CodeDeploy to orchestrate the deployment.

In the following appspec.yml, the locations of the binary files to be installed are defined in addition to the Micro Focus Enterprise Server Region XML config file. During the AfrerInstall stage, the XML config is imported into the Enterprise Server.

version: 0.0
os: windows
files:
  - source: scripts
    destination: C:\scripts\
  - source: online
    destination: C:\BANKDEMO\online\
  - source: common
    destination: C:\BANKDEMO\common\
  - source: batch
    destination: C:\BANKDEMO\batch\
  - source: scripts\BANKDEMO.xml
    destination: C:\BANKDEMO\
hooks:
  BeforeInstall: 
    - location: scripts\BeforeInstall.ps1
      timeout: 300
  AfterInstall: 
    - location: scripts\AfterInstall.ps1    
  ApplicationStart:
    - location: scripts\ApplicationStart.ps1
      timeout: 300
  ValidateService:
    - location: scripts\ValidateServer.cmd
      timeout: 300
  AfterBlockTraffic:
    - location: scripts\AfterBlockTraffic.ps1

Using the sample Micro Focus Bankdemo application, and the steps outlined above, we have setup a blue/green deployment process in Micro Focus Enterprise Server.

There are four important considerations when setting up blue/green deployment:

  1. For batch applications, the blue/green deployment should be invoked only outside of the scheduled “batch window”.
  2. For online applications, AWS CodeDeploy will deregister the Auto Scaling group from the target group of the Network Load Balancer. The deregistration may take a while as the server has to finish processing the ongoing requests before it can continue deployment of the new application instance. In this case, enabling Elastic Load Balancing connection draining feature with appropriate timeout value can minimize the risk of closing unfinished transactions. In addition, consider doing deployment in low-traffic windows to improve the deployment speeds.
  3. For application changes that require updates to the database schema, the version roll-forward and rollback can be managed via DB migrations tools, e.g., Flyway and Fluent Migrator.
  4. For testing in production environments, adherence to any regulatory compliance, such as full audit trail of events, must be considered.

Conclusion

In this post, we introduced the solution to use Micro Focus Enterprise Server PAC, Amazon EC2 Auto Scaling, AWS Systems Manager, and AWS CodeDeploy to automate the blue/green deployment of rehosted mainframe applications in AWS.

Through the blue/green deployment methodology, we can shift traffic between two identical clusters running different application versions in parallel. This mitigates the risks commonly associated with mainframe application deployment, namely downtime and rollback capacity, while ensure higher code quality in production through “Shift Right” testing.

A demo of the solution is available on the AWS Partner Micro Focus website [Solution-Demo]. If you’re interested in modernizing your mainframe applications, then please contact Micro Focus and AWS mainframe business development at [email protected].

Additional Information

About the authors

Kevin Yung

Kevin Yung

Kevin is a Senior Modernization Architect in AWS Professional Services Global Mainframe and Midrange Modernization (GM3) team. Kevin currently is focusing on leading and delivering mainframe and midrange applications modernization for large enterprise customers.

Krithika Palani Selvam

Krithika is a Senior Modernization Architect in AWS Professional Services Global Mainframe and Midrange Modernization (GM3) team. She is currently working with enterprise customers for migrating and modernizing mainframe and midrange applications to cloud.

Peter Woods

Peter Woods has been with Micro Focus for over 30 years <within the Application Modernisation & Connectivity portfolio>. His diverse range of roles has included Technical Support, Channel Sales, Product Management, Strategic Alliances Management and Pre-Sales and was primarily based in the UK. In 2017 Peter re-located to Melbourne, Australia and in his current role of AM2C APJ Regional Technical Leader and ANZ Pre-Sales Manager, he is charged with driving and supporting Application Modernisation sales activity across the APJ region.

Abraham Mercado Rondon

Abraham Rondon is a Solutions Architect working on Micro Focus Enterprise Solutions for the Application Modernization team based in Melbourne. After completing a degree in Statistics and before joining Micro Focus, Abraham had a long career in supporting Mainframe Applications in different countries doing progressive roles from Developer to Production Support, Business and Technical Analyst, and Project Team Lead.  Now, a vital part of the Micro Focus Application Modernization team, one of his main focus is Cloud implementations of mainframe DevOps and production workload rehost.

How to secure API Gateway HTTP endpoints with JWT authorizer

Post Syndicated from Siva Rajamani original https://aws.amazon.com/blogs/security/how-to-secure-api-gateway-http-endpoints-with-jwt-authorizer/

This blog post demonstrates how you can secure Amazon API Gateway HTTP endpoints with JSON web token (JWT) authorizers. Amazon API Gateway helps developers create, publish, and maintain secure APIs at any scale, helping manage thousands of API calls. There are no minimum fees, and you only pay for the API calls you receive.

Based on customer feedback and lessons learned from building the REST and WebSocket APIs, AWS launched HTTP APIs for Amazon API Gateway, a service built to be fast, low cost, and simple to use. HTTP APIs offer a solution for building APIs, as well as multiple mechanisms for controlling and managing access through AWS Identity and Access Management (IAM) authorizers, AWS Lambda authorizers, and JWT authorizers.

This post includes step-by-step guidance for setting up JWT authorizers using Amazon Cognito as the identity provider, configuring HTTP APIs to use JWT authorizers, and examples to test the entire setup. If you want to protect HTTP APIs using Lambda and IAM authorizers, you can refer to Introducing IAM and Lambda authorizers for Amazon API Gateway HTTP APIs.

Prerequisites

Before you can set up a JWT authorizer using Cognito, you first need to create three Lambda functions. You should create each Lambda function using the following configuration settings, permissions, and code:

  1. The first Lambda function (Pre-tokenAuthLambda) is invoked before the token generation, allowing you to customize the claims in the identity token.
  2. The second Lambda function (LambdaForAdminUser) acts as the HTTP API Gateway integration target for /AdminUser HTTP API resource route.
  3. The third Lambda function (LambdaForRegularUser) acts as the HTTP API Gateway integration target for /RegularUser HTTP API resource route.

IAM policy for Lambda function

You first need to create an IAM role using the following IAM policy for each of the three Lambda functions:

	{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "logs:CreateLogGroup",
			"Resource": "arn:aws:logs:us-east-1:<AWS Account Number>:*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"logs:CreateLogStream",
				"logs:PutLogEvents"
			],
			"Resource": [
				"arn:aws:logs:us-east-1:<AWS Account Number>:log-group:/aws/lambda/<Name of the Lambda functions>:*"
			]
		}
	]
} 

Settings for the required Lambda functions

For the three Lambda functions, use these settings:

Function name Enter an appropriate name for the Lambda function, for example:

  • Pre-tokenAuthLambda for the first Lambda
  • LambdaForAdminUser for the second
  • LambdaForRegularUser for the third
Runtime

Choose Node.js 12.x

Permissions Choose Use an existing role and select the role you created with the IAM policy in the Prerequisites section above.

Pre-tokenAuthLambda code

This first Lambda code, Pre-tokenAuthLambda, converts the authenticated user’s Cognito group details to be returned as the scope claim in the id_token returned by Cognito.

	exports.lambdaHandler = async (event, context) => {
		let newScopes = event.request.groupConfiguration.groupsToOverride.map(item => `${item}-${event.callerContext.clientId}`)
	event.response = {
		"claimsOverrideDetails": {
			"claimsToAddOrOverride": {
				"scope": newScopes.join(" "),
			}
		}
  	};
  	return event
}

LambdaForAdminUser code

This Lambda code, LambdaForAdminUser, acts as the HTTP API Gateway integration target and sends back the response Hello from Admin User when the /AdminUser resource path is invoked in API Gateway.

	exports.handler = async (event) => {

		const response = {
			statusCode: 200,
			body: JSON.stringify('Hello from Admin User'),
		};
		return response;
	};

LambdaForRegularUser code

This Lambda code, LambdaForRegularUser , acts as the HTTP API Gateway integration target and sends back the response Hello from Regular User when the /RegularUser resource path is invoked within API Gateway.

	exports.handler = async (event) => {

		const response = {
			statusCode: 200,
			body: JSON.stringify('Hello from Regular User'),
		};
		return response;
	};

Deploy the solution

To secure the API Gateway resources with JWT authorizer, complete the following steps:

  1. Create an Amazon Cognito User Pool with an app client that acts as the JWT authorizer
  2. Create API Gateway resources and secure them using the JWT authorizer based on the configured Amazon Cognito User Pool and app client settings.

The procedures below will walk you through the step-by-step configuration.

Set up JWT authorizer using Amazon Cognito

The first step to set up the JWT authorizer is to create an Amazon Cognito user pool.

To create an Amazon Cognito user pool

  1. Go to the Amazon Cognito console.
  2. Choose Manage User Pools, then choose Create a user pool.
    Figure 1: Create a user pool

    Figure 1: Create a user pool

  3. Enter a Pool name, then choose Review defaults.
    Figure 2: Review defaults while creating the user pool

    Figure 2: Review defaults while creating the user pool

  4. Choose Add app client.
    Figure 3: Add an app client for the user pool

    Figure 3: Add an app client for the user pool

  5. Enter an app client name. For this example, keep the default options. Choose Create app client to finish.
    Figure 4: Review the app client configuration and create it

    Figure 4: Review the app client configuration and create it

  6. Choose Return to pool details, and then choose Create pool.
    Figure 5: Complete the creation of user pool setup

    Figure 5: Complete the creation of user pool setup

To configure Cognito user pool settings

Now you can configure app client settings:

  1. On the left pane, choose App client settings. In Enabled Identity Providers, select the identity providers you want for the apps you configured in the App Clients tab.
  2. Enter the Callback URLs you want, separated by commas. These URLs apply to all selected identity providers.
  3. Under OAuth 2.0, select the from the following options.
    • For Allowed OAuth Flows, select Authorization code grant.
    • For Allowed OAuth Scopes, select phone, email, openID, and profile.
  4. Choose Save changes.
    Figure 6: Configure app client settings

    Figure 6: Configure app client settings

  5. Now add the domain prefix to use for the sign-in pages hosted by Amazon Cognito. On the left pane, choose Domain name and enter the appropriate domain prefix, then Save changes.
    Figure 7: Choose a domain name prefix for the Amazon Cognito domain

    Figure 7: Choose a domain name prefix for the Amazon Cognito domain

  6. Next, create the pre-token generation trigger. On the left pane, choose Triggers and under Pre Token Generation, select the Pre-tokenAuthLambda Lambda function you created in the Prerequisites procedure above, then choose Save changes.
    Figure 8: Configure Pre Token Generation trigger Lambda for user pool

    Figure 8: Configure Pre Token Generation trigger Lambda for user pool

  7. Finally, create two Cognito groups named admin and regular. Create two Cognito users named adminuser and regularuser. Assign adminuser to both admin and regular group. Assign regularuser to regular group.
    Figure 9: Create groups and users for user pool

    Figure 9: Create groups and users for user pool

Configuring HTTP endpoints with JWT authorizer

The first step to configure HTTP endpoints is to create the API in the API Gateway management console.

To create the API

  1. Go to the API Gateway management console and choose Create API.
    Figure 10: Create an API in API Gateway management console

    Figure 10: Create an API in API Gateway management console

  2. Choose HTTP API and select Build.
    Figure 11: Choose Build option for HTTP API

    Figure 11: Choose Build option for HTTP API

  3. Under Create and configure integrations, enter JWTAuth for the API name and choose Review and Create.
    Figure 12: Create Integrations for HTTP API

    Figure 12: Create Integrations for HTTP API

  4. Once you’ve created the API JWTAuth, choose Routes on the left pane.
    Figure 13: Navigate to Routes tab

    Figure 13: Navigate to Routes tab

  5. Choose Create a route and select GET method. Then, enter /AdminUser for the path.
    Figure 14: Create the first route for HTTP API

    Figure 14: Create the first route for HTTP API

  6. Repeat step 5 and create a second route using the GET method and /RegularUser for the path.
    Figure 15: Create the second route for HTTP API

    Figure 15: Create the second route for HTTP API

To create API integrations

  1. Now that the two routes are created, select Integrations from the left pane.
    Figure 16: Navigate to Integrations tab

    Figure 16: Navigate to Integrations tab

  2. Select GET for the /AdminUser resource path, and choose Create and attach an integration.
    Figure 17: Attach an integration to first route

    Figure 17: Attach an integration to first route

  3. To create an integration, select the following values

    Integration type: Lambda function
    Integration target: LambdaForAdminUser

  4. Choose Create.
    NOTE: LambdaForAdminUser is the Lambda function you previously created as part of the Prerequisites procedure LambdaForAdminUser code.
    Figure 18: Create an integration for first route

    Figure 18: Create an integration for first route

  5. Next, select GET for the /RegularUser resource path and choose Create and attach an integration.
    Figure 19: Attach an integration to second route

    Figure 19: Attach an integration to second route

  6. To create an integration, select the following values

    Integration type: Lambda function
    Integration target: LambdaForRegularUser

  7. Choose Create.
    NOTE: LambdaForRegularUser is the Lambda function you previously created as part of the Prerequisites procedure LambdaForRegularUser code.
    Figure 20: Create an integration for the second route

    Figure 20: Create an integration for the second route

To configure API authorization

  1. Select Authorization from the left pane, select /AdminUser path and choose Create and attach an authorizer.
    Figure 21: Navigate to Authorization left pane option to create an authorizer

    Figure 21: Navigate to Authorization left pane option to create an authorizer

  2. For Authorizer type select JWT and under Authorizer settings enter the following details:

    Name: JWTAuth
    Identity source: $request.header.Authorization
    Issuer URL: https://cognito-idp.us-east1.amazonaws.com/<your_userpool_id>
    Audience: <app_client_id_of_userpool>
  3. Choose Create.
    Figure 22: Create and attach an authorizer to HTTP API first route

    Figure 22: Create and attach an authorizer to HTTP API first route

  4. In the Authorizer for route GET /AdminUser screen, choose Add scope in the Authorization Scope section and enter scope name as admin-<app_client_id> and choose Save.
    Figure 23: Add authorization scopes to first route of HTTP API

    Figure 23: Add authorization scopes to first route of HTTP API

  5. Now select the /RegularUser path and from the dropdown, select the JWTAuth authorizer you created in step 3. Choose Attach authorizer.
    Figure 24: Attach an authorizer to HTTP API second route

    Figure 24: Attach an authorizer to HTTP API second route

  6. Choose Add scope and enter the scope name as regular-<app_client_id> and choose Save.
    Figure 25: Add authorization scopes to second route of HTTP API

    Figure 25: Add authorization scopes to second route of HTTP API

  7. Enter Test as the Name and then choose Create.
    Figure 26: Create a stage for HTTP API

    Figure 26: Create a stage for HTTP API

  8. Under Select a stage, enter Test, and then choose Deploy to stage.
    Figure 27: Deploy HTTP API to stage

    Figure 27: Deploy HTTP API to stage

Test the JWT authorizer

You can use the following examples to test the API authentication. We use Curl in this example, but you can use any HTTP client.

To test the API authentication

  1. Send a GET request to the /RegularUser HTTP API resource without specifying any authorization header.
    curl -s -X GET https://a1b2c3d4e5.execute-api.us-east-1.amazonaws.com/RegularUser

    API Gateway returns a 401 Unauthorized response, as expected.

    {“message”:”Unauthorized”}

  2. The required $request.header.Authorization identity source is not provided, so the JWT authorizer is not called. Supply a valid Authorization header key and value. You authenticate as the regularuser, using the aws cognito-idp initiate-auth AWS CLI command.
    aws cognito-idp initiate-auth --auth-flow USER_PASSWORD_AUTH --client-id <Cognito User Pool App Client ID> --auth-parameters USERNAME=regularuser,PASSWORD=<Password for regularuser>

    CLI Command response:

    
    {
    	"ChallengeParameters": {},
    	"AuthenticationResult": {
    		"AccessToken": "6f5e4d3c2b1a111112222233333xxxxxzz2yy",
    		"ExpiresIn": 3600,
    		"TokenType": "Bearer",
    		"RefreshToken": "xyz123abc456dddccc0000",
    		"IdToken": "aaabbbcccddd1234567890"
    	}
    }

    The command response contains a JWT (IdToken) that contains information about the authenticated user. This information can be used as the Authorization header value.

    curl -H "Authorization: aaabbbcccddd1234567890" -s -X GET https://a1b2c3d4e5.execute-api.us-east-1.amazonaws.com/RegularUser

  3. API Gateway returns the response Hello from Regular User. Now test access for the /AdminUser HTTP API resource with the JWT token for the regularuser.
    curl -H "Authorization: aaabbbcccddd1234567890" -s -X GET "https://a1b2c3d4e5.execute-api.us-east-1.amazonaws.com/AdminUser"

    API Gateway returns a 403 – Forbidden response.
    {“message”:”Forbidden”}
    The JWT token for the regularuser does not have the authorization scope defined for the /AdminUser resource, so API Gateway returns a 403 – Forbidden response.

  4. Next, log in as adminuser and validate that you can successfully access both /RegularUser and /AdminUser resource. You use the cognito-idp initiate-auth AWS CLI command.
  5. aws cognito-idp initiate-auth --auth-flow USER_PASSWORD_AUTH --client-id <Cognito User Pool App Client ID> --auth-parameters USERNAME=adminuser,PASSWORD==<Password for adminuser>

    CLI Command response:

    
    {
    	"ChallengeParameters": {},
    	"AuthenticationResult": {
    		"AccessToken": "a1b2c3d4e5c644444555556666Y2X3Z1111",
    		"ExpiresIn": 3600,
    		"TokenType": "Bearer",
    		"RefreshToken": "xyz654cba321dddccc1111",
    		"IdToken": "a1b2c3d4e5c6aabbbcccddd"
    	}
    }

  6. Using Curl, you can validate that the adminuser JWT token now has access to both the /RegularUser resource and the /AdminUser resource. This is possible when adminuser is part of both Cognito groups, so the JWT token contains both authorization scopes.
    curl -H "Authorization: a1b2c3d4e5c6aabbbcccddd" -s -X GET https://a1b2c3d4e5.execute-api.us-east-1.amazonaws.com/RegularUser

    API Gateway returns the response Hello from Regular User

    curl -H "Authorization: a1b2c3d4e5c6aabbbcccddd" -s -X GET https://a1b2c3d4e5.execute-api.us-east-1.amazonaws.com/AdminUser

    API Gateway returns the following response Hello from Admin User

Conclusion

AWS enabled the ability to manage access to an HTTP API in API Gateway in multiple ways: with Lambda authorizers, IAM roles and policies, and JWT authorizers. This post demonstrated how you can secure API Gateway HTTP API endpoints with JWT authorizers. We configured a JWT authorizer using Amazon Cognito as the identity provider (IdP). You can achieve the same results with any IdP that supports OAuth 2.0 standards. API Gateway validates the JWT that the client submits with API requests. API Gateway allows or denies requests based on token validation along with the scope of the token. You can configure distinct authorizers for each route of an API, or use the same authorizer for multiple routes.

To learn more, we recommend:

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Siva Rajamani

Siva is a Boston-based Enterprise Solutions Architect. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are Serverless, Application Integration, and Security.

Author

Sudhanshu Malhotra

Sudhanshu is a Boston-based Enterprise Solutions Architect for AWS. He’s a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are DevOps, Machine Learning, and Security. When he’s not working with customers on their journey to the cloud, he enjoys reading, hiking, and exploring new cuisines.

Author

Rajat Mathur

Rajat is a Sr. Solutions Architect at Amazon Web Services. Rajat is a passionate technologist who enjoys building innovative solutions for AWS customers. His core areas of focus are IoT, Networking and Serverless computing. In his spare time, Rajat enjoys long drives, traveling and spending time with family.

Security practices in AWS multi-tenant SaaS environments

Post Syndicated from Keith P original https://aws.amazon.com/blogs/security/security-practices-in-aws-multi-tenant-saas-environments/

Securing software-as-a-service (SaaS) applications is a top priority for all application architects and developers. Doing so in an environment shared by multiple tenants can be even more challenging. Identity frameworks and concepts can take time to understand, and forming tenant isolation in these environments requires deep understanding of different tools and services.

While security is a foundational element of any software application, specific considerations apply to SaaS applications. This post dives into the challenges, opportunities and best practices for securing multi-tenant SaaS environments on Amazon Web Services (AWS).

SaaS application security considerations

Single tenant applications are often deployed for a specific customer, and typically only deal with this single entity. While security is important in these environments, the threat profile does not include potential access by other customers. Multi-tenant SaaS applications have unique security considerations when compared to single tenant applications.

In particular, multi-tenant SaaS applications must pay special attention to identity and tenant isolation. These considerations are in addition to the security measures all applications must take. This blog post reviews concepts related to identity and tenant isolation, and how AWS can help SaaS providers build secure applications.

Identity

SaaS applications are accessed by individual principals (often referred to as users). These principals may be interactive (for example, through a web application) or machine-based (for example, through an API). Each principal is uniquely identified, and is usually associated with information about the principal, including email address, name, role and other metadata.

In addition to the unique identification of each individual principal, a SaaS application has another construct: a tenant. A paper on multi-tenancy defines a tenant as a group of one or more users sharing the same view on an application they use. This view may differ for different tenants. Each individual principal is associated with a tenant, even if it is only a 1:1 mapping. A tenant is uniquely identified, and contains information about the tenant administrator, billing information and other metadata.

When a principal makes a request to a SaaS application, the principal provides their tenant and user identifier along with the request. The SaaS application validates this information and makes an authorization decision. In well-designed SaaS applications, this authorization step should not rely on a centralized authorization service. A centralized authorization service is a single point of failure in an application. If it fails, or is overwhelmed with requests, the application will no longer be able to process requests.

There are two key techniques to providing this type of experience in a SaaS application: using an identity provider (IdP) and representing identity or authorization in a token.

Using an Identity Provider (IdP)

In the past, some web applications often stored user information in a relational database table. When a principal authenticated successfully, the application issued a session ID. For subsequent requests, the principal passed the session ID to the application. The application made authorization decisions based on this session ID. Figure 1 provides an example of how this setup worked.

Figure 1 - An example of legacy application authentication.

Figure 1 – An example of legacy application authentication.

In applications larger than a simple web application, this pattern is suboptimal. Each request usually results in at least one database query or cache look up, creating a bottleneck on the data store holding the user or session information. Further, because of the tight coupling between the application and its user management, federation with external identity providers becomes difficult.

When designing your SaaS application, you should consider the use of an identity provider like Amazon Cognito, Auth0, or Okta. Using an identity provider offloads the heavy lifting required for managing identity by having user authentication, including federation, handled by external identity providers. Figure 2 provides an example of how a SaaS provider can use an identity provider in place of the self-managed solution shown in Figure 1.

Figure 2 – An example of an authentication flow that involves an identity provider.

Figure 2 – An example of an authentication flow that involves an identity provider.

Once a user authenticates with an identity provider, the identity provider issues a standardized token. This token is the same regardless of how a user authenticates, which means your application does not need to build in support for multiple different authentication methods tenants might use.

Identity providers also commonly support federated access. Federated access means that a third party maintains the identities, but the identity provider has a trust relationship with this third party. When a customer tries to log in with an identity managed by the third party, the SaaS application’s identity provider handles the authentication transaction with the third-party identity provider.

This authentication transaction commonly uses a protocol like Security Assertion Markup Language (SAML) 2.0. The SaaS application’s identity provider manages the interaction with the tenant’s identity provider. The SaaS application’s identity provider issues a token in a format understood by the SaaS application. Figure 3 provides an example of how a SaaS application can provide support for federation using an identity provider.

Figure 3 - An example of authentication that involves a tenant-provided identity provider

Figure 3 – An example of authentication that involves a tenant-provided identity provider

For an example, see How to set up Amazon Cognito for federated authentication using Azure AD.

Representing identity with tokens

Identity is usually represented by signed tokens. JSON Web Signatures (JWS), often referred to as JSON Web Tokens (JWT), are signed JSON objects used in web applications to demonstrate that the bearer is authorized to access a particular resource. These JSON objects are signed by the identity provider, and can be validated without querying a centralized database or service.

The token contains several key-value pairs, called claims, which are issued by the identity provider. Besides several claims relating to the issuance and expiration of the token, the token can also contain information about the individual principal and tenant.

Sample access token claims

The example below shows the claims section of a typical access token issued by Amazon Cognito in JWT format.

{
  "sub": "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee",
  "cognito:groups": [
"TENANT-1"
  ],
  "token_use": "access",
  "auth_time": 1562190524,
  "iss": "https://cognito-idp.us-west-2.amazonaws.com/us-west-2_example",
  "exp": 1562194124,
  "iat": 1562190524,
  "origin_jti": "bbbbbbbbb-cccc-dddd-eeee-aaaaaaaaaaaa",
  "jti": "cccccccc-dddd-eeee-aaaa-bbbbbbbbbbbb",
  "client_id": "12345abcde",

The principal, and the tenant the principal is associated with, are represented in this token by the combination of the user identifier (the sub claim) and the tenant ID in the cognito:groups claim. In this example, the SaaS application represents a tenant by creating a Cognito group per tenant. Other identity providers may allow you to add a custom attribute to a user that is reflected in the access token.

When a SaaS application receives a JWT as part of a request, the application validates the token and unpacks its contents to make authorization decisions. The claims within the token set what is known as the tenant context. Much like the way environment variables can influence a command line application, the tenant context influences how the SaaS application processes the request.

By using a JWT, the SaaS application can process a request without frequent reference to an external identity provider or other centralized service.

Tenant isolation

Tenant isolation is foundational to every SaaS application. Each SaaS application must ensure that one tenant cannot access another tenant’s resources. The SaaS application must create boundaries that adequately isolate one tenant from another.

Determining what constitutes sufficient isolation depends on your domain, deployment model and any applicable compliance frameworks. The techniques for isolating tenants from each other depend on the isolation model and the applications you use. This section provides an overview of tenant isolation strategies.

Your deployment model influences isolation

How an application is deployed influences how tenants are isolated. SaaS applications can use three types of isolation: silo, pool, and bridge.

Silo deployment model

The silo deployment model involves customers deploying one set of infrastructure per tenant. Depending on the application, this may mean a VPC-per-tenant, a set of containers per tenant, or some other resource that is deployed for each tenant. In this model, there is one deployment per tenant, though there may be some shared infrastructure for cross-tenant administration. Figure 4 shows an example of a siloed deployment that uses a VPC-per-tenant model.

Figure 4 - An example of a siloed deployment that provisions a VPC-per-tenant

Figure 4 – An example of a siloed deployment that provisions a VPC-per-tenant

Pool deployment model

The pool deployment model involves a shared set of infrastructure for all tenants. Tenant isolation is implemented logically in the application through application-level constructs. Rather than having separate resources per tenant, isolation enforcement occurs within the application. Figure 5 shows an example of a pooled deployment model that uses serverless technologies.

Figure 5 - An example of a pooled deployment model using serverless technologies

Figure 5 – An example of a pooled deployment model using serverless technologies

In Figure 5, an AWS Lambda function that retrieves an item from an Amazon DynamoDB table shared by all tenants needs temporary credentials issued by the AWS Security Token Service. These credentials only allow the requester to access items in the table that belong to the tenant making the request. A requester gets these credentials by assuming an AWS Identity and Access Management (IAM) role. This allows a SaaS application to share the underlying infrastructure, while still isolating tenants from one another. See Isolation enforcement depends on service below for more details on this pattern.

Bridge deployment model

The bridge model combines elements of both the silo and pool models. Some resources may be separate, others may be shared. For example, suppose your application has a shared application layer and an Amazon Relational Database Service (RDS) instance per tenant. The application layer evaluates each request and connects to the database for the tenant that made the request.

This model is useful in a situation where each tenant may require a certain response time and one set of resources acts as a bottleneck. In the RDS example, the application layer could handle the requests imposed by the tenants, but a single RDS instance could not.

The decision on which isolation model to implement depends on your customer’s requirements, compliance needs or industry needs. You may find that some customers can be deployed onto a pool model, while larger customers may require their own silo deployment.

Your tiering strategy may also influence the type of isolation model you use. For example, a basic tier customer might be deployed onto pooled infrastructure, while an enterprise tier customer is deployed onto siloed infrastructure.

For more information about different tenant isolation models, read the tenant isolation strategies whitepaper.

Isolation enforcement depends on service

Most SaaS applications will need somewhere to store state information. This could be a relational database, a NoSQL database, or some other storage medium which persists state. SaaS applications built on AWS use various mechanisms to enforce tenant isolation when accessing a persistent storage medium.

IAM provides fine grain access controls access for the AWS API. Some services, like Amazon Simple Storage Service (Amazon S3) and DynamoDB, provide the ability to control access to individual objects or items with IAM policies. When possible, your application should use IAM’s built-in functionality to limit access to tenant resources. See Isolating SaaS Tenants with Dynamically Generated IAM Policies for more information about using IAM to implement tenant isolation.

AWS IAM also offers the ability to restrict access to resources based on tags. This is known as attribute-based access control (ABAC). This technique allows you to apply tags to supported resources, and make access control decisions based on which tags are applied. This is a more scalable access control mechanism than role-based access control (RBAC), because you do not need to modify an IAM policy each time a resource is added or removed. See How to implement SaaS tenant isolation with ABAC and AWS IAM for more information about how this can be applied to a SaaS application.

Some relational databases offer features that can enforce tenant isolation. For example, PostgreSQL offers a feature called row level security (RLS). Depending on the context in which the query is sent to the database, only tenant-specific items are returned in the results. See Multi-tenant data isolation with PostgreSQL Row Level Security for more information about row level security in PostgreSQL.

Other persistent storage mediums do not have fine grain permission models. They may, however, offer some kind of state container per tenant. For example, when using MongoDB, each tenant is assigned a MongoDB user and a MongoDB database. The secret associated with the user can be stored in AWS Secrets Manager. When retrieving a tenant’s data, the SaaS application first retrieves the secret, then authenticates with MongoDB. This creates tenant isolation because the associated credentials only have permission to access collections in a tenant-specific database.

Generally, if the persistent storage medium you’re using offers its own permission model that can enforce tenant isolation, you should use it, since this keeps you from having to implement isolation in your application. However, there may be cases where your data store does not offer this level of isolation. In this situation, you would need to write application-level tenant isolation enforcement. Application-level tenant isolation means that the SaaS application, rather than the persistent storage medium, makes sure that one tenant cannot access another tenant’s data.

Conclusion

This post reviews the challenges, opportunities and best practices for the unique security considerations associated with a multi-tenant SaaS application, and describes specific identity considerations, as well as tenant isolation methods.

If you’d like to know more about the topics above, the AWS Well-Architected SaaS Lens Security pillar dives deep on performance management in SaaS environments. It also provides best practices and resources to help you design and improve performance efficiency in your SaaS application.

Get Started with the AWS Well-Architected SaaS Lens

The AWS Well-Architected SaaS Lens focuses on SaaS workloads, and is intended to drive critical thinking for developing and operating SaaS workloads. Each question in the lens has a list of best practices, and each best practice has a list of improvement plans to help guide you in implementing them.

The lens can be applied to existing workloads, or used for new workloads you define in the tool. You can use it to improve the application you’re working on, or to get visibility into multiple workloads used by the department or area you’re working with.

The SaaS Lens is available in all Regions where the AWS Well-Architected Tool is offered, as described in the AWS Regional Services List. There are no costs for using the AWS Well-Architected Tool.

If you’re an AWS customer, find current AWS Partners that can conduct a review by learning about AWS Well-Architected Partners and AWS SaaS Competency Partners.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Security Hub forum. To start your 30-day free trial of Security Hub, visit AWS Security Hub.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Keith P

Keith is a senior partner solutions architect on the SaaS Factory team.

Andy Powell

Andy is the global lead partner for solutions architecture on the SaaS Factory team.

Deploy and Manage Gitlab Runners on Amazon EC2

Post Syndicated from Sylvia Qi original https://aws.amazon.com/blogs/devops/deploy-and-manage-gitlab-runners-on-amazon-ec2/

Gitlab CI is a tool utilized by many enterprises to automate their Continuous integration, continuous delivery and deployment (CI/CD) process. A Gitlab CI/CD pipeline consists of two major components: A .gitlab-ci.yml file describing a pipeline’s jobs, and a Gitlab Runner, an application that executes the pipeline jobs.

Setting up the Gitlab Runner is a time-consuming process. It involves provisioning the necessary infrastructure, installing the necessary software to run pipeline workloads, and configuring the runner. For enterprises running hundreds of pipelines across multiple environments, it is essential to automate the Gitlab Runner deployment process so as to be deployed quickly in a repeatable, consistent manner.

This post will guide you through utilizing Infrastructure-as-Code (IaC) to automate Gitlab Runner deployment and administrative tasks on Amazon EC2. With IaC, you can quickly and consistently deploy the entire Gitlab Runner architecture by running a script. You can track and manage changes efficiently. And, you can enforce guardrails and best practices via code. The solution presented here also offers autoscaling so that you save costs by terminating resources when not in use. You will learn:

  • How to deploy Gitlab Runner quickly and consistently across multiple AWS accounts.
  • How to enforce guardrails and best practices on the Gitlab Runner through IaC.
  • How to autoscale Gitlab Runner based on workloads to ensure best performance and save costs.

This post comes from a DevOps engineer perspective, and assumes that the engineer is familiar with the practices and tools of IaC and CI/CD.

Overview of the solution

The following diagram displays the solution architecture. We use AWS CloudFormation to describe the infrastructure that is hosting the Gitlab Runner. The main steps are as follows:

  1. The user runs a deploy script in order to deploy the CloudFormation template. The template is parameterized, and the parameters are defined in a properties file. The properties file specifies the infrastructure configuration, as well as the environment in which to deploy the template.
  2. The deploy script calls CloudFormation CreateStack API to create a Gitlab Runner stack in the specified environment.
  3. During stack creation, an EC2 autoscaling group is created with the desired number of EC2 instances. Each instance is launched via a launch template, which is created with values from the properties file. An IAM role is created and attached to the EC2 instance. The role contains permissions required for the Gitlab Runner to execute pipeline jobs. A lifecycle hook is attached to the autoscaling group on instance termination events. This ensures graceful instance termination.
  4. During instance launch, CloudFormation uses a cfn-init helper script to install and configure the Gitlab Runner:
    1. cfn-init installs the Gitlab Runner software on the EC2 instance.
    2. cfn-init configures the Gitlab Runner as a docker executor using a pre-defined docker image in the Gitlab Container Registry. The docker executor implementation lets the Gitlab Runner run each build in a separate and isolated container. The docker image contains the software required to run the pipeline workloads, thereby eliminating the need to install these packages during each build.
    3. cfn-init registers the Gitlab Runner to Gitlab projects specified in the properties file, so that these projects can utilize the Gitlab Runner to run pipelines.
  1. The user may repeat the same steps to deploy Gitlab Runner into another environment.

Architecture diagram previously explained in post.

Walkthrough

This walkthrough will demonstrate how to deploy the Gitlab Runner, and how easy it is to conduct Gitlab Runner administrative tasks via this architecture. We will walk through the following tasks:

  • Build a docker executor image for the Gitlab Runner.
  • Deploy the Gitlab Runner stack.
  • Update the Gitlab Runner.
  • Terminate the Gitlab Runner.
  • Add/Remove Gitlab projects from the Gitlab Runner.
  • Autoscale the Gitlab Runner based on workloads.

The code in this post is available at https://github.com/aws-samples/amazon-ec2-gitlab-runner.git

Prerequisites

For this walkthrough, you need the following:

  • A Gitlab account (all tiers including Gitlab Free self-managed, Gitlab Free SaaS, and higher tiers). This demo uses gitlab.com free tire.
  • A Gitlab Container Registry.
  • Git client to clone the source code provided.
  • An AWS account with local credentials properly configured (typically under ~/.aws/credentials).
  • The latest version of the AWS CLI. For more information, see Installing, updating, and uninstalling the AWS CLI.
  • Docker is installed and running on the localhost/laptop.
  • Nodejs and npm installed on the localhost/laptop.
  • A VPC with 2 private subnets and that is connected to the internet via NAT gateway allowing outbound traffic.
  • The following IAM service-linked role created in the AWS account: AWSServiceRoleForAutoScaling
  • An Amazon S3 bucket for storing Lambda deployment packages.
  • Familiarity with Git, Gitlab CI/CD, Docker, EC2, CloudFormation and Amazon CloudWatch.

Build a docker executor image for the Gitlab Runner

The Gitlab Runner in this solution is implemented as docker executor. The Docker executor connects to Docker Engine and runs each build in a separate and isolated container via a predefined docker image. The first step in deploying the Gitlab Runner is building a docker executor image. We provided a simple Dockerfile in order to build this image. You may customize the Dockerfile to install your own requirements.

To build a docker image using the sample Dockerfile:

  1. Create a directory where we will store our demo code. From your terminal run:
mkdir demo-repos && cd demo-repos
  1. Clone the source code repository found in the following location:
git clone https://github.com/aws-samples/amazon-ec2-gitlab-runner.git
  1. Create a new project on your Gitlab server. Name the project any name you like.
  2. Clone your newly created repo to your laptop. Ignore the warning about cloning an empty repository.
git clone <your-repo-url>
  1. Copy the demo repo files into your newly created repo on your laptop, and push it to your Gitlab repository. You may customize the Dockerfile before pushing it to Gitlab.
cp -r amazon-ec2-gitlab-runner/* <your-repo-dir>
cd <your-repo-dir>
git add .
git commit -m “Initial commit”
git push
  1. On the Gitlab console, go to your repository’s Package & Registries -> Container Registry. Follow the instructions provided on the Container Registry page in order to build and push a docker image to your repository’s container registry.

Deploy the Gitlab Runner stack

Once the docker executor image has been pushed to the Gitlab Container Registry, we can deploy the Gitlab Runner. The Gitlab Runner infrastructure is described in the Cloudformation template gitlab-runner.yaml. Its configuration is stored in a properties file called sample-runner.properties. A launch template is created with the values in the properties file. Then it is used to launch instances. This architecture lets you deploy Gitlab Runner to as many environments as you like by utilizing the configurations provided in the appropriate properties files.

During the provisioning process, utilize a cfn-init helper script to run a series of commands to install and configure the Gitlab Runner.

          commands:
            01InstallDocker:
              command: sudo yum -y install docker
            02StartDocker:
              command: sudo service docker start
            03DownloadGitlabRunner:
              command: sudo wget -O /usr/bin/gitlab-runner https://gitlab-runner-downloads.s3.amazonaws.com/latest/binaries/gitlab-runner-linux-amd64
            04ChmodGitlabRunner:
              command: sudo chmod a+x /usr/bin/gitlab-runner
            05AddUser:
              command: sudo useradd --comment 'GitLab Runner' --create-home gitlab-runner --shell /bin/bash
            06InstallGitlabRunner:
              command: sudo gitlab-runner install --user=gitlab-runner --working-directory=/home/gitlab-runner
            07SetRegion:
              command: !Sub 'aws configure set default.region ${AWS::Region}'
            08ConfigureDockerExecutor:
              command: !Sub 
                - |
                  for GitlabGroupToken in `aws ssm get-parameters --names /${AWS::StackName}/ci-tokens --query 'Parameters[0].Value' | sed -e "s/\"//g" | sed "s/,/ /g"`;do
                      sudo gitlab-runner register \
                      --non-interactive \
                      --url "${GitlabServerURL}" \
                      --registration-token $GitlabGroupToken \
                      --executor "docker" \
                      --docker-image "${DockerImagePath}" \
                      --description "Gitlab Runner with Docker Executor" \
                      --locked="${isLOCKED}" --access-level "${ACCESS}" \
                      --docker-volumes "/var/run/docker.sock:/var/run/docker.sock" \
                      --tag-list "${RunnerEnvironment}-${RunnerVersion}-docker"
                  done
                - isLOCKED: !FindInMap [GitlabRunnerRegisterOptionsMap, !Ref RunnerEnvironment, isLOCKED]
                  ACCESS: !FindInMap [GitlabRunnerRegisterOptionsMap, !Ref RunnerEnvironment, ACCESS]                              
            09StartGitlabRunner:
              command: sudo gitlab-runner start

The helper script ensures that the Gitlab Runner setup is consistent and repeatable for each deployment. If a configuration change is required, users simply update the configuration steps and redeploy the stack. Furthermore, all changes are tracked in Git, which allows for versioning of the Gitlab Runner.

To deploy the Gitlab Runner stack:

  1. Obtain the runner registration tokens of the Gitlab projects that you want registered to the Gitlab Runner. Obtain the token by selecting the project’s Settings > CI/CD and expand the Runners section.
  2. Update the sample-runner.properties file parameters according to your own environment. Refer to the gitlab-runner.yaml file for a description of these parameters. Rename the file if you like. You may also create an additional properties file for deploying into other environments.
  3. Run the deploy script to deploy the runner:
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 

<properties-file> is the name of the properties file.

<region> is the region where you want to deploy the stack.

<aws-profile> is the name of the CLI profile you set up in the prerequisites section.

<stack-name> is the name you chose for the CloudFormation stack.

For example:

./deploy-runner.sh sample-runner.properties us-east-1 dev amazon-ec2-gitlab-runner-demo

After the stack is deployed successfully, you will see the Gitlab Runner autoscaling group created in the EC2 console:

After the stack is deployed successfully, you will see the Gitlab Runner autoscaling group created in the EC2 console.

Under your Gitlab project Settings > CICD > Runners > Available specific runners, you will see the fully configured Gitlab Runner. The green circle indicates that the Gitlab Runner is ready for use.

Now go to your Gitlab project Settings  CICD  Runners  Available specific runners, you will see the fully configured Gitlab Runner. The green circle indicates that the Gitlab Runner is ready for use.

Updating the Gitlab Runner

There are times when you would want to update the Gitlab Runner. For example, updating the instance VolumeSize in order to resolve a disk space issue, or updating the AMI ID when a new AMI becomes available.

Utilizing the properties file and launch template makes it easy to update the Gitlab Runner. Simply update the Gitlab Runner configuration parameters in the properties file. Then, run the deploy script to udpate the Gitlab Runner stack. To ensure that the changes take effect immediately (e.g., existing instances are replaced by new instances with the new configuration), we utilize an AutoscalingRollingUpdate update policy to automatically update the instances in the autoscaling group.

    UpdatePolicy:
      AutoScalingRollingUpdate:
        MinInstancesInService: !Ref MinInstancesInService
        MaxBatchSize: !Ref MaxBatchSize
        PauseTime: "PT5M"
        WaitOnResourceSignals: true
        SuspendProcesses:
          - HealthCheck
          - ReplaceUnhealthy
          - AZRebalance
          - AlarmNotification
          - ScheduledActions

The policy tells CloudFormation that when changes are detected in the launch template, update the instances in batch size of MaxBatchSize, while keeping a number of instances (specified in MinInstanceInService) in service during the update.

Below is an example of updating the Gitlab Runner instance type.

To update the instance type of the runner instance:

  1. Update the “InstanceType” parameter in the properties file.

InstanceType=t2.medium

  1. Run the deploy-runner.sh script to update the CloudFormation stack:
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 

In the CloudFormation console, you will see that the launch template is updated first, then a rolling update is initiated. The instance type update requires a replacement of the original instance, so a temporary instance was launched and put in service. Then, the temporary instance was terminated when the new instance was launched successfully.

In the CloudFormation console, you will see that the launch template is updated first, then a rolling update is initiated. The instance type update requires a replacement of the original instance, so a temporary instance was launched and put in service. Then, the temporary instance was terminated when the new instance was launched successfully.

After the update is complete, you will see that on the Gitlab project’s console, the old Gitlab Runner, ez_5x8Rv, is replaced by the new Gitlab Runner, N1_UQ7yc.

After the update is complete, you will see that on the Gitlab project’s console, the old Gitlab Runner, ez_5x8Rv, is replaced by the new Gitlab Runner, N1_UQ7yc.

Terminate the Gitlab Runner

There are times when an autoscaling group instance must be terminated. For example, during an autoscaling scale-in event, or when the instance is being replaced by a new instance during a stack update, as seen previously. When terminating an instance, you must ensure that the Gitlab Runner finishes executing any running jobs before the instance is terminated, otherwise your environment could be left in an inconsistent state. Also, we want to ensure that the terminated Gitlab Runner is removed from the Gitlab project. We utilize an autoscaling lifecycle hook to achieve these goals.

The lifecycle hook works like this: A CloudWatch event rule actively listens for the EC2 Instance-terminate events. When one is detected, the event rule triggers a Lambda function. The Lambda function calls SSM Run Command to run a series of commands on the EC2 instances, via a SSM Document. The commands include stopping the Gitlab Runner gracefully when all running jobs are finished, de-registering the runner from Gitlab projects, and signaling the autoscaling group to terminate the instance.

The lifecycle hook works like this: A CloudWatch event rule actively listens for the EC2 Instance-terminate events. When one is detected, the event rule triggers a Lambda function. The Lambda function calls SSM Run Command to run a series of commands on the EC2 instances, via a SSM Document. The commands include stopping the Gitlab Runner gracefully when all running jobs are finished, de-registering the runner from Gitlab projects, and signaling the autoscaling group to terminate the instance.

There are also times when you want to terminate an instance manually. For example, when an instance is suspected to not be functioning properly. To terminate an instance from the Gitlab Runner autoscaling group, use the following command:

aws autoscaling terminate-instance-in-auto-scaling-group \
    --instance-id="${InstanceId}" \
    --no-should-decrement-desired-capacity \
    --region="${region}" \
    --profile="${profile}"

The above command terminates the instance. The lifecycle hook ensures that the cleanup steps are conducted properly, and the autoscaling group launches another new instance to replace the old one.

Note that if you terminate the instance by using the “ec2 terminate-instance” command, then the autoscaling lifecycle hook actions will not be triggered.

Add/Remove Gitlab projects from the Gitlab Runner

As new projects are added to your enterprise, you may want to register them to the Gitlab Runner, so that those projects can utilize the Gitlab Runner to run pipelines. On the other hand, you would want to remove the Gitlab Runner from a project if it no longer wants to utilize the Gitlab Runner, or if it qualifies to utilize the Gitlab Runner. For example, if a project is no longer allowed to deploy to an environment configured by the Gitlab Runner. Our architecture offers a simple way to add and remove projects from the Gitlab Runner. To add new projects to the Gitlab Runner, update the RunnerRegistrationTokens parameter in the properties file, and then rerun the deploy script to update the Gitlab Runner stack.

To add new projects to the Gitlab Runner:

  1. Update the RunnerRegistrationTokens parameter in the properties file. For example:
RunnerRegistrationTokens=ps8RjBSruy1sdRdP2nZX,XbtZNv4yxysbYhqvjEkC
  1. Update the Gitlab Runner stack. This updates the SSM parameter which stores the tokens.
cd <your-repo-dir>
./deploy-runner.sh <properties-file> <region> <aws-profile> <stack-name> 
  1. Relaunch the instances in the Gitlab Runner autoscaling group. The new instances will use the new RunnerRegistrationTokens value. Run the following command to relaunch the instances:
./cycle-runner.sh <runner-autoscaling-group-name> <region> <optional-aws-profile>

To remove projects from the Gitlab Runner, follow the steps described above, with just one difference. Instead of adding new tokens to the RunnerRegistrationTokens parameter, remove the token(s) of the project that you want to dissociate from the runner.

Autoscale the runner based on custom performance metrics

Each Gitlab Runner can be configured to handle a fixed number of concurrent jobs. Once this capacity is reached for every runner, any new jobs will be in a Queued/Waiting status until the current jobs complete, which would be a poor experience for our team. Setting the number of concurrent jobs too high on our runners would also result in a poor experience, because all jobs leverage the same CPU, memory, and storage in order to conduct the builds.

In this solution, we utilize a scheduled Lambda function that runs every minute in order to inspect the number of jobs running on every runner, leveraging the Prometheus Metrics endpoint that the runners expose. If we approach the concurrent build limit of the group, then we increase the Autoscaling Group size so that it can take on more work. As the number of concurrent jobs decreases, then the scheduled Lambda function will scale the Autoscaling Group back in an effort to minimize cost. The Scaling-Up operation will ignore the Autoscaling Group’s cooldown period, which will help ensure that our team is not waiting on a new instance, whereas the Scale-Down operation will obey the group’s cooldown period.

Here is the logical sequence diagram for the work:

Sequence diagram

For operational monitoring, the Lambda function also publishes custom CloudWatch Metrics for the count of active jobs, along with the target and actual capacities of the Autoscaling group. We can utilize this information to validate that the system is working properly and determine if we need to modify any of our autoscaling parameters.

For operational monitoring, the Lambda function also publishes custom CloudWatch Metrics for the count of active jobs, along with the target and actual capacities of the Autoscaling group. We can utilize this information to validate that the system is working properly and determine if we need to modify any of our autoscaling parameters.

Congratulations! You have completed the walkthrough. Take some time to review the resources you have deployed, and practice the various runner administrative tasks that we have covered in this post.

Troubleshooting

Problem: I deployed the CloudFormation template, but no runner is listed in my repository.

Possible Cause: Errors have been encountered during cfn-init, causing runner registration to fail. Connect to your runner EC2 instance, and check /var/log/cfn-*.log files.

Cleaning up

To avoid incurring future charges, delete every resource provisioned in this demo by deleting the CloudFormation stack created in the “Deploy the Gitlab Runner stack” section.

Conclusion

This article demonstrated how to utilize IaC to efficiently conduct various administrative tasks associated with a Gitlab Runner. We deployed Gitlab Runner consistently and quickly across multiple accounts. We utilized IaC to enforce guardrails and best practices, such as tracking Gitlab Runner configuration changes, terminating the Gitlab Runner gracefully, and autoscaling the Gitlab Runner to ensure best performance and minimum cost. We walked through the deploying, updating, autoscaling, and terminating of the Gitlab Runner. We also saw how easy it was to clean up the entire Gitlab Runner architecture by simply deleting a CloudFormation stack.

About the authors

Sylvia Qi

Sylvia is a Senior DevOps Architect focusing on architecting and automating DevOps processes, helping customers through their DevOps transformation journey. In her spare time, she enjoys biking, swimming, yoga, and photography.

Sebastian Carreras

Sebastian is a Senior Cloud Application Architect with AWS Professional Services. He leverages his breadth of experience to deliver bespoke solutions to satisfy the visions of his customer. In his free time, he really enjoys doing laundry. Really.

Analyze AWS WAF logs using Amazon OpenSearch Service anomaly detection built on Random Cut Forests

Post Syndicated from Umesh Ramesh original https://aws.amazon.com/blogs/security/analyze-aws-waf-logs-using-amazon-opensearch-service-anomaly-detection-built-on-random-cut-forests/

This blog post shows you how to use the machine learning capabilities of Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) to detect and visualize anomalies in AWS WAF logs. AWS WAF logs are streamed to Amazon OpenSearch Service using Amazon Kinesis Data Firehose. Kinesis Data Firehose invokes an AWS Lambda function to transform incoming source data and deliver the transformed data to Amazon OpenSearch Service. You can implement this solution without any machine learning expertise. AWS WAF logs capture a number of attributes about the incoming web request, and you can analyze these attributes to detect anomalous behavior. This blog post focuses on the following two scenarios:

  • Identifying anomalous behavior based on a high number of web requests coming from an unexpected country (Country Code is one of the request fields captured in AWS WAF logs).
  • Identifying anomalous behavior based on HTTP method for a read-heavy application like a content media website that receives unexpected write requests.

Log analysis is essential for understanding the effectiveness of any security solution. It helps with day-to-day troubleshooting, and also with long-term understanding of how your security environment is performing.

AWS WAF is a web application firewall that helps protect your web applications from common web exploits which could affect application availability, compromise security, or consume excessive resources. AWS WAF gives you control over which traffic sent to your web applications is allowed or blocked, by defining customizable web security rules. AWS WAF lets you define multiple types of rules to block unauthorized traffic.

Machine learning can assist in identifying unusual or unexpected behavior. Amazon OpenSearch Service is one of the commonly used services which offer log analytics for monitoring service logs, using dashboards and alerting mechanisms. Static, rule‑based analytics approaches are slow to adapt to evolving workloads, and can miss critical issues. With the announcement of real-time anomaly detection support in Amazon OpenSearch Service, you can use machine learning to detect anomalies in real‑time streaming data, and identify issues as they evolve so you can mitigate them quickly. Real‑time anomaly detection support uses Random Cut Forest (RCF), an unsupervised algorithm, which continuously adapts to evolving data patterns. Simply stated, RCF takes a set of random data points, divides them into multiple groups, each with the same number of points, and then builds a collection of models. As an unsupervised algorithm, RCF uses cluster analysis to detect spikes in time series data, breaks in periodicity or seasonality, and data point exceptions. The anomaly detection feature is lightweight, with the computational load distributed across Amazon OpenSearch Service nodes. Figure 1 shows the architecture of the solution described in this blog post.

Figure 1: End-to-end architecture

Figure 1: End-to-end architecture

The architecture flow shown in Figure 1 includes the following high-level steps:

  1. AWS WAF streams logs to Kinesis Data Firehose.
  2. Kinesis Data Firehose invokes a Lambda function to add attributes to the AWS WAF logs.
  3. Kinesis Data Firehose sends the transformed source records to Amazon OpenSearch Service.
  4. Amazon OpenSearch Service automatically detects anomalies.
  5. Amazon OpenSearch Service delivers anomaly alerts via Amazon Simple Notification Service (Amazon SNS).

Solution

Figure 2 shows examples of both an original and a modified AWS WAF log. The solution in this blog post focuses on Country and httpMethod. It uses a Lambda function to transform the AWS WAF log by adding fields, as shown in the snippet on the right side. The values of the newly added fields are evaluated based on the values of country and httpMethod in the AWS WAF log.

Figure 2: Sample processing done by a Lambda function

Figure 2: Sample processing done by a Lambda function

In this solution, you will use a Lambda function to introduce new fields to the incoming AWS WAF logs through Kinesis Data Firehose. You will introduce additional fields by using one-hot encoding to represent the incoming linear values as a “1” or “0”.

Scenario 1

In this scenario, the goal is to detect traffic from unexpected countries when serving user traffic expected to be from the US and UK. The function adds three new fields:

usTraffic
ukTraffic
otherTraffic

As shown in the lambda function inline code, we use the traffic_from_country function, in which we only want actions that ALLOW the traffic. Once we have that, we use conditions to check the country code. If the value of the country field in the web request captured in AWS WAF log is US, the usTraffic field in the transformed data will be assigned the value 1 while otherTraffic and ukTraffic will be assigned the value 0. The other two fields are transformed as shown in Table 1.

Original AWS WAF log Transformed AWS WAF log with new fields after one-hot encoding
Country usTraffic ukTraffic otherTraffic
US 1 0 0
UK 0 1 0
All other country codes 0 0 1

Table 1: One-hot encoding field mapping for country

Scenario 2

In the second scenario, you detect anomalous requests that use POST HTTP method.

As shown in the lambda function inline code, we use the filter_http_request_method function, in which we only want actions that ALLOW the traffic. Once we have that, we use conditions to check the HTTP _request method. If the value of the HTTP method in the AWS WAF log is GET, the getHttpMethod field is assigned the value 1 while headHttpMethod and postHttpMethod are assigned the value 0. The other two fields are transformed as shown in Table 2.

Original AWS WAF log Transformed AWS WAF log with new fields after one-hot encoding
HTTP method getHttpMethod headHttpMethod postHttpMethod
GET 1 0 0
HEAD 0 1 0
POST 0 0 1

Table 2: One-hot encoding field mapping for HTTP method

After adding these new fields, the transformed record from Lambda must contain the following parameters before the data is sent back to Kinesis Data Firehose

recordId The transformed record must contain the same original record ID as is received from the Kinesis Data Firehose.
result The status of the data transformation of the record (the status can be OK or Dropped).
data The transformed data payload.

AWS WAF logs are JSON files, and this anomaly detection feature works only on numeric data. This means that to use this feature for detecting anomalies in logs, you must pre-process your logs using a Lambda function.

Lambda function for one-hot encoding

Use the following Lambda function to transform the AWS WAF log by adding new attributes, as explained in Scenario 1 and Scenario 2.

import base64
import json

def lambda_handler(event,context):
    output = []
    
    try:
        # loop through records in incoming Event
        for record in event["records"]:
            # extract message
            message = json.loads(base64.b64decode(event["records"][0]["data"]))
            
            print('Country: ', message["httpRequest"]["country"])
            print('Action: ', message["action"])
            print('User Agent: ', message["httpRequest"]["headers"][1]["value"])
             
            timestamp = message["timestamp"]
            action = message["action"]
            country = message["httpRequest"]["country"]
            user_agent = message["httpRequest"]["headers"][1]["value"]
            http_method = message["httpRequest"]["httpMethod"]
            
            mobileUserAgent, browserUserAgent = filter_user_agent(user_agent)
            usTraffic, ukTraffic, otherTraffic = traffic_from_country(country, action)
            getHttpMethod, headHttpMethod, postHttpMethod = filter_http_request_method(http_method, action)
            
            # append new fields in message dict
            message["usTraffic"] = usTraffic
            message["ukTraffic"] = ukTraffic
            message["otherTraffic"] = otherTraffic
            message["mobileUserAgent"] = mobileUserAgent
            message["browserUserAgent"] = browserUserAgent
            message["getHttpMethod"] = getHttpMethod
            message["headHttpMethod"] = headHttpMethod
            message["postHttpMethod"] = postHttpMethod
            
            # base64-encoding
            data = base64.b64encode(json.dumps(message).encode('utf-8'))
            
            output_record = {
                "recordId": record['recordId'], # retain same record id from the Kinesis data Firehose
                "result": "Ok",
                "data": data.decode('utf-8')
            }
            output.append(output_record)
        return {"records": output}
    except Exception as e:
        print(e)
        
def filter_user_agent(user_agent):
    # returns one hot encoding based on user agent
    if "Mobile" in user_agent:
        mobile_user_agent = True
        return (1, 0)
    else:
        mobile_user_agent = False
        return (0, 1) # anomaly recorded
        
def traffic_from_country(country_code, action):
    # returns one hot encoding based on allowed traffic from countries
    if action == "ALLOW":
        if "US" in country_code:
            allowed_country_traffic = True
            return (1, 0, 0)
        elif "UK" in country_code:
            allowed_country_traffic = True
            return (0, 1, 0)
        else:
            allowed_country_traffic = False
            return (0, 0, 1) # anomaly recorded
            
def filter_http_request_method(http_method, action):
    # returns one hot encoding based on allowed http method type
    if action == "ALLOW":
        if "GET" in http_method:
            return (1, 0, 0)
        elif "HEAD" in http_method:
            return (0, 1, 0)
        elif "POST" in http_method:
            return (0, 0, 1) # anomaly recorded

After the transformation, the data that’s delivered to Amazon OpenSearch Service will have additional fields, as described in Table 1 and Table 2 above. You can configure an anomaly detector in Amazon OpenSearch Service to monitor these additional fields. The algorithm computes an anomaly grade and confidence score value for each incoming data point. Anomaly detection uses these values to differentiate an anomaly from normal variations in your data. Anomaly detection and alerting are plugins that are included in the available set of Amazon OpenSearch Service plugins. You can use these two plugins to generate a notification as soon as an anomaly is detected.

Deployment steps

In this section, you complete five high-level steps to deploy the solution. In this blog post, we are deploying this solution in the us-east-1 Region. The solution assumes you already have an active web application protected by AWS WAF rules. If you’re looking for details on creating AWS WAF rules, refer to Working with web ACLs and sample examples for more information.

Note: When you associate a web ACL with Amazon CloudFront as a protected resource, make sure that the Kinesis Firehose Delivery Stream is deployed in the us-east-1 Region.

The steps are:

  1. Deploy an AWS CloudFormation template
  2. Enable AWS WAF logs
  3. Create an anomaly detector
  4. Set up alerts in Amazon OpenSearch Service
  5. Create a monitor for the alerts

Deploy a CloudFormation template

To start, deploy a CloudFormation template to create the following AWS resources:

  • Amazon OpenSearch Service and Kibana (versions 1.5 to 7.10) with built-in AWS WAF dashboards.
  • Kinesis Data Firehose streams
  • A Lambda function for data transformation and an Amazon SNS topic with email subscription. 

To deploy the CloudFormation template

  1. Download the CloudFormation template and save it locally as Amazon-ES-Stack.yaml.
  2. Go to the AWS Management Console and open the CloudFormation console.
  3. Choose Create Stack.
  4. On the Specify template page, choose Upload a template file. Then select Choose File, and select the template file that you downloaded in step 1.
  5. Choose Next.
  6. Provide the Parameters:
    1. Enter a unique name for your CloudFormation stack.
    2. Update the email address for UserEmail with the address you want alerts sent to.
    3. Choose Next.
  7. Review and choose Create stack.
  8. When the CloudFormation stack status changes to CREATE_COMPLETE, go to the Outputs tab and make note of the DashboardLinkOutput value. Also note the credentials you’ll receive by email (Subject: Your temporary password) and subscribe to the SNS topic for which you’ll also receive an email confirmation request.

Enable AWS WAF logs

Before enabling the AWS WAF logs, you should have AWS WAF web ACLs set up to protect your web application traffic. From the console, open the AWS WAF service and choose your existing web ACL. Open your web ACL resource, which can either be deployed on an Amazon CloudFront distribution or on an Application Load Balancer.

To enable AWS WAF logs

  1. From the AWS WAF home page, choose Create web ACL.
  2. From the AWS WAF home page, choose  Logging and metrics
  3. From the AWS WAF home page, choose the web ACL for which you want to enable logging, as shown in Figure 3:
    Figure 3 – Enabling WAF logging

    Figure 3 – Enabling WAF logging

  4. Go to the Logging and metrics tab, and then choose Enable Logging. The next page displays all the delivery streams that start with aws-waf-logs. Choose the Kinesis Data Firehose delivery stream that was created by the Cloud Formation template, as shown in Figure 3 (in this example, aws-waf-logs-useast1). Don’t redact any fields or add filters. Select Save.

Create an Index template

Index templates lets you initialize new indices with predefined mapping. For example, in this case you predefined mapping for timestamp.

To create an Index template

  • Log into the Kibana dashboard. You can find the Kibana dashboard link in the Outputs tab of the CloudFormation stack. You should have received the username and temporary password (Ignore the period (.) at the end of the temporary password) by email, at the email address you entered as part of deploying the CloudFormation template. You will be logged in to the Kibana dashboard after setting a new password.
  • Choose Dev Tools in the left menu panel to access Kibana’s console.
  • The left pane in the console is the request pane, and the right pane is the response pane.
  • Select the green arrow at the end of the command line to execute the following PUT command.
    PUT  _template/awswaf
    {
        "index_patterns": ["awswaf-*"],
        "settings": {
        "number_of_shards": 1
        },
        "mappings": {
           "properties": {
              "timestamp": {
                "type": "date",
                "format": "epoch_millis"
              }
          }
      }
    }

  • You should see the following response:
    {
      "acknowledged": true
    }

The command creates a template named awswaf and applies it to any new index name that matches the regular expression awswaf-*

Create an anomaly detector

A detector is an individual anomaly detection task. You can create multiple detectors, and all the detectors can run simultaneously, with each analyzing data from different sources.

To create an anomaly detector

  1. Select Anomaly Detection from the menu bar, select Detectors and Create Detector.
    Figure 4- Home page view with menu bar on the left

    Figure 4- Home page view with menu bar on the left

  2. To create a detector, enter the following values and features:

    Name and description

    Name: aws-waf-country
    Description: Detect anomalies on other country values apart from “US” and “UK

    Data Source

    Index: awswaf*
    Timestamp field: timestamp
    Data filter: Visual editor
    Figure 5 – Detector features and their values

    Figure 5 – Detector features and their values

  3. For Detector operation settings, enter a value in minutes for the Detector interval to set the time interval at which the detector collects data. To add extra processing time for data collection, set a Window delay value (also in minutes). This tells the detector that the data isn’t ingested into Amazon OpenSearch Service in real time, but with a delay. The example in Figure 6 uses a 1-minute interval and a 2-minute delay.
    Figure 6 – Detector operation settings

    Figure 6 – Detector operation settings

  4. Next, select Create.
  5. Once you create a detector, select Configure Model and add the following values to Model configuration:

    Feature Name: waf-country-other
    Feature State: Enable feature
    Find anomalies based on: Field value
    Aggregation method: sum()
    Field: otherTraffic

    The aggregation method determines what constitutes an anomaly. For example, if you choose min(), the detector focuses on finding anomalies based on the minimum values of your feature. If you choose average(), the detector finds anomalies based on the average values of your feature. For this scenario, you will use sum().The value otherTraffic for Field is the transformed field in the Amazon OpenSearch Service logs that was added by the Lambda function.

    Figure 7 – Detector Model configuration

    Figure 7 – Detector Model configuration

  6. Under Advanced Settings on the Model configuration page, update the Window size to an appropriate interval (1 equals 1 minute) and choose Save and Start detector and Automatically start detector.

    We recommend you choose this value based on your actual data. If you expect missing values in your data, or if you want the anomalies based on the current value, choose 1. If your data is continuously ingested and you want the anomalies based on multiple intervals, choose a larger window size.

    Note: The detector takes 4 to 5 minutes to start. 

    Figure 8 – Detector window size

    Figure 8 – Detector window size

Set up alerts

You’ll use Amazon SNS as a destination for alerts from Amazon OpenSearch Service.

Note: A destination is a reusable location for an action.

To set up alerts:
  1. Go to the Kibana main menu bar and select Alerting, and then navigate to the Destinations tab.
  2. Select Add destination and enter a unique name for the destination.
  3. For Type, choose Amazon SNS and provide the topic ARN that was created as part of the CloudFormation resources (captured in the Outputs tab).
  4. Provide the ARN for an IAM role that was created as part of the CloudFormation outputs (SNSAccessIAMRole-********) that has the following trust relationship and permissions (at a minimum):
    {"Version": "2012-10-17",
      "Statement": [{"Effect": "Allow",
        "Principal": {"Service": "es.amazonaws.com"
        },
        "Action": "sts:AssumeRole"
      }]
    }
    {"Version": "2012-10-17",
      "Statement": [{"Effect": "Allow",
        "Action": "sns:Publish",
        "Resource": "sns-topic-arn"
      }]
    }

    Figure 9 – Destination

    Figure 9 – Destination

    Note: For more information, see Adding IAM Identity Permissions in the IAM user guide.

  5. Choose Create.

Create a monitor

A monitor can be defined as a job that runs on a defined schedule and queries Amazon OpenSearch Service. The results of these queries are then used as input for one or more triggers.

To create a monitor for the alert

  1. Select Alerting on the Kibana main menu and navigate to the Monitors tab. Select Create monitor
  2. Create a new record with the following values:

    Monitor Name: aws-waf-country-monitor
    Method of definition: Define using anomaly detector
    Detector: aws-waf-country
    Monitor schedule: Every 2 minutes
  3. Select Create.
    Figure 10 – Create monitor

    Figure 10 – Create monitor

  4. Choose Create Trigger to connect monitoring alert with the Amazon SNS topic using the below values:

    Trigger Name: SNS_Trigger
    Severity Level: 1
    Trigger Type: Anomaly Detector grade and confidence

    Under Configure Actions, set the following values:

    Action Name: SNS-alert
    Destination: select the destination name you chose when you created the Alert above
    Message Subject: “Anomaly detected – Country”
    Message: <Use the default message displayed>
  5. Select Create to create the trigger.
    Figure 11 – Create trigger

    Figure 11 – Create trigger

    Figure 12 – Configure actions

    Figure 12 – Configure actions

Test the solution

Now that you’ve deployed the solution, the AWS WAF logs will be sent to Amazon OpenSearch Service.

Kinesis Data Generator sample template

When testing the environment covered in this blog outside a production context, we used Kinesis Data Generator to generate sample user traffic with the template below, changing the country strings in different runs to reflect expected records or anomalous ones. Other tools are also available.

{
"timestamp":"[{{date.now("DD/MMM/YYYY:HH:mm:ss Z")}}]",
"formatVersion":1,
"webaclId":"arn:aws:wafv2:us-east-1:066931718055:regional/webacl/FMManagedWebACLV2test-policy1596636761038/3b9e0dde-812c-447f-afe7-2dd16658e746",
"terminatingRuleId":"Default_Action",
"terminatingRuleType":"REGULAR",
"action":"ALLOW",
"terminatingRuleMatchDetails":[
],
"httpSourceName":"ALB",
"httpSourceId":"066931718055-app/Webgoat-ALB/d1b4a2c257e57f2f",
"ruleGroupList":[
{
"ruleGroupId":"AWS#AWSManagedRulesAmazonIpReputationList",
"terminatingRule":null,
"nonTerminatingMatchingRules":[
],
"excludedRules":null
}
],
"rateBasedRuleList":[
],
"nonTerminatingMatchingRules":[
],
"httpRequest":{
"clientIp":"{{internet.ip}}",
"country":"{{random.arrayElement(
["US","UK"]
)}}",
"headers":[
{
"name":"Host",
"value":"34.225.62.38"
},
{
"name":"User-Agent",
"value":"{{internet.userAgent}}"
},
{
"name":"Accept",
"value":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
},
{
"name":"Accept-Language",
"value":"en-GB,en;q=0.5"
},
{
"name":"Accept-Encoding",
"value":"gzip, deflate"
},
{
"name":"Upgrade-Insecure-Requests",
"value":"1"
}
],
"uri":"/config/getuser",
"args":"index=0",
"httpVersion":"HTTP/1.1",
"httpMethod":"{{random.arrayElement(
["GET","HEAD"]
)}}",
"requestId":null
}
}

You will receive an email alert via Amazon SNS if the traffic contains any anomalous data. You should also be able to view the anomalies recorded in Amazon OpenSearch Service by selecting the detector and choosing Anomaly results for the detector, as shown in Figure 13.

Figure 13 – Anomaly results

Figure 13 – Anomaly results

Conclusion

In this post, you learned how you can discover anomalies in AWS WAF logs across parameters like Country and httpMethod defined by the attribute values. You can further expand your anomaly detection use cases with application logs and other AWS Service logs. To learn more about this feature with Amazon OpenSearch Service, we suggest reading the Amazon OpenSearch Service documentation. We look forward to hearing your questions, comments, and feedback. 

If you found this post interesting and useful, you may be interested in https://aws.amazon.com/blogs/security/how-to-improve-visibility-into-aws-waf-with-anomaly-detection/ and https://aws.amazon.com/blogs/big-data/analyzing-aws-waf-logs-with-amazon-es-amazon-athena-and-amazon-quicksight/ as further alternative approaches.

 
If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security news? Follow us on Twitter.

uramesh

Umesh Kumar Ramesh

Umesh is a senior cloud infrastructure architect with AWS who delivers proof-of-concept projects and topical workshops, and leads implementation projects. He holds a bachelor’s degree in computer science and engineering from the National Institute of Technology, Jamshedpur (India). Outside of work, he enjoys watching documentaries, biking, practicing meditation, and discussing spirituality.

Anuj Butail

Anuj Butail

Anuj is a solutions architect at AWS. He is based out of San Francisco and helps customers in San Francisco and Silicon Valley design and build large scale applications on AWS. He has expertise in the area of AWS, edge services, and containers. He enjoys playing tennis, watching sitcoms, and spending time with his family.

mahekp

Mahek Pavagadhi

Mahek is a cloud infrastructure architect at AWS in San Francisco, CA. She has a master’s degree in software engineering with a major in cloud computing. She is passionate about cloud services and building solutions with it. Outside of work, she is an avid traveler who loves to explore local cafés.

Implement OAuth 2.0 device grant flow by using Amazon Cognito and AWS Lambda

Post Syndicated from Jeff Lombardo original https://aws.amazon.com/blogs/security/implement-oauth-2-0-device-grant-flow-by-using-amazon-cognito-and-aws-lambda/

In this blog post, you’ll learn how to implement the OAuth 2.0 device authorization grant flow for Amazon Cognito by using AWS Lambda and Amazon DynamoDB.

When you implement the OAuth 2.0 authorization framework (RFC 6749) for internet-connected devices with limited input capabilities or that lack a user-friendly browser—such as wearables, smart assistants, video-streaming devices, smart-home automation, and health or medical devices—you should consider using the OAuth 2.0 device authorization grant (RFC 8628). This authorization flow makes it possible for the device user to review the authorization request on a secondary device, such as a smartphone, that has more advanced input and browser capabilities. By using this flow, you can work around the limits of the authorization code grant flow with Proof Key for Code Exchange (PKCE)-defined OpenID Connect Core specifications. This will help you to avoid scenarios such as:

  • Forcing end users to define a dedicated application password or use an on-screen keyboard with a remote control
  • Degrading the security posture of the end users by exposing their credentials to the client application or external observers

One common example of this type of scenario is a TV HDMI streaming device where, to be able to consume videos, the user must slowly select each letter of their user name and password with the remote control, which exposes these values to other people in the room during the operation.

Solution overview

The OAuth 2.0 device authorization grant (RFC 8628) is an IETF standard that enables Internet of Things (IoT) devices to initiate a unique transaction that authenticated end users can securely confirm through their native browsers. After the user authorizes the transaction, the solution will issue a delegated OAuth 2.0 access token that represents the end user to the requesting device through a back-channel call, as shown in Figure 1.
 

Figure 1: The device grant flow implemented in this solution

Figure 1: The device grant flow implemented in this solution

The workflow is as follows:

  1. An unauthenticated user requests service from the device.
  2. The device requests a pair of random codes (one for the device and one for the user) by authenticating with the client ID and client secret.
  3. The Lambda function creates an authorization request that stores the device code, user code, scope, and requestor’s client ID.
  4. The device provides the user code to the user.
  5. The user enters their user code on an authenticated web page to authorize the client application.
  6. The user is redirected to the Amazon Cognito user pool /authorize endpoint to request an authorization code.
  7. The user is returned to the Lambda function /callback endpoint with an authorization code.
  8. The Lambda function stores the authorization code in the authorization request.
  9. The device uses the device code to check the status of the authorization request regularly. And, after the authorization request is approved, the device uses the device code to retrieve a set of JSON web tokens from the Lambda function.
  10. In this case, the Lambda function impersonates the device to the Amazon Cognito user pool /token endpoint by using the authorization code that is stored in the authorization request, and returns the JSON web tokens to the device.

To achieve this flow, this blog post provides a solution that is composed of:

  • An AWS Lambda function with three additional endpoints:
    • The /token endpoint, which will handle client application requests such as generation of codes, the authorization request status check, and retrieval of the JSON web tokens.
    • The /device endpoint, which will handle user requests such as delivering the UI for approval or denial of the authorization request, or retrieving an authorization code.
    • The /callback endpoint, which will handle the reception of the authorization code associated with the user who is approving or denying the authorization request.
  • An Amazon Cognito user pool with:
  • Finally, an Amazon DynamoDB table to store the state of all the processed authorization requests.

Implement the solution

The implementation of this solution requires three steps:

  1. Define the public fully qualified domain name (FQDN) for the Application Load Balancer public endpoint and associate an X.509 certificate to the FQDN
  2. Deploy the provided AWS CloudFormation template
  3. Configure the DNS to point to the Application Load Balancer public endpoint for the public FQDN

Step 1: Choose a DNS name and create an SSL certificate

Your Lambda function endpoints must be publicly resolvable when they are exposed by the Application Load Balancer through an HTTPS/443 listener.

To configure the Application Load Balancer component

  1. Choose an FQDN in a DNS zone that you own.
  2. Associate an X.509 certificate and private key to the FQDN by doing one of the following:
  3. After you have the certificate in ACM, navigate to the Certificates page in the ACM console.
  4. Choose the right arrow (►) icon next to your certificate to show the certificate details.
     
    Figure 2: Locating the certificate in ACM

    Figure 2: Locating the certificate in ACM

  5. Copy the Amazon Resource Name (ARN) of the certificate and save it in a text file.
     
    Figure 3: Locating the certificate ARN in ACM

    Figure 3: Locating the certificate ARN in ACM

Step 2: Deploy the solution by using a CloudFormation template

To configure this solution, you’ll need to deploy the solution CloudFormation template.

Before you deploy the CloudFormation template, you can view it in its GitHub repository.

To deploy the CloudFormation template

  1. Choose the following Launch Stack button to launch a CloudFormation stack in your account.
    Select the Launch Stack button to launch the template

    Note: The stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, download the solution’s CloudFormation template, modify it, and deploy it to the selected Region.

  2. During the stack configuration, provide the following information:
    • A name for the stack.
    • The ARN of the certificate that you created or imported in AWS Certificate Manager.
    • A valid email address that you own. The initial password for the Amazon Cognito test user will be sent to this address.
    • The FQDN that you chose earlier, and that is associated to the certificate that you created or imported in AWS Certificate Manager.
    Figure 4: Configure the CloudFormation stack

    Figure 4: Configure the CloudFormation stack

  3. After the stack is configured, choose Next, and then choose Next again. On the Review page, select the check box that authorizes CloudFormation to create AWS Identity and Access Management (IAM) resources for the stack.
     
    Figure 5: Authorize CloudFormation to create IAM resources

    Figure 5: Authorize CloudFormation to create IAM resources

  4. Choose Create stack to deploy the stack. The deployment will take several minutes. When the status says CREATE_COMPLETE, the deployment is complete.

Step 3: Finalize the configuration

After the stack is set up, you must finalize the configuration by creating a DNS CNAME entry in the DNS zone you own that points to the Application Load Balancer DNS name.

To create the DNS CNAME entry

  1. In the CloudFormation console, on the Stacks page, locate your stack and choose it.
     
    Figure 6: Locating the stack in CloudFormation

    Figure 6: Locating the stack in CloudFormation

  2. Choose the Outputs tab.
  3. Copy the value for the key ALBCNAMEForDNSConfiguration.
     
    Figure 7: The ALB CNAME output in CloudFormation

    Figure 7: The ALB CNAME output in CloudFormation

  4. Configure a CNAME DNS entry into your DNS hosted zone based on this value. For more information on how to create a CNAME entry to the Application Load Balancer in a DNS zone, see Creating records by using the Amazon Route 53 console.
  5. Note the other values in the Output tab, which you will use in the next section of this post.

    Output key Output value and function
    DeviceCognitoClientClientID The app client ID, to be used by the simulated device to interact with the authorization server
    DeviceCognitoClientClientSecret The app client secret, to be used by the simulated device to interact with the authorization server
    TestEndPointForDevice The HTTPS endpoint that the simulated device will use to make its requests
    TestEndPointForUser The HTTPS endpoint that the user will use to make their requests
    UserPassword The password for the Amazon Cognito test user
    UserUserName The user name for the Amazon Cognito test user

Evaluate the solution

Now that you’ve deployed and configured the solution, you can initiate the OAuth 2.0 device code grant flow.

Until you implement your own device logic, you can perform all of the device calls by using the curl library, a Postman client, or any HTTP request library or SDK that is available in the client application coding language.

All of the following device HTTPS requests are made with the assumption that the device is a private OAuth 2.0 client. Therefore, an HTTP Authorization Basic header will be present and formed with a base64-encoded Client ID:Client Secret value.

You can retrieve the URI of the endpoints, the client ID, and the client secret from the CloudFormation Output table for the deployed stack, as described in the previous section.

Initialize the flow from the client application

The solution in this blog post lets you decide how the user will ask the device to start the authorization request and how the user will be presented with the user code and URI in order to verify the request. However, you can emulate the device behavior by generating the following HTTPS POST request to the Application Load Balancer–protected Lambda function /token endpoint with the appropriate HTTP Authorization header. The Authorization header is composed of:

  • The prefix Basic, describing the type of Authorization header
  • A space character as separator
  • The base64 encoding of the concatenation of:
    • The client ID
    • The colon character as a separator
    • The client secret
     POST /token?client_id=AIDACKCEVSQ6C2EXAMPLE HTTP/1.1
     User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
     Host: <FQDN of the ALB protected Lambda function>
     Accept: */*
     Accept-Encoding: gzip, deflate
     Connection: Keep-Alive
     Authorization: Basic QUlEQUNLQ0VWUwJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY VORy9iUHhSZmlDWUVYQU1QTEVLRVkg
    

The following JSON message will be returned to the client application.

Server: awselb/2.0
Date: Tue, 06 Apr 2021 19:57:31 GMT
Content-Type: application/json
Content-Length: 33
Connection: keep-alive
cache-control: no-store
{
    "device_code": "APKAEIBAERJR2EXAMPLE",
    "user_code": "ANPAJ2UCCR6DPCEXAMPLE",
    "verification_uri": "https://<FQDN of the ALB protected Lambda function>/device",
    "verification_uri_complete":"https://<FQDN of the ALB protected Lambda function>/device?code=ANPAJ2UCCR6DPCEXAMPLE&authorize=true",
    "interval": <Echo of POLLING_INTERVAL environment variable>,
    "expires_in": <Echo of CODE_EXPIRATION environment variable>
}

Check the status of the authorization request from the client application

You can emulate the process where the client app regularly checks for the authorization request status by using the following HTTPS POST request to the Application Load Balancer–protected Lambda function /token endpoint. The request should have the same HTTP Authorization header that was defined in the previous section.

POST /token?client_id=AIDACKCEVSQ6C2EXAMPLE&device_code=APKAEIBAERJR2EXAMPLE&grant_type=urn:ietf:params:oauth:grant-type:device_code HTTP/1.1
 User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
 Host: <FQDN of the ALB protected Lambda function>
 Accept: */*
 Accept-Encoding: gzip, deflate
 Connection: Keep-Alive
 Authorization: Basic QUlEQUNLQ0VWUwJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY VORy9iUHhSZmlDWUVYQU1QTEVLRVkg

Until the authorization request is approved, the client application will receive an error message that includes the reason for the error: authorization_pending if the request is not yet authorized, slow_down if the polling is too frequent, or expired if the maximum lifetime of the code has been reached. The following example shows the authorization_pending error message.

HTTP/1.1 400 Bad Request
Server: awselb/2.0
Date: Tue, 06 Apr 2021 20:57:31 GMT
Content-Type: application/json
Content-Length: 33
Connection: keep-alive
cache-control: no-store
{
"error":"authorization_pending"
}

Approve the authorization request with the user code

Next, you can approve the authorization request with the user code. To act as the user, you need to open a browser and navigate to the verification_uri that was provided by the client application.

If you don’t have a session with the Amazon Cognito user pool, you will be required to sign in.

Note: Remember that the initial password was sent to the email address you provided when you deployed the CloudFormation stack.

If you used the initial password, you’ll be asked to change it. Make sure to respect the password policy when you set a new password. After you’re authenticated, you’ll be presented with an authorization page, as shown in Figure 8.
 

Figure 8: The user UI for approving or denying the authorization request

Figure 8: The user UI for approving or denying the authorization request

Fill in the user code that was provided by the client application, as in the previous step, and then choose Authorize.

When the operation is successful, you’ll see a message similar to the one in Figure 9.
 

Figure 9: The “Success” message when the authorization request has been approved

Figure 9: The “Success” message when the authorization request has been approved

Finalize the flow from the client app

After the request has been approved, you can emulate the final client app check for the authorization request status by using the following HTTPS POST request to the Application Load Balancer–protected Lambda function /token endpoint. The request should have the same HTTP Authorization header that was defined in the previous section.

POST /token?client_id=AIDACKCEVSQ6C2EXAMPLE&device_code=APKAEIBAERJR2EXAMPLE&grant_type=urn:ietf:params:oauth:grant-type:device_code HTTP/1.1
 User-Agent: Mozilla/4.0 (compatible; MSIE5.01; Windows NT)
 Host: <FQDN of the ALB protected Lambda function>
 Accept: */*
 Accept-Encoding: gzip, deflate
 Connection: Keep-Alive
 Authorization: Basic QUlEQUNLQ0VWUwJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY VORy9iUHhSZmlDWUVYQU1QTEVLRVkg

The JSON web token set will then be returned to the client application, as follows.

HTTP/1.1 200 OK
Server: awselb/2.0
Date: Tue, 06 Apr 2021 21:41:50 GMT
Content-Type: application/json
Content-Length: 3501
Connection: keep-alive
cache-control: no-store
{
"access_token":"eyJrEXAMPLEHEADER2In0.eyJznvbEXAMPLEKEY6IjIcyJ9.eYEs-zaPdEXAMPLESIGCPltw",
"refresh_token":"eyJjdEXAMPLEHEADERifQ. AdBTvHIAPKAEIBAERJR2EXAMPLELq -co.pjEXAMPLESIGpw",
"expires_in":3600

The client application can now consume resources on behalf of the user, thanks to the access token, and can refresh the access token autonomously, thanks to the refresh token.

Going further with this solution

This project is delivered with a default configuration that can be extended to support additional security capabilities or to and adapted the experience to your end-users’ context.

Extending security capabilities

Through this solution, you can:

  • Use an AWS KMS key issued by AWS KMS to:
    • Encrypt the data in the database;
    • Protect the configuration in the Amazon Lambda function;
  • Use AWS Secret Manager to:
    • Securely store sensitive information like Cognito application client’s credentials;
    • Enforce Cognito application client’s credentials rotation;
  • Implement additional Amazon Lambda’s code to enforce data integrity on changes;
  • Activate AWS WAF WebACLs to protect your endpoints against attacks;

Customizing the end-user experience

The following table shows some of the variables you can work with.

Name Function Default value Type
CODE_EXPIRATION Represents the lifetime of the codes generated 1800 Seconds
DEVICE_CODE_FORMAT Represents the format for the device code #aA A string where:
# represents numbers
a lowercase letters
A uppercase letters
! special characters
DEVICE_CODE_LENGTH Represents the device code length 64 Number
POLLING_INTERVAL Represents the minimum time, in seconds, between two polling events from the client application 5 Seconds
USER_CODE_FORMAT Represents the format for the user code #B A string where:
# represents numbers
a lowercase letters
b lowercase letters that aren’t vowels
A uppercase letters
B uppercase letters that aren’t vowels
! special characters
USER_CODE_LENGTH Represents the user code length 8 Number
RESULT_TOKEN_SET Represents what should be returned in the token set to the client application ACCESS+REFRESH A string that includes only ID, ACCESS, and REFRESH values separated with a + symbol

To change the values of the Lambda function variables

  1. In the Lambda console, navigate to the Functions page.
  2. Select the DeviceGrant-token function.
     
    Figure 10: AWS Lambda console—Function selection

    Figure 10: AWS Lambda console—Function selection

  3. Choose the Configuration tab.
     
    Figure 11: AWS Lambda function—Configuration tab

    Figure 11: AWS Lambda function—Configuration tab

  4. Select the Environment variables tab, and then choose Edit to change the values for the variables.
     
    Figure 12: AWS Lambda Function—Environment variables tab

    Figure 12: AWS Lambda Function—Environment variables tab

  5. Generate new codes as the device and see how the experience changes based on how you’ve set the environment variables.

Conclusion

Although your business and security requirements can be more complex than the example shown in this post, this blog post will give you a good way to bootstrap your own implementation of the Device Grant Flow (RFC 8628) by using Amazon Cognito, AWS Lambda, and Amazon DynamoDB.

Your end users can now benefit from the same level of security and the same experience as they have when they enroll their identity in their mobile applications, including the following features:

  • Credentials will be provided through a full-featured application on the user’s mobile device or their computer
  • Credentials will be checked against the source of authority only
  • The authentication experience will match the typical authentication process chosen by the end user
  • Upon consent by the end user, IoT devices will be provided with end-user delegated dynamic credentials that are bound to the exact scope of tasks for that device

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Cognito forum or reach out through the post’s GitHub repository.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Jeff Lombardo

Jeff is a solutions architect expert in IAM, Application Security, and Data Protection. Through 16 years as a security consultant for enterprises of all sizes and business verticals, he delivered innovative solutions with respect to standards and governance frameworks. Today at AWS, he helps organizations enforce best practices and defense in depth for secure cloud adoption.

Parallel and dynamic SaaS deployments with AWS CDK Pipelines

Post Syndicated from Jani Muuriaisniemi original https://aws.amazon.com/blogs/devops/parallel-and-dynamic-saas-deployments-with-cdk-pipelines/

Software as a Service (SaaS) is an increasingly popular business model for independent software vendors (ISVs), including benefits such as a pay-as-you-go pricing model, scalability, and availability.

SaaS services can be built by using numerous architectural models. The silo model provides each tenant with dedicated resources and a shared-nothing architecture. Silo deployments also provide isolation between tenants’ compute resources and their data, and they help eliminate the noisy-neighbor problem. On the other hand, the pool model offers several benefits, such as lower maintenance overhead, simplified management and operations, and cost-saving opportunities, all due to a more efficient utilization of computing resources and capacity. In the bridge model, both silo and pool models are utilized side-by-side. The bridge model is a hybrid model, where parts of the system can be in a silo model, and parts in a pool.

End-customers benefit from SaaS delivery in numerous ways. For example, the service can be available from multiple locations, letting the customer choose what is best for them. The tenant onboarding process is often real-time and frictionless. To realize these benefits for their end-customers, SaaS providers need methods for reliable, fast, and multi-region capable provisioning and software lifecycle management.

This post will describe a deployment system for automating the provision and lifecycle management of workload components in pool or silo deployment models by using AWS Cloud Development Kit (AWS CDK) and CDK Pipelines. We will explore the system’s dynamic and database driven deployment model, as well as its multi-account and multi-region capabilities, and we will provision demo deployments of workload components in both the silo and pool models.

AWS Cloud Development Kit and CDK Pipelines

For this solution, we utilized AWS Cloud Development Kit (AWS CDK) and its CDK Pipelines construct library. AWS CDK is an open-source software development framework for modeling and provisioning cloud application resources by using familiar programming languages. AWS CDK lets you define your infrastructure as code and provision it through AWS CloudFormation.

CDK Pipelines is a high-level construct library with an opinionated implementation of a continuous deployment pipeline for your CDK applications. It is powered by AWS CodePipeline, a fully managed continuous delivery service that helps automate your release pipelines for fast and reliable application as well as infrastructure updates. No servers need to be provisioned or setup, and you only pay for what you use. This solution utilizes the recently released and stable CDK Pipelines modern API.

Business Scenario

As a baseline use case, we have selected the consideration of a fictitious ISV called Unicorn that wants to implement an SaaS business model.

Unicorn operates in several countries, and requires the storing of customer data within the customers’ chosen region. Currently, Unicorn needs two regions in order to satisfy its main customer base: one in EU and one in US. Unicorn expects rapid growth, and it needs a solution that can scale to thousands of tenants. Unicorn plans to have different tenant tiers with different isolation requirements. Their planned deployment model has the majority of tenants in shared pool instances, but they also plan to support dedicated silo instances for the tenants requiring it. The solution must also be easily extendable to new Regions as Unicorn’s business expands.

Unicorn is starting small with just a single development team responsible for currently the only component in their SaaS workload architecture. Following industry best practices, Unicorn has designed its workload architecture so that each component has a clear technical ownership boundary. The chosen solution must grow together with Unicorn, and support multiple independently developed and deployed components in the future.

Solution Overview

Today, many customers utilize AWS CodePipeline to build, test, and deploy their cloud applications. For an SaaS provider such as Unicorn, considering utilizing a single pipeline for managing every deployment presented concerns. At the scale that Unicorn requires, a single pipeline with potentially hundreds of actions runs the risk of becoming throughput limited. Moreover, a single pipeline would offer Unicorn limited control over how changes are released.

Our solution addresses this problem by having a separate dynamically provisioned pipeline for each pool and silo deployment. The solution is designed to manage multiple deployments of Unicorn’s single workload component, thereby aligning with their current needs — and with small changes, including future needs.

CDK Best Practices state that an AWS CDK application maps to a component as defined by the AWS Well-Architected Framework. A component is the code, configuration, and AWS Resources that together deliver against a workload requirement. And this is typically the unit of technical ownership. A component usually includes logical units (e.g., api, database), and can have a continuous deployment pipeline.

Utilizing CDK Pipelines provides a significant benefit: with no additional code, we can deploy cross-account and cross-region just as easily as we would to a single account and region. CDK Pipelines automatically creates and manages the required cross-account encryption keys and cross-region replication buckets. Furthermore, we only need to establish a trust relationship between the accounts during the CDK bootstrapping process.

The following diagram illustrates the solution architecture:

Solution Architecture Diagram

Figure 1: Solution architecture

Let’s look closer at the two primary high level solution flows: silo and pool pipeline provisioning (1 and 2), and component code deployment (3 and 4).

Provisioning is separated into a dedicated flow, so that code deployments do not interfere with tenant onboarding, and vice versa. At the heart of the provisioning flow is the deployment database (1), which is implemented by using an Amazon DynamoDB table.

Utilizing DynamoDB Streams and AWS Lambda Triggers, a new AWS CodeBuild provisioning project build (2) is automatically started after a record is inserted into the deployment database. The provisioning project directly provisions new silo and pool pipelines by using the “cdk deploy” command. Provisioning events are processed in parallel, so that the solution can handle possible bursts in Unicorn’s tenant onboarding volumes.

CDK best practices suggest that infrastructure and runtime code live in the same package. A single AWS CodeCommit repository (3) contains everything needed: the CI/CD pipeline definitions as well as the workload component code. This repository is the source artifact for every CodePipeline pipeline and CodeBuild project. The chapter “Managing application resources as code” describes related implementation details.

The CI/CD pipeline (4) is a CDK Pipelines pipeline, and it is responsible for the component’s Software Development Life Cycle (SDLC) activities. In addition to implementing the update release process, it is expected that most SaaS providers will also implement additional activities. This includes a variety of tests and pre-production environment deployments. The chapter “Controlling deployment updates” dives deeper into this topic.

Deployments have two parts: The pipeline (5) and the component resource stack(s) (6) that it manages. The pipelines are deployed to the central toolchain account and region, whereas the component resources are deployed to the AWS Account and Region, as specified in the deployments’ record in the deployment database.

Sample code for the solution is available in GitHub. The sample code is intended for utilization in conjunction with this post. Our solution is implemented in TypeScript.

Deployment Database

Our deployment database is an Amazon DynamoDB table, with the following structure:

Table structure explained in post.

Figure 2: DynamoDB table

  • ‘id’ is a unique identifier for each deployment.
  • ‘account’ is the AWS account ID for the component resources.
  • ‘region’ is the AWS region ID for the component resources.
  • ‘type’ is either ‘silo’ or ‘pool’, which defines the deployment model.

This design supports tenant deployment to multiple silo and pool deployments. Each of these can target any available and bootstrapped AWS Account and Region. For example, different pools can support tenants in different regions, with select tenants deployed to dedicated silos. As pools may be limited to how many tenants they can serve, the design also supports having multiple pools within a region, and it can easily be extended with an additional attribute to support the tiers concept.

Note that the deployment database does not contain tenant information. It is expected that such mapping is maintained in a separate tenant database, where each tenant record can map to the ID of the deployment that it is associated with.

Now that we have looked at our solution design and architecture, let’s move to the hands-on section, starting with the deployment requirements for the solution.

Prerequisites

The following tools are required to deploy the solution:

To follow this tutorial completely, you should have administrator access to at least one, but preferably two AWS accounts:

  • Toolchain: Account for the SDLC toolchain: the pipelines, the provisioning project, the repository, and the deployment database.
  • Workload (optional): Account for the component resources.

If you have only a single account, then the toolchain account can be used for both purposes. Credentials for the account(s) are assumed to be configured in AWS CLI profile(s).

The instructions in this post use the following placeholders, which you must replace with your specific values:

  • <TOOLCHAIN_ACCOUNT_ID>: The AWS Account ID for the toolchain account
  • <TOOLCHAIN_PROFILE_NAME>: The AWS CLI profile name for the toolchain account credentials
  • <WORKLOAD_ACCOUNT_ID>: The AWS Account ID for the workload account
  • <WORKLOAD_PROFILE_NAME>: The AWS CLI profile name for the workload account credentials

Bootstrapping

The toolchain account, and all workload account(s), must be bootstrapped prior to first-time deployment.

AWS CDK and our solutions’ dependencies must be installed to start with. The easiest way to do this is to install them locally with npm. First, we need to download our sample code, so that the we have the package.json configuration file available for npm.

Note that throughout these instructions, many commands are broken over multiple lines for readability. Take care to execute the commands completely. It is always safe to execute each code block as a whole.

Clone the sample code repository from GitHub, and then install the dependencies by using npm:

git clone https://github.com/aws-samples/aws-saas-parallel-deployments
cd aws-saas-parallel-deployments
npm ci 

CDK Pipelines requires use of modern bootstrapping. To ensure that this is enabled, start by setting the related environment variable:

export CDK_NEW_BOOTSTRAP=1

Then, bootstrap the toolchain account. You must bootstrap both the region where the toolchain stack is deployed, as well as every target region for component resources. Here, we will first bootstrap only the us-east-1 region, and later you can optionally bootstrap additional region(s).

To bootstrap, we use npx to execute the locally installed version of AWS CDK:

npx cdk bootstrap <TOOLCHAIN_ACCOUNT_ID>/us-east-1 --profile <TOOLCHAIN_PROFILE_NAME>

If you have a workload account that is separate from the toolchain account, then that account must also be bootstrapped. When bootstrapping the workload account, we will establish a trust relationship with the toolchain account. Skip this step if you don’t have a separate workload account.

The workload account boostrappings follows the security best practice of least privilege. First create an execution policy with the minimum permissions required to deploy our demo component resources. We provide a sample policy file in the solution repository for this purpose. Then, use that policy as the execution policy for the trust relationship between the toolchain account and the workload account

aws iam create-policy \
  --profile <WORKLOAD_PROFILE_NAME> \
  --policy-name CDK-Exec-Policy \
  --policy-document file://policies/workload-cdk-exec-policy.json
npx cdk bootstrap <WORKLOAD_ACCOUNT_ID>/us-east-1 \
  --profile <WORKLOAD_PROFILE_NAME> \
  --trust <TOOLCHAIN_ACCOUNT_ID> \
  --cloudformation-execution-policies arn:aws:iam::<WORKLOAD_ACCOUNT_ID>:policy/CDK-Exec-Policy

Toolchain deployment

Prior to being able to deploy for the first time, you must create an AWS CodeCommit repository for the solution. Create this repository in the toolchain account:

aws codecommit create-repository \
  --profile <TOOLCHAIN_PROFILE_NAME> \
  --region us-east-1 \
  --repository-name unicorn-repository

Next, you must push the contents to the CodeCommit repository. For this, use the git command together with the git-remote-codecommit extension in order to authenticate to the repository with your AWS CLI credentials. Our pipelines are configured to use the main branch.

git remote add unicorn codecommit::us-east-1://<TOOLCHAIN_PROFILE_NAME>@unicorn-repository
git push unicorn main

Now we are ready to deploy the toolchain stack:

export AWS_REGION=us-east-1
npx cdk deploy --profile <TOOLCHAIN_PROFILE_NAME>

Workload deployments

At this point, our CI/CD pipeline, provisioning project, and deployment database have been created. The database is initially empty.

Note that the DynamoDB command line interface demonstrated below is not intended to be the SaaS providers provisioning interface for production use. SaaS providers typically have online registration portals, wherein the customer signs up for the service. When new deployments are needed, then a record should automatically be inserted into the solution’s deployment database.

To demonstrate the solution’s capabilities, first we will provision two deployments, with an optional third cross-region deployment:

  1. A silo deployment (silo1) in the us-east-1 region.
  2. A pool deployment (pool1) in the us-east-1 region.
  3. A pool deployment (pool2) in the eu-west-1 region (optional).

To start, configure the AWS CLI environment variables:

export AWS_REGION=us-east-1
export AWS_PROFILE=<TOOLCHAIN_PROFILE_NAME>

Add the deployment database records for the first two deployments:

aws dynamodb put-item \
  --table-name unicorn-deployments \
  --item '{
    "id": {"S":"silo1"},
    "type": {"S":"silo"},
    "account": {"S":"<WORKLOAD_ACCOUNT_ID>"},
    "region": {"S":"us-east-1"}
  }'
aws dynamodb put-item \
  --table-name unicorn-deployments \
  --item '{
    "id": {"S":"pool1"},
    "type": {"S":"pool"},
    "account": {"S":"<WORKLOAD_ACCOUNT_ID>"},
    "region": {"S":"us-east-1"}
  }'

This will trigger two parallel builds of the provisioning CodeBuild project. Use the CodeBuild Console in order to observe the status and progress of each build.

Cross-region deployment (optional)

Optionally, also try a cross-region deployment. Skip this part if a cross-region deployment is not relevant for your use case.

First, you must bootstrap the target region in the toolchain and the workload accounts. Bootstrapping of eu-west-1 here is identical to the bootstrapping of the us-east-1 region earlier. First bootstrap the toolchain account:

npx cdk bootstrap <TOOLCHAIN_ACCOUNT_ID>/eu-west-1 --profile <TOOLCHAIN_PROFILE_NAME>

If you have a separate workload account, then we must also bootstrap it for the new region. Again, please skip this if you have only a single account:

npx cdk bootstrap <WORKLOAD_ACCOUNT_ID>/eu-west-1 \
  --profile <WORKLOAD_PROFILE_NAME> \
  --trust <TOOLCHAIN_ACCOUNT_ID> \
  --cloudformation-execution-policies arn:aws:iam::<WORKLOAD_ACCOUNT_ID>:policy/CDK-Exec-Policy

Then, add the cross-region deployment:

aws dynamodb put-item \
  --table-name unicorn-deployments \
  --item '{
    "id": {"S":"pool2"},
    "type": {"S":"pool"},
    "account": {"S":"<WORKLOAD_ACCOUNT_ID>"},
    "region": {"S":"eu-west-1"}
  }'

Validation of deployments

After the builds have completed, use the CodePipeline console to verify that the deployment pipelines were successfully created in the toolchain account:

CodePipeline console showing Pool-pool2-pipeline, Pool-pool1-pipeline and Silo-silo1-pipeline all Succeeded most recent execution.

Figure 3: CodePipeline console

Similarly, in the workload account, stacks containing your component resources will have been deployed to each configured region for the deployments. In this demo, we are deploying a single “hello world” container application utilizing AWS App Runner as runtime environment. Successful deployment can be verified by using CloudFormation Console:

Console showing Pool-pool1-resources with status of CREATE_COMPLETE

Figure 4: CloudFormation console

Now that we have successfully finished with our demo deployments, let’s look at how updates to the pipelines and the component resources can be managed.

Managing application resources as code

As highlighted earlier in the Solution Overview, every aspect of our solution shares a single source repository. With all of our code in a single source, we can easily deliver complex changes impacting multiple aspects of our solution. And all of this can be packaged, tested, and released as a single change set. For example, a change can introduce a new stage to the CI/CD pipeline, modify an existing stage in the silo and pool pipelines, and/or make code and resource changes to the component resources.

Managing the pipeline definitions is made simple by the self-mutate capability of the CDK Pipelines. Once initially deployed, each CDK Pipelines pipeline can update its own definition. This is implemented by using a separate SelfMutate stage in the pipeline definition. This stage is executed before any deployment actions, thereby ensuring that the pipeline always executes the latest version that is defined by the source code.

Managing how and when the pipelines trigger to execute also required attention. CDK Pipelines configures pipelines by default to utilize event-based polling of the source repository. While this is a reasonable default, and it is great for the CI/CD pipeline, it is undesired for our silo and pool pipelines. If all of these pipelines would execute automatically on code commits to the source repository, the CI/CD pipeline could not manage the release flow. To address this, we have configured the silo and pool pipelines with the trigger in the CodeCommitSourceOptions to NONE.

Controlling deployment updates

A key aspect of SaaS delivery is controlling how you roll out changes to tenants. Significant business risk can arise if changes are released to all tenants all-at-once in a single big bang.

This risk can be managed by utilizing a combination of silo and pool deployments. Reduce your risk by spreading tenants into multiple pools, and gradually rolling out your changes to these pools. Based on business needs and/or risk assessment, select customers can be provisioned into dedicated silo deployments, thereby allowing update control for those customers separately. Note that while all of a pool’s tenants get the same underlying update simultaneously, you can utilize feature flags to selectively enable new features only for specific tenants in the deployment.

In the demo solution, the CI/CD pipeline contains only a single custom stage “UpdateDeployments”. This CodeBuild action implements a simple “one-at-a-time” strategy. The code has been purposely written so that it is simple and provides you with a starting point to implement your own more complex strategy, as based on your unique business needs. In the default implementation, every silo and pool pipeline tracks the same “main” branch of the repository. Releases are governed by controlling when each pipeline executes to update its resources.

When designing your release strategy, look into how the planned process helps implement releases and changes with high quality and frequency. A typical starting point is a CI/CD pipeline with continuous automated deployments via multiple test and staging environments in order to validate your changes prior to deployment to any production tenants.

Furthermore, consider if utilizing a canary release strategy would help identify potential issues with your changes prior to rolling them out across all deployments in production. In a canary release, each change is first deployed only to a small subset of your deployments. Once you are satisfied with the change quality, then the change can either automatically or manually be released to the rest of your deployments. As an example, an AWS Step Functions state machine could be combined with the solution, and then utilized to control the release flow, execute validation tests, implement approval steps (either manual or automatic), and even conduct rollback if necessary.

Further considerations

The example in this post provisions every silo and pool deployment to a single AWS account. However, the solution is not limited to a single account, and it can deploy equally easily to multiple AWS accounts. When operating at scale, it is best-practice to spread your workloads to several accounts. The Organizing Your AWS Environment using Multiple Accounts whitepaper has in-depth guidance on strategies for spreading your workloads.

If combined with an AWS account-vending machine implementation, such as an AWS Control Tower Landing Zone, then the demo solution could be adapted so that new AWS accounts are provisioned automatically. This would be useful if your business requires full account-level deployment isolation, and you also want automated provisioning.

To meet Unicorn’s future needs for spreading their solution architecture over multiple separate components, the deployment database and associated lambda function could be decoupled from the rest of the toolchain components in order to provide a central deployment service. When provisioned as standalone, and amended with Amazon Simple Notification Service-based notifications sent to the component deployment systems for example, this central deployment service could be utilized for managing the deployments for multiple components.

In addition, you should analyze your deployment lifecycle transitions, and then consider what action should be taken when a tenant is disabled and/or deleted. Implementing a deployment archival/deletion process is not in the scope of this post.

Cleanup

To cleanup every resource deployed in this post, conduct the following actions:

  1. In the workload account:
    1. In us-east-1 Region, delete CloudFormation stacks named “pool-pool1-resources” and “silo-silo1-resources” and the CDK bootstrap stack “CDKToolKit”.
    2. In eu-west-1 Region, delete CloudFormation stack named “pool-pool2-resources” and the CDK Bootstrap stack “CDKToolKit”
  2. In the toolchain account:
    1. In us-east-1 Region, delete CloudFormation stacks “toolchain”, “pool-pool1-pipeline”, “pool-pool2-pipeline”, “silo-silo1-pipeline” and the CDK bootstrap stack “CDKToolKit”.
    2. In eu-west-1 Region, delete CloudFormation stack “pool-pool2-pipeline-support-eu-west-1” and the CDK bootstrap stack “CDKToolKit”
    3. Cleanup and delete S3 buckets “toolchain-*”, “pool-pool1-pipeline-*”, “pool-pool2-pipeline-*”, and “silo-silo1-pipeline-*”.

Conclusion

This solution demonstrated an implementation of an automated SaaS application component deployment factory. We covered how an ISV venturing into the SaaS model can utilize AWS CDK and CDK Pipelines in order to avoid a multitude of undifferentiated heavy lifting by leveraging and combining AWS CDK’s cross-region and cross-account capabilities with CDK Pipelines’ self-mutating deployment pipelines. Furthermore, we demonstrated how all of this can be written, managed, and released just like any other code you write. We also demonstrated how a single dynamic provisioning system can be utilized to operate in a mixed mode, with both silo and pool deployments.

Visit the AWS SaaS Factory Program page for further information on how AWS can help you on your SaaS journey — regardless of the stage you are currently in.

About the authors

Jani Muuriaisniemi

Jani is a Principal Solutions Architect at Amazon Web Services based out of Helsinki, Finland. With more than 20 years of industry experience, he works as a trusted advisor with a broad range of customers across different industries and segments, helping the customers on their cloud journey.

Jose Juhala

Jose is a Solutions Architect at Amazon Web Services based out of Tampere, Finland. He works with customers in Nordic and Baltic, from different industries, and guides them in their technical implementations architectural questions.

Emerging Solutions for Operations Research on AWS

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/architecture/emerging-solutions-for-operations-research-on-aws/

Operations research (OR) uses mathematical and analytical tools to arrive at optimal solutions for complex business problems like workforce scheduling. The mathematical techniques used to solve these problems, such as linear programming and mixed-integer programming, require the use of optimization software (solvers).  There are several popular and powerful solvers available, ranging from commercial options like IBM CPLEX to open-source packages like ORTools. While these solvers incorporate decades of algorithmic expertise and can solve large and complex problems effectively, they have some scalability limitations.

In this post, we’ll describe three alternatives that you can consider for solving OR problems (see Figure 1). None of these are as general purpose as traditional solvers, but they should be on your “emerging technologies” radar.

Figure 1. OR optimization options

Figure 1. OR optimization options

These include:

  1. A traditional solver running on a compute platform
  2. Reinforcement and machine learning (ML) algorithms running on Amazon SageMaker
  3. A quantum computing algorithm running on Amazon Braket. Experiments are collected in Amazon DynamoDB and the results are visualized in Amazon Elasticsearch Service.

A reference problem and solution

Let’s start with a reference problem and solve it with a traditional solver. We’ll tackle an inventory management issue (see Figure 2). We have a sales depot that supplies products for local sales outlets. For the depot’s Region, there are seven weeks of historical sales data for each product. We also know how much each product costs and for how much it can be sold. Finally, we know the overall weekly capacity of the depot. This depends on logistical constraints like the size of the warehouse and transportation availability. This scenario is loosely based on the Grupo Bimbo retailer’s Kaggle competition and dataset.

Figure 2. Sales depot inventory management scenario

Figure 2. Sales depot inventory management scenario

Our job is to place an inventory order to restock our sales depot each week. We quantify our work through a reward function. We want to maximize our revenue:

revenue = (sale price * number of units sold)

(Note that the sample dataset does not include cost of goods sold, only sale price.)

We use these constraints:

total units sold <= depot capacity
0 <= quantity sold of any given item <= forecasted demand for that item

There are many possible solutions to this problem. Using ORTools, we get an average reward (profit) of about $5,700, in about 1,000 simulations.

We can make the scenario slightly more realistic by acknowledging that our sales forecasts are not perfect. After we get the solution from the solver, we can penalize the reward (profit) by subtracting the cost of unsold goods. With this approach, we get a reward of about $2,450.

Solving OR problems with reinforcement learning

An alternative approach to the traditional solver is reinforcement learning (RL). RL is a field of ML that handles problems where the right answer is not immediately known, like playing a game of chess. RL fits our sales depot scenario, because we don’t know how well we will do until after we place the order and are able to view a week of sales activity.

Our sales depot problem resembles a knapsack problem. This is a common OR pattern where we want to fill a container (in this case, our sales depot) with as many items as possible until capacity is reached. Each item has a value (sales price) and a weight (cost). In RL we have to translate this into an observation space, an action space, a state, and a reward (see Figure 3).

The observation space is what our purchasing agent sees. This includes our depot capacity, the sales price, and the forecasted demand. The action space is what our agent can do. In the simplest case, it’s the number of each item to order for the depot, each week. The state is what the agent sees right now, and we model that as the sales results from last week. Finally, the reward function is our profit equation.

One important distinction between OR solvers and RL is that we can’t easily enforce hard constraints in RL. We can limit the amount of an individual product we purchase each week, but we can’t enforce an overall limit on the number of items purchased. We may exceed the capacity of our depot. The simplest way to handle that is to enforce a penalty. There are more sophisticated techniques available, such as interpreting our action as the percentage of budget to spend on each item. But let’s illustrate the simple case here.

Using an RL algorithm from the Ray RLLib package, our reward was $7,000 on average, including penalties for ordering too much of any given item.

Figure 3. Translating OR problem to RL

Figure 3. Translating OR problem to RL

Solving OR problems with machine learning

It’s possible to model a knapsack problem using ML rather than RL in some cases, and there are simple reference implementations available. The design assumes that we know, or can accurately estimate the reward for a given week. With our simple scenario, we can compute the reward using estimates of future sales. We can use this in a custom loss function to train a neural network.

Solving OR problems with quantum computing

Quantum computers are fundamentally different than the computers most of us use. The appeal of quantum computers is that they can tackle some types of problems much more efficiently than standard computers. Quantum computers can, in theory, solve prime number factoring for decryption in orders of magnitude faster than a standard computer. But they are still in their infancy and limited to the size of problem they can handle, due to hardware limitations.

D-Wave Systems, which make some of the types of quantum computers available through Amazon Braket, has a solver called QBSolv. QBSolv works on a specific type of optimization problem called quadratic unconstrained binary optimization (QUBO). It breaks large problems into smaller pieces that a quantum computer can handle. There is a reference pattern for translating a knapsack problem to a QUBO problem.

Running the sales depot problem through QBSolv on Amazon Braket and using a subset of the data, I was able to obtain a reward of $900. When I tried to run on the full dataset, I was not able to complete the decomposition step, likely due to a hardware limitation.

Conclusion

In this blog post, I review OR problems and traditional OR solvers. I then discussed three alternative approaches, RL, ML, and quantum computing. Each of these alternatives has drawbacks and none is a general-purpose replacement for traditional OR solvers.

However, RL and ML are potentially more scalable because you can train those solutions on a cluster of machines, rather than running an OR solver on a single machine. RL agents can also learn from experience, giving them flexibility to handle scenarios that may be difficult to incorporate into an OR solver. Quantum computing solutions are promising but the current state of the art for quantum computers limits their application to small-scale problems at the moment. All of these alternatives can potentially derive a solution more quickly than an OR solver.

Further Reading: