Tag Archives: Best practices

Automate marketing campaigns with real-time customer data using Amazon Pinpoint

Post Syndicated from Rushabh Lokhande original https://aws.amazon.com/blogs/messaging-and-targeting/automate-marketing-campaigns-with-real-time-customer-data-using-amazon-pinpoint/

Amazon Pinpoint offers marketers and developers one customizable tool to deliver customer communications across channels, segments, and campaigns at scale. Amazon Pinpoint makes it easy to run targeted campaigns and drive customer communications across different channels: email, SMS, push notifications, in-app messaging, or custom channels. Amazon Pinpoint campaigns enables you define which users to target, determine which messages to send, schedule the best time to deliver the messages, and then track the results of your campaign.

In many cases, the customer data resides in a third-party system such as a CRM, Customer Data Platform, Point of Sales, database and data warehouse. This customer data represents a valuable asset for your organization. Your marketing team needs to leverage each piece of this data to elevate the customer experience.

In this blog post we will demonstrate how you can leverage users’ clickstream data stored in database to build user segments and launch campaigns using Amazon Pinpoint. Also, we will showcase the full architecture of the data pipeline including other AWS services such as Amazon RDS, AWS Data Migration Service, Amazon Kinesis and AWS Lambda.

Let us understand our case study with an example: a customer currently has digital touch points such as a Website and a Mobile App to collect the users’ clickstreams and behavioral data where they are storing them in a MySQL database. Marketing teams want to leverage the collected data to deliver a personalized experience by leveraging Amazon Pinpoint capabilities.

You can find below the detail of a specific use case covered by the proposed solution:

  • All the clickstream and customer data are stored in MySQL DB
  • Your marketing team wants to create a personalized Amazon Pinpoint campaigns based on the user status and experience. Ex:
    • Customers who interested in specific offering to activate for them campaign based on their interest
    • Communicate with the preferred language of the user

Please note that this use case is used to showcase the proposed solution capabilities. However, it is not limited to this specific use case since you can leverage any customer collected dimension/attribute to create specific campaign to achieve a specific marketing use case.

In this post, we provide a guided journey on how marketers can collect, segment, and activate audience segments in real-time to increase their agility in managing campaigns.

Overview of solution

The use case covered in this post, focuses on demonstrating the flexibility offered by Amazon Pinpoint in both inbound (Ingestion) and outbound (Activation) stream of customer data. For the inbound stream, Amazon Pinpoint gives you a variety of ways to import your customer data, including:

  1. CSV/JSON import from the AWS console
  2. API operation to create a single or multiple endpoints
  3. Programmatically create and execute import jobs

We will focus on building a real-time inbound stream of customer data available within an Amazon RDS MySQL database specifically. It is important to mention that similar approach can be implemented to ingest data from third-party systems if any.

For the outbound stream, activating customer data using Amazon Pinpoint can be achieved using the following two methods:

  1. Campaign: a campaign is a messaging initiative that engages a specific audience segment.
  2. Journey: a journey is a customized, multi-step engagement experience.

The result of customer data activation cannot be completed without specifying the targeted channel. A channel represents the platform through which you engage your audience segment with messages. For example, Amazon Pinpoint customers can optimize how they target notifications to prospective customers through LINE message and email. They can deliver notifications with more information on prospected customer’s product information such as sales, new products etc. to the appropriate audience.

Amazon Pinpoint supports the following channels:

  • Push notifications
  • Email
  • SMS
  • Voice
  • In-app messages

In addition to these channels, you can also extend the capabilities to meet your specific use case by creating custom channels. You can use custom channels to send messages to your customers through any service that has an API including third-party services. For example, you can use custom channels to send messages through third-party services such as WhatsApp or Facebook Messenger. We will focus on developing an Amazon Pinpoint connector using custom channel to target your customers on third-party services through API.

Solution Architecture

The below diagram illustrates the proposed architecture to address the use case. Moving from left to right:

Fig 1: Architecture Diagram for the Solution

Fig 1: Architecture Diagram for the Solution

  1. Amazon RDS: This hosts customer database where you can have one or many tables contains customer data.
  2. AWS Data Migration Service (DMS): This acts as the glue between Amazon RDS MySQL and the downstream services by replicating any transformation that happens at the record level in the configured customer tables.
  3. Amazon Kinesis Data Streams: This is the destination endpoint for AWS DMS. It will carry all the transformed records for the next stage of the pipeline.
  4. AWS Lambda (inbound): The inbound AWS Lambda triggers the Kinesis Data Streams, process the mutated records, and ingest them in Amazon Pinpoint.
  5. Amazon Pinpoint: This act as the centralized place to define customer segments and launch campaigns.
  6. AWS Lambda (outbound): This act as the custom channel destination for the campaigns activated from Amazon Pinpoint.

To illustrate how to set up this architecture, we’ll walk you through the following steps:

  1. Deploying an AWS CDK stack to provision the following AWS Resources
  2. Validate the Deployment.
  3. Run a Sample Workflow – This workflow will run an AWS Glue PySpark job that uses a custom Python library, and an upgraded version of boto3.
  4. Cleaning up your resources.

Prerequisites

Make sure that you complete the following steps as prerequisites:

The Solution

Launching your AWS CDK Stack

Step 1a: Open your device’s command line or Terminal.

Step1b: Checkout Git repository to a local directory on your device:

git clone https://github.com/aws-samples/amazon-pinpoint-realtime-campaign-optimization-example.git

Step 2: Change directories to the new directory code location:

cd amazon-pinpoint-realtime-campaign-optimization-example

Step 3: Update your AWS account number and region:

  1. Edit config.py with your choice to tool or command line
  2. look for section “Account Setup” and update your account number and region

    Fig 2: Configuring config.py for account-id and region

    Fig 2: Configuring config.py for account-id and region

  3. look for section “VPC Parameters” and update your VPC and subnet info

    Fig 3: Configuring config.py for VPC and subnet information

    Fig 3: Configuring config.py for VPC and subnet information

Step 4: Verify if you are in the directory where app.py file is located:

ls -ltr app.py

Step 5: Create a virtual environment:

macOS/Linux:

python3 -m venv .env

Windows:

python -m venv .env

Step 6: Activate the virtual environment after the init process completes and the virtual environment is created:

macOS/Linux:

source .env/bin/activate

Windows:

.env\Scripts\activate.bat

Step 7: Install the required dependencies:

pip3 install -r requirements.txt

Step 8: Bootstrap the cdk app using the following command:

cdk bootstrap aws://<AWS_ACCOUNTID>/<AWS_REGION>

Replace the place holder AWS_ACCOUNTID and AWS_REGION with your AWS account ID and the region to be deployed.
This step provisions the initial resources, including an Amazon S3 bucket for storing files and IAM roles that grant permissions needed to perform deployments.

Fig 4: Bootstrapping CDK environment

Fig 4: Bootstrapping CDK environment

Please note, if you have already bootstrapped the same account previously, you cannot bootstrap account, in such case skip this step or use a new AWS account.

Step 9: Make sure that your AWS profile is setup along with the region that you want to deploy as mentioned in the prerequisite. Synthesize the templates. AWS CDK apps use code to define the infrastructure, and when run they produce or “synthesize” a CloudFormation template for each stack defined in the application:

cdk synthesize

Step 10: Deploy the solution. By default, some actions that could potentially make security changes require approval. In this deployment, you’re creating an IAM role. The following command overrides the approval prompts, but if you would like to manually accept the prompts, then omit the –require-approval never flag:

cdk deploy "*" --require-approval never

While the AWS CDK deploys the CloudFormation stacks, you can follow the deployment progress in your terminal.

Fig 5: AWS CDK Deployment progress in terminal

Fig 5: AWS CDK Deployment progress in terminal

Once the deployment is successful, you’ll see the successful status as follows:

Fig 6: AWS CDK Deployment completion success

Fig 6: AWS CDK Deployment completion success

Step 11: Log in to the AWS Console, go to CloudFormation, and see the output of the ApplicationStack:

Fig 7: AWS CloudFormation stack output

Fig 7: AWS CloudFormation stack output

Note the values of PinpointProjectId, PinpointProjectName, and RDSSecretName variables. We’ll use them in the next step to upload our artifacts

Testing The Solution

In this section we will create a full data flow using the below steps:

  1. Ingest data in the customer_tb table within the Amazon RDS MySQL DB instance
  2. Validate that AWS Data Migration Service created task is replicating the changes to the Amazon Kinesis Data Streams
  3. Validate that endpoints are created within Amazon Pinpoint
  4. Create Amazon Pinpoint Segment and Campaign and activate data to Webhook.site endpoint URL

Step 1: Connect to MySQL DB instance and create customer database

    1. Sign in to the AWS Management Console and open the AWS Cloud9 console at https://console.aws.amazon.com/cloud9 
    2. Click Create environment
      • Name: mysql-cloud9-01 (for example)
      • Click Next
      • Environment type: Create a new EC2 instance for environment (direct access)
      • Instance type: t2.micro
      • Timeout: 30 minutes
      • Platform: Amazon Linux 2
      • Network settings under VPC settings select the same VPC where the MySQL DB instance was created. (this is the same VPC and Subnet from step 3.3)
      • Click Next
      • Review and click Create environment
    3. Select the created AWS Cloud9 from the AWS Cloud9 console at https://console.aws.amazon.com/cloud9  and click Open in Cloud9. You will have access to AWS Cloud9 Linux shell.
    4. From Linux shell, update the operating system and :
      sudo yum update -y
    5. From Linux shell, update the operating system and :
      sudo yum install -y mysql
    6. To connect to the created MySQL RDS DB instance, use the below command in the AWS Cloud9 Linux shell:
      mysql -h <<host>> -P 3308 --user=<<username>> --password=<<password>>
      • To get values for dbInstanceIdentifier, username, and password
        • Navigate to the AWS Secrets Manager service
        • Open the secret with the name created by the CDK application
        • Select ‘Reveal secret value’ and copy the respective values and replace in your command
      • After you enter the password for the user, you should see output similar to the following.
      • Welcome to the MariaDB monitor.  Commands end with ; or \g.
        Your MySQL connection id is 27
        Server version: 8.0.32 Source distribution
        Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
        Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
        MySQL [(none)]>

Step 2: Ingest data in the customer_tb table within the Amazon RDS MySQL DB instance Once the connection to the MySQL DB instance established, using the same AWS Cloud9 Linux shell connected to the MySQL RDS DB execute following commands.

  • Create database pinpoint-test-db:
    CREATE DATABASE `pinpoint-test-db`;
  • Create table customer-tb:
    Use `pinpoint-test-db`;
    CREATE TABLE `customer_tb` (`userid` int NOT NULL,
                                `email` varchar(150) DEFAULT NULL,
                                `language` varchar(45) DEFAULT NULL,
                                `favourites` varchar(250) DEFAULT NULL,
                                PRIMARY KEY (`userid`);
  • You can verify the schema using the below SQL command:
  1. DESCRIBE `pinpoint-test-db`.customer_tb;
    Fig 8: Verify schema for customer_db table

    Fig 8: Verify schema for customer_db table

    • Insert records in customer_tb table:
    Use `pinpoint-test-db`;
    insert into customer_tb values (1,'[email protected]','english','football');
    insert into customer_tb values (2,'[email protected]','english','basketball');
    insert into customer_tb values (3,'[email protected]','french','football');
    insert into customer_tb values (4,'[email protected]','french','football');
    insert into customer_tb values (5,'[email protected]','french','basketball');
    insert into customer_tb values (6,'[email protected]','french','football');
    insert into customer_tb values (7,'[email protected]','french',null);
    insert into customer_tb values (8,'[email protected]','english','football');
    insert into customer_tb values (9,'[email protected]','english','football');
    insert into customer_tb values (10,'[email protected]','english',null);
    • Verify records in customer_tb table:
    select * from `pinpoint-test-db`.`customer_tb`;
    Fig 9: Verify data for customer_db table

    Fig 9: Verify data for customer_db table

    Step 3: Validate that AWS Data Migration Service created task is replicating the changes to the Amazon Kinesis Data Streams

      1. Sign in to the AWS Management Console and open the AWS DMS console at https://console.aws.amazon.com/dms/v2 
      2. From the navigation panel, choose Database migration tasks.
      3. Click on the created task created by CDK code ‘dmsreplicationtask-*’
      4. Start the replication task

        Fig 10: Starting AWS DMS Replication Task

        Fig 10: Starting AWS DMS Replication Task

      5. Make sure that Status is Replication ongoing

        Fig 11: AWS DMS Replication statistics

        Fig 11: AWS DMS Replication statistics

      6. Navigate to Table Statistics and make sure that the number of Inserts is equal to 10 and Load state is Table completed*

        Fig 12: AWS DMS Replication statistics

        Fig 12: AWS DMS Replication statistics

    Step 4: Validate that endpoints are created within Amazon Pinpoint

    1. Sign in to the AWS Management Console and open the Amazon Pinpoint console at https://console.aws.amazon.com/pinpoint/ 
    2. Click on Amazon Pinpoint Project Demo created by CDK stack “dev-pinpoint-project”
    3. From the left menu, click on Analytics and validate that the Active targetable endpoints are equal to 10 as shown below:
    Fig 13: Amazon Pinpoint endpoint summary

    Fig 13: Amazon Pinpoint endpoint summary

    Step 5: Create Amazon Pinpoint Segment and Campaign

    Step 5.1: Create Amazon Pinpoint Segment

    • Sign in to the AWS Management Console and open the Amazon Pinpoint console at https://console.aws.amazon.com/pinpoint/ 
    • Click on Amazon Pinpoint Project Demo created by CDK stack “dev-pinpoint-project”
    • from the left menu, click on Segments and click Create a segment
    • create Segment using the below configurations:
      • Name: English Speakers
      • Under criteria:
    • Attribute: Language
    • Operator: Conatins
    • Value: english
    Fig 14: Amazon Pinpoint segment summary

    Fig 14: Amazon Pinpoint segment summary

    • Click create segment

    Step 5.2: Create Amazon Pinpoint Campaign

    • from the left menu, click on Campaigns and click Create a campaign
    • set the Campaign name to test campaign and select Custom option for Channel as shown below:
    Fig 15: Amazon Pinpoint create campaign

    Fig 15: Amazon Pinpoint create campaign

    • Click Next
    • Select English Speakers from Segment as shown below and click Next:
    Fig 16: Amazon Pinpoint segment

    Fig 16: Amazon Pinpoint segment

    • Choose Lambda function channel type and select outbound lambda function with name pattern as ApplicationStack-lambdaoutboundfunction* from the dropdown as shown below:
    Fig 17: Amazon Pinpoint message creation

    Fig 17: Amazon Pinpoint message creation

    • Click Next
    • Choose At a specific time option and immediately to send the campaign as show below:
    Fig 18: Amazon Pinpoint campaign scheduling

    Fig 18: Amazon Pinpoint campaign scheduling

    If you push more messages or records into Amazon RDS (from step 2.4), you will need to create a new campaign (from step 4.2) to process the new messages.

    • Click Next, review the configuration and click Launch campaign.
    • Navigate to dev-pinpoint-project and select the campaign created in previous step. You should see status as ‘Complete’
    Fig 19: Amazon Pinpoint campaign processing status

    Fig 19: Amazon Pinpoint campaign processing status

    • Navigate to dev-pinpoint-project dashboard and select your campaign in ‘Campaign metrics’ dashboard, you will see the statistics for the processing.
    Fig 20: Amazon Pinpoint campaign metrics

    Fig 20: Amazon Pinpoint campaign metrics

    Accomplishments

    This is a quick summary of what we accomplished:

    1. Created an Amazon RDS MySQL DB instance and define customer_tb table schema
    2. Created an Amazon Kinesis Data Stream
    3. Replicated database changes from the Amazon RDS MySQL DB to Amazon Kinesis Data Stream
    4. Created an AWS Lambda function triggered by Amazon Kinesis Data Stream to ingest database records in Amazon Pinpoints as User endpoints using AWS SDK
    5. Created an Amazon Pinpoint Project, segment and campaign
    6. Created an AWS Lambda function as custom channel for Amazon Pinpoint campaign
    7. Tested end-to-end data flow from Amazon RDS MySQL DB instance to third party endpoint

    Next Steps

    You have now gained a good understanding of Amazon Pinpoint agnostic data flow but there are still many areas left for exploration. What this workshop hasn’t covered is the operation of other communication channels such as Email, SMS, Push notification and Voice outbound. You can enable the channels that are pertinent to your use case and send messages using campaigns or journeys.

    Clean Up

    Make sure that you clean up all of the other AWS resources that you created in the AWS CDK Stack deployment. You can delete these resources via the AWS CDK Destroy command as follows or the CloudFormation console.

    To destroy the resources using AWS CDK, follow these steps:

    • Follow Steps 1-5 from the ‘Launching your CDK Stack’ section.
    • Destroy the app by executing the following command:
    cdk destroy

    Summary

    In this post, you have now gained a good understanding of Amazon Pinpoint flexible real-time data flow. By implementing the steps detailed in this blog post, you can achieve a seamless integration of your customer data from Amazon RDS MySQL database to Amazon Pinpoint where you can leverage segments and campaigns to activate data using custom channels to third-party services via API. The demonstrated use case focuses on Amazon RDS MySQL database as a data source. However, there are still many areas left for exploration. What this post hasn’t covered is the operation of integrating customer data from other type of data sources such as MongoDB, Microsoft SQL Server, Google Cloud, etc. Also, other communication channels such as Email, SMS, Push notification and Voice outbound can be used in the activation layer. You can enable the channels that are pertinent to your use case and send messages using campaigns or journeys, and get a complete view of their customers across all touchpoints and can lead to less relevant marketing campaigns.

    About the Authors

  2. Bret Pontillo is a Senior Data Architect with AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.
    Rushabh Lokhande is a Data & ML Engineer with AWS Professional Services Analytics Practice. He helps customers implement big data, machine learning, and analytics solutions. Outside of work, he enjoys spending time with family, reading, running, and golf.
    Ghandi Nader is a Senior Partner Solution Architect focusing on the Adtech and Martech industry. He helps customers and partners innovate and align with the market trends related to the industry. Outside of work, he enjoys spending time cycling and watching formula one.

Quickly Restore Amazon EC2 Mac Instances using Replace Root Volume capability

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/new-reset-amazon-ec2-mac-instances-to-a-known-state-using-replace-root-volume-capability/

This post is written by Sebastien Stormacq, Principal Developer Advocate.

Amazon Elastic Compute Cloud (Amazon EC2) now supports replacing the root volume on a running EC2 Mac instance, enabling you to restore the root volume of an EC2 Mac instance to its initial launch state, to a specific snapshot, or to a new Amazon Machine Image (AMI).

Since 2021, we have offered on-demand and pay-as-you-go access to Amazon EC2 Mac instances, in the same manner as our Intel, AMD and Graviton-based instances. Amazon EC2 Mac instances integrate all the capabilities you know and love from macOS with dozens of AWS services such as Amazon Virtual Private Cloud (VPC) for network security, Amazon Elastic Block Store (EBS) for expandable storage, Elastic Load Balancing (ELB) for distributing build queues, Amazon FSx for scalable file storage, and AWS Systems Manager Agent (SSM Agent) for configuring, managing, and patching macOS environments.

Just like for every EC2 instance type, AWS is responsible for protecting the infrastructure that runs all of the services offered in the AWS cloud. To ensure that EC2 Mac instances provide the same security and data privacy as other Nitro-based EC2 instances, Amazon EC2 performs a scrubbing workflow on the underlying Dedicated Host as soon as you stop or terminate an instance. This scrubbing process erases the internal SSD, clears the persistent NVRAM variables, and updates the device firmware to the latest version enabling you to run the latest macOS AMIs. The documentation has more details about this process.

The scrubbing process ensures a sanitized dedicated host for each EC2 Mac instance launch and takes some time to complete. Our customers have shared two use cases where they may need to set back their instance to a previous state in a shorter time period or without the need to initiate the scrubbing workflow. The first use case is when patching an existing disk image to bring OS-level or applications-level updates to your fleet, without manually patching individual instances in-place. The second use case is during continuous integration and continuous deployment (CI/CD) when you need to restore an Amazon EC2 Mac instance to a defined well-known state at the end of a build.

To restart your EC2 Mac instance in its initial state without stopping or terminating them, we created the ability to replace the root volume of an Amazon EC2 Mac instance with another EBS volume. This new EBS volume is created either from a new AMI, an Amazon EBS Snapshot, or from the initial volume state during boot.

You just swap the root volume with a new one and initiate a reboot at OS-level. Local data, additional attached EBS volumes, networking configurations, and IAM profiles are all preserved. Additional EBS volumes attached to the instance are also preserved, as well as the instance IP addresses, IAM policies, and security groups.

Let’s see how Replace Root Volume works

To prepare and initiate an Amazon EBS root volume replacement, you can use the AWS Management Console, the AWS Command Line Interface (AWS CLI), or one of our AWS SDKs. For this demo, I used the AWS CLI to show how you can automate the entire process.

To start the demo, I first allocate a Dedicated Host and then start an EC2 Mac instance, SSH-connect to it, and install the latest version of Xcode. I use the open-source xcodeinstall CLI tool to download and install Xcode. Typically, you also download, install, and configure a build agent and additional build tools or libraries as required by your build pipelines.

Once the instance is ready, I create an Amazon Machine Image (AMI). AMIs are disk images you can reuse to launch additional and identical EC2 Mac instances. This can be done from any machine that has the credentials to make API calls on your AWS account. In the following, you can see the commands I issued from my laptop’s Terminal application.

#
# Find the instance’s ID based on the instance name tag
#
~ aws ec2 describe-instances \
--filters "Name=tag:Name,Values=RRV-Demo" \
--query "Reservations[].Instances[].InstanceId" \
--output text 

i-0fb8ffd5dbfdd5384

#
# Create an AMI based on this instance
#
~ aws ec2 create-image \
--instance-id i-0fb8ffd5dbfdd5384 \
--name "macOS_13.3_Gold_AMI"	\
--description "macOS 13.2 with Xcode 13.4.1"

{
 
"ImageId": "ami-0012e59ed047168e4"
}

It takes a few minutes to complete the AMI creation process.

After I created this AMI, I can use my instance as usual. I can use it to build, test, and distribute my application, or make any other changes on the root volume.

When I want to reset the instance to the state of my AMI, I initiate the replace root volume operation:

~ aws ec2 create-replace-root-volume-task	\
--instance-id i-0fb8ffd5dbfdd5384 \
--image-id ami-0012e59ed047168e4
{
"ReplaceRootVolumeTask": {
"ReplaceRootVolumeTaskId": "replacevol-07634c2a6cf2a1c61", "InstanceId": "i-0fb8ffd5dbfdd5384",
"TaskState": "pending", "StartTime": "2023-05-26T12:44:35Z", "Tags": [],
"ImageId": "ami-0012e59ed047168e4", "SnapshotId": "snap-02be6b9c02d654c83", "DeleteReplacedRootVolume": false
}
}

The root Amazon EBS volume is replaced with a fresh one created from the AMI, and the system triggers an OS-level reboot.

I can observe the progress with the DescribeReplaceRootVolumeTasks API

~ aws ec2 describe-replace-root-volume-tasks \
--replace-root-volume-task-ids replacevol-07634c2a6cf2a1c61

{
"ReplaceRootVolumeTasks": [
{
"ReplaceRootVolumeTaskId": "replacevol-07634c2a6cf2a1c61", "InstanceId": "i-0fb8ffd5dbfdd5384",
"TaskState": "succeeded", "StartTime": "2023-05-26T12:44:35Z",
"CompleteTime": "2023-05-26T12:44:43Z", "Tags": [],
"ImageId": "ami-0012e59ed047168e4", "DeleteReplacedRootVolume": false
}
]
}

After a short time, the instance becomes available again, and I can connect over ssh.

~ ssh [email protected]
Warning: Permanently added '3.0.0.86' (ED25519) to the list of known hosts.
Last login: Wed May 24 18:13:42 2023 from 81.0.0.0

┌───┬──┐	 |  |_ )
│ ╷╭╯╷ │	_| (	/
│ └╮	│   |\  |  |
│ ╰─┼╯ │ Amazon EC2
└───┴──┘ macOS Ventura 13.2.1
 
ec2-user@ip-172-31-58-100 ~ %

Additional thoughts

There are a couple of additional points to know before using this new capability:

  • By default, the old root volume is preserved. You can pass the –-delete-replaced-root-volume option to delete it automatically. Do not forget to delete old volumes and their corresponding Amazon EBS Snapshots when you don’t need them anymore to avoid being charged for them.
  • During the replacement, the instance will be unable to respond to health checks and hence might be marked as unhealthy if placed inside an Auto Scaled Group. You can write a custom health check to change that behavior.
  • When replacing the root volume with an AMI, the AMI must have the same product code, billing information, architecture type, and virtualization type as that of the instance.
  • When replacing the root volume with a snapshot, you must use snapshots from the same lineage as the instance’s current root volume.
  • The size of the new volume is the largest of the AMI’s block device mapping and the size of the old Amazon EBS root volume.
  • Any non-root Amazon EBS volume stays attached to the instance.
  • Finally, the content of the instance store (the internal SSD drive) is untouched, and all other meta-data of the instance are unmodified (the IP addresses, ENI, IAM policies etc.).

Pricing and availability

Replace Root Volume for EC2 Mac is available in all AWS Regions where Amazon EC2 Mac instances are available. There is no additional cost to use this capability. You are charged for the storage consumed by the Amazon EBS Snapshots and AMIs.

Check other options available on the API or AWS CLI and go configure your first root volume replacement task today!

Modernize a legacy real-time analytics application with Amazon Managed Service for Apache Flink

Post Syndicated from Bhupesh Sharma original https://aws.amazon.com/blogs/big-data/modernize-a-legacy-real-time-analytics-application-with-amazon-managed-service-for-apache-flink/

Organizations with legacy, on-premises, near-real-time analytics solutions typically rely on self-managed relational databases as their data store for analytics workloads. To reap the benefits of cloud computing, like increased agility and just-in-time provisioning of resources, organizations are migrating their legacy analytics applications to AWS. The lift and shift migration approach is limited in its ability to transform businesses because it relies on outdated, legacy technologies and architectures that limit flexibility and slow down productivity. In this post, we discuss ways to modernize your legacy, on-premises, real-time analytics architecture to build serverless data analytics solutions on AWS using Amazon Managed Service for Apache Flink.

Near-real-time streaming analytics captures the value of operational data and metrics to provide new insights to create business opportunities. In this post, we discuss challenges with relational databases when used for real-time analytics and ways to mitigate them by modernizing the architecture with serverless AWS solutions. We introduce you to Amazon Managed Service for Apache Flink Studio and get started querying streaming data interactively using Amazon Kinesis Data Streams.

Solution overview

In this post, we walk through a call center analytics solution that provides insights into the call center’s performance in near-real time through metrics that determine agent efficiency in handling calls in the queue. Key performance indicators (KPIs) of interest for a call center from a near-real-time platform could be calls waiting in the queue, highlighted in a performance dashboard within a few seconds of data ingestion from call center streams. These metrics help agents improve their call handle time and also reallocate agents across organizations to handle pending calls in the queue.

Traditionally, such a legacy call center analytics platform would be built on a relational database that stores data from streaming sources. Data transformations through stored procedures and use of materialized views to curate datasets and generate insights is a known pattern with relational databases. However, as data loses its relevance with time, transformations in a near-real-time analytics platform need only the latest data from the streams to generate insights. This may require frequent truncation in certain tables to retain only the latest stream of events. Also, the need to derive near-real-time insights within seconds requires frequent materialized view refreshes in this traditional relational database approach. Frequent materialized view refreshes on top of constantly changing base tables due to streamed data can lead to snapshot isolation errors. Also, a data model that allows table truncations at a regular frequency (for example, every 15 seconds) to store only relevant data in tables can cause locking and performance issues.

The following diagram provides the high-level architecture of a legacy call center analytics platform. In this traditional architecture, a relational database is used to store data from streaming data sources. Datasets used for generating insights are curated using materialized views inside the database and published for business intelligence (BI) reporting.

Modernizing this traditional database-driven architecture in the AWS Cloud allows you to use sophisticated streaming technologies like Amazon Managed Service for Apache Flink, which are built to transform and analyze streaming data in real time. With Amazon Managed Service for Apache Flink, you can gain actionable insights from streaming data with serverless, fully managed Apache Flink. You can use Amazon Managed Service for Apache Flink to quickly build end-to-end stream processing applications and process data continuously, getting insights in seconds or minutes. With Amazon Managed Service for Apache Flink, you can use Apache Flink code or Flink SQL to continuously generate time-series analytics over time windows and perform sophisticated joins across streams.

The following architecture diagram illustrates how a legacy call center analytics platform running on databases can be modernized to run on the AWS Cloud using Amazon Managed Service for Apache Flink. It shows a call center streaming data source that sends the latest call center feed in every 15 seconds. The second streaming data source constitutes metadata information about the call center organization and agents that gets refreshed throughout the day. You can perform sophisticated joins over these streaming datasets and create views on top of it using Amazon Managed Service for Apache Flink to generate KPIs required for the business using Amazon OpenSearch Service. You can analyze streaming data interactively using managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio in near-real time. The near-real-time insights can then be visualized as a performance dashboard using OpenSearch Dashboards.

In this post, you perform the following high-level implementation steps:

  1. Ingest data from streaming data sources to Kinesis Data Streams.
  2. Use managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio to transform the stream data within seconds of data ingestion.
  3. Visualize KPIs of call center performance in near-real time through OpenSearch Dashboards.

Prerequisites

This post requires you to set up the Amazon Kinesis Data Generator (KDG) to send data to a Kinesis data stream using an AWS CloudFormation template. For the template and setup information, refer to Test Your Streaming Data Solution with the New Amazon Kinesis Data Generator.

We use two datasets in this post. The first dataset is fact data, which contains call center organization data. The KDG generates a fact data feed in every 15 seconds that contains the following information:

  • AgentId – Agents work in a call center setting surrounded by other call center employees answering customers’ questions and referring them to the necessary resources to solve their problems.
  • OrgId – A call center contains different organizations and departments, such as Care Hub, IT Hub, Allied Hub, Prem Hub, Help Hub, and more.
  • QueueId – Call queues provide an effective way to route calls using simple or sophisticated patterns to ensure that all calls are getting into the correct hands quickly.
  • WorkMode – An agent work mode determines your current state and availability to receive incoming calls from the Automatic Call Distribution (ACD) and Direct Agent Call (DAC) queues. Call Center Elite does not route ACD and DAC calls to your phone when you are in an Aux mode or ACW mode.
  • WorkSkill – Working as a call center agent requires several soft skills to see the best results, like problem-solving, bilingualism, channel experience, aptitude with data, and more.
  • HandleTime – This customer service metric measures the length of a customer’s call.
  • ServiceLevel – The call center service level is defined as the percentage of calls answered within a predefined amount of time—the target time threshold. It can be measured over any period of time (such as 30 minutes, 1 hour, 1 day, or 1 week) and for each agent, team, department, or the company as a whole.
  • WorkStates – This specifies what state an agent is in. For example, an agent in an available state is available to handle calls from an ACD queue. An agent can have several states with respect to different ACD devices, or they can use a single state to describe their relationship to all ACD devices. Agent states are reported in agent-state events.
  • WaitingTime – This is the average time an inbound call spends waiting in the queue or waiting for a callback if that feature is active in your IVR system.
  • EventTime – This is the time when the call center stream is sent (via the KDG in this post).

The following fact payload is used in the KDG to generate sample fact data:

{
"AgentId" : {{random.number(
{
"min":2001,
"max":2005
}
)}},
"OrgID" : {{random.number(
{
"min":101,
"max":105
}
)}},
"QueueId" : {{random.number(
{
"min":1,
"max":5
}
)}},
"WorkMode" : "{{random.arrayElement(
["TACW","ACW","AUX"]
)}}",
"WorkSkill": "{{random.arrayElement(
["Problem Solving","Bilingualism","Channel experience","Aptitude with data"]
)}}",
"HandleTime": {{random.number(
{
"min":1,
"max":150
}
)}},
"ServiceLevel":"{{random.arrayElement(
["Sev1","Sev2","Sev3"]
)}}",
"WorkSates": "{{random.arrayElement(
["Unavailable","Available","On a call"]
)}}",
"WaitingTime": {{random.number(
{
"min":10,
"max":150
}
)}},
"EventTime":"{{date.utc("YYYY-MM-DDTHH:mm:ss")}}"
}

The following screenshot shows the output of the sample fact data in an Amazon Managed Service for Apache Flink notebook.

The second dataset is dimension data. This data contains metadata information like organization names for their respective organization IDs, agent names, and more. The frequency of the dimension dataset is twice a day, whereas the fact dataset gets loaded in every 15 seconds. In this post, we use Amazon Simple Storage Service (Amazon S3) as a data storage layer to store metadata information (Amazon DynamoDB can be used to store metadata information as well). We use AWS Lambda to load metadata from Amazon S3 to another Kinesis data stream that stores metadata information. The following JSON file stored in Amazon S3 has metadata mappings to be loaded into the Kinesis data stream:

[{"OrgID": 101,"OrgName" : "Care Hub","Region" : "APAC"},
{"OrgID" : 102,"OrgName" : "Prem Hub","Region" : "AMER"},
{"OrgID" : 103,"OrgName" : "IT Hub","Region" : "EMEA"},
{"OrgID" : 104,"OrgName" : "Help Hub","Region" : "EMEA"},
{"OrgID" : 105,"OrgName" : "Allied Hub","Region" : "LATAM"}]

Ingest data from streaming data sources to Kinesis Data Streams

To start ingesting your data, complete the following steps:

  1. Create two Kinesis data streams for the fact and dimension datasets, as shown in the following screenshot. For instructions, refer to Creating a Stream via the AWS Management Console.

  1. Create a Lambda function on the Lambda console to load metadata files from Amazon S3 to Kinesis Data Streams. Use the following code:
    import boto3
    import json
    
    # Create S3 object
    s3_client = boto3.client("s3")
    S3_BUCKET = '<S3 Bucket Name>'
    kinesis_client = boto3.client("kinesis")
    stream_name = '<Kinesis Stream Name>'
        
    def lambda_handler(event, context):
      
      # Read Metadata file on Amazon S3
      object_key = "<S3 File Name.json>"  
      file_content = s3_client.get_object(
          Bucket=S3_BUCKET, Key=object_key)["Body"].read()
      
      #Decode the S3 object to json
      decoded_data = file_content.decode("utf-8").replace("'", '"')
      json_data = json.dumps(decoded_data)
      
      #Upload json data to Kinesis data stream
      partition_key = 'OrgID'
      for record in json_data:
        response = kinesis_client.put_record(
          StreamName=stream_name,
          Data=json.dumps(record),
          PartitionKey=partition_key)

Use managed Apache Zeppelin notebooks with Amazon Managed Service for Apache Flink Studio to transform the streaming data

The next step is to create tables in Amazon Managed Service for Apache Flink Studio for further transformations (joins, aggregations, and so on). To set up and query Kinesis Data Streams using Amazon Managed Service for Apache Flink Studio, refer to Query your data streams interactively using Amazon Managed Service for Apache Flink Studio and Python and create an Amazon Managed Service for Apache Flink notebook. Then complete the following steps:

  1. In the Amazon Managed Service for Apache Flink Studio notebook, create a fact table from the facts data stream you created earlier, using the following query.

The event time attribute is defined using a WATERMARK statement in the CREATE table DDL. A WATERMARK statement defines a watermark generation expression on an existing event time field, which marks the event time field as the event time attribute.

The event time refers to the processing of streaming data based on timestamps that are attached to each row. The timestamps can encode when an event happened. Processing time (PROCTime) refers to the machine’s system time that is running the respective operation.

%flink.ssql
CREATE TABLE <Fact Table Name> (
AgentId INT,
OrgID INT,
QueueId BIGINT,
WorkMode VARCHAR,
WorkSkill VARCHAR,
HandleTime INT,
ServiceLevel VARCHAR,
WorkSates VARCHAR,
WaitingTime INT,
EventTime TIMESTAMP(3),
WATERMARK FOR EventTime AS EventTime - INTERVAL '4' SECOND
)
WITH (
'connector' = 'kinesis',
'stream' = '<fact stream name>',
'aws.region' = '<AWS region ex. us-east-1>',
'scan.stream.initpos' = 'LATEST',
'format' = 'json',
'json.timestamp-format.standard' = 'ISO-8601'
);

  1. Create a dimension table in the Amazon Managed Service for Apache Flink Studio notebook that uses the metadata Kinesis data stream:
    %flink.ssql
    CREATE TABLE <Metadata Table Name> (
    AgentId INT,
    AgentName VARCHAR,
    OrgID INT,
    OrgName VARCHAR,
    update_time as CURRENT_TIMESTAMP,
    WATERMARK FOR update_time AS update_time
    )
    WITH (
    'connector' = 'kinesis',
    'stream' = '<Metadata Stream Name>',
    'aws.region' = '<AWS region ex. us-east-1> ',
    'scan.stream.initpos' = 'LATEST',
    'format' = 'json',
    'json.timestamp-format.standard' = 'ISO-8601'
    );

  1. Create a versioned view to extract the latest version of metadata table values to be joined with the facts table:
    %flink.ssql(type=update)
    CREATE VIEW versioned_metadata AS 
    SELECT OrgID,OrgName
      FROM (
          SELECT *,
          ROW_NUMBER() OVER (PARTITION BY OrgID
             ORDER BY update_time DESC) AS rownum 
          FROM <Metadata Table>)
    WHERE rownum = 1;

  1. Join the facts and versioned metadata table on orgID and create a view that provides the total calls in the queue in each organization in a span of every 5 seconds, for further reporting. Additionally, create a tumble window of 5 seconds to receive the final output in every 5 seconds. See the following code:
    %flink.ssql(type=update)
    
    CREATE VIEW joined_view AS
    SELECT streamtable.window_start, streamtable.window_end,metadata.OrgName,streamtable.CallsInQueue
    FROM
    (SELECT window_start, window_end, OrgID, count(QueueId) as CallsInQueue
      FROM TABLE(
        TUMBLE( TABLE <Fact Table Name>, DESCRIPTOR(EventTime), INTERVAL '5' SECOND))
      GROUP BY window_start, window_end, OrgID) as streamtable
    JOIN
        <Metadata table name> metadata
        ON metadata.OrgID = streamtable.OrgID

  1. Now you can run the following query from the view you created and see the results in the notebook:
%flink.ssql(type=update)

SELECT jv.window_end, jv.CallsInQueue, jv.window_start, jv.OrgName
FROM joined_view jv where jv.OrgName = ‘Prem Hub’ ;

Visualize KPIs of call center performance in near-real time through OpenSearch Dashboards

You can publish the metrics generated within Amazon Managed Service for Apache Flink Studio to OpenSearch Service and visualize metrics in near-real time by creating a call center performance dashboard, as shown in the following example. Refer to Stream the data and validate output to configure OpenSearch Dashboards with Amazon Managed Service for Apache Flink. After you configure the connector, you can run the following command from the notebook to create an index in an OpenSearch Service cluster.

%flink.ssql(type=update)

drop table if exists active_call_queue;
CREATE TABLE active_call_queue (
window_start TIMESTAMP,
window_end TIMESTAMP,
OrgID int,
OrgName varchar,
CallsInQueue BIGINT,
Agent_cnt bigint,
max_handle_time bigint,
min_handle_time bigint,
max_wait_time bigint,
min_wait_time bigint
) WITH (
‘connector’ = ‘elasticsearch-7’,
‘hosts’ = ‘<Amazon OpenSearch host name>’,
‘index’ = ‘active_call_queue’,
‘username’ = ‘<username>’,
‘password’ = ‘<password>’
);

The active_call_queue index is created in OpenSearch Service. The following screenshot shows an index pattern from OpenSearch Dashboards.

Now you can create visualizations in OpenSearch Dashboards. The following screenshot shows an example.

Conclusion

In this post, we discussed ways to modernize a legacy, on-premises, real-time analytics architecture and build a serverless data analytics solution on AWS using Amazon Managed Service for Apache Flink. We also discussed challenges with relational databases when used for real-time analytics and ways to mitigate them by modernizing the architecture with serverless AWS solutions.

If you have any questions or suggestions, please leave us a comment.


About the Authors

Bhupesh Sharma is a Senior Data Engineer with AWS. His role is helping customers architect highly-available, high-performance, and cost-effective data analytics solutions to empower customers with data-driven decision-making. In his free time, he enjoys playing musical instruments, road biking, and swimming.

Devika Singh is a Senior Data Engineer at Amazon, with deep understanding of AWS services, architecture, and cloud-based best practices. With her expertise in data analytics and AWS cloud migrations, she provides technical guidance to the community, on architecting and building a cost-effective, secure and scalable solution, that is driven by business needs. Beyond her professional pursuits, she is passionate about classical music and travel.

Using Generative AI, Amazon Bedrock and Amazon CodeGuru to Improve Code Quality and Security

Post Syndicated from Marcilio Mendonca original https://aws.amazon.com/blogs/devops/using-generative-ai-amazon-bedrock-and-amazon-codeguru-to-improve-code-quality-and-security/

Automated code analysis plays a key role in improving code quality and compliance. Amazon CodeGuru Reviewer provides automated recommendations that can assist developers in identifying defects and deviation from coding best practices. For instance, CodeGuru Security automatically flags potential security vulnerabilities such as SQL injection, hardcoded AWS credentials and cross-site request forgery, to name a few. After becoming aware of these findings, developers can take decisive action to remediate their code.

On the other hand, determining what the best course of action is to address a particular automated recommendation might not always be obvious. For instance, an apprentice developer may not fully grasp what a SQL injection attack means or what makes the code at hand particularly vulnerable. In another situation, the developer reviewing a CodeGuru recommendation might not be the same developer who wrote the initial code. In these cases, the developer will first need to get familiarized with the code and the recommendation in order to take proper corrective action.

By using Generative AI, developers can leverage pre-trained foundation models to gain insights on their code’s structure, the CodeGuru Reviewer recommendation and the potential corrective actions. For example, Generative AI models can generate text content, e.g., to explain a technical concept such as SQL injection attacks or the correct use of a given library. Once the recommendation is well understood, the Generative AI model can be used to refactor the original code so that it complies with the recommendation. The possibilities opened up by Generative AI are numerous when it comes to improving code quality and security.

In this post, we will show how you can use CodeGuru Reviewer and Bedrock to improve the quality and security of your code. While CodeGuru Reviewer can provide automated code analysis and recommendations, Bedrock offers a low-friction environment that enables you to gain insights on the CodeGuru recommendations and to find creative ways to remediate your code.

Solution Overview

The diagram below depicts our approach and the AWS services involved. It works as follows:

1. The developer pushes code to an AWS CodeCommit repository.
2. The repository is associated with CodeGuru Reviewer, so an automated code review is initiated.
3. Upon completion, the CodeGuru Reviewer console displays a list of recommendations for the code base, if applicable.
4. Once aware of the recommendation and the affected code, the developer navigates to the Bedrock console, chooses a foundation model and builds a prompt (we will give examples of prompts in the next session).
5. Bedrock generates content as a response to the prompt, including code generation.
6. The developer might optionally refine the prompt, for example, to gain further insights on the CodeGuru Reviewer recommendation or to request for alternatives to remediate the code.
7. The model can respond with generated code that addresses the issue which can then be pushed back into the repository.

CodeCommit, CodeGuru and Bedrock used together

CodeCommit, CodeGuru and Bedrock used together

Note that we use CodeCommit in our walkthrough but readers can use any Git sources supported by CodeGuru Reviewer.

Using Generative AI to Improve Code Quality and Security

Next, we’re going to walk you through a scenario where a developer needs to improve the quality of her code after CodeGuru Reviewer has provided recommendations. But before getting there, let’s choose a code repository and set the Bedrock inference parameters.

A good reference of source repository for exploring CodeGuru Reviewer recommendations is the Amazon CodeGuru Reviewer Python Detector repository. The repository contains a comprehensive list of compliant and non-compliant code which fits well in the context of our discussion.

In terms of Bedrock model, we use Anthropic Claude V1 (v1.3) in our analysis which is specialized in content generation including text and code. We set the required model parameters as follows: temperature=0.5, top_p=0.9, top_k=500, max_tokens=2048. We set temperature and top_p parameters so as to give the model a bit more flexibility to generate responses for the same question. Please check the inference parameter definitions on Bedrock’s user guide for further details on these parameters. Given the randomness level specified by our inference parameters, readers experimenting with the prompts provided in this post might observe slightly different answers than the ones presented.

Requirements

  • An AWS account with access to CodeCommit, CodeGuru and Bedrock
  • Bedrock access enabled in the account. On-demand access should be fine (check pricing here).
  • Download and install the AWS CLI and Git (to push code to CodeCommit)

Walkthrough

Follow the steps below to run CodeGuru Reviewer analysis on a repository and to build and run Bedrock prompts.

  • Clone the from GitHub to your local workstation
git clone https://github.com/aws-samples/amazon-codeguru-reviewer-python-detectors.git
  • Create a CodeCommit repository and add a new Git remote
aws codecommit create-repository --repository-name amazon-codeguru-reviewer-python-detectors

cd amazon-codeguru-reviewer-python-detectors/

git remote add codecommit https://git-codecommit.us-east-1.amazonaws.com/v1/repos/amazon-codeguru-reviewer-python-detectors
  • Associate CodeGuru Reviewer with the repository to enable repository analysis
aws codeguru-reviewer associate-repository --repository 'CodeCommit={Name=amazon-codeguru-reviewer-python-detectors}'

Save the association ARN value returned after the command is executed (e.g., arn:aws:codeguru-reviewer:xx-xxxx-x:111111111111:association:e85aa20c-41d76-03b-f788-cefd0d2a3590).

  • Push code to the CodeCommit repository using the codecommit git remote
git push codecommit main:main
  • Trigger CodeGuru Reviewer to run a repository analysis on the repository’s main branch. Use the repository association ARN you noted in a previous step here.
aws codeguru-reviewer create-code-review \
 --name codereview001 \
 --type '{"RepositoryAnalysis": {"RepositoryHead": {"BranchName": "main"}}}' \
 --repository-association-arn arn:aws:codeguru-reviewer:xx-xxxx-x:111111111111:association:e85aa20c-41d76-03b-f788-cefd0d2a3590

Navigate to the CodeGuru Reviewer Console to see the various recommendations provided (you might have to wait a few minutes for the code analysis to run).

Amazon CodeGuru reviewer

Amazon CodeGuru Reviewer

  • On the CodeGuru Reviewer console (see screenshot above), we select the first recommendation on file hashlib_contructor.py, line 12, and take note of the recommendation content: The constructors for the hashlib module are faster than new(). We recommend using hashlib.sha256() instead.
  • Now let’s extract the affected code. Click on the file name link (hashlib_contructor.py in the figure above) to open the corresponding code in the CodeCommit console.
AWS CodeCommit Repository

AWS CodeCommit Repository

  • The blue arrow in the CodeCommit console above indicates the non-compliant code highlighting the specific line (line 12). We select the wrapping python function from lines 5 through 15 to build our prompt. You may want to experiment reducing the scope to a single line or a given block of lines and check if it yields better responses.
Amazon Bedrock Playground Console

Amazon Bedrock Playground Console

  • We then navigate to the Bedrock console (see screenshot above).
    • Search for keyword Bedrock in the AWS console
    • Select the Bedrock service to navigate to the service console
    • Choose Playgrounds, then choose Text
    • Choose model Anthropic Claude V1 (1.3). If you don’t see this model available, please make sure to enable model access.
  • Set the Inference configuration as shown in the screenshot below including temperature, Top P and the other parameters. Please check the inference parameter definitions on Bedrock’s user guide for further details on these parameters.
  • Build a Bedrock prompt using three elements, as illustrated in the screenshot below:
    • The source code copied from CodeCommit
    • The CodeGuru Reviewer recommendation
    • A request to refactor the code to address the code analysis finding
A Prompt in the Amazon Bedrock Playground Console

A Prompt in the Amazon Bedrock Playground Console

  • Press the Run button. Notice that Bedrock will automatically add the words Human (at the top) and Assistant (at the bottom) to the prompt.  Wait a few seconds and a response is generated (in green). The response includes the refactored code and an explanation on how the code was fixed (see screenshot below).
A Prompt Response (or completion) in the Amazon Bedrock Playground Console

A Prompt Response (or completion) in the Amazon Bedrock Playground Console

Note that the original code was refactored to use ashlib.sha256() instead of  using new in the constructor: hashlib.new(‘sha256’, …). Note that the prompt also asks for an explanation on how the refactored code fixes the issue, so the response includes such details. If we were interested in the refactored code only, we could change the prompt and ask that it returns only the refactored code.

Gaining Insights and Improving Code Quality

So far, our walkthrough showed how to navigate through the CodeCommit, CodeGuru Reviewer and Bedrock consoles to gather relevant content that can be used to build and run a Bedrock prompt. We are now going to explore further prompts to gain insights on the code and the recommendations in order to build more specific prompt to refactor the code. We omit screenshots for brevity. Readers can follow the same process outlined in the previous session.

Learning about the code

Let’s now examine another CodeGuru Reviewer recommendation. Specifically, we want to find the SQL Injection vulnerability raised for file, line 12, as shown below.

Sample non-compliant code

Sample non-compliant code

CodeGuru Reviewer provided the following recommendation:

CodeGuru Reviewer recommendation

For the sake of the walkthrough, let’s assume the developer reviewing the code above was not the original developer who built the code. We start by asking Bedrock to explain the code, line by line. Note that we have not included the CodeGuru Reviewer recommendation in the prompt this time as our initial goal is to understand the code.

Prompt:

Source Code:
def execute_query_noncompliant(request):
    import sqlite3
    name = request.GET.get("name")
    query = "SELECT * FROM Users WHERE name = " + name + ";" 
    with sqlite3.connect("example.db") as connection:
        cursor = connection.cursor()
        # Noncompliant: user input is used without sanitization.
        cursor.execute(query)
        connection.commit()
        connection.close()
Request:
Explain the code logic above line by line.

Response:

Sure, here is a line by line explanation of the code:
def execute_query_noncompliant(request):
This defines a function called execute_query_noncompliant that takes a request as an argument.
import sqlite3
This imports the sqlite3 library to connect to a SQLite database.
name = request.GET.get("name")
This gets a "name" query parameter from the request GET parameters.
query = "SELECT * FROM Users WHERE name = " + name + ";"
This constructs an SQL query string by concatenating the name variable into the query. 
So if name was "John", the query would be:
SELECT * FROM Users WHERE name = John;
with sqlite3.connect("example.db") as connection:
This opens a connection to the "example.db" SQLite database.
   cursor = connection.cursor()
This gets a cursor to execute queries on the connection.
    # Noncompliant: user input is used without sanitization.
    cursor.execute(query)
This executes the query on the database. The comment notes that this is noncompliant 
because it uses unsanitized user input in the query.
    connection.commit()
    connection.close()

Bedrock provides a line-by-line explanation of the code including the library being used (sqllite3) and a comment about the code using an unsanitized [MOU1] user input. The explanation is detailed helping the developer to understand the code logic and structure.

The developer now has a good understanding of the code. Let’s now gain insights on why and how the code is vulnerable to SQL Injection.

Prompt:

Source Code:
def execute_query_noncompliant(request):
    import sqlite3
    name = request.GET.get("name")
    query = "SELECT * FROM Users WHERE name = " + name + ";" 
    with sqlite3.connect("example.db") as connection:
        cursor = connection.cursor()
        # Noncompliant: user input is used without sanitization.
        cursor.execute(query)
        connection.commit()
        connection.close()
Request:
Why is the code above vulnerable to SQL injection?

Response:

The code is vulnerable to SQL injection because it directly concatenates 
user input (the name parameter) into an SQL query string. This allows a 
malicious user to inject SQL code into the query.
For example, if the name parameter is passed as:

'; DROP TABLE Users; -- 
Then the resulting SQL query would be:

SELECT * FROM Users WHERE name = ''; DROP TABLE Users; -- ;
This would drop the Users table, causing damage to the database.
To fix this, the user input should be sanitized before being added to the 
SQL query string. This can be done by:

Escaping single quotes in the input:
name = request.GET.get("name").replace("'", "''")
query = f"SELECT * FROM Users WHERE name = '{name}';"

Using parameterized queries:
query = "SELECT * FROM Users WHERE name = ?"
cursor.execute(query, (name,))

Again, the response is detailed. It explains that the user input is being concatenated directly into a query string. It gives an example of a SQL injection attack that could cause the Users table to be dropped. It also provides two options to fix the code including escaping single quotes and using parameterized queries.

We continue to dive deep into SQL injection. Let’s build a prompt to gather further examples of possible SQL injection attacks.

Prompt:

Source Code:
def execute_query_noncompliant(request):
    import sqlite3
    name = request.GET.get("name")
    query = "SELECT * FROM Users WHERE name = " + name + ";" 
    with sqlite3.connect("example.db") as connection:
        cursor = connection.cursor()
        # Noncompliant: user input is used without sanitization.
        cursor.execute(query)
        connection.commit()
        connection.close()
Request:
Provide examples of SQL injection for the code above

Response:

Here are some examples of SQL injection for that code:

If the name parameter is passed as:
'; DROP TABLE Users; -- 

Then the executed query would be:
SELECT * FROM Users WHERE name = ''; DROP TABLE Users; -- ;
This would drop the Users table.

If the name parameter is passed as:
OR 1=1; -- 

Then the executed query would be:
SELECT * FROM Users WHERE name = OR 1=1; -- ;
This would select all rows from the Users table.

If the name parameter is passed as:
'; UPDATE Users SET password = 'hacked' WHERE 1=1; -- 

Then the executed query would be:
SELECT * FROM Users WHERE name = ''; UPDATE Users SET password = 'hacked' WHERE 1=1; -- ;

This would update all passwords in the Users table to 'hacked'.
So in summary, without sanitizing the input, this code is vulnerable to SQL injection attacks 
that can delete data, access unauthorized data, modify data, etc. The input should be sanitized 
by escaping special characters before using in the SQL query.

Bedrock responds by including two new forms of SQL injection attacks, one that lists all records from the Users table and another one that updates all passwords in the table.

At this point, the developer has learned a lot about the code, SQL injection attacks and the CodeGuru Reviewer recommendation. It’s now time to build a prompt to fix the code. We leverage what we learned in previous prompts including the term “parameterized queries”, as suggested by CodeGuru Reviewer, to build a specific request in the prompt. Note that we have also included the CodeGuru Reviewer recommendation in the prompt this time.

Prompt:

Source Code:
def execute_query_noncompliant(request):
    import sqlite3
    name = request.GET.get("name")
    query = "SELECT * FROM Users WHERE name = " + name + ";" 
    with sqlite3.connect("example.db") as connection:
        cursor = connection.cursor()
        # Noncompliant: user input is used without sanitization.
        cursor.execute(query)
        connection.commit()
        connection.close()

Code Analysis Finding: 
It seems that you are constructing an SQL query using formatted strings. This could 
make your code vulnerable to SQL injection attacks, which can lead to loss of critical 
data. We recommend that you use parameterized queries for constructing SQL queries.  
[Learn more](https://cwe.mitre.org/data/definitions/89.html)
Request:
Fix the code above using parameterized query.

Response:

Here is the fixed code using parameterized query:

def execute_query_compliant(request):
   import sqlite3
   name = request.GET.get("name")
   query = "SELECT * FROM Users WHERE name = ?" 
   with sqlite3.connect("example.db") as connection:
      cursor = connection.cursor()
      # Compliant: Parameterized query prevents SQL injection
      cursor.execute(query, (name,)) 
      connection.commit()
      connection.close()

Instead of concatenating the user input directly into the SQL string, we use a ? 
placeholder and pass the user input as a tuple argument to cursor.execute(). This 
prevents SQL injection since the input is treated as a literal value rather than 
part of the SQL syntax.

The response includes a refactored code that uses a parameterized query to prevent SQL injection attacks (cursor.execute(query, (name,)).


We’re excited with the improvements in the code and now decide to go beyond fixing the SQL injection vulnerability. The next prompt asks Bedrock to make two specific improvements to the refactored.

  • Add python type hints
  • Add exception handling

Prompt:

Source Code:
def execute_query_compliant(request):
   import sqlite3
   name = request.GET.get("name")
   query = "SELECT * FROM Users WHERE name = ?" 
   with sqlite3.connect("example.db") as connection:
      cursor = connection.cursor()
      # Compliant: Parameterized query prevents SQL injection
      cursor.execute(query, (name,)) 
      connection.commit()
      connection.close()

Request: 
Add exception handling and type hints to the code above.

Response:

Here is the code with exception handling and type hints:

def execute_query_compliant(request: Request) -> None: 
   """Executes a compliant SQL query"""
   import sqlite3
   name = request.GET.get("name")
   query = "SELECT * FROM Users WHERE name = ?" 
   try:
      with sqlite3.connect("example.db") as connection: 
          cursor: sqlite3.Cursor = connection.cursor()
          # Compliant: Parameterized query prevents SQL injection
          cursor.execute(query, (name,)) 
          connection.commit()
   except sqlite3.Error as e: 
      print(f"SQLite error: {e}")
   finally: 
      connection.close()

I added:

Type hints:
request: Request indicating the request parameter is of type Request

cursor: sqlite3.Cursor indicating cursor is of type sqlite3.Cursor
Exception handling:
A try/except block catching sqlite3.Error exceptions. 
A finally block ensuring the connection is always closed. 
Let me know if you would like me to explain anything in the updated code!

Note that the generated code now includes a try/except/finally block to handle exceptions and type hints for the function parameter and return value.

However, some variables such as name and query are missing type hints. Here, we’d like to call out the importance of human supervision on automatically generated responses. The developer tweaks the previous prompt to ask that “all variables” have type hint added to them.

Prompt:

Source Code:
def execute_query_compliant(request):
   import sqlite3
   name = request.GET.get("name")
   query = "SELECT * FROM Users WHERE name = ?" 
   with sqlite3.connect("example.db") as connection:
      cursor = connection.cursor()
      # Compliant: Parameterized query prevents SQL injection
      cursor.execute(query, (name,)) 
      connection.commit()
      connection.close()

Request: 
Add exception handling and type hints to all variables in the code above.

Response:

Here is the code with exception handling and type hints:

def execute_query_compliant(request: Request) -> None: 
   """Executes a compliant SQL query"""
   import sqlite3
   name: str = request.GET.get("name")
   query: str = "SELECT * FROM Users WHERE name = ?" 
   try:
      with sqlite3.connect("example.db") as connection: 
         cursor: sqlite3.Cursor = connection.cursor()
         # Compliant: Parameterized query prevents SQL injection
         cursor.execute(query, (name,)) 
         connection.commit()
   except sqlite3.Error as e: 
        print(f"SQLite error: {e}")
   finally: 
        connection.close()

The new response refactors the code making sure all variables now have type hint. This shows how little changes to a prompt can produce much better results. In our example, by adding “all variables” to the prompt caused the response to indeed add type hints to all variables in the code provided.

Here is a summary of the activities performed via Bedrock prompting:

  • Gain insights on the code and the CodeGuru recommendation
    • Explain the code logic above line by line.
    • Why is the code above vulnerable to SQL injection?
    • Provide examples of SQL injection for the code above
  • Refactor and Improve the Code
    • Fix the code above using parameterized query
    • Add exception handling and type hints to the code above
    • Add exception handling and type hints to all variables in the code above.

The main takeaway is that by using a static analysis and security testing tool such as CodeGuru Reviewer in combination with a Generative AI service such as Bedrock, developers can significantly improve their code towards best practices and enhanced security. In addition, prompts which are more specific normally yield better results and that’s when CodeGuru Reviewer can be really helpful as it gives developers hints and keywords that can be used to build powerful prompts.

Cleaning Up

Don’t forget to delete the CodeCommit repository created if you no longer need it.

aws codecommit delete-repository -–repository-name amazon-codeguru-reviewer-python-detectors

Conclusion and Call to Action

In this blog, we discussed how CodeGuru Reviewer and Bedrock can be used in combination to improve code quality and security. While CodeGuru Reviewer provides a rich set of recommendations through automated code reviews, Bedrock gives developers the ability to gain deeper insights on the code and the recommendations as well as to refactor the original code to meet compliance and best practices.

We encourage readers to explore new Bedrock prompts beyond the ones introduced in this post and share their feedback with us.

Here are some ideas:

For a sample Python repository we recommend using the Amazon CodeGuru Reviewer Python Detector repository on GitHub which is publicly accessible to readers.

For Java developers, there’s a CodeGuru Reviewer Python Detector for Java repository alternative available.

Note: at the time of the writing of this post, Bedrock’s Anthropic Claude 2.0 model was not yet available so we invite readers to also experiment with the prompts provided using that model.

Special thanks to my colleagues Raghvender Arni and Mahesh Yadav for support and review of this post.
Author: Marcilio Mendonca

Marcilio Mendonca

Marcilio Mendonca is a Sr. Solutions Developer in the Prototyping And Customer Engineering (PACE) team at Amazon Web Services. He is passionate about helping customers rethink and reinvent their business through the art of prototyping, primarily in the realm of modern application development, Serverless and AI/ML. Prior to joining AWS, Marcilio was a Software Development Engineer with Amazon. He also holds a PhD in Computer Science. You can find Marcilio on LinkedIn at https://www.linkedin.com/in/marcilio/. Let’s connect!

Blue/Green deployments using AWS CDK Pipelines and AWS CodeDeploy

Post Syndicated from Luiz Decaro original https://aws.amazon.com/blogs/devops/blue-green-deployments-using-aws-cdk-pipelines-and-aws-codedeploy/

Customers often ask for help with implementing Blue/Green deployments to Amazon Elastic Container Service (Amazon ECS) using AWS CodeDeploy. Their use cases usually involve cross-Region and cross-account deployment scenarios. These requirements are challenging enough on their own, but in addition to those, there are specific design decisions that need to be considered when using CodeDeploy. These include how to configure CodeDeploy, when and how to create CodeDeploy resources (such as Application and Deployment Group), and how to write code that can be used to deploy to any combination of account and Region.

Today, I will discuss those design decisions in detail and how to use CDK Pipelines to implement a self-mutating pipeline that deploys services to Amazon ECS in cross-account and cross-Region scenarios. At the end of this blog post, I also introduce a demo application, available in Java, that follows best practices for developing and deploying cloud infrastructure using AWS Cloud Development Kit (AWS CDK).

The Pipeline

CDK Pipelines is an opinionated construct library used for building pipelines with different deployment engines. It abstracts implementation details that developers or infrastructure engineers need to solve when implementing a cross-Region or cross-account pipeline. For example, in cross-Region scenarios, AWS CloudFormation needs artifacts to be replicated to the target Region. For that reason, AWS Key Management Service (AWS KMS) keys, an Amazon Simple Storage Service (Amazon S3) bucket, and policies need to be created for the secondary Region. This enables artifacts to be moved from one Region to another. In cross-account scenarios, CodeDeploy requires a cross-account role with access to the KMS key used to encrypt configuration files. This is the sort of detail that our customers want to avoid dealing with manually.

AWS CodeDeploy is a deployment service that automates application deployment across different scenarios. It deploys to Amazon EC2 instances, On-Premises instances, serverless Lambda functions, or Amazon ECS services. It integrates with AWS Identity and Access Management (AWS IAM), to implement access control to deploy or re-deploy old versions of an application. In the Blue/Green deployment type, it is possible to automate the rollback of a deployment using Amazon CloudWatch Alarms.

CDK Pipelines was designed to automate AWS CloudFormation deployments. Using AWS CDK, these CloudFormation deployments may include deploying application software to instances or containers. However, some customers prefer using CodeDeploy to deploy application software. In this blog post, CDK Pipelines will deploy using CodeDeploy instead of CloudFormation.

A pipeline build with CDK Pipelines that deploys to Amazon ECS using AWS CodeDeploy. It contains at least 5 stages: Source, Build, UpdatePipeline, Assets and at least one Deployment stage.

Design Considerations

In this post, I’m considering the use of CDK Pipelines to implement different use cases for deploying a service to any combination of accounts (single-account & cross-account) and regions (single-Region & cross-Region) using CodeDeploy. More specifically, there are four problems that need to be solved:

CodeDeploy Configuration

The most popular options for implementing a Blue/Green deployment type using CodeDeploy are using CloudFormation Hooks or using a CodeDeploy construct. I decided to operate CodeDeploy using its configuration files. This is a flexible design that doesn’t rely on using custom resources, which is another technique customers have used to solve this problem. On each run, a pipeline pushes a container to a repository on Amazon Elastic Container Registry (ECR) and creates a tag. CodeDeploy needs that information to deploy the container.

I recommend creating a pipeline action to scan the AWS CDK cloud assembly and retrieve the repository and tag information. The same action can create the CodeDeploy configuration files. Three configuration files are required to configure CodeDeploy: appspec.yaml, taskdef.json and imageDetail.json. This pipeline action should be executed before the CodeDeploy deployment action. I recommend creating template files for appspec.yaml and taskdef.json. The following script can be used to implement the pipeline action:

##
#!/bin/sh
#
# Action Configure AWS CodeDeploy
# It customizes the files template-appspec.yaml and template-taskdef.json to the environment
#
# Account = The target Account Id
# AppName = Name of the application
# StageName = Name of the stage
# Region = Name of the region (us-east-1, us-east-2)
# PipelineId = Id of the pipeline
# ServiceName = Name of the service. It will be used to define the role and the task definition name
#
# Primary output directory is codedeploy/. All the 3 files created (appspec.json, imageDetail.json and 
# taskDef.json) will be located inside the codedeploy/ directory
#
##
Account=$1
Region=$2
AppName=$3
StageName=$4
PipelineId=$5
ServiceName=$6
repo_name=$(cat assembly*$PipelineId-$StageName/*.assets.json | jq -r '.dockerImages[] | .destinations[] | .repositoryName' | head -1) 
tag_name=$(cat assembly*$PipelineId-$StageName/*.assets.json | jq -r '.dockerImages | to_entries[0].key')  
echo ${repo_name} 
echo ${tag_name} 
printf '{"ImageURI":"%s"}' "$Account.dkr.ecr.$Region.amazonaws.com/${repo_name}:${tag_name}" > codedeploy/imageDetail.json                     
sed 's#APPLICATION#'$AppName'#g' codedeploy/template-appspec.yaml > codedeploy/appspec.yaml 
sed 's#APPLICATION#'$AppName'#g' codedeploy/template-taskdef.json | sed 's#TASK_EXEC_ROLE#arn:aws:iam::'$Account':role/'$ServiceName'#g' | sed 's#fargate-task-definition#'$ServiceName'#g' > codedeploy/taskdef.json 
cat codedeploy/appspec.yaml
cat codedeploy/taskdef.json
cat codedeploy/imageDetail.json

Using a Toolchain

A good strategy is to encapsulate the pipeline inside a Toolchain to abstract how to deploy to different accounts and regions. This helps decoupling clients from the details such as how the pipeline is created, how CodeDeploy is configured, and how cross-account and cross-Region deployments are implemented. To create the pipeline, deploy a Toolchain stack. Out-of-the-box, it allows different environments to be added as needed. Depending on the requirements, the pipeline may be customized to reflect the different stages or waves that different components might require. For more information, please refer to our best practices on how to automate safe, hands-off deployments and its reference implementation.

In detail, the Toolchain stack follows the builder pattern used throughout the CDK for Java. This is a convenience that allows complex objects to be created using a single statement:

 Toolchain.Builder.create(app, Constants.APP_NAME+"Toolchain")
        .stackProperties(StackProps.builder()
                .env(Environment.builder()
                        .account(Demo.TOOLCHAIN_ACCOUNT)
                        .region(Demo.TOOLCHAIN_REGION)
                        .build())
                .build())
        .setGitRepo(Demo.CODECOMMIT_REPO)
        .setGitBranch(Demo.CODECOMMIT_BRANCH)
        .addStage(
                "UAT",
                EcsDeploymentConfig.CANARY_10_PERCENT_5_MINUTES,
                Environment.builder()
                        .account(Demo.SERVICE_ACCOUNT)
                        .region(Demo.SERVICE_REGION)
                        .build())                                                                                                             
        .build();

In the statement above, the continuous deployment pipeline is created in the TOOLCHAIN_ACCOUNT and TOOLCHAIN_REGION. It implements a stage that builds the source code and creates the Java archive (JAR) using Apache Maven.  The pipeline then creates a Docker image containing the JAR file.

The UAT stage will deploy the service to the SERVICE_ACCOUNT and SERVICE_REGION using the deployment configuration CANARY_10_PERCENT_5_MINUTES. This means 10 percent of the traffic is shifted in the first increment and the remaining 90 percent is deployed 5 minutes later.

To create additional deployment stages, you need a stage name, a CodeDeploy deployment configuration and an environment where it should deploy the service. As mentioned, the pipeline is, by default, a self-mutating pipeline. For example, to add a Prod stage, update the code that creates the Toolchain object and submit this change to the code repository. The pipeline will run and update itself adding a Prod stage after the UAT stage. Next, I show in detail the statement used to add a new Prod stage. The new stage deploys to the same account and Region as in the UAT environment:

... 
        .addStage(
                "Prod",
                EcsDeploymentConfig.CANARY_10_PERCENT_5_MINUTES,
                Environment.builder()
                        .account(Demo.SERVICE_ACCOUNT)
                        .region(Demo.SERVICE_REGION)
                        .build())                                                                                                                                      
        .build();

In the statement above, the Prod stage will deploy new versions of the service using a CodeDeploy deployment configuration CANARY_10_PERCENT_5_MINUTES. It means that 10 percent of traffic is shifted in the first increment of 5 minutes. Then, it shifts the rest of the traffic to the new version of the application. Please refer to Organizing Your AWS Environment Using Multiple Accounts whitepaper for best-practices on how to isolate and manage your business applications.

Some customers might find this approach interesting and decide to provide this as an abstraction to their application development teams. In this case, I advise creating a construct that builds such a pipeline. Using a construct would allow for further customization. Examples are stages that promote quality assurance or deploy the service in a disaster recovery scenario.

The implementation creates a stack for the toolchain and another stack for each deployment stage. As an example, consider a toolchain created with a single deployment stage named UAT. After running successfully, the DemoToolchain and DemoService-UAT stacks should be created as in the next image:

Two stacks are needed to create a Pipeline that deploys to a single environment. One stack deploys the Toolchain with the Pipeline and another stack deploys the Service compute infrastructure and CodeDeploy Application and DeploymentGroup. In this example, for an application named Demo that deploys to an environment named UAT, the stacks deployed are: DemoToolchain and DemoService-UAT

CodeDeploy Application and Deployment Group

CodeDeploy configuration requires an application and a deployment group. Depending on the use case, you need to create these in the same or in a different account from the toolchain (pipeline). The pipeline includes the CodeDeploy deployment action that performs the blue/green deployment. My recommendation is to create the CodeDeploy application and deployment group as part of the Service stack. This approach allows to align the lifecycle of CodeDeploy application and deployment group with the related Service stack instance.

CodePipeline allows to create a CodeDeploy deployment action that references a non-existing CodeDeploy application and deployment group. This allows us to implement the following approach:

  • Toolchain stack deploys the pipeline with CodeDeploy deployment action referencing a non-existing CodeDeploy application and deployment group
  • When the pipeline executes, it first deploys the Service stack that creates the related CodeDeploy application and deployment group
  • The next pipeline action executes the CodeDeploy deployment action. When the pipeline executes the CodeDeploy deployment action, the related CodeDeploy application and deployment will already exist.

Below is the pipeline code that references the (initially non-existing) CodeDeploy application and deployment group.

private IEcsDeploymentGroup referenceCodeDeployDeploymentGroup(
        final Environment env, 
        final String serviceName, 
        final IEcsDeploymentConfig ecsDeploymentConfig, 
        final String stageName) {

    IEcsApplication codeDeployApp = EcsApplication.fromEcsApplicationArn(
            this,
            Constants.APP_NAME + "EcsCodeDeployApp-"+stageName,
            Arn.format(ArnComponents.builder()
                    .arnFormat(ArnFormat.COLON_RESOURCE_NAME)
                    .partition("aws")
                    .region(env.getRegion())
                    .service("codedeploy")
                    .account(env.getAccount())
                    .resource("application")
                    .resourceName(serviceName)
                    .build()));

    IEcsDeploymentGroup deploymentGroup = EcsDeploymentGroup.fromEcsDeploymentGroupAttributes(
            this,
            Constants.APP_NAME + "-EcsCodeDeployDG-"+stageName,
            EcsDeploymentGroupAttributes.builder()
                    .deploymentGroupName(serviceName)
                    .application(codeDeployApp)
                    .deploymentConfig(ecsDeploymentConfig)
                    .build());

    return deploymentGroup;
}

To make this work, you should use the same application name and deployment group name values when creating the CodeDeploy deployment action in the pipeline and when creating the CodeDeploy application and deployment group in the Service stack (where the Amazon ECS infrastructure is deployed). This approach is necessary to avoid a circular dependency error when trying to create the CodeDeploy application and deployment group inside the Service stack and reference these objects to configure the CodeDeploy deployment action inside the pipeline. Below is the code that uses Service stack construct ID to name the CodeDeploy application and deployment group. I set the Service stack construct ID to the same name I used when creating the CodeDeploy deployment action in the pipeline.

   // configure AWS CodeDeploy Application and DeploymentGroup
   EcsApplication app = EcsApplication.Builder.create(this, "BlueGreenApplication")
           .applicationName(id)
           .build();

   EcsDeploymentGroup.Builder.create(this, "BlueGreenDeploymentGroup")
           .deploymentGroupName(id)
           .application(app)
           .service(albService.getService())
           .role(createCodeDeployExecutionRole(id))
           .blueGreenDeploymentConfig(EcsBlueGreenDeploymentConfig.builder()
                   .blueTargetGroup(albService.getTargetGroup())
                   .greenTargetGroup(tgGreen)
                   .listener(albService.getListener())
                   .testListener(listenerGreen)
                   .terminationWaitTime(Duration.minutes(15))
                   .build())
           .deploymentConfig(deploymentConfig)
           .build();

CDK Pipelines roles and permissions

CDK Pipelines creates roles and permissions the pipeline uses to execute deployments in different scenarios of regions and accounts. When using CodeDeploy in cross-account scenarios, CDK Pipelines deploys a cross-account support stack that creates a pipeline action role for the CodeDeploy action. This cross-account support stack is defined in a JSON file that needs to be published to the AWS CDK assets bucket in the target account. If the pipeline has the self-mutation feature on (default), the UpdatePipeline stage will do a cdk deploy to deploy changes to the pipeline. In cross-account scenarios, this deployment also involves deploying/updating the cross-account support stack. For this, the SelfMutate action in UpdatePipeline stage needs to assume CDK file-publishing and a deploy roles in the remote account.

The IAM role associated with the AWS CodeBuild project that runs the UpdatePipeline stage does not have these permissions by default. CDK Pipelines cannot grant these permissions automatically, because the information about the permissions that the cross-account stack needs is only available after the AWS CDK app finishes synthesizing. At that point, the permissions that the pipeline has are already locked-in­­. Hence, for cross-account scenarios, the toolchain should extend the permissions of the pipeline’s UpdatePipeline stage to include the file-publishing and deploy roles.

In cross-account environments it is possible to manually add these permissions to the UpdatePipeline stage. To accomplish that, the Toolchain stack may be used to hide this sort of implementation detail. In the end, a method like the one below can be used to add these missing permissions. For each different mapping of stage and environment in the pipeline it validates if the target account is different than the account where the pipeline is deployed. When the criteria is met, it should grant permission to the UpdatePipeline stage to assume CDK bootstrap roles (tagged using key aws-cdk:bootstrap-role) in the target account (with the tag value as file-publishing or deploy). The example below shows how to add permissions to the UpdatePipeline stage:

private void grantUpdatePipelineCrossAccoutPermissions(Map<String, Environment> stageNameEnvironment) {

    if (!stageNameEnvironment.isEmpty()) {

        this.pipeline.buildPipeline();
        for (String stage : stageNameEnvironment.keySet()) {

            HashMap<String, String[]> condition = new HashMap<>();
            condition.put(
                    "iam:ResourceTag/aws-cdk:bootstrap-role",
                    new String[] {"file-publishing", "deploy"});
            pipeline.getSelfMutationProject()
                    .getRole()
                    .addToPrincipalPolicy(PolicyStatement.Builder.create()
                            .actions(Arrays.asList("sts:AssumeRole"))
                            .effect(Effect.ALLOW)
                            .resources(Arrays.asList("arn:*:iam::"
                                    + stageNameEnvironment.get(stage).getAccount() + ":role/*"))
                            .conditions(new HashMap<String, Object>() {{
                                    put("ForAnyValue:StringEquals", condition);
                            }})
                            .build());
        }
    }
}

The Deployment Stage

Let’s consider a pipeline that has a single deployment stage, UAT. The UAT stage deploys a DemoService. For that, it requires four actions: DemoService-UAT (Prepare and Deploy), ConfigureBlueGreenDeploy and Deploy.

When using CodeDeploy the deployment stage is expected to have four actions: two actions to create CloudFormation change set and deploy the ECS or compute infrastructure, an action to configure CodeDeploy and the last action that deploys the application using CodeDeploy. In the diagram, these are (in the diagram in the respective order): DemoService-UAT.Prepare and DemoService-UAT.Deploy, ConfigureBlueGreenDeploy and Deploy.

The
DemoService-UAT.Deploy action will create the ECS resources and the CodeDeploy application and deployment group. The
ConfigureBlueGreenDeploy action will read the AWS CDK
cloud assembly. It uses the configuration files to identify the Amazon Elastic Container Registry (Amazon ECR) repository and the container image tag pushed. The pipeline will send this information to the
Deploy action.  The
Deploy action starts the deployment using CodeDeploy.

Solution Overview

As a convenience, I created an application, written in Java, that solves all these challenges and can be used as an example. The application deployment follows the same 5 steps for all deployment scenarios of account and Region, and this includes the scenarios represented in the following design:

A pipeline created by a Toolchain should be able to deploy to any combination of accounts and regions. This includes four scenarios: single-account and single-Region, single-account and cross-Region, cross-account and single-Region and cross-account and cross-Region

Conclusion

In this post, I identified, explained and solved challenges associated with the creation of a pipeline that deploys a service to Amazon ECS using CodeDeploy in different combinations of accounts and regions. I also introduced a demo application that implements these recommendations. The sample code can be extended to implement more elaborate scenarios. These scenarios might include automated testing, automated deployment rollbacks, or disaster recovery. I wish you success in your transformative journey.

Luiz Decaro

Luiz is a Principal Solutions architect at Amazon Web Services (AWS). He focuses on helping customers from the Financial Services Industry succeed in the cloud. Luiz holds a master’s in software engineering and he triggered his first continuous deployment pipeline in 2005.

Delegating permission set management and account assignment in AWS IAM Identity Center

Post Syndicated from Jake Barker original https://aws.amazon.com/blogs/security/delegating-permission-set-management-and-account-assignment-in-aws-iam-identity-center/

In this blog post, we look at how you can use AWS IAM Identity Center (successor to AWS Single Sign-On) to delegate the management of permission sets and account assignments. Delegating the day-to-day administration of user identities and entitlements allows teams to move faster and reduces the burden on your central identity administrators.

IAM Identity Center helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications. Identity Center requires accounts to be managed by AWS Organizations. Administration of Identity Center can be delegated to a member account (an account other than the management account). We recommend that you delegate Identity Center administration to limit who has access to the management account and use the management account only for tasks that require the management account.

Delegated administration is different from the delegation of permission sets and account assignments, which this blog covers. For more information on delegated administration, see Getting started with AWS IAM Identity Center delegated administration. The patterns in this blog post work whether Identity Center is delegated to a member account or remains in the management account.

Permission sets are used to define the level of access that users or groups have to an AWS account. Permission sets can contain AWS managed policies, customer managed policies, inline policies, and permissions boundaries.

Solution overview

As your organization grows, you might want to start delegating permissions management and account assignment to give your teams more autonomy and reduce the burden on your identity team. Alternatively, you might have different business units or customers, operating out of their own organizational units (OUs), that want more control over their own identity management.

In this scenario, an example organization has three developer teams: Red, Blue, and Yellow. Each of the teams operate out of its own OU. IAM Identity Center has been delegated from the management account to the Identity Account. Figure 1 shows the structure of the example organization.

Figure 1: The structure of the organization in the example scenario

Figure 1: The structure of the organization in the example scenario

The organization in this scenario has an existing collection of permission sets. They want to delegate the management of permission sets and account assignments away from their central identity management team.

  • The Red team wants to be able to assign the existing permission sets to accounts in its OU. This is an accounts-based model.
  • The Blue team wants to edit and use a single permission set and then assign that set to the team’s single account. This is a permission-based model.
  • The Yellow team wants to create, edit, and use a permission set tagged with Team: Yellow and then assign that set to all of the accounts in its OU. This is a tag-based model.

We’ll look at the permission sets needed for these three use cases.

Note: If you’re using the AWS Management Console, additional permissions are required.

Use case 1: Accounts-based model

In this use case, the Red team is given permission to assign existing permission sets to the three accounts in its OU. This will also include permissions to remove account assignments.

Using this model, an organization can create generic permission sets that can be assigned to its AWS accounts. It helps reduce complexity for the delegated administrators and verifies that they are using permission sets that follow the organization’s best practices. These permission sets restrict access based on services and features within those services, rather than specific resources.

{
  "Version": "2012-10-17",
  "Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "sso:CreateAccountAssignment",
            "sso:DeleteAccountAssignment",
            "sso:ProvisionPermissionSet"
        ],
        "Resource": [
            "arn:aws:sso:::instance/ssoins-<sso-ins-id>",
            "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
            "arn:aws:sso:::account/112233445566",
            "arn:aws:sso:::account/223344556677",
            "arn:aws:sso:::account/334455667788"

        ]
    }
  ]
}

In the preceding policy, the principal can assign existing permission sets to the three AWS accounts with the IDs 112233445566, 223344556677 and 334455667788. This includes administration permission sets, so carefully consider which accounts you allow the permission sets to be assigned to.

The arn:aws:sso:::instance/ssoins-<sso-ins-id> is the IAM Identity Center instance ID ARN. It can be found using either the AWS Command Line Interface (AWS CLI) v2 with the list-instances API or the AWS Management Console.

Use the AWS CLI

Use the AWS Command Line Interface (AWS CLI) to run the following command:

aws sso-admin list-instances

You can also use AWS CloudShell to run the command.

Use the AWS Management Console

Use the Management Console to navigate to the IAM Identity Center in your AWS Region and then select Choose your identity source on the dashboard.

Figure 2: The IAM Identity Center instance ID ARN in the console

Figure 2: The IAM Identity Center instance ID ARN in the console

Use case 2: Permission-based model

For this example, the Blue team is given permission to edit one or more specific permission sets and then assign those permission sets to a single account. The following permissions allow the team to use managed and inline policies.

This model allows the delegated administrator to use fine-grained permissions on a specific AWS account. It’s useful when the team wants total control over the permissions in its AWS account, including the ability to create additional roles with administrative permissions. In these cases, the permissions are often better managed by the team that operates the account because it has a better understanding of the services and workloads.

Granting complete control over permissions can lead to unintended or undesired outcomes. Permission sets are still subject to IAM evaluation and authorization, which means that service control policies (SCPs) can be used to deny specific actions.

{
  "Version": "2012-10-17",
  "Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "sso:AttachCustomerManagedPolicyReferenceToPermissionSet",
            "sso:AttachManagedPolicyToPermissionSet",
            "sso:CreateAccountAssignment",
            "sso:DeleteAccountAssignment",
            "sso:DeleteInlinePolicyFromPermissionSet",
            "sso:DetachCustomerManagedPolicyReferenceFromPermissionSet",
            "sso:DetachManagedPolicyFromPermissionSet",
            "sso:ProvisionPermissionSet",
            "sso:PutInlinePolicyToPermissionSet",
            "sso:UpdatePermissionSet"
        ],
        "Resource": [
            "arn:aws:sso:::instance/ssoins-<sso-ins-id>",
            "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/ps-1122334455667788",
            "arn:aws:sso:::account/445566778899"
        ]
    }
  ]
}

Here, the principal can edit the permission set arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/ps-1122334455667788 and assign it to the AWS account 445566778899. The editing rights include customer managed policies, AWS managed policies, and inline policies.

If you want to use the preceding policy, replace the missing and example resource values with your own IAM Identity Center instance ID and account numbers.

In the preceding policy, the arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/ps-1122334455667788 is the permission set ARN. You can find this ARN through the console, or by using the AWS CLI command to list all of the permission sets:

aws sso-admin list-permission-sets --instance-arn <instance arn from above>

This permission set can also be applied to multiple accounts—similar to the first use case—by adding additional account IDs to the list of resources. Likewise, additional permission sets can be added so that the user can edit multiple permission sets and assign them to a set of accounts.

Use case 3: Tag-based model

For this example, the Yellow team is given permission to create, edit, and use permission sets tagged with Team: Yellow. Then they can assign those tagged permission sets to all of their accounts.

This example can be used by an organization to allow a team to freely create and edit permission sets and then assign them to the team’s accounts. It uses tagging as a mechanism to control which permission sets can be created and edited. Permission sets without the correct tag cannot be altered.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sso:CreatePermissionSet",
                "sso:DescribePermissionSet",
                "sso:UpdatePermissionSet",
                "sso:DeletePermissionSet",
                "sso:DescribePermissionSetProvisioningStatus",
                "sso:DescribeAccountAssignmentCreationStatus",
                "sso:DescribeAccountAssignmentDeletionStatus",
                "sso:TagResource"
            ],
            "Resource": [
                "arn:aws:sso:::instance/ssoins-<sso-ins-id>"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "sso:DescribePermissionSet",
                "sso:UpdatePermissionSet",
                "sso:DeletePermissionSet"
                	
            ],
            "Resource": "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/Team": "Yellow"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "sso:CreatePermissionSet"
            ],
            "Resource": "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/Team": "Yellow"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": "sso:TagResource",
            "Resource": "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/Team": "Yellow",
                    "aws:RequestTag/Team": "Yellow"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "sso:CreateAccountAssignment",
                "sso:DeleteAccountAssignment",
                "sso:ProvisionPermissionSet"
            ],
            "Resource": [
                "arn:aws:sso:::instance/ssoins-<sso-ins-id>",
                "arn:aws:sso:::account/556677889900",
                "arn:aws:sso:::account/667788990011",
                "arn:aws:sso:::account/778899001122"
            ]
        },
        {
            "Sid": "InlinePolicy",
            "Effect": "Allow",
            "Action": [
                "sso:GetInlinePolicyForPermissionSet",
                "sso:PutInlinePolicyToPermissionSet",
                "sso:DeleteInlinePolicyFromPermissionSet"
            ],
            "Resource": [
                "arn:aws:sso:::instance/ssoins-<sso-ins-id>"
            ]
        },
        {
            "Sid": "InlinePolicyABAC",
            "Effect": "Allow",
            "Action": [
                "sso:GetInlinePolicyForPermissionSet",
                "sso:PutInlinePolicyToPermissionSet",
                "sso:DeleteInlinePolicyFromPermissionSet"
            ],
            "Resource": "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceTag/Team": "Yellow"
                }
            }
        }
    ]
}

In the preceding policy, the principal is allowed to create new permission sets only with the tag Team: Yellow, and assign only permission sets tagged with Team: Yellow to the AWS accounts with ID 556677889900, 667788990011, and 778899001122.

The principal can only edit the inline policies of the permission sets tagged with Team: Yellow and cannot change the tags of the permission sets that are already tagged for another team.

If you want to use this policy, replace the missing and example resource values with your own IAM Identity Center instance ID, tags, and account numbers.

Note: The policy above assumes that there are no additional statements applying to the principal. If you require additional allow statements, verify that the resulting policy doesn’t create a risk of privilege escalation. You can review Controlling access to AWS resources using tags for additional information.

This policy only allows the delegation of permission sets using inline policies. Customer managed policies are IAM policies that are deployed to and are unique to each AWS account. When you create a permission set with a customer managed policy, you must create an IAM policy with the same name and path in each AWS account where IAM Identity Center assigns the permission set. If the IAM policy doesn’t exist, Identity Center won’t make the account assignment. For more information on how to use customer managed policies with Identity Center, see How to use customer managed policies in AWS IAM Identity Center for advanced use cases.

You can extend the policy to allow the delegation of customer managed policies with these two statements:

{
    "Sid": "CustomerManagedPolicy",
    "Effect": "Allow",
    "Action": [
        "sso:ListCustomerManagedPolicyReferencesInPermissionSet",
        "sso:AttachCustomerManagedPolicyReferenceToPermissionSet",
        "sso:DetachCustomerManagedPolicyReferenceFromPermissionSet"
    ],
    "Resource": [
        "arn:aws:sso:::instance/ssoins-<sso-ins-id>"
    ]
},
{
    "Sid": "CustomerManagedABAC",
    "Effect": "Allow",
    "Action": [
        "sso:ListCustomerManagedPolicyReferencesInPermissionSet",
        "sso:AttachCustomerManagedPolicyReferenceToPermissionSet",
        "sso:DetachCustomerManagedPolicyReferenceFromPermissionSet"
    ],
    "Resource": "arn:aws:sso:::permissionSet/ssoins-<sso-ins-id>/*",
    "Condition": {
        "StringEquals": {
            "aws:ResourceTag/Team": "Yellow"
        }
    }
}

Note: Both statements are required, as only the resource type PermissionSet supports the condition key aws:ResourceTag/${TagKey}, and the actions listed require access to both the Instance and PermissionSet resource type. See Actions, resources, and condition keys for AWS IAM Identity Center for more information.

Best practices

Here are some best practices to consider when delegating management of permission sets and account assignments:

  • Assign permissions to edit specific permission sets. Allowing roles to edit every permission set could allow that role to edit their own permission set.
  • Only allow administrators to manage groups. Users with rights to edit group membership could add themselves to any group, including a group reserved for organization administrators.

If you’re using IAM Identity Center in a delegated account, you should also be aware of the best practices for delegated administration.

Summary

Organizations can empower teams by delegating the management of permission sets and account assignments in IAM Identity Center. Delegating these actions can allow teams to move faster and reduce the burden on the central identity management team.

The scenario and examples share delegation concepts that can be combined and scaled up within your organization. If you have feedback about this blog post, submit comments in the Comments section. If you have questions, start a new thread on AWS Re:Post with the Identity Center tag.

Want more AWS Security news? Follow us on Twitter.

Jake Barker

Jake Barker

Jake is a Senior Security Consultant with AWS Professional Services. He loves making security accessible to customers and eating great pizza.

Roberto Migli

Roberto Migli

Roberto is a Principal Solutions Architect at AWS. Roberto supports global financial services customers, focusing on security and identity and access management. In his free time he enjoys building electronics gadgets and spending time with his family.

Enhancing Resource Isolation in AWS CDK with the App Staging Synthesizer

Post Syndicated from Jehu Gray original https://aws.amazon.com/blogs/devops/enhancing-resource-isolation-in-aws-cdk-with-the-app-staging-synthesizer/

AWS Cloud Development Kit (CDK) has become a powerful tool for defining and provisioning AWS cloud resources. While CDK simplifies the process of infrastructure as code, managing resources across different projects and environments can still present challenges. In this blog post, we’ll explore a new experimental library, the App Staging Synthesizer, that enhances resource isolation and provides finer control over staging resources in CDK applications.

Background: The CDK Bootstrapping Model

Let’s consider a scenario where a company has two projects in the same account, Project A and Project B. Both projects are developed using the AWS CDK and deploy various AWS resources. However, the company wants to ensure that resources used in Project A are not discoverable or accessible to Project B. Prior to the introduction of the App Staging Synthesizer library in CDK, the default bootstrapping process created shared staging resources, such as a single Amazon S3 bucket and Amazon ECR repository, which are used by all CDK applications deployed in the CDK environment. In AWS CDK, a combination of region and an account is considered to be an environment. The traditional CDK bootstrapping method offers simplicity and consistency by providing a standardized set of shared staging resources for all CDK applications in an environment, which can be cost-effective for multiple applications. This shared model makes it challenging to control access and visibility between the projects in the same account, particularly in scenarios where resource isolation is crucial between different projects. In such scenarios, AWS recommends a best practice of separating projects that need critical isolation into different AWS accounts. However, it is recognized that there might be organizational or practical reasons preventing the immediate adoption of this recommendation. In such cases, mechanisms like the App Staging Synthesizer can provide a valuable workaround.

Introducing the App Staging Synthesizer:

Today, a growing trend among customers is the consolidation of their cloud accounts driven by the desire to optimize costs, bolster security and enhance compliance control. However, while consolidation offers several advantages, it can sometimes limit the flexibility to align ownership and decision making with individual accounts. This can lead to dependencies and conflicts in how workloads across accounts are secured and managed. The App Staging Synthesizer which is an experimental library designed to provide a more flexible approach to resource management and staging in CDK applications was designed to address these challenges. The AppStagingSynthesizer enhances resource isolation and cleanup control by creating separate staging resources for each application, reducing the risk of conflicts between resources and providing more granular management. It also enables better asset lifecycle management and customization of roles and resource handling, offering CDK developers a flexible and organized approach to resource deployment. Let’s delve into some of the advantages and key features of this library.

Advantages and Outcomes:

  1. Isolation and Access Control: The resources created for Project A are now completely isolated from Project B. Project B doesn’t have visibility or access to the staging resources of Project A, and vice versa. This ensures a higher level of data and resource security.
  2. Easier Resource Cleanup: When cleaning up or deleting resources, the Staging Stack specific to each project can be removed independently. This allows for a more streamlined and controlled cleanup process, mitigating the risk of inadvertently affecting other projects.
  3. Lifecycle Management: With separate ECR repositories for each CDK application, the company can apply lifecycle rules independently for retention and cost management. For example, they can configure each ECR repository to retain only the most recent 5 images, effectively cutting down on storage costs.
  4. Reduced Bootstrapping Complexity: As the only shared resources required are global Roles, the company now only needs to bootstrap every account in one Region instead of bootstrapping every Region. This simplifies the bootstrapping process, making it easier to manage with CloudFormation StackSets.

Key Features of the App Staging Synthesizer:

  • IStagingResources Interface: The App Staging Synthesizer introduces the IStagingResources interface, offering a framework to manage app-level bootstrap stacks. These stacks handle file assets and Docker assets for CDK applications.
  • DefaultStagingStack: Included in the library, the DefaultStagingStack is a pre-built implementation of the IStagingResources It comes with default configurations for staging resources, making it easier to get started.
  • AppStagingSynthesizer: This is a new CDK synthesizer that orchestrates the creation of staging resources for each CDK application. It seamlessly integrates with the application deployment process.
  • Deployment Roles: In addition to creating staging resources, the CDK App Staging Synthesizer also manages deployment roles. These roles are crucial for secure and controlled resource deployment, ensuring that only authorized processes can modify or access the resources.

 Implementation:

Let’s explore practical examples of using the App Staging Synthesizer within a CDK application.

Prerequisite:

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Install AWS CDK version 2.73.0 or later
  • A basic understanding of CDK. Please go through cdkworkshop.com to get hands-on learning about CDK and related concepts.
  • NOTE: To utilize the AppStagingSynthesizer, you should have an existing CDK application or should be working on a CDK application.

Using Default Staging Resources:

When configuring your CDK application to use deployment identities with the old bootstrap stack, it’s important to note that the existing staging resources, including the global S3 bucket and ECR repository, will still be created as part of the bootstrapping process. However, they will remain unused by this specific application, thanks to the App Staging Synthesizer.
While we won’t delve into the removal of these unused resources in this blogpost, it’s worth mentioning that for a more streamlined resource setup, you have the option to customize the bootstrap template to remove these resources if desired. This can help reduce clutter and ensure that only the necessary resources are retained within your CDK environment.

To get started, update your CDK App with the following code snippet:

const app = new App({
defaultStackSynthesizer: AppStagingSynthesizer.defaultResources({
appId: 'my-app-id',
// The following line is optional. By default, it is assumed you have bootstrapped in the same region(s) as the stack(s) you are deploying.
deploymentIdentities: DeploymentIdentities.defaultBootstrapRoles({ bootstrapRegion: 'us-east-1' }),
}),
});

This code snippet creates a DefaultStagingStack for a CDK App, allowing you to manage staging resources more effectively.

Customizing Roles:

You can customize roles for the synthesizer, which can be useful for several reasons such as:

  • Reuse of existing roles: In many AWS environments, organizations have existing IAM roles with specific permissions and policies that are aligned with their security and compliance requirements. Rather than creating new roles from scratch, you might want to leverage these existing roles to maintain consistency and adhere to established security practices.
  • Compatibility: In scenarios where you have pre-existing IAM roles that are being used across various AWS services or applications, customizing roles within the CDK App Staging Synthesizer allows you to seamlessly integrate CDK deployments into your existing IAM role management strategy.

Overall, customizing roles provides flexibility and control over resources used during CDK application deployments, enabling you to align CDK-based infrastructure with the organization’s policies. An example is:

const app = new App({
defaultStackSynthesizer: AppStagingSynthesizer.defaultResources({
appId: 'my-app-id',
deploymentIdentities: DeploymentIdentities.specifyRoles({
cloudFormationExecutionRole: BootstrapRole.fromRoleArn('arn:aws:iam::123456789012:role/Execute'),
deploymentRole: BootstrapRole.fromRoleArn('arn:aws:iam::123456789012:role/Deploy'),
}),
}),
});

This code snippet illustrates how you can specify custom roles for different stages of the deployment process.

Deploy Time S3 Assets:

Deploy-time S3 assets can be classified into two categories, each serving a distinct purpose:

  • Assets Used Only During Deployment: These assets are instrumental in handing off substantial data to other services for private copying during deployment. They play a vital role during initial deployment, and afterwards are retained solely for potential future rollbacks
  • Assets Accessed Throughout Application Lifespan: In contrast, some assets are accessed continuously throughout the runtime of your application. These could include script files utilized in CodeBuild projects, startup scripts for EC2 instances, or, in the case of CDK applications, ECR images that persist throughout the application’s life.

Marking Lambda Assets as Deploy-Time:

By default, Lambda assets are marked as deploy-time assets in the CDK App Staging Synthesizer. This means they fall into the first category mentioned above, serving as essential components during deployment. For instance, consider the following code snippet:

declare const stack: Stack;
new lambda.Function(stack, 'lambda', {
code: lambda.AssetCode.fromAsset(path.join(__dirname, 'assets')), // Lambda code bundle marked as deploy-time
handler: 'index.handler',
runtime: lambda.Runtime.PYTHON_3_9,
});

In this example, the Lambda code bundle is automatically identified as a deploy-time asset. This distinction ensures that it’s cleaned up after the configurable rollback window.

Creating Custom Deploy-Time Assets:

CDK offers the flexibility needed to create custom deploy-time assets. This can be achieved by utilizing the Asset construct from the AWS CDK library:

import { Asset } from 'aws-cdk-lib/aws-s3-assets';
declare const stack: Stack;
const asset = new Asset(stack, 'deploy-time-asset', {
deployTime: true, // Marking the asset as deploy-time
path: path.join(__dirname, './deploy-time-asset'),
});

By setting deployTime to true, the asset is explicitly marked as deploy-time. This allows you to maintain control over the lifecycle of these assets, ensuring they are retained for as long as needed. However, it is important to note that deploy-time assets eventually become eligible for cleanup.

Configuring Asset Lifecycles:
By default, the CDK retains deploy-time assets for a period of 30 days. However, there is flexibility to adjust this duration according to custom requirements. This can be achieved by specifying deployTimeFileAssetLifetime. The value set here determines how long you can roll back to a previous application version without the need for rebuilding and republishing assets:

const app = new App({
defaultStackSynthesizer: AppStagingSynthesizer.defaultResources({
appId: 'my-app-id',
deployTimeFileAssetLifetime: Duration.days(100), // Adjusting the asset retention period to 100 days
}),
});

By fine-tuning the lifecycle of deploy-time S3 assets, you gain more control over CDK deployments and ensure that CDK applications are equipped to handle rollbacks and updates with ease.

Optimizing ECR Repository Management with Lifecycle Rules:

The AWS CDK App Staging Synthesizer provides you with the capability to control the lifecycle of container images by leveraging lifecycle rules within ECR repositories. Let’s explore how this feature can help streamline your CDK workflows.

ECR repositories can accumulate numerous versions of Docker images over time. While retaining some historical versions is essential for rollback scenarios and reference, an unregulated growth of image versions can lead to increased storage costs and management complexity.

The AWS CDK App Staging Synthesizer offers a default configuration that stores a maximum of 3 revisions for a given Docker image asset. This ensures that you maintain access to previous image versions, facilitating seamless rollback operations. When more than 3 revisions of an asset exist in the ECR repository, the oldest one is purged.

Although by default, it’s set to 3, you can also adjust this value using the imageAssetVersionCount property:

const app = new App({
defaultStackSynthesizer: AppStagingSynthesizer.defaultResources({
appId: 'my-app-id',
imageAssetVersionCount: 10, // Customizing the image version count to retain up to 10 revisions
}),
});

By increasing or decreasing the imageAssetVersionCount, you can strike a balance between storage efficiency and the need to access historical image versions. This ensures that ECR repositories are optimized to the CDK application’s requirements.

Streamlining Cleanup: Auto Delete Staging Assets on Stack Deletion

Efficiently managing resources throughout the lifecycle of your CDK applications is essential, and this includes handling the cleanup of staging assets when stacks are deleted. The AWS CDK App Staging Synthesizer simplifies this process by providing an auto-delete feature for staging resources. In this section, we’ll explore how this feature works and how you can customize it according to your needs.

The Default Cleanup Behavior:
By default, the AWS CDK App Staging Synthesizer is designed to facilitate the cleanup of staging resources automatically when a stack is deleted. This means that associated resources, such as S3 buckets and ECR repositories, are configured with a RemovalPolicy.DESTROY and have autoDeleteObjects (for S3 buckets) or autoDeleteImages (for ECR repositories) turned on. Under the hood, custom resources are created to ensure a seamless cleanup process.

Customizing Cleanup Behavior:
While automatic cleanup is convenient for many scenarios, there may be situations where you want to retain staging resources even after stack deletion. This can be useful when you intend to reuse these resources or when you have specific cleanup processes outside of the default behavior. To retain staging assets and disable the auto-delete feature, you can specify autoDeleteStagingAssets: as false when configuring the AWS CDK App Staging Synthesizer:

const app = new App({
defaultStackSynthesizer: AppStagingSynthesizer.defaultResources({
appId: 'my-app-id',
autoDeleteStagingAssets: false, // Disabling auto-delete of staging assets
}),
});

By setting autoDeleteStagingAssets to false, you have full control over the cleanup of staging resources. This allows you to retain and manage these resources independently, giving you the flexibility to align CDK workflows with the organization’s specific practices.

Using an Existing Staging Stack:

While the AWS CDK App Staging Synthesizer offers powerful tools for managing staging resources, there may be scenarios where you already have a meticulously crafted staging stack in place. In such cases, you can seamlessly integrate the existing stack with the AppStagingSynthesizer using the customResources() method. Let’s explore how you can make the most of your pre-existing staging infrastructure.

The process is straightforward—supply your existing staging stack as a resource to the AppStagingSynthesizer using the customResources() method. It’s crucial to ensure that the custom stack adheres to the requirements of the IStagingResources interface for smooth integration.

Here’s an example:

// Create a new CDK App
const resourceApp = new App();

//Instantiate your custom staging stack (make sure it implements IstagingResources)
const resources = new CustomStagingStack(resourceApp, 'CustomStagingStack', {});

//Configure your CDK App to use the App Staging Synthesizer with your custom staging stack
const app = new App({
defaultStackSynthesizer: AppStagingSynthesizer.customResources({
resources,
}),
});

In this example, CustomStagingStack represents the pre-existing staging infrastructure. By providing it as a resource to the App Staging Synthesizer, you seamlessly integrate it into the CDK application’s deployment workflow.

Crafting Custom Staging Stacks for Environment Control:

For those seeking precise control over resource management in different environments, the AWS CDK App Staging Synthesizer offers a robust solution – custom staging stacks. This feature allows you to tailor resource configurations, permissions, and behaviors to meet the unique demands of each environment within the CDK application.

Subclassing DefaultStagingStack for a Quick Start:

If your customization requirements align with the available properties, you can start by subclassing DefaultStagingStack. This streamlined approach lets you inherit existing functionalities while tweaking specific behaviors as needed. Here’s how you can dive right in:

//Define custom staging stack
interface CustomStagingStackOptions extends DefaultStagingStackOptions {}

//Subclass DefaultStagingStack to create the custom stgaing stack
class CustomStagingStack extends DefaultStagingStack {
// Implement customizations here
}

Building Staging Resources from Scratch:

For more granular control, consider building the staging resources entirely from scratch. This approach allows you to define every aspect of the staging stack, from the ground up, by implementing the “IStagingResources” interface. Here’s an example:

// Define custom staging stack properties(if needed)
interface CustomStagingStackProps extends StackProps {}

//Create your custom staging stack that implements IStagingResources
class CustomStagingStack extends Stack implements IStagingResources {
constructor(scope: Construct, id: string, props: CustomStagingStackProps) {
super(scope, id, props);
}

// Implement methods to define your custom staging resources
public addFile(asset: FileAssetSource): FileStagingLocation {
return {
bucketName: 'myBucket',
assumeRoleArn: 'myArn',
dependencyStack: this,
};
}
public addDockerImage(asset: DockerImageAssetSource): ImageStagingLocation {
return {
repoName: 'myRepo',
assumeRoleArn: 'myArn',
dependencyStack: this,
};
}
}

Creating Custom Staging Resources:

Implementing custom staging resources also involves crafting a CustomFactory class to facilitate the creation of these resources in every environment where your CDK App is deployed. This approach offers a high level of customization while ensuring consistency across deployments. Here’s how it works:

// Define a custom factory for your staging resources
class CustomFactory implements IStagingResourcesFactory {
public obtainStagingResources(stack: Stack, context: ObtainStagingResourcesContext) {
const myApp = App.of(stack);

// Create a custom staging stack instance for the current environment
return new CustomStagingStack(myApp!, `CustomStagingStack-${context.environmentString}`, {});
}
}

//Incorporate your custom staging resources into the Application using the customer factory
const app = new App({
defaultStackSynthesizer: AppStagingSynthesizer.customFactory({
factory: new CustomFactory(),
oncePerEnv: true, // by default
}),
});

With this setup, you can create custom staging stacks for each environment, ensuring resource management tailored to your specific needs. Whether you choose to subclass DefaultStagingStack for a quick start or build resources from scratch, custom staging stacks empower you to achieve fine-grained control and consistency across CDK deployments.

Conclusion:

The App Staging Synthesizer introduces a powerful approach to managing staging resources in AWS CDK applications. With enhanced resource isolation and lifecycle control, it addresses the limitations of the default bootstrapping model. By integrating the App Staging Synthesizer into CDK applications, you can achieve better resource management, cleaner cleanup processes, and more control over cloud infrastructure.
Explore this experimental library and unleash the potential of fine-tuned resource management in CDK projects.

For more information and code examples, refer to the official documentation provided by AWS.

About the Authors:

Jehu Gray

Jehu Gray is an Enterprise Solutions Architect at Amazon Web Services where he helps customers design solutions that fits their needs. He enjoys exploring what’s possible with IaC.

Abiola Olanrewaju

Abiola Olanrewaju is an Enterprise Solutions Architect at Amazon Web Services where he helps customers design and implement scalable solutions that drive business outcomes. He has a keen interest in Data Analytics, Security and Automation.

Define per-team resource limits for big data workloads using Amazon EMR Serverless

Post Syndicated from Gaurav Sharma original https://aws.amazon.com/blogs/big-data/define-per-team-resource-limits-for-big-data-workloads-using-amazon-emr-serverless/

Customers face a challenge when distributing cloud resources between different teams running workloads such as development, testing, or production. The resource distribution challenge also occurs when you have different line-of-business users. The objective is not only to ensure sufficient resources be consistently available to production workloads and critical teams, but also to prevent adhoc jobs from using all the resources and delaying other critical workloads due to mis-configured or non-optimized code. Cost controls and usage tracking across these teams is also a critical factor.

In the legacy big data and Hadoop clusters as well as Amazon EMR provisioned clusters, this problem was overcome by Yarn resource management and defining what were called Yarn queues for different workloads or teams. Another approach was to allocate independent clusters for different teams or different workloads.

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it straightforward to run your big data workloads using open-source analytics frameworks such as Apache Spark and Hive without the need to configure, manage, or scale the clusters. With EMR Serverless, you don’t have to configure, optimize, secure, or operate clusters to run your workloads. You continue to get the benefits of Amazon EMR, such as open-source compatibility, concurrency, and optimized runtime performance for popular bigdata frameworks. EMR Serverless provides shorter job startup latency, automatic resource management and effective cost controls.

In this post, we show how to define per-team resource limits for big data workloads using EMR serverless.

Solution overview

EMR Serverless comes with a concept called an  EMR Serverless application, which is an isolated environment with the option to choose one of the open source analytics applications(Spark, Hive) to submit your workloads. You can include your own custom libraries, specify your EMR release version, and most importantly define the resource limits for the compute and memory resources. For instance, if your production Spark jobs run on Amazon EMR 6.9.0 and you need to test the same workload on Amazon EMR 6.10.0, you could use EMR Serverless to define EMR 6.10.0 as your version and test your workload using a predefined limit on resources.

The following diagram illustrates our solution architecture. We see that two different teams namely Prod team and Dev team are submitting their jobs independently to two different EMR Applications (namely ProdApp and DevApp respectively ) having dedicated resources.

EMR Serverless provides controls at the account, application and job level to limit the use of resources such as CPU, memory or disk. In the following sections, we discuss some of these controls.

Service quotas at account level

Amazon EMR Serverless has a default quota of 16 for maximum concurrent vCPUs per account. In other words, a new account can have a maximum of 16 vCPUs running at a given point in time in a particular Region across all EMR Serverless applications. However, this quota is auto-adjustable based on the usage patterns, which are monitored at the account and Region levels.

Resource limits and runtime configurations at the application level

In addition to quotas at the account levels, administrators can limit the use of resources at the application level using a feature known as “maximum capacity” which defines the maximum total vCPU, memory and disk capacity that can be consumed collectively by all the jobs running under this application.

You also have an option to specify common runtime and monitoring configurations at the application level which you would otherwise put in the specific job configurations. This helps create a standardized runtime environment for all the jobs running under an application. This can include settings like defining common connection setting your jobs need access to, log configurations that all your jobs will inherit by default, or Spark resource settings to help balance ad-hoc workloads. You can override these configurations at the job level, but defining them at the application can help reduce the configuration necessary for individual jobs.

For further details, refer to Declaring configurations at application level.

Runtime configurations at Job level

After you have set service, application quotas and runtime configurations at application level, you also have an option to override or add new configurations at the job level as well. For example, you can use different Spark job parameters to define how many maximum executors can be run by that specific job. One such parameter is spark.dynamicAllocation.maxExecutors which defines an upper bound for the number of executors in a job and therefore controls the number of workers in an EMR Serverless application because each executor runs within a single worker. This parameter is part of the dynamic allocation feature of Apache Spark, which allows you to dynamically scale the number of executors(workers) registered with the job up and down based on the workload. Dynamic allocation is enabled by default on EMR Serverless. For detailed steps, refer to Declaring configurations at application level.

With these configurations, you can control the resources used across accounts, applications, and jobs. For example, you can create applications with a predefined maximum capacity to constrain costs or configure jobs with resource limits in order to allow multiple ad hoc jobs to run simultaneously without consuming too many resources.

Best practices and considerations

Extending these usage scenarios further, EMR Serverless provides features and capabilities to implement the following design considerations and best practices based on your workload requirements:

  • To make sure that the users or teams submit their jobs only to their approved applications, you could use tag based AWS Identity and Access Management (IAM) policy conditions. For more details, refer to Using tags for access control.
  • You can use custom images as applications belonging to different teams that have distinct use-cases and software requirements. Using custom images is possible EMR 6.9.0 and onwards. Custom images allows you to package various application dependencies into a single container. Some of the important benefits of using custom images include the ability to use your own JDK and Python versions, apply your organization-specific security policies and integrate EMR Serverless into your build, test and deploy pipelines. For more information, refer to Customizing an EMR Serverless image.
  • If you need to estimate how much a Spark job would cost when run on EMR Serverless, you can use the open-source tool EMR Serverless Estimator. This tool analyzes Spark event logs to provide you with the cost estimate. For more details, refer to Amazon EMR Serverless cost estimator
  • We recommend that you determine your maximum capacity relative to the supported worker sizes by multiplying the number of workers by their size. For example, if you want to limit your application with 50 workers to 2 vCPUs, 16 GB of memory and 20 GB of disk, set the maximum capacity to 100 vCPU, 800 GB of memory, and 1000 GB of disk.
  • You can use tags when you create the EMR Serverless application to help search and filter your resources, or track the AWS costs using AWS Cost Explorer. You can also use tags for controlling who can submit jobs to a particular application or modify its configurations. Refer to Tagging your resources for more details.
  • You can configure the pre-initialized capacity at the time of application creation, which keeps the resources ready to be consumed by the time-sensitive jobs you submit.
  • The number of concurrent jobs you can run depends on important factors like maximum capacity limits, workers required for each job, and available IP address if using a VPC.
  • EMR Serverless will setup elastic network interfaces (ENIs) to securely communicate with resources in your VPC. Make sure you have enough IP addresses in your subnet for the job.
  • It’s a best practice to select multiple subnets from multiple Availability Zones. This is because the subnets you select determine the Availability Zones that are available to run the EMR Serverless application. Each worker uses an IP address in the subnet where it is launched. Make sure the configured subnets have enough IP addresses for the number of workers you plan to run.

Resource usage tracking

EMR Serverless not only allows cloud administrators to limit the resources for each application, it also enables them to monitor the applications and track the usage of resources across these applications. For more details, refer to  EMR Serverless usage metrics .

You can also deploy an AWS CloudFormation template to build a sample CloudWatch Dashboard for EMR Serverless which would help visualize various metrics for your applications and jobs. For more information, refer to EMR Serverless CloudWatch Dashboard.

Conclusion

In this post, we discussed how EMR Serverless empowers cloud and data platform administrators to efficiently distribute as well as restrict the cloud resources at different levels, for different organizational units, users and teams, as well as between critical and non-critical workloads. EMR Serverless resource limiting features make sure cloud cost is under control and resource usage is tracked effectively.

For more information on EMR Serverless applications and resource quotas, please refer to EMR Serverless User Guide and Configuring an application.


About the Authors

Gaurav Sharma is a Specialist Solutions Architect(Analytics) at Amazon Web Services (AWS), supporting US public sector customers on their cloud journey. Outside of work, Gaurav enjoys spending time with his family and reading books.

Damon Cortesi is a Principal Developer Advocate with Amazon Web Services. He builds tools and content to help make the lives of data engineers easier. When not hard at work, he still builds data pipelines and splits logs in his spare time.

Use AWS Secrets Manager to store and manage secrets in on-premises or multicloud workloads

Post Syndicated from Sreedar Radhakrishnan original https://aws.amazon.com/blogs/security/use-aws-secrets-manager-to-store-and-manage-secrets-in-on-premises-or-multicloud-workloads/

AWS Secrets Manager helps you manage, retrieve, and rotate database credentials, API keys, and other secrets throughout their lifecycles. You might already use Secrets Manager to store and manage secrets in your applications built on Amazon Web Services (AWS), but what about secrets for applications that are hosted in your on-premises data center, or hosted by another cloud service provider? You might even be in the process of moving applications out of your data center as part of a phased migration, where the application is partially in AWS, but other components still remain in your data center until the migration is complete. In this blog post, we’ll describe the potential benefits of using Secrets Manager for workloads outside AWS, outline some recommended practices for using Secrets Manager for hybrid workloads, and provide a basic sample application to highlight how to securely authenticate and retrieve secrets from Secrets Manager in a multicloud workload.

In order to make an API call to retrieve secrets from Secrets Manager, you need IAM credentials. While it is possible to use an AWS Identity and Access Management (IAM) user, AWS recommends using temporary, or short-lived, credentials wherever possible to reduce the scope of impact of an exposed credential. This means we will allow our hybrid application to assume an IAM role in this example. We’ll use IAM Roles Anywhere to provide a mechanism for our applications outside AWS to assume an IAM Role based on a trust configured with our Certificate Authority (CA).

IAM Roles Anywhere offers a solution for on-premises or multicloud applications to acquire temporary AWS credentials, helping to eliminate the necessity for creating and handling long-term AWS credentials. This removal of long-term credentials enhances security and streamlines the operational process by reducing the burden of managing and rotating the credentials.

In this post, we assume that you have a basic understanding of IAM. For more information on IAM roles, see the IAM documentation. We’ll start by examining some potential use cases at a high level, and then we’ll highlight recommended practices to securely fetch secrets from Secrets Manager from your on-premises or hybrid workload. Finally, we’ll walk you through a simple application example to demonstrate how to put these recommendations together in a workload.

Selected use cases for accessing secrets from outside AWS

Following are some example scenarios where it may be necessary to securely retrieve or manage secrets from outside AWS, such from applications hosted in your data center, or another cloud provider.

Centralize secrets management for applications in your data center and in AWS

It’s beneficial to offer your application teams a single, centralized environment for managing secrets. This can simplify managing secrets because application teams are only required to understand and use a single set of APIs to create, retrieve, and rotate secrets. It also provides consistent visibility into the secrets used across your organization because Secrets Manager is integrated with AWS CloudTrail to log API calls to the service, including calls to retrieve or modify a secret value.

In scenarios where your application is deployed either on-premises or in a multicloud environment, and your database resides in Amazon Relational Database Service (Amazon RDS), you have the opportunity to use both IAM Roles Anywhere and Secrets Manager to store and retrieve secrets by using short-term credentials. This approach allows central security teams to have confidence in the management of credentials and builder teams to have a well-defined pattern for credential management. Note that you can also choose to configure IAM database authentication with RDS, instead of storing database credentials in Secrets Manager, if this is supported by your database environment.

Hybrid or multicloud workloads

At AWS, we’ve generally seen that customers get the best experience, performance, and pricing when they choose a primary cloud provider. However, for a variety of reasons, some customers end up in a situation where they’re running IT operations in a multicloud environment. In these scenarios, you might have hybrid applications that run in multiple cloud environments, or you might have data stored in AWS that needs to be accessed from a different application or workload running in another cloud provider. You can use IAM Roles Anywhere to securely access or manage secrets in Secrets Manager for these use cases.

Phased application migrations to AWS

Consider a situation where you are migrating a monolithic application to AWS from your data center, but the migration is planned to take place in phases over a number of months. You might be migrating your compute into AWS well before your databases, or vice versa. In this scenario, you can use Secrets Manager to store your application secrets and access them from both on premises and in AWS. Because your secrets are accessible from both on premises and AWS through the same APIs, you won’t need to refactor your application to retrieve these secrets as the migration proceeds.

Recommended practices for retrieving secrets for hybrid and multicloud workloads

In this section, we’ll outline some recommended practices that will help you provide least-privilege access to your application secrets, wherever the access is coming from.

Client-side caching of secrets

Client-side caching of secrets stored in Secrets Manager can help you improve performance and decrease costs by reducing the number of API requests to Secrets Manager. After retrieving a secret from Secrets Manager, your application can get the secret value from its in-memory cache without making further API calls. The cached secret value is automatically refreshed after a configurable time interval, called the cache duration, to help ensure that the application is always using the latest secret value. AWS provides client-side caching libraries for .NET, Java, JDBC, Python, and Go to enable client-side caching. You can find more detailed information on client-side caching specific to Python libraries in this blog post.

Consider a hybrid application with an application server on premises, that needs to retrieve database credentials stored in Secrets Manager in order to query customer information from a database. Because the API calls to retrieve the secret are coming from outside AWS, they may incur increased latency simply based on the physical distance from the closest AWS data center. In this scenario, the performance gains from client-side caching become even more impactful.

Enforce least-privilege access to secrets through IAM policies

You can use a combination of IAM policy types to granularly restrict access to application secrets when you’re using IAM Roles Anywhere and Secrets Manager. You can use conditions in trust policies to control which systems can assume the role. In our example, this is based on the system’s certificate, meaning that you need to appropriately control access to these certificates. We use a policy condition to specify an IP address in our example, but you could also use a range of IP addresses. Other examples would be conditions that specify a time range for when resources can be accessed, conditions that allow or deny building resources in certain AWS Regions, and more. You can find example policies in the IAM documentation.

You should use identity policies to provide Secrets Manager with permissions to the IAM role being assumed, following the principle of least privilege. You can find IAM policy examples for Secrets Manager use cases in the Secrets Manager documentation.

By combining different policy types, like identity policies and trust policies, you can limit the scope of systems that can assume a role, and control what those systems can do after assuming a role. For example, in the trust policy for the IAM role with access to the secret in Secrets Manager, you can allow or deny access based on the common name of the certificate that’s being used to authenticate and retrieve temporary credentials in order to assume a role using IAM Roles Anywhere. You can then attach an identity policy to the role being assumed that provides only the necessary API actions for your application, such as the ability to retrieve a secret value—but not to a delete a secret. See this blogpost for more information on when to use different policy types.

Transform long-term secrets into short-term secrets

You may already be wondering, “why should I use short-lived credentials to access a long-term secret?” Frequently rotating your application secrets in Secrets Manager will reduce the impact radius of a compromised secret. Imagine that you rotate your application secret every day. If that secret is somehow publicly exposed, it will only be usable for a single day (or less). This can greatly reduce the risk of compromised credentials being used to get access to sensitive information. You can find more information about the value of using short-lived credentials in this AWS Well-Architected best practice.

Instead of using static database credentials that are rarely (or never) rotated, you can use Secrets Manager to automatically rotate secrets up to every four hours. This method better aligns the lifetime of your database secret with the lifetime of the short-lived credentials that are used to assume the IAM role by using IAM Roles Anywhere.

Sample workload: How to retrieve a secret to query an Amazon RDS database from a workload running in another cloud provider.

Now we’ll demonstrate examples of the recommended practices we outlined earlier, such as scoping permissions with IAM policies. We’ll also showcase a sample application that uses a virtual machine (VM) hosted in another cloud provider to access a secret in Secrets Manager.

The reference architecture in Figure 1 shows the basic sample application.

Figure 1: Application connecting to Secrets Manager by using IAM Roles Anywhere to retrieve RDS credentials

Figure 1: Application connecting to Secrets Manager by using IAM Roles Anywhere to retrieve RDS credentials

In the sample application, an application secret (for example, a database username and password) is being used to access an Amazon RDS database from an application server hosted in another cloud provider. The following process is used to connect to Secrets Manager in order to retrieve and use the secret:

  1. The application server makes a request to retrieve temporary credentials by using IAM Roles Anywhere.
  2. IAM validates the request against the relevant IAM policies and verifies that the certificate was issued by a CA configured as a trust anchor.
  3. If the request is valid, AWS Security Token Service (AWS STS) provides temporary credentials that the application can use to assume an IAM role.
  4. IAM Roles Anywhere returns temporary credentials to the application.
  5. The application assumes an IAM role with Secrets Manager permissions and makes a GetSecretValue API call to Secrets Manager.
  6. The application uses the returned database credentials from Secrets Manager to query the RDS database and retrieve the data it needs to process.

Configure IAM Roles Anywhere

Before you configure IAM Roles Anywhere, it’s essential to have an IAM role created with the required permission for Amazon RDS and Secrets Manager. If you’re following along on your own with these instructions, refer to this blog post and the IAM Roles Anywhere User Guide for the steps to configure IAM Roles Anywhere in your environment.

Obtain temporary security credentials

You have several options to obtain temporary security credentials using IAM Roles Anywhere:

  • Using the credential helper — The IAM Roles Anywhere credential helper is a tool that manages the process of signing the CreateSession API with the private key associated with an X.509 end-entity certificate and calls the endpoint to obtain temporary AWS credentials. It returns the credentials to the calling process in a standard JSON format. This approach is documented in the IAM Roles Anywhere User Guide.
  • Using the AWS SDKs

Use policy controls to appropriately scope access to secrets

In this section, we demonstrate the process of restricting access to temporary credentials by employing condition statements based on attributes extracted from the X.509 certificate. This additional step gives you granular control of the trust policy, so that you can effectively manage which resources can obtain credentials from IAM Roles Anywhere. For more information on establishing a robust data perimeter on AWS, refer to this blog post.

Prerequisites

  • IAM Roles Anywhere using AWS Private Certificate Authority or your own PKI as the trust anchor
  • IAM Roles Anywhere profile
  • An IAM role with Secrets Manager permissions

Restrict access to temporary credentials

You can restrict access to temporary credentials by using specific PKI conditions in your role’s trust policy, as follows:

  • Sessions issued by IAM Roles Anywhere have the source identity set to the common name (CN) of the subject you use in end-entity certificate authenticating to the target role.
  • IAM Roles Anywhere extracts values from the subject, issuer, and Subject Alternative Name (SAN) fields of the authenticating certificate and makes them available for policy evaluation through the sourceIdentity and principal tags.
  • To examine the contents of a certificate, use the following command:

    openssl x509 -text -noout -in certificate.pem

  • To establish a trust relationship for IAM Roles Anywhere, use the following steps:
    1. In the navigation pane of the IAM console, choose Roles.
    2. The console displays the roles for your account. Choose the name of the role that you want to modify, and then choose the Trust relationships tab on the details page.
    3. Choose Edit trust relationship.

Example: Restrict access to a role based on the common name of the certificate

The following example shows a trust policy that adds a condition based on the Subject Common Name (CN) of the certificate.

{
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "rolesanywhere.amazonaws.com"
        },
        "Action": [
          "sts:AssumeRole",
          "sts:TagSession",
          "sts:SetSourceIdentity"
        ],
        "Condition": {
          "StringEquals": {
            "aws:PrincipalTag/x509Subject/CN": "workload-a.iamcr-test"
          },
          "ArnEquals": {
            "aws:SourceArn": [
              "arn:aws:rolesanywhere:region:account:trust-anchor/TA_ID"
            ]
          }
        }
      }
    ]
  }

If you try to access the temporary credentials using a different certificate which has a different CN, you will receive the error “Error when retrieving credentials from custom-process: 2023/07/0X 23:46:43 AccessDeniedException: Unable to assume role for arn:aws:iam::64687XXX207:role/RDS_SM_Role”.

Example: Restrict access to a role based on the issuer common name

The following example shows a trust policy that adds a condition based on the Issuer CN of the certificate.

 {
    "Version": "2012-10-17",
    "Statement": [
      {
        "Effect": "Allow",
        "Principal": {
          "Service": "rolesanywhere.amazonaws.com"
        },
        "Action": [
          "sts:AssumeRole",
          "sts:TagSession",
          "sts:SetSourceIdentity"
        ],
        "Condition": {
          "StringEquals": {
            "aws:PrincipalTag/x509Issuer/CN": "iamcr.test"
          },
          "ArnEquals": {
            "aws:SourceArn": [
              "arn:aws:rolesanywhere:region:account:trust-anchor/TA_ID"
            ]
          }
        }
      }
    ]
  }

Example: Restrict access to a role based on the subject alternative name (SAN)

The following example shows a trust policy that adds a condition based on the SAN fields of the certificate.

 {
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "rolesanywhere.amazonaws.com"
      },
      "Action": [
        "sts:AssumeRole",
        "sts:TagSession",
        "sts:SetSourceIdentity"
      ],
      "Condition": {
        "StringEquals": {
          "aws:PrincipalTag/x509SAN/DNS": "workload-a.iamcr.test"
        },
        "ArnEquals": {
          "aws:SourceArn": [
            "arn:aws:rolesanywhere:region:account:trust-anchor/TA_ID"
          ]
        }
      }
    }
  ]
}

Session policies

Define session policies to further scope down the sessions delivered by IAM Roles Anywhere. Here, for demonstration purposes, we added an inline policy to only allow requests coming from the specified IP address by using the following steps.

  1. Navigate to the Roles Anywhere console.
  2. Under Profiles, choose Create a profile.
  3. On the Create a profile page, enter a name for the profile.
  4. For Roles, select the role that you created in the previous step, and select the Inline policy.

The following example shows how to allow only the requests from a specific IP address. You will need to replace <X.X.X.X/32> in the policy example with your own IP address.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "NotIpAddress": {
          "aws:SourceIp": [
            "<X.X.X.X/32>"
          ]
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": "*",
      "Resource": "*"
    }
  ]
}

Retrieve secrets securely from a workload running in another cloud environment

In this section, we’ll demonstrate the process of connecting virtual machines (VMs) running in another cloud provider to an Amazon RDS MySQL database, where the database credentials are securely stored in Secrets Manager.

Create a database and manage Amazon RDS master database credentials in Secrets Manager

In this section, you will create a database instance and use Secrets Manager to manage the master database credentials.

To create an Amazon RDS database and manage master database credentials in Secrets Manager

  1. Open the Amazon RDS console and choose Create database.
  2. Select your preferred database creation method. For this post, we chose Standard create.
  3. Under Engine options, for Engine type, choose your preferred database engine. In this post, we use MySQL.
  4. Under Settings, for Credentials Settings, select Manage master credentials in AWS Secrets Manager.
    Figure 2: Manage master credentials in Secrets Manager

    Figure 2: Manage master credentials in Secrets Manager

  5. You have the option to encrypt the managed master database credentials. In this example, we will use the default AWS KMS key.
  6. (Optional) Choose other settings to meet your requirements. For more information, see Settings for DB instances.
  7. Choose Create Database, and wait a few minutes for the database to be created.

Retrieve and use temporary credentials to access a secret in Secrets Manager

The next step is to use the AWS Roles Anywhere service to obtain temporary credentials for an IAM role. These temporary credentials are essential for accessing AWS resources securely. Earlier, we described the options available to you to retrieve temporary credentials by using IAM Roles Anywhere. In this section, we will assume you’re using the credential helper to retrieve temporary credentials and make an API call to Secrets Manager.

After you retrieve temporary credentials and assume an IAM role with permissions to access the secret in Secrets Manager, you can run a simple script on the VM to get the database username and password from Secrets Manager and update the database. The steps are summarized here:

  • Use the credential helper to assume your IAM role with permissions to access the secret in Secrets Manager. You can find instructions to obtain temporary credentials in the IAM Roles Anywhere User Guide.
  • Retrieve secrets from Secrets Manager. Using the obtained temporary credentials, you can create a boto3 session object and initialize a secrets_client from boto3.client(‘secretsmanager’). The secrets_client is responsible for interacting with the Secrets Manager service. You will retrieve the secret value from Secrets Manager, which contains the necessary credentials (for example, database username and password) for accessing an RDS database.
  • Establish a connection to the RDS database. The retrieved secret value is parsed, extracting the database connection information. You can then establish a connection to the RDS database using the extracted details, such as username and password.
  • Perform database operations. Once the database connection is established, the script performs the operation to update a record in the database.

The following is an example Python script to retrieve credentials from Secrets Manager and connect to the RDS for database operations.

import mysql.connector
import boto3
import json

#Create client

client = boto3.client('secretsmanager')
response = client.get_secret_value(
    SecretId='rds!db-fXXb-11ce-4f05-9XX2-d42XXcd'
)
secretDict = json.loads(response['SecretString'])

#Connect to DB

mydb = mysql.connector.connect(
    host="rdsmysql.cpl0ov.us-east-1.rds.amazonaws.com",
    user=secretDict['username'],
    password=secretDict['password'],
    database="rds_mysql"
)
mycursor = mydb.cursor()

#Update DB 

sql = "INSERT INTO employees (id, name) VALUES (%s, %s)"
val = (12, "AWPS")
mycursor.execute(sql, val)
mydb.commit()
print(mycursor.rowcount, "record inserted.")

And that’s it! You’ve retrieved temporary credentials using IAM Roles Anywhere, assumed a role with permissions to access the database username and password in Secrets Manager, and then retrieved and used the database credentials to update a database from your application running on another cloud provider. This is a simple example application for the purpose of the blog post, but the same concepts will apply in real-world use cases.

Conclusion

In this post, we’ve demonstrated how you can securely store, retrieve, and manage application secrets and database credentials for your hybrid and multicloud workloads using Secrets Manager. We also outlined some recommended practices for least-privilege access to your secrets when accessing Secrets Manager from outside AWS by using IAM Roles Anywhere. Lastly, we demonstrated a simple example of using IAM Roles Anywhere to assume a role, then retrieve and use database credentials from Secrets Manager in a multicloud workload. To get started managing secrets, open the Secrets Manager console. To learn more about Secrets Manager, refer to the Secrets Manager documentation.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Sreedar Radhakrishnan

Sreedar Radhakrishnan

Sreedar is a Senior Solutions Architect at AWS, where he helps enterprise customers to design and build secure, scalable, and sustainable solutions on AWS. In his spare time, Sreedar enjoys playing badminton and spending time with his family.

Zach Miller

Zach Miller

Zach is a Senior Security Specialist Solutions Architect at AWS. His background is in data protection and security architecture, focused on a variety of security domains, including cryptography, secrets management, and data classification. Today, he is focused on helping enterprise AWS customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.

Akshay Aggarwal

Akshay Aggarwal

Akshay is a Senior Technical Product Manager on the AWS Secrets Manager team. As part of AWS Cryptography, Akshay drives technologies and defines best practices that help improve customer’s experience of building secure, reliable workloads in the AWS Cloud. Akshay is passionate about building technologies that are easy to use, secure, and scalable.

Enable Security Hub partner integrations across your organization

Post Syndicated from Joaquin Manuel Rinaudo original https://aws.amazon.com/blogs/security/enable-security-hub-partner-integrations-across-your-organization/

AWS Security Hub offers over 75 third-party partner product integrations, such as Palo Alto Networks Prisma, Prowler, Qualys, Wiz, and more, that you can use to send, receive, or update findings in Security Hub.

We recommend that you enable your corresponding Security Hub third-party partner product integrations when you use these partner solutions. By centralizing findings across your AWS and partner solutions in Security Hub, you can get a holistic cross-account and cross-Region view of your security risks. In this way, you can move beyond security reporting and start implementing automations on top of Security Hub that help improve your overall security posture and reduce manual efforts. For example, you can configure your third-party partner offerings to send findings to Security Hub and build standardized enrichment, escalation, and remediation solutions by using Security Hub automation rules, or other AWS services such as AWS Lambda or AWS Step Functions.

To enable partner integrations, you must configure the integration in each AWS Region and AWS account across your organization in AWS Organizations. In this blog post, we’ll show you how to set up a Security Hub partner integration across your entire organization by using AWS CloudFormation StackSets.

Overview

Figure 1 shows the architecture of the solution. The main steps are as follows:

  1. The deployment script creates a CloudFormation template that deploys a stack set across your AWS accounts.
  2. The stack in the member account deploys a CloudFormation custom resource using a Lambda function.
  3. The Lambda function iterates through target Regions and invokes the Security Hub boto3 method enable_import_findings_for_product to enable the corresponding partner integration.

When you add new accounts to the organizational units (OUs), StackSets deploys the CloudFormation stack and the partner integration is enabled.

Figure 1: Diagram of the solution

Figure 1: Diagram of the solution

Prerequisites

To follow along with this walkthrough, make sure that you have the following prerequisites in place:

  • Security Hub enabled across an organization in the Regions where you want to deploy the partner integration.
  • Trusted access with AWS Organizations enabled so that you can deploy CloudFormation StackSets across your organization. For instructions on how to do this, see Activate trusted access with AWS Organizations.
  • Permissions to deploy CloudFormation StackSets in a delegated administrator account for your organization.
  • AWS Command Line Interface (AWS CLI) installed.

Walkthrough

Next, we show you how to get started with enabling your partner integration across your organization using the following solution.

Step 1: Clone the repository

In the AWS CLI, run the following command to clone the aws-securityhub-deploy-partner-integration GitHub repository:

git clone https://github.com/aws-samples/aws-securityhub-partner-integration

Step 2: Set up the integration parameters

  1. Open the parameters.json file and configure the following values:
    • ProductName — Name of the product that you want to enable.
    • ProductArn — The unique Amazon Resource Name (ARN) of the Security Hub partner product. For example, the product ARN for Palo Alto PRISMA Cloud Enterprise, is arn:aws:securityhub:<REGION>:188619942792:product/paloaltonetworks/redlock; and for Prowler, it’s arn:aws:securityhub:<REGION>::product/prowler/prowler. To find a product ARN, see Available third-party partner product integrations.
    • DeploymentTargets — List of the IDs of the OUs of the AWS accounts that you want to configure. For example, use the unique identifier (ID) for the root to deploy across your entire organization.
    • DeploymentRegions — List of the Regions in which you’ve enabled Security Hub, and for which the partner integration should be enabled.
  2. Save the changes and close the file.

Step 3: Deploy the solution

  1. Open a command line terminal of your preference.
  2. Set up your AWS_REGION (for example, export AWS_REGION=eu-west-1) and make sure that your credentials are configured for the delegated administrator account.
  3. Enter the following command to deploy:
    ./setup.sh deploy

Step 4: Verify Security Hub partner integration

To test that the product integration is enabled, run the following command in one of the accounts in the organization. Replace <TARGET-REGION> with one of the Regions where you enabled Security Hub.

aws securityhub list-enabled-products-for-import --region <TARGET-REGION>

Step 5: (Optional) Manage new partners, Regions, and OUs

To add or remove the partner integration in certain Regions or OUs, update the parameters.json file with your desired Regions and OU IDs and repeat Step 3 to redeploy changes to your Security Hub partner integration. You can also directly update the CloudFormation parameters for the securityhub-integration-<PARTNER-NAME> from the CloudFormation console.

To enable new partner integrations, create a new parameters.json file version with the partner’s product name and product ARN to deploy a new stack using the deployment script from Step 3. In the next step, we show you how to disable the partner integrations.

Step 6: Clean up

If needed, you can remove the partner integrations by destroying the stack deployed. To destroy the stack, use the command line terminal configured with the credentials for the AWS StackSets delegated administrator account and run the following command:

 ./setup.sh destroy

You can also directly delete the stack mentioned in Step 5 from the CloudFormation console by accessing the stack page from the CloudFormation console, selecting the stack securityhub-integration-<PARTNER-NAME>, and then choosing Delete.

Conclusion

In this post, you learned how you to enable Security Hub partner integrations across your organization. Now you can configure the partner product of your choice to send, update, or receive Security Hub findings.

You can extend your security automation by using Security Hub automation rules, Amazon EventBridge events, and Lambda functions to start or enrich automated remediation of new ingested findings from partners. For an example of how to do this, see Automated Security Response on AWS.

Developer teams can opt in to configure their own chatbot in AWS Chatbot to receive notifications in Amazon Chime, Slack, or Microsoft Teams channels. Lastly, security teams can use existing bidirectional integrations with Jira Service Management or Jira Core to escalate severe findings to their developer teams.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Security, Identity, & Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Joaquin Manuel Rinaudo

Joaquin is a Principal Security Architect with AWS Professional Services. He is passionate about building solutions that help developers improve their software quality. Prior to AWS, he worked across multiple domains in the security industry, from mobile security to cloud and compliance related topics. In his free time, Joaquin enjoys spending time with family and reading science-fiction novels.

Shachar Hirshberg

Shachar Hirshberg

Shachar is a Senior Product Manager for AWS Security Hub with over a decade of experience in building, designing, launching, and scaling enterprise software. He is passionate about further improving how customers harness AWS services to enable innovation and enhance the security of their cloud environments. Outside of work, Shachar is an avid traveler and a skiing enthusiast.

Validate IAM policies with Access Analyzer using AWS Config rules

Post Syndicated from Anurag Jain original https://aws.amazon.com/blogs/security/validate-iam-policies-with-access-analyzer-using-aws-config-rules/

You can use AWS Identity and Access Management (IAM) Access Analyzer policy validation to validate IAM policies against IAM policy grammar and best practices. The findings generated by Access Analyzer policy validation include errors, security warnings, general warnings, and suggestions for your policy. These findings provide actionable recommendations that help you author policies that are functional and conform to security best practices.

You can use the IAM Policy Validator for AWS CloudFormation and the IAM Policy Validator for Terraform solutions to integrate Access Analyzer policy validation in a proactive manner within your continuous integration and continuous delivery CI/CD pipeline before deploying IAM policies to your Amazon Web Service (AWS) environment. Customers requested a similar capability to validate policies already deployed within their environments as part of the defense-in-depth strategy.

In this post, you learn how to set up and continuously validate and report on compliance of the IAM policies in your environment using AWS Config. AWS Config evaluates the configuration settings of your AWS resources with the help of AWS Config rules, which represent your ideal configuration settings. AWS Config continuously tracks the configuration changes that occur among your resources and checks whether these changes conform to the conditions in your rules. If a resource doesn’t conform to a rule, AWS Config flags the resource and the rule as noncompliant.

You can use this solution to validate identity-based and resource-based IAM policies attached to resources in your AWS environment that might have grammatical or syntactical errors or might not follow AWS best practices. The code used in this post is hosted in a GitHub repository.

Prerequisites

Before you get started, you need:

Step 1: Enable AWS Config to monitor global resources

To get started, enable AWS Config in your AWS account by following the instructions in the AWS Config Developer Guide.

Next, enable the recording of global resources:

  1. Open the AWS Management Console and go to the AWS Config console.
  2. Go to Settings and choose Edit to see the AWS Config recorder settings.
  3. Under General settings, select the Include globally recorded resource types to enable AWS Config to monitor IAM configuration items.
  4. Leave the other settings at their defaults.
  5. Choose Save.
    Figure 1: AWS Config settings page showing inclusion of globally recorded resource types

    Figure 1: AWS Config settings page showing inclusion of globally recorded resource types

  6. After choosing Save, you should see Recording is on at the top of the window.
    Figure 2: AWS Config settings page showing recorder settings

    Figure 2: AWS Config settings page showing recorder settings

    Note: You only need to enable globally recorded resource types in the AWS Region where you’ve configured AWS Config because they aren’t tied to a specific Region and can be used in other Regions. The globally recorded resource types that AWS Config supports are IAM users, groups, roles, and customer managed policies.

Step 2: Deploy the CloudFormation template

In this section, you deploy and test a sample AWS CloudFormation template that creates the following:

  • An AWS Config rule that reports the compliance of IAM policies.
  • An AWS Lambda function that implements and then makes the requests to IAM Access Analyzer and returns the policy validation findings.
  • An IAM role that’s used by the Lambda function with permissions to validate IAM policies using the Access Analyzer ValidatePolicy API.
  • An optional Amazon CloudWatch alarm and Amazon Simple Notification Service (Amazon SNS) topic to provide notification of Lambda function errors.

Follow the steps below to deploy the AWS CloudFormation template:

  1. To deploy the CloudFormation template using the following command, you must have the AWS Command Line Interface (AWS CLI) installed.
  2. Make sure you have configured your AWS CLI credentials.
  3. Clone the solution repository.
    git clone https://github.com/awslabs/aws-iam-access-analyzer-policy-validation-config-rule.git

  4. Navigate to the iam-access-analyzer-config-rule folder of the cloned repository.
    cd aws-iam-access-analyzer-policy-validation-config-rule

  5. Deploy the CloudFormation template using the AWS CLI.

    Note: Change the Region for the parameter — RegionToValidateGlobalResources — to the Region you enabled for global resources in Step 1. Optionally, you can add an email address if you want to receive notifications if the AWS Config rule stops working. Use the code that follows, replacing <us-east-1> with the Region you enabled and <EMAIL_ADDRESS> with your chosen address.

    aws cloudformation deploy \
        --stack-name iam-policy-validation-config-rule \
        --template-file templates/template.yaml \
        --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM \
        --parameter-overrides RegionToValidateGlobalResources='<us-east-1>' \
                              ErrorNotificationsEmailAddress='<EMAIL_ADDRESS>'

  6. After successful deployment, you will see the message Successfully created/updated stack – iam-policy-validation-config-rule.
    Figure 3: Successful CloudFormation stack creation reported on the terminal

    Figure 3: Successful CloudFormation stack creation reported on the terminal

    Note: If the CloudFormation stack creation fails, go to the CloudFormation console and select the iam-policy-validation-config-rule stack. Choose Events to review the failure reason.

  7. After deployment, open the CloudFormation console and select the iam-policy-validation-config-rule stack.
  8. Choose Resources to see the resources created by the template.

Step 3: Check noncompliant resources discovered by AWS Config

The AWS Config rule is designed to mark resources that have IAM policies as noncompliant if the resources have validation findings found using the IAM Access Analyzer ValidatePolicy API.

  1. Open the AWS Config console
  2. Choose Rules from the navigation pane on the left and select policy-validation-config-rule.
    Figure 4: AWS Config rules page showing the rule details

    Figure 4: AWS Config rules page showing the rule details

  3. Scroll down on the page and filter Resources in Scope to see the noncompliant resources.
    Figure 5: AWS Config rules page showing noncompliant resources

    Figure 5: AWS Config rules page showing noncompliant resources

    Note: If the AWS Config rule isn’t invoked yet, you can choose Actions and select Re-evaluate to invoke it.

    Figure 6: AWS Config rules page showing evaluation invocation

    Figure 6: AWS Config rules page showing evaluation invocation

Step 4: Modify the AWS Config rule for exceptions

You might want to exempt certain resources from specific policy validation checks. For example, you might need to deploy a more privileged role—such as an administrator role—to your environment and you don’t want that role’s policies to have policy validation findings.

Figure 7: AWS Config rules page showing a noncompliant administrator role

Figure 7: AWS Config rules page showing a noncompliant administrator role

This section shows you how to configure an exceptions file to exempt specific resources.

  1. Start by configuring an exceptions file similar to the one that follows to log general warning findings across the accounts in your organization to make sure your policies conform to best practices by setting ignoreWarningFindings to False.
  2. Additionally, you might want to create an exception that allows administrator roles to use the iam:PassRole action on another role. This combination of action and resource is usually reserved for privileged users. The example file below shows an exception for all the roles created with Administrator in the role path from account 12345678912.

    Example exceptions file:

    {
    "global":{
    "ignoreWarningFindings":false
    },
    "12345678912":{
    "ignoreFindingsWith":[
    {
    "issueCode":"PASS_ROLE_WITH_STAR_IN_ACTION_AND_RESOURCE",
    "resourceType":"AWS::IAM::Role",
    "resourceName":"Administrator/*"
    }
    ]
    }
    }
  3. After the exceptions file is ready, upload the JSON file to the S3 bucket you created as a part of the prerequisites.

    You can manage this exceptions file by hosting it in a central Git repository. When teams need to exempt a particular resource from these policy validation checks, they can submit a pull request to the central repository. An approver can then approve or reject this request and, if approved, deploy the updated exceptions file.

  4. Modify the bucket policy so that the bucket is accessible to your AWS Config rule if the rule is operating in a different account than the bucket was created in. Below is an example of a bucket policy that allows the accounts in your organization to read the exceptions file.
    {
          "Version": "2012-10-17",
          "Statement": [{
              "Effect": "Allow",
              "Principal": {"AWS": "*"},
              "Action": "s3:GetObject",
              "Resource": "arn:aws:s3:::EXAMPLE-BUCKET/my-exceptions-file.json",
              "Condition": {
                  "StringEquals": {
                      "aws:PrincipalOrgId": "<your organization id here>"
                  }
              }
          }]
    }

    Note: For more examples visit example policy validation exceptions file contents.

  5. Deploy the CloudFormation template again using the ExceptionsS3BucketName and ExceptionsS3FilePrefix parameters. The file prefix should be the full prefix of the S3 object exceptions file.
    aws cloudformation deploy \
        --stack-name iam-policy-validation-config-rule \
        --template-file templates/template.yaml \
        --capabilities CAPABILITY_IAM CAPABILITY_NAMED_IAM \
        --parameter-overrides RegionToValidateGlobalResources='<us-east-1>' \
            		ExceptionsS3BucketName='EXAMPLE-BUCKET' \
           		 ExceptionsS3FilePrefix='my-exceptions-file.json'

  6. After you see the Successfully created/updated stack – iam-policy-validation-config-rule message on the terminal or command line and the AWS Config rule has been re-evaluated, the resources mentioned in the exception file should show as Compliant.
    Figure 8: Resource exception result

    Figure 8: Resource exception result

You can find additional customization options in the exceptions file schema.

Cleanup

To avoid recurring charges and to remove the resources used in testing the solution outlined in this post, use the CloudFormation console to delete the iam-policy-validation-config-rule CloudFormation stack.

Figure 9: AWS CloudFormation stack deletion

Figure 9: AWS CloudFormation stack deletion

Conclusion

In this post, we demonstrated how you can set up a centralized compliance and monitoring workflow using AWS IAM Access Analyzer policy validation with AWS Config rules to validate identity-based and resource-based policies attached to resources in your account. Using this solution, you can create a single pane of glass to monitor resources and govern centralized compliance for AWS Config-supported resources across accounts. You can also build and maintain exceptions customized to your environment as shown in the example policy validation exceptions file. You can visit the Access Analyzer policy checks reference page for a complete list of policy check validation errors and resolutions.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Matt Luttrell

Matt is a Sr. Solutions Architect on the AWS Identity Solutions team. When he’s not spending time chasing his kids around, he enjoys skiing, cycling, and the occasional video game.

Swara Gandhi

Swara Gandhi

Swara is a solutions architect on the AWS Identity Solutions team. She works on building secure and scalable end-to-end identity solutions. She is passionate about everything identity, security, and cloud.

Announcing updates to the AWS Well-Architected Framework guidance

Post Syndicated from Haleh Najafzadeh original https://aws.amazon.com/blogs/architecture/announcing-updates-to-the-aws-well-architected-framework-guidance/

We are excited to announce the availability of improved AWS Well-Architected Framework guidance. In this update, we have made changes across all six pillars of the framework: Operational ExcellenceSecurityReliabilityPerformance EfficiencyCost Optimization, and Sustainability.

In this release, we have made the implementation guidance for the new and updated best practices more prescriptive, including enhanced recommendations and steps on reusable architecture patterns targeting specific business outcomes in the Amazon Web Services (AWS) Cloud.

A brief history

The Well-Architected Framework is a collection of best practices that allow customers to evaluate and improve the design, implementation, and operations of their workloads in the cloud.

In 2012, the first version of the framework was published, leading to the 2015 release of the guidance whitepaper. We added the Operational Excellence pillar in 2016. The pillar-specific whitepapers and AWS Well-Architected Lenses were released in 2017, and the following year, the AWS Well-Architected Tool was launched.

In 2020, Well-Architected Framework guidance had a new release, along with more lenses, as well as API integration with the AWS Well-Architected Tool. The sixth pillar, Sustainability, was added in 2021. In 2022, dedicated pages were introduced for each consolidated best practice across all six pillars, with several best practices updated with improved prescriptive guidance. By April 2023, more than 50% of the Framework’s best practices have had their prescriptive guidance improved.

A brief history of the AWS Well-Architected Framework

A brief history of the AWS Well-Architected Framework

What’s new

As customers mature in their journey, they are seeking guidance to achieve accurate solutions that is prescriptive to their business, environments, and workloads. AWS Well-Architected is committed to providing such information to customers by continually evolving and updating our guidance.

The content updates and improvements in this release focus on having more complete coverage across the AWS service portfolio, helping customers make more informed decisions when developing implementation plans. Services that were added or expanded in coverage include: AWS Elastic Disaster Recovery, AWS Trusted Advisor, AWS Resilience Hub, AWS Config, AWS Security Hub, Amazon GuardDuty, AWS Organizations, AWS Control Tower, AWS Compute Optimizer, AWS Budgets, Amazon CodeWhisperer, Amazon CodeGuru, Amazon EventBridge, Amazon CloudWatch, Amazon Simple Notification Service, AWS Systems Manager, Amazon ElastiCache, and AWS Global Accelerator.

Pillar updates

Operational Excellence

The Operational Excellence Pillar has received updates to two of the five Design Principles and has a new Design Principle on observability, which highlights its importance and relevance throughout the pillar content. All 10 best practices in OPS05 have been updated, and we have consolidated 28 best practices into 16, across four questions (OPS04, OPS06, OPS08, and OPS09), as well as improving prescriptive guidance.

Security

In the Security Pillar, the Incident response in SEC10 underwent an update to align with the AWS Security Incident Response Guide, while introducing one new best practice, and improving the prescriptive guidance for others. Two best practices in SEC08 and SEC09 have received improved prescriptive guidance on securing workloads at rest and in transit.

Reliability

The Reliability Pillar has received prescriptive guidance improvements to one best practice in REL06, and six best practices in REL11, focused on how to best monitor, failover, remediate, and limit impacts of failures. The update addresses a wide variety of managed services and designs, including multi-Region-based resilience.

Performance Efficiency

The Performance Efficiency Pillar has been completely restructured, consolidating and merging guidance to reduce the number of best practices by 10 and the number of questions by three. We have added best practices around efficient caching and optimizing hardware acceleration. We have also improved the implementation guidance in all 32 best practices of the newly restructured Pillar.

Cost Optimization

The Cost Optimization Pillar has 10 best practices with improved implementation prescriptive guidance.

Sustainability

The Sustainability Pillar has received updates to the risk levels of seven best practices.

Conclusion

This Well-Architected release includes updates and improvements to 90 best practices: Operational Excellence (26), Security (8), Reliability (7), Performance Efficiency (32), Cost Optimization (10), and Sustainability (7). These changes are in addition to the 151 improved best practices released in 2023 (127 in April 10, 2023; and 24 in July 13, 2023), resulting in more than 73% of the existing Framework best practices updated at least once in the last year.

As of this release, 100% of Performance Efficiency, Cost Optimization, and Sustainability; 63% of Operational Excellence; 60% of Security; and 50% of Reliability Pillar content have been refreshed at least once since October 2022.

The content is available in 11 languages: English, Spanish, French, German, Italian, Japanese, Korean, Indonesian, Brazilian Portuguese, Simplified Chinese, and Traditional Chinese.

Updates in this release are also available in the AWS Well-Architected Tool, which can be used to review your workloads, address important design considerations, and help ensure that you follow the best practices and guidance of the AWS Well-Architected Framework.

Ready to get started? Review the updated AWS Well-Architected Framework Pillar best practices, as well as pillar-specific whitepapers.

Have questions about some of the new best practices or most recent updates? Join our growing community on AWS re:Post.

Apache Iceberg optimization: Solving the small files problem in Amazon EMR

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/apache-iceberg-optimization-solving-the-small-files-problem-in-amazon-emr/

In our previous post Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes, we discussed how you can implement solutions to improve operational efficiencies of your Amazon Simple Storage Service (Amazon S3) data lake that is using the Apache Iceberg open table format and running on the Amazon EMR big data platform. Iceberg tables store metadata in manifest files. As the number of data files increase, the amount of metadata stored in these manifest files also increases, leading to longer query planning time. The query runtime also increases because it’s proportional to the number of data or metadata file read operations. Compaction is the process of combining these small data and metadata files to improve performance and reduce cost. Compaction also gets rid of deleting files by applying deletes and rewriting a new file without deleting records. Currently, Iceberg provides a compaction utility that compacts small files at a table or partition level. But this approach requires you to implement the compaction job using your preferred job scheduler or manually triggering the compaction job.

In this post, we discuss the new Iceberg feature that you can use to automatically compact small files while writing data into Iceberg tables using Spark on Amazon EMR or Amazon Athena.

Use cases for processing small files

Streaming applications are prone to creating a large number of small files, which can negatively impact the performance of subsequent processing times. For example, consider a critical Internet of Things (IoT) sensor from a cold storage facility that is continuously sending temperature and health data into an S3 data lake for downstream data processing and triggering actions like emergency maintenance. Systems of this nature generate a huge number of small objects and need attention to compact them to a more optimal size for faster reading, such as 128 MB, 256 MB, or 512 MB. In this post, we show you a streaming sensor data use case with a large number of small files and the mitigation steps using the Iceberg open table format. For more information on streaming applications on AWS, refer to Real-time Data Streaming and Analytics.

Streaming Architecture

Solution overview

To compact the small files for improved performance, in this example, Amazon EMR triggers a compaction job after the write commit as a post-commit hook when defined thresholds (for example, number of commits) are met. By default, Amazon EMR waits for 10 commits to trigger the post-commit hook compaction utility.

This Iceberg event-based table management feature lets you monitor table activities during writes to make better decisions about how to manage each table differently based on events. As of this writing, only the optimize-data optimization is supported. To learn more about the available optimize data executors and catalog properties, refer to the README file in the GitHub repo.

To use the feature, you can use the iceberg-aws-event-based-table-management source code and provide the built JAR in the engine’s class-path. The following bootstrap action can place the JAR in the engine’s class-path:

sudo aws s3 cp s3://<path>/iceberg-aws-event-based-table-management-0.1.jar /usr/lib/spark/jars/

Note that the Iceberg AWS event-based table management feature works with Iceberg v1.2.0 and above (available from Amazon EMR 6.11.0).

In some use cases, you may want to run the event-based compaction jobs in a different EMR cluster in order to avoid any impact to the ETL jobs running in their current EMR cluster. You can get the metadata, including the cluster ID of your current ETL workflows, from the /mnt/var/lib/info/job-flow.json file and then use a different cluster to process the event-based compactions.

The notebook examples shown in the following sections are also available in the aws-samples GitHub repo.

Prerequisite

For this performance comparison exercise between a Spark external table and an Iceberg table and Iceberg with compaction, we generate a significant number of small files in Parquet format and store them in an S3 bucket. We used the Amazon Kinesis Data Generator (KDG) tool to generate sample sensor data information using the following template:

{"sensorId": {{random.number(5000)}},
 "currentTemperature": {{random.number(
        {
            "min":10,
            "max":150
        }
  )}},
 "status": "{{random.arrayElement(
        ["OK","FAIL","WARN"]
    )}}",
 "date_ts": "{{date.now("YYYY-MM-DD HH:mm:ss")}}"
}

We configured an Amazon Kinesis Data Firehose delivery stream and sent the generated data into a staging S3 bucket. Then we ran an AWS Glue extract, transform, and load (ETL) job to convert the JSON files into Parquet format. For our testing, we generated about 58,176 small objects with total size of 2 GB.

For running the Amazon EMR tests, we used Amazon EMR version emr-6.11.0 with Spark 3.3.2, and JupyterEnterpriseGateway 2.6.0. The cluster used had one primary node (r5.2xlarge) and two core nodes (r5.xlarge). We used a bootstrap action during cluster creation to enable event-based table management:

sudo aws s3 cp s3://<path>/iceberg-aws-event-based-table-management-0.1.jar /usr/lib/spark/jars/

Also, refer to our guidance on how to use an Iceberg cluster with Spark, which is a prerequisite for this exercise.

As part of the exercise, we see new steps are being added to the EMR cluster to trigger the compaction jobs. To enable adding new steps to the running cluster, we add the elasticmapreduce:AddJobFlowSteps action to the cluster’s default role, EMR_EC2_DefaultRole, as a prerequisite.

Performance of Iceberg reads with the compaction utility on Amazon EMR

In the following steps, we demonstrate how to use the compaction utility and what performance benefits you can achieve. We use an EMR notebook to demonstrate the benefits of the compaction utility. For instructions to set up an EMR notebook, refer to Amazon EMR Studio overview.

First, you configure your Spark session using the %%configure magic command. We use the Hive catalog for Iceberg tables.

  1. Before you run the following step, create an Amazon S3 bucket in your AWS account called <your-iceberg-storage-blog>. To check how to create an Amazon S3 bucket, follow the instructions given here. Update the your-iceberg-storage-blog bucket name in the following configuration with the actual bucket name you created to test this example:
    %%configure -f
    {
    "conf":{
        "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
        "spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.dev.catalog-impl":"org.apache.iceberg.aws.glue.GlueCatalog",
        "spark.sql.catalog.dev.io-impl":"org.apache.iceberg.aws.s3.S3FileIO",
        "spark.sql.catalog.dev.warehouse":"s3://<your-iceberg-storage-blog>/iceberg/"
        }
    }

  2. Create a new database for the Iceberg table in the AWS Glue Data Catalog named DB and provide the S3 URI specified in the Spark config as s3://<your-iceberg-storage-blog>/iceberg/db. Also, create another Database named iceberg_db in Glue for the parquet tables. Follow the instructions given in Working with databases on the AWS Glue console to create your Glue databases. Then create a new Spark table in Parquet format pointing to the bucket containing small objects in your AWS account. See the following code:
    spark.sql(""" CREATE TABLE iceberg_db.sensor_data_parquet_table (
        sensorid int,
        currenttemperature int,
        status string,
        date_ts timestamp)
    USING parquet
    location 's3://<your-bucket-with-parquet-files>/'
    """)

  3. Run an aggregate SQL to measure the performance of Spark SQL on the Parquet table with 58,176 small objects:
    spark.sql(""" select maxtemp, mintemp, avgtemp from
    (select
    max(currenttemperature) as maxtemp,
    min(currenttemperature) as mintemp,
    avg(currenttemperature) as avgtemp
    from iceberg_db.sensor_data_parquet_table
    where month(date_ts) between 2 and 10
    order by maxtemp, mintemp, avgtemp)""").show()

In the following steps, we create a new Iceberg table from the Spark/Parquet table using CTAS (Create Table As Select). Then we show how the automated compaction job can help improve query performance.

  1. Create a new Iceberg table using CTAS from the earlier AWS Glue table with the small files:
    spark.sql(""" CREATE TABLE dev.db.sensor_data_iceberg_format USING iceberg AS (SELECT * FROM iceberg_db.sensor_data_parquet_table)""")

  2. Validate that a new Iceberg snapshot was created for the new table:
    spark.sql(""" Select * from dev.db.sensor_data_iceberg_format.snapshots limit 5""").show()

We have confirmed that our S3 folder corresponds to the newly created Iceberg table. It shows that during the CTAS statement, it added 1,879 objects in the new folder with a total size of 1.3 GB. We can conclude that Iceberg did some optimization while loading data from the Parquet table.

  1. Now that you have data in the Iceberg table, run the previous aggregation SQL to check the runtime:
    spark.sql(""" select maxtemp, mintemp, avgtemp from
    (select
    max(currenttemperature) as maxtemp,
    min(currenttemperature) as mintemp,
    avg(currenttemperature) as avgtemp
    from dev.db.sensor_data_iceberg_format
    where month(date_ts) between 2 and 10
    order by maxtemp, mintemp, avgtemp)""").show()

The runtime for the preceding query ran on the Iceberg table with 1,879 objects in 1 minute, 39 seconds. There is already some significant performance improvement by converting the external Parquet table to an Iceberg table.

  1. Now let’s add the configurations needed to apply the automatic compaction of small files in the Iceberg tables. Note the last four newly added configurations in the following statement. The parameter optimize-data.commit-threshold suggests that the compaction will take place after the first successful commit. The default is 10 successful commits to trigger the compaction.
    %%configure -f
    {
    "conf":{
        "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
        "spark.sql.catalog.dev":"org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.dev.catalog-impl":"org.apache.iceberg.aws.glue.GlueCatalog",
        "spark.sql.catalog.dev.io-impl":"org.apache.iceberg.aws.s3.S3FileIO",
        "spark.sql.catalog.dev.warehouse":"s3://<your-iceberg-storage-blog>/iceberg/",
        "spark.sql.catalog.dev.metrics-reporter-impl":"org.apache.iceberg.aws.manage.AwsTableManagementMetricsEvaluator",
        "spark.sql.catalog.dev.optimize-data.impl":"org.apache.iceberg.aws.manage.EmrOnEc2OptimizeDataExecutor",
        "spark.sql.catalog.dev.optimize-data.emr.cluster-id":"j-1N8J5NZI0KEU3",
        "spark.sql.catalog.dev.optimize-data.commit-threshold":"1"
        }
    }

  2. Run a quick sanity check to confirm that the configurations are working fine with Spark SQL.

  1. 10. To activate the automatic compaction process, add a new record to the existing Iceberg table using a Spark insert:
    spark.sql(""" Insert into dev.db.sensor_data_iceberg_format values(999123, 86, 'PASS', timestamp'2023-07-26 12:50:25') """)

  2. Navigate to the Amazon EMR console to check the cluster steps.

You should see a new step added that goes from Pending to Running and finally the Completed state. Every time the data in the Iceberg table is updated or inserted, based on configuration optimize-data.commit-threshold, the optimize job will automatically trigger to compact the underlying data.

  1. Validate that the record insert was successful.

  1. Check the snapshot table to see that a new snapshot is created for the table with the operation replace.

For every successful run of the background optimize job, a new entry will be added to the snapshot table.

  1. On the Amazon S3 console, navigate to the folder corresponding to the Iceberg table and see that the data files are compacted.

In our case, it was compacted from the previous smaller sizes to approximately 437 MB. The folder will still contain the previous smaller files for time travel unless you issue an expire snapshot command to remove them.

  1. Now you can run the same aggregate query and record the performance after the compaction.

Summary of Amazon EMR testing

The runtime for the preceding aggregation query on the compacted Iceberg table reduced to approximately 59 seconds from the previous runtime of 1 minute, 39 seconds. That is about a 40% improvement. The more small files you have in your source bucket, the bigger performance boost you can achieve with this post-hook compaction implementation. The examples shown in this blog were executed in a small Amazon EMR cluster with only two core nodes (r5.xlarge). To improve the performance of your Spark applications, Amazon EMR provides multiple optimization features that you can implement for your production workloads.

Performance of Iceberg reads with the compaction utility on Athena

To manage the Iceberg table based on events, you can start the Spark 3.3 SQL shell as shown in the following code. Make sure that the athena:StartQueryExecution and athena:GetQueryExecution permission policies are enabled.

spark-sql --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
          --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
          --conf spark.sql.catalog.my_catalog.warehouse=<s3-bucket> \
          --conf spark.sql.catalog.my_catalog.metrics-reporter-impl=org.apache.iceberg.aws.manage.AwsTableManagementMetricsEvaluator \
          --conf spark.sql.catalog.my_catalog.optimize-data.impl=org.apache.iceberg.aws.manage.AthenaOptimizeDataExecutor \
          --conf spark.sql.catalog.my_catalog.optimize-data.athena.output-bucket=<s3-bucket>

Clean up

After you complete the test, clean up your resources to avoid any recurring costs:

  1. Delete the S3 buckets that you created for this test.
  2. Delete the EMR cluster.
  3. Stop and delete the EMR notebook instance.

Conclusion

In this post, we showed how Iceberg event-based table management lets you manage each table differently based on events and compact small files to boost application performance. This event-based process significantly reduces the operational overhead of using the Iceberg rewrite_data_files procedure, which needs manual or scheduled operation.

To learn more about Apache Iceberg and implement this open table format for your transactional data lake use cases, refer to the following resources:


About the Authors

Avijit Goswami is a Principal Solutions Architect at AWS specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS managed services and open-source solutions. Outside of his work, Avijit likes to travel, hike, watch sports, and listen to music.

Rajarshi Sarkar is a Software Development Engineer at Amazon EMR/Athena. He works on cutting-edge features of Amazon EMR/Athena and is also involved in open-source projects such as Apache Iceberg and Trino. In his spare time, he likes to travel, watch movies, and hang out with friends.

Build and deploy to Amazon EKS with Amazon CodeCatalyst

Post Syndicated from Vineeth Nair original https://aws.amazon.com/blogs/devops/build-and-deploy-to-amazon-eks-with-amazon-codecatalyst/

Amazon CodeCatalyst is an integrated service for software development teams adopting continuous integration and deployment (CI/CD) practices into their software development process. CodeCatalyst puts all of the tools that development teams need in one place, allowing for a unified experience for collaborating on, building, and releasing software. You can also integrate AWS resources with your projects by connecting your AWS accounts to your CodeCatalyst space. By managing all of the stages and aspects of your application lifecycle in one tool, you can deliver software quickly and confidently.

Introduction

Containerization has revolutionized the way we develop, deploy, and scale applications. With the rise of managed container services like Amazon Elastic Kubernetes Service (EKS), developers can leverage the power of Kubernetes without worrying about the underlying infrastructure. In this post, we will focus on how DevOps teams can use CodeCatalyst to build and deploy applications to EKS clusters.

CodeCatalyst offers a collection of pre-built actions that encapsulate common container-related tasks such as building and pushing a container image to an ECR and deploying a Kubernetes manifest. In this walkthrough, we will leverage two actions that can greatly simplify the container build and deployment process. We start by building a simple container image with the ‘Push to Amazon ECR’ action from CodeCatalyst labs. This action simplifies the process of building, tagging and pushing an image to an Amazon Elastic Container Registry (ECR). We will also utilize the ‘Deploy to Kubernetes cluster’ action from AWS for pushing our Kubernetes manifests with our updated image.

Architecture diagram demonstrating how a developer uses Cloud9 and a repository to store code, then pushes the image to Amazon ECR and deploys it to Amazon EKS.
Figure 1: Architectural Diagram.

Prerequisites

To follow along with the post, you will need the following items:

Walkthrough

In this walkthrough, we will build a simple Nginx based application and push this to an ECR, we will then build and deploy this image to an EKS cluster. The emphasis of this post, will be on how to translate a fairly common pattern with microservices applications to a CodeCatalyst workflow. At the end of the post, our workflow will look like so:

The image shows how codecatalyst worflow configured. 1st stage is pusing the Image to Amazon ECR followed by Deploy to Amazon EKS
Figure 2: CodeCatalyst workflow.

Create the base workflow

To begin, we will create our workflow, in the CodeCatalyst project, Select CI/CD → Workflows → Create workflow:

Image shows how to create workflow which has Source repository and Branch to be selected from drop down option
Figure 3: Create workflow.

Leave the defaults for the Source Repository and Branch, select Create. We will have an empty workflow:

Image shows how an emapy workflow looks like. It doesn't have any sptes configured which need to be added in this file based on our requirement.
Figure 4: Empty workflow.

We can edit the workflow from within the CodeCatalyst console, or use a Dev Environment. We will create an initial commit of this workflow file, ignore any validation errors at this stage:

Image shows how to create a Dev environment. User need to provide a workflow name, commit message along with Repository name and Branch name.
Figure 5: Creating Dev environment.

Connect to CodeCatalyst Dev Environment

For this post, we will use an AWS Cloud9 Dev Environment. Our first step is to connect to the Dev environment. Select Code → Dev Environments. If you do not already a Dev Instance, you can create an instance by selecting Create Dev Environment.

Image shows how a Dev environment looks like. The environment we created in the previous step, will be listed here.
Figure 6: My Dev environment.

Create a CodeCatalyst secret

Prior to adding the code, we will add a CodeCatalyst secret that will be consumed by our workflow. Using CodeCatalyst secrets ensures that we do not store sensitive data in plaintext in our workflow file. To create the secrets in the CodeCatalyst console, browse to CICD -> Secrets. Select Create Secret with the following details:

The image shows how we can create a secret. We need to pass Name and Value along with option desctption to create a secret.
Figure 7: Adding secrets.

Name: eks_cluster_name

Value: <Your EKS Cluster name>

Connect to the CodeCatalyst Dev Environment

We already have a Dev Environment so we will select Resume Instance. A new browser tab opens for the IDE and will be available in less than a minute. Once the IDE is ready, we can go ahead and start creating the Dockerfile and Kubernetes manifest that make up our application

mkdir WebApp
cat <<EOF > WebApp/Dockerfile
FROM nginx
RUN apt-get update && apt-get install -y curl
EOF

The previous command block creates our Dockerfile, which we will build in our CodeCatalyst workflow from an Nginx base image and installs cURL. Next, we will add our Kubernetes manifest file to create a Kubernetes deployment and service for our application:

Create a directory called Manifests and a file inside the directory called demo-app.yaml. Update the file with code for deployment and Kubernetes Service.

The image shows the code structure of demo-app.yaml file
Figure 8: demo-app.yaml file.

The previous code block shows the Kubernetes manifest file for our deployment, along with a Kubernetes service. We modify the image value to include the URI for our ECR as this value is unique. Once we have created our Dockerfile and Kubernetes manifest, pull the latest changes to our repository, including our workflow file that we just created. In our environment, our repository is called eks-demo-app:

cd eks-demo-app && git pull

We can now edit this file in our IDE. In our example our workflow is Workflow_df84 , we will locate Workflow_df84.yaml in the .codecatalyst\workflows directory in our repository. From here we can double click on the file to launch in the IDE for editing:

Image shows empty workflow yaml file.
Figure 9: workflow file in yaml format.

Add the build steps to workflow

We can assign our workflow a name and configure the action for our build phase. The code outlined in the following diagram is our CodeCatalyst workflow definition

The image shows updated workflow file which has triggers and actions filled.
Figure 10: Workflow updated with build phase.

Kustomize starts from here

The image shows Kustomize steps added in the workflow file.
Figure 11: Workflow updated with Kustomize.

Deployment starts from here

The image shows deployment stage added in the workflow file.
Figure 12: Workflow updated with Deployment phase.

The workflow will now contain two CodeCatalyst actions – PushtoAmazonECR which builds and pushes our container image to the ECR. We have also added a dependent stage DeploytoKubernetesCluster which deploys our Kubernetes manifest.

To save our changes we select File -> Save, we can then commit these to our git repository by typing the following at the terminal:

git add . && git commit -m ‘adding workflow’ && git push

The previous command will commit and push our changes the CodeCatalyst source repository, as we have a branch trigger for main defined, this will trigger a run of the workflow. We can monitor the status of the workflow in the CodeCatalyst console by selecting CICD -> Workflows. Locate your workflow and click on Runs to view the status.

We will now have all two stages available, as depicted at the beginning of this walkthrough. We will now have a container image in our ECR along with the newly built image deployed to our EKS cluster.

Cleaning up

If you have been following along with this workflow, you should delete the resources that you have deployed to avoid further changes. First, delete the Amazon ECR repository and Amazon EKS cluster (along with associated IAM roles) using the AWS console. Second, delete the CodeCatalyst project by navigating to project settings and choosing to Delete Project.

Conclusion

In this post, we explained how teams can easily get started building, scanning, and deploying a microservice application to an EKS cluster using CodeCatalyst. We outlined the stages in our workflow that enabled us to achieve the end-to-end build and release cycle. We also demonstrated how to enhance the developer experience of integrating CodeCatalyst with our Cloud9 Dev Environment.

Call to Action

Learn more about CodeCatalyst here.

Vineeth Nair

Vineeth Nair is a DevOps Architect at Amazon Web Services (AWS), Professional Services. He collaborates closely with AWS customers to support and accelerate their journeys to the cloud and within the cloud ecosystem by building performant, resilient, scalable, secure and cost efficient solutions.

Richard Merritt

Richard Merritt is a DevOps Consultant at Amazon Web Services (AWS), Professional Services. He works with AWS customers to accelerate their journeys to the cloud by providing scalable, secure and robust DevOps solutions.

Process and analyze highly nested and large XML files using AWS Glue and Amazon Athena

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/process-and-analyze-highly-nested-and-large-xml-files-using-aws-glue-and-amazon-athena/

In today’s digital age, data is at the heart of every organization’s success. One of the most commonly used formats for exchanging data is XML. Analyzing XML files is crucial for several reasons. Firstly, XML files are used in many industries, including finance, healthcare, and government. Analyzing XML files can help organizations gain insights into their data, allowing them to make better decisions and improve their operations. Analyzing XML files can also help in data integration, because many applications and systems use XML as a standard data format. By analyzing XML files, organizations can easily integrate data from different sources and ensure consistency across their systems, However, XML files contain semi-structured, highly nested data, making it difficult to access and analyze information, especially if the file is large and has complex, highly nested schema.

XML files are well-suited for applications, but they may not be optimal for analytics engines. In order to enhance query performance and enable easy access in downstream analytics engines such as Amazon Athena, it’s crucial to preprocess XML files into a columnar format like Parquet. This transformation allows for improved efficiency and usability in analytics workflows. In this post, we show how to process XML data using AWS Glue and Athena.

Solution overview

We explore two distinct techniques that can streamline your XML file processing workflow:

  • Technique 1: Use an AWS Glue crawler and the AWS Glue visual editor – You can use the AWS Glue user interface in conjunction with a crawler to define the table structure for your XML files. This approach provides a user-friendly interface and is particularly suitable for individuals who prefer a graphical approach to managing their data.
  • Technique 2: Use AWS Glue DynamicFrames with inferred and fixed schemas – The crawler has a limitation when it comes to processing a single row in XML files larger than 1 MB. To overcome this restriction, we use an AWS Glue notebook to construct AWS Glue DynamicFrames, utilizing both inferred and fixed schemas. This method ensures efficient handling of XML files with rows exceeding 1 MB in size.

In both approaches, our ultimate goal is to convert XML files into Apache Parquet format, making them readily available for querying using Athena. With these techniques, you can enhance the processing speed and accessibility of your XML data, enabling you to derive valuable insights with ease.

Prerequisites

Before you begin this tutorial, complete the following prerequisites (these apply to both techniques):

  1. Download the XML files technique1.xml and technique2.xml.
  2. Upload the files to an Amazon Simple Storage Service (Amazon S3) bucket. You can upload them to the same S3 bucket in different folders or to different S3 buckets.
  3. Create an AWS Identity and Access Management (IAM) role for your ETL job or notebook as instructed in Set up IAM permissions for AWS Glue Studio.
  4. Add an inline policy to your role with the iam:PassRole action:
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": ["iam:PassRole"],
      "Effect": "Allow",
      "Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*",
      "Condition": {
        "StringLike": {
          "iam:PassedToService": ["glue.amazonaws.com"]
        }
      }
    }
}
  1. Add a permissions policy to the role with access to your S3 bucket.

Now that we’re done with the prerequisites, let’s move on to implementing the first technique.

Technique 1: Use an AWS Glue crawler and the visual editor

The following diagram illustrates the simple architecture that you can use to implement the solution.

Processing and Analyzing XML file using AWS Glue and Amazon Athena

To analyze XML files stored in Amazon S3 using AWS Glue and Athena, we complete the following high-level steps:

  1. Create an AWS Glue crawler to extract XML metadata and create a table in the AWS Glue Data Catalog.
  2. Process and transform XML data into a format (like Parquet) suitable for Athena using an AWS Glue extract, transform, and load (ETL) job.
  3. Set up and run an AWS Glue job via the AWS Glue console or the AWS Command Line Interface (AWS CLI).
  4. Use the processed data (in Parquet format) with Athena tables, enabling SQL queries.
  5. Use the user-friendly interface in Athena to analyze the XML data with SQL queries on your data stored in Amazon S3.

This architecture is a scalable, cost-effective solution for analyzing XML data on Amazon S3 using AWS Glue and Athena. You can analyze large datasets without complex infrastructure management.

We use the AWS Glue crawler to extract XML file metadata. You can choose the default AWS Glue classifier for general-purpose XML classification. It automatically detects XML data structure and schema, which is useful for common formats.

We also use a custom XML classifier in this solution. It’s designed for specific XML schemas or formats, allowing precise metadata extraction. This is ideal for non-standard XML formats or when you need detailed control over classification. A custom classifier ensures only necessary metadata is extracted, simplifying downstream processing and analysis tasks. This approach optimizes the use of your XML files.

The following screenshot shows an example of an XML file with tags.

Create a custom classifier

In this step, you create a custom AWS Glue classifier to extract metadata from an XML file. Complete the following steps:

  1. On the AWS Glue console, under Crawlers in the navigation pane, choose Classifiers.
  2. Choose Add classifier.
  3. Select XML as the classifier type.
  4. Enter a name for the classifier, such as blog-glue-xml-contact.
  5. For Row tag, enter the name of the root tag that contains the metadata (for example, metadata).
  6. Choose Create.

Create an AWS Glue Crawler to crawl xml file

In this section, we are creating a Glue Crawler to extract the metadata from XML file using the customer classifier created in previous step.

Create a database

  1. Go to the AWS Glue console, choose Databases in the navigation pane.
  2. Click on Add database.
  3. Provide a name such as blog_glue_xml
  4. Choose Create Database

Create a Crawler

Complete the following steps to create your first crawler:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. On the Set crawler properties page, provide a name for the new crawler (such as blog-glue-parquet), then choose Next.
  4. On the Choose data sources and classifiers page, select Not Yet under Data source configuration.
  5. Choose Add a data store.
  6. For S3 path, browse to s3://${BUCKET_NAME}/input/geologicalsurvey/.

Make sure you pick the XML folder rather than the file inside the folder.

  1. Leave the rest of the options as default and choose Add an S3 data source.
  2. Expand Custom classifiers – optional, choose blog-glue-xml-contact, then choose Next and keep the rest of the options as default.
  3. Choose your IAM role or choose Create new IAM role, add the suffix glue-xml-contact (for example, AWSGlueServiceNotebookRoleBlog), and choose Next.
  4. On the Set output and scheduling page, under Output configuration, choose blog_glue_xml for Target database.
  5. Enter console_ as the prefix added to tables (optional) and under Crawler schedule, keep the frequency set to On demand.
  6. Choose Next.
  7. Review all the parameters and choose Create crawler.

Run the Crawler

After you create the crawler, complete the following steps to run it:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Open the crawler you created and choose Run.

The crawler will take 1–2 minutes to complete.

  1. When the crawler is complete, choose Databases in the navigation pane.
  2. Choose the database you crated and choose the table name to see the schema extracted by the crawler.

Create an AWS Glue job to convert the XML to Parquet format

In this step, you create an AWS Glue Studio job to convert the XML file into a Parquet file. Complete the following steps:

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Under Create job, select Visual with a blank canvas.
  3. Choose Create.
  4. Rename the job to blog_glue_xml_job.

Now you have a blank AWS Glue Studio visual job editor. On the top of the editor are the tabs for different views.

  1. Choose the Script tab to see an empty shell of the AWS Glue ETL script.

As we add new steps in the visual editor, the script will be updated automatically.

  1. Choose the Job details tab to see all the job configurations.
  2. For IAM role, choose AWSGlueServiceNotebookRoleBlog.
  3. For Glue version, choose Glue 4.0 – Support Spark 3.3, Scala 2, Python 3.
  4. Set Requested number of workers to 2.
  5. Set Number of retries to 0.
  6. Choose the Visual tab to go back to the visual editor.
  7. On the Source drop-down menu, choose AWS Glue Data Catalog.
  8. On the Data source properties – Data Catalog tab, provide the following information:
    1. For Database, choose blog_glue_xml.
    2. For Table, choose the table that starts with the name console_ that the crawler created (for example, console_geologicalsurvey).
  9. On the Node properties tab, provide the following information:
    1. Change Name to geologicalsurvey dataset.
    2. Choose Action and the transformation Change Schema (Apply Mapping).
    3. Choose Node properties and change the name of the transform from Change Schema (Apply Mapping) to ApplyMapping.
    4. On the Target menu, choose S3.
  10. On the Data source properties – S3 tab, provide the following information:
    1. For Format, select Parquet.
    2. For Compression Type, select Uncompressed.
    3. For S3 source type, select S3 location.
    4. For S3 URL, enter s3://${BUCKET_NAME}/output/parquet/.
    5. Choose Node Properties and change the name to Output.
  11. Choose Save to save the job.
  12. Choose Run to run the job.

The following screenshot shows the job in the visual editor.

Create an AWS Gue Crawler to crawl the Parquet file

In this step, you create an AWS Glue crawler to extract metadata from the Parquet file you created using an AWS Glue Studio job. This time, you use the default classifier. Complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. On the Set crawler properties page, provide a name for the new crawler, such as blog-glue-parquet-contact, then choose Next.
  4. On the Choose data sources and classifiers page, select Not Yet for Data source configuration.
  5. Choose Add a data store.
  6. For S3 path, browse to s3://${BUCKET_NAME}/output/parquet/.

Make sure you pick the parquet folder rather than the file inside the folder.

  1. Choose your IAM role created during the prerequisite section or choose Create new IAM role (for example, AWSGlueServiceNotebookRoleBlog), and choose Next.
  2. On the Set output and scheduling page, under Output configuration, choose blog_glue_xml for Database.
  3. Enter parquet_ as the prefix added to tables (optional) and under Crawler schedule, keep the frequency set to On demand.
  4. Choose Next.
  5. Review all the parameters and choose Create crawler.

Now you can run the crawler, which takes 1–2 minutes to complete.

You can preview the newly created schema for the Parquet file in the AWS Glue Data Catalog, which is similar to the schema of the XML file.

We now possess data that is suitable for use with Athena. In the next section, we perform data queries using Athena.

Query the Parquet file using Athena

Athena doesn’t support querying the XML file format, which is why you converted the XML file into Parquet for more efficient data querying and use dot notation to query complex types and nested structures.

The following example code uses dot notation to query nested data:

SELECT 
    idinfo.citation.citeinfo.origin,
    idinfo.citation.citeinfo.pubdate,
    idinfo.citation.citeinfo.title,
    idinfo.citation.citeinfo.geoform,
    idinfo.citation.citeinfo.pubinfo.pubplace,
    idinfo.citation.citeinfo.pubinfo.publish,
    idinfo.citation.citeinfo.onlink,
    idinfo.descript.abstract,
    idinfo.descript.purpose,
    idinfo.descript.supplinf,
    dataqual.attracc.attraccr, 
    dataqual.logic,
    dataqual.complete,
    dataqual.posacc.horizpa.horizpar,
    dataqual.posacc.vertacc.vertaccr,
    dataqual.lineage.procstep.procdate,
    dataqual.lineage.procstep.procdesc
FROM "blog_glue_xml"."parquet_parquet" limit 10;

Now that we’ve completed technique 1, let’s move on to learn about technique 2.

Technique 2: Use AWS Glue DynamicFrames with inferred and fixed schemas

In the previous section, we covered the process of handling a small XML file using an AWS Glue crawler to generate a table, an AWS Glue job to convert the file into Parquet format, and Athena to access the Parquet data. However, the crawler encounters limitations when it comes to processing XML files that exceed 1 MB in size. In this section, we delve into the topic of batch processing larger XML files, necessitating additional parsing to extract individual events and conduct analysis using Athena.

Our approach involves reading the XML files through AWS Glue DynamicFrames, employing both inferred and fixed schemas. Then we extract the individual events in Parquet format using the relationalize transformation, enabling us to query and analyze them seamlessly using Athena.

To implement this solution, you complete the following high-level steps:

  1. Create an AWS Glue notebook to read and analyze the XML file.
  2. Use DynamicFrames with InferSchema to read the XML file.
  3. Use the relationalize function to unnest any arrays.
  4. Convert the data to Parquet format.
  5. Query the Parquet data using Athena.
  6. Repeat the previous steps, but this time pass a schema to DynamicFrames instead of using InferSchema.

The electric vehicle population data XML file has a response tag at its root level. This tag contains an array of row tags, which are nested within it. The row tag is an array that contains a set of another row tags, which provide information about a vehicle, including its make, model, and other relevant details. The following screenshot shows an example.

Create an AWS Glue Notebook

To create an AWS Glue notebook, complete the following steps:

  1. Open the AWS Glue Studio console, choose Jobs in the navigation pane.
  2. Select Jupyter Notebook and choose Create.

  1. Enter a name for your AWS Glue job, such as blog_glue_xml_job_Jupyter.
  2. Choose the role that you created in the prerequisites (AWSGlueServiceNotebookRoleBlog).

The AWS Glue notebook comes with a preexisting example that demonstrates how to query a database and write the output to Amazon S3.

  1. Adjust the timeout (in minutes) as shown in the following screenshot and run the cell to create the AWS Glue interactive session.

Create basic Variables

After you create the interactive session, at the end of the notebook, create a new cell with the following variables (provide your own bucket name):

BUCKET_NAME='YOUR_BUCKET_NAME'
S3_SOURCE_XML_FILE = f's3://{BUCKET_NAME}/xml_dataset/'
S3_TEMP_FOLDER = f's3://{BUCKET_NAME}/temp/'
S3_OUTPUT_INFER_SCHEMA = f's3://{BUCKET_NAME}/infer_schema/'
INFER_SCHEMA_TABLE_NAME = 'infer_schema'
S3_OUTPUT_NO_INFER_SCHEMA = f's3://{BUCKET_NAME}/no_infer_schema/'
NO_INFER_SCHEMA_TABLE_NAME = 'no_infer_schema'
DATABASE_NAME = 'blog_xml'

Read the XML file inferring the schema

If you don’t pass a schema to the DynamicFrame, it will infer the schema of the files. To read the data using a dynamic frame, you can use the following command:

df = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [S3_SOURCE_XML_FILE]},
    format="xml",
    format_options={"rowTag": "response"},
)

Print the DynamicFrame Schema

Print the schema with the following code:

df.printSchema()

The schema shows a nested structure with a row array containing multiple elements. To unnest this structure into lines, you can use the AWS Glue relationalize transformation:

df_relationalized = df.relationalize(
    "root", S3_TEMP_FOLDER
)

We are only interested in the information contained within the row array, and we can view the schema by using the following command:

df_relationalized.select("root_row.row").printSchema()

The column names contain row.row, which correspond to the array structure and array column in the dataset. We don’t rename the columns in this post; for instructions to do so, refer to Automate dynamic mapping and renaming of column names in data files using AWS Glue: Part 1. Then you can convert the data to Parquet format and create the AWS Glue table using the following command:


s3output = glueContext.getSink(
  path= S3_OUTPUT_INFER_SCHEMA,
  connection_type="s3",
  updateBehavior="UPDATE_IN_DATABASE",
  partitionKeys=[],
  compression="snappy",
  enableUpdateCatalog=True,
  transformation_ctx="s3output",
)
s3output.setCatalogInfo(
  catalogDatabase="blog_xml", catalogTableName="jupyter_notebook_with_infer_schema"
)
s3output.setFormat("glueparquet")
s3output.writeFrame(df_relationalized.select("root_row.row"))

AWS Glue DynamicFrame provides features that you can use in your ETL script to create and update a schema in the Data Catalog. We use the updateBehavior parameter to create the table directly in the Data Catalog. With this approach, we don’t need to run an AWS Glue crawler after the AWS Glue job is complete.

Read the XML file by setting a schema

An alternative way to read the file is by predefining a schema. To do this, complete the following steps:

  1. Import the AWS Glue data types:
    from awsglue.gluetypes import *

  2. Create a schema for the XML file:
    schema = StructType([ 
      Field("row", StructType([
        Field("row", ArrayType(StructType([
                Field("_2020_census_tract", LongType()),
                Field("__address", StringType()),
                Field("__id", StringType()),
                Field("__position", IntegerType()),
                Field("__uuid", StringType()),
                Field("base_msrp", IntegerType()),
                Field("cafv_type", StringType()),
                Field("city", StringType()),
                Field("county", StringType()),
                Field("dol_vehicle_id", IntegerType()),
                Field("electric_range", IntegerType()),
                Field("electric_utility", StringType()),
                Field("ev_type", StringType()),
                Field("geocoded_column", StringType()),
                Field("legislative_district", IntegerType()),
                Field("make", StringType()),
                Field("model", StringType()),
                Field("model_year", IntegerType()),
                Field("state", StringType()),
                Field("vin_1_10", StringType()),
                Field("zip_code", IntegerType())
        ])))
      ]))
    ])

  3. Pass the schema when reading the XML file:
    df = glueContext.create_dynamic_frame.from_options(
        connection_type="s3",
        connection_options={"paths": [S3_SOURCE_XML_FILE]},
        format="xml",
        format_options={"rowTag": "response", "withSchema": json.dumps(schema.jsonValue())},
    )

  4. Unnest the dataset like before:
    df_relationalized = df.relationalize(
        "root", S3_TEMP_FOLDER
    )

  5. Convert the dataset to Parquet and create the AWS Glue table:
    s3output = glueContext.getSink(
      path=S3_OUTPUT_NO_INFER_SCHEMA,
      connection_type="s3",
      updateBehavior="UPDATE_IN_DATABASE",
      partitionKeys=[],
      compression="snappy",
      enableUpdateCatalog=True,
      transformation_ctx="s3output",
    )
    s3output.setCatalogInfo(
      catalogDatabase="blog_xml", catalogTableName="jupyter_notebook_no_infer_schema"
    )
    s3output.setFormat("glueparquet")
    s3output.writeFrame(df_relationalized.select("root_row.row"))

Query the tables using Athena

Now that we have created both tables, we can query the tables using Athena. For example, we can use the following query:

SELECT * FROM "blog_xml"."jupyter_notebook_no_infer_schema " limit 10;

The following screenshot shows the results.

Clean Up

In this post, we created an IAM role, an AWS Glue Jupyter notebook, and two tables in the AWS Glue Data Catalog. We also uploaded some files to an S3 bucket. To clean up these objects, complete the following steps:

  1. On the IAM console, delete the role you created.
  2. On the AWS Glue Studio console, delete the custom classifier, crawler, ETL jobs, and Jupyter notebook.
  3. Navigate to the AWS Glue Data Catalog and delete the tables you created.
  4. On the Amazon S3 console, navigate to the bucket you created and delete the folders named temp, infer_schema, and no_infer_schema.

Key Takeaways

In AWS Glue, there’s a feature called InferSchema in AWS Glue DynamicFrames. It automatically figures out the structure of a data frame based on the data it contains. In contrast, defining a schema means explicitly stating how the data frame’s structure should be before loading the data.

XML, being a text-based format, doesn’t restrict the data types of its columns. This can cause issues with the InferSchema function. For example, in the first run, a file with column A having a value of 2 results in a Parquet file with column A as an integer. In the second run, a new file has column A with the value C, leading to a Parquet file with column A as a string. Now there are two files on S3, each with a column A of different data types, which can create problems downstream.

The same happens with complex data types like nested structures or arrays. For example, if a file has one tag entry called transaction, it’s inferred as a struct. But if another file has the same tag, it’s inferred as an array

Despite these data type issues, InferSchema is useful when you don’t know the schema or defining one manually is impractical. However, it’s not ideal for large or constantly changing datasets. Defining a schema is more precise, especially with complex data types, but has its own issues, like requiring manual effort and being inflexible to data changes.

InferSchema has limitations, like incorrect data type inference and issues with handling null values. Defining a schema also has limitations, like manual effort and potential errors.

Choosing between inferring and defining a schema depends on the project’s needs. InferSchema is great for quick exploration of small datasets, whereas defining a schema is better for larger, complex datasets requiring accuracy and consistency. Consider the trade-offs and constraints of each method to pick what suits your project best.

Conclusion

In this post, we explored two techniques for managing XML data using AWS Glue, each tailored to address specific needs and challenges you may encounter.

Technique 1 offers a user-friendly path for those who prefer a graphical interface. You can use an AWS Glue crawler and the visual editor to effortlessly define the table structure for your XML files. This approach simplifies the data management process and is particularly appealing to those looking for a straightforward way to handle their data.

However, we recognize that the crawler has its limitations, specifically when dealing with XML files having rows larger than 1 MB. This is where technique 2 comes to the rescue. By harnessing AWS Glue DynamicFrames with both inferred and fixed schemas, and employing an AWS Glue notebook, you can efficiently handle XML files of any size. This method provides a robust solution that ensures seamless processing even for XML files with rows exceeding the 1 MB constraint.

As you navigate the world of data management, having these techniques in your toolkit empowers you to make informed decisions based on the specific requirements of your project. Whether you prefer the simplicity of technique 1 or the scalability of technique 2, AWS Glue provides the flexibility you need to handle XML data effectively.


About the Authors

Navnit Shuklaserves as an AWS Specialist Solution Architect with a focus on Analytics. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, Navnit Shukla is the accomplished author of the book titled “Data Wrangling on AWS.

Patrick Muller works as a Senior Data Lab Architect at AWS. His main responsibility is to assist customers in turning their ideas into a production-ready data product. In his free time, Patrick enjoys playing soccer, watching movies, and traveling.

Amogh Gaikwad is a Senior Solutions Developer at Amazon Web Services. He helps global customers build and deploy AI/ML solutions on AWS. His work is mainly focused on computer vision, and natural language processing and helping customers optimize their AI/ML workloads for sustainability. Amogh has received his master’s in Computer Science specializing in Machine Learning.

Sheela Sonone is a Senior Resident Architect at AWS. She helps AWS customers make informed choices and tradeoffs about accelerating their data, analytics, and AI/ML workloads and implementations. In her spare time, she enjoys spending time with her family – usually on tennis courts.

Deploy AWS WAF faster with Security Automations

Post Syndicated from Harith Gaddamanugu original https://aws.amazon.com/blogs/security/deploy-aws-managed-rules-using-security-automations-for-aws-waf/

You can now deploy AWS WAF managed rules as part of the Security Automations for AWS WAF solution. In this post, we show you how to get started and set up monitoring for this automated solution with additional recommendations.

This article discusses AWS WAF, a service that assists you in protecting against typical web attacks and bots that might disrupt availability, compromise security, or consume excessive resources. As requests for your websites are received by the underlying service, they’re forwarded to AWS WAF for inspection against your rules. AWS WAF informs the underlying service to either block, allow, or take another configured action when a request fulfills the criteria stated in your rules. AWS WAF is tightly integrated with Amazon CloudFront, Application Load Balancer (ALB), Amazon API Gateway, and AWS AppSync—all of which are routinely used by AWS customers to provide content for their websites and applications.

To provide a simple, purpose-driven deployment approach, our solutions builder teams developed Security Automations for AWS WAF, a solution that can help organizations that don’t have dedicated security teams to quickly deploy an AWS WAF that filters common web-based malicious activity. Security Automations for AWS WAF deploys a set of preconfigured rules to help you protect your applications from common web exploits.

This solution can be installed in your AWS accounts by launching the provided AWS CloudFormation template.

Security Automations for AWS WAF provides the following features and benefits:

  • Helps secure your web applications with AWS managed rule groups
  • Provide layer 7 flood protection with a predefined HTTP flood custom rule
  • Helps block exploitation of vulnerabilities with a predefined scanners and probes custom rule
  • Detect and deflect intrusion from bots with a honeypot endpoint using a bad bot custom rule
  • Helps block malicious IP addresses based on AWS and external IP reputation lists
  • Building a monitoring dashboard with Amazon CloudWatch
  • Integration with AWS Service Catalog AppRegistry and AWS Systems Manager Application Manager
Figure 1: Design overview of the new Security Automations for AWS WAF solution

Figure 1: Design overview of the new Security Automations for AWS WAF solution

Getting started

Many customers begin their proofs of concept (POC) by using the AWS Management Console for AWS WAF to set up their very first AWS WAF, but quickly realize the benefits of automation, such as increased productivity, enforcing best practices, avoiding repetition, and so on. Manually managing AWS WAF can be time-consuming, especially if you want to duplicate complicated automations across multiple environments.

You can deploy this solution for new and existing supported AWS WAF resources. The implementation guide discusses architectural considerations, configuration steps, and operational best practices for deploying this solution in the AWS Cloud. It includes links to AWS CloudFormation templates and stacks that launch, configure, and run the AWS security, compute, storage, and other services required to deploy this solution on AWS, using AWS best practices for security and availability.

Before you launch the CloudFormation template, review the architecture and configuration considerations discussed in this guide. The template takes about 15 minutes to deploy and includes three basic steps:

Step 1. Launch the stack

  1. Launch the CloudFormation template into your AWS account and select the desired AWS Region.
  2. Enter values for the required parameters: Stack name and Application access log bucket name.
  3. Review the other template parameters and adjust if necessary.

Step 2. Associate the web ACL with your web application

Associate your CloudFront web distributions or ALBs with the web ACL that this solution generates. You can associate as many distributions or load balancers as you want.

Step 3. Configure web access logging

Turn on web access logging for your CloudFront web distributions or ALBs, and send the log files to the appropriate Amazon Simple Storage Service (Amazon S3) bucket. Save the logs in a folder matching the user-defined prefix. If no user-defined prefix is used, save the logs to AWSLogs (default log prefix AWSLogs/).

Customize the solution

This solution provides an example of how to use AWS WAF and other services to build security automations on the AWS Cloud. You can download the open source code from GitHub to apply customizations or build your own security automations that fit your needs. The solution builder team is planning to release a Terraform version for this solution in the near future.

Monitor the solution

This solution includes a Service Catalog AppRegistry resource to register the CloudFormation template and underlying resources as an application in both the Service Catalog AppRegistry and Systems Manager Application Manager. You can monitor the costs and operations data in the Systems Manager console, as shown in Figure 2 that follows.

Figure 2: Example of the application view for the Security Automations for AWS WAF stack in Application Manager

Figure 2: Example of the application view for the Security Automations for AWS WAF stack in Application Manager

CloudWatch dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view, including visualizing AWS WAF logs as shown in Figure 3 that follows. The solution creates a simple dashboard that you can customize to monitor additional metrics, alarms and logs. If suspicious activity is reported, you can use the visuals to understand the traffic in more detail and drive incident response actions as needed. From here, you can investigate further by using specific queries with CloudWatch Logs Insights.

Figure 3: Example of an enhanced AWS WAF CloudWatch dashboard that can be built for monitoring your site traffic

Figure 3: Example of an enhanced AWS WAF CloudWatch dashboard that can be built for monitoring your site traffic

Conclusion

In this post, you learned about using the AWS Security Automation template to quickly deploy AWS WAF. If you prefer a simpler solution, we recommend using the one-click CloudFront AWS WAF setup, which offers a simple way to deploy AWS WAF for your CloudFront distribution. By choosing the approach that aligns with your requirements, you can enhance the security of your web applications and safeguard them against potential threats.

For more solutions, visit the AWS Solutions Library.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Harith Gaddamanugu

Harith Gaddamanugu

Harith works at AWS as a Sr. Edge Specialist Solutions Architect. He stays motivated by solving problems for customers across AWS Perimeter Protection and Edge services. When he is not working, he enjoys spending time outdoors with friends and family.

Integrating AWS WAF with your Amazon Lightsail instance

Post Syndicated from Macey Neff original https://aws.amazon.com/blogs/compute/integrating-aws-waf-with-your-amazon-lightsail-instance/

This blog post is written by Riaz Panjwani, Solutions Architect, Canada CSC and Dylan Souvage, Solutions Architect, Canada CSC.

Security is the top priority at AWS. This post shows how you can level up your application security posture on your Amazon Lightsail instances with an AWS Web Application Firewall (AWS WAF) integration. Amazon Lightsail offers easy-to-use virtual private server (VPS) instances and more at a cost-effective monthly price.

Lightsail provides security functionality built-in with every instance through the Lightsail Firewall. Lightsail Firewall is a network-level firewall that enables you to define rules for incoming traffic based on IP addresses, ports, and protocols. Developers looking to help protect against attacks such as SQL injection, cross-site scripting (XSS), and distributed denial of service (DDoS) can leverage AWS WAF on top of the Lightsail Firewall.

As of this post’s publishing, AWS WAF can only be deployed on Amazon CloudFront, Application Load Balancer (ALB), Amazon API Gateway, and AWS AppSync. However, Lightsail can’t directly act as a target for these services because Lightsail instances run within an AWS managed Amazon Virtual Private Cloud (Amazon VPC). By leveraging VPC peering, you can deploy the aforementioned services in front of your Lightsail instance, allowing you to integrate AWS WAF with your Lightsail instance.

Solution Overview

This post shows you two solutions to integrate AWS WAF with your Lightsail instance(s). The first uses AWS WAF attached to an Internet-facing ALB. The second uses AWS WAF attached to CloudFront. By following one of these two solutions, you can utilize rule sets provided in AWS WAF to secure your application running on Lightsail.

Solution 1: ALB and AWS WAF

This first solution uses VPC peering and ALB to allow you to use AWS WAF to secure your Lightsail instances. This section guides you through the steps of creating a Lightsail instance, configuring VPC peering, creating a security group, setting up a target group for your load balancer, and integrating AWS WAF with your load balancer.

AWS architecture diagram showing Amazon Lightsail integration with WAF using VPC peering across two separate VPCs. The Lightsail application is in a private subnet inside the managed VPC(vpc-b), with peering connection to your VPC(vpc-a) which has an ALB in a public subnet with WAF attached to it.

Creating the Lightsail Instance

For this walkthrough, you can utilize an AWS Free Tier Linux-based WordPress blueprint.

1. Navigate to the Lightsail console and create the instance.

2. Verify that your Lightsail instance is online and obtain its private IP, which you will need when configuring the Target Group later.

Screenshot of Lightsail console with a WordPress application set up showcasing the networking tab.

Attaching an ALB to your Lightsail instance

You must enable VPC peering as you will be utilizing an ALB in a separate VPC.

1. To enable VPC peering, navigate to your account in the top-right corner, select the Account dropdown, select Account, then select Advanced, and select Enable VPC Peering. Note the AWS Region being selected, as it is necessary later. For this example, select “us-east-2”. Screenshot of Lightsail console in the settings menu under the advanced section showcasing VPC peering.2. In the AWS Management Console, navigate to the VPC service in the search bar, select VPC Peering Connections and verify the created peering connection.

Screenshot of the AWS Console showing the VPC Peering Connections menu with an active peering connection.

3. In the left navigation pane, select Security groups, and create a Security group that allows HTTP traffic (port 80). This is used later to allow public HTTP traffic to the ALB.

4. Navigate to the Amazon Elastic Compute Cloud (Amazon EC2) service, and in the left pane under Load Balancing select Target Groups. Proceed to create a Target Group, choosing IP addresses as the target type.Screenshot of the AWS console setting up target groups with the IP address target type selected.

5. Proceed to the Register targets section, and select Other private IP address. Add the private IP address of the Lightsail instance that you created before. Select Include as Pending below and then Create target group (note that if your Lightsail instance is re-launched the target group must be updated as the private IP address may change).

6. In the left pane, select Load Balancers, select Create load balancers and choose Application Load Balancer. Ensure that you select the “Internet-facing” scheme, otherwise, you will not be able to connect to your instance over the internet.Screenshot of the AWS console setting up target groups with the IP address target type selected.

7. Select the VPC in which you want your ALB to reside. In this example, select the default VPC and all the Availability Zones (AZs) to make sure of the high availability of the load balancer.

8. Select the Security Group created in Step 3 to make sure that public Internet traffic can pass through the load balancer.

9. Select the target group under Listeners and routing to the target group you created earlier (in Step 5). Proceed to Create load balancer.Screenshot of the AWS console creating an ALB with the target group created earlier in the blog, selected as the listener.

10. Retrieve the DNS name from your load balancer again by navigating to the Load Balancers menu under the EC2 service.Screenshot of the AWS console with load balancer created.

11. Verify that you can access your Lightsail instance using the Load Balancer’s DNS by copying the DNS name into your browser.

Screenshot of basic WordPress app launched accessed via a web browser.

Integrating AWS WAF with your ALB

Now that you have ALB successfully routing to the Lightsail instance, you can restrict the instance to only accept traffic from the load balancer, and then create an AWS WAF web Access Control List (ACL).

1. Navigate back to the Lightsail service, select the Lightsail instance previously created, and select Networking. Delete all firewall rules that allow public access, and under IPv4 Firewall add a rule that restricts traffic to the IP CIDR range of the VPC of the previously created ALB.

Screenshot of the Lightsail console showing the IPv4 firewall.

2. Now you can integrate the AWS WAF to the ALB. In the Console, navigate to the AWS WAF console, or simply navigate to your load balancer’s integrations section, and select Create web ACL.

Screenshot of the AWS console showing the WAF configuration in the integrations tab of the ALB.

3. Choose Create a web ACL, and then select Add AWS resources to add the previously created ALB.Screenshot of creating and assigning a web ACL to the ALB.

4. Add any rules you want to your ACL, these rules will govern the traffic allowed or denied to your resources. In this example, you can add the WordPress application managed rules.Screenshot of adding the AWS WAF managed rule for WordPress applications.

5. Leave all other configurations as default and create the AWS WAF.

6. You can verify your firewall is attached to the ALB in the load balancer Integrations section.Screenshot of the AWS console showing the WAF integration detected in the integrations tab of the ALB.

Solution 2: CloudFront and AWS WAF

Now that you have set up ALB and VPC peering to your Lightsail instance, you can optionally choose to add CloudFront to the solution. This can be done by setting up a custom HTTP header rule in the Listener of your ALB, setting up the CloudFront distribution to use the ALB as an origin, and setting up an AWS WAF web ACL for your new CloudFront distribution. This configuration makes traffic limited to only accessing your application through CloudFront, and is still protected by WAF.AWS architecture diagram showing Amazon Lightsail integration with WAF using VPC peering across two separate VPCs. The Lightsail application is in a public subnet inside VPC-B, with peering connection to VPC-A which has an ALB in a private subnet fronted with CloudFront that has WAF attached.

1. Navigate to the CloudFront service, and click Create distribution.

2. Under Origin domain, select the load balancer that you had created previously.Screenshot of creating a distribution in CloudFront.

3. Scroll down to the Add custom header field, and click Add header.

4. Create your header name and value. Note the header name and value as you will need it later in the walkthrough.Screenshot of adding the custom header to your CloudFront distribution.

5. Scroll down to the Cache key and origin requests section. Under Cache policy, choose CachingDisabled.Screenshot of selecting the CachingDisabled cache policy inside the creation of the CloudFront distribution.

6. Scroll to the Web Application Firewall (WAF) section, and select Enable security protections.Screenshot of selecting “Enable security protections” inside the creation of the CloudFront distribution.

7. Leave all other configurations as default, and click Create distribution.

8. Wait until your CloudFront distribution has been deployed, and verify that you can access your Lightsail application using the DNS under Domain name.

Screenshot of the CloudFront distribution created with the status as enabled and the deployment finished.

9. Navigate to the EC2 service, and in the left pane under Load Balancing, select Load Balancers.

10. Select the load balancer you created previously, and under the Listeners tab, select the Listener you had created previously. Select Actions in the top right and then select Manage rules.Screenshot of the Listener section of the ALB with the Manage rules being selected.

11. Select the edit icon at the top, and then select the edit icon beside the Default rule.

Screenshot of the edit section inside managed rules.

12. Select the delete icon to delete the Default Action.

Screenshot of highlighting the delete button inside the edit rules section.

13. Choose Add action and then select Return fixed response.

Screenshot of adding a new rule “Return fixed response…”.

14. For Response code, enter 403, this will restrict access to CloudFront.

15. For Response body, enter “Access Denied”.

16. Select Update in the top right corner to update the Default rule.

Screenshot of the rule being successfully updated.

17. Select the insert icon at the top, then select Insert Rule.

Screenshot of inserting a new rule to the Listener.

18. Choose Add Condition, then select Http header. For Header type, enter the Header name, and then for Value enter the Header Value chosen previously.

19. Choose Add Action, then select Forward to and select the target group you had created in the previous section.

20. Choose Save at the top right corner to create the rule.

Screenshot of adding a new rule to the Listener, with the Http header selected as the custom-header and custom-value from the previous creation of the CloudFront distribution, with the Load Balancer selected as the target group.

21. Retrieve the DNS name from your load balancer again by navigating to the Load Balancers menu under the EC2 service.

22. Verify that you can no longer access your Lightsail application using the Load Balancer’s DNS.

Screenshot of the Lightsail application being accessed through the Load Balancer via a web browser with Access Denied being shown..

23. Navigate back to the CloudFront service and select the Distribution you had created. Under the General tab, select the Web ACL link under the AWS WAF section. Modify the Web ACL to leverage any managed or custom rule sets.

Screenshot of the CloudFront distribution focusing on the AWS WAF integration under the General tab Settings.

You have successfully integrated AWS WAF to your Lightsail instance! You can access your Lightsail instance via your CloudFront distribution domain name!

Clean Up Lightsail console resources

To start, you will delete your Lightsail instance.

  1. Sign in to the Lightsail console.
  2. For the instance you want to delete, choose the actions menu icon (⋮), then choose Delete.
  3. Choose Yes to confirm the deletion.

Next you will delete your provisioned static IP.

  1. Sign in to the Lightsail console.
  2. On the Lightsail home page, choose the Networking tab.
  3. On the Networking page choose the vertical ellipsis icon next to the static IP address that you want to delete, and then choose Delete.

Finally you will disable VPC peering.

  1. In the Lightsail console, choose Account on the navigation bar.
  2. Choose Advanced.
  3. In the VPC peering section, clear Enable VPC peering for all AWS Regions.

Clean Up AWS console resources

To start, you will delete your Load balancer.

  1. Navigate to the EC2 console, choose Load balancers on the navigation bar.
  2. Select the load balancer you created previously.
  3. Under Actions, select Delete load balancer.

Next, you will delete your target group.

  1. Navigate to the EC2 console, choose Target Groups on the navigation bar.
  2. Select the target group you created previously.
  3. Under Actions, select Delete.

Now you will delete your CloudFront distribution.

  1. Navigate to the CloudFront console, choose Distributions on the navigation bar.
  2. Select the distribution you created earlier and select Disable.
  3. Wait for the distribution to finish deploying.
  4. Select the same distribution after it is finished deploying and select Delete.

Finally, you will delete your WAF ACL.

  1. Navigate to the WAF console, and select Web ACLS on the navigation bar.
  2. Select the web ACL you created previously, and select Delete.

Conclusion

Adding AWS WAF to your Lightsail instance enhances the security of your application by providing a robust layer of protection against common web exploits and vulnerabilities. In this post, you learned how to add AWS WAF to your Lightsail instance through two methods: Using AWS WAF attached to an Internet-facing ALB and using AWS WAF attached to CloudFront.

Security is top priority at AWS and security is an ongoing effort. AWS strives to help you build and operate architectures that protect information, systems, and assets while delivering business value. To learn more about Lightsail security, check out the AWS documentation for Security in Amazon Lightsail.

Manage your workloads better using Amazon Redshift Workload Management

Post Syndicated from Rohit Vashishtha original https://aws.amazon.com/blogs/big-data/manage-your-workloads-better-using-amazon-redshift-workload-management/

With Amazon Redshift, you can run a complex mix of workloads on your data warehouse, such as frequent data loads running alongside business-critical dashboard queries and complex transformation jobs. We also see more and more data science and machine learning (ML) workloads. Each workload type has different resource needs and different service-level agreements (SLAs).

Amazon Redshift workload management (WLM) helps you maximize query throughput and get consistent performance for the most demanding analytics workloads by optimally using the resources of your existing data warehouse.

In Amazon Redshift, you implement WLM to define the number of query queues that are available and how queries are routed to those queues for processing. WLM queues are configured based on Redshift user groups, user roles, or query groups. When users belonging to a user group or role run queries in the database, their queries are routed to a queue as depicted in the following flowchart.

Role-based access control (RBAC) is a new enhancement that helps you simplify the management of security privileges in Amazon Redshift. You can use RBAC to control end-user access to data at a broad or granular level based on their job role. We have introduced support for Redshift roles in WLM queues, you will now find User roles along with User groups and Query groups as query routing mechanism.

This post provides examples of analytics workloads for an enterprise, and shares common challenges and ways to mitigate those challenges using WLM. We guide you through common WLM patterns and how they can be associated with your data warehouse configurations. We also show how to assign user roles to WLM queues and how to use WLM query insights to optimize configuration.

Use case overview

ExampleCorp is an enterprise using Amazon Redshift to modernize its data platform and analytics. They have variety of workloads with users from various departments and personas. The service-level performance requirements vary by the nature of the workload and user personas accessing the datasets. ExampleCorp would like to manage resources and priorities on Amazon Redshift using WLM queues. For this multitenant architecture by department, ExampleCorp can achieve read/write isolation using the Amazon Redshift data sharing feature and meet its unpredictable compute scaling requirements using concurrency scaling.

The following figure illustrates the user personas and access in ExampleCorp.

ExampleCorp has multiple Redshift clusters. For this post, we focus on the following:

  • Enterprise data warehouse (EDW) platform – This has all write workloads, along with some of the applications running reads via the Redshift Data API. The enterprise standardized data from the EDW cluster is accessed by multiple consumer clusters using the Redshift data sharing feature to run downstream reports, dashboards, and other analytics workloads.
  • Marketing data mart – This has predictable extract, transform, and load (ETL) and business intelligence (BI) workloads at specific times of day. The cluster admin understands the exact resource requirements by workload type.
  • Auditor data mart – This is only used for a few hours a day to run scheduled reports.

ExampleCorp would like to better manage their workloads using WLM.

Solution overview

As we discussed in the previous section, ExampleCorp has multiple Redshift data warehouses: one enterprise data warehouse and two downstream Redshift data warehouses. Each data warehouse has different workloads, SLAs, and concurrency requirements.

A database administrator (DBA) will implement appropriate WLM strategies on each Redshift data warehouse based on their use case. For this post, we use the following examples:

  • The enterprise data warehouse demonstrates Auto WLM with query priorities
  • The marketing data mart cluster demonstrates manual WLM
  • The auditors team uses their data mart infrequently for sporadic workloads; they use Amazon Redshift Serverless, which doesn’t require workload management

The following diagram illustrates the solution architecture.

Prerequisites

Before beginning this solution, you need the following:

  • An AWS account
  • Administrative access to Amazon Redshift

Let’s start by understanding some foundational concepts before solving the problem statement for ExampleCorp. First, how to choose between auto vs. manual WLM.

Auto vs. manual WLM

Amazon Redshift WLM enables you to flexibly manage priorities within workloads to meet your SLAs. Amazon Redshift supports Auto WLM or manual WLM for your provisioned Redshift data warehouse. The following diagram illustrates queues for each option.

Auto WLM determines the amount of resources that queries need and adjusts the concurrency based on the workload. When queries requiring large amounts of resources are in the system (for example, hash joins between large tables), the concurrency is lower. For additional information, refer to Implementing automatic WLM. You should use Auto WLM when your workload is highly unpredictable.

With manual WLM, you manage query concurrency and memory allocation, as opposed to auto WLM, where it’s managed by Amazon Redshift automatically. You configure separate WLM queues for different workloads like ETL, BI, and ad hoc and customize resource allocation. For additional information, refer to Tutorial: Configuring manual workload management (WLM) queues.

Use manual when When your workload pattern is predictable or if you need to throttle certain types of queries depending on the time of day, such as throttle down ingestion during business hours. If you need to guarantee multiple workloads are able to run at the same time, you can define slots for each workload.

Now that you have chosen automatic or manual WLM, let’s explore WLM parameters and properties.

Static vs. dynamic properties

The WLM configuration for a Redshift data warehouse is set using a parameter group under the database configuration properties.

The parameter group WLM settings are either dynamic or static. You can apply dynamic properties to the database without a cluster reboot, but static properties require a cluster reboot for changes to take effect. The following table summarizes the static vs. dynamic requirements for different WLM properties.

WLM Property Automatic WLM Manual WLM
Query groups Dynamic Static
Query group wildcard Dynamic Static
User groups Dynamic Static
User group wildcard Dynamic Static
User roles Dynamic Static
User role wildcard Dynamic Static
Concurrency on main Not applicable Dynamic
Concurrency Scaling mode Dynamic Dynamic
Enable short query acceleration Not applicable Dynamic
Maximum runtime for short queries Dynamic Dynamic
Percent of memory to use Not applicable Dynamic
Timeout Not applicable Dynamic
Priority Dynamic Not applicable
Adding or removing queues Dynamic Static

Note the following:

  • The parameter group parameters and WLM switch from manual to auto or vice versa and are static properties, and therefore require a cluster reboot.
  • For the WLM properties Concurrency on main, Percentage of memory to use, and Timeout, which are dynamic for manual WLM, the change only applies to new queries submitted after the value has changed and not for currently running queries.
  • The query monitoring rules, which we discuss later in this post, are dynamic and don’t require a cluster reboot.

In the next section, we discuss the concept of service class, meaning which queue does the query get submitted to and why.

Service class

Whether you use Auto or manual WLM, the user queries submitted go to the intended WLM queue via one of the following mechanisms:

  • User_Groups – The WLM queue directly maps to Redshift groups that would appear in the pg_group table.
  • Query_Groups – Queue assignment is based on the query_group label. For example, a dashboard submitted from the same reporting user can have separate priorities by designation or department.
  • User_Roles (latest addition) – The queue is assigned based on the Redshift roles.

WLM queues from a metadata perspective are defined as service class configuration. The following table lists common service class identifiers for your reference.

ID Service class
1–4 Reserved for system use.
5 Used by the superuser queue.
6–13 Used by manual WLM queues that are defined in the WLM configuration.
14 Used by short query acceleration.
15 Reserved for maintenance activities run by Amazon Redshift.
100–107 Used by automatic WLM queue when auto_wlm is true.

The WLM queues you define based on user_groups, query_groups, or user_roles fall in service class ID 6–13 for manual WLM and service class id 100–107 for automatic WLM.

Using Query_group, you can force a query to go to service class 5 and run in the superuser queue (provided you are an authorized superuser) as shown in the following code:

set query_group to 'superuser';
analyze table_xyz;
vacuum full table_xyz;
reset query_group;

For more details on how to assign a query to a particular service class, refer to Assigning queries to queues.

The short query acceleration (SQA) queue (service class 14) prioritizes short-running queries ahead of longer-running queries. If you enable SQA, you can reduce WLM queues that are dedicated to running short queries. In addition, long-running queries don’t need to contend with short queries for slots in a queue, so you can configure your WLM queues to use fewer query slots (a term used for available concurrency). Amazon Redshift uses an ML algorithm to analyze each eligible query and predict the query’s runtime. Auto WLM dynamically assigns a value for the SQA maximum runtime based on analysis of your cluster’s workload. Alternatively, you can specify a fixed value of 1–20 seconds when using manual WLM.

SQA is enabled by default in the default parameter group and for all new parameter groups. SQA can have a maximum concurrency of six queries.

Now that you understand how queries get submitted to a service class, it’s important to understand ways to avoid runaway queries and initiate an action for an unintended event.

Query monitoring rules

You can use Amazon Redshift query monitoring rules (QMRs) to set metrics-based performance boundaries for WLM queues and specify what action to take when a query goes beyond those boundaries.

The Redshift cluster automatically collects query monitoring metrics. You can query the system view SVL_QUERY_METRICS_SUMMARY as an aid to determine threshold values for defining the QMR. Then create the QMR based on following attributes:

  • Query runtime, in seconds
  • Query return row count
  • The CPU time for a SQL statement

For a complete list of QMRs, refer to WLM query monitoring rules.

Create sample parameter groups

For our ExampleCorp use case, we demonstrate automatic and manual WLM for a provisioned Redshift data warehouse and share a serverless perspective of WLM.

The following AWS CloudFormation template provides an automated way to create sample parameter groups that you can attach to your Redshift data warehouse for workload management.

Enterprise data warehouse Redshift cluster using automatic WLM

For the EDW cluster, we use Auto WLM. To configure the service class, we look at all three options: user_roles, user_groups, and query_groups.

Here’s a glimpse of how this can be set up in WLM queues and then used in your queries.

On the Amazon Redshift console, under Configurations in the navigation pane, choose Workload Management. You can create a new parameter group or modify an existing one created by you. Select the parameter group to edit its queues. There’s always a default queue (the last one in case of multiple queues defined), which is a catch-all for queries that don’t get routed to any specific queue.

User roles in WLM

With the introduction of user roles in WLM queues, now you can manage your workload by adding different roles to different queues. This can help you prioritize the queries based on the roles a user has. When a user runs a query, WLM will check if this user’s roles were added in any workload queues and assign the query to the first matching queue. To add roles into the WLM queue, you can go to the WLM page, create or modify an existing workload queue, add a user’s roles in the queue, and select Matching wildcards to add roles that get matched as wildcards.

For more information about how to convert from groups to roles, refer to Amazon Redshift Roles (RBAC), which walks you through a stored procedure to convert groups to roles.

In the following example, we have created the WLM queue EDW_Admins, which uses edw_admin_role created in Amazon Redshift to submit the workloads in this queue. The EDW_Admins queue is created with a high priority and automatic concurrency scaling mode.

User groups

Groups are collections of users who are all granted permissions associated with the group. You can use groups to simplify permission management by granting privileges just one time. If the members of a group get added or removed, you don’t need to manage them at a user level. For example, you can create different groups for sales, administration, and support and give the users in each group the appropriate access to the data they need for their work.

You can grant or revoke permissions at the user group level, and those changes will apply to all members of the group.

ETL, data analysts, or BI or decision support systems can use user groups to better manage and isolate their workloads. For our example, ETL WLM queue queries will be run with the user group etl. The data analyst group (BI) WLM queue queries will run using the bi user group.

Choose Add queue to add a new queue that you will use for user_groups, in this case ETL. If you would like these to be matched as wildcards (strings containing those keywords), select Matching wildcards. You can customize other options like query priority and concurrency scaling, explained earlier in this post. Choose Save to complete this queue setup.

In the following example, we have created two different WLM queues for ETL and BI. The ETL queue has a high priority and concurrency scaling mode is off, whereas the BI queue has a low priority and concurrency scaling mode is off.

Use the following code to create a group with multiple users:

-- Example of create group with multiple users
create group ETL with user etl_user1, etl_user2;
Create group BI with user bi_user1, bi_user2;

Query groups

Query_Groups are labels used for queries that are run within the same session. Think of these as tags that you may want to use to identify queries for a uniquely identifiable use case. In our example use case, the data analysts or BI or decision support systems can use query_groups to better manage and isolate their workloads. For our example, weekly business reports can run with the query_group label wbr. Queries from the marketing department can be run with a query_group of marketing.

The benefit of using query_groups is that you can use it to constrain results from the STL_QUERY and STV_INFLIGHT tables and the SVL_QLOG view. You can apply a separate label to every query that you run to uniquely identify queries without having to look up their IDs.

Choose Add queue to add a new queue that you will use for query_groups, in this case wbr or weekly_business_report. If you would like these to be matched as wildcards (strings containing those keywords), select Matching wildcards. You can customize other options like query priority and concurrency scaling options as explained earlier in this post. Choose Save to save this queue setup.

Now let’s see how you can force a query to use the query_groups queue just created.

You can assign a query to a queue at runtime by assigning your query to the appropriate query group. Use the SET command to begin a query group:

SET query_group TO wbr;
-- or
SET query_group TO weekly_business_report;

Queries following the SET command would go to the WLM queue Query_Group_WBR until you either reset the query group or end your current login session. For information about setting and resetting server configuration parameter, see SET and RESET, respectively.

The query group labels that you specify must be included in the current WLM configuration; otherwise, the SET query_group command has no effect on query queues.

For more query_groups examples, refer to WLM queue assignment rules.

Marketing Redshift cluster using manual WLM

Expanding on the marketing Redshift cluster use case of ExampleCorp, this cluster serves two types of workloads:

  • Running ETL for a period of 2 hours between 7:00 AM to 9:00 AM
  • Running BI reports and dashboards for the remaining time during the day

When you have such a clarity in the workloads, and your scope of usage is customizable by design, you may want to consider using manual WLM, where you can control the memory and concurrency resource allocation. Auto WLM will still be applicable, but manual WLM can also be a choice.

Let’s set up manual WLM in this case, with two WLM queues: ETL and BI.

To best utilize the resources, we use an AWS Command Line Interface (AWS CLI) command at the start of our ETL, which will make our WLM queues ETL-friendly, providing higher concurrency to the ETL queue. At the end of our ETL, we use an AWS CLI command to change the WLM queue to have BI-friendly resource settings. Modifying the WLM queues doesn’t require a reboot of your cluster; however, modifying the parameters or parameter group does.

If you were to use Auto WLM, this could have been achieved by dynamically changing the query priority of the ETL and BI queues.

By default, when you choose Create, the WLM created will be Auto WLM. You can switch to manual WLM by choosing Switch WLM mode. After switching WLM mode, choose Edit workload queues.

This will open the Modify workload queues page, where you can create your ETL and BI WLM queues.

After you add your ETL and BI queues, choose Save. You should have configured the following:

  • An ETL queue with 60% memory allocation and query concurrency of 9
  • A BI queue with 30% memory allocation and query concurrency of 4
  • A default queue with 10% memory allocation and query concurrency of 2

Your WLM queues should appear with settings as shown in the following screenshot.

Enterprises may prefer to complete these steps in an automated way. For the marketing data mart use case, the ETL starts at 7:00 AM. An ideal start to the ETL flow would be to have a job that makes your WLM settings ETL queue friendly. Here’s how you would modify concurrency and memory (both dynamic properties in manual WLM queues) to an ETL-friendly configuration:

aws redshift --region 'us-east-1' modify-cluster-parameter-group --parameter-group-name manual-wlm-demo --parameters '{"ParameterName": "wlm_json_configuration","ParameterValue": "[{\"query_group\": [], \"user_group\": [\"etl\"],\"query_group_wild_card\": 0,\"user_group_wild_card\": 0, \"query_concurrency\": 9, \"max_execution_time\": 0, \"memory_percent_to_use\": 60, \"name\": \"ETL\" }, {\"query_group\": [], \"user_group\": [\"bi\"],\"query_group_wild_card\": 0,\"user_group_wild_card\": 0, \"query_concurrency\": 3, \"max_execution_time\": 0, \"memory_percent_to_use\": 20, \"name\": \"BI\" }, { \"query_group\": [], \"user_group\": [], \"query_group_wild_card\": 0, \"user_group_wild_card\": 0, \"query_concurrency\": 3, \"max_execution_time\": 5400000, \"memory_percent_to_use\": 20, \"name\": \"Default queue\", \"rules\": [ { \"rule_name\": \"user_query_duration_threshold\", \"predicate\": [ { \"metric_name\": \"query_execution_time\", \"operator\": \">\", \"value\": 10800 } ], \"action\": \"abort\" } ] }, { \"short_query_queue\": \"true\" } ]","Description": "ETL Start, ETL Friendly"}';

The preceding AWS CLI command programmatically sets the configuration of your WLM queues without requiring a reboot of the cluster because the queue settings changed were all dynamic settings.

For the marketing data mart use case, at 9:00 AM or when the ETL is finished, you can have a job run an AWS CLI command to modify the WLM queue resource settings to a BI-friendly configuration as shown in the following code:

aws redshift --region 'us-east-1' modify-cluster-parameter-group --parameter-group-name manual-wlm-demo --parameters '{"ParameterName": "wlm_json_configuration","ParameterValue": "[{\"query_group\": [], \"user_group\": [\"etl\"],\"query_group_wild_card\": 0,\"user_group_wild_card\": 0, \"query_concurrency\": 1, \"max_execution_time\": 0, \"memory_percent_to_use\": 5, \"name\": \"ETL\" }, {\"query_group\": [], \"user_group\": [\"bi\"],\"query_group_wild_card\": 0,\"user_group_wild_card\": 0, \"query_concurrency\": 12, \"max_execution_time\": 0, \"memory_percent_to_use\": 80, \"name\": \"BI\" }, { \"query_group\": [], \"user_group\": [], \"query_group_wild_card\": 0, \"user_group_wild_card\": 0, \"query_concurrency\": 2, \"max_execution_time\": 5400000, \"memory_percent_to_use\": 15, \"name\": \"Default queue\", \"rules\": [ { \"rule_name\": \"user_query_duration_threshold\", \"predicate\": [ { \"metric_name\": \"query_execution_time\", \"operator\": \">\", \"value\": 10800 } ], \"action\": \"abort\" } ] }, { \"short_query_queue\": \"true\" } ]","Description": "ETL End, BI Friendly"}';

Note that in regards to a manual WLM configuration, the maximum slots you can allocate to a queue is 50. However, this doesn’t mean that in an automatic WLM configuration, a Redshift cluster always runs 50 queries concurrently. This can change based on the memory needs or other types of resource allocation on the cluster. We recommend configuring your manual WLM query queues with a total of 15 or fewer query slots. For more information, see Concurrency level.

In case of WLM timeout or a QMR hop action within manual WLM, a query can attempt to hop to the next matching queue based on WLM queue assignment rules. This action in manual WLM is called query queue hopping.

Auditor Redshift data warehouse using WLM in Redshift Serverless

The auditor data warehouse workload runs on the month, and quarter end. For this periodic workload, Redshift Serverless is well suited, both from a cost and ease of administration perspective. Redshift Serverless uses ML to learn from your workload to automatically manage workload and auto scaling of compute needed for your workload.

In Redshift Serverless, you can set up usage and query limits. The query limits let you set up the QMR. You can choose Manage query limits to automatically trigger the default abort action when queries go beyond performance boundaries. For more information, refer to Query monitoring metrics for Amazon Redshift Serverless.

For other detailed limits in Redshift Serverless, refer to Configure monitoring, limits, and alarms in Amazon Redshift Serverless to keep costs predictable.

Monitor using system views for operational metrics

The system views in Amazon Redshift are used to monitor the workload performance. You can view the status of queries, queues, and service classes by using WLM-specific system tables. You can query system tables to explore the following details:

  • View which queries are being tracked and what resources are allocated by the workload manager
  • See which queue a query has been assigned to
  • View the status of a query that is currently being tracked by the workload manager

You can download the sample SQL notebook system queries. You can import this in Query Editor V2.0. The queries in the sample notebook can help you explore your workloads being managed by WLM queues.

Conclusion

In this post, we covered real-world examples for Auto WLM and manual WLM patterns. We introduced user roles assignment to WLM queues, and shared queries on system views and tables to gather operational insights on your WLM configuration. We encourage you to explore using Redshift user roles with workload management. Use the script provided on AWS re:Post to convert groups to roles, and start using user roles for your WLM queues.


About the Authors

Rohit Vashishtha is a Senior Analytics Specialist Solutions Architect at AWS based in Dallas, Texas. He has over 17 years of experience architecting, building, leading, and maintaining big data platforms. Rohit helps customers modernize their analytic workloads using the breadth of AWS services and ensures that customers get the best price/performance with utmost security and data governance.

Harshida Patel is a Principal specialist SA with AWS.

Nita Shah is an Analytics Specialist Solutions Architect at AWS based out of New York. She has been building data warehouse solutions for over 20 years and specializes in Amazon Redshift. She is focused on helping customers design and build enterprise-scale well-architected analytics and decision support platforms.

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.

Automate Lambda code signing with Amazon CodeCatalyst and AWS Signer

Post Syndicated from Vineeth Nair original https://aws.amazon.com/blogs/devops/automate-lambda-code-signing-with-amazon-codecatalyst-and-aws-signer/

Amazon CodeCatalyst is an integrated service for software development teams adopting continuous integration and deployment practices into their software development process. CodeCatalyst puts the tools you need all in one place. You can plan work, collaborate on code build, test, and deploy applications with continuous integration/continuous delivery (CI/CD) tools. You can also integrate AWS resources with your projects by connecting your AWS accounts to your CodeCatalyst space. By managing all of the stages and aspects of your application lifecycle in one tool, you can deliver software quickly and confidently.

Introduction

In this post we will focus on how development teams can use Amazon CodeCatalyst with AWS Signer to fully manage the code signing process to ensure the trust and integrity of code assets. We will describe the process of building the AWS Lambda code using a CodeCatalyst workflow, we will then demonstrate the process of signing the code using a signer profile and deploying the signed code to our Lambda function.

In the Develop stage, the engineer commits the code to the Amazon CodeCatalyst repository using the Cloud 9 IDE. The CodeCatalyst workflow sends the index.py file from the repository and puts it into the S3 source bucket after compressing it. AWS Signer signs this content and pushes it to the S3 destination bucket. In the deploy stage, the signed zip file will be deployed into the AWS Lambda function.

Figure 1: Architecture Diagram.

Prerequisites

To follow along with the post, you will need the following items:

Walkthrough

During this tutorial, we will create a step-by-step guide to constructing a workflow utilizing CodeCatalyst. The objective is to employ the AWS Signer service to retrieve Python code from a specified source Amazon S3 bucket, compress and sign the code, and subsequently store it in a destination S3 bucket. Finally, we will utilize the signed code to deploy a secure Lambda function.

Create the base workflow

To begin we will create our workflow in the CodeCatalyst project.

Select CI/CD → Workflows → Create workflow:


Figure 2: Create workflow.

Leave the defaults for the Source Repository and Branch, select Create. We will have an empty workflow:


Figure 3: Empty workflow.

We can edit the workflow from the CodeCatalyst console, or use a Dev Environment. Initially, we will create an initial commit of this workflow file, ignore any validation errors at this stage:

In Commit workflow page, we can add the workflow file name, commit message. Repository name and Branch name can be selected from the drop-down option.
Figure 4: Commit workflow with workflow file name, message repository and branch name.

Connect to CodeCatalyst Dev Environment

We will use an AWS Cloud9 Dev Environment. Our first step is to connect to the dev environment.

Select Code → Dev Environments. If you do not already a Dev Instance you can create an instance by selecting Create Dev Environment.

My Dev Environment tab shows all Environment available.
Figure 5: Create Dev Environment.

We already have a Dev Environment, so will go ahead and select Resume Instance. A new browser tab opens for the IDE and will be available in less than one minute. Once the IDE is ready, we can go ahead and start building our workflow. First, open a terminal. You can then change into the source repository directory and pull the latest changes. In our example, our Git source repository name is lambda-signer

cd lambda-signer && git pull. We can now edit this file in our IDE.

Initially, we will create a basic Lambda code under artifacts directory:

mkdir artifacts
cat <<EOF > artifacts/index.py
def lambda_handler(event, context):
    print('Testing Lambda Code Signing using Signer') 
EOF

The previous command block creates our index.py file which will go inside the AWS Lambda function. When we testing the Lambda Function, we should see message “Testing Lambda Code Signing using Signer” in the console log.

As a next step, we will create the CDK directory and initiate it:

mkdir cdk;
cd cdk && cdk init --language python cdk

The previous command will create a directory called ‘cdk’ and then initiate cdk inside this directory. As a result, we will see another directory named ‘cdk’. We then need to update files inside this directory as per the following screenshot.

Shows the cdk directory structure. Inside this directory, there is a file called app.py. Also there is a subdirectory called cdk. Inside this subdirectory, there are 2 files named cdk_stack.py and lambda_stack.py.
Figure 6: Repository file structure.

Update the content of the files as per the code following snippets:

(Note: Update your region name by replacing the placeholder <Region Name> )

cdk_stack.py:

import os
from constructs import Construct
from aws_cdk import (
    Duration,
    Stack,
    aws_lambda as lambda_,
    aws_signer as signer,
    aws_s3 as s3,
    Aws as aws,
    CfnOutput
)


class CdkStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)
        
         # Set the AWS region
        os.environ["AWS_DEFAULT_REGION"] = "<Region Name>"

        # Create the signer profile
        signer_profile_name = "my-signer-profile-" + aws.ACCOUNT_ID
        print(f"signer_profile_name: {signer_profile_name}")
        
        signing_profile = signer.SigningProfile(self, "SigningProfile",
            platform=signer.Platform.AWS_LAMBDA_SHA384_ECDSA,
            signing_profile_name='my-signer-profile' + aws.ACCOUNT_ID,
            signature_validity=Duration.days(365)
        )

        self.code_signing_config = lambda_.CodeSigningConfig(self, "CodeSigningConfig",
            signing_profiles=[signing_profile]
        )
        

        source_bucket_name = "source-signer-bucket-" + aws.ACCOUNT_ID
        source_bucket = s3.Bucket(self, "SourceBucket",
            bucket_name=source_bucket_name,
            block_public_access=s3.BlockPublicAccess.BLOCK_ALL,
            encryption=s3.BucketEncryption.S3_MANAGED,
            versioned=True
        )

        destination_bucket_name = "dest-signer-bucket-" + aws.ACCOUNT_ID
        self.destination_bucket = s3.Bucket(self, "DestinationBucket",
            bucket_name=destination_bucket_name,
            block_public_access=s3.BlockPublicAccess.BLOCK_ALL,
            encryption=s3.BucketEncryption.S3_MANAGED,
            versioned=True
        )
        resolved_signing_profile_name = self.resolve(signing_profile.signing_profile_name)

        CfnOutput(self,"signer-profile",value=signing_profile.signing_profile_name)
        CfnOutput(self,"src-bucket",value=source_bucket.bucket_name)
        CfnOutput(self,"dst-bucket",value=self.destination_bucket.bucket_name)

lambda_stack.py:

from constructs import Construct
from aws_cdk import (
    Duration,
    Stack,
    aws_lambda as lambda_,
    aws_signer as signer,
    aws_s3 as s3,
    Aws as aws,
    CfnOutput
)


class LambdaStack(Stack):

    def __init__(self, scope: Construct, construct_id: str, dst_bucket:s3.Bucket,codesigning_config: lambda_.CodeSigningConfig, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)
        
         # Set the AWS region
        # Get the code from action inputs
        bucket_name = self.node.try_get_context("bucket_name")
        key = self.node.try_get_context("key")

        lambda_function = lambda_.Function(
            self,
            "Function",
		 function_name=’sample-signer-function’,
            code_signing_config=codesigning_config,
            runtime=lambda_.Runtime.PYTHON_3_9,
            handler="index.Lambda_handler",
            code=lambda_.Code.from_bucket(dst_bucket, key)
        )

app.py:

#!/usr/bin/env python3

import aws_cdk as cdk

from cdk.cdk_stack import CdkStack
from cdk.lambda_stack import LambdaStack


app = cdk.App()
signer_stack = CdkStack(app, "cdk")
lambda_stack = LambdaStack(app, "LambdaStack", dst_bucket=signer_stack.destination_bucket,codesigning_config=signer_stack.code_signing_config)

app.synth()

Finally, we will work on Workflow:

In our example, our workflow is Workflow_d892. We will locate Workflow_d892.yaml in the .codecatalyst\workflows directory in our repository.


Figure 7: Workflow yaml file.

Update workflow with remaining steps

We can assign our workflow a name and configure the action. We have five stages in this workflow:

  • CDKBootstrap: Prepare AWS Account for CDK deployment.
  • CreateSignerResources: Deploys Signer resources into AWS Account
  • ZipLambdaCode: Compresses the index.py file and store it in the source S3 bucket
  • SignCode: Sign the compressed python file and push it to the destination S3 bucket
  • Createlambda: Creates the Lambda Function using the signed code from destination S3 bucket.

Please insert the following values for your environment into the workflow file. The environment configuration will be as per the pre-requisite configuration for CodeCatalyst environment setup:

  • <Name of your Environment>: The Name of your CodeCatalyst environment
  • <AWS Account>: The AWS Account connection ID
  • <Role Name>: The CodeCatalyst role that is configured for the environment

(Note: Feel free to update the region configuration to meet your deployment requirements. Supported regions are listed here)

Name: Workflow_d892
SchemaVersion: "1.0"

# Optional - Set automatic triggers.
Triggers:
  - Type: Push
    Branches:
      - main

# Required - Define action configurations.
Actions:
  CDKBootstrap:
    # Identifies the action. Do not modify this value.
    Identifier: aws/[email protected]

    # Specifies the source and/or artifacts to pass to the action as input.
    Inputs:
      # Optional
      Sources:
        - WorkflowSource # This specifies that the action requires this Workflow as a source

    # Required; You can use an environment, AWS account connection, and role to access AWS resources.
    Environment:
      Name: <Name of your Environment>
      Connections:
        - Name: <AWS Account>
          Role: <Role Name> # Defines the action's properties.
    Configuration:
      # Required; type: string; description: AWS region to bootstrap
      Region: <Region Name>
  CreateSignerResources:
    # Identifies the action. Do not modify this value.
    Identifier: aws/[email protected]
    DependsOn:
      - CDKBootstrap
      # Specifies the source and/or artifacts to pass to the action as input.
    Inputs:
      # Optional
      Sources:
        - WorkflowSource # This specifies that the action requires this Workflow as a source

    # Required; You can use an environment, AWS account connection, and role to access AWS resources.
    Environment:
      Name: <Name of your Environment>
      Connections:
        - Name: <AWS Account>
          Role: <Role Name> 
    Configuration:
      # Required; type: string; description: Name of the stack to deploy
      StackName: cdk
      CdkRootPath: cdk
      Region: <Region Name>
      CfnOutputVariables: '["signerprofile","dstbucket","srcbucket"]'
      Context: '{"key": "placeholder"}'
  ZipLambdaCode:
    Identifier: aws/build@v1
    DependsOn:
    - CreateSignerResources
    Inputs:
      Sources:
        - WorkflowSource
    Environment:
      Name: <Name of your Environment>
      Connections:
        - Name: <AWS Account>
          Role: <Role Name> 
#
    Configuration:
      Steps:
        - Run: sudo yum install zip -y
        - Run: cd artifacts && zip lambda-${WorkflowSource.CommitId}.zip index.py
        - Run: aws s3 cp lambda-${WorkflowSource.CommitId}.zip s3://${CreateSignerResources.srcbucket}/tobesigned/lambda-${WorkflowSource.CommitId}.zip
        - Run: S3VER=$(aws s3api list-object-versions --output text --bucket ${CreateSignerResources.srcbucket} --prefix 'tobesigned/lambda-${WorkflowSource.CommitId}.zip' --query 'Versions[*].VersionId')
    Outputs:
      Variables:
      - S3VER
           
  SignCode:
    Identifier: aws/build@v1
    DependsOn:
    - ZipLambdaCode
    Inputs:
      Sources:
        - WorkflowSource
    Environment:
      Name: <Name of your Environment>
      Connections:
        - Name: AWS Account>
          Role: <Role Name> #
    Configuration:
      Steps:
        - Run: export AWS_REGION=<Region Name>
        - Run: SIGNER_JOB=$(aws signer start-signing-job --source --output text --query 'jobId' 's3={bucketName=${CreateSignerResources.srcbucket},key=tobesigned/lambda-${WorkflowSource.CommitId}.zip,version=${ZipLambdaCode.S3VER}}' --destination 's3={bucketName=${CreateSignerResources.dstbucket},prefix=signed-}' --profile-name ${CreateSignerResources.signerprofile})
    Outputs:
      Variables:
        - SIGNER_JOB
  CreateLambda:
    # Identifies the action. Do not modify this value.
    Identifier: aws/[email protected]
    DependsOn:
      - SignCode
      # Specifies the source and/or artifacts to pass to the action as input.
    Inputs:
      # Optional
      Sources:
        - WorkflowSource # This specifies that the action requires this Workflow as a source

    # Required; You can use an environment, AWS account connection, and role to access AWS resources.
    Environment:
      Name: <Name of your Environment>
      Connections:
        - Name: AWS Account>
          Role: <Role Name>
            # Defines the action's properties.
    Configuration:
      # Required; type: string; description: Name of the stack to deploy
      StackName: LambdaStack
      CdkRootPath: cdk
      Region: <Region Name>
      Context: '{"key": "signed-${SignCode.SIGNER_JOB}.zip"}'

We can copy/paste this code into our workflow. To save our changes, we select File -> Save. We can then commit these to our git repository by typing the following at the terminal:

git add . && git commit -m ‘adding workflow’ && git push

The previous command will commit and push the changes that we have made to the CodeCatalyst source repository. As we have a branch trigger for main defined, this will trigger a run of the workflow. We can monitor the status of the workflow in the CodeCatalyst console by selecting CICD -> Workflows. Locate your workflow and click on Runs to view the status.

CodeCatalyst CICD pipeline stage starts with CDKBootstrap stage. Stage 2 is CreateSignerResources. Stage3 is ZipLambdaCode. Stage4 is SignCode and Final Stage is CreateLambda.
Figure 8: Successful workflow execution.

To validate that our newly created Lambda function is using AWS Signed code, we can open the AWS Console in our target region > Lambda > click on the sample-signer-function to inspect the properties.

When open the AWS Lambda function, code tab shows a text message “Your function has signed code and can’t be edited inline"
Figure 9: AWS Lambda function with signed code.

Under the Code Source configuration property, you should see an informational message advising that ‘Your function has signed cofde and can’t be edited inline’. This confirms that the Lambda function is successfully using signed code.

Cleaning up

If you have been following along with this workflow, you should delete the resources that you have deployed to avoid further chargers. In the AWS Console > CloudFormation, locate the LambdaStack, then select and click Delete to remove the stack. Complete the same steps for the CDK stack.

Conclusion

In this post, we explained how development teams can easily get started signing code with AWS Signer and deploying it to Lambda Functions using Amazon CodeCatalyst. We outlined the stages in our workflow that enabled us to achieve the end-to-end release cycle. We also demonstrated how to enhance the developer experience of integrating CodeCatalyst with our AWS Cloud9 Dev Environment and leveraging the power of AWS CDK to use familiar programming languages such as Python to define our infrastructure as code resources.

Richard Merritt

Richard Merritt is a DevOps Consultant at Amazon Web Services (AWS), Professional Services. He works with AWS customers to accelerate their journeys to the cloud by providing scalable, secure and robust DevOps solutions.

Vineeth Nair

Vineeth Nair is a DevOps Architect at Amazon Web Services (AWS), Professional Services. He collaborates closely with AWS customers to support and accelerate their journeys to the cloud and within the cloud ecosystem by building performant, resilient, scalable, secure and cost efficient solutions.

 

Externalize Amazon MSK Connect configurations with Terraform

Post Syndicated from Ramc Venkatasamy original https://aws.amazon.com/blogs/big-data/externalize-amazon-msk-connect-configurations-with-terraform/

Managing configurations for Amazon MSK Connect, a feature of Amazon Managed Streaming for Apache Kafka (Amazon MSK), can become challenging, especially as the number of topics and configurations grows. In this post, we address this complexity by using Terraform to optimize the configuration of the Kafka topic to Amazon S3 Sink connector. By adopting this strategic approach, you can establish a robust and automated mechanism for handling MSK Connect configurations, eliminating the need for manual intervention or connector restarts. This efficient solution will save time, reduce errors, and provide better control over your Kafka data streaming processes. Let’s explore how Terraform can simplify and enhance the management of MSK Connect configurations for seamless integration with your infrastructure.

Solution overview

At a well-known AWS customer, the management of their constantly growing MSK Connect S3 Sink connector topics has become a significant challenge. The challenges lie in the overhead of managing configurations, as well as dealing with patching and upgrades. Manually handling Kubernetes (K8s) configs and restarting connectors can be cumbersome and error-prone, making it difficult to keep track of changes and updates. At the time of writing this post, MSK Connect does not offer native mechanisms to easily externalize the Kafka topic to S3 Sink configuration.

To address these challenges, we introduce Terraform, an infrastructure as code (IaC) tool. Terraform’s declarative approach and extensive ecosystem make it an ideal choice for managing MSK Connect configurations.

By externalizing Kafka topic to S3 configurations, organizations can achieve the following:

  • Scalability – Effortlessly manage a growing number of topics, ensuring the system can handle increasing data volumes without difficulty
  • Flexibility – Seamlessly integrate MSK Connect configurations with other infrastructure components and services, enabling adaptability to changing business needs
  • Automation – Automate the deployment and management of MSK Connect configurations, reducing manual intervention and streamlining operational tasks
  • Centralized management – Achieve improved governance with centralized management, version control, auditing, and change tracking, ensuring better control and visibility over the configurations

In the following sections, we provide a detailed guide on establishing Terraform for MSK Connect configuration management, defining and decentralizing Topic configurations, and deploying and updating configurations using Terraform.

Prerequisites

Before proceeding with the solution, ensure you have the following resources and access:

  • You need access to an AWS account with sufficient permissions to create and manage resources, including AWS Identity and Access Management (IAM) roles and MSK clusters.
  • To simplify the setup, use the provided AWS CloudFormation template. This template will create the necessary MSK cluster and required resources for this post.
  • For this post, we are using the latest Terraform version (1.5.6).

By ensuring you have these prerequisites in place, you will be ready to follow the instructions and streamline your MSK Connect configurations with Terraform. Let’s get started!

Setup

Setting up Terraform for MSK Connect configuration management includes the following:

  • Installation of Terraform and setting up the environment
  • Setting up the necessary authentication and permissions

Defining and decentralizing topic configurations using Terraform includes the following:

  • Understanding the structure of Terraform configuration files
  • Determining the required variables and resources
  • Utilizing Terraform’s modules and interpolation for flexibility

The decision to externalize the configuration was primarily driven by the customer’s business requirement. They anticipated the need to add topics periodically and wanted to avoid the need to bring down and write specific code each time. Given the limitations of MSK Connect (as of this writing), it’s important to note that MSK Connect can handle up to 300 workers. For this proof of concept (POC), we opted for a configuration with 100 topics directed to a single Amazon Simple Storage Service (Amazon S3) bucket. To ensure compatibility within the 300-worker limit, we set the MCU count to 1 and configured auto scaling with a maximum of 2 workers. This ensures that the configuration remains within the bounds of the 300-worker maximum.

To make the configuration more flexible, we specify the variables that can be utilized in the code.(variables.tf):

variable "aws_region" {
description = "The AWS region to deploy resources in."
type = string
}

variable "s3_bucket_name" {
description = "s3_bucket_name."
type = string
}

variable "topics" {
description = "topics"
type = string
}

variable "msk_connect_name" {
description = "Name of the MSK Connect instance."
type = string
}

variable "msk_connect_description" {
description = "Description of the MSK Connect instance."
type = string
}

# Rest of the variables...

To set up the AWS MSK Connector for the S3 Sink, we need to provide various configurations. Let’s examine the connector_configuration block in the code snippet provided in the main.tf file in more detail:

connector_configuration = {
"connector.class" = "io.confluent.connect.s3.S3SinkConnector"
"s3.region" = "us-east-1"
"flush.size" = "5"
"schema.compatibility" = "NONE"
"tasks.max" = "1"
"topics" = var.topics
"format.class" = "io.confluent.connect.s3.format.json.JsonFormat"
"partitioner.class" = "io.confluent.connect.storage.partitioner.DefaultPartitioner"
"value.converter.schemas.enable" = "false"
"value.converter" = "org.apache.kafka.connect.json.JsonConverter"
"storage.class" = "io.confluent.connect.s3.storage.S3Storage"
"key.converter" = "org.apache.kafka.connect.storage.StringConverter"
"s3.bucket.name" = var.s3_bucket_name
"topics.dir" = "cxdl-data/KairosTelemetry"
}

The kafka_cluster block in the code snippet defines the Kafka cluster details, including the bootstrap servers and VPC settings. You can reference the variables to specify the appropriate values:

kafka_cluster {
apache_kafka_cluster {
bootstrap_servers = var.bootstrap_servers

vpc {
security_groups = [var.security_groups]
subnets = [var.aws_subnet_example1_id, var.aws_subnet_example2_id, var.aws_subnet_example3_id]
}
}
}

To secure the connection between Kafka and the connector, the code snippet includes configurations for authentication and encryption:

  • The kafka_cluster_client_authentication block sets the authentication type to IAM, enabling the use of IAM for authentication
  • The kafka_cluster_encryption_in_transit block enables TLS encryption for data transfer between Kafka and the connector
  kafka_cluster_client_authentication {
    authentication_type = "IAM"
  }

  kafka_cluster_encryption_in_transit {
    encryption_type = "TLS"
  }

You can externalize the variables and provide dynamic values using a var.tfvars file. Let’s assume the content of the var.tfvars file is as follows:

aws_region = "us-east-1"
msk_connect_name = "confluentinc-MSK-connect-s3-2"
msk_connect_description = "My MSK Connect instance"
s3_bucket_name = "msk-lab-xxxxxxxxxxxx-target-bucket"
topics = "salesdb.salesdb.CUSTOMER,salesdb.salesdb.CUSTOMER_SITE,salesdb.salesdb.PRODUCT,salesdb.salesdb.PRODUCT_CATEGORY,salesdb.salesdb.SALES_ORDER,salesdb.salesdb.SALES_ORDER_ALL,salesdb.salesdb.SALES_ORDER_DETAIL,salesdb.salesdb.SALES_ORDER_DETAIL_DS,salesdb.salesdb.SUPPLIER"
bootstrap_servers = "b-2.mskclustermskconnectl.4xwlfx.c11.kafka.us-east-1.amazonaws.com:9098,b-3.mskclustermskconnectl.4xwlfx.c11.kafka.us-east-1.amazonaws.com:9098,b-1.mskclustermskconnectl.4xwlfx.c11.kafka.us-east-1.amazonaws.com:9098“
aws_subnet_example1_id = "subnet-016ef7bb5f5db5759"
aws_subnet_example2_id = "subnet-0114c390d379134fa"
aws_subnet_example3_id = "subnet-0f6352ad89a1454f2"
security_groups = "sg-07eb8f8e4559334e7"
aws_mskconnect_custom_plugin_example_arn = "arn:aws:kafkaconnect:us-east-1:xxxxxxxxxxxx:custom-plugin/confluentinc-kafka-connect-s3-10-0-3/e9aeb52e-d172-4dba-9de5-f5cf73f1cb9e-2"
aws_mskconnect_custom_plugin_example_latest_revision = "1"
aws_iam_role_example_arn = "arn:aws:iam::xxxxxxxxxxxx:role/msk-connect-lab-S3ConnectorIAMRole-3LBTU7YAV9CM"

Deploy and update configurations using Terraform

Once you’ve defined your MSK Connect infrastructure using Terraform, applying these configurations is a straightforward process for creating or updating your infrastructure. This becomes particularly convenient when a new topic needs to be added. Thanks to the externalized configuration, incorporating this change is now a seamless task. The steps are as follows:

  1. Download and install Terraform from the official website (https://www.terraform.io/downloads.html) for your operating system.
  2. Confirm the installation by running the terraform version command on your command line interface.
  3. Ensure that you have configured your AWS credentials using the AWS Command Line Interface (AWS CLI) or by setting environment variables. You can use the aws configure command to configure your credentials if you’re using the AWS CLI.
  4. Place the main.tf, variables.tf, and var.tfvars files in the same Terraform directory.
  5. Open a command line interface, navigate to the directory containing the Terraform files, and run the command terraform init to initialize Terraform and download the required providers.
  6. Run the command terraform plan -var-file="var.tfvars" to review the run plan.

This command shows the changes that Terraform will make to the infrastructure based on the provided variables. This step is optional but is often used as a preview of the changes Terraform will make.

  1. If the plan looks correct, run the command terraform apply -var-file="var.tfvars" to apply the configuration.

Terraform will create the MSK_Connect in your AWS account. This will prompt you for confirmation before proceeding.

  1. After the terraform apply command is complete, verify the infrastructure has been created or updated on the console.
  2. For any changes or updates, modify your Terraform files (main.tf, variables.tf, var.tfvars) as needed, and then rerun the terraform plan and terraform apply commands.
  3. When you no longer need the infrastructure, you can use terraform destroy -var-file="var.tfvars" to remove all resources created by your Terraform files.

Be careful with this command because it will delete all the resources defined in your Terraform files.

Conclusion

In this post, we addressed the challenges faced by a customer in managing MSK Connect configurations and described a Terraform-based solution. By externalizing Kafka topic to Amazon S3 configurations, you can streamline your configuration management processes, achieve scalability, enhance flexibility, automate deployments, and centralize management. We encourage you to use Terraform to optimize your MSK Connect configurations and explore further possibilities in managing your streaming data pipelines efficiently.

To get started with externalizing MSK Connect configurations using Terraform, refer to the provided implementation steps and the Getting Started with Terraform guide, MSK Connect documentation, Terraform documentation, and example GitHub repository.

Using Terraform to externalize the Kafka topic to Amazon S3 Sink configuration in MSK Connect offers a powerful solution for managing and scaling your streaming data pipelines. By automating the deployment, updating, and central management of configurations, you can ensure efficiency, flexibility, and scalability in your data processing workflows.


About the Author

RamC Venkatasamy is a Solutions Architect based in Bloomington, Illinois. He helps AWS Strategic customers transform their businesses in the cloud. With a fervent enthusiasm for Serverless, Event-Driven Architecture and GenAI.