Tag Archives: Technical How-to

How to use Amazon Verified Permissions for authorization

2022-12-12 Jeremy Ware

Post Syndicated from Jeremy Ware original https://aws.amazon.com/blogs/security/how-to-use-amazon-verified-permissions-for-authorization/

Applications with multiple users and shared data require permissions management. The permissions describe what each user of an application is permitted to do. Permissions are defined as allow or deny decisions for resources in the application.

To manage permissions, developers often combine attribute-based access control (ABAC) and role-based access control (RBAC) models with custom code coupled with business logic. This requires a review of the code to understand the permissions, and changes to the code to modify the permissions. Auditing permissions within an application can require the same level of time and effort as a full application code review. This can cause delays to deliver and require additional time and resources to ascertain permissions across your application.

In this post, I will show you how to use Amazon Verified Permissions to define permissions within custom applications using the Cedar policy language. I’ll also show you how authorization requests are made.

Overview of Amazon Verified Permissions

Amazon Verified Permissions provides a prebuilt, flexible permissions system that you can use to build permissions based on both ABAC and RBAC in your applications. You define and manage fine-grained permissions using both permit policies, that grant permissions, and forbid policies, that restrict an action. This lets you focus on building or modernizing the application.

Amazon Verified Permissions maintains a centralized policy store, which helps you manage permissions throughout an application, authorize actions, and analyze permissions with automated reasoning. It also has an evaluation simulator tool to help you test your authorization decisions and author policies.

Policy creation

To author policies with Amazon Verified Permissions, use the purpose-built Cedar policy language to create specific permission policies that include traits of ABAC and RBAC. This allows you to apply granularity with least privilege in mind.

The following figure shows a permission policy for a document management application. In the figure, between the set of parentheses on lines 1-4 of the policy, RBAC is used, based on the principal’s UserGroup, to limit the permit action to registered users—and not guest or machine principals, for example. Between the brackets on lines 5–7 of the policy, ABAC is used, where resource.owner == principal limits access to the resource to only the owner.

Figure 1: Using the Cedar policy language to create permissions

Policies are developed in two ways:

Developers build out policies as part of the deployment of the application – Policy permissions that are defined as part of deployment are a great way for developers to set up guardrails on actions that should not cross set boundaries.
Policies are created through the use of the application by end users – Policy permissions that are configurable within the application provide the freedom for data to be shared between users.

We will walk you through these two approaches in the following sections.

Create policies as part of the deployment of the application

The following figure shows how a developer can configure a permit policy as part of the deployment of an application.

Figure 2: Creating policies as part of the deployment of the application

Policies configured by developers with pre-defined permissions that are deployed alongside the application is a familiar method for setting up guardrails in an application. Consider the document management application shown in Figure 3. There is a permit policy in place that allows users to view their own documents. Without a policy, the default result is a deny. You should also configure explicit forbid policies to act as guardrails to prevent overly permissive policies. In Figure 3, the policy restricts a user to only GET documents that they own or that are not tagged as private.

Figure 3: Example of a permit policy using Cedar

Create policies within the application by end users

The following figure shows how end users can apply policies within the application.

Figure 4: How permissions can be applied using policies for application end users

In a document sharing application, the application usually provides a simple end-user experience with a menu containing point-and-click actions that allow the user to select predefined permissions, such as read, write, or delete. Abstracted by the application, these permissions are transformed into Amazon Verified Permissions policy statements and stored in the designated policy location for the application. When an end user tries to take actions protected by these permissions, the application queries the Amazon Verified Permissions backend to determine if the principal in question has permissions to do so.

You can allow users of the application to create policies directly with respect to their given environments or current permissions. For example, if the application is targeted to system administrators or engineers who are technically proficient, you might choose not to hide the policy generation process behind a UI. The Amazon Verified Permissions policy grammar is designed for users comfortable with text-based query languages. Figure 5 shows an example policy that allows a user to GET or POST documents that they own.

Figure 5: Amazon Verified Permissions policy grammar written with Cedar to define permissions

Conclusion

Amazon Verified Permissions is a scalable, fine-grained permissions management and authorization service that helps you build and modernize applications without relying heavily on coding authorization within your applications. By using the Cedar policy language, you can define granular access controls that use both RBAC and ABAC and help end users create policies within the application. This allows for alignment of authorization standards across applications and provides clear visibility into existing permissions for review and audibility.

To learn more about ABAC and RBAC and how to design policy statements, see the blog post Get the best out of Amazon Verified Permissions by using fine-grained authorization methods.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Configuration driven dynamic multi-account CI/CD solution on AWS

2022-12-12 Anshul Saxena

Post Syndicated from Anshul Saxena original https://aws.amazon.com/blogs/devops/configuration-driven-dynamic-multi-account-ci-cd-solution-on-aws/

Many organizations require durable automated code delivery for their applications. They leverage multi-account continuous integration/continuous deployment (CI/CD) pipelines to deploy code and run automated tests in multiple environments before deploying to Production. In cases where the testing strategy is release specific, you must update the pipeline before every release. Traditional pipeline stages are predefined and static in nature, and once the pipeline stages are defined it’s hard to update them. In this post, we present a configuration driven dynamic CI/CD solution per repository. The pipeline state is maintained and governed by configurations stored in Amazon DynamoDB. This gives you the advantage of automatically customizing the pipeline for every release based on the testing requirements.

By following this post, you will set up a dynamic multi-account CI/CD solution. Your pipeline will deploy and test a sample pet store API application. Refer to Automating your API testing with AWS CodeBuild, AWS CodePipeline, and Postman for more details on this application. New code deployments will be delivered with custom pipeline stages based on the pipeline configuration that you create. This solution uses services such as AWS Cloud Development Kit (AWS CDK), AWS CloudFormation, Amazon DynamoDB, AWS Lambda, and AWS Step Functions.

Solution overview

The following diagram illustrates the solution architecture:

The image represents the solution workflow, highlighting the integration of the AWS components involved.

Figure 1: Architecture Diagram

Users insert/update/delete entry in the DynamoDB table.
The Step Function Trigger Lambda is invoked on all modifications.
The Step Function Trigger Lambda evaluates the incoming event and does the following:
1. On insert and update, triggers the Step Function.
2. On delete, finds the appropriate CloudFormation stack and deletes it.
Steps in the Step Function are as follows:
1. Collect Information (Pass State) – Filters the relevant information from the event, such as repositoryName and referenceName.
2. Get Mapping Information (Backed by CodeCommit event filter Lambda) – Retrieves the mapping information from the Pipeline config stored in the DynamoDB.
3. Deployment Configuration Exist? (Choice State) – If the StatusCode == 200, then the DynamoDB entry is found, and Initiate CloudFormation Stack step is invoked, or else StepFunction exits with Successful.
4. Initiate CloudFormation Stack (Backed by stack create Lambda) – Constructs the CloudFormation parameters and creates/updates the dynamic pipeline based on the configuration stored in the DynamoDB via CloudFormation.

Code deliverables

The code deliverables include the following:

AWS CDK app – The AWS CDK app contains the code for all the Lambdas, Step Functions, and CloudFormation templates.
sample-application-repo – This directory contains the sample application repository used for deployment.
automated-tests-repo– This directory contains the sample automated tests repository for testing the sample repo.

Deploying the CI/CD solution

Clone this repository to your local machine.
Follow the README to deploy the solution to your main CI/CD account. Upon successful deployment, the following resources should be created in the CI/CD account:
1. A DynamoDB table
2. Step Function
3. Lambda Functions
Navigate to the Amazon Simple Storage Service (Amazon S3) console in your main CI/CD account and search for a bucket with the name: cloudformation-template-bucket-<AWS_ACCOUNT_ID>. You should see two CloudFormation templates (templates/codepipeline.yaml and templates/childaccount.yaml) uploaded to this bucket.
Run the childaccount.yaml in every target CI/CD account (Alpha, Beta, Gamma, and Prod) by going to the CloudFormation Console. Provide the main CI/CD account number as the “CentralAwsAccountId” parameter, and execute.
Upon successful creation of Stack, two roles will be created in the Child Accounts:
1. ChildAccountFormationRole
2. ChildAccountDeployerRole

Pipeline configuration

Make an entry into devops-pipeline-table-info for the Repository name and branch combination. A sample entry can be found in sample-entry.json.

The pipeline is highly configurable, and everything can be configured through the DynamoDB entry.

The following are the top-level keys:

RepoName: Name of the repository for which AWS CodePipeline is configured.
RepoTag: Name of the branch used in CodePipeline.
BuildImage: Build image used for application AWS CodeBuild project.
BuildSpecFile: Buildspec file used in the application CodeBuild project.
DeploymentConfigurations: This key holds the deployment configurations for the pipeline. Under this key are the environment specific configurations. In our case, we’ve named our environments Alpha, Beta, Gamma, and Prod. You can configure to any name you like, but make sure that the entries in json are the same as in the codepipeline.yaml CloudFormation template. This is because there is a 1:1 mapping between them. Sub-level keys under DeploymentConfigurations are as follows:

EnvironmentName. This is the top-level key for environment specific configuration. In our case, it’s Alpha, Beta, Gamma, and Prod. Sub level keys under this are:
- <Env>AwsAccountId: AWS account ID of the target environment.
- Deploy<Env>: A key specifying whether or not the artifact should be deployed to this environment. Based on its value, the CodePipeline will have a deployment stage to this environment.
- ManualApproval<Env>: Key representing whether or not manual approval is required before deployment. Enter your email or set to false.
- Tests: Once again, this is a top-level key with sub-level keys. This key holds the test related information to be run on specific environments. Each test based on whether or not it will be run will add an additional step to the CodePipeline. The tests’ related information is also configurable with the ability to specify the test repository, branch name, buildspec file, and build image for testing the CodeBuild project.

Execute

Make an entry into the devops-pipeline-table-info DynamoDB table in the main CI/CD account. A sample entry can be found in sample-entry.json. Make sure to replace the configuration values with appropriate values for your environment. An explanation of the values can be found in the Pipeline Configuration section above.
After the entry is made in the DynamoDB table, you should see a CloudFormation stack being created. This CloudFormation stack will deploy the CodePipeline in the main CI/CD account by reading and using the entry in the DynamoDB table.

Customize the solution for different combinations such as deploying to an environment while skipping for others by updating the pipeline configurations stored in the devops-pipeline-table-info DynamoDB table. The following is the pipeline configured for the sample-application repository’s main branch.

The image represents the dynamic CI/CD pipeline deployed in your account.

Figure 2: Dynamic Multi-Account CI/CD Pipeline

Clean up your dynamic multi-account CI/CD solution and related resources

To avoid ongoing charges for the resources that you created following this post, you should delete the following:

The pipeline configuration stored in the DynamoDB
The CloudFormation stacks deployed in the target CI/CD accounts
The AWS CDK app deployed in the main CI/CD account
Empty and delete the retained S3 buckets.

Conclusion

This configuration-driven CI/CD solution provides the ability to dynamically create and configure your pipelines in DynamoDB. IDEMIA, a global leader in identity technologies, adopted this approach for deploying their microservices based application across environments. This solution created by AWS Professional Services allowed them to dynamically create and configure their pipelines per repository per release. As Kunal Bajaj, Tech Lead of IDEMIA, states, “We worked with AWS pro-serve team to create a dynamic CI/CD solution using lambdas, step functions, SQS, and other native AWS services to conduct cross-account deployments to our different environments while providing us the flexibility to add tests and approvals as needed by the business.”

About the authors:

How to secure your SaaS tenant data in DynamoDB with ABAC and client-side encryption

2022-12-07 Jani Muuriaisniemi

Post Syndicated from Jani Muuriaisniemi original https://aws.amazon.com/blogs/security/how-to-secure-your-saas-tenant-data-in-dynamodb-with-abac-and-client-side-encryption/

If you’re a SaaS vendor, you may need to store and process personal and sensitive data for large numbers of customers across different geographies. When processing sensitive data at scale, you have an increased responsibility to secure this data end-to-end. Client-side encryption of data, such as your customers’ contact information, provides an additional mechanism that can help you protect your customers and earn their trust.

In this blog post, we show how to implement client-side encryption of your SaaS application’s tenant data in Amazon DynamoDB with the Amazon DynamoDB Encryption Client. This is accomplished by leveraging AWS Identity and Access Management (IAM) together with AWS Key Management Service (AWS KMS) for a more secure and cost-effective isolation of the client-side encrypted data in DynamoDB, both at run-time and at rest.

Encrypting data in Amazon DynamoDB

Amazon DynamoDB supports data encryption at rest using encryption keys stored in AWS KMS. This functionality helps reduce operational burden and complexity involved in protecting sensitive data. In this post, you’ll learn about the benefits of adding client-side encryption to achieve end-to-end encryption in transit and at rest for your data, from its source to storage in DynamoDB. Client-side encryption helps ensure that your plaintext data isn’t available to any third party, including AWS.

You can use the Amazon DynamoDB Encryption Client to implement client-side encryption with DynamoDB. In the solution in this post, client-side encryption refers to the cryptographic operations that are performed on the application-side in the application’s Lambda function, before the data is sent to or retrieved from DynamoDB. The solution in this post uses the DynamoDB Encryption Client with the Direct KMS Materials Provider so that your data is encrypted by using AWS KMS. However, the underlying concept of the solution is not limited to the use of the DynamoDB Encryption Client, you can apply it to any client-side use of AWS KMS, for example using the AWS Encryption SDK.

For detailed information about using the DynamoDB Encryption Client, see the blog post How to encrypt and sign DynamoDB data in your application. This is a great place to start if you are not yet familiar with DynamoDB Encryption Client. If you are unsure about whether you should use client-side encryption, see Client-side and server-side encryption in the Amazon DynamoDB Encryption Client Developer Guide to help you with the decision.

AWS KMS encryption context

AWS KMS gives you the ability to add an additional layer of authentication for your AWS KMS API decrypt operations by using encryption context. The encryption context is one or more key-value pairs of additional data that you want associated with AWS KMS protected information.

Encryption context helps you defend against the risks of ciphertexts being tampered with, modified, or replaced — whether intentionally or unintentionally. Encryption context helps defend against both an unauthorized user replacing one ciphertext with another, as well as problems like operational events. To use encryption context, you specify associated key-value pairs on encrypt. You must provide the exact same key-value pairs in the encryption context on decrypt, or the operation will fail. Encryption context is not secret, and is not an access-control mechanism. The encryption context is a means of authenticating the data, not the caller.

The Direct KMS Materials Provider used in this blog post transparently generates a unique data key by using AWS KMS for each item stored in the DynamoDB table. It automatically sets the item’s partition key and sort key (if any) as AWS KMS encryption context key-value pairs.

The solution in this blog post relies on the partition key of each table item being defined in the encryption context. If you encrypt data with your own implementation, make sure to add your tenant ID to the encryption context in all your AWS KMS API calls.

For more information about the concept of AWS KMS encryption context, see the blog post How to Protect the Integrity of Your Encrypted Data by Using AWS Key Management Service and EncryptionContext. You can also see another example in Exercise 3 of the Busy Engineer’s Document Bucket Workshop.

Attribute-based access control for AWS

Attribute-based access control (ABAC) is an authorization strategy that defines permissions based on attributes. In AWS, these attributes are called tags. In the solution in this post, ABAC helps you create tenant-isolated access policies for your application, without the need to provision tenant specific AWS IAM roles.

If you are new to ABAC, or need a refresher on the concepts and the different isolation methods, see the blog post How to implement SaaS tenant isolation with ABAC and AWS IAM.

Solution overview

If you are a SaaS vendor expecting large numbers of tenants, it is important that your underlying architecture can cost effectively scale with minimal complexity to support the required number of tenants, without compromising on security. One way to meet these criteria is to store your tenant data in a single pooled DynamoDB table, and to encrypt the data using a single AWS KMS key.

Using a single shared KMS key to read and write encrypted data in DynamoDB for multiple tenants reduces your per-tenant costs. This may be especially relevant to manage your costs if you have users on your organization’s free tier, with no direct revenue to offset your costs.

When you use shared resources such as a single pooled DynamoDB table encrypted by using a single KMS key, you need a mechanism to help prevent cross-tenant access to the sensitive data. This is where you can use ABAC for AWS. By using ABAC, you can build an application with strong tenant isolation capabilities, while still using shared and pooled underlying resources for storing your sensitive tenant data.

You can find the solution described in this blog post in the aws-dynamodb-encrypt-with-abac GitHub repository. This solution uses ABAC combined with KMS encryption context to provide isolation of tenant data, both at rest and at run time. By using a single KMS key, the application encrypts tenant data on the client-side, and stores it in a pooled DynamoDB table, which is partitioned by a tenant ID.

Solution Architecture

Figure 1: Components of solution architecture

The presented solution implements an API with a single AWS Lambda function behind an Amazon API Gateway, and implements processing for two types of requests:

GET request: fetch any key-value pairs stored in the tenant data store for the given tenant ID.
POST request: store the provided key-value pairs in the tenant data store for the given tenant ID, overwriting any existing data for the same tenant ID.

The application is written in Python, it uses AWS Lambda Powertools for Python, and you deploy it by using the AWS CDK.

It also uses the DynamoDB Encryption Client for Python, which includes several helper classes that mirror the AWS SDK for Python (Boto3) classes for DynamoDB. This solution uses the EncryptedResource helper class which provides Boto3 compatible get_item and put_item methods. The helper class is used together with the KMS Materials Provider to handle encryption and decryption with AWS KMS transparently for the application.

Note: This example solution provides no authentication of the caller identity. See chapter “Considerations for authentication and authorization” for further guidance.

How it works

Figure 2: Detailed architecture for storing new or updated tenant data

As requests are made into the application’s API, they are routed by API Gateway to the application’s Lambda function (1). The Lambda function begins to run with the IAM permissions that its IAM execution role (DefaultExecutionRole) has been granted. These permissions do not grant any access to the DynamoDB table or the KMS key. In order to access these resources, the Lambda function first needs to assume the ResourceAccessRole, which does have the necessary permissions. To implement ABAC more securely in this use case, it is important that the application maintains clear separation of IAM permissions between the assumed ResourceAccessRole and the DefaultExecutionRole.

As the application assumes the ResourceAccessRole using the AssumeRole API call (2), it also sets a TenantID session tag. Session tags are key-value pairs that can be passed when you assume an IAM role in AWS Simple Token Service (AWS STS), and are a fundamental core building block of ABAC on AWS. When the session credentials (3) are used to make a subsequent request, the request context includes the aws:PrincipalTag context key, which can be used to access the session’s tags. The chapter “The ResourceAccessRole policy” describes how the aws:PrincipalTag context key is used in IAM policy condition statements to implement ABAC for this solution. Note that for demonstration purposes, this solution receives the value for the TenantID tag directly from the request URL, and it is not authenticated.

The trust policy of the ResourceAccessRole defines the principals that are allowed to assume the role, and to tag the assumed role session. Make sure to limit the principals to the least needed for your application to function. In this solution, the application Lambda function is the only trusted principal defined in the trust policy.

Next, the Lambda function prepares to encrypt or decrypt the data (4). To do so, it uses the DynamoDB Encryption Client. The KMS Materials Provider and the EncryptedResource helper class are both initialized with sessions by using the temporary credentials from the AssumeRole API call. This allows the Lambda function to access the KMS key and DynamoDB table resources, with access restricted to operations on data belonging only to the specific tenant ID.

Finally, using the EncryptedResource helper class provided by the DynamoDB Encryption Library, the data is written to and read from the DynamoDB table (5).

Considerations for authentication and authorization

The solution in this blog post intentionally does not implement authentication or authorization of the client requests. Instead, the requested tenant ID from the request URL is passed as the tenant identity. Your own applications should always authenticate and authorize tenant requests. There are multiple ways you can achieve this.

Modern web applications commonly use OpenID Connect (OIDC) for authentication, and OAuth for authorization. JSON Web Tokens (JWTs) can be used to pass the resulting authorization data from client to the application. You can validate a JWT when using AWS API Gateway with one of the following methods:

When using a REST or a HTTP API, you can use a Lambda authorizer
When using a HTTP API, you can use a JWT authorizer
You can validate the token directly in your application code

If you write your own authorizer code, you can pick a popular open source library or you can choose the AWS provided open source library. To learn more about using a JWT authorizer, see the blog post How to secure API Gateway HTTP endpoints with JWT authorizer.

Regardless of the chosen method, you must be able to map a suitable claim from the user’s JWT, such as the subject, to the tenant ID, so that it can be used as the session tag in this solution.

The ResourceAccessRole policy

A critical part of the correct operation of ABAC in this solution is with the definition of the IAM access policy for the ResourceAccessRole. In the following policy, be sure to replace <region>, <account-id>, <table-name>, and <key-id> with your own values.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "dynamodb:DescribeTable",
                "dynamodb:GetItem",
                "dynamodb:PutItem"
            ],
            "Resource": [
                "arn:aws:dynamodb:<region>:<account-id>:table/<table-name>",
           ],
            "Condition": {
                "ForAllValues:StringEquals": {
                    "dynamodb:LeadingKeys": [
                        "${aws:PrincipalTag/TenantID}"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt",
                "kms:GenerateDataKey",
            ],
            "Resource": "arn:aws:kms:<region>:<account-id>:key/<key-id>",
            "Condition": {
                "StringEquals": {
                    "kms:EncryptionContext:tenant_id": "${aws:PrincipalTag/TenantID}"
                }
            }
        }
    ]
}

The policy defines two access statements, both of which apply separate ABAC conditions:

The first statement grants access to the DynamoDB table with the condition that the partition key of the item matches the TenantID session tag in the caller’s session.
The second statement grants access to the KMS key with the condition that one of the key-value pairs in the encryption context of the API call has a key called tenant_id with a value that matches the TenantID session tag in the caller’s session.

Warning: Do not use a ForAnyValue or ForAllValues set operator with the kms:EncryptionContext single-valued condition key. These set operators can create a policy condition that does not require values you intend to require, and allows values you intend to forbid.

Deploying and testing the solution

Prerequisites

To deploy and test the solution, you need the following:

An AWS account
The AWS Command Line Interface (AWS CLI)
NodeJS version compatible with AWS CDK version 2.37.0
Python 3.9
Git
Docker

Deploying the solution

After you have the prerequisites installed, run the following steps in a command line environment to deploy the solution. Make sure that your AWS CLI is configured with your AWS account credentials. Note that standard AWS service charges apply to this solution. For more information about pricing, see the AWS Pricing page.

To deploy the solution into your AWS account

Use the following command to download the source code:

git clone https://github.com/aws-samples/aws-dynamodb-encrypt-with-abac
cd aws-dynamodb-encrypt-with-abac

(Optional) You will need an AWS CDK version compatible with the application (2.37.0) to deploy. The simplest way is to install a local copy with npm, but you can also use a globally installed version if you already have one. To install locally, use the following command to use npm to install the AWS CDK:
```
npm install [email protected]
```

Use the following commands to initialize a Python virtual environment:

python3 -m venv demoenv
source demoenv/bin/activate
python3 -m pip install -r requirements.txt

(Optional) If you have not used AWS CDK with this account and Region before, you first need to bootstrap the environment:
```
npx cdk bootstrap
```
Use the following command to deploy the application with the AWS CDK:
```
npx cdk deploy
```
Make note of the API endpoint URL https://<api url>/prod/ in the Outputs section of the CDK command. You will need this URL for the next steps.
```
Outputs:
DemoappStack.ApiEndpoint4F160690 = https://<api url>/prod/
```

Testing the solution with example API calls

With the application deployed, you can test the solution by making API calls against the API URL that you captured from the deployment output. You can start with a simple HTTP POST request to insert data for a tenant. The API expects a JSON string as the data to store, so make sure to post properly formatted JSON in the body of the request.

An example request using curl -command looks like:

curl https://<api url>/prod/tenant/<tenant-name> -X POST --data '{"email":"<[email protected]>"}'

You can then read the same data back with an HTTP GET request:

curl https://<api url>/prod/tenant/<tenant-name>

You can store and retrieve data for any number of tenants, and can store as many attributes as you like. Each time you store data for a tenant, any previously stored data is overwritten.

Additional considerations

A tenant ID is used as the DynamoDB table’s partition key in the example application in this solution. You can replace the tenant ID with another unique partition key, such as a product ID, as long as the ID is consistently used in the IAM access policy, the IAM session tag, and the KMS encryption context. In addition, while this solution does not use a sort key in the table, you can modify the application to support a sort key with only a few changes. For more information, see Working with tables and data in DynamoDB.

Clean up

To clean up the application resources that you deployed while testing the solution, in the solution’s home directory, run the command cdk destroy.

Then, if you no longer plan to deploy to this account and Region using AWS CDK, you can also use the AWS CloudFormation console to delete the bootstrap stack (CDKToolKit).

Conclusion

In this post, you learned a method for simple and cost-efficient client-side encryption for your tenant data. By using the DynamoDB Encryption Client, you were able to implement the encryption with less effort, all while using a standard Boto3 DynamoDB Table resource compatible interface.

Adding to the client-side encryption, you also learned how to apply attribute-based access control (ABAC) to your IAM access policies. You used ABAC for tenant isolation by applying conditions for both the DynamoDB table access, as well as access to the KMS key that is used for encryption of the tenant data in the DynamoDB table. By combining client-side encryption with ABAC, you have increased your data protection with multiple layers of security.

You can start experimenting today on your own by using the provided solution. If you have feedback about this post, submit comments in the Comments section below. If you have questions on the content, consider submitting them to AWS re:Post

Want more AWS Security news? Follow us on Twitter.

How to investigate and take action on security issues in Amazon EKS clusters with Amazon Detective – Part 2

2022-12-05 Marshall Jones

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/how-to-investigate-and-take-action-on-security-issues-in-amazon-eks-clusters-with-amazon-detective-part-2/

In part 1 of this of this two-part series, How to detect security issues in Amazon EKS cluster using Amazon GuardDuty, we walked through a real-world observed security issue in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster and saw how Amazon GuardDuty detected each phase by following MITRE ATT&CK tactics.

In this blog post, we’ll walk you through investigative techniques to use with Amazon Detective, paired with the GuardDuty EKS and malware findings from the security issue. After we have identified impacted resources through our investigation, we’ll provide example remediation tactics and preventative controls to address and help prevent security issues in EKS clusters.

Amazon Detective can help you investigate security issues and related resources in your account. Detective provides EKS coverage that you can enable within your accounts. When this coverage is enabled, Detective can help investigate and remediate potentially unauthorized EKS activity that results from misconfiguration of the control plane nodes or application. Although GuardDuty is not a prerequisite to enable Detective, it is recommended that you enable GuardDuty to enhance the visualization capabilities in Detective with GuardDuty findings.

Prerequisites

You must have the following services enabled in your AWS account to generate and investigate findings associated with EKS security events in a similar manner as outlined in this blog. If you do not have GuardDuty enabled, you can still investigate with Detective, but in a limited capacity.

Amazon GuardDuty, along with these features of GuardDuty:
- Kubernetes Protection
- Malware Protection
Amazon Detective, along with this feature of Detective:
- EKS audit logs

Investigate with Amazon Detective

In the five phases we walked through in part 1, we discussed GuardDuty findings and MITRE ATT&CK tactics that can help you detect and understand each phase of the unauthorized activity, from the initial misconfiguration to the impact on our application when the EKS cluster is used for crypto mining.

The next recommended step is to investigate the EKS cluster and any associated resources. Amazon Detective can help you to investigate whether there was any other related unauthorized activity in the environment. We will walk through Detective capabilities for visualizing and gathering important information to effectively respond to the security issue. If you’re interested in creating detailed incident response playbooks for your security team to follow in your own environment, refer to these sample AWS incident response playbooks.

Depending on your scenario, there are various resources you can use to start your investigation, such as Security Hub findings, GuardDuty findings, related Kubernetes subjects, or an AWS account’s AWS CloudTrail activity. For our walkthrough, we’ll start our investigation from the GuardDuty finding and use the EKS cluster resource to pivot to the Detective console, as shown in Figure 7. Although we initially focus on the EKS cluster, you could start from any entities that are supported in the Detective behavior graph structure in the Amazon Detective User Guide. For example, we could start directly with the Kubernetes subject system:anonymous and find activity associated with the anonymous user.

Figure 7: Example Detective popup from GuardDuty finding for EKS cluster

We’ll now go over the information that you would need to gather from Detective in order to investigate the example security issue.

To investigate EKS cluster findings with Detective

In the GuardDuty console, navigate to an individual finding and hover over Investigate with Detective. Choose one of the specific resources to start. In the image below, we selected the EKS cluster resource to investigate with Detective. You will need to gather some preliminary information about the IAM roles associated with the EKS cluster.
- Questions: When was the cluster created? What IAM role created the cluster? What IAM role is assigned to the cluster?
- Why it matters: If you are an incident responder, these details can potentially help you identify the owner of the cluster and help you determine what IAM principals are involved.
- What next: Start looking into each IAM principal’s activity, as seen in CloudTrail, to investigate whether the IAM entity itself is potentially compromised or what other resources may have been impacted.
Figure 8: Detective summary page for EKS cluster metadata details
Next, on the EKS cluster overview page, you can see the container details associated with the cluster.
- Question: What are some of the other container details for the cluster? Does anything look out of the ordinary? Is it using a public image? Is it missing a network policy?
- Why it matters: Based on the architecture related to this cluster, you might be able to use this information to determine whether there are unauthorized containers. The contents of unauthorized containers will depend on your organization but typically consist of public images or unauthorized RBAC, pod security policies, or network policy configurations. It’s important to keep in mind that when you look at data in Detective, the scope time is very important. When you pivot from a GuardDuty finding, the scope time will be set to the first time the GuardDuty finding was seen to the last time the finding was seen. The container details reflect the containers that were running during the selected scope time. Changing the scope time might change the containers that are listed in the table shown in Figure 9.
- What next: Information found on this page can help to highlight unauthorized resources or configurations that will need to be remediated. You will also need to look at how these resources were initially created and if there are missing guardrails that should have been created during the provisioning of the cluster.
Figure 9: Detective summary page for EKS container metadata details
Finally, you will see associated security findings with this specific EKS cluster, similar to Figure 10, at the bottom of the EKS cluster overview page in Detective.
- Question: Are there any other security findings associated with this cluster that I previously was not aware of?
- Why it matters: In our example scenario, we walked through the findings that were initially detected and the events that unfolded from those findings. After further investigation, you might see other findings that were not part of the original investigation. This can occur if your security team is only investigating specific findings or severity values. The finding for PrivilegeEscalation:Kubernetes/PrivilegedContainer informs you that a privileged container was launched on your Kubernetes cluster by using an image that has never before been used to launch privileged containers in your cluster. A privileged container has root level access to the host. The other finding, Persistence:Kubernetes/ContainerWithSensitiveMount, informs you that a container was launched with a configuration that included a sensitive host path with write access in the volumeMounts section. This makes the sensitive host path accessible and writable from inside the container. Any finding associated to the suspicious or compromised cluster is valuable because it provides additional insight into what the unauthorized entity was trying to accomplish after the initial detection.
- What next: With Detective, you might want to continue your investigation by selecting each of these findings and reviewing all details related to the finding. Depending on the findings, you could bring in additional team members to help investigate further. For this example, we will move on to the next step.
Figure 10: Example Detective summary of security findings associated with the EKS cluster
Shift from the EKS cluster overview section to the Kubernetes API activity section, similar to Figure 11 below. This will give you the opportunity to dig into the API activity associated with this cluster.
1. Question: What other Kubernetes API activity was attempted from the cluster? Which API calls were successful? Which API calls failed? What was the unauthorized user trying to do?
2. Why it matters: It’s important to determine which actions were successfully invoked by the unauthorized user so that appropriate remediation actions can be taken. You can look at trends of successful and failed API calls, and can even search by Subject, IP address, or Kubernetes API call.
3. What next: You might want to look at all cluster role binding from days before the first GuardDuty finding was seen to determine if there was any other suspicious activity you should be investigating regarding the cluster.
Figure 11: Example Detective summary page for Kubernetes API activity on the EKS cluster
Next, you will want to look at the Newly observed Kubernetes API calls section, similar to Figure 12 below.
- Question: What are some of the more recent Kubernetes API calls? What are they trying to access right now and are they successful? Do I need to start taking action for other resources outside of EKS?
- Why it matters: This data shows Kubernetes subjects who were observed issuing API calls to this cluster for the first time during our scope time. Detective provides you this information by keeping a baseline of the activity associated with supported AWS resources. This can help you more quickly determine whether activity might be suspicious and worth looking into. In our example, we used the search functionality to look at API calls associated with the built-in Kubernetes secrets management. A common way to start your search is to see if an unauthorized user has successfully accessed any secrets, which can help you determine what information you might want to search in the overall API call volume section discussed in step 4.
- What next: If the unauthorized user has successfully accessed any secret, those secrets should be marked as compromised, and they should be rotated immediately.
Figure 12: Example Detective summary for newly observed Kubernetes API calls from the EKS cluster
You can also consider the following question when you look at the Newly observed Kubernetes API calls section.
- Question: Has the IP address associated with the finding been communicating with any other resources in our environment, and if so, what are the details of that communication?
- Why it matters: To answer this question, you can use Detective’s search functionality and the ability to use wild cards to search for IP addresses with the same first three octets. Also note that you can use CIDR notation to search, as well. Based on the results in the example in Figure 13, you can see that there are a number of related IP addresses associated with the environment. With this information, you now can look at the traffic associated with these different IPs and what resources they were communicating with.
Figure 13: Example Detective results page from a query against IP addresses associated with the EKS cluster
You can select one of the IP addresses in the search results to get more information related to it, similar to Figure 14 below.
1. Question: What was the first time an IP address was observed in the environment? When was the last time it was observed?
2. Why it matters: You can use this information to start isolating where unauthorized activity is coming from and what actions are being taken. You can also start creating a time series of unauthorized activity and scope.
3. What next: You can repeat some of the previous investigation steps for each IP address, like looking at the different tabs to review New behavior, Resource interaction, and Kubernetes activity.
Figure 14: Example Detective results page for specific IP address and associated metadata details

In summary, we began our investigation with a GuardDuty finding about an anonymous API request that was successful in using system:anonymous on one of our EKS clusters. We then used Detective to investigate and visualize activity associated with that EKS cluster, such as volume of successful or unsuccessful API requests, where and when those actions were attempted and other security findings associated with the resource. Once we have completed the investigation, we can confirm scope and impact of the security event and start moving towards taking action.

Remediation techniques for Amazon EKS

In this section, we will focus on how to remediate the security issue in our example. Your actions will vary based on your organization and the resources affected. It’s important to note that these actions will impact the EKS cluster and associated workloads, and should accordingly be performed by or coordinated with the cluster operator.

Before you take action on the EKS cluster, you will need to preserve forensic artifacts and evidence for the impacted EKS resources. The order of operations for these actions matters, because you want to get all the data from forensic artifacts in order to determine the overall impact to the resources affected. If you quarantine resources before you capture forensic artifacts, there is a risk that running processes will be interrupted or that the malware attempts to destroy resources that are valuable to a forensics investigation, to cover its tracks.

To preserve forensic evidence

Enable termination protection on the impacted worker node and change the shutdown behavior to Stop.
Label the offending pod or node with a label indicating that it is part of an active investigation.
Cordon the worker node.
Capture both volatile (temporary memory) and non-volatile (Amazon EBS snapshots) artifacts on the worker node.

Now that you have the forensic evidence, you can start to quarantine your EKS resources to restrict unauthorized network communication. The main objective is to prevent the affected EKS pods from communicating with internal resources or exfiltrating data externally.

To quarantine EKS resources

Isolate the pod by creating a network policy that denies ingress and egress traffic to the pod.
Attach a security group to the host and remove inbound and outbound rules. Take this action if you believe the underlying host has been compromised.
Depending on existing inbound and outbound rules on the security group, the connections will either be tracked or untracked. Applying an isolation security group will drop untracked connections. For tracked connections, new connections with the host will not be allowed from the isolation security group, but existing tracked connections will not be interrupted.

Important: This action will affect all containers running on the host.
Attach a deny rule for the EKS resources in a network access control list (network ACL). Because network ACLs are stateless firewalls, all connections will be interrupted, whether they are tracked or untracked connections.

Important: This action will affect all subnets using the network ACL and all resources within those subnets.

At this point, the affected EKS resources are quarantined, but the cluster is still configured to allow anonymous, unauthenticated access. You will need to remove all unauthorized permissions that were created or added.

To remove unauthorized permissions

Update the RBAC configuration to remove system:anonymous access.
Revoke temporary security credentials that are assigned to the pod or worker node, if necessary. You can also remove the IAM role associated with the EKS resources.

Note: Removing IAM policies or attaching IAM policies to restrict permissions will affect the resources that are using the IAM role.
Remove any unauthorized ClusterRoleBinding created by the system:anonymous user.
Redeploy the compromised pod or workload resource.

The actions taken so far primarily target the EKS resource, but based on our Detective investigation, there are other actions you might need to take. Because secrets were involved that could be used outside of the EKS cluster, those secrets will need to be rotated wherever they are referenced. Detective will also suggest additional areas where you can investigate and remediate additional unauthorized activity in your AWS account.

It is important that your team go through game days or run-throughs for investigating and responding to different scenarios in order to make sure the team is prepared. You can run through the EKS security workshop to get your security team more familiar with remediation for EKS.

For more information about responding to EKS cluster related security issues, refer to GuardDuty EKS remediation in the GuardDuty User Guide and the EKS Best Practices Guide.

Preventative controls for EKS

This section covers several preventative controls that you can use to protect EKS clusters.

How can I prevent external access to the EKS cluster?

To help prevent external access to your EKS clusters, limit the exposure of your API server. You can achieve that in two ways:

Set the API server endpoint access to Private. This will effectively forbid anyone outside of the VPC to send Kubernetes API requests to your EKS cluster.
Set an IP address allow list for the EKS cluster public access endpoint.

How can I prevent giving admin access to the EKS cluster?

To help prevent an EKS cluster user from granting any type of access to anonymous or unauthenticated users, you can set up a ValidatingAdmissionWebhook. This is a special type of Kubernetes admission controller that can be configured in the Kubernetes API. (To learn how to build serverless admission webhooks, see the blog post Building serverless admission webhooks for Kubernetes with AWS SAM.)

The ValidatingAdmissionWebhook will deny a Kubernetes API request that matches all of the following checks:

The request is creating or modifying a ClusterRoleBinding or RoleBinding.
The subjects section contains either of the following:
- The user system:anonymous
- The group system:unauthenticated

How can I prevent malicious images from being deployed?

Now that you have set controls to prevent external access to the EKS cluster and prevent granting access to anonymous users, you can focus on preventing the deployment of potentially malicious images.

Malicious container images can have different origins, including:

Images stored in public or unauthorized registries
Images replacing the ones that are stored in authorized registries
Authorized images that contain software with existing or newly discovered vulnerabilities

You can address these sources of malicious images by doing the following:

Use admission controllers to verify that images meet your organization’s requirements, including for the image origin. You can also refer to this this blog post to implement a solution with a webhook and admission controllers.
Enable tag immutability in your registry, a control that prevents an actor from maliciously replacing container images without changing the image’s tags. Additionally, you can enable an AWS Config rule to check tag immutability
Configure another ValidatingAdmissionWebhook that will only accept images if they meet all of the following criteria.
1. Images that come from approved registries.
2. Images that pass the vulnerability scan during deployment time.
3. Images that are signed by a trusted party. Amazon Elastic Container Registry (Amazon ECR) is working on a product enhancement to store image signatures. Currently, you can use an open-source cosign tool to verify and store image signatures.
  
  Note: These criteria can vary based on your use case and internal security and compliance standards.

The above controls will help prevent the deployment of a vulnerable, unauthorized, or potentially malicious container image.

How can I prevent lateral movement inside the cluster?

To prevent lateral movement inside the cluster, it is recommended to use network policies, as follows:

Enforce Kubernetes network policies to enforce ingress and egress controls within the cluster. You can implement these policies by following the steps in the Securing your cluster with network policies EKS workshop.

It’s important to note that you could use security groups for the same purpose, but pod security groups should only be used if the cluster is compromised and when you want to control the traffic between a pod and a resource that resides in the VPC, not inter-pod traffic.

In this section, we’ve reviewed different preventative controls that could have helped mitigate our example security incident. With the first preventative control, we could have prevented external actors from connecting to the API server. The second control could have prevented granting access to anonymous users. The third control could have prevented the deployment of an unauthorized or vulnerable container image. Finally, the fourth control could have helped limit the impact of the deployed vulnerable images to only the pods where the images were deployed, making it harder to laterally move to other pods in the cluster.

Conclusion

In this post, we walked you through how to investigate an EKS cluster related security issue with Amazon Detective. We also provided some recommended remediation and preventative controls to put in place for the EKS cluster specific security issues. When pairing GuardDuty’s ability for continuous threat detection and monitoring with Detective’s organization and visualization capabilities, you enable your security team to conduct faster and more effective investigation. By providing the security team the ability quickly view an organized set of data associated with security events within your AWS account, you reduce the overall Mean Time to Respond (MTTR).

Now that you understand the investigative capabilities with Detective, it’s time to try things out! It is important that you provide a mechanism for your security team to practice detection, investigation, and remediation techniques using security incident response simulations. By periodically running simulations, your security team will be prepared to quickly respond to possible security events. You can find more detailed incident response playbooks that can assist you in preparing for events in your environment, see these sample AWS incident response playbooks.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a thread on Amazon GuardDuty re:Post.

Want more AWS Security news? Follow us on Twitter.

How to use Amazon Macie to preview sensitive data in S3 buckets

2022-12-01 Koulick Ghosh

Post Syndicated from Koulick Ghosh original https://aws.amazon.com/blogs/security/how-to-use-amazon-macie-to-preview-sensitive-data-in-s3-buckets/

Security teams use Amazon Macie to discover and protect sensitive data, such as names, payment card data, and AWS credentials, in Amazon Simple Storage Service (Amazon S3). When Macie discovers sensitive data, these teams will want to see examples of the actual sensitive data found. Reviewing a sampling of the discovered data helps them quickly confirm that the object is truly sensitive according to their data protection and privacy policies.

In this post, we walk you through how your data security teams are able to use a new capability in Amazon Macie to retrieve up to 10 examples of sensitive data found in your S3 objects, so that you are able to confirm the nature of the data at a glance. Additionally, we will discuss how you are able to control who is able to use this capability, so that only authorized personnel have permissions to view these examples.

The challenge customers face

After a Macie sensitive data discovery job is run, security teams start their work. The security team will review the Macie findings to investigate the discovered sensitive data and decide what actions to take to protect such data. The findings provide details that include the severity of the finding, information on the affected S3 object, and a summary of the type, location, and amount of sensitive data found. However, Macie findings only contain pointers to data that Macie found in the object. In order to complete their investigation, customers in the past had to do additional work to extract the contents of a sensitive object, such as navigating to a different AWS account where the object is located, downloading and manually searching for keywords in a file editor, or writing and refining SQL queries by using Amazon S3 Select. The investigations are further slowed down when the object type is one that is not easily readable without additional tooling, such as big-data file types like Avro and Parquet. By using the Macie capability to retrieve sensitive data samples, you are able to review the discovered data and make decisions concerning the finding remediation.

Prerequisites

To implement the ability to retrieve and reveal samples of sensitive data, you’ll need the following prerequisites:

Enable Amazon Macie in your AWS account. For instructions, see Getting started with Amazon Macie.
Set your account as the delegated Macie administrator account and enable Macie in at least one member account by using AWS Organizations. In this post, we will refer to the delegated administrator account as Account A and the member account as Account B.
Configure Macie detailed classification results in Account A.

Note: The detailed classification results contain a record for each Amazon S3 object that you configure the job to analyze, and include the location of up to 1,000 occurrences of each type of sensitive data that Macie found in an object. Macie uses the location information in the detailed classification results to retrieve the examples of sensitive data. The detailed classification results are stored in an S3 bucket of your choice. In this post, we will refer to this bucket as DOC-EXAMPLE-BUCKET1.
Create an S3 bucket that contains sensitive data in Account B. In this post, we will refer to this bucket as DOC-EXAMPLE-BUCKET2.

Note: You should enable server-side encryption on this bucket by using customer managed AWS Key Management Service (AWS KMS) keys (a type of encryption known as SSE-KMS).
(Optional) Add sensitive data to DOC-EXAMPLE-BUCKET2. This post uses a sample dataset that contains fake sensitive data. You are able to download this sample dataset, unarchive the .zip folder, and follow these steps to upload the objects to S3. This is a synthetic dataset generated by AWS that we will use for the examples in this post. All data in this blog post has been artificially created by AWS for demonstration purposes and has not been collected from any individual person. Similarly, such data does not relate back to any individual person, nor is it intended to.
Create and run a sensitive data discovery job from Account A to analyze the contents of DOC-EXAMPLE-BUCKET2.
(Optional) Set up the AWS Command Line Interface (AWS CLI).

Configure Macie to retrieve and reveal examples of sensitive data

In this section, we’ll describe how to configure Macie so that you are able to retrieve and view examples of sensitive data from Macie findings.

To configure Macie (console)

In the AWS Management Console, in the Macie delegated administrator account (Account A), follow these steps from the Amazon Macie User Guide.

To configure Macie (AWS CLI)

Confirm that you have Macie enabled.

	$ aws macie2 get-macie-session --query 'status'
	// The expected response is "ENABLED"

Confirm that you have configured the detailed classification results bucket.

	$ aws macie2 get-classification-export-configuration

	// The expected response is:
	{
   	 "configuration": {
   		 	    "s3Destination": {
        		    "bucketName": " DOC-EXAMPLE-BUCKET1 ",
           			"kmsKeyArn": "arn:aws:kms:<YOUR-REGION>:<YOUR-ACCOUNT-ID>:key/<KEY-USED-TO-ENCRYPT-DOC-EXAMPLE-BUCKET1>"
     		  	 }
		}	
	}

Create a new KMS key to encrypt the retrieved examples of sensitive data. Make sure that the key is created in the same AWS Region where you are operating Macie.

$ aws kms create-key
{
    "KeyMetadata": {
        "Origin": "AWS_KMS",
        "KeyId": "<YOUR-KEY-ID>",
        "Description": "",
        "KeyManager": "CUSTOMER",
        "Enabled": true,
        "KeySpec": "SYMMETRIC_DEFAULT",
        "CustomerMasterKeySpec": "SYMMETRIC_DEFAULT",
        "KeyUsage": "ENCRYPT_DECRYPT",
        "KeyState": "Enabled",
        "CreationDate": 1502910355.475,
        "Arn": "arn:aws:kms: <YOUR-AWS-REGION>:<AWS-ACCOUNT-A>:key/<YOUR-KEY-ID>",
        "AWSAccountId": "<AWS-ACCOUNT-A>",
        "MultiRegion": false
        "EncryptionAlgorithms": [
            "SYMMETRIC_DEFAULT"
        ],
    }
}

Give this key the alias REVEAL-KMS-KEY.

$ aws kms CreateAlias
{
   "AliasName": " <REVEAL-KMS-KEY> ",
   "TargetKeyId": "<YOUR-KEY-ID>"
}

Enable the feature in Macie and configure it to encrypt the data by using REVEAL-KMS-KEY. You do not specify a key policy for your new KMS key in this step. The key policy will be discussed later in the post.

$ aws macie2 update-reveal-configuration --configuration '{"status":"ENABLED","kmsKeyId":"alias/ <REVEAL-KMS-KEY> "}'

// The expected response is:
{
    "configuration": {
        "kmsKeyId": "arn:aws:kms:<YOUR-REGION>: <YOUR ACCOUNT ID>:key/<REVEAL-KMS-KEY>.",
        "status": "ENABLED"
    }
}

Control access to read sensitive data and protect data displayed in Macie

This new Macie capability uses the AWS Identity and Access Management (IAM) policies, S3 bucket policies, and AWS KMS key policies that you have defined in your accounts. This means that in order to see examples through the Macie console or by invoking the Macie API, the IAM principal needs to have read access to the S3 object and to decrypt the object if it is server-side encrypted. It’s important to note that Macie uses the IAM permissions of the AWS principal to locate, retrieve, and reveal the samples and does not use the Macie service-linked role to perform these tasks.

Using the setup discussed in the previous section, you will walk through how to control access to the ability to retrieve and reveal sensitive data examples. To recap, you created and ran a discovery job from the Amazon Macie delegated administrator account (Account A) to analyze the contents of DOC-EXAMPLE-BUCKET2 in a member account (Account B). You configured Macie to retrieve examples and to encrypt the examples of sensitive data with the REVEAL-KMS-KEY.

The next step is to create and use an IAM role that will be assumed by other users in Account A to retrieve and reveal examples of sensitive data discovered by Macie. In this post, we’ll refer to this role as MACIE-REVEAL-ROLE.

To apply the principle of least privilege and allow only authorized personnel to view the sensitive data samples, grant the following permissions so that Macie users who assume MACIE-REVEAL-ROLE will be able to successfully retrieve and reveal examples of sensitive data:

Step 1 – Update the IAM policy for MACIE-REVEAL-ROLE.
Step 2 – Update the KMS key policy for REVEAL-KMS-KEY.
Step 3 – Update the S3 bucket policy for DOC-EXAMPLE-BUCKET2 and the KMS key policy used for its server-side encryption in Account B.

After you grant these permissions, MACIE-REVEAL-ROLE is succcesfully able to retrieve and reveal examples of sensitive data in DOC-EXAMPLE-BUCKET2, as shown in Figure 1.

Figure 1: Macie runs the discovery job from the delegated administrator account in a member account, and MACIE-REVEAL-ROLE retrieves examples of sensitive data

Step 1: Update the IAM policy

Provide the following required permissions to MACIE-REVEAL-ROLE:

Allow GetObject from DOC-EXAMPLE-BUCKET2 in Account B.
Allow decryption of DOC-EXAMPLE-BUCKET2 if it is server-side encrypted with a customer managed key (SSE-KMS).
Allow GetObject from DOC-EXAMPLE-BUCKET1.
Allow decryption of the Macie discovery results.
Allow the necessary Macie actions to retrieve and reveal sensitive data examples.

To set up the required permissions

Use the following commands to provide the permissions. Make sure to replace the placeholders with your own data.

{
    "Version": "2012-10-17",
    "Statement": [
	{
            "Sid": "AllowGetFromCompanyDataBucket",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<DOC-EXAMPLE-BUCKET2>/*"
        },
        {
            "Sid": "AllowKMSDecryptForCompanyDataBucket",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": "arn:aws:kms:<AWS-Region>:<AWS-Account-B>:key/<KEY-USED-TO-ENCRYPT-DOC-EXAMPLE-BUCKET2>"
        },
        {
            "Sid": "AllowGetObjectfromMacieResultsBucket",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<DOC-EXAMPLE-BUCKET1>/*"
        },
	{
            "Sid": "AllowKMSDecryptForMacieRoleDiscoveryBucket",
            "Effect": "Allow",
            "Action": [
                "kms:Decrypt"
            ],
            "Resource": "arn:aws:kms:<AWS-REGION>:<AWS-ACCOUNT-A>:key/<KEY-USED-TO-ENCRYPT-DOC-EXAMPLE-BUCKET1>"
        },
	{
            "Sid": "AllowActionsRetrieveAndReveal",
            "Effect": "Allow",
            "Action": [
                "macie2:GetMacieSession",
                "macie2:GetFindings",
                "macie2:GetSensitiveDataOccurrencesAvailability",
                "macie2:GetSensitiveDataOccurrences",
                "macie2:ListFindingsFilters",
                "macie2:GetBucketStatistics",
                "macie2:ListMembers",
                "macie2:ListFindings",
                "macie2:GetFindingStatistics",
                "macie2:GetAdministratorAccount",
                "macie2:GetClassificationExportConfiguration",
                "macie2:GetRevealConfiguration",
                "macie2:DescribeBuckets"
            ],
            "Resource": "*” 
        }
    ]
}

Step 2: Update the KMS key policy

Next, update the KMS key policy that is used to encrypt sensitive data samples that you retrieve and reveal in your delegated administrator account.

To update the key policy

Allow the MACIE-REVEAL-ROLE access to the KMS key that you created for protecting the retrieved sensitive data, using the following commands. Make sure to replace the placeholders with your own data.

	{
            "Sid": "AllowMacieRoleDecrypt",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam:<AWS-REGION>:<AWS-ACCOUNT-A>:role/<MACIE-REVEAL-ROLE>"
            },
            "Action": [
                "kms:Decrypt",
                "kms:DescribeKey",
                "kms:GenerateDataKey"
            ],
            "Resource": "arn:aws:kms:<AWS-REGION>:<AWS-ACCOUNT-A>:key/<REVEAL-KMS-KEY>"
        }

Step 3: Update the bucket policy of the S3 bucket

Finally, update the bucket policy of the S3 bucket in member accounts, and update the key policy of the key used for SSE-KMS.

To update the S3 bucket policy and KMS key policy

Use the following commands to update key policy for the KMS key used for server-side encryption of the DOC-EXAMPLE-BUCKET2 bucket in Account B.

	{
            "Sid": "AllowMacieRoleDecrypt”
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam:<AWS-REGION>:<AWS-ACCOUNT-A>:role/<MACIE-REVEAL-ROLE>"
            },
            "Action": "kms:Decrypt",
            "Resource": "arn:aws:kms:<AWS-REGION>:<AWS-ACCOUNT-B>:key/<KEY-USED-TO-ENCRYPT-DOC-EXAMPLE-BUCKET2>"
  }

Use the following commands to update the bucket policy of DOC-EXAMPLE-BUCKET2 to allow cross-account access for MACIE-REVEAL-ROLE to get objects from this bucket.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowMacieRoleGet",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<AWS-ACCOUNT-A>:role/<MACIE-REVEAL-ROLE>"
            },
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<DOC-EXAMPLE-BUCKET2>/*"
        }
    ]
}

Retrieve and reveal sensitive data samples

Now that you’ve put in place the necessary permissions, users who assume MACIE-REVEAL-ROLE will be able to conveniently retrieve and reveal sensitive data samples.

To retrieve and reveal sensitive data samples

In the Macie console, in the left navigation pane, choose Findings, and select a specific finding. Under Sensitive Data, choose Review.

Figure 2: The finding details panel
On the Reveal sensitive data page, choose Reveal samples.

Figure 3: The Reveal sensitive data page
Under Sensitive data, you will be able to view up to 10 examples of the sensitive data found by Amazon Macie.

Figure 4: Examples of sensitive data revealed in the Amazon Macie console

You are able to find additional information on setting up the Macie Reveal function in the Amazon Macie User Guide.

Conclusion

In this post, we showed how you are to retrieve and review examples of sensitive data that were found in Amazon S3 using Amazon Macie. This capability will make it easier for your data protection teams to review the sensitive contents found in S3 buckets across the accounts in your AWS environment. With this information, security teams are able to quickly take remediation actions, such as updating the configuration of sensitive buckets, quarantining files with sensitive information, or sending a notification to the owner of the account where the sensitive data resides. In certain cases, you are able to add the examples to an allow list in Macie if you don’t want Macie to report those as sensitive data (for example, corporate addresses or sample data that is used for testing).

The following are links to additional resources that you will be able to use to expand your knowledge of Amazon Macie capabilities and features:

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Macie re:Post.

Want more AWS Security news? Follow us on Twitter.

Use Amazon Macie for automatic, continual, and cost-effective discovery of sensitive data in S3

2022-11-29 Jonathan Nguyen

Post Syndicated from Jonathan Nguyen original https://aws.amazon.com/blogs/security/use-amazon-macie-for-automatic-continual-and-cost-effective-discovery-of-sensitive-data-in-s3/

Customers have an increasing need to collect, store, and process data within their AWS environments for application modernization, reporting, and predictive analytics. AWS Well-Architected security pillar, general data privacy and compliance regulations require that you appropriately identify and secure sensitive information. Knowing where your data is allows you to implement the appropriate security controls which help support meeting a range of objectives including compliance & data privacy.

With Amazon Macie, you can detect sensitive information stored in your organization’s Amazon Simple Storage Service (Amazon S3) storage. Macie provides sensitive data findings and additional metadata to help you protect your data in Amazon S3.

If you have many accounts with a lot of S3 buckets and data, you might find it complex, expensive, and time consuming to discover sensitive data in each bucket and account, and to evaluate the large number of findings. As your applications continue to scale you want to have confidence that you continue to understand where the data is in your environment.

To help discover sensitive data across your entire S3 storage, you can now use a new feature in Macie—automated sensitive data discovery—to automatically build sensitive data profiles on S3 buckets and uncover the presence of sensitive data. The new feature continually and cost-efficiently samples data across your S3 storage. This reduces the data scanning needed to locate sensitive data so that you can focus your time, effort, and resources on additional investigation and remediation if sensitive data is found. This broad visibility can help you develop scalable, repeatable processes for ongoing and proactive protection of data.

In this blog post, we show you how to set up Macie automated sensitive data discovery in your AWS environment and walk you through the insights that it generates. We also share some common patterns on how you can use the findings to improve your data security posture.

Prerequisites

To get started, you’ll need the following prerequisites:

Activate Amazon Macie in your accounts for the AWS Regions of your choosing. Macie is a regional service, so it scans S3 buckets only in the Regions where it’s turned on.
Set up a delegated Macie administrator account, also referred to as the Macie admin account, for these Regions. A Macie admin account has visibility into the S3 buckets of member accounts. It also allows you to restrict access to automated sensitive data discovery results to the appropriate teams, without providing access into the management account.
To set up the delegated Macie administrator to centrally manage multiple Macie accounts, do one of the following:
- Option 1 (Recommended) – Add member accounts using AWS Organizations. Make sure to review best practices for setting up Macie for your organizations.
- Option 2 – Add member accounts by using Macie membership invitations.
For steps on how to implement these options, see Considerations and recommendations for invitation-based organizations in Amazon Macie.
Make sure that a Macie service-linked IAM role has appropriate permissions to read and decrypt S3 objects. For S3 objects that are server-side encrypted with AWS Key Management Service (AWS KMS), update the associated KMS key policies to grant the required permission for the Macie service-linked role to decrypt existing and future S3 objects.
Configure a S3 bucket for sensitive data results in the Macie admin account to access the results and allow for long-term storage and retention.

Activate automated sensitive data discovery in the delegated Macie administrator account

In this section, we walk you through how to activate automated sensitive data discovery in Macie.

For new Macie admin accounts, automated sensitive data discovery is turned on by default. For existing Macie accounts, you need to activate automated sensitive data discovery in the existing Macie admin accounts.

To activate automated sensitive data discovery in the existing Macie admin accounts

Navigate to the Amazon Macie console.
Under Settings, choose Automated discovery.
For Status, choose Enable, and then edit the following sections according to your needs:
- S3 buckets – By default, Macie selects and inspects samples of objects across all S3 buckets in your organization. For example, you might want to exclude an S3 bucket that stores AWS CloudTrail logs.
- Managed data identifiers – You can select managed data identifiers to include or exclude during automated sensitivity data discovery. By default, Macie inspects and samples objects by using a set of managed data identifiers that AWS recommends. This includes most of the managed data identifiers that AWS supports, but excludes some that can potentially cause a high volume of alerts in buckets where you might not expect them. If you know specific data types that could exist within your environment, you can add those managed data identifiers specifically. If you want Macie to exclude detections that aren’t sensitive in your deployment, you can exclude them. For more details, see the Macie administrator user guide.
- Custom data identifiers – You can select custom data identifiers to include or exclude during automated sensitive data discovery.
- Allow lists – You can select allow lists to define specific text or a text pattern that you want Macie to exclude from automated sensitive data discovery.

Figure 1: Settings page for Macie automated sensitive data discovery

Note: When you make changes to the inclusion or exclusion of managed or custom data identifiers for S3 buckets managed by the Macie admin account, those changes apply only to new S3 objects that are discovered. The changes do not apply to detections for existing S3 objects that were previously scanned with automated sensitive data discovery.

How Macie samples data and assigns scores

Macie automated sensitive data discovery analyzes objects in the S3 buckets in your accounts where Macie is turned on. It organizes objects with similar S3 metadata, such as bucket names, object-key prefixes, file-type extensions, and storage class, into groups that are likely to have similar content. It then selects small, but representative, samples from each identified group of objects and scans them to detect the presence of sensitive data. Macie has a feedback loop that uses the results of previously scanned samples to prioritize the next set of samples to inspect.

This systematic exploration of your S3 storage can help identify the presence of unknown sensitive data for a fraction of the cost of targeted sensitive data discovery jobs. A single sample might not be conclusive, so Macie continues sampling to build a security-relevant, interactive map of your S3 buckets. It automatically detects new buckets in your accounts, and keeps track of the previously scanned objects that get deleted from existing buckets to make sure that your map stays up to date.

Review data sensitivity scoring

When you first activate automated sensitive data discovery, Macie assigns each of your S3 buckets a sensitivity score of 50. Then, Macie begins to continually select and scan a sample of objects in your S3 buckets across each member account. Based on the results, Macie adjusts the sensitivity score for each bucket, assigning new scores that range from 1–99. Macie increases the score if sensitive data is found, and decreases the score if sensitive data isn’t found.

Macie calculates this score based on the amount of data inspected, number of sensitive data types discovered, number of occurrences of each sensitive data type, and the nature of the sensitive data. The score can help you identify potential security risks, but it does not indicate the criticality that a given bucket, and its contents, might have for your organization.

Figure 2 shows an example Summary page for the delegated Macie administrator. This page summarizes the results of automated sensitive data discovery for the delegated administrator account and each member account.

Figure 2: Macie summary page showing S3 bucket metadata

From the Summary page, you can choose statistics, such as Publicly accessible or Sensitive, to investigate. When you choose a statistic, you will be redirected to the S3 buckets page that displays a filtered view based on the selected data.

On the S3 buckets page shown in Figure 3, Macie displays a heat map of consolidated information, grouped by account, on whether a bucket is sensitive, not sensitive, or not analyzed yet. Each square in the heat map represents an S3 bucket. In the figure, account 111122223333 has 79 buckets, including 4 buckets with sensitive data findings, 34 buckets that were scanned with no sensitive data found, and 41 buckets that are pending scanning.

Figure 3: Heat map of automated sensitive data discovery in Macie

For more information about an S3 bucket, select one of the squares in the heat map. This will show you the sensitivity score and other details, such as types of sensitive data, names of sensitive objects, and profiling statistics.

The following table summarizes Macie sensitivity score categories and how to interpret the heat map.

Data sensitivity score	Data sensitivity status	Data sensitivity heat map
-1	Unable to analyze	Macie was unable to analyze a S3 object(s) due to a permission issue.
1-49	Not sensitive	A darker shade of blue, and a lower sensitivity score, indicates that a greater proportion of objects in the bucket were scanned and fewer occurrences of sensitive data were found. A score closer to 1 indicates that Macie scanned most of the objects in the bucket and did not find occurrences of objects with sensitive data. A score closer to 49 indicates that Macie scanned a smaller proportion of objects in the bucket and did not find occurrences of objects with sensitive data.
50	Not analyzed	White shading indicates that Macie hasn’t analyzed objects yet.
51-99	Sensitive	A darker shade of red, and a higher sensitivity score, indicates that a greater proportion of objects in the bucket were scanned and more occurrences of sensitive data were found. A score closer to 99 indicates that Macie scanned a greater proportion of objects in the bucket, and found several occurrences of objects with sensitive data. A score closer to 51 indicates that Macie scanned a smaller proportion of objects and found some occurrences of objects with sensitive data.
100	Maximum score	A solid shade of red. Macie doesn’t assign this score, but you can manually assign it.

Common use cases for Macie automated sensitive data discovery

In this section, we discuss how you can use automated sensitive data discovery in Macie to implement the following common patterns:

Activate continuous monitoring for broad visibility into the presence of sensitive data in your S3 buckets, including existing buckets where sensitive data was not found before.
Manually identify and prioritize a subset of S3 buckets so that you can conduct a full scan based on the sensitivity score.
Build automation that scans S3 buckets by using the sensitivity score and takes actions, such as sending notifications or performing remediation, so that buckets with sensitive data have proper guardrails.

Continuous monitoring of S3 buckets for sensitive data

The dynamic nature of applications and the speed of innovation increases the type and amount of data generated, stored, and processed over time. While development teams work on developing new features for your applications, security teams help the application teams understand where they should take action to protect data.

Discovering sensitive data is an ongoing activity that requires a continuous search for sensitive data in S3 buckets in each account that the Macie admin accounts manage. Macie continually searches for sensitive data and updates the information found on the Summary and S3 buckets pages in the Macie admin accounts.

To help you gain visibility across your S3 storage at an affordable cost, automated sensitive data discovery establishes a baseline profile of the sensitivity of each bucket, while analyzing only a fraction of S3 data for each account in a given month. After you activate this feature in the Macie admin accounts, Macie starts constructing an S3 bucket baseline within 48 hours.

Macie continues to refine bucket profiles and prioritizes those that it has the least information on. For example, Macie might prioritize buckets that were recently created in the monitored accounts or existing buckets from a member account that recently joined your organization. This provides continual visibility that achieves greater fidelity over time while scanning data at a predictable monthly rate.

Automated discovery uses the results of the automated data inspection to create a profile for each bucket. It also tracks previously scanned objects to make sure that each bucket profile is up to date. This means that if a previously scanned object is removed, Macie updates the profile of the bucket to make sure that you have the most current information.

You can also include or exclude specific managed and custom data identifiers from specific S3 buckets or from each S3 bucket that the Macie admin accounts manages. For example, to make sure that the sensitivity score is as accurate as possible, you can exclude specific data identifiers on select S3 buckets where you expect those identifiers.

Let’s walk through an example of how to exclude specific data identifiers on an S3 bucket. Imagine that your company has an S3 bucket where data scientists store a test dataset of fictitious names and addresses. The appropriate teams have verified that the test dataset isn’t sensitive and can be used to create test data models. You want to exclude name and address detections for this bucket while keeping these detections for the rest of your S3 storage.

To exclude the name and address identifiers, navigate to the specific S3 bucket, choose the identifiers to exclude (in this case, NAME and ADDRESS), and choose Exclude from score, as shown in Figure 4. Macie automatically excludes these identifiers from the sensitivity score for that S3 bucket only, for existing and new objects.

Figure 4: Macie S3 bucket list view with sensitivity scores and detections

Note: When you change the included or excluded managed or custom data identifiers for an S3 bucket, Macie automatically updates existing detections and sensitivity scores. Macie also applies these changes to new S3 objects that it scans with automated sensitive data discovery.

You can prioritize S3 buckets that need additional review by manually assigning them a maximum sensitivity score. When you select Assign maximum score on an S3 bucket, Macie sets the score to 100, regardless of the sensitive data detections that it found through automated sensitive data discovery. Automated sensitive data discovery continues to scan the bucket and create sensitive data detections unless you select Exclude from automated discovery.

You might want to assign maximum scores for S3 buckets that are publicly accessible, shared across multiple internal or external customers, or part of an environment where sensitive data shouldn’t be present. By assigning a maximum score to an S3 bucket, you can help ensure that your security and privacy teams regularly review high-priority buckets. You can decide whether to assign maximum scores based on your organization’s use cases and security policies.

Identify a subset of S3 buckets to conduct a full scan based on the sensitivity score

You can use sensitivity scores to prioritize specific S3 buckets for full Macie scanning jobs. By running full scanning jobs on specific buckets, you can focus your efforts on buckets where sensitive data could have the greatest impact on your organization. Because full scanning occurs on only a subset of your buckets, this strategy can help lower your overall costs for Macie.

To create a Macie job that scans S3 buckets based on the sensitivity score

Navigate to the Amazon Macie console.
In the left navigation pane, choose S3 buckets.
For Sensitivity, add a filter as follows:
- For To, enter a minimum sensitivity score.
- For From, enter a maximum sensitivity score.
If you leave the To field blank, Macie returns a list of buckets with a score greater than or equal to the value in the From field.

Note: Sensitivity scores can vary based on the objects analyzed and whether you have the settings configured for Assign maximum score, Automatically discover sensitive data, or both.
After you add the filter, you will see the S3 bucket results for the Sensitivity values that you entered, grouped by account. To view the buckets in list view, choose the list view icon (). To view the buckets in group view, choose the group view icon ().

Note: You can’t create Macie scan jobs from group view. To run Macie scan jobs, switch to list view.
Make sure that you are in list view, select the specific S3 buckets that you want to scan based on the Sensitivity score, and then choose Create Jobs.

Figure 5: List view of sensitivity scores for S3 buckets
Review the S3 buckets that you selected. To exclude specific buckets, choose Remove for each bucket. After you review your selection, choose Next.
Select a scheduled job or one-time job. If you select Scheduled job, select the update frequency and whether or not to include existing objects. Configure the sampling depth to be 100%. Optionally, you can configure additional object criteria.
Select managed data identifiers, custom data identifiers, allow lists, and general settings according to your needs.
Confirm the Macie job details and choose Submit to start scanning the S3 buckets based on the sensitivity score. When this job is complete, you will receive findings on sensitive data discovered from the job.

When you are considering whether to run a scheduled job or a one-time job, remember that S3 bucket sensitivity scores can change based on new objects, managed or custom identifiers, and allow lists used by Macie automated sensitive data discovery. If you run a scheduled job on buckets that meet certain sensitivity score criteria, the configurations for the job are immutable in order to support data privacy and protection audits or investigations. If a new bucket meets the sensitivity score criteria, you need to create a new scheduled job to include that bucket.

Use automation to scan S3 buckets by sensitivity score and take actions based on findings

You can use the GetResourceProfile API to query specific S3 buckets and return sensitivity profiling information. With the information returned from the API, you can develop custom automation to take specific actions on buckets based on their sensitivity scores. For example, you can use Amazon EventBridge and AWS Lambda functions to create Macie jobs based on the sensitivity scores of the S3 buckets managed by Macie, as shown in the following architecture.

Figure 6: Example architecture for automated jobs based on sensitivity scores

This architecture has the following steps:

An EventBridge rule runs periodically to invoke a Lambda function that invokes the GetResourceProfile API for S3 buckets managed by the Macie admin accounts.
The Lambda function takes the following actions:
1. Creates a list of S3 buckets with maximum sensitivity scores, or with automated sensitivity profiling scores that exceed a threshold value, and then stores the results in an Amazon DynamoDB table.
2. Creates a Macie job by using items in the DynamoDB table to conduct a one-time scan with 100% sampling depth of those S3 buckets. Upon job submission, you can add a last-scanned date to the table for tracking purposes, to help avoid the creation of multiple one-time jobs on the same bucket.
The delegated Macie administrator job starts scan jobs for S3 buckets in member accounts.

After you conduct your Macie scans either manually or with automation, you can implement semi- or fully automated response and remediation actions based on the sensitive data findings. The following are examples of automated response and remediation actions that you can take:

You can deploy the solution to automatically send notifications to Slack if sensitive data is found for buckets with specific sensitivity scores.
You can use AWS Security Hub custom actions to develop pre-determined response and remediation actions on Macie sensitive data findings.

Conclusion

In this blog post, we showed you how to turn on Macie automated sensitive data discovery in your AWS environment and how to use the findings to continually manage your data security posture. This new feature can help you prioritize your remediation efforts and identify buckets on which to run full scans for sensitive data discovery. We also shared a design pattern to build automation by using Macie APIs for automated remediation of Macie findings.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Macie re:Post.

Want more AWS Security news? Follow us on Twitter.

Get the best out of Amazon Verified Permissions by using fine-grained authorization methods

2022-11-29 Jeff Lombardo

Post Syndicated from Jeff Lombardo original https://aws.amazon.com/blogs/security/get-the-best-out-of-amazon-verified-permissions-by-using-fine-grained-authorization-methods/

With the release of Amazon Verified Permissions, developers of custom applications can implement access control logic based on caller and resource information; group membership, hierarchy, and relationship; and session context, such as device posture, location, time, or method of authentication. With Amazon Verified Permissions, you can focus on building simple authorization policies and your applications—instead of, for example, building an authorization engine for your multi-tenant consumer applications.

Amazon Verified Permissions uses the Cedar policy language, which simplifies the implementation, review, and maintenance of large and complex access control strategies.

Amazon Verified Permissions includes schema definitions, policy statement grammar, and automated reasoning that scales across millions of permissions, which enables you to enforce the principles of default deny and of least privilege. These features facilitate the deployment of an in-depth fine-grained authorization model to support your Zero-Trust objectives.

In this blog post, we’ll discuss how you can use Amazon Verified Permissions to create authorization policies that are an improvement over traditional access control models, and we provide some best practices for the use of this feature.

What is fine-grained authorization? Is it a role-based or an attribute-based access control mechanism?

Traditionally, customers deploy access control strategies based on roles or attributes.

Role-based access control (RBAC) is an approach of granting access to resources through group memberships instead of individual users. This approach, although it simplifies the definition of entitlements, can become very complex when you scale out groups’ memberships, hierarchies, and nestings.

Consider a photo sharing application that allows users to upload photos and share those photos with friends. We have a user Alice who uploads their vacation photos to a folder named Austin2022. Alice decides to share these photos with friends.

Alice provides a link to their vacation photos to a friend named Bob. Using the link, Bob is able to view photos in the folder Austin2022, because Bob is in the user group Alice/Friends. That is, Bob has the role of Alice/Friends. If Bob were removed as Alice’s friend, Bob would not be able to view Alice’s photos. This is an example of how role-based access control works.

Attribute-based access control (ABAC) deviates from the static nature of RBAC by introducing access rules based on the characteristics of the following: the requestor identity; the attributes of the resources targeted; or contextual elements such as the request time, where the request originated, or the device used to make the request.

Let’s consider who can delete photos in the example photo sharing application. We want to make sure that only Alice can delete their photos. That is, we make an authorization decision based on the attribute owner of the resource photo.

Fine-grained authorization (FGA) is a model that combines the advantages of both RBAC and ABAC, so that customers can find the right balance between each approach for their individual use case. Understanding the FGA approach is key to writing policy statements in Amazon Verified Permissions.

How does permissions policy statement language work?

To define a policy statement, Amazon Verified Permissions uses a policy language based on the PARC model, as AWS Identity and Access Management (IAM) does for IAM policies. PARC refers to the four objects in the policy language: principal, action, resource, and condition, and these are defined as follows:

The principal is the entity taking the action. Often this will be a human user, but it could also be another service or a device.
The action is the operation being performed, for which permission must be granted. Often the action will map to an API call.
The resource is the target of the call.
The condition limits when or where the principal can make the action on the resource.

Using this language, you can create a policy that allows user Alice (the principal) to call deletePhoto (the action) on VacationPhoto_1.jpg (the resource) when Alice is logged in by using multi-factor authentication (the condition). After the Amazon Verified Permissions policy is authored, you will store it in your Amazon Verified Permissions policy store instance.

Policy statements are divided into two sections:

The policy head, which defines the targets of the policy (principal, action, resource) and whether the policy permits or forbids the action.
The Conditions section, which allows you to place conditions that authorize API actions only when specified criteria are met.

You can use the structure of the policy statements to tell at a glance whether a policy follows an RBAC, an ABAC, or an FGA approach, as shown in the following three examples.

// This style of policy can be used to implement a RBAC approach
permit(
  principal in UserGroup::"Alice/Friends",
  action in [
    Action::"readFile", 
    Action::"writeFile"
  ],
  resource in Folder::"Playa del Sol 2021"
);

// This style of policy can be used to implement an ABAC approach
permit(principal, action, resource)
when {
  principal.permitted_access_level >= resource.access_level
};

// This style of policy can be used to implement a hybrid approach
permit(
  principal in UserGroup::"Alice/Friends",
  action in [
    Action::"readFile", 
    Action::"writeFile"
  ],
  resource in Folder::"Playa del Sol 2021"
)
when {
  principal.permitted_access_level >= resource.access_level
};

Let’s go back to our example of Alice and Bob. Now, Alice can define a policy that allows their friends to view photos in their folder Austin2022, as follows.

permit(
    principal in UserGroup::"Alice/Friends",
    action == Action::"viewPhoto",
    resource in Folder::"Austin2022"
);

The policy head says to permit the viewPhoto action to be performed on resources in the folder Austin2022 for principals in user group Alice/Friends. There is no condition section for this policy. With the preceding policy, Bob can access the photos in Alice’s Austin2022 album as long as Bob is a member of the group Alice/Friends.

We can go back to the photo deletion workflow for a more complex scenario. To delete photos, you want to ensure that the requestor owns the photo. Additionally, you might require the user to be logged in via multi-factor authentication (MFA). This policy can be written as follows.

permit(
    principal,
    action == Action::"deletePhoto",
    resource == File::"photo"
)
when {
    resource.owner == principal.name (http://principal.name/)
    && context.MFA == true
};

The policy head permits a user to call the action deletePhoto on photos. The condition section limits the policy to permit photo deletion only when the resource’s owner attribute is the same as the principal’s name attribute and the context object’s MFA attribute equals true.

Designing well-architected policy statements

In this section, we cover six best practices that help customers scale out efficiently.

Use immutable identifiers to reduce risk of collision

The policy statements in this blog post and in Amazon Verified Permissions documentation intentionally use human-readable values such as Bob for a Principal entity, or Alice/Friends for a Group entity. This is useful when discussing general concepts, but in production systems, customers should utilize unique and immutable values for entities. As an example, what would happen if Alice wants to change their user name?

Instead of creating a user named Alice, you should use an autogenerated and unique identifier such as a Universally Unique Identifier (UUID). Those are generally available from your user directory, JSON Web Token, or file system. That way, you can create a user object with the ID a1b2c3d4-5678-90ab-cdef-EXAMPLE11111 and the name attribute Alice. This would allow you to update Alice’s user name without needing to recreate the user object.

Reduce the number of policies that use entity grouping

Policy statements can only contain a single principal entity and a single resource entity. If you want the same policy to apply to multiple principals or resources, you can group common entities and use an in statement.

In this example, Bob’s user account could be stored as the following object.

{
    "EntityId": {
        "EntityType": "User",
        "EntityId": "Bob"
    },
    "Parents":[
        {
            "EntityType": "UserGroup",
            "EntityId": "Alice/Friends"
        }
    ],
    "Attributes": {
        "username": {
            "String": "Bob"
        },
        "email": {
            "String":"[email protected]"
           },
    }
}

And user group Alice/Friends could be stored as the following object.

{                 
  "EntityId": {                     
    "EntityType": "UserGroup",                     
    "EntityId": "Alice/Friends"
    }
}

The parent relationship defined in Bob’s user account object is what makes Bob a member of the group Alice/Friends.

Now you can define a policy that allows Bob to gain access to Alice’s vacation photos because he is in the group Alice/Friends, as follows.

permit(
    principal in UserGroup::"Alice/Friends",
    action == Action::"viewPhoto",
    resource in Folder::"Austin2022"
);

Use namespaces to remove ambiguity

You can use namespaces to remove ambiguity. Returning to our application, let’s say that you want to give users the ability to delete their photos. But your moderators also need the ability to delete inappropriate photos. How can you distinguish between the user action deletePhoto and the administrator action deletePhoto? Namespaces give you this flexibility.

When creating your entities, you can add namespaces in the EntityType field, as in the following example.

{
  "EntityId": {
    "EntityType": "Admin::Action",
    "EntityId": "\"deletePhoto\""    
  },
  "Parents" : []
  "Attributes": {
    "readOnly": {
      "String": "false",
      },
      "appliesTo": {
        "String": "\"Photo\""
      }
  }
}

You then use the namespace in your permit policy, as follows.

permit(
  principal,
  action == Admin::Action::"deletePhoto",
  resource == File::"Photo")
when {
  principal.role == Moderator
};

This policy requires a user to have the role Moderator to successfully use the administrator deletePhoto action.

Set permission guardrails with forbid statements

The Amazon Verified Permissions policy engine denies any action that is not explicitly allowed with a permit policy. But you might want to establish permission guardrails to ensure that an action will be never allowed. You can create forbid policies for this purpose.

Returning to our photo sharing application, suppose that you want to ensure that no user can delete a photo unless the user has been authenticated with MFA. You could use the following policy.

forbid(
  principal,
  action == Action::"deletePhoto",
  resource == File::"Photo"
)
unless {
  context.MFA == true
}

This permission guardrail will help prevent the accidental grant of overly permissive deletePhoto permissions.

Simplify statements with unless conditions

When you define complex conditions for a policy statement, you might face situations where a policy needs multiple negative conditions. Amazon Verified Permissions provides an alternative keyword for the conditional expression: unless. For example, you might deny moderators the ability to delete photos unless they have flagged the photo as inappropriate, are authenticated using MFA, and are on the company’s network, in order to simplify policy statements.

Unless behaves the same as when, except that using unless requires all conditions to evaluate as false. With this additional expression, you can create statement that are less complex to review and maintain. The following example shows how you can simplify a condition with multiple parameters by using the unless expression.

// Allow access unless a resource was deleted more than 7 days ago
permit(
  principal in Group::"Alice/Friends",
  action == Action::"readPhoto",
  resource in Folder::"Playa del Sol 2021"
)
when {
  !(resource.status == "deleted"
   && resource.deletion_date < (context.time.now - 604800)) //7 days ago
}

The following example shows how you can simplify the previous policy by using an unless expression.

// Allow access unless a resource was deleted more than 7 days ago
permit(
  principal in Group::"Alice/Friends",
  action == Action::"readPhoto",
  resource in Folder::"Playa del Sol 2021"
)
unless {
  (resource.status == "deleted"
   && resource.deletion_date < (context.time.now - 604800)) //7 days ago
}

Rationalize policies with a template

You might face a situation where you are repeatedly creating the same rule for different contexts. In the following example, we demonstrate a policy that permits Alice to describe the folder Alice’s Org. Then we replicate the same policy for Bob and the folder Bob’s Org.

permit(
    principal == "Alice",
    action == Action::"describeFolder",
    resource == Folder::"Alice's Org"
)
when {
    resource.owner == principal.username
};

permit(
    principal == "Bob",
    action == Action::"describeFolder",
    resource == Folder::"Bob's Org"
)
when {
    resource.owner == principal.username
};

In this case, we recommend that you use a policy template to simplify the evaluation, as in the following example.

permit(
    principal == ?principal,
    action == Action::"describeFolder",
    resource == ?resource
)
when {
    resource.owner == principal.username
};

With a policy template, the statement inherits from a placeholder (in this example, ?principal and ?resource) and will be evaluated dynamically for each policy evaluation request, based on context that the application will provide.

Conclusion: Start authorizing with Amazon Verified Permissions

With Amazon Verified Permissions, you can create permission policies with expressiveness, performance, and readability in mind.

Using the best practices described in this post, you are ready to author policies with Amazon Verified Permissions. When combined with services like Amazon Cognito, Amazon API Gateway, an AWS Lambda authorizer, or AWS AppSync, Amazon Verified Permissions allows you to unlock in-depth and explicit access control logic securely using native AWS services.

Over the next months, AWS will release more resources to support our customers in their implementation of Amazon Verified Permissions. Learn more about Amazon Verified Permissions. Stay tuned and happy building.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Monitoring shared AWS Outposts rack capacity

2022-11-29 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/monitoring-shared-aws-outposts-rack-capacity/

This post is written by Adam Imeson, Sr. Hybrid Edge Specialist Solutions Architect.

AWS Outposts rack is a fully-managed service that offers the same AWS infrastructure, APIs, tools, and a subset of AWS services to any data center, colocation space, or on-premises facility for a consistent hybrid experience. Outposts rack is ideal for workloads that require low latency, access to on-premises systems, local data processing, data residency, and migration of applications with local system interdependencies.

An Outpost is a pool of AWS compute and storage capacity deployed at a customer site. In an Outposts rack deployment, an Outpost may comprise of one or more racks connected together at the site. It’s common for customers to order their Outpost in a dedicated account and then integrate with their multi-account organizational architecture by sharing the Outpost via AWS Resource Access Manager (AWS RAM). This post will explain how to set up cross-account Amazon CloudWatch metrics so that disparate stakeholders within your organization can effectively monitor your Outpost’s capacity to meet their specific needs.

Overview

The AWS account that you use to order an Outpost owns that Outpost. This includes all metrics and health events pertaining to that Outpost. Many customers must integrate Outposts into their multi-account environments, as discussed in the “Best practices: AWS Outposts in a multi-account AWS environment” posts (part 1 and part 2). This post will go into more detail on how to monitor Outposts in these environments.

The nuance here stems from the different ways to share access to AWS resources. AWS RAM allows infrastructure resources to be shared across multiple accounts. Then, the consumer accounts can launch resources on the infrastructure as though they owned it. AWS Identity and Access Management (IAM) allows customers to modify a given account’s permissions such that users in other accounts can make AWS API calls that affect the given account.

An Outpost provides infrastructure resources, so customers can share Outposts via AWS RAM. CloudWatch metrics about Outposts are data which customers retrieve using AWS API calls, so customers can share access to those metrics using IAM.

In a typical customer’s AWS Organization, there are two cases to consider. First, when the customer is sharing an Outpost to multiple development accounts, each account needs to view metrics relevant to the Outpost so that the development accounts can deploy and operate their applications.

Second, when the customer has several accounts that each own different Outposts, the customer’s centralized monitoring account needs to track metrics relevant to each of the Outposts.

This post will explain strategies for both cases.

Customers must monitor the health of the Outpost’s connection to its regional control plane (the Outpost’s service link), as an Outpost is an extension of an AWS Availability Zone (AZ) and is designed to be connected to an AZ at all times. The health of the Outpost’s service link is a crucial variable when application owners are diagnosing disruptions to their application, and also when infrastructure owners are diagnosing disruptions to a site. Customers can monitor their service link’s status with the ConnectedStatus metric.

Customers also must monitor their Outposts’ current capacity. Outposts necessarily have a limited capacity footprint when compared to an AWS Region. Application owners must make informed decisions about capacity as they scale their apps over time or respond to occasional hardware failures. Infrastructure owners also must maintain a holistic view of capacity across all of the Outposts for which they are responsible so that they can plan for capacity expansion over time. Customers can monitor their Outposts’ capacity using the various capacity metrics that Outposts provide.

For an overview of how to set up a capacity dashboard and capacity-based CloudWatch alarms within a single account, see “Monitoring AWS Outposts capacity.” This post will expand on the single-account strategy by introducing cross-account capabilities. See also “Cross-Account Cross-Region Dashboards with Amazon CloudWatch.” These two posts provide practical walkthroughs for setting up the metric flows explained below.

Setting up Outposts metric permissions for your organization

This post assumes that you have multiple Outposts in different accounts that are all part of the same Organization. You’re sharing these Outposts into accounts that development teams use to deploy and operate their applications. You also have a centralized monitoring account where your infrastructure team tracks various metrics across all accounts. Your Organization might look something like this:

The first Outpost is shared to Accounts A and B, and the second Outpost is only shared to Account B. This is just an example of how a customer might set up their environment so that Application A can deploy on Outpost 1, and Application B can deploy on both Outpost 1 and 2.

To enable centralized monitoring, each account shares CloudWatch metrics with the central monitoring account as described in “Cross-Account Cross-Region Dashboards with Amazon CloudWatch.”

Now there are application accounts which can launch on the desired Outposts, and all of the accounts are sharing metrics with the central monitoring account. The team responsible for procuring and managing the Outposts can now set up dashboards in the central monitoring account in accordance with “Monitoring AWS Outposts capacity” to get a holistic view of capacity. This is valuable for capacity planning as applications naturally grow over time.

However, this may not be sufficient for operations. Consider that each application team needs to understand how much capacity is available on the Outpost that they’re using. This is crucial for teams operating highly available applications to maintain awareness of whether they still have N+1 capacity available on the Outpost to use in the event of a hardware failure. This is also important for planning expansions to the application ahead of time, as application teams have the best understanding of the future needs of their applications. Finally, application teams can use the metrics to track the operational health of the Outpost, which is crucial for root-causing any application disruptions.

You can implement this by sharing CloudWatch metrics from the Outpost accounts to the application accounts which are consuming the Outposts’ capacity, as shown in the following diagram.

Walkthrough

Log in to your application account and navigate to the CloudWatch console. Open the Settings menu and choose Configure.

Scroll to the bottom. In the View cross-account cross-region section, choose Edit.

Choose your preferred account selection method from the three options and choose Save changes. I recommend the Custom account selector option, as it strikes a good balance between a simple setup and ease of use. If you choose this option, then input the Outpost owner account’s account ID and a human-readable name for the account. This name will appear in the drop-down when you’re using the CloudWatch console to view metrics from other accounts later.

Your application account is now prepared to view metrics from the Outpost owner account. Now log in to the account that owns the Outpost and navigate to the CloudWatch console. You still need to share the Outpost’s metrics to the application account. Open the Settings page again, and choose Configure in the Cross-account cross-region section as before. This time, choose Share data in the Share your CloudWatch data section:

Choose Add account and input the application account’s account ID. Then scroll to the bottom of the page and choose Launch CloudFormation template.

The AWS CloudFormation template will create the CloudWatch-CrossAccountSharingRole. This role gives CloudWatch read access to the AWS account that you specified, the application account. You can view and modify this role using the IAM console if you want to. For example, you might adjust the role to allow read access to an entire Organizational Unit (OU).

Now, log back in to the application account and navigate to the CloudWatch console. Choose All metrics in the left-side menu. In the Metrics section, select the Outpost owner account from the drop-down.

You can now view the metrics from the Outpost owner account and incorporate them into the dashboards in the application account. Now the application teams can track the Outposts’ ConnectedStatus metrics to be alerted on any disconnections from the region, and they can track the Outposts’ capacity metrics as well. It’s a best practice to alarm on Outpost capacity metrics once a consumption threshold defined by business needs has been breached.

Conclusion

Outposts rack allows customers to deploy AWS infrastructure into virtually any data center, colocation space, or on-premises facility. Outposts are tied to the AWS account that ordered them, and customers can share Outposts among AWS accounts within the same Organization. When multiple teams within a customer’s Organization are interacting with the same Outpost, that introduces additional monitoring surface area for capacity and service health. This post explains how customers can accommodate their teams’ different needs by sharing Outposts metrics around their Organization along with their Outposts. As best practices, customers should share their Outposts capacity and ConnectedStatus metrics to teams who are running applications on Outposts. Customers’ operations teams should also work with their stakeholders to define a maximum capacity utilization threshold for a given Outpost and alarm on that threshold.

Deploy AWS Organizations resources by using CloudFormation

2022-11-29 Matt Luttrell

Post Syndicated from Matt Luttrell original https://aws.amazon.com/blogs/security/deploy-aws-organizations-resources-by-using-cloudformation/

AWS recently announced that AWS Organizations now supports AWS CloudFormation. This feature allows you to create and update AWS accounts, organizational units (OUs), and policies within your organization by using CloudFormation templates. With this latest integration, you can efficiently codify and automate the deployment of your resources in AWS Organizations.

You can now manage your AWS organization resources using infrastructure as code (iaC) and make changes in a central place. This can help reduce the time required to build a new organization, expand or modify the existing organization, replicate your organization infrastructure, or apply and update policies across multiple accounts and OUs. You can also delete organization resources by deleting the stacks.

In this blog post, we will show you how to create various AWS Organizations resources for a multi-account organization by using a CloudFormation template.

How does it work?

A CloudFormation template describes your desired resources and their dependencies so that you can launch and configure them together as a stack. You can use a template to create, update, and delete an entire stack as a single unit instead of managing resources individually.

With CloudFormation support for AWS Organizations, you can now do the following:

Create, delete, or update an organizational unit (OU). An OU is a container for accounts that allows you to organize your accounts to apply policies according to your needs.
Create accounts in your organization, add tags, and attach them to OUs.
Add or remove a tag on an OU.
Create, delete, or update a service control policy (SCP), backup policy, tag policy and artificial intelligence (AI) services opt-out policy.
Add or remove a tag on an SCP, backup policy, tag policy, and AI services opt-out policy.
Attach or detach an SCP, backup policy, tag policy, and AI services opt-out policy to a target (root, OU, or account).

To create AWS Organizations resources using CloudFormation, you will need to use your organization’s management account. As of this writing, the new resource types may only be deployed from the organization’s management account or delegated administration account.

Overview of the new resource types

The following are the three new resource types available for the implementation and management of an account, OU, and organizations policy in CloudFormation:

AWS::Organizations::Account – Creates an account that is automatically a member of the organization whose credentials made the request.
AWS::Organizations::OrganizationalUnit – Creates an OU within a root or parent OU.
AWS::Organizations::Policy – Creates a policy of a specified type that you can attach to a root, OU, or individual account.

Prerequisites

This blog post assumes that you have AWS Organizations enabled in your management account. You also need the tag policy and service control policy types enabled in your management account. For instructions on how to create an organization, see Create your organization.

You should also review the following important points for creating resources in AWS Organizations:

AWS Organizations supports the creation of a single account at a time. If you include multiple accounts in a single CloudFormation template, you should use the DependsOn attribute so that your accounts are created sequentially.
Before you can create a policy of a given type, you must first enable that policy type in your organization.
The number of levels deep that you can nest OUs depends on the policy types that you have enabled for the root. For SCPs, the limit is five.
To modify the AccountName, Email, and RoleName for the account resource parameters, you must sign in to the AWS Management Console as the AWS account root user.
Since the CloudFormation template in this blog deploys Account and Organization Unit resources, you must deploy it in your organization’s management account.

For a complete list of dependencies, see the AWS Organizations resource type reference.

Use a CloudFormation template with the new AWS Organizations resources

In this section, we will walk you through a sample CloudFormation template that incorporates the newly supported AWS Organizations resources. CloudFormation provisions and configures the resources for you, so that you don’t have to individually create and configure them and determine resource dependencies.

The template will create the following resources and structure.

Three organizational units
- Infrastructure – Within the organizational root
- Production – Within the Infrastructure OU
- Security – Within the organizational root
One account
- AccountA – Within the Production child OU
Two service control policies
- PreventLeavingOrganization – Attached to the organizational root
- PreventCloudTrailDisablement – Attached to the Security OU
One tag policy
- DefineTagKeyCase – Attached to the Production child OU

Note: The above OU and account layout is only an example for the purpose of this blog post. Please refer to Organizing Your AWS Environment Using Multiple Accounts whitepaper for more information on multi-account strategy best practices & recommendations.

Download the template

Download the CloudFormation template. The following shows the contents of the template:

AWSTemplateFormatVersion: '2010-09-09'
Description: "AWS Organizations using Cloudformation - Creates OU, nested OU, account and organizations policies"

Parameters:
  OrganizationRoot:
    Description: 'Organization ID'
    Type: String 

Resources:
  InfrastructureOU:
      Type: AWS::Organizations::OrganizationalUnit
      Properties:
          Name: Infrastructure
          ParentId: !Ref OrganizationRoot

  SecurityOU:
      Type: AWS::Organizations::OrganizationalUnit
      Properties:
          Name: Security
          ParentId: !Ref OrganizationRoot

  ProductionOU:
      Type: AWS::Organizations::OrganizationalUnit 
      Properties:
          Name: Production
          ParentId: { "Ref" : "InfrastructureOU" }
      DependsOn: InfrastructureOU

  AccountA:
      Type: AWS::Organizations::Account
      Properties:
          AccountName: AccountA
          Email: [email protected]
          ParentIds: [{"Ref": "ProductionOU"}]            

  PreventLeavingOrganizationSCP:
      Type: AWS::Organizations::Policy
      Properties:
          TargetIds: [{"Ref": "OrganizationRoot"}]
          Name: PreventLeavingOrganization
          Description: Prevent member accounts from leaving the organization
          Type: SERVICE_CONTROL_POLICY
          Content: >-
            {
                "Version": "2012-10-17",
                "Statement": [
                    {
                        "Effect": "Deny",
                        "Action": [
                            "organizations:LeaveOrganization"
                        ],
                        "Resource": "*"
                    }
                ]
            }
          Tags:
            - Key: DoNotDelete
              Value: True

  PreventCloudTrailDisablementSCP:
      Type: AWS::Organizations::Policy
      Properties:
          TargetIds: [{"Ref": "SecurityOU"}]
          Name: PreventCloudTrailDisablement
          Description: Prevent users from disabling CloudTrail or altering its configuration
          Type: SERVICE_CONTROL_POLICY
          Content: >-
            {
              "Version": "2012-10-17",
              "Statement": [
                {
                  "Effect": "Deny",
                  "Action": [
                    "cloudtrail:DeleteTrail",
                    "cloudtrail:PutEventSelectors",
                    "cloudtrail:StopLogging", 
                    "cloudtrail:UpdateTrail" 

                  ],
                  "Resource": "*"
                }
              ]
            }

  TagPolicy:
      Type: AWS::Organizations::Policy
      Properties:
          TargetIds: [{"Ref": "ProductionOU"}]
          Name: DefineTagKeyCase
          Description: CostCenter tag should comply with case specified in the policy
          Type: TAG_POLICY
          Content: >-
            {
                "tags": {
                  "CostCenter": {
                      "tag_key": {
                        "@@assign": "CostCenter",
                        "@@operators_allowed_for_child_policies": ["@@none"]
                        }
                      }
                    }
            }

Create a stack with the template

In this section, you will create a stack by using the CloudFormation template that you downloaded.

To create the stack

Create the AWS Organizations resources outlined in the template by creating an IAM role for CloudFormation using the following IAM permissions policy and trust policy.

Permissions policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ReadOnlyPermissions",
            "Effect": "Allow",
            "Action": [
                "organizations:Describe*",
                "organizations:List*",
                "account:GetContactInformation",
                "account:GetAlternateContact"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowCreationOfResources",
            "Effect": "Allow",
            "Action": [
                "organizations:CreateAccount",
                "organizations:CreateOrganizationalUnit",
                "organizations:CreatePolicy"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowModificationOfResources",
            "Effect": "Allow",
            "Action": [
                "organizations:UpdateOrganizationalUnit",
                "organizations:AttachPolicy",
                "organizations:TagResource",
                "account:PutContactInformation"
            ],
            "Resource": "*"
    }
    ]
}

Trust policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "cloudformation.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Sign in to the management account for your organization, navigate to the CloudFormation console, and choose Create stack.
Choose With new resources (standard), upload the template file, and choose Next.

Figure 1: CloudFormation console showing creation of stack
Enter a name for the stack (for example, CloudFormationForAWSOrganizations). For OrganizationRoot, enter your organizations root ID. You can find the root ID in the AWS Organizations console.
Choose Create stack.
On the Configure stack options page, in the Permissions section, choose the IAM role that you granted permissions to previously, as shown in Figure 2. Then choose Next.

Figure 2: Set IAM role permissions for CloudFormation

You will see a screen showing stack creation in progress.

Figure 3: CloudFormation console showing stack creation in progress
When the stack has been created, choose the Resources tab to see the resources created.

Figure 4: CloudFormation console showing stack resources created

Confirm and visualize the resources created by using the console

In this section, you will use the console to confirm and visualize the resources created.

To confirm and visualize the resources

Navigate to the AWS Organizations console.
In the left navigation pane, choose AWS accounts to see the OUs and account that were created.

Figure 5: AWS Organizations console showing the organization structure

Confirm the service control policy created and attached to the organization’s root

In this section, you will confirm that the SCP was created and attached to the organization’s root.

Note: When you enable SCPs on an organization, an AWS full access policy is attached by default at each level (root, OU, and account) of your organization. Because you can attach policies to multiple levels of the organization, accounts can inherit multiple policies with an effect of deny. For more details, see inheritance for service control policies.

To confirm the SCP was created and attached to the root

To view the service control policy, choose Root, and then in the section Applied policies, review the list of policies. The PreventLeavingOrganization SCP prevents the use of the LeaveOrganization API so that member accounts can’t remove their accounts from the organization.

Figure 6: AWS Organizations console showing the organization’s root
To confirm that the DoNotDelete tag was attached to the PreventLeavingOrganization SCP, choose the policy name and then choose the Tags tab.

Figure 7: SCP with tags attached to it in Organizations

Confirm the service control policy created and attached to the Security OU

In this section, you will confirm that the PreventCloudTrailDisablement SCP was created and attached to the Security OU, thus preventing users or roles in the accounts in the security OU from disabling an AWS CloudTrail log.

To confirm that the SCP was created and attached to the Security OU

From the left navigation pane, choose AWS accounts, and then choose Security.
On the Security page, choose the Policies tab to see a list of policies.
To review and confirm the contents of the policy, choose PreventCloudTrailDisablement.

Figure 8: SCP attached to the Security OU in Organizations

Confirm the account and tag policy created and attached to the Production OU

In this step, you will confirm that the account and tag policy were created and attached to the Production OU.

To confirm creation of the account and tag policy in the Production OU

On the Production page, choose the Children tab to confirm that the account named AccountA was created.

Figure 9: The Production OU and account A in Organizations
To confirm that the DefineTagKeyCase tag policy was attached to the Production OU, do the following:
1. From the left navigation pane, choose AWS accounts, and then choose Production.
2. Choose the Policies tab to see the list of policies.
3. In the Tag policies section, under Applied policies, choose DefineTagKeyCase to confirm the contents of the policy. This policy defines the tag key and the capitalization that you want accounts in the production OU to standardize on.
  
  Figure 10: SCP and tag policy attached to the Production OU in Organizations

Conclusion

In this blog post, you learned how to create AWS Organizations resources, including organizational units, accounts, service control policies, and tag policies by using CloudFormation. You can use this new feature to model the state of your infrastructure as code and to help deploy your AWS resources in a safe, repeatable manner at scale.

To learn more about managing AWS Organizations resources with CloudFormation, see AWS Organizations resource type reference in the CloudFormation documentation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

BloomIP Automatically Identifies production issues with Amazon DevOps Guru

2022-11-28 David Ernst

Post Syndicated from David Ernst original https://aws.amazon.com/blogs/devops/bloomip-automatically-identifies-production-issues-with-amazon-devops-guru/

Operational excellence is critical for BloomIP’s customers. In this post, you will see how we built a solution to automate the detection of trends and issues in production workloads by implementing Amazon DevOps Guru for our clients.

BloomIP ensures your business is ready for what’s ahead, with security, scalability, performance, and cost control. We are cloud solutions partner that gets to know both the people and processes in your business.

The Challenge

Identifying operational issues within applications and services is time-consuming. This requires developers and cloud engineers to spend valuable time manually debugging using multiple tools. We needed to quickly identify any operational issues related to our clients applications, including any load balancer errors or user delays in accessing their application. Ensuring the application is up and running during certain times of the day is crucial to the success of our client’s business. We needed to identify any downtime or performance patterns and quickly address any related issues.

Analyzing an AWS environment after any incident requires a combination of tools such as Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS CloudFormation, and AWS X-Ray. We spend hours pouring over the information in each tool to try to identify patterns and troubleshooting steps. Still, identifying issues that correlate between those tools is a manual process.

Automating Identification of Operational Issues

To address the challenges of tedious and manual processes of analyzing different tools to identify patterns, we implemented Amazon DevOps Guru for many of our clients. Amazon DevOps Guru helps us automatically ingests all related data from the services mentioned above and applies Machine Learning techniques to analyze and recommend fixes for abnormal behaviors. Amazon DevOps Guru organizes its findings into reactive and proactive insights.

We capture Amazon DevOps Guru Insights as events using Amazon EventBridg, and send them to an Amazon SNS Topic, which then notifies us via email and Slack.

Architecture diagram showing a typical 3 tier web app using AWS services and integrating the application with Amazon DevOps Guru, Amazon Eventbridge and Amazon SNS Topic to send send notifications via Email and Slack

Figure 1. Architecture diagram

Results

BloomIP is leveraging DevOps Guru to scale its operations across multiple customers. Amazon DevOps Guru was easy to enable; it provides us with a single console experience to search and visualize operational data. In addition to detecting anomalies, we can see graphs and timelines related to the numerous anomalous metrics and more contextual information such as relevant events and log snippets. This helps us quickly understand the anomaly scope. Because it integrates data across multiple sources such as Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS CloudFormation, and AWS X-Ray, Amazon DevOps Guru reduces the need for us to use numerous tools.

“We were looking at a way to effortlessly scale our observability needs across multiple clients while ensuring we had the proper coverage. DevOps Guru gives us additional insight and assurance by quickly pointing out anomalies in our client’s environments. With ML-powered recommendations, DevOps Guru has allowed us to remediate repeated production issues automatically. ” – Joshua Haynes, Director of Engineering, BloomIP

Conclusion

Amazon DevOps Guru provides BloomIP with a streamlined approach to visualize operational data by integrating data across multiple sources supporting Amazon CloudWatch, AWS Config, AWS CloudTrail, AWS CloudFormation, and AWS X-Ray and reduces the need to use multiple tools. DevOps Guru gives you a single-console dashboard to look for and visualize anomalies in your operational data.

Start monitoring your AWS applications with AWS DevOps Guru today using this link

About the authors:

Lower your Amazon OpenSearch Service storage cost with gp3 Amazon EBS volumes

2022-11-24 Siddhant Gupta

Post Syndicated from Siddhant Gupta original https://aws.amazon.com/blogs/big-data/lower-your-amazon-opensearch-service-storage-cost-with-gp3-amazon-ebs-volumes/

Amazon OpenSearch Service makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open-source, distributed search and analytics suite comprising OpenSearch, a distributed search and analytics engine, and OpenSearch Dashboards, a UI and visualization tool. When you use Amazon OpenSearch Service, you configure a set of data nodes to store indexes and serve queries. The service supports instance types for data nodes with different storage options. Some supported Amazon Elastic Compute Cloud (Amazon EC2) instance types, like the R6GD or I3, have local NVMe disks. Others use Amazon Elastic Block Store (Amazon EBS) storage.

On July 2022, OpenSearch Service launched support for the next generation, general purpose SSD (gp3) EBS volumes. OpenSearch Service data nodes require low latency and high throughput storage to provide fast indexing and query. With gp3 EBS volumes, you get higher baseline performance (IOPS and throughput) at a 9.6% lower cost than with the previously offered gp2 EBS volume type. You can provision additional IOPS and throughput independent of volume size using gp3. gp3 volumes are also more stable because they don’t use burst credits. OpenSearch support for gp3 volumes includes doubling the limit on per-data node volume sizes. With these larger volumes, you can reduce the cost of passive data, increasing the amount of storage per node.

We recommend that you consider gp3 as the best Amazon EBS option for price/performance and flexibility. In this post, I discuss the basics of gp3 and various cost-saving use cases. Migrating from previous generation storage (gp2, PIOPS, and magnetic) volumes to the latest generation gp3 volumes allows you to reduce monthly storage costs and optimize instance utilization.

Comparing gp2 and gp3

gp3 is the successor to the general purpose SSD gp2 volume. The key benefits of gp3 include higher baseline performance, 9.6% lower cost, and the ability to provision higher performance regardless of volume. The following table summarizes the key differences between gp2 and gp3.

Volume type	gp3	gp2
Volume size	Depends on instance type. Max OpenSearch Service supports 24 TiB for R6g.12Xlarge. For the latest instance limits, see Amazon OpenSearch Service quotas.	Depends on instance type. Max OpenSearch Service supports 12 TiB for R6g.12Xlarge.
Baseline IOPS	3,000 IOPS for volume size up to 1,024 GiB. For volumes above 1,024 GiB, you get 3 IOPS/GiB, without burst credit complexity.	3 IOPS/GiB (minimum 100 IOPS) to a maximum of 16,000 IOPS. Volumes smaller than 1 TiB can also burst up to 3,000 IOPS.
Max IOPS/volume	16,000	16,000
Baseline throughput	125 MiB/s free for volume size up to 170 GiB, or 250 MiB/s free for volume above 170 GiB.	Between 125 MiB/s and 250 MiB/s, depending on the volume size.
Max throughput/volume	1,000 MiB/s	250 MiB/s
Price for us-east-1 Region	Storage – $0.122/GB-month. IOPS – 3,000 IOPS free for volumes up to 1,024 GiB, or 3 IOPS/GiB free for volumes above 1,024 GiB. $0.008/provisioned IOPS-month over free limits. Throughput – 125 MiB/s free for volumes up to 170 GiB, or +250 MiB/s free for every 3 TiB for volumes above 170 GiB. $0.064/provisioned MiB/s-month over free limits.	Storage – $0.135/GB-month. IOPS and throughput provisioning not allowed.
Instance supported	T3, C5, M5, R5, C6g, M6g, and R6g	T2, C4, M4, R4, T3, C5, M5, R5, C6g, M6g, and R6g

Lower your monthly bills with gp3

The ability to provision IOPS and throughput independent of volume size and support for denser (twice as large) volume sizes are two significant advantages of gp3 adoption. Together, these benefits enable multiple use cases to lower your monthly bills. In this section, we present a few examples of pricing comparisons for OpenSearch domains.

gp2 vs. gp3

This is the most common scenario, in which existing gp2 customers switch to gp3 and immediately begin saving 9.6% due to the lower monthly price per GB for gp3 storage. You can also benefit from the fact that gp3 supports volume sizes two times larger for the R5, R6g, M5, and M6g instance families. This means that you don’t need to spin up new instances for denser storage requirements and can achieve higher storage on the same instance. OpenSearch Service currently supports a maximum of 24 TiB of gp3 storage on R6g.12Xlarge instances.

PIOPS (io1) vs. gp3

OpenSearch Service supports the PIOPS SSD (io1) EBS volume type. You can switch to gp3 and provision additional IOPS and throughput to meet your specific performance requirements. The following table compares the monthly cost of PIOPS (io1) and gp3 storage with R5.large.search instances for storage requirements of 6 TiB and 16000 IOPS. In this example, you would save 65% with gp3 adoption.

.	PIOPS (io1)	gp3
Instance cost	6 instances * $0.186/hr = $830/month (r5.large.search can support up to 1 TiB storage for io1; to support 6 TiB we require six instances.)	3 instances * $0.167Hr = $372/month (r6g.large.search can support up to 2 TiB storage for gp3; to support 6 TiB we require three instances.)
Storage cost (6 TiB)	6,597 GB * $0.169/GB-month = $1115/month Notes: (a) Price for PIOPS(io1) is $0.169 per GB/month. (b) 6TiB = 6597 GB	6,597 GB * $0.122/GB-month = $805/month Notes: (a) Price for gp3 storage is $0.122 per GB/month. (b) 6TiB = 6597 GB
PIOPS cost (16000 PIOPS)	16000 IOPS * $0.088/IOPS-month = $1408/month Note: io1 PIOPS rate is $0.088 per IOPS-month.	18,000 IOPS is included in the price for 6 TiB volume of gp3; you don’t need to pay. Note: 3 IOPS/ GiB Storage IOPS inlcued in price.
Total monthly bills	$3,353/month	$1,177/month

I3 vs. gp3

I3 instances include Non-Volatile Memory Express (NVMe) SSD-based instance storage optimized for low latency, very high random I/O performance, and high sequential read throughput, and delivers high IOPS. However, I3 uses older third-generation CPUs, and the largest storage supported size is 15 TiB with i3.16xlarge.search instance. You should consider using the largest generation instances such as R6g with gp3 storage to get lower cost and better performance over I3 instances.

To comprehend the cost advantage, let’s compare I3 and gp3 for 12 TiB of data storage needs. By switching to gp3 along with the current generation of instances, you can reduce your monthly bills by 56%, according to the calculations in the following table.

.	I3.4xlarge	gp3 with R6g.xlarge
On-demand instance cost for us-east-1 Region	4 instances * $1.99/hr = $5,922/month Note: I3.4xlarge.search supports up to 3.8 TiB, so we require four instances to manage 12 TiB storage. Instance cost is $1.99/hr.	4 instances * $0.335/hr = $996/month Note: R6g.xlarge.search supports up to 3 TiB with gp3, so we require four instances to manage 12 TiB. Instance cost is $0.335/hr.
Storage cost (12 TiB)	N/A (included in instance price)	13,194 GB * $0.122/GB-month = $1,610/month Notes: (a) 12 TiB = 13,194 GB (b) Storage cost is $0.122 per GB / month
Total monthly bills	$5,922/month	$2,606/month

UltraWarm vs. gp3

UltraWarm is designed to provide inexpensive access to infrequently accessed data, such as logs older than 30 days. Warm storage is useful for indexes that aren’t actively being written to, are queried less frequently, and don’t require high performance. If you have large and query-intensive workloads and are attempting to use UltraWarm to optimize costs but encountering higher query volumes than it can handle, you should consider moving some of the data volume to hot nodes with gp3 storage. UltraWarm will remain the least expensive option for your warm data (less-frequently accessed) type use cases, but you shouldn’t use it for hot data use cases. A combination of low-cost gp3 storage and denser instances can help you achieve cost-optimized higher performance for hot data.

The following table shows the monthly costs associated with running a 30 TiB UltraWarm workload, along with a comparison to the potential monthly costs of gp2 and gp3. With gp3, you can save up to 36% compared to gp2. Please note that UltraWarm setup does require hot data nodes; however, we excluded them in the UltraWarm column to focus on UltraWarm replacement costs with hot data nodes using gp2 and gp3.

.	UltraWarm	All Hot (gp2 with R6g.8xlarge)	All Hot (gp3 with R6g.8xlarge)
Instance cost (On-demand)	2 UW large instances * $2.68/hr = $3,987/month Note: ultrawarm1.large.search supports max 20 TiB, so we need two instances.	4 instances * $2.677/hr = $7,966/month Note: r6g.8xlarge.search supports max 8 TiB with gp2, so we require four instances.	2 Instances * $2.677/hr= $3,984/month Note: r6g.8xlarge.search supports max 16 TiB with gp3, so we only require two instances.
Storage cost (30 TiB)	32,985 GB * $0.024/GB-month = $792/month Notes: (1) Storage price is $0.024/per GB/month). (2) 30 TiB = 32985 GB	32,985 GB * $0.135/GB-month = $4,453/month Notes: (1) Storage price is $0.135 per GB/month. (2) 30 TiB = 32985 GB	32,985 GB * $0.122/GB-month = $4,024/month Notes: (1) Storage price is $0.122 per GB/month. (2) 30 TiB = 32985 GB
Total Monthly Bills	$4,779/month	$12,419/month	$8,008/month

All the preceding use cases are from a cost perspective. Before making any changes to the production environment, we recommend validating performance in a test environment for your unique workload and ensuring that configuration changes don’t result in performance degradation.

Optimize instance cost with gp3’s denser storage

OpenSearch Service increased the maximum volume size supported per instance for gp3 by 100% when compared to gp2 for the R5, R6g, M5, and M6g instance families due to gp3’s improved baseline performance. You can optimize your instance needs by taking advantage of the increased storage per instance volume. For example, R6g.large supports up to 2 TiB with gp3, but only 1 TiB with gp2. If you require support for 12 TiB of data storage, you can reconfigure your domains from six data nodes to three R6g.large in order to reduce your instance costs. For OpenSearch EBS instance-specific volume limits, refer to EBS volume size quotas.

Upgrade from gp2 to gp3

To use the EBS gp3 volume type, you must first upgrade your domain’s instances to supported instance types if they don’t already support gp3. For a list of OpenSearch Service supported instances, see EBS volume size quotas. The transition from gp2 to gp3 is seamless. You can upgrade domain configurations from existing EBS volume types such as gp2, Magnetic, and PIOS (io1) to gp3 through OpenSearch Service console or the UpdatedomainConfig API. The configuration change will initiate blue/green deployment, which runs in the background without impacting your online traffic and, depending on the data size, is complete in a few hours. Blue/green deployments run in the background, ensuring that your online traffic is uninterrupted and preventing data loss.

gp3 baseline performance, and additional provisioning limits

One of the gp3’s key features is the ability to scale IOPS and throughput independent of volume. When your application requires more performance, you can scale up to 16,000 IOPS and 1,000 MiB/s throughput for an additional fee. OpenSearch Service EBS gp3 delivers a baseline performance of 3,000 IOPS and 125 MiB/s throughput at any volume size. In addition, OpenSearch Service provisions additional IOPS and throughput for larger volumes to ensure optimal performance. For volumes above 1,024 GiB, you receive 3 IOPS/GiB, and for volumes above 170 GiB, you get an incremental 250 MiB/s for every 3 TiB of storage.

The following table outlines OpenSearch Service baseline IOPS and throughput, as well as the maximum amount you can provision. Note that your instance type may have additional limitations regarding how much and for how long it can support these performance baselines in a 24-hour period. For more information about instances and their limits, refer to Amazon EBS-optimized instances.

Additional performance customers can provisions

..	Baseline (included in storage price)		Additional performance customers can provision
Volume Storage (in GiB)	IOPS	throughput (MiB/s)	IOPS	throughput (MiB/s)
170	3,000	125	13,000	875
172	3,000	250	13,000	750
1,024	3,000	250	13,000	750
1,025	3,075	250	12,925	750
3,000	9,000	250	7,000	750
3,001	9,003	500	6,997	500
6,000	18,000	500	NA	500
6,001	18,003	750	NA	250
9,001	27,003	1,000	NA	NA
24,000	72,000	2,000	NA	NA

Do you need additional performance?

In the majority of use cases, you don’t need to provision additional IOPS and throughput, and gp3 baseline performance should suffice. You can use Amazon CloudWatch metrics to find the usage patterns, and if you observe current limits of IOPS and throughput bottlenecking your index and query performance, you should provision additional performance. For more information, refer to EBS volume metrics.

Conclusion

This post explains how OpenSearch Service general purpose SSD gp3 volumes can significantly reduce monthly storage and instance costs, making them more cost-effective than gp2 volumes. Migration to gp3 volumes with the same size and performance configurations as gp2 is the quickest and simplest way to reduce costs. Additionally, you should also consider reducing instance costs by taking advantage of gp3’s support for denser storage per data node.

For more details, check out Amazon OpenSearch Service pricing and Configuration API reference for Amazon OpenSearch Service.

About the author

Siddhant Gupta is a Sr. Technical Product Manager at Amazon Web Services based in Hyderabad, India. Siddhant has been with Amazon for over five years and is currently working with the OpenSearch Service team, helping with new region launches, pricing strategy, and bringing EC2 and EBS innovations to OpenSearch Service customers . He is passionate about analytics and machine learning. In his free time, he loves traveling, fitness activities, spending time with his family and reading non-fiction books.

How to detect security issues in Amazon EKS clusters using Amazon GuardDuty – Part 1

2022-11-22 Marshall Jones

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/how-to-detect-security-issues-in-amazon-eks-clusters-using-amazon-guardduty-part-1/

In this two-part blog post, we’ll discuss how to detect and investigate security issues in an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with Amazon GuardDuty and Amazon Detective.

Amazon Elastic Kubernetes Service (Amazon EKS) is a managed service that you can use to run and scale container workloads by using Kubernetes in the AWS Cloud, which can help increase the speed of deployment and portability of modern applications. Amazon EKS provides secure, managed Kubernetes clusters on the AWS control plane by default. Kubernetes configurations such as pod security policies, runtime security, and network policies and configurations are specific for your organization’s use-case and securing them adequately would be a customer’s responsibility within AWS’ shared responsibility model.

Amazon GuardDuty can help you continuously monitor and detect suspicious activity related to AWS resources in your account. GuardDuty for EKS protection is a feature that you can enable within your accounts. When this feature is enabled, GuardDuty can help detect potentially unauthorized EKS activity resulting from misconfiguration of the control plane nodes or application.

In this post, we’ll walk through the events leading up to a real-world security issue that occurred due to EKS cluster misconfiguration, discuss how those misconfigurations could be used by a malicious actor, and how Amazon GuardDuty monitors and identifies suspicious activity throughout the EKS security event. In part 2 of the post, we’ll cover Amazon Detective investigation capabilities, possible remediation techniques, and preventative controls for EKS cluster related security issues.

Prerequisites

You must have AWS GuardDuty enabled in your AWS account in order to monitor and generate findings associated with an EKS cluster related security issue in your environment.

Amazon GuardDuty, along with these features of GuardDuty:
- Kubernetes Protection
- Malware Protection

EKS security issue walkthrough

Before jumping into the security issue, it is important to understand how the AWS shared responsibility model applies to the Amazon EKS managed service. AWS is responsible for the EKS managed Kubernetes control plane and the infrastructure to deliver EKS in a secure and reliable manner. You have the ability to configure EKS and how it interacts with other applications and services, where you are responsible for making sure that secure configurations are being used.

The following scenario is based on a real-world observed event, where a malicious actor used Kubernetes compromise tactics and techniques to expose and access an EKS cluster. We use this example to show how you can use AWS security services to identify and investigate each step of this security event. For a security event in your own environment, the order of operations and the investigative and remediation techniques used might be different. The scenario is broken down into the following phases and associated MITRE ATT&CK tactics:

Phase 1 – EKS cluster misconfiguration
Phase 2 (Discovery) – Discovery of vulnerable EKS clusters
Phase 3 (Initial Access) – Credential access to obtain Kubernetes secrets
Phase 4 (Persistence) – Impact to persist unauthorized access to the cluster
Phase 5 (Impact) – Impact to manipulate resources for unauthorized activity

Phase 1 – EKS cluster misconfiguration

By default, when you provision an EKS cluster, the API cluster endpoint is set to public, meaning that it can be accessed from the internet. Despite being accessible from the internet, the endpoint is still considered secure because it requires all API requests to be authenticated by AWS Identity and Access Management (IAM) and then authorized by Kubernetes role-based access control (RBAC). Also, the entity (user or role) that creates the EKS cluster is automatically granted system:masters permissions, which allows the entity to modify the EKS cluster’s RBAC configuration.

This example scenario starts with a developer who has access to administer EKS clusters in an AWS account. The developer wants to work from their home network and doesn’t want to connect to their enterprise VPN for IAM role federation. They configure an EKS cluster API without setting up the proper authentication and authorization components. Instead, the developer grants explicit access to the system:anonymous user in the cluster’s RBAC configuration. (Alternatively, an unauthorized RBAC configuration could be introduced into your environment after a developer unknowingly installs a malicious helm chart from the internet without reviewing or inspecting it first.)

In Kubernetes anonymous requests, unauthenticated and unrejected HTTP requests are treated as anonymous access and are identified as a system:anonymous user belonging to a system:unauthenticated group. This means that any entity on the internet can access the cluster and make API requests that are permitted by the role. There aren’t many legitimate use cases for this type of activity, because it’s considered a best practice to use RBAC instead. Anonymous requests are primarily used for setting up health endpoints and custom authentication.

By monitoring EKS audit logs, GuardDuty identifies this activity and generates the finding Policy:Kubernetes/AnonymousAccessGranted, as shown in Figure 1. This finding informs you that a user on your Kubernetes cluster successfully created a ClusterRoleBinding or RoleBinding to bind the user system:anonymous to a role. This action enables unauthenticated access to the API operations permitted by the role.

Figure 1: Example GuardDuty finding for Kubernetes anonymous access granted

Phase 2 (Discovery) – Discovery of vulnerable EKS clusters

Port scanning is a method that malicious actors use to determine if resources are publicly exposed, with open ports and known vulnerabilities. As an increasing number of open-source tools allows users to search for endpoints connected to the internet, finding these endpoints has become even easier. Security teams can use these open-source tools to their advantage by proactively scanning for and identifying externally exposed resources in their organization.

This brings us to the discovery phase of our misconfigured EKS cluster. The discovery phase is defined by MITRE as follows: “Discovery consists of techniques an adversary may use to gain knowledge about the system and internal network. These techniques help adversaries observe the environment and orient themselves before deciding how to act.”

By granting system:anonymous access to the EKS cluster in our example, the developer allowed requests from any public unauthenticated source. This can result in external web crawlers probing the cluster API, which can often happen within seconds of the system:anonymous access being granted. GuardDuty identifies this activity and generates the finding Discovery:Kubernetes/SuccessfulAnonymousAccess, as shown in Figure 2. This finding informs you that an API operation to discover resources in a cluster was successfully invoked by the system:anonymous user. Remember, all API calls made by system:anonymous are unauthenticated, in addition to /healthz and /version calls that are always unauthenticated regardless of the user identity, and any entity can make use of this user within the EKS cluster.

In the screenshot, under the Action section in the finding details, you can see that the anonymous user made a get request to “/”. This is a generic request that is not specific to a Kubernetes cluster, which may indicate that the crawler is not specifically targeting Kubernetes clusters. You can further see that the Status code is 200, indicating that the request was successful. If this activity is malicious, then the actor is now aware that there is an exposed resource.

Figure 2: Example GuardDuty finding for Kubernetes successful anonymous access

Phase 3 (Initial Access) – Credential access to obtain Kubernetes secrets

Next, in this phase, you might start observing more targeted API calls for establishing initial access from unauthorized users. MITRE defines initial access as “techniques that use various entry vectors to gain their initial foothold within a network. Techniques used to gain a foothold include targeted spearphishing and exploiting weaknesses on public-facing web servers. Footholds gained through initial access may allow for continued access, like valid accounts and use of external remote services, or may be limited-use due to changing passwords.”

In our example, the malicious actor has established initial access for the EKS cluster which is evident in the next GuardDuty finding, CredentialAccess:Kubernetes/SuccessfulAnonymousAccess, as shown in Figure 3. This finding informs you that an API call to access credentials or secrets was successfully invoked by the system:anonymous user. The observed API call is commonly associated with the credential access tactic where an adversary is attempting to collect passwords, usernames, and access keys for a Kubernetes cluster.

You can see that in this GuardDuty finding, in the Action section, the Request uri is targeted at a Kubernetes cluster, specifically /api/v1/namespaces/kube-system/secrets. This request seems to be targeting the secrets management capabilities that are built into Kubernetes. You can find more information about this secrets management capability in the Kubernetes documentation.

Figure 3: Example GuardDuty finding for Kubernetes successful credential access from anonymous user

Phase 4 (Persistence) – Impact to persist unauthorized access to the cluster

The next phase of this scenario is likely to be an impact in the EKS cluster to enable persistence by the malicious actor. MITRE defines impact as “techniques that adversaries use to disrupt availability or compromise integrity by manipulating business and operational processes.” Following the MITRE definitions, “Persistence consists of techniques that adversaries use to keep access to systems across restarts, changed credentials, and other interruptions that could cut off their access. Techniques used for persistence include any access, action, or configuration changes that let them maintain their foothold on systems, such as replacing or hijacking legitimate code or adding startup code.”

In the GuardDuty finding Impact:Kubernetes/SuccessfulAnonymousAccess, shown in Figure 4, you can see the Kubernetes user details and Action sections that indicate that a successful Kubernetes API call was made to create a ClusterRoleBinding by the system:anonymous username. This finding informs you that a write API operation to tamper with resources was successfully invoked by the system:anonymous user. The observed API call is commonly associated with the impact stage of an attack, when an adversary is tampering with resources in your cluster. This activity shows that the system:anonymous user has now created their own role to enable persistent access the EKS cluster. If the user is malicious, they can now access the cluster even if access is removed in the RBAC configuration for the system:anonymous user.

Figure 4 Example GuardDuty finding for Kubernetes successful credential change by anonymous user

Phase 5 (Impact) – Impact to manipulate resources for unauthorized activity

The fifth phase of this scenario is where the unauthorized user is likely to focus on impact techniques in order to use the access for malicious purpose. MITRE says of the impact phase: “Techniques used for impact can include destroying or tampering with data. In some cases, business processes can look fine, but may have been altered to benefit the adversaries’ goals. These techniques might be used by adversaries to follow through on their end goal or to provide cover for a confidentiality breach.” Typically, once a malicious actor has access into a system, they will introduce malware to the system to manipulate the compromised resource and possibly also other resources.

With the introduction of GuardDuty Malware Protection, when an Amazon Elastic Compute Cloud (Amazon EC2) or container-related GuardDuty finding that indicates potentially suspicious activity is generated, an agentless scan on the volumes will initiate and detect the presence of malware. Existing GuardDuty customers need to enable Malware Protection, and for new customers this feature is on by default when they enable GuardDuty for the first time. Malware Protection comes with a 30-day free trial for both existing and new GuardDuty customers. You can see a list of findings that initiates a malware scan in the GuardDuty User Guide.

In this example, the malicious actor now uses access to the cluster to perform unauthorized cryptocurrency mining. GuardDuty monitors the DNS requests from the EC2 instances used to host the EKS cluster. This allows GuardDuty to identify a DNS request made to a domain name associated with a cryptocurrency mining pool, and generate the finding CryptoCurrency:EC2/BitcoinTool.B!DNS, as shown in Figure 5.

Figure 5: Example GuardDuty finding for EC2 instance querying bitcoin domain name

Because this is an EC2 related GuardDuty finding and GuardDuty Malware Protection is enabled in the account, GuardDuty then conducts an agentless scan on the volumes of the EC2 instance to detect malware. If the scan results in a successful detection of one or more malicious files, another GuardDuty finding for Execution:EC2/MaliciousFile is generated, as shown in Figure 6.

Figure 6: Example GuardDuty finding for detection of a malicious file on EC2

The first GuardDuty finding detects crypto mining activity, while the proceeding malware protection finding provides context on the malware associated with this activity. This context is very valuable for the incident response process.

Conclusion

In this post, we walked you through each of the five phases where we outlined how an initial misconfiguration could result in a malicious actor gaining control of EKS resources within an AWS account and how GuardDuty is able to continually monitor and detect the progression of the security event. As previously stated, this is just one example where a misconfiguration in an EKS cluster could result in a security event.

Now that you have a good understanding of GuardDuty capabilities to continuously monitor and detect EKS security events, you will need to establish processes and procedures to enable your security team to investigate these events. You can enable Amazon Detective to help accelerate your security team’s mean time to respond (MTTR) by providing an efficient mechanism to analyze, investigate, and identify the root cause of security events. Follow along in part 2 of this series, How to investigate and take action on an Amazon EKS cluster related security issue with Amazon Detective, where we’ll cover techniques you can use with Amazon Detective to identify impacted EKS resources in your AWS account, possible remediation actions to take on the cluster, and preventative controls you can implement.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a thread on Amazon GuardDuty re:Post.

Want more AWS Security news? Follow us on Twitter.

Build your Apache Hudi data lake on AWS using Amazon EMR – Part 1

2022-11-22 Suthan Phillips

Post Syndicated from Suthan Phillips original https://aws.amazon.com/blogs/big-data/part-1-build-your-apache-hudi-data-lake-on-aws-using-amazon-emr/

Apache Hudi is an open-source transactional data lake framework that greatly simplifies incremental data processing and data pipeline development. It does this by bringing core warehouse and database functionality directly to a data lake on Amazon Simple Storage Service (Amazon S3) or Apache HDFS. Hudi provides table management, instantaneous views, efficient upserts/deletes, advanced indexes, streaming ingestion services, data and file layout optimizations (through clustering and compaction), and concurrency control, all while keeping your data in open-source file formats such as Apache Parquet and Apache Avro. Furthermore, Apache Hudi is integrated with open-source big data analytics frameworks, such as Apache Spark, Apache Hive, Apache Flink, Presto, and Trino.

In this post, we cover best practices when building Hudi data lakes on AWS using Amazon EMR. This post assumes that you have the understanding of Hudi data layout, file layout, and table and query types. The configuration and features can change with new Hudi versions; the concept of this post applies to Hudi versions of 0.11.0 (Amazon EMR release 6.7), 0.11.1 (Amazon EMR release 6.8) and 0.12.1 (Amazon EMR release 6.9).

Specify the table type: Copy on Write Vs. Merge on Read

When we write data into Hudi, we have the option to specify the table type: Copy on Write (CoW) or Merge on Read (MoR). This decision has to be made at the initial setup, and the table type can’t be changed after the table has been created. These two table types offer different trade-offs between ingest and query performance, and the data files are stored differently based on the chosen table type. If you don’t specify it, the default storage type CoW is used.

The following table summarizes the feature comparison of the two storage types.

CoW	MoR
Data is stored in base files (columnar Parquet format).	Data is stored as a combination of base files (columnar Parquet format) and log files with incremental changes (row-based Avro format).
COMMIT: Each new write creates a new version of the base files, which contain merged records from older base files and newer incoming records. Each write adds a commit action to the timeline, and each write atomically adds a commit action to the timeline, guaranteeing a write (and all its changes) entirely succeed or get entirely rolled back.	DELTA_COMMIT: Each new write creates incremental log files for updates, which are associated with the base Parquet files. For inserts, it creates a new version of the base file similar to CoW. Each write adds a delta commit action to the timeline.
Write
In case of updates, write latency is higher than MoR due to the merge cost because it needs to rewrite the entire affected Parquet files with the merged updates. Additionally, writing in the columnar Parquet format (for CoW updates) is more latent in comparison to the row-based Avro format (for MoR updates).	No merge cost for updates during write time, and the write operation is faster because it just appends the data changes to the new log file corresponding to the base file each time.
Compaction isn’t needed because all data is directly written to Parquet files.	Compaction is required to merge the base and log files to create a new version of the base file.
Higher write amplification because new versions of base files are created for every write. Write cost will be O(number of files in storage modified by the write).	Lower write amplification because updates go to log files. Write cost will be O(1) for update-only datasets and can get higher when there are new inserts.
Read
CoW table supports snapshot query and incremental queries.	MoR offers two ways to query the same underlying storage: ReadOptimized tables and Near-Realtime tables (snapshot queries). ReadOptimized tables support read-optimized queries, and Near-Realtime tables support snapshot queries and incremental queries.
Read-optimized queries aren’t applicable for CoW because data is already merged to base files while writing.	Read-optimized queries show the latest compacted data, which doesn’t include the freshest updates in the not yet compacted log files.
Snapshot queries have no merge cost during read.	Snapshot queries merge data while reading if not compacted and therefore can be slower than CoW while querying the latest data.

CoW is the default storage type and is preferred for simple read-heavy use cases. Use cases with the following characteristics are recommended for CoW:

Tables with a lower ingestion rate and use cases without real-time ingestion
Use cases requiring the freshest data with minimal read latency because merging cost is taken care of at the write phase
Append-only workloads where existing data is immutable

MoR is recommended for tables with write-heavy and update-heavy use cases. Use cases with the following characteristics are recommended for MoR:

Faster ingestion requirements and real-time ingestion use cases.
Varying or bursty write patterns (for example, ingesting bulk random deletes in an upstream database) due to the zero-merge cost for updates during write time
Streaming use cases
Mix of downstream consumers, where some are looking for fresher data by paying some additional read cost, and others need faster reads with some trade-off in data freshness

For streaming use cases demanding strict ingestion performance with MoR tables, we suggest running the table services (for example, compaction and cleaning) asynchronously, which is discussed in the upcoming Part 3 of this series.

For more details on table types and use cases, refer to How do I choose a storage type for my workload?

Select the record key, key generator, preCombine field, and record payload

This section discusses the basic configurations for the record key, key generator, preCombine field, and record payload.

Record key

Every record in Hudi is uniquely identified by a Hoodie key (similar to primary keys in databases), which is usually a pair of record key and partition path. With Hoodie keys, you can enable efficient updates and deletes on records, as well as avoid duplicate records. Hudi partitions have multiple file groups, and each file group is identified by a file ID. Hudi maps Hoodie keys to file IDs, using an indexing mechanism.

A record key that you select from your data can be unique within a partition or across partitions. If the selected record key is unique within a partition, it can be uniquely identified in the Hudi dataset using the combination of the record key and partition path. You can also combine multiple fields from your dataset into a compound record key. Record keys cannot be null.

Key generator

Key generators are different implementations to generate record keys and partition paths based on the values specified for these fields in the Hudi configuration. The right key generator has to be configured depending on the type of key (simple or composite key) and the column data type used in the record key and partition path columns (for example, TimestampBasedKeyGenerator is used for timestamp data type partition path). Hudi provides several key generators out of the box, which you can specify in your job using the following configuration.

Configuration Parameter	Description	Value
`hoodie.datasource.write.keygenerator.class`	Key generator class, which generates the record key and partition path	Default value is SimpleKeyGenerator

The following table describes the different types of key generators in Hudi.

Key Generators	Use-case
`SimpleKeyGenerator`	Use this key generator if your record key refers to a single column by name and similarly your partition path also refers to a single column by name.
`ComplexKeyGenerator`	Use this key generator when record key and partition paths comprise multiple columns. Columns are expected to be comma-separated in the config value (for example, `"hoodie.datasource.write.recordkey.field" : “col1,col4”`).
`GlobalDeleteKeyGenerator`	Use this key generator when you can’t determine the partition of incoming records to be deleted and need to delete only based on record key. This key generator ignores the partition path while generating keys to uniquely identify Hudi records. When using this key generator, set the config hoodie.`[bloom\|simple\|hbase].index.update.partition.path` to false in order to avoid redundant data written to the storage.
`NonPartitionedKeyGenerator`	Use this key generator for non-partitioned datasets because it returns an empty partition for all records.
`TimestampBasedKeyGenerator`	Use this key generator for a timestamp data type partition path. With this key generator, the partition path column values are interpreted as timestamps. The record key is the same as before, which is a single column converted to string. If using TimestampBasedKeyGenerator, a few more configs need to be set.
`CustomKeyGenerator`	Use this key generator to take advantage of the benefits of SimpleKeyGenerator, ComplexKeyGenerator, and TimestampBasedKeyGenerator all at the same time. With this you can configure record key and partition paths as a single field or a combination of fields. This is helpful if you want to generate nested partitions with each partition key of different types (for example, `field_3:simple,field_5:timestamp`). For more information, refer to CustomKeyGenerator.

The key generator class can be automatically inferred by Hudi if the specified record key and partition path require a SimpleKeyGenerator or ComplexKeyGenerator, depending on whether there are single or multiple record key or partition path columns. For all other cases, you need to specify the key generator.

The following flow chart explains how to select the right key generator for your use case.

PreCombine field

This is a mandatory field that Hudi uses to deduplicate the records within the same batch before writing them. When two records have the same record key, they go through the preCombine process, and the record with the largest value for the preCombine key is picked by default. This behavior can be customized through custom implementation of the Hudi payload class, which we describe in the next section.

The following table summarizes the configurations related to preCombine.

Configuration Parameter	Description	Value
`hoodie.datasource.write.precombine.field`	The field used in preCombining before the actual write. It helps select the latest record whenever there are multiple updates to the same record in a single incoming data batch.	The default value is ts. You can configure it to any column in your dataset that you want Hudi to use to deduplicate the records whenever there are multiple records with the same record key in the same batch. Currently, you can only pick one field as the preCombine field. Select a column with the timestamp data type or any column that can determine which record holds the latest version, like a monotonically increasing number.
`hoodie.combine.before.upsert`	During upsert, this configuration controls whether deduplication should be done for the incoming batch before ingesting into Hudi. This is applicable only for upsert operations.	The default value is true. We recommend keeping it at the default to avoid duplicates.
`hoodie.combine.before.delete`	Same as the preceding config, but applicable only for delete operations.	The default value is true. We recommend keeping it at the default to avoid duplicates.
`hoodie.combine.before.insert`	When inserted records share the same key, the configuration controls whether they should be first combined (deduplicated) before writing to storage.	The default value is false. We recommend setting it to true if the incoming inserts or bulk inserts can have duplicates.

Record payload

Record payload defines how to merge new incoming records against old stored records for upserts.

The default OverwriteWithLatestAvroPayload payload class always overwrites the stored record with the latest incoming record. This works fine for batch jobs and most use cases. But let’s say you have a streaming job and want to prevent the late-arriving data from overwriting the latest record in storage. You need to use a different payload class implementation (DefaultHoodieRecordPayload) to determine the latest record in storage based on an ordering field, which you provide.

For example, in the following example, Commit 1 has HoodieKey 1, Val 1, preCombine10, and in-flight Commit 2 has HoodieKey 1, Val 2, preCombine 5.

If using the default OverwriteWithLatestAvroPayload, the Val 2 version of the record will be the final version of the record in storage (Amazon S3) because it’s the latest version of the record.

If using DefaultHoodieRecordPayload, it will honor Val 1 because the Val 2’s record version has a lower preCombine value (preCombine 5) compared to Val 1’s record version, while merging multiple versions of the record.

You can select a payload class while writing to the Hudi table using the configuration hoodie.datasource.write.payload.class.

Some useful in-built payload class implementations are described in the following table.

Payload Class	Description
OverwriteWithLatestAvroPayload (`org.apache.hudi.common.model.OverwriteWithLatestAvroPayload`)	Chooses the latest incoming record to overwrite any previous version of the records. Default payload class.
DefaultHoodieRecordPayload (`org.apache.hudi.common.model.DefaultHoodieRecordPayload`)	Uses `hoodie.payload.ordering.field` to determine the final record version while writing to storage.
EmptyHoodieRecordPayload (`org.apache.hudi.common.model.EmptyHoodieRecordPayload`)	Use this as payload class to delete all the records in the dataset.
AWSDmsAvroPayload (`org.apache.hudi.common.model.AWSDmsAvroPayload`)	Use this as payload class if AWS DMS is used as source. It provides support for seamlessly applying changes captured via AWS DMS. This payload implementation performs insert, delete, and update operations on the Hudi table based on the operation type for the CDC record obtained from AWS DMS.

Partitioning

Partitioning is the physical organization of files within a table. They act as virtual columns and can impact the max parallelism we can use on writing.

Extremely fine-grained partitioning (for example, over 20,000 partitions) can create excessive overhead for the Spark engine managing all the small tasks, and can degrade query performance by reducing file sizes. Also, an overly coarse-grained partition strategy, without clustering and data skipping, can negatively impact both read and upsert performance with the need to scan more files in each partition.

Right partitioning helps improve read performance by reducing the amount of data scanned per query. It also improves upsert performance by limiting the number of files scanned to find the file group in which a specific record exists during ingest. A column frequently used in query filters would be a good candidate for partitioning.

For large-scale use cases with evolving query patterns, we suggest coarse-grained partitioning (such as date), while using fine-grained data layout optimization techniques (clustering) within each partition. This opens the possibility of data layout evolution.

By default, Hudi creates the partition folders with just the partition values. We recommend using Hive style partitioning, in which the name of the partition columns is prefixed to the partition values in the path (for example, year=2022/month=07 as opposed to 2022/07). This enables better integration with Hive metastores, such as using msck repair to fix partition paths.

To support Apache Hive style partitions in Hudi, we have to enable it in the config hoodie.datasource.write.hive_style_partitioning.

The following table summarizes the key configurations related to Hudi partitioning.

Configuration Parameter	Description	Value
`hoodie.datasource.write.partitionpath.field`	Partition path field. This is a required configuration that you need to pass while writing the Hudi dataset.	There is no default value set for this. Set it to the column that you have determined for partitioning the data. We recommend that it doesn’t cause extremely fine-grained partitions.
`hoodie.datasource.write.hive_style_partitioning`	Determines whether to use Hive style partitioning. If set to true, the names of partition folders follow `<partition_column_name>=<partition_value>` format.	Default value is false. Set it to true to use Hive style partitioning.
`hoodie.datasource.write.partitionpath.urlencode`	Indicates if we should URL encode the partition path value before creating the folder structure.	Default value is false. Set it to true if you want to URL encode the partition path value. For example, if you’re using the data format “`yyyy-MM-dd HH:mm:ss`“, the URL encode needs to be set to true because it will result in an invalid path due to :.

Note that if the data isn’t partitioned, you need to specifically use NonPartitionedKeyGenerator for the record key, which is explained in the previous section. Additionally, Hudi doesn’t allow partition columns to be changed or evolved.

Choose the right index

After we select the storage type in Hudi and determine the record key and partition path, we need to choose the right index for upsert performance. Apache Hudi employs an index to locate the file group that an update/delete belongs to. This enables efficient upsert and delete operations and enforces uniqueness based on the record keys.

Global index vs. non-global index

When picking the right indexing strategy, the first decision is whether to use a global (table level) or non-global (partition level) index. The main difference between global vs. non-global indexes is the scope of key uniqueness constraints. Global indexes enforce uniqueness of the keys across all partitions of a table. The non-global index implementations enforce this constraint only within a specific partition. Global indexes offer stronger uniqueness guarantees, but they come with a higher update/delete cost, for example global deletes with just the record key need to scan the entire dataset. HBase indexes are an exception here, but come with an operational overhead.

For large-scale global index use cases, use an HBase index or record-level index (available in Hudi 0.13) because for all other global indexes, the update/delete cost grows with the size of the table, O(size of the table).

When using a global index, be aware of the configuration hoodie[bloom|simple|hbase].index.update.partition.path, which is already set to true by default. For existing records getting upserted to a new partition, enabling this configuration will help delete the old record in the old partition and insert it in the new partition.

Hudi index options

After picking the scope of the index, the next step is to decide which indexing option best fits your workload. The following table explains the indexing options available in Hudi as of 0.11.0.

Indexing Option	How It Works	Characteristic	Scope
Simple Index	Performs a join of the incoming upsert/delete records against keys extracted from the involved partition in case of non-global datasets and the entire dataset in case of global or non-partitioned datasets.	Easiest to configure. Suitable for basic use cases like small tables with evenly spread updates. Even for larger tables where updates are very random to all partitions, a simple index is the right choice because it directly joins with interested fields from every data file without any initial pruning, as compared to Bloom, which in the case of random upserts adds additional overhead and doesn’t give enough pruning benefits because the Bloom filters could indicate true positive for most of the files and end up comparing ranges and filters against all these files.	Global/Non-global
Bloom Index (default index in EMR Hudi)	Employs Bloom filters built out of the record keys, optionally also pruning candidate files using record key ranges. Bloom filter is stored in the data file footer while writing the data.	More efficient filter compared to simple index for use cases like late-arriving updates to fact tables and deduplication in event tables with ordered record keys such as timestamp. Hudi implements a dynamic Bloom filter mechanism to reduce false positives provided by Bloom filters. In general, the probability of false positives increases with the number of records in a given file. Check the Hudi FAQ for Bloom filter configuration best practices.	Global/Non-global
Bucket Index	It distributes records to buckets using a hash function based on the record keys or subset of it. It uses the same hash function to determine which file group to match with incoming records. New indexing option since hudi 0.11.0.	Simple to configure. It has better upsert throughput performance compared to the Bloom filter. As of Hudi 0.11.1, only fixed bucket number is supported. This will no longer be an issue with the upcoming consistent hashing bucket index feature, which can dynamically change bucket numbers.	Non-global
HBase Index	The index mapping is managed in an external HBase table.	Best lookup time, especially for large numbers of partitions and files. It comes with additional operational overhead because you need to manage an external HBase table.	Global

Use cases suitable for simple index

Simple indexes are most suitable for workloads with evenly spread updates over partitions and files on small tables, and also for larger tables with dimension kind of workloads because updates are random to all partitions. A common example is a CDC pipeline for a dimension table. In this case, updates end up touching a large number of files and partitions. Therefore, a join with no other pruning is most efficient.

Use cases suitable for Bloom index

Bloom indexes are suitable for most production workloads with uneven update distribution across partitions. For workloads with most updates to recent data like fact tables, Bloom filter rightly fits the bill. It can be clickstream data collected from an ecommerce site, bank transactions in a FinTech application, or CDC logs for a fact table.

When using a Bloom index, be aware of the following configurations:

hoodie.bloom.index.use.metadata – By default, it is set to false. When this flag is on, the Hudi writer gets the index metadata information from the metadata table and doesn’t need to open Parquet file footers to get the Bloom filters and stats. You prune out the files by just using the metadata table and therefore have improved performance for larger tables.
hoodie.bloom.index.prune.by.ranges– Enable or disable range pruning based on use case. By default, it’s already set to true. When this flag is on, range information from files is used to speed up index lookups. This is helpful if the selected record key is monotonously increasing. You can set any record key to be monotonically increasing by adding a timestamp prefix. If the record key is completely random and has no natural ordering (such as UUIDs), it’s better to turn this off, because range pruning will only add extra overhead to the index lookup.

Use cases suitable for bucket index

Bucket indexes are suitable for upsert use cases on huge datasets with a large number of file groups within partitions, relatively even data distribution across partitions, and can achieve relatively even data distribution on the bucket hash field column. It can have better upsert performance in these cases due to no index lookup involved as file groups are located based on a hashing mechanism, which is very fast. This is totally different from both simple and Bloom indexes, where an explicit index lookup step is involved during write. The buckets here has one-one mapping with the hudi file group and since the total number of buckets (defined by hoodie.bucket.index.num.buckets(default – 4)) is fixed here, it can potentially lead to skewed data (data distributed unevenly across buckets) and scalability (buckets can grow over time) issues over time. These issues will be addressed in the upcoming consistent hashing bucket index, which is going to be a special type of bucket index.

Use cases suitable for HBase index

HBase indexes are suitable for use cases where ingestion performance can’t be met using the other index types. These are mostly use cases with global indexes and large numbers of files and partitions. HBase indexes provide the best lookup time but come with large operational overheads if you’re already using HBase for other workloads.

For more information on choosing the right index and indexing strategies for common use cases, refer to Employing the right indexes for fast updates, deletes in Apache Hudi. As you have already seen, Hudi index performance depends heavily on the actual workload. We encourage you to evaluate different indexes for your workload and choose the one which is best suited for your use case.

Migration guidance

With Apache Hudi growing in popularity, one of the fundamental challenges is to efficiently migrate existing datasets to Apache Hudi. Apache Hudi maintains record-level metadata to perform core operations such as upserts and incremental pulls. To take advantage of Hudi’s upsert and incremental processing support, you need to add Hudi record-level metadata to your original dataset.

Using bulk_insert

The recommended way for data migration to Hudi is to perform a full rewrite using bulk_insert. There is no look-up for existing records in bulk_insert and writer optimizations like small file handling. Performing a one-time full rewrite is a good opportunity to write your data in Hudi format with all the metadata and indexes generated and also potentially control file size and sort data by record keys.

You can set the sort mode in a bulk_insert operation using the configuration hoodie.bulkinsert.sort.mode. bulk_insert offers the following sort modes to configure.

Sort Modes	Description
`NONE`	No sorting is done to the records. You can get the fastest performance (comparable to writing parquet files with spark) for initial load with this mode.
`GLOBAL_SORT`	Use this to sort records globally across Spark partitions. It is less performant in initial load than other modes as it repartitions data by partition path and sorts it by record key within each partition. This helps in controlling the number of files generated in the target thereby controlling the target file size. Also, the generated target files will not have overlapping min-max values for record keys which will further help speed up index look-ups during upserts/deletes by pruning out files based on record key ranges in bloom index.
`PARTITION_SORT`	Use this to sort records within Spark partitions. It is more performant for initial load than `Global_Sort` and if your Spark partitions in the data frame are already fairly mapped to the Hudi partitions (dataframe is already repartitioned by partition column), using this mode would be preferred as you can obtain records sorted by record key within each partition.

We recommend to use Global_Sort mode if you can handle the one-time cost. The default sort mode is changed from Global_Sort to None from EMR 6.9 (Hudi 0.12.1). During bulk_insert with Global_Sort, two configurations control the sizes of target files generated by Hudi.

Configuration Parameter	Description	Value
`hoodie.bulkinsert.shuffle.parallelism`	The number of files generated from the bulk insert is determined by this configuration. The higher the parallelism, the more Spark tasks processing the data.	Default value is 200. To control file size and achieve maximum performance (more parallelism), we recommend setting this to a value such that the files generated are equal to the `hoodie.parquet.max.file.size`. If you make parallelism really high, the max file size can’t be honored because the Spark tasks are working on smaller amounts of data.
`hoodie.parquet.max.file.size`	Target size for Parquet files produced by Hudi write phases.	Default value is 120 MB. If the Spark partitions generated with `hoodie.bulkinsert.shuffle.parallelism` are larger than this size, it splits it and generates multiple files to not exceed the max file size.

Let’s say we have a 100 GB Parquet source dataset and we’re bulk inserting with Global_Sort into a partitioned Hudi table with 10 evenly distributed Hudi partitions. We want to have the preferred target file size of 120 MB (default value for hoodie.parquet.max.file.size). The Hudi bulk insert shuffle parallelism should be calculated as follows:

The total data size in MB is 100 * 1024 = 102400 MB
hoodie.bulkinsert.shuffle.parallelism should be set to 102400/120 = ~854

Please note that in reality even with Global_Sort, each spark partition can be mapped to more than one hudi partition and this calculation should only be used as a rough estimate and can potentially end up with more files than the parallelism specified.

Using bootstrapping

For customers operating at scale on hundreds of terabytes or petabytes of data, migrating your datasets to start using Apache Hudi can be time-consuming. Apache Hudi provides a feature called bootstrap to help with this challenge.

The bootstrap operation contains two modes: METADATA_ONLY and FULL_RECORD.

FULL_RECORD is the same as full rewrite, where the original data is copied and rewritten with the metadata as Hudi files.

The METADATA_ONLY mode is the key to accelerating the migration progress. The conceptual idea is to decouple the record-level metadata from the actual data by writing only the metadata columns in the Hudi files generated while the data isn’t copied over and stays in its original location. This significantly reduces the amount of data written, thereby improving the time to migrate and get started with Hudi. However, this comes at the expense of read performance, which involves the overhead merging Hudi files and original data files to get the complete record. Therefore, you may not want to use it for frequently queried partitions.

You can pick and choose these modes at partition level. One common strategy is to tier your data. Use FULL_RECORD mode for a small set of hot partitions, which are accessed frequently, and METADATA_ONLY for a larger set of cold partitions.

Consider the following:

There is some read performance penalty for the METADATA_ONLY partitions, and it should only be used for archived partitions. For more details, refer to Efficient Migration of Large Parquet Tables to Apache Hudi.
The original dataset needs to be in Parquet format to use bootstrap.

Catalog sync

Hudi supports syncing Hudi table partitions and columns to a catalog. On AWS, you can either use the AWS Glue Data Catalog or Hive metastore as the metadata store for your Hudi tables. To register and synchronize the metadata with your regular write pipeline, you need to either enable hive sync or run the hive_sync_tool or AwsGlueCatalogSyncTool command line utility.

We recommend enabling the hive sync feature with your regular write pipeline to make sure the catalog is up to date. If you don’t expect a new partition to be added or the schema changed as part of each batch, then we recommend enabling hoodie.datasource.meta_sync.condition.sync as well so that it allows Hudi to determine if hive sync is necessary for the job.

If you have frequent ingestion jobs and need to maximize ingestion performance, you can disable hive sync and run the hive_sync_tool asynchronously.

If you have the timestamp data type in your Hudi data, we recommend setting hoodie.datasource.hive_sync.support_timestamp to true to convert the int64 (timestamp_micros) to the hive type timestamp. Otherwise, you will see the values in bigint while querying data.

The following table summarizes the configurations related to hive_sync.

Configuration Parameter	Description	Value
`hoodie.datasource.hive_sync.enable`	To register or sync the table to a Hive metastore or the AWS Glue Data Catalog.	Default value is false. We recommend setting the value to true to make sure the catalog is up to date, and it needs to be enabled in every single write to avoid an out-of-sync metastore.
`hoodie.datasource.hive_sync.mode`	This configuration sets the mode for HiveSynctool to connect to the Hive metastore server. For more information, refer to Sync modes.	Valid values are hms, jdbc, and hiveql. If the mode isn’t specified, it defaults to jdbc. Hms and jdbc both talk to the underlying thrift server, but jdbc needs a separate jdbc driver. We recommend setting it to ‘hms’, which uses the Hive metastore client to sync Hudi tables using thrift APIs directly. This helps when using the AWS Glue Data Catalog because you don’t need to install Hive as an application on the EMR cluster (because it doesn’t need the server).
`hoodie.datasource.hive_sync.database`	Name of the destination database that we should sync the Hudi table to.	Default value is default. Set this to the database name of your catalog.
`hoodie.datasource.hive_sync.table`	Name of the destination table that we should sync the Hudi table to.	In Amazon EMR, the value is inferred from the Hudi table name. You can set this config if you need a different table name.
`hoodie.datasource.hive_sync.support_timestamp`	To convert logical type `TIMESTAMP_MICROS` as hive type timestamp.	Default value is false. Set it to true to convert to hive type timestamp.
`hoodie.datasource.meta_sync.condition.sync`	If true, only sync on conditions like schema change or partition change.	Default value is false.

Writing and reading Hudi datasets, and its integration with other AWS services

There are different ways you can write the data to Hudi using Amazon EMR, as explained in the following table.

Hudi Write Options	Description
Spark DataSource	You can use this option to do upsert, insert, or bulk insert for the write operation. Refer to Work with a Hudi dataset for an example of how to write data using DataSourceWrite.
Spark SQL	You can easily write data to Hudi with SQL statements. It eliminates the need to write Scala or PySpark code and adopt a low-code paradigm.
Flink SQL, Flink DataStream API	If you’re using Flink for real-time streaming ingestion, you can use the high-level Flink SQL or Flink DataStream API to write the data to Hudi.
DeltaStreamer	DeltaStreamer is a self-managed tool that supports standard data sources like Apache Kafka, Amazon S3 events, DFS, AWS DMS, JDBC, and SQL sources, built-in checkpoint management, schema validations, as well as lightweight transformations. It can also operate in a continuous mode, in which a single self-contained Spark job can pull data from source, write it out to Hudi tables, and asynchronously perform cleaning, clustering, compactions, and catalog syncing, relying on Spark’s job pools for resource management. It’s easy to use and we recommend using it for all the streaming and ingestion use cases where a low-code approach is preferred. For more information, refer to Streaming Ingestion.
Spark structured streaming	For use cases that require complex data transformations of the source data frame written in Spark DataFrame APIs or advanced SQL, we recommend the structured streaming sink. The streaming source can be used to obtain change feeds out of Hudi tables for streaming or incremental processing use cases.
Kafka Connect Sink	If you standardize on the Apache Kafka Connect framework for your ingestion needs, you can also use the Hudi Connect Sink.

Refer to the following support matrix for query support on specific query engines. The following table explains the different options to read the Hudi dataset using Amazon EMR.

Hudi Read options	Description
Spark DataSource	You can read Hudi datasets directly from Amazon S3 using this option. The tables don’t need to be registered with Hive metastore or the AWS Glue Data Catalog for this option. You can use this option if your use case doesn’t require a metadata catalog. Refer to Work with a Hudi dataset for example of how to read data using DataSourceReadOptions.
Spark SQL	You can query Hudi tables with DML/DDL statements. The tables need to be registered with Hive metastore or the AWS Glue Data Catalog for this option.
Flink SQL	After the Flink Hudi tables have been registered to the Flink catalog, they can be queried using the Flink SQL.
PrestoDB/Trino	The tables need to be registered with Hive metastore or the AWS Glue Data Catalog for this option. This engine is preferred for interactive queries. There is a new Trino connector in upcoming Hudi 0.13, and we recommend reading datasets through this connector when using Trino for performance benefits.
Hive	The tables need to be registered with Hive metastore or the AWS Glue Data Catalog for this option.

Apache Hudi is well integrated with AWS services, and these integrations work when AWS Glue Data Catalog is used, with the exception of Athena, where you can also use a data source connector to an external Hive metastore. The following table summarizes the service integrations.

AWS Service	Description
Amazon Athena	You can use Athena for a serverless option to query a Hudi dataset on Amazon S3. Currently, it supports snapshot queries and read-optimized queries, but not incremental queries. For more details, refer to Using Athena to query Apache Hudi datasets.
Amazon Redshift Spectrum	You can use Amazon Redshift Spectrum to run analytic queries against tables in your Amazon S3 data lake with Hudi format. Currently, it supports only CoW tables. For more details, refer to Creating external tables for data managed in Apache Hudi.
AWS Lake Formation	AWS Lake Formation is used to secure data lakes and define fine-grained access control on the database and table level. Hudi is not currently supported with Amazon EMR Lake Formation integration.
AWS DMS	You can use AWS DMS to ingest data from upstream relational databases to your S3 data lakes into an Hudi dataset. For more details, refer to Apply record level changes from relational databases to Amazon S3 data lake using Apache Hudi on Amazon EMR and AWS Database Migration Service.

Conclusion

This post covered best practices for configuring Apache Hudi data lakes using Amazon EMR. We discussed the key configurations in migrating your existing dataset to Hudi and shared guidance on how to determine the right options for different use cases when setting up Hudi tables.

The upcoming Part 2 of this series focuses on optimizations that can be done on this setup, along with monitoring using Amazon CloudWatch.

About the Authors

Suthan Phillips is a Big Data Architect for Amazon EMR at AWS. He works with customers to provide best practice and technical guidance and helps them achieve highly scalable, reliable and secure solutions for complex applications on Amazon EMR. In his spare time, he enjoys hiking and exploring the Pacific Northwest.

Dylan Qu is an AWS solutions architect responsible for providing architectural guidance across the full AWS stack with a focus on Data Analytics, AI/ML and DevOps.

How Etleap and Amazon Redshift Serverless optimize costs for ETL

2022-11-22 Caius Brindescu

Post Syndicated from Caius Brindescu original https://aws.amazon.com/blogs/big-data/how-etleap-and-amazon-redshift-serverless-optimize-costs-for-etl/

Amazon Redshift Serverless lets you avoid managing infrastructure while only paying for what you use. Etleap provides data integration software that is natively built on AWS. It’s an AWS Advanced Technology Partner with the AWS Data & Analytics Competency and Amazon Redshift Service Ready designation.

In this post, we share how you can minimize the usage of resources for some workload patterns and maximize savings while seamlessly managing data pipelines. We illustrate an example of how Redshift Serverless and Etleap’s load synchronization feature can reduce active Redshift Serverless time, further optimizing extract, transform, and load (ETL) costs.

Introduction to Redshift Serverless

Redshift Serverless makes it easy to run and scale analytics in seconds without the need to set up and manage data warehouse clusters. With Redshift Serverless, you pay for the compute only when the data warehouse is in use. This is ideal when it’s difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes. As your demand evolves with new workloads and more concurrent users, Redshift Serverless automatically provisions the right compute resources, and your data warehouse scales seamlessly and automatically.

You can create a Redshift Serverless data warehouse either using the default settings or custom settings. Redshift Serverless creates a default workgroup and associates that to the default namespace. You can also create multiple Redshift Serverless endpoints per AWS account and Region using namespaces and workgroups.

A namespace is a collection of database objects and users, with properties such as database name and password, permissions, and encryption and security. The following screenshot shows an example of a namespace configuration on the Redshift Serverless console.

A workgroup is a collection of compute resources, which includes network and security settings. Workgroup configuration allows you to create a private or public serverless endpoint that you can use to connect with your applications. The following screenshot shows an example workgroup on the Redshift Serverless console.

When the Redshift Serverless endpoint is available, choose Query data to launch the Amazon Redshift Query Editor v2 to create database objects, load data, and analyze and visualize data. You can also connect to Redshift Serverless endpoints using your preferred SQL client tools via Amazon Redshift JDBC/ODBC drivers.

With Redshift Serverless, you pay separately for the compute and storage you use. Compute capacity is measured in Redshift Processing Units (RPUs), and you pay for the workloads in RPU-hours with a minimum charge of 60 seconds, metered on a per-second basis. Data lake queries are also part of the same RPU-hours, and Redshift Serverless doesn’t charge separately for the per-TB based pricing of Amazon Redshift Spectrum. The default base capacity is 128 RPUs, but you can adjust it from 32 RPUs to 512 RPUs in units of 8 using the Redshift Serverless console. For storage, you pay for data stored in Amazon Redshift-managed storage and storage used for manual snapshots, similar to what you would pay with Amazon Redshift provisioned RA3 instances.

To control your costs, you can specify usage limits and define actions that Amazon Redshift automatically takes if those limits are reached. You can specify usage limits in RPU-hours and associated with a daily, weekly, or monthly duration. Setting higher usage limits can improve the overall throughput of the system, especially for workloads that need to handle high concurrency while maintaining consistently high performance.

Why Etleap customers need Redshift Serverless

Etleap gives customers robust and flexible pipelines without the hassle of coding and managing infrastructure. Redshift Serverless has a similar benefit, letting you run Amazon Redshift without worrying about provisioning and maintaining data warehouse.

With the close Etleap-AWS integration, you can get started working with multiple data sources in Redshift Serverless in minutes.

Redshift Serverless can also reduce users’ costs because it automatically scales data warehouse capacity up and down to match usage and only charges when the serverless instance is active. ETL workloads are often batch-based and characterized by spikes, so the dynamic scaling of Redshift Serverless reduces unnecessary costs.

The following diagram illustrates this solution architecture.

Etleap uses Amazon Database Migration Service (AWS DMS), Amazon EMR, and Amazon Simple Storage Service (Amazon S3) to process data from databases, files, applications, and streams into Redshift Serverless.

Optimize costs for Redshift Serverless

One of the main sources of cost savings when using Redshift Serverless comes from its auto-pausing feature. When a Redshift Serverless instance is idle, it will auto-pause and you aren’t charged during this period of inactivity.

However, high frequency ETL pipelines (such as those from streams or CDC sources) can constantly resume the Redshift Serverless instance, negating the cost benefit. To maximize the advantages of the auto-pausing feature of Redshift Serverless, Etleap provides the option of load synchronization. As shown in the following figure, this reduces the number of load batches, thereby lowering active Redshift Serverless instance time and cost.

It sometimes makes sense to maximize the frequency of data ingestion, but not all use cases justify the higher cost of an always-on Amazon Redshift instance. Etleap users can set their load frequency at a cost-efficient once-per-hour or as frequently as every 5 minutes.

Amazon Redshift users typically run some SQL transformations after data is loaded in the warehouse. Etleap’s models feature lets you define the SQL transformations and their dependencies and control when these transformations are run. As with data loading, however, if these aren’t designed thoughtfully, there is a risk that models will trigger updates that unnecessarily wake up an idle Redshift Serverless instance, negating the cost savings of the Redshift Serverless auto-pausing feature.

To avoid this, Etleap schedules the models to update immediately after all the dependent tables have been updated. This maximizes the instance usage while it’s awake and allows it to pause when the loads and updates have completed.

Cost savings example

Let’s illustrate the cost savings benefits of Redshift Serverless by means of an example. A customer has set a 1-hour load synchronization schedule and has 100 pipelines and 10 models. Although by default Redshift Serverless has a provisioned base capacity of 128 RPUs, a provisioned base capacity of 32 RPUs is sufficient for the load requirements of this example. A typical average load time for Etleap customers into Amazon Redshift is 6 seconds. In Etleap, we perform a maximum of five loads at a time to avoid overloading the Redshift Serverless instance.

Here is an example of how the sequence would work for the pipelines:

When the hourly schedule triggers, Etleap begins the extraction and transformation of source data for all pipelines with new data to process.
After all the pipelines have finished extraction and transformation, Etleap begins to load the data into Amazon Redshift. This resumes the serverless instance. At an average of 6 seconds per load and five loads running in parallel, it takes 120 seconds to load all the pipelines (100 / 5 pipeline cycles * 6 seconds each).
When the load is complete, Etleap triggers the model updates. A typical model in Etleap takes about 130 seconds to update. As with loads, Etleap limits models to five simultaneous updates to reduce the load on the Redshift Serverless instance. Therefore, updating all 10 models takes 260 seconds of total instance run time (130 seconds * 10/5 model cycles).
At this point, you’re being charged for 380 seconds of active workload, and Redshift Serverless will become idle after some time.

Additionally, Etleap runs daily vacuum operations on applicable tables to minimize storage and improve query efficiency. The length of this process depends on the tables and the number of updates and deletes. For a customer with this amount of pipeline volume, 20 minutes is a typical length of time to vacuum the tables, adding that much daily runtime for the instance.

This results in a total daily runtime of 172 minutes ((380 seconds * 24 daily cycles / 60) + 20 minutes), which translates into a cost of $34.40 per day for a 32 RPU serverless instance. This is 88% lower cost than a comparable Amazon Redshift provisioned environment without the benefits of Etleap and Redshift Serverless: an always-on provisioned Amazon Redshift cluster with similar performance (1 year reserved instance pricing for 16 ra3.xlplus nodes running 24 hours/day).

Other ETL optimizations on Etleap using Redshift Serverless

Etleap natively supports Redshift Serverless by updating its ETL solution to ensure you can continue to seamlessly ingest diverse data sources.

Redshift Serverless offers new system views that are used for tracking and managing ingestion, and Etleap utilizes these new system views to natively handle tracking ingestion loads and vacuuming operations in their platform. For example, Etleap uses sys_query_history to determine which loads are in progress or complete, and thereby helps avoid double loading a batch.

Redshift Serverless automatically initiates optimizations such as sort and vacuum in the background and doesn’t charge for these automatic optimizations. As a best practice, after Etleap load synchronization, Etleap periodically runs the vacuum function on applicable tables, which reduces storage and improves query performance. Etleap uses the vacuum_sort_benefit column in svv_table_info, which provides the statistics for each table, informing which would benefit from vacuuming.

Summary

In this post, we described how Redshift Serverless frees you from managing data warehouse infrastructure and reduces costs. In particular, we illustrated a data integration pattern where Etleap can ensure further cost savings through its load synchronization feature by optimally choosing a cost-efficient once-per-hour load frequency. Although this proves to be an optimal solution for uses cases where you prefer cost efficiency over real-time data insights, Etleap also allows you to set the load frequency as low as 5 minutes for use cases where near-real-time data insights are important.

Start using Redshift Serverless to run and scale analytics without having to manage data warehouse infrastructure and take advantage of further cost savings through Etleap’s load synchronization feature. To get started with Etleap, start a free trial or request a tailored demo.

About the Authors

Caius Brindescu is an engineer at Etleap with over 4 years of experience in developing ETL software. In addition to development work, he helps customers make the most out of Etleap and Amazon Redshift. He holds a PhD from Oregon State University and one AWS certification (Big Data – Specialty).

Maneesh Sharma is a Senior Database Engineer at AWS with more than a decade of experience designing and implementing large-scale data warehouse and analytics solutions. He collaborates with various Amazon Redshift Partners and customers to drive better integration.

Sathisan Vannadil is a Senior Partner Solutions Architect at Amazon Web Services (AWS). His primary focus is on helping independent software vendor (ISV) partners design and build solutions at scale on AWS. Prior to AWS, Sathisan held diverse technical positions and has over 20 years of experience in the field of data and analytics.

Get started with data integration from Amazon S3 to Amazon Redshift using AWS Glue interactive sessions

2022-11-21 Vikas Omer

Post Syndicated from Vikas Omer original https://aws.amazon.com/blogs/big-data/get-started-with-data-integration-from-amazon-s3-to-amazon-redshift-using-aws-glue-interactive-sessions/

Organizations are placing a high priority on data integration, especially to support analytics, machine learning (ML), business intelligence (BI), and application development initiatives. Data is growing exponentially and is generated by increasingly diverse data sources. Data integration becomes challenging when processing data at scale and the inherent heavy lifting associated with infrastructure required to manage it. This is one of the key reasons why organizations are constantly looking for easy-to-use and low maintenance data integration solutions to move data from one location to another or to consolidate their business data from several sources into a centralized location to make strategic business decisions.

Most organizations use Spark for their big data processing needs. If you’re looking to simplify data integration, and don’t want the hassle of spinning up servers, managing resources, or setting up Spark clusters, we have the solution for you.

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, ML, and application development. AWS Glue provides both visual and code-based interfaces to make data integration simple and accessible for everyone.

If you prefer a code-based experience and want to interactively author data integration jobs, we recommend interactive sessions. Interactive sessions is a recently launched AWS Glue feature that allows you to interactively develop AWS Glue processes, run and test each step, and view the results.

There are different options to use interactive sessions. You can create and work with interactive sessions through the AWS Command Line Interface (AWS CLI) and API. You can also use Jupyter-compatible notebooks to visually author and test your notebook scripts. Interactive sessions provide a Jupyter kernel that integrates almost anywhere that Jupyter does, including integrating with IDEs such as PyCharm, IntelliJ, and Visual Studio Code. This enables you to author code in your local environment and run it seamlessly on the interactive session backend. You can also start a notebook through AWS Glue Studio; all the configuration steps are done for you so that you can explore your data and start developing your job script after only a few seconds. When the code is ready, you can configure, schedule, and monitor job notebooks as AWS Glue jobs.

If you haven’t tried AWS Glue interactive sessions before, this post is highly recommended. We work through a simple scenario where you might need to incrementally load data from Amazon Simple Storage Service (Amazon S3) into Amazon Redshift or transform and enrich your data before loading into Amazon Redshift. In this post, we use interactive sessions within an AWS Glue Studio notebook to load the NYC Taxi dataset into an Amazon Redshift Serverless cluster, query the loaded dataset, save our Jupyter notebook as a job, and schedule it to run using a cron expression. Let’s get started.

Solution overview

We walk you through the following steps:

Set up an AWS Glue Jupyter notebook with interactive sessions.
Use notebook’s magics, including AWS Glue connection and bookmarks.
Read data from Amazon S3, and transform and load it into Redshift Serverless.
Save the notebook as an AWS Glue job and schedule it to run.

Prerequisites

For this walkthrough, we must complete the following prerequisites:

Upload Yellow Taxi Trip Records data and the taxi zone lookup table datasets into Amazon S3. Steps to do that are listed in the next section.
Prepare the necessary AWS Identity and Access Management (IAM) policies and roles to work with AWS Glue Studio Jupyter notebooks, interactive sessions, and AWS Glue.
Create the AWS Glue connection for Redshift Serverless.

Upload datasets into Amazon S3

Download Yellow Taxi Trip Records data and taxi zone lookup table data to your local environment. For this post, we download the January 2022 data for yellow taxi trip records data in Parquet format. The taxi zone lookup data is in CSV format. You can also download the data dictionary for the trip record dataset.

On the Amazon S3 console, create a bucket called my-first-aws-glue-is-project-<random number> in the us-east-1 Region to store the data.S3 bucket names must be unique across all AWS accounts in all the Regions.
Create folders nyc_yellow_taxi and taxi_zone_lookup in the bucket you just created and upload the files you downloaded.
Your folder structures should look like the following screenshots.

Prepare IAM policies and role

Let’s prepare the necessary IAM policies and role to work with AWS Glue Studio Jupyter notebooks and interactive sessions. To get started with notebooks in AWS Glue Studio, refer to Getting started with notebooks in AWS Glue Studio.

Create IAM policies for the AWS Glue notebook role

Create the policy AWSGlueInteractiveSessionPassRolePolicy with the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
        "Effect": "Allow",
        "Action": "iam:PassRole",
        "Resource":"arn:aws:iam::<AWS account ID>:role/AWSGlueServiceRole-GlueIS"
        }
    ]
}

This policy allows the AWS Glue notebook role to pass to interactive sessions so that the same role can be used in both places. Note that AWSGlueServiceRole-GlueIS is the role that we create for the AWS Glue Studio Jupyter notebook in a later step. Next, create the policy AmazonS3Access-MyFirstGlueISProject with the following permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<your s3 bucket name>",
                "arn:aws:s3:::<your s3 bucket name>/*"
            ]
        }
    ]
}

This policy allows the AWS Glue notebook role to access data in the S3 bucket.

Create an IAM role for the AWS Glue notebook

Create a new AWS Glue role called AWSGlueServiceRole-GlueIS with the following policies attached to it:

Create the AWS Glue connection for Redshift Serverless

Now we’re ready to configure a Redshift Serverless security group to connect with AWS Glue components.

On the Redshift Serverless console, open the workgroup you’re using.
You can find all the namespaces and workgroups on the Redshift Serverless dashboard.
Under Data access, choose Network and security.
Choose the link for the Redshift Serverless VPC security group.You’re redirected to the Amazon Elastic Compute Cloud (Amazon EC2) console.
In the Redshift Serverless security group details, under Inbound rules, choose Edit inbound rules.
Add a self-referencing rule to allow AWS Glue components to communicate:
1. For Type, choose All TCP.
2. For Protocol, choose TCP.
3. For Port range, include all ports.
4. For Source, use the same security group as the group ID.
Similarly, add the following outbound rules:
1. A self-referencing rule with Type as All TCP, Protocol as TCP, Port range including all ports, and Destination as the same security group as the group ID.
2. An HTTPS rule for Amazon S3 access. The s3-prefix-list-id value is required in the security group rule to allow traffic from the VPC to the Amazon S3 VPC endpoint.

If you don’t have an Amazon S3 VPC endpoint, you can create one on the Amazon Virtual Private Cloud (Amazon VPC) console.

You can check the value for s3-prefix-list-id on the Managed prefix lists page on the Amazon VPC console.

Next, go to the Connectors page on AWS Glue Studio and create a new JDBC connection called redshiftServerless to your Redshift Serverless cluster (unless one already exists). You can find the Redshift Serverless endpoint details under your workgroup’s General Information section. The connection setting looks like the following screenshot.

Write interactive code on an AWS Glue Studio Jupyter notebook powered by interactive sessions

Now you can get started with writing interactive code using AWS Glue Studio Jupyter notebook powered by interactive sessions. Note that it’s a good practice to keep saving the notebook at regular intervals while you work through it.

On the AWS Glue Studio console, create a new job.
Select Jupyter Notebook and select Create a new notebook from scratch.
Choose Create.
For Job name, enter a name (for example, myFirstGlueISProject).
For IAM Role, choose the role you created (AWSGlueServiceRole-GlueIS).
Choose Start notebook job.
After the notebook is initialized, you can see some of the available magics and a cell with boilerplate code. To view all the magics of interactive sessions, run %help in a cell to print a full list. With the exception of %%sql, running a cell of only magics doesn’t start a session, but sets the configuration for the session that starts when you run your first cell of code.For this post, we configure AWS Glue with version 3.0, three G.1X workers, idle timeout, and an Amazon Redshift connection with the help of available magics.

Let’s enter the following magics into our first cell and run it:

%glue_version 3.0
%number_of_workers 3
%worker_type G.1X
%idle_timeout 60
%connections redshiftServerless

We get the following response:

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.35 
Setting Glue version to: 3.0
Previous number of workers: 5
Setting new number of workers to: 3
Previous worker type: G.1X
Setting new worker type to: G.1X
Current idle_timeout is 2880 minutes.
idle_timeout has been set to 60 minutes.
Connections to be included:
redshiftServerless

Let’s run our first code cell (boilerplate code) to start an interactive notebook session within a few seconds:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

We get the following response:

Authenticating with environment variables and user-defined glue_role_arn:arn:aws:iam::xxxxxxxxxxxx:role/AWSGlueServiceRole-GlueIS
Attempting to use existing AssumeRole session credentials.
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 3
Session ID: 7c9eadb1-9f9b-424f-9fba-d0abc57e610d
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
--job-bookmark-option job-bookmark-enable
Waiting for session 7c9eadb1-9f9b-424f-9fba-d0abc57e610d to get into ready status...
Session 7c9eadb1-9f9b-424f-9fba-d0abc57e610d has been created

Next, read the NYC yellow taxi data from the S3 bucket into an AWS Glue dynamic frame:

nyc_taxi_trip_input_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type = "s3", 
    connection_options = {
        "paths": ["s3://<your-s3-bucket-name>/nyc_yellow_taxi/"]
    }, 
    format = "parquet",
    transformation_ctx = "nyc_taxi_trip_input_dyf"
)

Let’s count the number of rows, look at the schema and a few rows of the dataset.

Count the rows with the following code:

nyc_taxi_trip_input_df = nyc_taxi_trip_input_dyf.toDF()
nyc_taxi_trip_input_df.count()

We get the following response:

View the schema with the following code:

nyc_taxi_trip_input_df.printSchema()

We get the following response:

root
 |-- VendorID: long (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: long (nullable = true)
 |-- DOLocationID: long (nullable = true)
 |-- payment_type: long (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)

View a few rows of the dataset with the following code:

nyc_taxi_trip_input_df.show(5)

We get the following response:

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|airport_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
|       2| 2022-01-18 15:04:43|  2022-01-18 15:12:51|            1.0|         1.13|       1.0|                 N|         141|         229|           2|        7.0|  0.0|    0.5|       0.0|         0.0|                  0.3|        10.3|                 2.5|        0.0|
|       2| 2022-01-18 15:03:28|  2022-01-18 15:15:52|            2.0|         1.36|       1.0|                 N|         237|         142|           1|        9.5|  0.0|    0.5|      2.56|         0.0|                  0.3|       15.36|                 2.5|        0.0|
|       1| 2022-01-06 17:49:22|  2022-01-06 17:57:03|            1.0|          1.1|       1.0|                 N|         161|         229|           2|        7.0|  3.5|    0.5|       0.0|         0.0|                  0.3|        11.3|                 2.5|        0.0|
|       2| 2022-01-09 20:00:55|  2022-01-09 20:04:14|            1.0|         0.56|       1.0|                 N|         230|         230|           1|        4.5|  0.5|    0.5|      1.66|         0.0|                  0.3|        9.96|                 2.5|        0.0|
|       2| 2022-01-24 16:16:53|  2022-01-24 16:31:36|            1.0|         2.02|       1.0|                 N|         163|         234|           1|       10.5|  1.0|    0.5|       3.7|         0.0|                  0.3|        18.5|                 2.5|        0.0|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+
only showing top 5 rows

Now, read the taxi zone lookup data from the S3 bucket into an AWS Glue dynamic frame:

nyc_taxi_zone_lookup_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type = "s3", 
    connection_options = {
        "paths": ["s3://<your-s3-bucket-name>/taxi_zone_lookup/"]
    }, 
    format = "csv",
    format_options= {
        'withHeader': True
    },
    transformation_ctx = "nyc_taxi_zone_lookup_dyf"
)

Let’s count the number of rows, look at the schema and a few rows of the dataset.

Count the rows with the following code:

nyc_taxi_zone_lookup_df = nyc_taxi_zone_lookup_dyf.toDF()
nyc_taxi_zone_lookup_df.count()

We get the following response:

View the schema with the following code:

nyc_taxi_zone_lookup_apply_mapping_dyf.toDF().printSchema()

We get the following response:

root
 |-- LocationID: string (nullable = true)
 |-- Borough: string (nullable = true)
 |-- Zone: string (nullable = true)
 |-- service_zone: string (nullable = true)

View a few rows with the following code:

nyc_taxi_zone_lookup_df.show(5)

We get the following response:

+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
+----------+-------------+--------------------+------------+
only showing top 5 rows

Based on the data dictionary, lets recalibrate the data types of attributes in dynamic frames corresponding to both dynamic frames:

nyc_taxi_trip_apply_mapping_dyf = ApplyMapping.apply(
    frame = nyc_taxi_trip_input_dyf, 
    mappings = [
        ("VendorID","Long","VendorID","Integer"), 
        ("tpep_pickup_datetime","Timestamp","tpep_pickup_datetime","Timestamp"), 
        ("tpep_dropoff_datetime","Timestamp","tpep_dropoff_datetime","Timestamp"), 
        ("passenger_count","Double","passenger_count","Integer"), 
        ("trip_distance","Double","trip_distance","Double"),
        ("RatecodeID","Double","RatecodeID","Integer"), 
        ("store_and_fwd_flag","String","store_and_fwd_flag","String"), 
        ("PULocationID","Long","PULocationID","Integer"), 
        ("DOLocationID","Long","DOLocationID","Integer"),
        ("payment_type","Long","payment_type","Integer"), 
        ("fare_amount","Double","fare_amount","Double"),
        ("extra","Double","extra","Double"), 
        ("mta_tax","Double","mta_tax","Double"),
        ("tip_amount","Double","tip_amount","Double"), 
        ("tolls_amount","Double","tolls_amount","Double"), 
        ("improvement_surcharge","Double","improvement_surcharge","Double"), 
        ("total_amount","Double","total_amount","Double"), 
        ("congestion_surcharge","Double","congestion_surcharge","Double"), 
        ("airport_fee","Double","airport_fee","Double")
    ],
    transformation_ctx = "nyc_taxi_trip_apply_mapping_dyf"
)

nyc_taxi_zone_lookup_apply_mapping_dyf = ApplyMapping.apply(
    frame = nyc_taxi_zone_lookup_dyf, 
    mappings = [ 
        ("LocationID","String","LocationID","Integer"), 
        ("Borough","String","Borough","String"), 
        ("Zone","String","Zone","String"), 
        ("service_zone","String", "service_zone","String")
    ],
    transformation_ctx = "nyc_taxi_zone_lookup_apply_mapping_dyf"
)

Now let’s check their schema:

nyc_taxi_trip_apply_mapping_dyf.toDF().printSchema()

We get the following response:

root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- airport_fee: double (nullable = true)

nyc_taxi_zone_lookup_apply_mapping_dyf.toDF().printSchema()

We get the following response:

root
 |-- LocationID: integer (nullable = true)
 |-- Borough: string (nullable = true)
 |-- Zone: string (nullable = true)
 |-- service_zone: string (nullable = true)

Let’s add the column trip_duration to calculate the duration of each trip in minutes to the taxi trip dynamic frame:

# Function to calculate trip duration in minutes
def trip_duration(start_timestamp,end_timestamp):
    minutes_diff = (end_timestamp - start_timestamp).total_seconds() / 60.0
    return(minutes_diff)

# Transformation function for each record
def transformRecord(rec):
    rec["trip_duration"] = trip_duration(rec["tpep_pickup_datetime"], rec["tpep_dropoff_datetime"])
    return rec
nyc_taxi_trip_final_dyf = Map.apply(
    frame = nyc_taxi_trip_apply_mapping_dyf, 
    f = transformRecord, 
    transformation_ctx = "nyc_taxi_trip_final_dyf"
)

Let’s count the number of rows, look at the schema and a few rows of the dataset after applying the above transformation.

Get a record count with the following code:

nyc_taxi_trip_final_df = nyc_taxi_trip_final_dyf.toDF()
nyc_taxi_trip_final_df.count()

We get the following response:

View the schema with the following code:

nyc_taxi_trip_final_df.printSchema()

We get the following response:

root
 |-- extra: double (nullable = true)
 |-- tpep_dropoff_datetime: timestamp (nullable = true)
 |-- trip_duration: double (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- airport_fee: double (nullable = true)
 |-- payment_type: integer (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- RatecodeID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp (nullable = true)
 |-- VendorID: integer (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- passenger_count: integer (nullable = true)

View a few rows with the following code:

nyc_taxi_trip_final_df.show(5)

We get the following response:

+-----+---------------------+------------------+-------------+-------+---------------------+------------+--------------------+------------+-----------+------------+-----------+----------+--------------------+--------+------------+----------+------------+------------------+---------------+
|extra|tpep_dropoff_datetime|     trip_duration|trip_distance|mta_tax|improvement_surcharge|DOLocationID|congestion_surcharge|total_amount|airport_fee|payment_type|fare_amount|RatecodeID|tpep_pickup_datetime|VendorID|PULocationID|tip_amount|tolls_amount|store_and_fwd_flag|passenger_count|
+-----+---------------------+------------------+-------------+-------+---------------------+------------+--------------------+------------+-----------+------------+-----------+----------+--------------------+--------+------------+----------+------------+------------------+---------------+
|  0.0|  2022-01-18 15:12:51| 8.133333333333333|         1.13|    0.5|                  0.3|         229|                 2.5|        10.3|        0.0|           2|        7.0|         1| 2022-01-18 15:04:43|       2|         141|       0.0|         0.0|                 N|              1|
|  0.0|  2022-01-18 15:15:52|              12.4|         1.36|    0.5|                  0.3|         142|                 2.5|       15.36|        0.0|           1|        9.5|         1| 2022-01-18 15:03:28|       2|         237|      2.56|         0.0|                 N|              2|
|  3.5|  2022-01-06 17:57:03| 7.683333333333334|          1.1|    0.5|                  0.3|         229|                 2.5|        11.3|        0.0|           2|        7.0|         1| 2022-01-06 17:49:22|       1|         161|       0.0|         0.0|                 N|              1|
|  0.5|  2022-01-09 20:04:14| 3.316666666666667|         0.56|    0.5|                  0.3|         230|                 2.5|        9.96|        0.0|           1|        4.5|         1| 2022-01-09 20:00:55|       2|         230|      1.66|         0.0|                 N|              1|
|  1.0|  2022-01-24 16:31:36|14.716666666666667|         2.02|    0.5|                  0.3|         234|                 2.5|        18.5|        0.0|           1|       10.5|         1| 2022-01-24 16:16:53|       2|         163|       3.7|         0.0|                 N|              1|
+-----+---------------------+------------------+-------------+-------+---------------------+------------+--------------------+------------+-----------+------------+-----------+----------+--------------------+--------+------------+----------+------------+------------------+---------------+
only showing top 5 rows

Next, load both the dynamic frames into our Amazon Redshift Serverless cluster:

nyc_taxi_trip_sink_dyf = glueContext.write_dynamic_frame.from_jdbc_conf(
    frame = nyc_taxi_trip_final_dyf, 
    catalog_connection = "redshiftServerless", 
    connection_options =  {"dbtable": "public.f_nyc_yellow_taxi_trip","database": "dev"}, 
    redshift_tmp_dir = "s3://aws-glue-assets-<AWS-account-ID>-us-east-1/temporary/", 
    transformation_ctx = "nyc_taxi_trip_sink_dyf"
)

nyc_taxi_zone_lookup_sink_dyf = glueContext.write_dynamic_frame.from_jdbc_conf(
    frame = nyc_taxi_zone_lookup_apply_mapping_dyf, 
    catalog_connection = "redshiftServerless", 
    connection_options = {"dbtable": "public.d_nyc_taxi_zone_lookup", "database": "dev"}, 
    redshift_tmp_dir = "s3://aws-glue-assets-<AWS-account-ID>-us-east-1/temporary/", 
    transformation_ctx = "nyc_taxi_zone_lookup_sink_dyf"
)

Now let’s validate the data loaded in Amazon Redshift Serverless cluster by running a few queries in Amazon Redshift query editor v2. You can also use your preferred query editor.

First, we count the number of records and select a few rows in both the target tables (f_nyc_yellow_taxi_trip and d_nyc_taxi_zone_lookup):
```
SELECT 'f_nyc_yellow_taxi_trip' AS table_name, COUNT(1) FROM "public"."f_nyc_yellow_taxi_trip"
UNION ALL
SELECT 'd_nyc_taxi_zone_lookup' AS table_name, COUNT(1) FROM "public"."d_nyc_taxi_zone_lookup";
```
The number of records in f_nyc_yellow_taxi_trip (2,463,931) and d_nyc_taxi_zone_lookup (265) match the number of records in our input dynamic frame. This validates that all records from files in Amazon S3 have been successfully loaded into Amazon Redshift.

You can view some of the records for each table with the following commands:
```
SELECT * FROM public.f_nyc_yellow_taxi_trip LIMIT 10;
```
```
SELECT * FROM public.d_nyc_taxi_zone_lookup LIMIT 10;
```

One of the insights that we want to generate from the datasets is to get the top five routes with their trip duration. Let’s run the SQL for that on Amazon Redshift:

SELECT 
    CASE WHEN putzl.zone >= dotzl.zone 
        THEN putzl.zone || ' - ' || dotzl.zone 
        ELSE  dotzl.zone || ' - ' || putzl.zone 
    END AS "Route",
    COUNT(1) AS "Frequency",
    ROUND(SUM(trip_duration),1) AS "Total Trip Duration (mins)"
FROM 
    public.f_nyc_yellow_taxi_trip ytt
INNER JOIN 
    public.d_nyc_taxi_zone_lookup putzl ON ytt.pulocationid = putzl.locationid
INNER JOIN 
    public.d_nyc_taxi_zone_lookup dotzl ON ytt.dolocationid = dotzl.locationid
GROUP BY 
    "Route"
ORDER BY 
    "Frequency" DESC, "Total Trip Duration (mins)" DESC
LIMIT 5;

Transform the notebook into an AWS Glue job and schedule it

Now that we have authored the code and tested its functionality, let’s save it as a job and schedule it.

Let’s first enable job bookmarks. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data. With job bookmarks, you can process new data when rerunning on a scheduled interval.

Add the following magic command after the first cell that contains other magic commands initialized during authoring the code:
```
%%configure
{
    "--job-bookmark-option": "job-bookmark-enable"
}
```
To initialize job bookmarks, we run the following code with the name of the job as the default argument (myFirstGlueISProject for this post). Job bookmarks store the states for a job. You should always have job.init() in the beginning of the script and the job.commit() at the end of the script. These two functions are used to initialize the bookmark service and update the state change to the service. Bookmarks won’t work without calling them.

Add the following piece of code after the boilerplate code:

params = []
if '--JOB_NAME' in sys.argv:
    params.append('JOB_NAME')
args = getResolvedOptions(sys.argv, params)
if 'JOB_NAME' in args:
    jobname = args['JOB_NAME']
else:
    jobname = "myFirstGlueISProject"
job.init(jobname, args)

Then comment out all the lines of code that were authored to verify the desired outcome and aren’t necessary for the job to deliver its purpose:

#nyc_taxi_trip_input_df = nyc_taxi_trip_input_dyf.toDF()
#nyc_taxi_trip_input_df.count()
#nyc_taxi_trip_input_df.printSchema()
#nyc_taxi_trip_input_df.show(5)

#nyc_taxi_zone_lookup_df = nyc_taxi_zone_lookup_dyf.toDF()
#nyc_taxi_zone_lookup_df.count()
#nyc_taxi_zone_lookup_df.printSchema()
#nyc_taxi_zone_lookup_df.show(5)

#nyc_taxi_trip_apply_mapping_dyf.toDF().printSchema()
#nyc_taxi_zone_lookup_apply_mapping_dyf.toDF().printSchema()

#nyc_taxi_trip_final_df = nyc_taxi_trip_final_dyf.toDF()
#nyc_taxi_trip_final_df.count()
#nyc_taxi_trip_final_df.printSchema()
#nyc_taxi_trip_final_df.show(5)

Save the notebook.

You can check the corresponding script on the Script tab.Note that job.commit() is automatically added at the end of the script.Let’s run the notebook as a job.
First, truncate f_nyc_yellow_taxi_trip and d_nyc_taxi_zone_lookup tables in Amazon Redshift using the query editor v2 so that we don’t have duplicates in both the tables:
```
truncate "public"."f_nyc_yellow_taxi_trip";
truncate "public"."d_nyc_taxi_zone_lookup";
```
Choose Run to run the job.
You can check its status on the Runs tab.The job completed in less than 5 minutes with G1.x 3 DPUs.
Let’s check the count of records in f_nyc_yellow_taxi_trip and d_nyc_taxi_zone_lookup tables in Amazon Redshift:
```
SELECT 'f_nyc_yellow_taxi_trip' AS table_name, COUNT(1) FROM "public"."f_nyc_yellow_taxi_trip"
UNION ALL
SELECT 'd_nyc_taxi_zone_lookup' AS table_name, COUNT(1) FROM "public"."d_nyc_taxi_zone_lookup";
```
With job bookmarks enabled, even if you run the job again with no new files in corresponding folders in the S3 bucket, it doesn’t process the same files again. The following screenshot shows a subsequent job run in my environment, which completed in less than 2 minutes because there were no new files to process.

Now let’s schedule the job.
On the Schedules tab, choose Create schedule.
For Name¸ enter a name (for example, myFirstGlueISProject-testSchedule).
For Frequency, choose Custom.
Enter a cron expression so the job runs every Monday at 6:00 AM.
Add an optional description.
Choose Create schedule.

The schedule has been saved and activated. You can edit, pause, resume, or delete the schedule from the Actions menu.

Clean up

To avoid incurring future charges, delete the AWS resources you created.

Delete the AWS Glue job (myFirstGlueISProject for this post).
Delete the Amazon S3 objects and bucket (my-first-aws-glue-is-project-<random number> for this post).
Delete the AWS IAM policies and roles (AWSGlueInteractiveSessionPassRolePolicy, AmazonS3Access-MyFirstGlueISProject and AWSGlueServiceRole-GlueIS).
Delete the Amazon Redshift tables (f_nyc_yellow_taxi_trip and d_nyc_taxi_zone_lookup).
Delete the AWS Glue JDBC Connection (redshiftServerless).
Also delete the self-referencing Redshift Serverless security group, and Amazon S3 endpoint (if you created it while following the steps for this post).

Conclusion

In this post, we demonstrated how to do the following:

Set up an AWS Glue Jupyter notebook with interactive sessions
Use the notebook’s magics, including the AWS Glue connection onboarding and bookmarks
Read the data from Amazon S3, and transform and load it into Amazon Redshift Serverless
Configure magics to enable job bookmarks, save the notebook as an AWS Glue job, and schedule it using a cron expression

The goal of this post is to give you step-by-step fundamentals to get you going with AWS Glue Studio Jupyter notebooks and interactive sessions. You can set up an AWS Glue Jupyter notebook in minutes, start an interactive session in seconds, and greatly improve the development experience with AWS Glue jobs. Interactive sessions have a 1-minute billing minimum with cost control features that reduce the cost of developing data preparation applications. You can build and test applications from the environment of your choice, even on your local environment, using the interactive sessions backend.

Interactive sessions provide a faster, cheaper, and more flexible way to build and run data preparation and analytics applications. To learn more about interactive sessions, refer to Job development (interactive sessions), and start exploring a whole new development experience with AWS Glue. Additionally, check out the following posts to walk through more examples of using interactive sessions with different options:

About the Authors

Vikas Omer is a principal analytics specialist solutions architect at Amazon Web Services. Vikas has a strong background in analytics, customer experience management (CEM), and data monetization, with over 13 years of experience in the industry globally. With six AWS Certifications, including Analytics Specialty, he is a trusted analytics advocate to AWS customers and partners. He loves traveling, meeting customers, and helping them become successful in what they do.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He enjoys collaborating with different teams to deliver results like this post. In his spare time, he enjoys playing video games with his family.

Gal Heyne is a Product Manager for AWS Glue and has over 15 years of experience as a product manager, data engineer and data architect. She is passionate about developing a deep understanding of customers’ business needs and collaborating with engineers to design elegant, powerful and easy to use data products. Gal has a Master’s degree in Data Science from UC Berkeley and she enjoys traveling, playing board games and going to music concerts.

Share and publish your Snowflake data to AWS Data Exchange using Amazon Redshift data sharing

2022-11-17 Raks Khare

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/share-and-publish-your-snowflake-data-to-aws-data-exchange-using-amazon-redshift-data-sharing/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Today, tens of thousands of AWS customers—from Fortune 500 companies, startups, and everything in between—use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

In this post, we discuss a customer who is currently using Snowflake to store analytics data. The customer needs to offer this data to clients who are using Amazon Redshift via AWS Data Exchange, the world’s most comprehensive service for third-party datasets. We explain in detail how to implement a fully integrated process that will automatically ingest data from Snowflake into Amazon Redshift and offer it to clients via AWS Data Exchange.

Overview of the solution

The solution consists of four high-level steps:

Configure Snowflake to push the changed data for identified tables into an Amazon Simple Storage Service (Amazon S3) bucket.
Use a custom-built Redshift Auto Loader to load this Amazon S3 landed data to Amazon Redshift.
Merge the data from the change data capture (CDC) S3 staging tables to Amazon Redshift tables.
Use Amazon Redshift data sharing to license the data to customers via AWS Data Exchange as a public or private offering.

The following diagram illustrates this workflow.

Prerequisites

To get started, you need the following prerequisites:

A Snowflake account in the same Region as your Amazon Redshift cluster.
An S3 bucket. Refer to Create your first S3 bucket for more details.
An Amazon Redshift cluster with encryption enabled and an AWS Identity and Access Management (IAM) role with permission to the S3 bucket. See Create a sample Amazon Redshift cluster and Create an IAM role for Amazon Redshift for more details.
A database schema from Snowflake to Amazon Redshift that is migrated using the AWS Schema Conversion Tool (AWS SCT). For more information, refer to Accelerate Snowflake to Amazon Redshift migration using AWS Schema Conversion Tool.
An IAM role and external Amazon S3 stage for Snowflake access to the S3 bucket you created earlier. For instructions, refer to Configuring Secure Access to Amazon S3. Name this external stage unload_to_s3, pointing to the s3-redshift-loader-source folder of the target S3 bucket. It will be referenced in COPY commands later in this post for offloading the data to Amazon S3. Once created, you should see an external stage created as shown in the following screenshot.
You must be a registered provider on AWS Data Exchange. For more information, see Providing data products on AWS Data Exchange.

Configure Snowflake to track the changed data and unload it to Amazon S3

In Snowflake, identify the tables that you need to replicate to Amazon Redshift. For the purpose of this demo, we use the data in the TPCH_SF1 schema’s Customer, LineItem, and Orders tables of the SNOWFLAKE_SAMPLE_DATA database, which comes out of the box with your Snowflake account.

Make sure that the Snowflake external stage name unload_to_s3 created in the prerequisites is pointing to the S3 prefix s3-redshift-loader-sourcecreated in the previous step.
Create a new schema BLOG_DEMO in the DEMO_DB database:CREATE SCHEMA demo_db.blog_demo;

Duplicate the Customer, LineItem, and Orders tables in the TPCH_SF1 schema to the BLOG_DEMO schema:

CREATE TABLE CUSTOMER AS 
SELECT * FROM snowflake_sample_data.tpch_sf1.CUSTOMER;
CREATE TABLE ORDERS AS
SELECT * FROM snowflake_sample_data.tpch_sf1.ORDERS;
CREATE TABLE LINEITEM AS 
SELECT * FROM snowflake_sample_data.tpch_sf1.LINEITEM;

Verify that the tables have been duplicated successfully:

SELECT table_catalog, table_schema, table_name, row_count, bytes
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'BLOG_DEMO'
ORDER BY ROW_COUNT;

unload-step-4

Create table streams to track data manipulation language (DML) changes made to the tables, including inserts, updates, and deletes:

CREATE OR REPLACE STREAM CUSTOMER_CHECK ON TABLE CUSTOMER;
CREATE OR REPLACE STREAM ORDERS_CHECK ON TABLE ORDERS;
CREATE OR REPLACE STREAM LINEITEM_CHECK ON TABLE LINEITEM;

Perform DML changes to the tables (for this post, we run UPDATE on all tables and MERGE on the customer table):

UPDATE customer 
SET c_comment = 'Sample comment for blog demo' 
WHERE c_custkey between 0 and 10; 
UPDATE orders 
SET o_comment = 'Sample comment for blog demo' 
WHERE o_orderkey between 1800001 and 1800010; 
UPDATE lineitem 
SET l_comment = 'Sample comment for blog demo' 
WHERE l_orderkey between 3600001 and 3600010;

MERGE INTO customer c 
USING 
( 
SELECT n_nationkey 
FROM snowflake_sample_data.tpch_sf1.nation s 
WHERE n_name = 'UNITED STATES') n 
ON n.n_nationkey = c.c_nationkey 
WHEN MATCHED THEN UPDATE SET c.c_comment = 'This is US based customer1';

Validate that the stream tables have recorded all changes:
```
SELECT * FROM CUSTOMER_CHECK; 
SELECT * FROM ORDERS_CHECK; 
SELECT * FROM LINEITEM_CHECK;
```
For example, we can query the following customer key value to verify how the events were recorded for the MERGE statement on the customer table:

SELECT * FROM CUSTOMER_CHECK where c_custkey = 60027;

We can see the METADATA$ISUPDATE column as TRUE, and we see DELETE followed by INSERT in the METADATA$ACTION column.

Run the COPY command to offload the CDC from the stream tables to the S3 bucket using the external stage name unload_to_s3.In the following code, we’re also copying the data to S3 folders ending with _stg to ensure that when Redshift Auto Loader automatically creates these tables in Amazon Redshift, they get created and marked as staging tables:

COPY INTO @unload_to_s3/customer_stg/
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.customer_check)
FILE_FORMAT = (TYPE = PARQUET)
OVERWRITE = TRUE HEADER = TRUE;

COPY INTO @unload_to_s3/customer_stg/
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.customer_check)
FILE_FORMAT = (TYPE = PARQUET)
OVERWRITE = TRUE HEADER = TRUE;

COPY INTO @unload_to_s3/lineitem_stg/ 
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.lineitem_check) 
FILE_FORMAT = (TYPE = PARQUET) 
OVERWRITE = TRUE HEADER = TRUE;

Verify the data in the S3 bucket. There will be three sub-folders created in the s3-redshift-loader-source folder of the S3 bucket, and each will have .parquet data files.You can also automate the preceding COPY commands using tasks, which can be scheduled to run at a set frequency for automatic copy of CDC data from Snowflake to Amazon S3.
Use the ACCOUNTADMIN role to assign the EXECUTE TASK privilege. In this scenario, we’re assigning the privileges to the SYSADMIN role:
```
USE ROLE accountadmin;
GRANT EXECUTE TASK, EXECUTE MANAGED TASK ON ACCOUNT TO ROLE sysadmin;
```

Use the SYSADMIN role to create three separate tasks to run three COPY commands every 5 minutes: USE ROLE sysadmin;

/* Task to offload Customer CDC table */ 
CREATE TASK sf_rs_customer_cdc 
WAREHOUSE = SMALL 
SCHEDULE = 'USING CRON 5 * * * * UTC' 
AS 
COPY INTO @unload_to_s3/customer_stg/ 
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.customer_check) 
FILE_FORMAT = (TYPE = PARQUET) 
OVERWRITE = TRUE 
HEADER = TRUE;

/*Task to offload Orders CDC table */ 
CREATE TASK sf_rs_orders_cdc 
WAREHOUSE = SMALL 
SCHEDULE = 'USING CRON 5 * * * * UTC' 
AS 
COPY INTO @unload_to_s3/orders_stg/ 
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.orders_check)
FILE_FORMAT = (TYPE = PARQUET)
OVERWRITE = TRUE HEADER = TRUE;

/* Task to offload Lineitem CDC table */ 
CREATE TASK sf_rs_lineitem_cdc 
WAREHOUSE = SMALL 
SCHEDULE = 'USING CRON 5 * * * * UTC' 
AS 
COPY INTO @unload_to_s3/lineitem_stg/ 
FROM (select *, sysdate() as LAST_UPDATED_TS from demo_db.blog_demo.lineitem_check)
FILE_FORMAT = (TYPE = PARQUET)
OVERWRITE = TRUE HEADER = TRUE;

When the tasks are first created, they’re in a SUSPENDED state.

Alter the three tasks and set them to RESUME state:

ALTER TASK sf_rs_customer_cdc RESUME;
ALTER TASK sf_rs_orders_cdc RESUME;
ALTER TASK sf_rs_lineitem_cdc RESUME;

Validate that all three tasks have been resumed successfully: SHOW TASKS;Now the tasks will run every 5 minutes and look for new data in the stream tables to offload to Amazon S3.As soon as data is migrated from Snowflake to Amazon S3, Redshift Auto Loader automatically infers the schema and instantly creates corresponding tables in Amazon Redshift. Then, by default, it starts loading data from Amazon S3 to Amazon Redshift every 5 minutes. You can also change the default setting of 5 minutes.
On the Amazon Redshift console, launch the query editor v2 and connect to your Amazon Redshift cluster.
Browse to the dev database, public schema, and expand Tables.
You can see three staging tables created with the same name as the corresponding folders in Amazon S3.
Validate the data in one of the tables by running the following query:SELECT * FROM "dev"."public"."customer_stg";

Configure the Redshift Auto Loader utility

The Redshift Auto Loader makes data ingestion to Amazon Redshift significantly easier because it automatically loads data files from Amazon S3 to Amazon Redshift. The files are mapped to the respective tables by simply dropping files into preconfigured locations on Amazon S3. For more details about the architecture and internal workflow, refer to the GitHub repo.

We use an AWS CloudFormation template to set up Redshift Auto Loader. Complete the following steps:

Launch the CloudFormation template.
Choose Next.
For Stack name, enter a name.

Provide the parameters listed in the following table.

CloudFormation Template Parameter	Allowed Values	Description
`RedshiftClusterIdentifier`	Amazon Redshift cluster identifier	Enter the Amazon Redshift cluster identifier.
`DatabaseUserName`	Database user name in the Amazon Redshift cluster	The Amazon Redshift database user name that has access to run the SQL script.
`DatabaseName`	S3 bucket name	The name of the Amazon Redshift primary database where the SQL script is run.
`DatabaseSchemaName`	Database name in Amazon Redshift	The Amazon Redshift schema name where the tables are created.
`RedshiftIAMRoleARN`	Default or the valid IAM role ARN attached to the Amazon Redshift cluster	The IAM role ARN associated with the Amazon Redshift cluster. Your default IAM role is set for the cluster and has access to your S3 bucket, leave it at the default.
`CopyCommandOptions`	Copy option; default is delimiter ‘\|’ gzip	Provide the additional COPY command data format parameters. If InitiateSchemaDetection = Yes, then the process attempts to detect the schema and automatically set the suitable copy command options. In the event of failure on schema detection or when InitiateSchemaDetection = No, then this value is used as the default COPY command options to load data.
`SourceS3Bucket`	S3 bucket name	The S3 bucket where the data is stored. Make sure the IAM role that is associated to the Amazon Redshift cluster has access to this bucket.
`InitiateSchemaDetection`	Yes/No	Set to Yes to dynamically detect the schema prior to file load and create a table in Amazon Redshift if it doesn’t exist already. If a table already exists, then it won’t drop or recreate the table in Amazon Redshift. If schema detection fails, the process uses the default COPY options as specified in `CopyCommandOptions`.

The Redshift Auto Loader uses the COPY command to load data into Amazon Redshift. For this post, set CopyCommandOptions as follows, and configure any supported COPY command options:

delimiter '|' dateformat 'auto' TIMEFORMAT 'auto'

autoloader-input-parameters

Choose Next.
Accept the default values on the next page and choose Next.
Select the acknowledgement check box and choose Create stack.
Monitor the progress of the Stack creation and wait until it is complete.
To verify the Redshift Auto Loader configuration, sign in to the Amazon S3 console and navigate to the S3 bucket you provided.
You should see a new directory s3-redshift-loader-source is created.

Copy all the data files exported from Snowflake under s3-redshift-loader-source.

Merge the data from the CDC S3 staging tables to Amazon Redshift tables

To merge your data from Amazon S3 to Amazon Redshift, complete the following steps:

Create a temporary staging table merge_stg and insert all the rows from the S3 staging table that have metadata_action as INSERT, using the following code. This includes all the new inserts as well as the update.
```
CREATE TEMP TABLE merge_stg 
AS
SELECT * FROM
(
SELECT *, DENSE_RANK() OVER (PARTITION BY c_custkey ORDER BY last_updated_ts DESC
) AS rnk
FROM customer_stg WHERE rnk = 1 AND metadata$action = 'INSERT'
```
The preceding code uses a window function DENSE_RANK() to select the latest entries for a given c_custkey by assigning a rank to each row for a given c_custkey and arrange the data in descending order using last_updated_ts. We then select the rows with rnk=1 and metadata$action = ‘INSERT’ to capture all the inserts.
Use the S3 staging table customer_stg to delete the records from the base table customer, which are marked as deletes or updates:
```
DELETE FROM customer 
USING customer_stg 
WHERE customer.c_custkey = customer_stg.c_custkey;
```
This deletes all the rows that are present in the CDC S3 staging table, which takes care of rows marked for deletion and updates.

Use the temporary staging table merge_stg to insert the records marked for updates or inserts:

INSERT INTO customer 
SELECT c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment 
FROM merge_stg;

Truncate the staging table, because we have already updated the target table:truncate customer_stg;

You can also run the preceding steps as a stored procedure:

CREATE OR REPLACE PROCEDURE merge_customer()
AS $$
BEGIN
/*CREATING TEMP TABLE TO GET THE MOST LATEST RECORDS FOR UPDATES/NEW INSERTS*/
CREATE TEMP TABLE merge_stg AS
SELECT * FROM
(
SELECT *, DENSE_RANK() OVER (PARTITION BY c_custkey ORDER BY last_updated_ts DESC ) AS rnk
FROM customer_stg
)
WHERE rnk = 1 AND metadata$action = 'INSERT';
/* DELETING FROM THE BASE TABLE USING THE CDC STAGING TABLE ALL THE RECORDS MARKED AS DELETES OR UPDATES*/
DELETE FROM customer
USING customer_stg
WHERE customer.c_custkey = customer_stg.c_custkey;
/*INSERTING NEW/UPDATED RECORDS IN THE BASE TABLE*/ 
INSERT INTO customer
SELECT c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment
FROM merge_stg;
truncate customer_stg;
END;
$$ LANGUAGE plpgsql;

For example, let’s look at the before and after states of the customer table when there’s been a change in data for a particular customer.

The following screenshot shows the new changes recorded in the customer_stg table for c_custkey = 74360.
merge-process-new-changes
We can see two records for a customer with c_custkey=74360 one with metadata$action as DELETE and one with metadata$action as INSERT. That means the record with c_custkey was updated at the source and these changes need to be applied to the target customer table in Amazon Redshift.

The following screenshot shows the current state of the customer table before these changes have been merged using the preceding stored procedure:
merge-process-current-state

Now, to update the target table, we can run the stored procedure as follows: CALL merge_customer()The following screenshot shows the final state of the target table after the stored procedure is complete.

Run the stored procedure on a schedule

You can also run the stored procedure on a schedule via Amazon EventBridge. The scheduling steps are as follows:

On the EventBridge console, choose Create rule.
For Name, enter a meaningful name, for example, Trigger-Snowflake-Redshift-CDC-Merge.
For Event bus, choose default.
For Rule Type, select Schedule.
Choose Next.
For Schedule pattern, select A schedule that runs at a regular rate, such as every 10 minutes.
For Rate expression, enter Value as 5 and choose Unit as Minutes.
Choose Next.
For Target types, choose AWS service.
For Select a Target, choose Redshift cluster.
For Cluster, choose the Amazon Redshift cluster identifier.
For Database name, choose dev.
For Database user, enter a user name with access to run the stored procedure. It uses temporary credentials to authenticate.
Optionally, you can also use AWS Secrets Manager for authentication.
For SQL statement, enter CALL merge_customer().
For Execution role, select Create a new role for this specific resource.
Choose Next.
Review the rule parameters and choose Create rule.

After the rule has been created, it automatically triggers the stored procedure in Amazon Redshift every 5 minutes to merge the CDC data into the target table.

Configure Amazon Redshift to share the identified data with AWS Data Exchange

Now that you have the data stored inside Amazon Redshift, you can publish it to customers using AWS Data Exchange.

In Amazon Redshift, using any query editor, create the data share and add the tables to be shared:

CREATE DATASHARE salesshare MANAGEDBY ADX;
ALTER DATASHARE salesshare ADD SCHEMA tpch_sf1;
ALTER DATASHARE salesshare ADD TABLE tpch_sf1.customer;

ADX-step1

On the AWS Data Exchange console, create your dataset.
Select Amazon Redshift datashare.
Create a revision in the dataset.
Add assets to the revision (in this case, the Amazon Redshift data share).
Finalize the revision.

After you create the dataset, you can publish it to the public catalog or directly to customers as a private product. For instructions on how to create and publish products, refer to NEW – AWS Data Exchange for Amazon Redshift

Clean up

To avoid incurring future charges, complete the following steps:

Delete the CloudFormation stack used to create the Redshift Auto Loader.
Delete the Amazon Redshift cluster created for this demonstration.
If you were using an existing cluster, drop the created external table and external schema.
Delete the S3 bucket you created.
Delete the Snowflake objects you created.

Conclusion

In this post, we demonstrated how you can set up a fully integrated process that continuously replicates data from Snowflake to Amazon Redshift and then uses Amazon Redshift to offer data to downstream clients over AWS Data Exchange. You can use the same architecture for other purposes, such as sharing data with other Amazon Redshift clusters within the same account, cross-accounts, or even cross-Regions if needed.

About the Authors

Raks Khare is an Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers architect data analytics solutions at scale on the AWS platform.

Ekta Ahuja is a Senior Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling
and cooking.

Ahmed Shehata is a Senior Analytics Specialist Solutions Architect at AWS based on Toronto. He has more than two decades of experience helping customers modernize their data platforms, Ahmed is passionate about helping customers build efficient, performant and scalable Analytic solutions.

You can now assign multiple MFA devices in IAM

2022-11-17 Liam Wadman

Post Syndicated from Liam Wadman original https://aws.amazon.com/blogs/security/you-can-now-assign-multiple-mfa-devices-in-iam/

At Amazon Web Services (AWS), security is our top priority, and configuring multi-factor authentication (MFA) on accounts is an important step in securing your organization.

Now, you can add multiple MFA devices to AWS account root users and AWS Identity and Access Management (IAM) users in your AWS accounts. This helps you to raise the security bar in your accounts and limit access management to highly privileged principals, such as root users. Previously, you could only have one MFA device associated with root users or IAM users, but now you can associate up to eight MFA devices of the currently supported types with root users and IAM users.

In this blog post, we review the current MFA features for IAM, share use cases for multiple MFA devices, and show you how to manage and sign in with the additional MFA devices for better resiliency and flexibility.

Overview of MFA for IAM

First, let’s recap some of the benefits and available MFA configurations for IAM.

The use of MFA is an important security best practice on AWS. With MFA, you have an additional layer of protection to help prevent unauthorized individuals from gaining access to your systems and data. MFA can help protect your AWS environments if a password associated with your root user or IAM user became compromised.

As a security best practice, AWS recommends that you avoid using root users or IAM users to manage access to your accounts. Instead, you should use AWS IAM Identity Center (successor to AWS Single Sign-On) to manage access to your accounts. You should only use root users for tasks that they are required for.

To help meet different customer needs, AWS supports three types of MFA devices for IAM, including FIDO security keys, virtual authenticator applications, and time-based one-time password (TOTP) hardware tokens. You should select the device type that aligns with your security and operational requirements. You can associate different types of MFA devices with an IAM principal.

Use cases for multiple MFA devices

There are several use cases in which associating multiple MFA devices with an IAM principal is beneficial to the security and operational efficiency of your organization, such as the following:

In the event of a lost, stolen, or inaccessible MFA device, you can use one of the remaining MFA devices to access the account without performing the AWS account recovery procedure. If an MFA device is lost or stolen, it’s best practice to disassociate the lost or stolen device from the root users or IAM users that it’s associated with.
Geographically dispersed teams, or teams working remotely, can use hardware-based MFA to access AWS, without shipping a single hardware device or coordinating a physical exchange of a single hardware device between team members.
If the holder of an MFA device isn’t available, you can maintain access to your root users and IAM users by using a different MFA device associated with an IAM principal.
You can store additional MFA devices in a secure physical location, such as a vault or safe, while retaining physical access to another MFA device for redundancy.

How to manage multiple MFA devices in IAM

You can register up to eight MFA devices, in any combination of the currently supported MFA types, with your root users and IAM users.

To register an MFA device

Sign in to the AWS Management Console and do the following:
- For a root user, choose My Security Credentials.
- For an IAM user, choose Security credentials.
For Multi-factor authentication (MFA), choose Assign MFA device.
Select the type of MFA device that you want to use and then choose Next.

With multiple MFA devices, you only need one MFA device to sign in to the console or to create a session through the AWS Command Line Interface (AWS CLI) as that principal.

You don’t need to make permissions changes in order for your organization to start taking advantage of multiple MFA devices. The root users and IAM users in your accounts that manage MFA devices today can use their existing IAM permissions to enable additional MFA devices.

Changes to Cloudtrail log entries

In support of this new feature, the identifier of the MFA device used will now be added to the console sign-in events of the root user and IAM user that use MFA. With these changes to AWS CloudTrail log entries, you can now view both the user and the MFA device used to authenticate to AWS. This provides better traceability and audibility for your accounts.

You can find this information in the MFAIdentifier field in CloudTrail, within additionalEventData. You don’t need to take action for this information to be logged. The following is a sample log from CloudTrail that includes the MFAIdentifier.

"additionalEventData": {
"LoginTo": "https://console.aws.amazon.com/console/home?state=hashArgs%23&isauthcode=true",
"MobileVersion": "No",
"MFAIdentifier": "arn:aws:iam::111122223333:mfa/root-account-mfa-device",
"MFAUsed": "YES"
}

The identifier of the MFA devices used for AWS CLI sessions with the sts:GetSessionToken action are logged in the requestParameters field.

    "requestParameters": {
"serialNumber": "arn:aws:iam::111122223333:mfa/root-account-mfa-device"
    }

Sign-in experience with multiple MFA devices

In this section, we’ll show you how to sign in to the console as an IAM principal with multiple MFA devices associated with it.

To authenticate as an IAM principal with multiple MFA devices

Sign in to the IAM console as an IAM principal.
Authenticate with the principal’s password.
For Additional verification required, select the type of MFA device that you want to use to continue authenticating, and then choose Next:

Figure 1: MFA device selection when authenticating to the console as an IAM user or root user with different types of MFA devices available
You will then be prompted to authenticate with the type of device that you selected.

Figure 2: Prompt to authenticate with a FIDO security key

Conclusion

In this blog post, you learned about the new multiple MFA devices feature in IAM, and how to set up and manage multiple MFA devices in IAM. Associating multiple MFA devices with your root users and IAM users can make it simpler for you to manage access to them. This feature is available now for AWS customers, except for customers operating in AWS GovCloud (US) Regions or in the AWS China Regions. For more information about how to configure multiple MFA devices on your root users and IAM users, see the documentation on MFA in IAM. There is no extra charge to use MFA devices in IAM.

AWS offers a free MFA security key to eligible AWS account owners in the United States. To determine eligibility and order a key, see the ordering portal.

If you have questions, post them in the AWS Identity and Access Management re:Post topic or reach out to AWS Support.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Running AI-ML Object Detection Model to Process Confidential Data using Nitro Enclaves

2022-11-15 Sheila Busser

Post Syndicated from Sheila Busser original https://aws.amazon.com/blogs/compute/running-ai-ml-object-detection-model-to-process-confidential-data-using-nitro-enclaves/

This blog post was written by, Antoine Awad, Solutions Architect, Kevin Taylor, Senior Solutions Architect and Joel Desaulniers, Senior Solutions Architect.

Machine Learning (ML) models are used for inferencing of highly sensitive data in many industries such as government, healthcare, financial, and pharmaceutical. These industries require tools and services that protect their data in transit, at rest, and isolate data while in use. During processing, threats may originate from the technology stack such as the operating system or programs installed on the host which we need to protect against. Having a process that enforces the separation of roles and responsibilities within an organization minimizes the ability of personnel to access sensitive data. In this post, we walk you through how to run ML inference inside AWS Nitro Enclaves to illustrate how your sensitive data is protected during processing.

We are using a Nitro Enclave to run ML inference on sensitive data which helps reduce the attack surface area when the data is decrypted for processing. Nitro Enclaves enable you to create isolated compute environments within Amazon EC2 instances to protect and securely process highly sensitive data. Enclaves have no persistent storage, no interactive access, and no external networking. Communication between your instance and your enclave is done using a secure local channel called a vsock. By default, even an admin or root user on the parent instance will not be able to access the enclave.

Overview

Our example use-case demonstrates how to deploy an AI/ML workload and run inferencing inside Nitro Enclaves to securely process sensitive data. We use an image to demonstrate the process of how data can be encrypted, stored, transferred, decrypted and processed when necessary, to minimize the risk to your sensitive data. The workload uses an open-source AI/ML model to detect objects in an image, representing the sensitive data, and returns a summary of the type of objects detected. The image below is used for illustration purposes to provide clarity on the inference that occurs inside the Nitro Enclave. It was generated by adding bounding boxes to the original image based on the coordinates returned by the AI/ML model.

Figure 1 – Image of airplanes with bounding boxes

To encrypt this image, we are using a Python script (Encryptor app – see Figure 2) which runs on an EC2 instance, in a real-world scenario this step would be performed in a secure environment like a Nitro Enclave or a secured workstation before transferring the encrypted data. The Encryptor app uses AWS KMS envelope encryption with a symmetrical Customer Master Key (CMK) to encrypt the data.

Figure 2 – Image Encryption with AWS KMS using Envelope Encryption

Note, it’s also possible to use asymmetrical keys to perform the encryption/decryption.

Now that the image is encrypted, let’s look at each component and its role in the solution architecture, see Figure 3 below for reference.

The Client app reads the encrypted image file and sends it to the Server app over the vsock (secure local communication channel).
The Server app, running inside a Nitro Enclave, extracts the encrypted data key and sends it to AWS KMS for decryption. Once the data key is decrypted, the Server app uses it to decrypt the image and run inference on it to detect the objects in the image. Once the inference is complete, the results are returned to the Client app without exposing the original image or sensitive data.
To allow the Nitro Enclave to communicate with AWS KMS, we use the KMS Enclave Tool which uses the vsock to connect to AWS KMS and decrypt the encrypted key.
The vsock-proxy (packaged with the Nitro CLI) routes incoming traffic from the KMS Tool to AWS KMS provided that the AWS KMS endpoint is included on the vsock-proxy allowlist. The response from AWS KMS is then sent back to the KMS Enclave Tool over the vsock.

As part of the request to AWS KMS, the KMS Enclave Tool extracts and sends a signed attestation document to AWS KMS containing the enclave’s measurements to prove its identity. AWS KMS will validate the attestation document before decrypting the data key. Once validated, the data key is decrypted and securely returned to the KMS Tool which securely transfers it to the Server app to decrypt the image.

Figure 3 – Solution architecture diagram for this blog post

Environment Setup

Prerequisites

Before we get started, you will need the following prequisites to deploy the solution:

AWS account
AWS Identity and Access Management (IAM) role with appropriate access

AWS CloudFormation Template

We are going to use AWS CloudFormation to provision our infrastructure.

Download the CloudFormation (CFN) template nitro-enclave-demo.yaml. This template orchestrates an EC2 instance with the required networking components such as a VPC, Subnet and NAT Gateway.
Log in to the AWS Management Console and select the AWS Region where you’d like to deploy this stack. In the example, we select Canada (Central).
Open the AWS CloudFormation console at: https://console.aws.amazon.com/cloudformation/
Choose Create Stack, Template is ready, Upload a template file. Choose File to select nitro-enclave-demo.yaml that you saved locally.
Choose Next, enter a stack name such as NitroEnclaveStack, choose Next.
On the subsequent screens, leave the defaults, and continue to select Next until you arrive at the Review step
At the Review step, scroll to the bottom and place a checkmark in “I acknowledge that AWS CloudFormation might create IAM resources with custom names.” and click “Create stack”
The stack status is initially CREATE_IN_PROGRESS. It will take around 5 minutes to complete. Click the Refresh button periodically to refresh the status. Upon completion, the status changes to CREATE_COMPLETE.
Once completed, click on “Resources” tab and search for “NitroEnclaveInstance”, click on its “Physical ID” to navigate to the EC2 instance
On the Amazon EC2 page, select the instance and click “Connect”
Choose “Session Manager” and click “Connect”

EC2 Instance Configuration

Now that the EC2 instance has been provisioned and you are connected to it, follow these steps to configure it:

Install the Nitro Enclaves CLI which will allow you to build and run a Nitro Enclave application:

sudo amazon-linux-extras install aws-nitro-enclaves-cli -y
sudo yum install aws-nitro-enclaves-cli-devel -y

Verify that the Nitro Enclaves CLI was installed successfully by running the following command:
```
nitro-cli --version
```

To download the application from GitHub and build a docker image, you need to first install Docker and Git by executing the following commands:

sudo yum install git -y
sudo usermod -aG ne ssm-user
sudo usermod -aG docker ssm-user
sudo systemctl start docker && sudo systemctl enable docker

Nitro Enclave Configuration

A Nitro Enclave is an isolated environment which runs within the EC2 instance, hence we need to specify the resources (CPU & Memory) that the Nitro Enclaves allocator service dedicates to the enclave.

Enter the following commands to set the CPU and Memory available for the Nitro Enclave allocator service to allocate to your enclave container:

ALLOCATOR_YAML=/etc/nitro_enclaves/allocator.yaml
MEM_KEY=memory_mib
DEFAULT_MEM=20480
sudo sed -r "s/^(\s*${MEM_KEY}\s*:\s*).*/\1${DEFAULT_MEM}/" -i "${ALLOCATOR_YAML}"
sudo systemctl start nitro-enclaves-allocator.service && sudo systemctl enable nitro-enclaves-allocator.service

To verify the configuration has been applied, run the following command and note the values for memory_mib and cpu_count:
```
cat /etc/nitro_enclaves/allocator.yaml
```

Creating a Nitro Enclave Image

Download the Project and Build the Enclave Base Image

Now that the EC2 instance is configured, download the workload code and build the enclave base Docker image. This image contains the Nitro Enclaves Software Development Kit (SDK) which allows an enclave to request a cryptographically signed attestation document from the Nitro Hypervisor. The attestation document includes unique measurements (SHA384 hashes) that are used to prove the enclave’s identity to services such as AWS KMS.

Clone the Github Project

cd ~/ && git clone https://github.com/aws-samples/aws-nitro-enclaves-ai-ml-object-detection.git

Navigate to the cloned project’s folder and build the “enclave_base” image:
```
cd ~/aws-nitro-enclaves-ai-ml-object-detection/enclave-base-image
sudo docker build ./ -t enclave_base
```
Note: The above step will take approximately 8-10 minutes to complete.

Build and Run The Nitro Enclave Image

To build the Nitro Enclave image of the workload, build a docker image of your application and then use the Nitro CLI to build the Nitro Enclave image:

Download TensorFlow pre-trained model:

cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
mkdir -p models/faster_rcnn_openimages_v4_inception_resnet_v2_1 && cd models/
wget -O tensorflow-model.tar.gz https://tfhub.dev/google/faster_rcnn/openimages_v4/inception_resnet_v2/1?tf-hub-format=compressed
tar -xvf tensorflow-model.tar.gz -C faster_rcnn_openimages_v4_inception_resnet_v2_1

Navigate to the use-case folder and build the docker image for the application:

cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
sudo docker build ./ -t nitro-enclave-container-ai-ml:latest

Use the Nitro CLI to build an Enclave Image File (.eif) using the docker image you built in the previous step:

sudo nitro-cli build-enclave --docker-uri nitro-enclave-container-ai-ml:latest --output-file nitro-enclave-container-ai-ml.eif

The output of the previous step produces the Platform configuration registers or PCR hashes and a nitro enclave image file (.eif). Take note of the PCR0 value, which is a hash of the enclave image file.Example PCR0:
```
{
    "Measurements": {
        "PCR0": "7968aee86dc343ace7d35fa1a504f955ee4e53f0d7ad23310e7df535a187364a0e6218b135a8c2f8fe205d39d9321923"
        ...
    }
}
```
Launch the Nitro Enclave container using the Enclave Image File (.eif) generated in the previous step and allocate resources to it. You should allocate at least 4 times the EIF file size for enclave memory. This is necessary because the tmpfs filesystem uses half of the memory and the remainder of the memory is used to uncompress the initial initramfs where the application executable resides. For CPU allocation, you should allocate CPU in full cores i.e. 2x vCPU for x86 hyper-threaded instances.
In our case, we are going to allocate 14GB or 14,366 MB for the enclave:
```
sudo nitro-cli run-enclave --cpu-count 2 --memory 14336 --eif-path nitro-enclave-container-ai-ml.eif
```
Note: Allow a few seconds for the server to boot up prior to running the Client app in the below section “Object Detection using Nitro Enclaves”.

Update the KMS Key Policy to Include the PCR0 Hash

Now that you have the PCR0 value for your enclave image, update the KMS key policy to only allow your Nitro Enclave container access to the KMS key.

Navigate to AWS KMS in your AWS Console and make sure you are in the same region where your CloudFormation template was deployed
Select “Customer managed keys”
Search for a key with alias “EnclaveKMSKey” and click on it
Click “Edit” on the “Key Policy”
Scroll to the bottom of the key policy and replace the value of “EXAMPLETOBEUPDATED” for the “kms:RecipientAttestation:PCR0” key with the PCR0 hash you noted in the previous section and click “Save changes”

AI/ML Object Detection using a Nitro Enclave

Now that you have an enclave image file, run the components of the solution.

Requirements Installation for Client App

Install the python requirements using the following command:

cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
pip3 install -r requirements.txt

Set the region that your CloudFormation stack is deployed in. In our case we selected Canada (Centra)
```
CFN_REGION=ca-central-1
```
Run the following command to encrypt the image using the AWS KMS key “EnclaveKMSKey”, make sure to replace “ca-central-1” with the region where you deployed your CloudFormation template:
```
python3 ./envelope-encryption/encryptor.py --filePath ./images/air-show.jpg --cmkId alias/EnclaveKMSkey --region $CFN_REGION
```
Verify that the output contains: file encrypted? True
Note: The previous command generates two files: an encrypted image file and an encrypted data key file. The data key file is generated so we can demonstrate an attempt from the parent instance at decrypting the data key.

Launching VSock Proxy

Launch the VSock Proxy which proxies requests from the Nitro Enclave to an external endpoint, in this case, to AWS KMS. Note the file vsock-proxy-config.yaml contains a list of endpoints which allow-lists the endpoints that an enclave can communicate with.

cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
vsock-proxy 8001 "kms.$CFN_REGION.amazonaws.com" 443 --config vsock-proxy-config.yaml &

Object Detection using Nitro Enclaves

Send the encrypted image to the enclave to decrypt the image and use the AI/ML model to detect objects and return a summary of the objects detected:

cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
python3 client.py --filePath ./images/air-show.jpg.encrypted | jq -C '.'

The previous step takes around a minute to complete when first called. Inside the enclave, the server application decrypts the image, runs it through the AI/ML model to generate a list of objects detected and returns that list to the client application.

Attempt to Decrypt Data Key using Parent Instance Credentials

To prove that the parent instance is not able to decrypt the content, attempt to decrypt the image using the parent’s credentials:

cd ~/aws-nitro-enclaves-ai-ml-object-detection/src
aws kms decrypt --ciphertext-blob fileb://images/air-show.jpg.data_key.encrypted --region $CFN_REGION

Note: The command is expected to fail with AccessDeniedException, since the parent instance is not allowed to decrypt the data key.

Cleaning up

Open the AWS CloudFormation console at: https://console.aws.amazon.com/cloudformation/.
Select the stack you created earlier, such as NitroEnclaveStack.
Choose Delete, then choose Delete Stack.
The stack status is initially DELETE_IN_PROGRESS. Click the Refresh button periodically to refresh its status. The status changes to DELETE_COMPLETE after it’s finished and the stack name no longer appears in your list of active stacks.

Conclusion

In this post, we showcase how to process sensitive data with Nitro Enclaves using an AI/ML model deployed on Amazon EC2, as well as how to integrate an enclave with AWS KMS to restrict access to an AWS KMS CMK so that only the Nitro Enclave is allowed to use the key and decrypt the image.

We encrypt the sample data with envelope encryption to illustrate how to protect, transfer and securely process highly sensitive data. This process would be similar for any kind of sensitive information such as personally identifiable information (PII), healthcare or intellectual property (IP) which could also be the AI/ML model.

Dig deeper by exploring how to further restrict your AWS KMS CMK using additional PCR hashes such as PCR1 (hash of the Linux kernel and bootstrap), PCR2 (Hash of the application), and other hashes available to you.

Also, try our comprehensive Nitro Enclave workshop which includes use-cases at different complexity levels.

Reducing Your Organization’s Carbon Footprint with Amazon CodeGuru Profiler

2022-11-15 Isha Dua

Post Syndicated from Isha Dua original https://aws.amazon.com/blogs/devops/reducing-your-organizations-carbon-footprint-with-codeguru-profiler/

It is crucial to examine every functional area when firms reorient their operations toward sustainable practices. Making informed decisions is necessary to reduce the environmental effect of an IT stack when creating, deploying, and maintaining it. To build a sustainable business for our customers and for the world we all share, we have deployed data centers that provide the efficient, resilient service our customers expect while minimizing our environmental footprint—and theirs. While we work to improve the energy efficiency of our datacenters, we also work to help our customers improve their operations on the AWS cloud. This two-pronged approach is based on the concept of the shared responsibility between AWS and AWS’ customers. As shown in the diagram below, AWS focuses on optimizing the sustainability of the cloud, while customers are responsible for sustainability in the cloud, meaning that AWS customers must optimize the workloads they have on the AWS cloud.

Figure 1. Shared responsibility model for sustainability

Just by migrating to the cloud, AWS customers become significantly more sustainable in their technology operations. On average, AWS customers use 77% fewer servers, 84% less power, and a 28% cleaner power mix, ultimately reducing their carbon emissions by 88% compared to when they ran workloads in their own data centers. These improvements are attributable to the technological advancements and economies of scale that AWS datacenters bring. However, there are still significant opportunities for AWS customers to make their cloud operations more sustainable. To uncover this, we must first understand how emissions are categorized.

The Greenhouse Gas Protocol organizes carbon emissions into the following scopes, along with relevant emission examples within each scope for a cloud provider such as AWS:

Scope 1: All direct emissions from the activities of an organization or under its control. For example, fuel combustion by data center backup generators.
Scope 2: Indirect emissions from electricity purchased and used to power data centers and other facilities. For example, emissions from commercial power generation.
Scope 3: All other indirect emissions from activities of an organization from sources it doesn’t control. AWS examples include emissions related to data center construction, and the manufacture and transportation of IT hardware deployed in data centers.

From an AWS customer perspective, emissions from customer workloads running on AWS are accounted for as indirect emissions, and part of the customer’s Scope 3 emissions. Each workload deployed generates a fraction of the total AWS emissions from each of the previous scopes. The actual amount varies per workload and depends on several factors including the AWS services used, the energy consumed by those services, the carbon intensity of the electric grids serving the AWS data centers where they run, and the AWS procurement of renewable energy.

At a high level, AWS customers approach optimization initiatives at three levels:

Application (Architecture and Design): Using efficient software designs and architectures to minimize the average resources required per unit of work.
Resource (Provisioning and Utilization): Monitoring workload activity and modifying the capacity of individual resources to prevent idling due to over-provisioning or under-utilization.
Code (Code Optimization): Using code profilers and other tools to identify the areas of code that use up the most time or resources as targets for optimization.

In this blogpost, we will concentrate on code-level sustainability improvements and how they can be realized using Amazon CodeGuru Profiler.

How CodeGuru Profiler improves code sustainability

Amazon CodeGuru Profiler collects runtime performance data from your live applications and provides recommendations that can help you fine-tune your application performance. Using machine learning algorithms, CodeGuru Profiler can help you find your most CPU-intensive lines of code, which contribute the most to your scope 3 emissions. CodeGuru Profiler then suggests ways to improve the code to make it less CPU demanding. CodeGuru Profiler provides different visualizations of profiling data to help you identify what code is running on the CPU, see how much time is consumed, and suggest ways to reduce CPU utilization. Optimizing your code with CodeGuru profiler leads to the following:

Improvements in application performance
Reduction in cloud cost, and
Reduction in the carbon emissions attributable to your cloud workload.

When your code performs the same task with less CPU, your applications run faster, customer experience improves, and your cost reduces alongside your cloud emission. CodeGuru Profiler generates the recommendations that help you make your code faster by using an agent that continuously samples stack traces from your application. The stack traces indicate how much time the CPU spends on each function or method in your code—information that is then transformed into CPU and latency data that is used to detect anomalies. When anomalies are detected, CodeGuru Profiler generates recommendations that clearly outline you should do to remediate the situation. Although CodeGuru Profiler has several visualizations that help you visualize your code, in many cases, customers can implement these recommendations without reviewing the visualizations. Let’s demonstrate this with a simple example.

Demonstration: Using CodeGuru Profiler to optimize a Lambda function

In this demonstration, the inefficiencies in a AWS Lambda function will be identified by CodeGuru Profiler.

Building our Lambda Function (10mins)

To keep this demonstration quick and simple, let’s create a simple lambda function that display’s ‘Hello World’. Before writing the code for this function, let’s review two important concepts. First, when writing Python code that runs on AWS and calls AWS services, two critical steps are required:

Importing the AWS SDK for Python (Boto3), and
Creating the AWS SDK service client.

The Python code lines (that will be part of our function) that execute these steps listed above are shown below:

import boto3 #this will import AWS SDK library for Python VariableName = boto3.client('dynamodb’) #this will create the AWS SDK service client

Secondly, functionally, AWS Lambda functions comprise of two sections:

Initialization code
Handler code

The first time a function is invoked (i.e., a cold start), Lambda downloads the function code, creates the required runtime environment, runs the initialization code, and then runs the handler code. During subsequent invocations (warm starts), to keep execution time low, Lambda bypasses the initialization code and goes straight to the handler code. AWS Lambda is designed such that the SDK service client created during initialization persists into the handler code execution. For this reason, AWS SDK service clients should be created in the initialization code. If the code lines for creating the AWS SDK service client are placed in the handler code, the AWS SDK service client will be recreated every time the Lambda function is invoked, needlessly increasing the duration of the Lambda function during cold and warm starts. This inadvertently increases CPU demand (and cost), which in turn increases the carbon emissions attributable to the customer’s code. Below, you can see the green and brown versions of the same Lambda function.

Now that we understand the importance of structuring our Lambda function code for efficient execution, let’s create a Lambda function that recreates the SDK service client. We will then watch CodeGuru Profiler flag this issue and generate a recommendation.

Open AWS Lambda from the AWS Console and click on Create function.
Select Author from scratch, name the function ‘demo-function’, select Python 3.9 under runtime, select x86_64 under Architecture.
Expand Permissions, then choose whether to create a new execution role or use an existing one.
Expand Advanced settings, and then select Function URL.
For Auth type, choose AWS_IAM or NONE.
Select Configure cross-origin resource sharing (CORS). By selecting this option during function creation, your function URL allows requests from all origins by default. You can edit the CORS settings for your function URL after creating the function.
Choose Create function.
In the code editor tab of the code source window, copy and paste the code below:

#invocation code
import json
import boto3

#handler code
def lambda_handler(event, context):
  client = boto3.client('dynamodb') #create AWS SDK Service client’
  #simple codeblock for demonstration purposes  
  output = ‘Hello World’
  print(output)
  #handler function return

  return output

Ensure that the handler code is properly indented.

Save the code, Deploy, and then Test.
For the first execution of this Lambda function, a test event configuration dialog will appear. On the Configure test event dialog window, leave the selection as the default (Create new event), enter ‘demo-event’ as the Event name, and leave the hello-world template as the Event template.
When you run the code by clicking on Test, the console should return ‘Hello World’.
To simulate actual traffic, let’s run a curl script that will invoke the Lambda function every 0.2 seconds. On a bash terminal, run the following command:

while true; do curl {Lambda Function URL]; sleep 0.06; done

If you do not have git bash installed, you can use AWS Cloud 9 which supports curl commands.

Enabling CodeGuru Profiler for our Lambda function

We will now set up CodeGuru Profiler to monitor our Lambda function. For Lambda functions running on Java 8 (Amazon Corretto), Java 11, and Python 3.8 or 3.9 runtimes, CodeGuru Profiler can be enabled through a single click in the configuration tab in the AWS Lambda console. Other runtimes can be enabled following a series of steps that can be found in the CodeGuru Profiler documentation for Java and the Python.

Our demo code is written in Python 3.9, so we will enable Profiler from the configuration tab in the AWS Lambda console.

On the AWS Lambda console, select the demo-function that we created.
Navigate to Configuration > Monitoring and operations tools, and click Edit on the right side of the page.

Scroll down to Amazon CodeGuru Profiler and click the button next to Code profiling to turn it on. After enabling Code profiling, click Save.

Note: CodeGuru Profiler requires 5 minutes of Lambda runtime data to generate results. After your Lambda function provides this runtime data, which may need multiple runs if your lambda has a short runtime, it will display within the Profiling group page in the CodeGuru Profiler console. The profiling group will be given a default name (i.e., aws-lambda-<lambda-function-name>), and it will take approximately 15 minutes after CodeGuru Profiler receives the runtime data for this profiling group to appear. Be patient. Although our function duration is ~33ms, our curl script invokes the application once every 0.06 seconds. This should give profiler sufficient information to profile our function in a couple of hours. After 5 minutes, our profiling group should appear in the list of active profiling groups as shown below.

Depending on how frequently your Lambda function is invoked, it can take up to 15 minutes to aggregate profiles, after which you can see your first visualization in the CodeGuru Profiler console. The granularity of the first visualization depends on how active your function was during those first 5 minutes of profiling—an application that is idle most of the time doesn’t have many data points to plot in the default visualization. However, you can remedy this by looking at a wider time period of profiled data, for example, a day or even up to a week, if your application has very low CPU utilization. For our demo function, a recommendation should appear after about an hour. By this time, the profiling groups list should show that our profiling group now has one recommendation.

Profiler has now flagged the repeated creation of the SDK service client with every invocation.

From the information provided, we can see that our CPU is spending 5x more computing time than expected on the recreation of the SDK service client. The estimated cost impact of this inefficiency is also provided. In production environments, the cost impact of seemingly minor inefficiencies can scale very quickly to several kilograms of CO2 and hundreds of dollars as invocation frequency, and the number of Lambda functions increase.

CodeGuru Profiler integrates with Amazon DevOps Guru, a fully managed service that makes it easy for developers and operators to improve the performance and availability of their applications. Amazon DevOps Guru analyzes operational data and application metrics to identify behaviors that deviate from normal operating patterns. Once these operational anomalies are detected, DevOps Guru presents intelligent recommendations that address current and predicted future operational issues. By integrating with CodeGuru Profiler, customers can now view operational anomalies and code optimization recommendations on the DevOps Guru console. The integration, which is enabled by default, is only applicable to Lambda resources that are supported by CodeGuru Profiler and monitored by both DevOps Guru and CodeGuru.

We can now stop the curl loop (Control+C) so that the Lambda function stops running. Next, we delete the profiling group that was created when we enabled profiling in Lambda, and then delete the Lambda function or repurpose as needed.

Conclusion

Cloud sustainability is a shared responsibility between AWS and our customers. While we work to make our datacenter more sustainable, customers also have to work to make their code, resources, and applications more sustainable, and CodeGuru Profiler can help you improve code sustainability, as demonstrated above. To start Profiling your code today, visit the CodeGuru Profiler documentation page. To start monitoring your applications, head over to the Amazon DevOps Guru documentation page.

About the authors:

Detect and block advanced bot traffic

2022-11-10 Etienne Munnich

Post Syndicated from Etienne Munnich original https://aws.amazon.com/blogs/security/detect-and-block-advanced-bot-traffic/

Automated scripts, known as bots, can generate significant volumes of traffic to your mobile applications, websites, and APIs. Targeted bots take this a step further by targeting website content, such as product availability or pricing.

Traffic from targeted bots can result in a poor user experience by competing against legitimate user traffic for website access to high-demand inventory, increasing business risk through chargebacks from fraudulent transactions, and increasing infrastructure costs.

In 2021, AWS released AWS WAF Bot Control for Common Bots to help you detect and control common bots. In October 2022, AWS released a new feature—AWS Bot Control for Targeted Bots—that can help you detect and protect against bots that use advanced techniques to actively avoid detection.

In this post, I provide an overview of Bot Control for Targeted Bots and show you how to enable Bot Control to detect and block both common and targeted bots.

Overview of Bot Control for Targeted Bots

Bot Control for Targeted Bots provides sophisticated bot detection and mitigation by creating an intelligent baseline of traffic patterns. Bot Control for Targeted Bots uses browser fingerprinting techniques and client-side JavaScript interrogation methods to help protect your application from advanced bots that mimic human traffic patterns and actively try to evade detection.

Bot Control detects anomalies in usage patterns and provides new flexible mitigation options to isolate bad bots. These options include dynamic rate-limiting, challenge actions, and the ability to block based on labels and confidence scores.

With Bot Control for Targeted Bots, you can use bot protection rules to allow verified common bot traffic and, at the same time, to challenge unwanted advanced bot traffic. You can achieve both tasks from the same configuration page without making application or architectural changes. You can also configure fine-grained rule sets. For example, you can configure blocking actions for high-risk bots while allowing for exceptions for known IP ranges.

This release also introduces token domains, which is the ability to use the same AWS WAF web ACL across multiple domain names and Amazon CloudFront distributions to simplify client-side configuration. For example, you can use token domains to accept tokens that are generated by www.example.com for api.example.com and vice versa. In addition, you can now specify a resource path directly in the managed rule configuration, enabling you to only require a token for API calls, but not for cached, content-like images.

Bot Control for Targeted Bots sends metrics to Amazon CloudWatch to identify application access trends. The metrics include the percentage of human traffic compared to bot traffic and the count of requests for sensitive web pages such as login and checkout pages. Each rule in Bot Control produces a unique label so that you can review CloudWatch metrics and filter logs to understand traffic patterns. By using these mechanisms, you can identify, isolate, and remediate operational issues.

Walkthrough

In this walkthrough, I will show you how to set up Bot Control for Targeted Bots to help protect a CloudFront distribution.

You will set up an AWS WAF web ACL with an AWS Managed Rule for Bot Control for Targeted Bots. The rule detects bots and then decides the appropriate action:

Dynamically rate limit verified bots – Based on traffic history, Bot Control creates an intelligent baseline and then applies rate limits to abnormally high volumes.
Enable the challenge action – You have a new option, called challenge, along with the already supported options of count, allow, block, and CAPTCHA. The challenge option initiates a process of challenge interstitial, which means that Bot Control provides a challenge to the browser and creates a domain token when the challenge is resolved.

Set up Bot Control for Targeted Bots

In this section, I will show you how to set up Bot Control for Targeted Bots by creating a new web ACL or editing an existing one.

To set up Bot Control for Targeted Bots

Open the AWS WAF console, and then do one of the following:
- To create a new web ACL, choose Create a new web ACL.
- To edit an existing web ACL, choose the name of the ACL.
On the Rules tab, for the Add rules drop-down, select Add managed rule groups.
Add a Bot Control rule set to the web ACL. Choose Edit to edit the rule.
For Bot Control inspection level, select the inspection level for Bot Control. For this walkthrough, we chose Targeted to enable Bot Control for Targeted Bots.

Figure 1: Bot Control – Select inspection level
Review and select the actions to be taken on each category of bots detected, and then choose Save rule. In our example, we set allow, challenge, and count rules for the categories, as shown in Figure 2.

Figure 2: Bot Control – Select actions for each category

You can select different actions for each category based on your application security needs:
- Allow: Allows the request to be sent to a protected resource.
- Block: Blocks the request, returning an HTTP 403 (Forbidden) response.
- Count: Allows the request to be sent to the protected resource while counting detections. The count shows you bot activity that is occurring without blocking or challenging. When you turn on rules for the first time, this information can help you see what the detections are, before you change the actions.
- CAPTCHA and Challenge: use CAPTCHA puzzles and silent challenges with tokens to track successful client responses.
In this example you will configure a scope-down statement to apply Bot Control for a given URI path only.
On the same page in the step above, you can add a scope-down statement to ensure you use and incur Targeted Bots charges for the requests where you need protections. There are more examples of how to use scope-down statements in our documentation.

Select “Enable scope-down statement” and configure the rule to inspect the URI path as per figure 3.

Figure 3: Bot Control – Add the scope-down statement
To add domain names to be protected, scroll to the bottom of the web ACL and choose Edit. In the Token domain list – optional section, enter the domain name or names to which the token verification applies. Tokens that are generated are valid for these domains.

Create the SDK link for the AWS WAF integration

In this section, I’ll show you how to find the AWS WAF SDK and add it to your application pages.

The token SDK manages the token authorization and includes the tokens in the requests that you send to your protected resources. By adding the SDK link to application pages, you can help ensure that the remote procedure calls by your client contain a valid token.

To add the SDK to your application pages

In the AWS WAF console, in the left navigation pane, choose Application integration SDKs.
Under JavaScript SDK, copy the provided code snippet. This code snippet allows for creation of the cryptographic token in the background when the application loads for the first time, providing a better customer experience.
Add the code snippet to your pages. For example, paste the provided script code within the <head> section of the HTML.

When this integration is in place on your application’s pages, you can add AWS WAF rules in your web ACL to block requests that don’t contain a valid token. Replace the <Web ACL integration URL> with the provided integration URL from the AWS WAF console or copy the script tag from the console:

<script type="text/javascript" src="<Web ACL integration URL>/challenge.js” defer></script>

Figure 4 shows the SDK link for application pages.

Figure 4: Bot Control – Add SDK link to application pages

Review metrics

Now that you’ve set up the web ACL and application, you can use the bot visualization dashboard to review bot traffic patterns. Bot rules emit metrics corresponding to their labels, helping you identify which rule within the AWS Managed Rule for Bot Control for Targeted Bots initiated an action. You can also use these labels and rule actions to filter AWS WAF logs so that you can further examine a request.

To view AWS WAF metrics for the distribution

In the AWS WAF console, in the left navigation pane, select Web ACLs.
Select the web ACL that Bot Control is enabled on and then choose the Bot Control tab to view the metrics.

Figure 5: Bot Control – Review web ACL metrics

Best practices

In this section, I describe best practices for your Bot Control setup.

Set priority ordering of AWS WAF rules to help lower costs

You can set the priority of rule groups in a web ACL such that the order of the rule matches requests more efficiently. AWS WAF will take the action associated to the first rule it matches. If the incoming traffic matches the more wider criteria (such as IPset rules at priority 1), the associated action is taken. That request is never analyzed by the Bot Control rule and hence do not incur the bot control request analysis fees. For example, the following list shows rules ranked in order from highest priority (1) to lowest priority (5):

Use allow and deny lists – provide IP addresses to allow or deny
AWS Managed Rule groups for IP reputation – block bots and other threats
General rate limit – help prevent HTTP flood across the protected resource
AWS WAF Bot Control rule group – scoped-down to exclude static content such as images
Rate limit for login pages – scoped-down for specific URLs and HTTP POST methods

Figure 6 shows the prioritized rules in AWS WAF.

Figure 6: AWS WAF – Web ACL rule order

Use scope-down statements

You can use scope-down statements to limit the requests evaluated for a rule group. For example, a scope-down statement that excludes checking requests for static assets, such as images for a given URI and HTTP method (GET), can help reduce Bot Control costs.

Block requests without tokens

If a request has a token absent or is rejected, you can block that request. For example, you might want to block requests on login or payment processing pages. To block requests with a missing or rejected token, add a rule to run after the Bot Control rule to block requests matching the labels rejected and absent:

awswaf:managed:token:rejected – The request token is present but is either corrupt or has an expired challenge timestamp.
awswaf:managed:token:absent – The request doesn’t have a token.

Use SDK integration

After you add the token domains and the provided script to your application pages, you can add a rule to block requests that don’t have a token. Use of the SDK helps AWS WAF verify the client application with silent challenges and provide AWS token acquisition and management. The SDK provides the full functionality of both AWS WAF Bot Control and AWS WAF Fraud Control, reducing the need for multiple SDKs if either or both rule groups are used in the web ACL.

Create CloudWatch alarms

You can add CloudWatch alarms to help you assess whether there is activity outside of the norm for your application. For example, you can monitor for a high number of token-absent metrics for a given time period.

Configure a billing alarm

To help you track costs, you can configure a billing alarm that sends an alert when you have exceeded the threshold for your expected costs.

Pricing and availability

Bot Control for Targeted Bots is available today in AWS Regions where AWS WAF is available, excluding AWS GovCloud (US) and China Regions. For information on pricing, see AWS WAF Pricing.

Conclusion

In this post, you learned how to use Bot Control for Targeted Bots to add visibility into bot activity on your website or applications. With Bot Control for common and targeted bots, you can detect, challenge, and block unwanted bot activity. Because Bot Control is customizable, you can tailor how you address legitimate bots while protecting against bots that use advanced techniques to actively avoid detection. For more information and to get started today, see AWS WAF Bot Control.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.