Tag Archives: Intermediate (200)

How to manage certificate lifecycles using ACM event-driven workflows

Post Syndicated from Shahna Campbell original https://aws.amazon.com/blogs/security/how-to-manage-certificate-lifecycles-using-acm-event-driven-workflows/

With AWS Certificate Manager (ACM), you can simplify certificate lifecycle management by using event-driven workflows to notify or take action on expiring TLS certificates in your organization. Using ACM, you can provision, manage, and deploy public and private TLS certificates for use with integrated AWS services like Amazon CloudFront and Elastic Load Balancing (ELB), as well as for your internal resources and infrastructure. For a full list of integrated services, see Services integrated with AWS Certificate Manager.

By implementing event-driven workflows for certificate lifecycle management, you can help increase the visibility of upcoming and actual certificate expirations, and notify application teams that their action is required to renew a certificate. You can also use event-driven workflows to automate provisioning of private certificates to your internal resources, like a web server based on Amazon Elastic Compute Cloud (Amazon EC2).

In this post, we describe the ACM event types that Amazon EventBridge supports. EventBridge is a serverless event router that you can use to build event-driven applications at scale. ACM publishes these events for important occurrences, such as when a certificate becomes available, approaches expiration, or fails to renew. We also highlight example use cases for the event types supported by ACM. Lastly, we show you how to implement an event-driven workflow to notify application teams that they need to take action to renew a certificate for their workloads. You can also use these types of workflows to send the relevant event information to AWS Security Hub or to initiate certificate automation actions through AWS Lambda.

To view a video walkthrough and demo of this workflow, see AWS Certificate Manager: How to create event-driven certificate workflows.

ACM event types and selected use cases

In October 2022, ACM released support for three new event types:

  • ACM Certificate Renewal Action Required
  • ACM Certificate Expired 
  • ACM Certificate Available

Before this release, ACM had a single event type: ACM Certificate Approaching Expiration. By default, ACM creates Certificate Approaching Expiration events daily for active, ACM-issued certificates starting 45 days prior to their expiration. To learn more about the structure of these event types, see Amazon EventBridge support for ACM. The following examples highlight how you can use the different event types in the context of certificate lifecycle operations.

Notify stakeholders that action is required to complete certificate renewal

ACM emits an ACM Certificate Renewal Action Required event when customer action must be taken before a certificate can be renewed. For instance, if permissions aren’t appropriately configured to allow ACM to renew private certificates issued from AWS Private Certificate Authority (AWS Private CA), ACM will publish this event when automatic renewal fails at 45 days before expiration. Similarly, ACM might not be able to renew a public certificate because an administrator needs to confirm the renewal by email validation, or because Certification Authority Authorization (CAA) record changes prevent automatic renewal through domain validation. ACM will make further renewal attempts at 30 days, 15 days, 3 days, and 1 day before expiration, or until customer action is taken, the certificate expires, or the certificate is no longer eligible for renewal. ACM publishes an event for each renewal attempt.

It’s important to notify the appropriate parties — for example, the Public Key Infrastructure (PKI) team, security engineers, or application developers — that they need to take action to resolve these issues. You might notify them by email, or by integrating with your workflow management system to open a case that the appropriate engineering or support teams can track.

Notify application teams that a certificate for their workload has expired

You can use the ACM Certificate Expired event type to notify application teams that a certificate associated with their workload has expired. The teams should quickly investigate and validate that the expired certificate won’t cause an outage or cause application users to see a message stating that a website is insecure, which could impact their trust. To increase visibility for support teams, you can publish this event to Security Hub or a support ticketing system. For an example of how to publish these events as findings in Security Hub, see Responding to an event with a Lambda function.

Use automation to export and place a renewed private certificate

ACM sends an ACM Certificate Available event when a managed public or private certificate is ready for use. ACM publishes the event on issuance, renewal, and import. When a private certificate becomes available, you might still need to take action to deploy it to your resources, such as installing the private certificate for use in an EC2 web server. This includes a new private certificate that AWS Private CA issues as part of managed renewal through ACM. You might want to notify the appropriate teams that the new certificate is available for export from ACM, so that they can use the ACM APIs, AWS Command Line Interface (AWS CLI), or AWS Management Console to export the certificate and manually distribute it to your workload (for example, an EC2-based web server). For integrated services such as ELB, ACM binds the renewed certificate to your resource, and no action is required.

You can also use this event to invoke a Lambda function that exports the private certificate and places it in the appropriate directory for the relevant server, provide it to other serverless resources, or put it in an encrypted Amazon Simple Storage Service (Amazon S3) bucket to share with a third party for mutual TLS or a similar use case.

How to build a workflow to notify administrators that action is required to renew a certificate

In this section, we’ll show you how to configure notifications to alert the appropriate stakeholders that they need to take an action to successfully renew an ACM certificate.

To follow along with this walkthrough, make sure that you have an AWS Identity and Access Management (IAM) role with the appropriate permissions for EventBridge and Amazon Simple Notification Service (Amazon SNS). When a rule runs in EventBridge, a target associated with that rule is invoked, and in order to make API calls on Amazon SNS, EventBridge needs a resource-based IAM policy.

The following IAM permissions work for the example below (and for tidying up afterwards):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "events:*"
      ],
      "Effect": "Allow",
      "Resource": "arn:aws:events:*:*:*"
    },
    {
      "Action": [
        "iam:PassRole"
      ],
      "Effect": "Allow",
      "Resource": "*",
      "Condition": {
        "StringLike": {
          "iam:PassedToService": "events.amazonaws.com"
        }
      }
    }  
  ]
}

The following is a sample resource-based policy that allows EventBridge to publish to an Amazon SNS topic. Make sure to replace <region>, <account-id>, and <topic-name> with your own data.

{
  "Sid": "PublishEventsToMyTopic",
  "Effect": "Allow",
  "Principal": {
    "Service": "events.amazonaws.com"
  },
  "Action": "sns:Publish",
  "Resource": "arn:aws:sns:<region>:<account-id>:<topic-name>"
}

The first step is to create an SNS topic by using the console to link multiple endpoints such as AWS Lambda and Amazon Simple Queue Service (Amazon SQS), or send a notification to an email address.

To create an SNS topic

  1. Open the Amazon SNS console.
  2. In the left navigation pane, choose Topics.
  3. Choose Create Topic.
  4. For Type, choose Standard.
  5. Enter a name for the topic, and (optional) enter a display name.
  6. Choose the triangle next to the Encryption — optional panel title.
    1. Select the Encryption toggle (optional) to encrypt the topic. This will enable server-side encryption (SSE) to help protect the contents of the messages in Amazon SNS topics.
    2. For this demonstration, we are going to use the default AWS managed KMS key. Using Amazon SNS with AWS Key Management Service (AWS KMS) provides encryption at rest, and the data keys that encrypt the SNS message data are stored with the data protected. To learn more about SNS data encryption, see Data encryption.
  7. Keep the defaults for all other settings.
  8. Choose Create topic.

When the topic has been successfully created, a notification bar appears at the top of the screen, and you will be routed to the page for the newly created topic. Note the Amazon Resource Name (ARN) listed in the Details panel because you’ll need it for the next section.

Next, you need to create a subscription to the topic to set a destination endpoint for the messages that are pushed to the topic to be delivered.

To create a subscription to the topic

  1. In the Subscriptions section of the SNS topic page you just created, choose Create subscription.
  2. On the Create subscription page, in the Details section, do the following:
    1. For Protocol, choose Email.
    2. For Endpoint, enter the email address where the ACM Certificate Renewal Action Required event alerts should be sent.
    3. Keep the default Subscription filter policy and Redrive policy settings for this demonstration.
    4. Choose Create subscription.
  3. To finalize the subscription, an email will be sent to the email address that you entered as the endpoint. To validate your subscription, choose Confirm Subscription in the email when you receive it.
  4. A new web browser will open with a message verifying that the subscription status is Confirmed and that you have been successfully subscribed to the SNS topic.

Next, create the EventBridge rule that will be invoked when an ACM Certificate Renewal Action Required event occurs. This rule uses the SNS topic that you just created as a target.

To create an EventBridge rule

  1. Navigate to the EventBridge console.
  2. In the left navigation pane, choose Rules.
  3. Choose Create rule.
  4. In the Rule detail section, do the following:
    1. Define the rule by entering a Name and an optional Description.
    2. In the Event bus dropdown menu, select the default event bus.
    3. Keep the default values for the rest of the settings.
    4. Choose Next.
  5. For Event source, make sure that AWS events or EventBridge partner events is selected, because the event source is ACM.
  6. In the Sample event panel, under Sample events, choose ACM Certificate Renewal Action Required as the sample event. This helps you verify your event pattern.
  7. In the Event pattern panel, for Event Source, make sure that AWS services is selected. 
  8. For AWS service, choose Certificate Manager.
  9. Under Event type, choose ACM Certificate Renewal Action Required.
  10. Choose Test pattern.
  11. In the Event pattern section, a notification will appear stating Sample event matched the event pattern to confirm that the correct event pattern was created.
  12. Choose Next.
  13. In the Target 1 panel, do the following:
    1. For Target types, make sure that AWS service is selected.
    2. Under Select a target, choose SNS topic.
    3. In the Topic dropdown list, choose your desired topic.
    4. Choose Next.
  14. (Optional) Add tags to the topic.
  15. Choose Next.
  16. Review the settings for the rule, and then choose Create rule.

Now you are listening to this event and will be notified when a customer action must be taken before a certificate can be renewed.

For another example of how to use Amazon SNS and email notifications, see How to monitor expirations of imported certificates in AWS Certificate Manager (ACM). For an example of how to use Lambda to publish findings to Security Hub to provide visibility to administrators and security teams, see Responding to an event with a Lambda function. Other options for responding to this event include invoking a Lambda function to export and distribute private certificates, or integrating with a messaging or collaboration tool for ChatOps.

Conclusion 

In this blog post, you learned about the new EventBridge event types for ACM, and some example use cases for each of these event types. You also learned how to use these event types to create a workflow with EventBridge and Amazon SNS that notifies the appropriate stakeholders when they need to take action, so that ACM can automatically renew a TLS certificate.

By using these events to increase awareness of upcoming certificate lifecycle events, you can make it simpler to manage TLS certificates across your organization. For more information about certificate management on AWS, see the ACM documentation or get started using ACM today in the AWS Management Console.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Shahna Campbell

Shahna Campbell

Shahna is a solutions architect at AWS, working within the specialist organization with a focus on security. Previously, Shahna worked within the healthcare field clinically and as an application specialist. Shahna is passionate about cybersecurity and analytics. In her free time she enjoys hiking, traveling, and spending time with family.

Mani Subramanian

Manikandan Subramanian

Manikandan is a principal engineer at AWS Cryptography. His primary focus is on public key infrastructure (PKI), and he helps ensure secure communication using TLS certificates for AWS customers. Mani is also passionate at designing APIs, is an API bar raiser, and has helped launch multiple AWS services. Outside of work, Mani enjoys cooking and watching Formula One.

Zach Miller

Zach Miller

Zach is a Senior Security Specialist Solutions Architect at AWS. His background is in data protection and security architecture, focused on a variety of security domains, including cryptography, secrets management, and data classification. Today, he is focused on helping enterprise AWS customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.

How to write and execute integration tests for AWS CDK applications

Post Syndicated from Svenja Raether original https://aws.amazon.com/blogs/devops/how-to-write-and-execute-integration-tests-for-aws-cdk-applications/

Automated integration testing validates system components and boosts confidence for new software releases. Performing integration tests on resources deployed to the AWS cloud enables the validation of AWS Identity and Access Management (IAM) policies, service limits, application configuration, and runtime code. For developers that are currently leveraging AWS Cloud Development Kit (AWS CDK) as their Infrastructure as Code tool, there is a testing framework available that makes integration testing easier to implement in the software release.

AWS CDK is an open-source framework for defining and provisioning AWS cloud infrastructure using supported programming languages. The framework includes constructs for writing and running unit and integration tests. The assertions construct can be used to write unit tests and assert against the generated CloudFormation templates. CDK integ-tests construct can be used for defining integration test cases and can be combined with CDK integ-runner for executing these tests. The integ-runner handles automatic resource provisioning and removal and supports several customization options. Unit tests using assertion functions are used to test configurations in the CloudFormation templates before deploying these templates, while integration tests run assertions in the deployed resources. This blog post demonstrates writing automated integration tests for an example application using AWS CDK.

Solution Overview

Architecture Diagram for the serverless data enrichment application

Figure 1: Serverless data enrichment application

The example application shown in Figure 1 is a sample serverless data enrichment application. Data is processed and enriched in the system as follows:

  1. Users publish messages to an Amazon Simple Notification Service (Amazon SNS) topic. Messages are encrypted at rest using an AWS Key Management Service (AWS KMS) customer-managed key.
  2. Amazon Simple Queue Service (Amazon SQS) queue is subscribed to the Amazon SNS topic, where published messages are delivered.
  3. AWS Lambda consumes messages from the Amazon SQS queue, adding additional data to the message. Messages that cannot be processed successfully are sent to a dead-letter queue.
  4. Successfully enriched messages are stored in an Amazon DynamoDB table by the Lambda function.
Architecture diagram for the integration test with one assertion

Figure 2: Integration test with one assertion

For this sample application, we will use AWS CDK’s integration testing framework to validate the processing for a single message as shown in Figure 2. To run the test, we configure the test framework to do the following steps:

  1. Publish a message to the Amazon SNS topic. Wait for the application to process the message and save to DynamoDB.
  2. Periodically check the Amazon DynamoDB table and verify that the saved message was enriched.

Prerequisites

The following are the required to deploy this solution:

The structure of the sample AWS CDK application repository is as follows:

  • /bin folder contains the top-level definition of the AWS CDK app.
  • /lib folder contains the stack definition of the application under test which defines the application described in the section above.
  • /lib/functions contains the Lambda function runtime code.
  • /integ-tests contains the integration test stack where we define and configure our test cases.

The repository is a typical AWS CDK application except that it has one additional directory for the test case definitions. For the remainder of this blog post, we focus on the integration test definition in /integ-tests/integ.sns-sqs-ddb.ts and walk you through its creation and the execution of the integration test.

Writing integration tests

An integration test should validate expected behavior of your AWS CDK application. You can define an integration test for your application as follows:

  1. Create a stack under test from the CdkIntegTestsDemoStack definition and map it to the application.
    // CDK App for Integration Tests
    const app = new cdk.App();
    
    // Stack under test
    const stackUnderTest = new CdkIntegTestsDemoStack(app, ‘IntegrationTestStack’, {
      setDestroyPolicyToAllResources: true,
      description:
        “This stack includes the application’s resources for integration testing.”,
    });
  2. Define the integration test construct with a list of test cases. This construct offers the ability to customize the behavior of the integration runner tool. For example, you can force the integ-runner to destroy the resources after the test run to force the cleanup.
    // Initialize Integ Test construct
    const integ = new IntegTest(app, ‘DataFlowTest’, {
      testCases: [stackUnderTest], // Define a list of cases for this test
      cdkCommandOptions: {
        // Customize the integ-runner parameters
        destroy: {
          args: {
            force: true,
          },
        },
      },
      regions: [stackUnderTest.region],
    });
  3. Add an assertion to validate the test results. In this example, we validate the single message flow from the Amazon SNS topic to the Amazon DynamoDB table. The assertion publishes the message object to the Amazon SNS topic using the AwsApiCall method. In the background this method utilizes a Lambda-backed CloudFormation custom resource to execute the Amazon SNS Publish API call with the AWS SDK for JavaScript.
    /**
     * Assertion:
     * The application should handle single message and write the enriched item to the DynamoDB table.
     */
    const id = 'test-id-1';
    const message = 'This message should be validated';
    /**
     * Publish a message to the SNS topic.
     * Note - SNS topic ARN is a member variable of the
     * application stack for testing purposes.
     */
    const assertion = integ.assertions
      .awsApiCall('SNS', 'publish', {
        TopicArn: stackUnderTest.topicArn,
        Message: JSON.stringify({
          id: id,
          message: message,
        }),
      })
  4. Use the next helper method to chain API calls. In our example, a second Amazon DynamoDB GetItem API call gets the item whose primary key equals the message id. The result from the second API call is expected to match the message object including the additional attribute added as a result of the data enrichment.
    /**
     * Validate that the DynamoDB table contains the enriched message.
     */
      .next(
        integ.assertions
          .awsApiCall('DynamoDB', 'getItem', {
            TableName: stackUnderTest.tableName,
            Key: { id: { S: id } },
          })
          /**
           * Expect the enriched message to be returned.
           */
          .expect(
            ExpectedResult.objectLike({
              Item: { id: { S: id, },
                message: { S: message, },
                additionalAttr: { S: 'enriched', },
              },
            }),
          )
  5. Since it may take a while for the message to be passed through the application, we run the assertion asynchronously by calling the waitForAssertions method. This means that the Amazon DynamoDB GetItem API call is called in intervals until the expected result is met or the total timeout is reached.
    /**
     * Timeout and interval check for assertion to be true.
     * Note - Data may take some time to arrive in DynamoDB.
     * Iteratively executes API call at specified interval.
     */
          .waitForAssertions({
            totalTimeout: Duration.seconds(25),
            interval: Duration.seconds(3),
          }),
      );
  6. The AwsApiCall method automatically adds the correct IAM permissions for both API calls to the AWS Lambda function. Given that the example application’s Amazon SNS topic is encrypted using an AWS KMS key, additional permissions are required to publish the message.
    // Add the required permissions to the api call
    assertion.provider.addToRolePolicy({
      Effect: 'Allow',
      Action: [
        'kms:Encrypt',
        'kms:ReEncrypt*',
        'kms:GenerateDataKey*',
        'kms:Decrypt',
      ],
      Resource: [stackUnderTest.kmsKeyArn],
    });

The full code for this blog is available on this GitHub project.

Running integration tests

In this section, we show how to run integration test for the introduced sample application using the integ-runner to execute the test case and report on the assertion results.

Install and build the project.

npm install 

npm run build

Run the following command to initiate the test case execution with a list of options.

npm run integ-test

The directory option specifies in which location the integ-runner needs to recursively search for test definition files. The parallel-regions option allows to define a list of regions to run tests in. We set this to us-east-1 and ensure that the AWS CDK bootstrapping has previously been performed in this region. The update-on-failed option allows to rerun the integration tests if the snapshot fails. A full list of available options can be found in the integ-runner Github repository.

Hint: if you want to retain your test stacks during development for debugging, you can specify the no-clean option to retain the test stack after the test run.

The integ-runner initially checks the integration test snapshots to determine if any changes have occurred since the last execution. Since there are no previous snapshots for the initial run, the snapshot verification fails. As a result, the integ-runner begins executing the integration tests using the ephemeral test stack and displays the result.

Verifying integration test snapshots...

  NEW        integ.sns-sqs-ddb 2.863s

Snapshot Results: 

Tests:    1 failed, 1 total

Running integration tests for failed tests...

Running in parallel across regions: us-east-1
Running test <your-path>/cdk-integ-tests-demo/integ-tests/integ.sns-sqs-ddb.js in us-east-1
  SUCCESS    integ.sns-sqs-ddb-DemoTest/DefaultTest 587.295s
       AssertionResultsAwsApiCallDynamoDBgetItem - success

Test Results: 

Tests:    1 passed, 1 total
The AWS CloudFormation console deploys the IntegrationTestStack and DataFlowDefaultTestDeployAssert stack

Figure 3: AWS CloudFormation deploying the IntegrationTestStack and DataFlowDefaultTestDeployAssert stacks

The integ-runner generates two AWS CloudFormation stacks, as shown in Figure 3. The IntegrationTestStack stack includes the resources from our sample application, which serves as an isolated application representing the stack under test. The DataFlowDefaultTestDeployAssert stack contains the resources required for executing the integration tests as shown in Figure 4.

AWS CloudFormation displays the resources for the DataFlowDefaultTestDeployAssert stack

Figure 4: AWS CloudFormation resources for the DataFlowDefaultTestDeployAssert stack

Cleaning up

Based on the specified RemovalPolicy, the resources are automatically destroyed as the stack is removed. Some resources such as Amazon DynamoDB tables have the default RemovalPolicy set to Retain in AWS CDK. To set the removal policy to Destroy for the integration test resources, we leverage Aspects.

/**
 * Aspect for setting all removal policies to DESTROY
 */
class ApplyDestroyPolicyAspect implements cdk.IAspect {
  public visit(node: IConstruct): void {
    if (node instanceof CfnResource) {
      node.applyRemovalPolicy(cdk.RemovalPolicy.DESTROY);
    }
  }
}
Deleting AWS CloudFormation stack from the AWS Console

Figure 5: Deleting AWS CloudFormation stacks from the AWS Console

If you set the no-clean argument as part of the integ-runner CLI options, you need to manually destroy the stacks. This can be done from the AWS Console, via AWS CloudFormation as shown in Figure 5 or by using the following command.

cdk destroy --all

To clean up the code repository build files, you can run the following script.

npm run clean

Conclusion

The AWS CDK integ-tests construct is a valuable tool for defining and conducting automated integration tests for your AWS CDK applications. In this blog post, we have introduced a practical code example showcasing how AWS CDK integration tests can be used to validate the expected application behavior when deployed to the cloud. You can leverage the techniques in this guide to write your own AWS CDK integration tests and improve the quality and reliability of your application releases.

For information on how to get started with these constructs, please refer to the following documentation.

Call to Action

Integ-runner and integ-tests constructs are experimental and subject to change. The release notes for both stable and experimental modules are available in the AWS CDK Github release notes. As always, we welcome bug reports, feature requests, and pull requests on the aws-cdk GitHub repository to further shape these alpha constructs based on your feedback.

About the authors

Iris Kraja

Iris is a Cloud Application Architect at AWS Professional Services based in New York City. She is passionate about helping customers design and build modern AWS cloud native solutions, with a keen interest in serverless technology, event-driven architectures and DevOps. Outside of work, she enjoys hiking and spending as much time as possible in nature.

Svenja Raether

Svenja is an Associate Cloud Application Architect at AWS Professional Services based in Munich.

Ahmed Bakry

Ahmed is a Security Consultant at AWS Professional Services based in Amsterdam. He obtained his master’s degree in Computer Science at the University of Twente and specialized in Cyber Security. And he did his bachelor degree in Networks Engineering at the German University in Cairo. His passion is developing secure and robust applications that drive success for his customers.

Philip Chen

Philip is a Senior Cloud Application Architect at AWS Professional Services. He works with customers to design cloud solutions that are built to achieve business goals and outcomes. He is passionate about his work and enjoys the creativity that goes into architecting solutions.

Dimensional modeling in Amazon Redshift

Post Syndicated from Bernard Verster original https://aws.amazon.com/blogs/big-data/dimensional-modeling-in-amazon-redshift/

Amazon Redshift is a fully managed and petabyte-scale cloud data warehouse that is used by tens of thousands of customers to process exabytes of data every day to power their analytics workload. You can structure your data, measure business processes, and get valuable insights quickly can be done by using a dimensional model. Amazon Redshift provides built-in features to accelerate the process of modeling, orchestrating, and reporting from a dimensional model.

In this post, we discuss how to implement a dimensional model, specifically the Kimball methodology. We discuss implementing dimensions and facts within Amazon Redshift. We show how to perform extract, transform, and load (ELT), an integration process focused on getting the raw data from a data lake into a staging layer to perform the modeling. Overall, the post will give you a clear understanding of how to use dimensional modeling in Amazon Redshift.

Solution overview

The following diagram illustrates the solution architecture.

In the following sections, we first discuss and demonstrate the key aspects of the dimensional model. After that, we create a data mart using Amazon Redshift with a dimensional data model including dimension and fact tables. Data is loaded and staged using the COPY command, the data in the dimensions is loaded using the MERGE statement, and facts will be joined to the dimensions where insights are derived from. We schedule the loading of the dimensions and facts using the Amazon Redshift Query Editor V2. Lastly, we use Amazon QuickSight to gain insights on the modeled data in the form of a QuickSight dashboard.

For this solution, we use a sample dataset (normalized) provided by Amazon Redshift for event ticket sales. For this post, we have narrowed down the dataset for simplicity and demonstration purposes. The following tables show examples of the data for ticket sales and venues.

According to the Kimball dimensional modeling methodology, there are four key steps in designing a dimensional model:

  1. Identify the business process.
  2. Declare the grain of your data.
  3. Identify and implement the dimensions.
  4. Identify and implement the facts.

Additionally, we add a fifth step for demonstration purposes, which is to report and analyze business events.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Identify the business process

In simple terms, identifying the business process is identifying a measurable event that generates data within an organization. Usually, companies have some sort of operational source system that generates their data in its raw format. This is a good starting point to identify various sources for a business process.

The business process is then persisted as a data mart in the form of dimensions and facts. Looking at our sample dataset mentioned earlier, we can clearly see the business process is the sales made for a given event.

A common mistake made is using departments of a company as the business process. The data (business process) needs to be integrated across various departments, in this case, marketing can access the sales data. Identifying the correct business process is critical—getting this step wrong can impact the entire data mart (it can cause the grain to be duplicated and incorrect metrics on the final reports).

Declare the grain of your data

Declaring the grain is the act of uniquely identifying a record in your data source. The grain is used in the fact table to accurately measure the data and enable you to roll up further. In our example, this could be a line item in the sales business process.

In our use case, a sale can be uniquely identified by looking at the transaction time when the sale took place; this will be the most atomic level.

Identify and implement the dimensions

Your dimension table describes your fact table and its attributes. When identifying the descriptive context of your business process, you store the text in a separate table, keeping the fact table grain in mind. When joining the dimensions table to the fact table, there should only be a single row associated to the fact table. In our example, we use the following table to be separated into a dimensions table; these fields describe the facts that we will measure.

When designing the structure of the dimensional model (the schema), you can either create a star or snowflake schema. The structure should closely align with the business process; therefore, a star schema is best fit for our example. The following figure shows our Entity Relationship Diagram (ERD).

In the following sections, we detail the steps to implement the dimensions.

Stage the source data

Before we can create and load the dimensions table, we need source data. Therefore, we stage the source data into a staging or temporary table. This is often referred to as the staging layer, which is the raw copy of the source data. To do this in Amazon Redshift, we use the COPY command to load the data from the dimensional-modeling-in-amazon-redshift public S3 bucket located on the us-east-1 Region. Note that the COPY command uses an AWS Identity and Access Management (IAM) role with access to Amazon S3. The role needs to be associated with the cluster. Complete the following steps to stage the source data:

  1. Create the venue source table:
CREATE TABLE public.venue (
    venueid bigint,
    venuename character varying(100),
    venuecity character varying(30),
    venuestate character(2),
    venueseats bigint
) DISTSTYLE AUTO
        SORTKEY
    (venueid);
  1. Load the venue data:
COPY public.venue
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/venue.csv'
IAM_ROLE '<Your IAM role arn>'
DELIMITER ','
REGION 'us-east-1'
IGNOREHEADER 1
  1. Create the sales source table:
CREATE TABLE public.sales (
    salesid integer,
    venueid character varying(256),
    saletime timestamp without time zone,
    qtysold BIGINT,
    commission numeric(18,2),
    pricepaid numeric(18,2)
) DISTSTYLE AUTO;
  1. Load the sales source data:
COPY public.sales
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/sales.csv'
IAM_ROLE '<Your IAM role arn>'
DELIMITER ','
REGION 'us-east-1'
IGNOREHEADER 1
  1. Create the calendar table:
CREATE TABLE public.DimCalendar(
    dateid smallint,
        caldate date,
        day varchar(20),
        week smallint,
        month varchar(20),
        qtr varchar(20),
        year smallint,
        holiday boolean
) DISTSTYLE AUTO
SORTKEY
    (dateid);
  1. Load the calendar data:
COPY public.DimCalendar
FROM 's3://redshift-blogs/dimensional-modeling-in-amazon-redshift/date.csv'
IAM_ROLE '<Your IAM role arn>'
DELIMITER ',' 
REGION 'us-east-1'
IGNOREHEADER 1

Create the dimensions table

Designing the dimensions table can depend on your business requirement—for example, do you need to track changes to the data over time? There are seven different dimension types. For our example, we use type 1 because we don’t need to track historical changes. For more about type 2, refer to Simplify data loading into Type 2 slowly changing dimensions in Amazon Redshift. The dimensions table will be denormalized with a primary key, surrogate key, and a few added fields to indicate changes to the table. See the following code:

create schema SalesMart;
CREATE TABLE SalesMart.DimVenue( 
    "VenueSkey" int IDENTITY(1,1) primary key
    ,"VenueId" VARCHAR NOT NULL
    ,"VenueName" VARCHAR NULL
    ,"VenueCity" VARCHAR NULL
    ,"VenueState" VARCHAR NULL
    ,"VenueSeats" INT NULL
    ,"InsertedDate" DATETIME NOT NULL
    ,"UpdatedDate" DATETIME NOT NULL
) 
diststyle AUTO;

A few notes on creating the dimensions table creation:

  • The field names are transformed into business-friendly names
  • Our primary key is VenueID, which we use to uniquely identify a venue at which the sale took place
  • Two additional rows will be added, indicating when a record was inserted and updated (to track changes)
  • We are using an AUTO distribution style to give Amazon Redshift the responsibility to choose and adjust the distribution style

Another important factor to consider in dimensional modelling is the usage of surrogate keys. Surrogate keys are artificial keys that are used in dimensional modelling to uniquely identify each record in a dimension table. They are typically generated as a sequential integer, and they don’t have any meaning in the business domain. They offer several benefits, such as ensuring uniqueness and improving performance in joins, because they’re typically smaller than natural keys and as surrogate keys they don’t change over time. This allows us to be consistent and join facts and dimensions more easily.

In Amazon Redshift, surrogate keys are typically created using the IDENTITY keyword. For example, the preceding CREATE statement creates a dimension table with a VenueSkey surrogate key. The VenueSkey column is automatically populated with unique values as new rows are added to the table. This column can then be used to join the venue table to the FactSaleTransactions table.

A few tips for designing surrogate keys:

  • Use a small, fixed-width data type for the surrogate key. This will improve performance and reduce storage space.
  • Use the IDENTITY keyword, or generate the surrogate key using a sequential or GUID value. This will ensure that the surrogate key is unique and can’t be changed.

Load the dim table using MERGE

There are numerous ways to load your dim table. Certain factors need to be considered—for example, performance, data volume, and perhaps SLA loading times. With the MERGE statement, we perform an upsert without needing to specify multiple insert and update commands. You can set up the MERGE statement in a stored procedure to populate the data. You then schedule the stored procedure to run programmatically via the query editor, which we demonstrate later in the post. The following code creates a stored procedure called SalesMart.DimVenueLoad:

CREATE OR REPLACE PROCEDURE SalesMart.DimVenueLoad()
AS $$
BEGIN
MERGE INTO SalesMart.DimVenue USING public.venue as MergeSource
ON SalesMart.DimVenue.VenueId = MergeSource.VenueId
WHEN MATCHED
THEN
UPDATE
SET VenueName = ISNULL(MergeSource.VenueName, 'Unknown')
, VenueCity = ISNULL(MergeSource.VenueCity, 'Unknown')
, VenueState = ISNULL(MergeSource.VenueState, 'Unknown')
, VenueSeats = ISNULL(MergeSource.VenueSeats, -1)
, UpdatedDate = GETDATE()
WHEN NOT MATCHED
THEN
INSERT (
VenueId
, VenueName
, VenueCity
, VenueState
, VenueSeats
, UpdatedDate
, InsertedDate
)
VALUES (
ISNULL(MergeSource.VenueId, -1)
, ISNULL(MergeSource.VenueName, 'Unknown')
, ISNULL(MergeSource.VenueCity, 'Unknown')
, ISNULL(MergeSource.VenueState, 'Unknown')
, ISNULL(MergeSource.VenueSeats, -1)
, ISNULL(GETDATE() , '1900-01-01')
, ISNULL(GETDATE() , '1900-01-01')
);
END;
$$
LANGUAGE plpgsql;

A few notes on the dimension loading:

  • When a record in inserted for the first time, the inserted date and updated date will be populated. When any values change, the data is updated and the updated date reflects the date when it was changed. The inserted date remains.
  • Because the data will be used by business users, we need to replace NULL values, if any, with more business-appropriate values.

Identify and implement the facts

Now that we have declared our grain to be the event of a sale that took place at a specific time, our fact table will store the numeric facts for our business process.

We have identified the following numerical facts to measure:

  • Quantity of tickets sold per sale
  • Commission for the sale

Implementing the Fact

There are three types of fact tables (transaction fact table, periodic snapshot fact table, and accumulating snapshot fact table). Each serves a different view of the business process. For our example, we use a transaction fact table. Complete the following steps:

  1. Create the fact table
CREATE TABLE SalesMart.FactSaleTransactions( 
    CalendarDate date NOT NULL
    ,SaleTransactionTime DATETIME NOT NULL
    ,VenueSkey INT NOT NULL
    ,QuantitySold BIGINT NOT NULL
    ,SaleComission NUMERIC NOT NULL
    ,InsertedDate DATETIME DEFAULT GETDATE()
) diststyle AUTO;

An inserted date with a default value is added, indicating if and when a record was loaded. You can use this when reloading the fact table to remove the already loaded data to avoid duplicates.

Loading the fact table consists of a simple insert statement joining your associated dimensions. We join from the DimVenue table that was created, which describes our facts. It’s best practice but optional to have calendar date dimensions, which allow the end-user to navigate the fact table. Data can either be loaded when there is a new sale, or daily; this is where the inserted date or load date comes in handy.

We load the fact table using a stored procedure and use a date parameter.

  1. Create the stored procedure with the following code. To keep the same data integrity that we applied in the dimension load, we replace NULL values, if any, with more business appropriate values:
create or replace procedure SalesMart.FactSaleTransactionsLoad(loadate datetime)
language plpgsql
as
    $$
begin
--------------------------------------------------------------------
/*** Delete records loaded for the day, should there be any ***/
--------------------------------------------------------------------
Delete from SalesMart.FactSaleTransactions
where cast(InsertedDate as date) = CAST(loadate as date);
RAISE INFO 'Deleted rows for load date: %', loadate;
--------------------------------------------------------------------
/*** Insert records ***/
--------------------------------------------------------------------
INSERT INTO SalesMart.FactSaleTransactions (
CalendarDate    
,SaleTransactionTime    
,VenueSkey  
,QuantitySold  
,Salecomission
)
SELECT DISTINCT
    ISNULL(c.caldate, '1900-01-01') as CalendarDate
    ,ISNULL(a.saletime, '1900-01-01') as SaleTransactionTime
    ,ISNULL(b.VenueSkey, -1) as VenueSkey
    ,ISNULL(a.qtysold, 0) as QuantitySold
    ,ISNULL(a.commission, 0) as SaleComission
FROM
    public.sales as a
 
LEFT JOIN SalesMart.DimVenue as b
on a.venueid = b.venueid
 
LEFT JOIN public.DimCalendar as c
on to_char(a.saletime,'YYYYMMDD') = to_char(c.caldate,'YYYYMMDD');
--Optional filter, should you want to load only the latest data from source
--where cast(a.saletime as date) = cast(loadate as date);
  
end;
$$;
  1. Load the data by calling the procedure with the following command:
call SalesMart.FactSaleTransactionsLoad(getdate())

Schedule the data load

We can now automate the modeling process by scheduling the stored procedures in Amazon Redshift Query Editor V2. Complete the following steps:

  1. We first call the dimension load and after the dimension load runs successfully, the fact load begins:
BEGIN;
----Insert Dim Loads
call SalesMart.DimVenueLoad();

----Insert Fact Loads. They will only run if the DimLoad is successful
call SalesMart.FactSaleTransactionsLoad(getdate());
END;

If the dimension load fails, the fact load will not run. This ensures consistency in the data because we don’t want to load the fact table with outdated dimensions.

  1. To schedule the load, choose Schedule in Query Editor V2.

  1. We schedule the query to run every day at 5:00 AM.
  2. Optionally, you can add failure notifications by enabling Amazon Simple Notification Service (Amazon SNS) notifications.

Report and analysis the data in Amazon Quicksight

QuickSight is a business intelligence service that makes it easy to deliver insights. As a fully managed service, QuickSight lets you easily create and publish interactive dashboards that can then be accessed from any device and embedded into your applications, portals, and websites.

We use our data mart to visually present the facts in the form of a dashboard. To get started and set up QuickSight, refer to Creating a dataset using a database that’s not autodiscovered.

After you create your data source in QuickSight, we join the modeled data (data mart) together based on our surrogate key skey. We use this dataset to visualize the data mart.

Our end dashboard will contain the insights of the data mart and answer critical business questions, such as total commission per venue and dates with the highest sales. The following screenshot shows the final product of the data mart.

Clean up

To avoid incurring future charges, delete any resources you created as part of this post.

Conclusion

We have now successfully implemented a data mart using our DimVenue, DimCalendar, and FactSaleTransactions tables. Our warehouse is not complete; as we can expand the data mart with more facts and implement more marts, and as the business process and requirements grow over time, so will the data warehouse. In this post, we gave an end-to-end view on understanding and implementing dimensional modeling in Amazon Redshift.

Get started with your Amazon Redshift dimensional model today.


About the Authors

Bernard Verster is an experienced cloud engineer with years of exposure in creating scalable and efficient data models, defining data integration strategies, and ensuring data governance and security. He is passionate about using data to drive insights, while aligning with business requirements and objectives.

Abhishek Pan is a WWSO Specialist SA-Analytics working with AWS India Public sector customers. He engages with customers to define data-driven strategy, provide deep dive sessions on analytics use cases, and design scalable and performant analytical applications. He has 12 years of experience and is passionate about databases, analytics, and AI/ML. He is an avid traveler and tries to capture the world through his camera lens.

Protect APIs with Amazon API Gateway and perimeter protection services

Post Syndicated from Pengfei Shao original https://aws.amazon.com/blogs/security/protect-apis-with-amazon-api-gateway-and-perimeter-protection-services/

As Amazon Web Services (AWS) customers build new applications, APIs have been key to driving the adoption of these offerings. APIs simplify client integration and provide for efficient operations and management of applications by offering standard contracts for data exchange. APIs are also the front door to hosted applications that need to be effectively secured, monitored, and metered to provide resilient infrastructure.

In this post, we will discuss how to help protect your APIs by building a perimeter protection layer with Amazon CloudFront, AWS WAF, and AWS Shield and putting it in front of Amazon API Gateway endpoints. Amazon API Gateway is a fully managed AWS service that you can use to create, publish, maintain, monitor, and secure REST, HTTP, and WebSocket APIs at any scale.

Solution overview

CloudFront, AWS WAF, and Shield provide a layered security perimeter that co-resides at the AWS edge and provides scalable, reliable, and high-performance protection for applications and content. For more information, see the AWS Best Practices for DDoS Resiliency whitepaper.

By using CloudFront as the front door to APIs that are hosted on API Gateway, globally distributed API clients can get accelerated API performance. API Gateway endpoints that are hosted in an AWS Region gain access to scaled distributed denial of service (DDoS) mitigation capacity across the AWS global edge network.

When you protect CloudFront distributions with AWS WAF, you can protect your API Gateway API endpoints against common web exploits and bots that can affect availability, compromise security, or consume excessive resources. AWS Managed Rules for AWS WAF help provide protection against common application vulnerabilities or other unwanted traffic, without the need for you to write your own rules. AWS WAF rate-based rules automatically block traffic from source IPs when they exceed the thresholds that you define, which helps to protect your application against web request floods, and alerts you to sudden spikes in traffic that might indicate a potential DDoS attack.

Shield mitigates infrastructure layer DDoS attacks against CloudFront distributions in real time, without observable latency. When you protect a CloudFront distribution with Shield Advanced, you gain additional detection and mitigation against large and sophisticated DDoS attacks, near real-time visibility into attacks, and integration with AWS WAF. When you configure Shield Advanced automatic application layer DDoS mitigation, Shield Advanced responds to application layer (layer 7) attacks by creating, evaluating, and deploying custom AWS WAF rules.

To take advantage of the perimeter protection layer built with CloudFront, AWS WAF, and Shield, and to help avoid exposing API Gateway endpoints directly, you can use the following approaches to restrict API access through CloudFront only. For more information about these approaches, see the Security Overview of Amazon API Gateway whitepaper.

  1. CloudFront can insert the X-API-Key header before it forwards the request to API Gateway, and API Gateway validates the API key when receiving the requests. For more information, see Protecting your API using Amazon API Gateway and AWS WAF — Part 2.
  2. CloudFront can insert a custom header (not X-API-Key) with a known secret that is shared with API Gateway. An AWS Lambda custom request authorizer that is configured in API Gateway validates the secret. For more information, see Restricting access on HTTP API Gateway Endpoint with Lambda Authorizer.
  3. CloudFront can sign the request with AWS Signature Version 4 by using Lambda@Edge before it sends the request to API Gateway. Configured AWS Identity and Access Management (IAM) authorization in API Gateway validates the signature and verifies the identity of the requester.

Although the X-API-Key header approach is straightforward to implement at a lower cost, it’s only applicable to customers who are using REST API endpoints. If the X-API-Key header already exists, CloudFront will overwrite it. The custom header approach addresses this limitation, but it has an additional cost due to the use of the Lambda authorizer. With both approaches, there is an operational overhead for managing keys and rotating the keys periodically. Also, it isn’t a security best practice to use long-term secrets for authorization.

By using the AWS Signature Version 4 approach, you can minimize this type of operational overhead through the use of requests signed with Signature Version 4 in Lambda@Edge. The signing uses temporary credentials that AWS Security Token Service (AWS STS) provides, and built-in API Gateway IAM authorization performs the request signature validation. There is an additional Lambda@Edge cost in this approach. This approach supports the three API endpoint types available in API Gateway — REST, HTTP, and WebSocket — and it helps secure requests by verifying the identity of the requester, protecting data in transit, and protecting against potential replay attacks. We describe this approach in detail in the next section.

Solution architecture

Figure 1 shows the architecture of the Signature Version 4 solution.

Figure 1: High-level flow of a client request with sequence of events

Figure 1: High-level flow of a client request with sequence of events

The sequence of events that occurs when the client sends a request is as follows:

  1. A client sends a request to an API endpoint that is fronted by CloudFront.
  2. AWS WAF inspects the request at the edge location according to the web access control list (web ACL) rules that you configured. With Shield Advanced automatic application-layer mitigation enabled, when Shield Advanced detects a DDoS attack and identifies the attack signatures, Shield Advanced creates AWS WAF rules inside an associated web ACL to mitigate the attack.
  3. CloudFront handles the request and invokes the Lambda@Edge function before sending the request to API Gateway.
  4. The Lambda@Edge function signs the request with Signature Version 4 by adding the necessary headers.
  5. API Gateway verifies the Lambda@Edge function with the necessary permissions and sends the request to the backend.
  6. An unauthorized client sends a request to an API Gateway endpoint, and it receives the HTTP 403 Forbidden message.

Solution deployment

The sample solution contains the following main steps:

  1. Preparation
  2. Deploy the CloudFormation template
  3. Enable IAM authorization in API Gateway
  4. Confirm successful viewer access to the CloudFront URL
  5. Confirm that direct access to the API Gateway API URL is blocked
  6. Review the CloudFront configuration
  7. Review the Lambda@Edge function and its IAM role
  8. Review the AWS WAF web ACL configuration
  9. (Optional) Protect the CloudFront distribution with Shield Advanced

Step 1: Preparation

Before you deploy the solution, you will first need to create an API Gateway endpoint.

To create an API Gateway endpoint

  1. Choose the following Launch Stack button to launch a CloudFormation stack in your account.

    Select this image to open a link that starts building the CloudFormation stack

    Note: The stack will launch in the US East (N. Virginia) Region (us-east-1). To deploy the solution to another Region, download the solution’s CloudFormation template, and deploy it to the selected Region.

    When you launch the stack, it creates an API called PetStoreAPI that is deployed to the prod stage.

  2. In the Stages navigation pane, expand the prod stage, select GET on /pets/{petId}, and then copy the Invoke URL value of https://api-id.execute-api.region.amazonaws.com/prod/pets/{petId}. {petId} stands for a path variable.
  3. In the address bar of a browser, paste the Invoke URL value. Make sure to replace {petId} with your own information (for example, 1), and press Enter to submit the request. A 200 OK response should return with the following JSON payload:
    {
      "id": 1,
      "type": "dog",
      "price": 249.99
    }

In this post, we will refer to this API Gateway endpoint as the CloudFront origin.

Step 2: Deploy the CloudFormation template

The next step is to deploy the CloudFormation template of the solution.

The CloudFormation template includes the following:

  • A CloudFront distribution that uses an API Gateway endpoint as the origin
  • An AWS WAF web ACL that is associated with the CloudFront distribution
  • A Lambda@Edge function that is used to sign the request with Signature Version 4 and that the CloudFront distribution invokes before the request is forwarded to the origin on the CloudFront distribution
  • An IAM role for the Lambda@Edge function

To deploy the CloudFormation template

  1. Choose the following Launch Stack button to launch a CloudFormation stack in your account.

    Select this image to open a link that starts building the CloudFormation stack

    Note: The stack will launch in the US East N. Virginia Region (us-east-1). To deploy the solution to another Region, download the solution’s CloudFormation template, provide the required parameters, and deploy it to the selected Region.

  2. On the Specify stack details page, update with the following:
    1. For Stack name, enter APIProtection
    2. For the parameter APIGWEndpoint, enter the API Gateway endpoint in the following format. Make sure to replace <Region> with your own information.

    {api-id}.execute-api.<Region>.amazonaws.com

  3. Choose Next to continue the stack deployment.

It takes a couple of minutes to finish the deployment. After it finishes, the Output tab lists the CloudFront domain URL, as shown in Figure 2.

Figure 2: CloudFormation template output

Figure 2: CloudFormation template output

Step 3: Enable IAM authorization in API Gateway

Before you verify the solution, you will enable IAM authorization on the API endpoint first, which enforces Signature Version 4 verification at API Gateway. The following steps are applied for a REST API; you could also enable IAM authorization on an HTTP API or WebSocket API.

To enable IAM authorization in API Gateway

  1. In the API Gateway console, choose the name of your API.
  2. In the Resources pane, choose the GET method for the resource /pets. In the Method Execution pane, choose Method Request.
  3. Under Settings, for Authorization, choose the pencil icon (Edit). Then, in the dropdown list, choose AWS_IAM, and choose the check mark icon (Update).
  4. Repeat steps 2 and 3 for the resource /pets/{petId}.
  5. Deploy your API so that the changes take effect. When deploying, choose prod as the stage.
Figure 3: Enable IAM authorization in API Gateway

Figure 3: Enable IAM authorization in API Gateway

Step 4: Confirm successful viewer access to the CloudFront URL

Now that you’ve deployed the setup, you can verify that you are able to access the API through the CloudFront distribution.

To confirm viewer access through CloudFront

  1. In the CloudFormation console, choose the APIProtection stack.
  2. On the stack Outputs tab, copy the value for the CFDistribution entry and append /prod/pets to it, then open the URL in a new browser tab or window. The result should look similar to the following, which confirms successful viewer access through CloudFront.
    Figure 4: Successful API response when accessing API through CloudFront distribution

    Figure 4: Successful API response when accessing API through CloudFront distribution

Step 5: Confirm that direct access to the API Gateway API URL is blocked

Next, verify whether direct access to the API Gateway API endpoint is blocked.

Copy your API Gateway endpoint URL and append /prod/pets to it, then open the URL in a new browser tab or window. The result should look similar to the following, which confirms that direct viewer access through API Gateway is blocked.

Figure 5: API error response when attempting to access API Gateway directly

Figure 5: API error response when attempting to access API Gateway directly

Step 6: Review CloudFront configuration

Now that you’ve confirmed that access to the API Gateway endpoint is restricted to CloudFront only, you will review the CloudFront configuration that enables this restriction.

To review the CloudFront configuration

  1. In the CloudFormation console, choose the APIProtection stack. On the stack Resources tab, under the CFDistribution entry, copy the distribution ID.
  2. In the CloudFront console, select the distribution that has the distribution ID that you noted in the preceding step. On the Behaviors tab, select the behavior with path pattern Default (*).
  3. Choose Edit and scroll to the Cache key and origin requests section. You can see that Origin request policy is set to AllViewerExceptHostHeader, which allows CloudFront to forward viewer headers, cookies, and query strings to origins except the Host header. This policy is intended for use with the API Gateway origin.
  4. Scroll down to the Function associations – optional section.
    Figure 6: CloudFront configuration – Function association with origin request

    Figure 6: CloudFront configuration – Function association with origin request

    You can see that a Lambda@Edge function is associated with the origin request event; CloudFront invokes this function before forwarding requests to the origin. You can also see that the Include body option is selected, which exposes the request body to Lambda@Edge for HTTP methods like POST/PUT, and the request payload hash will be used for Signature Version 4 signing in the Lambda@Edge function.

Step 7: Review the Lambda@Edge function and its IAM role

In this step, you will review the Lambda@Edge function code and its IAM role, and learn how the function signs the request with Signature Version 4 before forwarding to API Gateway.

To review the Lambda@Edge function code

  1. In the CloudFormation console, choose the APIProtection stack.
  2. On the stack Resources tab, choose the Sigv4RequestLambdaFunction link to go to the Lambda function, and review the function code. You can see that it follows the Signature Version 4 signing process and uses an AWS access key to calculate the signature. The AWS access key is a temporary security credential provided when the IAM role for Lambda is being assumed.

To review the IAM role for Lambda

  1. In the CloudFormation console, choose the APIProtection stack.
  2. On the stack Resources tab, choose the Sigv4RequestLambdaFunctionExecutionRole link to go to the IAM role. Expand the permission policy to review the permissions. You can see that the policy allows the API Gateway endpoint to be invoked.
            {
                "Action": [
                    "execute-api:Invoke"
                ],
                "Resource": [
                    "arn:aws:execute-api:<region>:<account-id>:<api-id>/*/*/*"
                ],
                "Effect": "Allow"
            }

Because IAM authorization is enabled, when API Gateway receives the request, it checks whether the client has execute-api:Invoke permission for the API and route before handling the request.

Step 8: Review AWS WAF web ACL configuration

In this step, you will review the web ACL configuration in AWS WAF.

AWS Managed Rules for AWS WAF helps provide protection against common application vulnerabilities or other unwanted traffic. The web ACL for this solution includes several AWS managed rule groups as an example. The Amazon IP reputation list managed rule group helps to mitigate bots and reduce the risk of threat actors by blocking problematic IP addresses. The Core rule set (CRS) managed rule group helps provide protection against exploitation of a wide range of vulnerabilities, including some of the high risk and commonly occurring vulnerabilities described in the OWASP Top 10. The Known bad inputs managed rule group helps to reduce the risk of threat actors by blocking request patterns that are known to be invalid and that are associated with exploitation or discovery of vulnerabilities, like Log4J.

AWS WAF supports rate-based rules to block requests originating from IP addresses that exceed the set threshold per 5-minute time span, until the rate of requests falls below the threshold. We have used one such rule in the following example, but you could layer the rules for better security posture. You can configure multiple rate-based rules, each with a different threshold and scope (like URI, IP list, or country) for better protection. For more information on best practices for AWS WAF rate-based rules, see The three most important AWS WAF rate-based rules.

To review the web ACL configuration

  1. In the CloudFormation console, choose the APIProtection stack.
  2. On the stack Outputs tab, choose the EdgeLayerWebACL link to go to the web ACL configuration, and then choose the Rules tab to review the rules for this web ACL. On the Rules tab, you can see that the web ACL includes the following rule and rule groups.
    Figure 7: AWS WAF web ACL configuration

    Figure 7: AWS WAF web ACL configuration

  3. Choose the Associated AWS resources tab. You should see that the CloudFront distribution is associated to this web ACL.

Step 9: (Optional) Protect the CloudFront distribution with Shield Advanced

In this optional step, you will protect your CloudFront distribution with Shield Advanced. This adds additional protection on top of the protection provided by AWS WAF managed rule groups and rate-based rules in the web ACL that is associated with the CloudFront distribution.

Note: Proceed with this step only if you have subscribed to an annual subscription to Shield Advanced.

AWS Shield is a managed DDoS protection service that is offered in two tiers: AWS Shield Standard and AWS Shield Advanced. All AWS customers benefit from the automatic protection of Shield Standard, at no additional cost. Shield Standard helps defend against the most common, frequently occurring network and transport layer DDoS attacks that target your website or applications. AWS Shield Advanced is a paid service that requires a 1-year commitment—you pay one monthly subscription fee, plus usage fees based on gigabytes (GB) of data transferred out. Shield Advanced provides expanded DDoS attack protection for your applications.

Besides providing visibility and additional detection and mitigation against large and sophisticated DDoS attacks, Shield Advanced also gives you 24/7 access to the Shield Response Team (SRT) and cost protection against spikes in your AWS bill that might result from a DDoS attack against your protected resources. When you use both Shield Advanced and AWS WAF to help protect your resources, AWS waives the basic AWS WAF fees for web ACLs, rules, and web requests for your protected resources. You can grant permission to the SRT to act on your behalf, and also configure proactive engagement so that SRT contacts you directly when the availability and performance of your application is impacted by a possible DDoS attack.

Shield Advanced automatic application-layer DDoS mitigation compares current traffic patterns to historic traffic baselines to detect deviations that might indicate a DDoS attack. When you enable automatic application-layer DDoS mitigation, if your protected resource doesn’t yet have a history of normal application traffic, we recommend that you set to Count mode until a history of normal application traffic has been established. Shield Advanced establishes baselines that represent normal traffic patterns after protecting resources for at least 24 hours and is most accurate after 30 days. To mitigate against application layer attacks automatically, change the AWS WAF rule action to Block after you’ve established a normal traffic baseline.

To help protect your CloudFront distribution with Shield Advanced

  1. In the WAF & Shield console, in the AWS Shield section, choose Protected Resources, and then choose Add resources to protect.
  2. For Resource type, select CloudFront distribution, and then choose Load resources.
  3. In the Select resources section, select the CloudFront distribution that you used in Step 6 of this post. Then choose Protect with Shield Advanced.
  4. In the Automatic application layer DDoS mitigation section, choose Enable. Leave the AWS WAF rule action as Count, and then choose Next.
  5. (Optional, but recommended) Under Associated health check, choose one Amazon Route 53 health check to associate with the protection, and then choose Next. The Route 53 health check is used to enable health-based detection, which can improve responsiveness and accuracy in attack detection and mitigation. Associating the protected resource with a Route 53 health check is also one of the prerequisites to be protected with proactive engagement. You can create the health check by following these best practices.
  6. (Optional) In the Select SNS topic to notify for DDoS detected alarms section, select the SNS topic that you want to use for notification for DDoS detected alarms, then choose Next.
  7. Choose Finish configuration.

With automatic application-layer DDoS mitigation configured, Shield Advanced creates a rule group in the web ACL that you have associated with your resource. Shield Advanced depends on the rule group for automatic application-layer DDoS mitigation.

To review the rule group created by Shield Advanced

  1. In the CloudFormation console, choose the APIProtection stack. On the stack Outputs tab, look for the EdgeLayerWebACL entry.
  2. Choose the EdgeLayerWebACL link to go to the web ACL configuration.
  3. Choose the Rules tab, and look for the rule group with the name that starts with ShieldMitigationRuleGroup, at the bottom of the rule list. This rule group is managed by Shield Advanced, and is not viewable.
    Figure 8: Shield Advanced created rule group for DDoS mitigation

    Figure 8: Shield Advanced created rule group for DDoS mitigation

Considerations

Here are some further considerations as you implement this solution:

Conclusion

In this blog post, we introduced managing public-facing APIs through API Gateway, and helping protect API Gateway endpoints by using CloudFront and AWS perimeter protection services (AWS WAF and Shield Advanced). We walked through the steps to add Signature Version 4 authentication information to the CloudFront originated API requests, providing trusted access to the APIs. Together, these actions present a best practice approach to build a DDoS-resilient architecture that helps protect your application’s availability by preventing many common infrastructure and application layer DDoS attacks.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Pengfei Shao

Pengfei Shao

Pengfei is a Senior Technical Account Manager at AWS based in Stockholm, with more than 20 years of experience in Telecom and IT industry. His main focus is to help AWS Enterprise Support customers to remain operationally healthy, secure, and cost efficient in AWS. He is also focusing on AWS Edge Services domain, and loves to work with customers to solve their technical challenges.

Manoj Gupta

Manoj Gupta

Manoj is a Senior Solutions Architect at AWS. He’s passionate about building well-architected cloud-focused solutions by using AWS services with security, networking, and serverless as his primary focus areas. Before AWS, he worked in application and system architecture roles, building solutions across various industries. Outside of work, when he gets free time, he enjoys the outdoors and walking trails with his family.

Migrate data from Google Cloud Storage to Amazon S3 using AWS Glue

Post Syndicated from Qiushuang Feng original https://aws.amazon.com/blogs/big-data/migrate-data-from-google-cloud-storage-to-amazon-s3-using-aws-glue/

Today, we are pleased to announce a new AWS Glue connector for Google Cloud Storage that allows you to move data bi-directionally between Google Cloud Storage and Amazon Simple Storage Service (Amazon S3).

We’ve seen that there is a demand to design applications that enable data to be portable across cloud environments and give you the ability to derive insights from one or more data sources. One of the data sources you can now quickly integrate with is Google Cloud Storage, a managed service for storing both unstructured data and structured data. With this connector, you can bring the data from Google Cloud Storage to Amazon S3.

In this post, we go over how the new connector works, introduce the connector’s functions, and provide you with key steps to set it up. We provide you with prerequisites, share how to subscribe to this connector in AWS Marketplace, and describe how to create and run AWS Glue for Apache Spark jobs with it.

AWS Glue is a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue natively integrates with various data stores such as MySQL, PostgreSQL, MongoDB, and Apache Kafka, along with AWS data stores such as Amazon S3, Amazon Redshift, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon S3. AWS Glue Marketplace connectors allow you to discover and integrate additional data sources, such as software as a service (SaaS) applications and your custom data sources. With just a few clicks, you can search and select connectors from AWS Marketplace and begin your data preparation workflow in minutes.

How the connector works

This connector relies on the Spark DataSource API in AWS Glue and calls Hadoop’s FileSystem interface. The latter has implemented libraries for reading and writing various distributed or traditional storage. This connector also includes the Google Cloud Storage Connector for Hadoop, which lets you run Apache Hadoop or Apache Spark jobs directly on data in Google Cloud Storage. AWS Glue loads the library from the Amazon Elastic Container Registry (Amazon ECR) repository during initialization (as a connector), reads the connection credentials using AWS Secrets Manager, and reads data source configurations from input parameters. When AWS Glue has internet access, the Spark job in AWS Glue can read from and write to Google Cloud Storage.

Solution overview

The following architecture diagram shows how AWS Glue connects to Google Cloud Storage for data ingestion.

In the following sections, we show you how to create a new secret for Google Cloud Storage in Secrets Manager, subscribe to the AWS Glue connector, and move data from Google Cloud Storage to Amazon S3.

Prerequisites

You need the following prerequisites:

  • An account in Google Cloud and your data path in Google Cloud Storage. Prepare the GCP account keys file in advance and upload them to the S3 bucket. For instructions, refer to Create a service account key.
  • A Secrets Manager secret to store a Google Cloud secret.
  • An AWS Identity and Access Management (IAM) role for the AWS Glue job with the following policies:
    • AWSGlueServiceRole, which allows the AWS Glue service role access to related services.
    • AmazonEC2ContainerRegistryReadOnly, which provides read-only access to Amazon EC2 Container Registry repositories. This policy is for using AWS Marketplace’s connector libraries.
    • A Secrets Manager policy, which provides read access to the secret in Secrets Manager.
    • An S3 bucket policy for the S3 bucket that you need to load ETL (extract, transform, and load) data from Google Cloud Storage.

We assume that you are already familiar with the key concepts of Secrets Manager, IAM, and AWS Glue. Regarding IAM, these roles should be granted the permissions needed to communicate with AWS services and nothing more, according to the principle of least privilege.

Create a new secret for Google Cloud Storage in Secrets Manager

Complete the following steps to create a secret in Secrets Manager to store the Google Cloud Storage credentials:

  1. On the Secrets Manager console, choose Store a new secret.
  2. For Secret type, select Other type of secret.
  3. Enter your key as keyS3Uri and the value as your key file in the s3 bucket, for example, s3://keys/project-gcs-connector **.json.
  4. Leave the rest of the options at their default.
  5. Choose Next.
  6. Provide a name for the secret, such as googlecloudstorage_credentials.
  7. Follow the rest of the steps to store the secret.

Subscribe to the AWS Glue connector for Google Cloud Storage

To subscribe to the connector, complete the following steps:

  1. Navigate to the Google Cloud Storage Connector for AWS Glue on AWS Marketplace.
  2. On the product page for the connector, use the tabs to view information about the connector. If you decide to purchase this connector, choose Continue to Subscribe.
  3. Review the pricing terms and the seller’s End User License Agreement, then choose Accept Terms.
  4. Continue to the next step by choosing Continue to Configuration.
  5. On the Configure this software page, choose the fulfillment options and the version of the connector to use. We have provided two options for the Google Cloud Storage Connector, AWS Glue 3.0 and AWS Glue 4.0. In this example, we focus on AWS Glue 4.0. After selecting Glue 3.0 or Glue 4.0, select corresponding AWS Glue version when you configure the AWS Glue job.
  6. Choose Continue to Launch.
  7. On the Launch this software page, you can review the Usage Instructions provided by AWS. When you’re ready to continue, choose Activate the Glue connector in AWS Glue Studio.

The console will display the Create marketplace connection page in AWS Glue Studio.

Move data from Google Cloud Storage to Amazon S3

To move your data to Amazon S3, you must configure the custom connection and then set up an AWS Glue job.

Create a custom connection in AWS Glue

An AWS Glue connection stores connection information for a particular data store, including login credentials, URI strings, virtual private cloud (VPC) information, and more. Complete the following steps to create your connection:

  1. On the AWS Glue console, choose Connectors in the navigation pane.
  2. Choose Create connection.
  3. For Connector, choose Google Cloud Storage Connector for AWS Glue.
  4. For Name, enter a name for the connection (for example, GCSConnection).
  5. Enter an optional description.
  6. For AWS secret, enter the secret you created (googlecloudstorage_credentials).
  7. Choose Create connection and activate connector.

The connector and connection information is now visible on the Connectors page.

Create an AWS Glue job and configure connection options

Complete the following steps:

  1. On the AWS Glue console, choose Connectors in the navigation pane.
  2. Choose the connection you created (GCSConnection).
  3. Choose Create job.
  4. On the Node properties tab in the node details pane, enter the following information:
    • For Name, enter Google Cloud Storage Connector for AWS Glue. This name should be unique among all the nodes for this job.
    • For Node type, choose the Google Cloud Storage Connector.
  5. On the Data source properties tab, provide the following information:
    • For Connection, choose the connection you created (GCSConnection).
    • For Key, enter path, and for Value, enter your Google Cloud Storage URI (for example, gs://bucket/covid-csv-data/).
    • Enter another key-value pair. For Key, enter fileFormat. For Value, enter csv, because our sample data is this format.
  6. On the Job details tab, for IAM Role, choose the IAM role mentioned in the prerequisites.
  7. For Glue version, choose your AWS Glue version.
  8. Continue to create your ETL job. For instructions, refer to Creating ETL jobs with AWS Glue Studio.
  9. Choose Run to run your job.

After the job succeeds, we can check the logs in Amazon CloudWatch.

The data is ingested into Amazon S3, as shown in the following screenshot.                        

We are now able to import data from Google Cloud Storage to Amazon S3.

Scaling considerations

In this example, we set the AWS Glue capacity as 10 DPU (Data Processing Units). A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. To scale your AWS Glue job, you can increase the number of DPU, and also take advantage of Auto Scaling. With Auto Scaling enabled, AWS Glue automatically adds and removes workers from the cluster depending on the workload. This removes the need for you to experiment and decide on the number of workers to assign for your AWS Glue ETL jobs. If you choose the maximum number of workers, AWS Glue will adapt the right size of resources for the workload.

Clean up

To clean up your resources, complete the following steps:

  1. Remove the AWS Glue job and secret in Secrets Manager with the following command:
    aws glue delete-job —job-name <your_job_name> aws glue delete-connection —connection-name <your_connection_name>
    aws secretsmanager delete-secret —secret-id <your_secretsmanager_id> 

  2. Cancel the Google Cloud Storage Connector for AWS Glue’s subscription:
    • On the AWS Marketplace console, go to the Manage subscriptions page.
    • Select the subscription for the product that you want to cancel.
    • On the Actions menu, choose Cancel subscription.
    • Read the information provided and select the acknowledgement check box.
    • Choose Yes, cancel subscription.
  3. Delete the data in the S3 buckets.

Conclusion

In this post, we showed how to use AWS Glue and the new connector for ingesting data from Google Cloud Storage to Amazon S3. This connector provides access to Google Cloud Storage, facilitating cloud ETL processes for operational reporting, backup and disaster recovery, data governance, and more.

This connector enables your data to be portable across Google Cloud Storage and Amazon S3. We welcome any feedback or questions in the comments section.

References


About the authors

Qiushuang Feng is a Solutions Architect at AWS, responsible for Enterprise customers’ technical architecture design, consulting, and design optimization on AWS Cloud services. Before joining AWS, Qiushuang worked in IT companies such as IBM and Oracle, and accumulated rich practical experience in development and analytics.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is passionate about architecting fast-growing data environments, diving deep into distributed big data software like Apache Spark, building reusable software artifacts for data lakes, and sharing knowledge in AWS Big Data blog posts.

 Greg Huang is a Senior Solutions Architect at AWS with expertise in technical architecture design and consulting for the China G1000 team. He is dedicated to deploying and utilizing enterprise-level applications on AWS Cloud services. He possesses nearly 20 years of rich experience in large-scale enterprise application development and implementation, having worked in the cloud computing field for many years. He has extensive experience in helping various types of enterprises migrate to the cloud. Prior to joining AWS, he worked for well-known IT enterprises such as Baidu and Oracle.

Maciej Torbus is a Principal Customer Solutions Manager within Strategic Accounts at Amazon Web Services. With extensive experience in large-scale migrations, he focuses on helping customers move their applications and systems to highly reliable and scalable architectures in AWS. Outside of work, he enjoys sailing, traveling, and restoring vintage mechanical watches.

Automate secure access to Amazon MWAA environments using existing OpenID Connect single-sign-on authentication and authorization

Post Syndicated from Ajay Vohra original https://aws.amazon.com/blogs/big-data/automate-secure-access-to-amazon-mwaa-environments-using-existing-openid-connect-single-sign-on-authentication-and-authorization/

Customers use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to run Apache Airflow at scale in the cloud. They want to use their existing login solutions developed using OpenID Connect (OIDC) providers with Amazon MWAA; this allows them to provide a uniform authentication and single sign-on (SSO) experience using their adopted identity providers (IdP) across AWS services. For ease of use for end-users of Amazon MWAA, organizations configure a custom domain endpoint to their Apache Airflow UI endpoint. For teams operating and managing multiple Amazon MWAA environments, securing and customizing each environment is a repetitive but necessary task. Automation through infrastructure as code (IaC) can alleviate this heavy lifting to achieve consistency at scale.

This post describes how you can integrate your organization’s existing OIDC-based IdPs with Amazon MWAA to grant secure access to your existing Amazon MWAA environments. Furthermore, you can use the solution to provision new Amazon MWAA environments with the built-in OIDC-based IdP integrations. This approach allows you to securely provide access to your new or existing Amazon MWAA environments without requiring AWS credentials for end-users.

Overview of Amazon MWAA environments

Managing multiple user names and passwords can be difficult—this is where SSO authentication and authorization comes in. OIDC is a widely used standard for SSO, and it’s possible to use OIDC SSO authentication and authorization to access Apache Airflow UI across multiple Amazon MWAA environments.

When you provision an Amazon MWAA environment, you can choose public or private Apache Airflow UI access mode. Private access mode is typically used by customers that require restricting access from only within their virtual private cloud (VPC). When you use public access mode, the access to the Apache Airflow UI is available from the internet, in the same way as an AWS Management Console page. Internet access is needed when access is required outside of a corporate network.

Regardless of the access mode, authorization to the Apache Airflow UI in Amazon MWAA is integrated with AWS Identity and Access Management (IAM). All requests made to the Apache Airflow UI need to have valid AWS session credentials with an assumed IAM role that has permissions to access the corresponding Apache Airflow environment. For more details on the permissions policies needed to access the Apache Airflow UI, refer to Apache Airflow UI access policy: AmazonMWAAWebServerAccess.

Different user personas such as developers, data scientists, system operators, or architects in your organization may need access to the Apache Airflow UI. In some organizations, not all employees have access to the AWS console. It’s fairly common that employees who don’t have AWS credentials may also need access to the Apache Airflow UI that Amazon MWAA exposes.

In addition, many organizations have multiple Amazon MWAA environments. It’s common to have an Amazon MWAA environment setup per application or team. Each of these Amazon MWAA environments can be run in different deployment environments like development, staging, and production. For large organizations, you can easily envision a scenario where there is a need to manage multiple Amazon MWAA environments. Organizations need to provide secure access to all of their Amazon MWAA environments using their existing OIDC provider.

Solution Overview

The solution architecture integrates an existing OIDC provider to provide authentication for accessing the Amazon MWAA Apache Airflow UI. This allows users to log in to the Apache Airflow UI using their OIDC credentials. From a system perspective, this means that Amazon MWAA can integrate with an existing OIDC provider rather than having to create and manage an isolated user authentication and authorization through IAM internally.

The solution architecture relies on an Application Load Balancer (ALB) setup with a fully qualified domain name (FQDN) with public (internet) or private access. This ALB provides SSO access to multiple Amazon MWAA environments. The user-agent (web browser) call flow for accessing an Apache Airflow UI console to the target Amazon MWAA environment includes the following steps:

  1. The user-agent resolves the ALB domain name from the Domain Name System (DNS) resolver.
  2. The user-agent sends a login request to the ALB path /aws_mwaa/aws-console-sso with a set of query parameters populated. The request uses the required parameters mwaa_env and rbac_role as placeholders for the target Amazon MWAA environment and the Apache Airflow role-based access control (RBAC) role, respectively.
  3. Once it receives the request, the ALB redirects the user-agent to the OIDC IdP authentication endpoint. The user-agent authenticates with the OIDC IdP with the existing user name and password.
  4. If user authentication is successful, the OIDC IdP redirects the user-agent back to the configured ALB with a redirect_url with the authorization code included in the URL.
  5. The ALB uses the authorization code received to obtain the access_token and OpenID JWT token with openid email scope from the OIDC IdP. It then forwards the login request to the Amazon MWAA authenticator AWS Lambda function with the JWT token included in the request header in the x-amzn-oidc-data parameter.
  6. The Lambda function verifies the JWT token found in the request header using ALB public keys. The function subsequently authorizes the authenticated user for the requested mwaa_env and rbac_role stored in an Amazon DynamoDB table. The use of DynamoDB for authorization here is optional; the Lambda code function is_allowed can be customized to use other authorization mechanisms.
  7. The Amazon MWAA authenticator Lambda function redirects the user-agent to the Apache Airflow UI console in the requested Amazon MWAA environment with the login token in the redirect URL. Additionally, the function provides the logout functionality.

Amazon MWAA public network access mode

For the Amazon MWAA environments configured with public access mode, the user agent uses public routing over the internet to connect to the ALB hosted in a public subnet.

The following diagram illustrates the solution architecture with a numbered call flow sequence for internet network reachability.

Amazon MWAA public network access mode architecture diagram

Amazon MWAA private network access mode

For Amazon MWAA environments configured with private access mode, the user agent uses private routing over a dedicated AWS Direct Connect or AWS Client VPN to connect to the ALB hosted in a private subnet.

The following diagram shows the solution architecture for Client VPN network reachability.

Amazon MWAA private network access mode architecture diagram

Automation through infrastructure as code

To make setting up this solution easier, we have released a pre-built solution that automates the tasks involved. The solution has been built using the AWS Cloud Development Kit (AWS CDK) using the Python programming language. The solution is available in our GitHub repository and helps you achieve the following:

  • Set up a secure ALB to provide OIDC-based SSO to your existing Amazon MWAA environment with default Apache Airflow Admin role-based access.
  • Create new Amazon MWAA environments along with an ALB and an authenticator Lambda function that provides OIDC-based SSO support. With the customization provided, you can define the number of Amazon MWAA environments to create. Additionally, you can customize the type of Amazon MWAA environments created, including defining the hosting VPC configuration, environment name, Apache Airflow UI access mode, environment class, auto scaling, and logging configurations.

The solution offers a number of customization options, which can be specified in the cdk.context.json file. Follow the setup instructions to complete the integration to your existing Amazon MWAA environments or create new Amazon MWAA environments with SSO enabled. The setup process creates an ALB with an HTTPS listener that provides the user access endpoint. You have the option to define the type of ALB that you need. You can define whether your ALB will be public facing (internet accessible) or private facing (only accessible within the VPC). It is recommended to use a private ALB with your new or existing Amazon MWAA environments configured using private UI access mode.

The following sections describe the specific implementation steps and customization options for each use case.

Prerequisites

Before you continue with the installation steps, make sure you have completed all prerequisites and run the setup-venv script as outlined within the README.md file of the GitHub repository.

Integrate to a single existing Amazon MWAA environment

If you’re integrating with a single existing Amazon MWAA environment, follow the guides in the Quick start section. You must specify the same ALB VPC as that of your existing Amazon MWAA VPC. You can specify the default Apache Airflow RBAC role that all users will assume. The ALB with an HTTPS listener is configured within your existing Amazon MWAA VPC.

Integrate to multiple existing Amazon MWAA environments

To connect to multiple existing Amazon MWAA environments, specify only the Amazon MWAA environment name in the JSON file. The setup process will create a new VPC with subnets hosting the ALB and the listener. You must define the CIDR range for this ALB VPC such that it doesn’t overlap with the VPC CIDR range of your existing Amazon MWAA VPCs.

When the setup steps are complete, implement the post-deployment configuration steps. This includes adding the ALB CNAME record to the Amazon Route 53 DNS domain.

For integrating with Amazon MWAA environments configured using private access mode, there are additional steps that need to be configured. These include configuring VPC peering and subnet routes between the new ALB VPC and the existing Amazon MWAA VPC. Additionally, you need to configure network connectivity from your user-agent to the private ALB endpoint resolved by your DNS domain.

Create new Amazon MWAA environments

You can configure the new Amazon MWAA environments you want to provision through this solution. The cdk.context.json file defines a dictionary entry in the MwaaEnvironments array. Configure the details that you need for each of the Amazon MWAA environments. The setup process creates an ALB VPC, ALB with an HTTPS listener, Lambda authorizer function, DynamoDB table, and respective Amazon MWAA VPCs and Amazon MWAA environments in them. Furthermore, it creates the VPC peering connection between the ALB VPC and the Amazon MWAA VPC.

If you want to create Amazon MWAA environments with private access mode, the ALB VPC CIDR range specified must not overlap with the Amazon MWAA VPC CIDR range. This is required for the automatic peering connection to succeed. It can take between 20–30 minutes for each Amazon MWAA environment to finish creating.

When the environment creation processes are complete, run the post-deployment configuration steps. One of the steps here is to add authorization records to the created DynamoDB table for your users. You need to define the Apache Airflow rbac_role for each of your end-users, which the Lambda authorizer function matches to provide the requisite access.

Verify access

Once you’ve completed with the post-deployment steps, you can log in to the URL using your ALB FQDN. For example, If your ALB FQDN is alb-sso-mwaa.example.com, you can log in to your target Amazon MWAA environment, named Env1, assuming a specific Apache Airflow RBAC role (such as Admin), using the following URL: https://alb-sso-mwaa.example.com/aws_mwaa/aws-console-sso?mwaa_env=Env1&rbac_role=Admin. For the Amazon MWAA environments that this solution created, you need to have appropriate Apache Airflow rbac_role entries in your DynamoDB table.

The solution also provides a logout feature. To log out from an Apache Airflow console, use the normal Apache Airflow console logout. To log out from the ALB, you can, for example, use the URL https://alb-sso-mwaa.example.com/logout.

Clean up

Follow the readme documented steps in the section Destroy CDK stacks in the GitHub repo, which shows how to clean up the artifacts created via the AWS CDK deployments. Remember to revert any manual configurations, like VPC peering connections, that you might have made after the deployments.

Conclusion

This post provided a solution to integrate your organization’s OIDC-based IdPs with Amazon MWAA to grant secure access to multiple Amazon MWAA environments. We walked through the solution that solves this problem using infrastructure as code. This solution allows different end-user personas in your organization to access the Amazon MWAA Apache Airflow UI using OIDC SSO.

To use the solution for your own environments, refer to Application load balancer single-sign-on for Amazon MWAA. For additional code examples on Amazon MWAA, refer to Amazon MWAA code examples.


About the Authors

Ajay Vohra is a Principal Prototyping Architect specializing in perception machine learning for autonomous vehicle development. Prior to Amazon, Ajay worked in the area of massively parallel grid-computing for financial risk modeling.

Jaswanth Kumar is a customer-obsessed Cloud Application Architect at AWS in NY. Jaswanth excels in application refactoring and migration, with expertise in containers and serverless solutions, coupled with a Masters Degree in Applied Computer Science.

Aneel Murari is a Sr. Serverless Specialist Solution Architect at AWS based in the Washington, D.C. area. He has over 18 years of software development and architecture experience and holds a graduate degree in Computer Science. Aneel helps AWS customers orchestrate their workflows on Amazon Managed Apache Airflow (MWAA) in a secure, cost effective and performance optimized manner.

Parnab Basak is a Solutions Architect and a Serverless Specialist at AWS. He specializes in creating new solutions that are cloud native using modern software development practices like serverless, DevOps, and analytics. Parnab works closely in the analytics and integration services space helping customers adopt AWS services for their workflow orchestration needs.

Introducing field-based coloring experience for Amazon QuickSight

Post Syndicated from Bhupinder Chadha original https://aws.amazon.com/blogs/big-data/introducing-field-based-coloring-experience-for-amazon-quicksight/

Color plays a crucial role in visualizations. It conveys meaning, captures attention, and enhances aesthetics. You can quickly grasp important information when key insights and data points pop with color. However, it’s important to use color judiciously to enhance readability and ensure correct interpretation. Color should also be accessible and consistent to enable users to establish visual patterns and comprehend data effectively.

In line with data visualization best practices, Amazon QuickSight is announcing the launch of field-based coloring options, which provides a fresh approach to configuring colors across visuals in addition to the visual-level color settings. With field-based colors, you can now enjoy the following benefits:

  • Consistent coloring across visuals using the same Color field
  • The ability to assign custom colors to dimension values at the field level
  • The ability to persist default color consistency during visual interactions, such as filtering and sorting

Consistent coloring experience across visuals

At present, users in QuickSight can either assign colors to their charts using themes or the on-visual menu. In addition to these options, the launch of field-based coloring allows authors to specify colors on a per-field basis, simplifying the process of setting colors and ensuring consistency across all visuals that use the same field. The following example shows that, before this feature was available, both charts using the color field Ship region displayed different colors across the field values.

With the implementation of field colors, authors now have the capability to maintain consistent color schemes across visuals that utilize the same field. This is achieved by defining distinct colors for each field value, which ensures uniformity throughout. In contrast to the previous example, both charts now showcase consistent colors for the Ship region field.

Consistent coloring experience with visual interaction

In the past, the default coloring logic used to be based on the sorting order, which means that colors would stay the same for a given sort order. However, this caused inconsistency because the same values could display different colors when the sorting order changed or when they were filtered. The following example shows that the colors for each segment field (Online, In-Store, and Catalog) on the donut chart differ from the colors on the bar chart after sorting.

The assigned colors persist and remain unchanged during any visual interaction, such as sorting or filtering, by defining field-based colors. Notice that, after sorting the donut chart another way, the legend order changes, but the colors remain the same.

How to customize field colors

In this section, we demonstrate the various ways you can customize field colors.

Edit field color

There are two ways to add or edit field-based color:

  • Fields list pane – Select the field in your analysis and choose Edit field colors from the context menu. This allows you to choose your own colors for each value.
  • On-visual menu – To define or modify colors another way, you can simply select the legend or the desired data point. Access the context menu and choose Edit field colors. This opens the Edit field colors pane, which is filtered to display the selected value and allows for easy and convenient color customization.

Note the following considerations:

  • Colors defined at a visual level override field-based colors.
  • You can assign colors to a maximum of 50 values per field. If you want more than 50, you’ll need to reset a previously assigned color to continue.

Reset visual color

If your visuals have colors assigned through the on-visual menu, the field-based colors aren’t visible. This is because on-visual colors take precedence over the field-based color settings. However, you can easily reset the visual-based colors to reveal the underlying field-based colors in such cases.

Reset field colors

If you want to change the color of a specific value, simply choose the reset icon next to the edited color. Alternatively, if you want to reset all colors, choose Reset colors at the bottom. This restores all edited values to their default color assignment.

Unused color (stale color assignment)

When values that you’ve assigned colors to no longer appear in data, QuickSight labels the values as unused. You can view the unused color assignments and choose to delete them if you’d like.

Conclusion

Field-based coloring options in QuickSight simplify the process of achieving consistent and visually appealing visuals. The persistence of default colors during interactions, such as filtering and sorting, enhances the user experience. Start using field-based coloring today for consistent coloring experience and to enable better comparisons and pattern recognition for effective data interpretation and decision-making.


About the author

Bhupinder Chadha is a senior product manager for Amazon QuickSight focused on visualization and front end experiences. He is passionate about BI, data visualization and low-code/no-code experiences. Prior to QuickSight he was the lead product manager for Inforiver, responsible for building a enterprise BI product from ground up. Bhupinder started his career in presales, followed by a small gig in consulting and then PM for xViz, an add on visualization product.

How to scan EC2 AMIs using Amazon Inspector

Post Syndicated from Luke Notley original https://aws.amazon.com/blogs/security/how-to-scan-ec2-amis-using-amazon-inspector/

Amazon Inspector is an automated vulnerability management service that continually scans Amazon Web Services (AWS) workloads for software vulnerabilities and unintended network exposure. Amazon Inspector supports vulnerability reporting and deep inspection of Amazon Elastic Compute Cloud (Amazon EC2) instances, container images stored in Amazon Elastic Container Registry (Amazon ECR), and AWS Lambda functions. Operating system and programming language support is extensive, ranging from Bottlerocket to Windows Server.

Many customers use Amazon EC2 Auto Scaling groups as part of their resilience and scaling architecture for their workloads. With Auto Scaling groups, you can scale and deploy rapidly by using Amazon Machine Images (AMIs). However, AMIs within your environment can quickly become outdated as new vulnerabilities are discovered. A security best practice is to perform routine vulnerability assessments of your AMIs to identify whether newfound vulnerabilities apply to them. If you identify a vulnerability, you can update the AMI with the appropriate security patches, test the AMI in lower environments, and deploy the updated AMI in your environment. At this time, Amazon Inspector only supports scanning of running EC2 instances.

In this blog post, we’ll share a solution that you can use with Amazon EventBridge, AWS Lambda, AWS Step Functions, Amazon Simple Notification Service (Amazon SNS)­­, and Amazon Simple Storage Service (Amazon S3) to scan AMIs and generate Amazon Inspector finding reports to help ensure that your AMIs are scanned for known vulnerabilities and updated prior to deployment. Then, we will show you how to periodically scan selected EC2 AMIs based on a tagging strategy, and take automated actions.

Prerequisites

The solution provided in this post has a number of items that you will need to review and address before you deploy the solution:

  1. Make sure that the AMI to be scanned by Amazon Inspector is based from one of the operating systems that AWS supports for EC2 scanning.
  2. To successfully complete a scan, Amazon Inspector requires the EC2 instance to be a managed instance in AWS Systems Manager that has the Systems Manager Agent installed and running, and has an attached AWS Identity and Access Management (IAM) instance profile that allows Systems Manager to manage the instance. For more information, see Scanning Amazon EC2 instances with Amazon Inspector.
  3. If you use customer managed keys to encrypt Amazon Elastic Block Store (Amazon EBS) volumes and you have a default EC2 configuration set to encrypt EBS volumes, you will need to configure additional key policy permissions. For the customer managed key that encrypts EBS volumes, add the following example policy statement to the key policy. Make sure to replace <111122223333> with your own AWS account ID.
    {
                "Sid": "Allow use of the key by AMI Scanner State Machine",
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam:: <111122223333>:role/service-role/AMIScanner-Statemachine-role"
                },
                "Action": [
                    "kms:Encrypt",
                    "kms:Decrypt",
                    "kms:ReEncrypt*",
                    "kms:GenerateDataKey*",
                    "kms:DescribeKey"
                ],
                "Resource": "*"
            },

    If you don’t add this additional policy, the Step Functions state machine won’t allow the EC2 instances to launch. For more information, see Key policy sections that allow access to the customer managed key.

  4. The solution in this blog post requires that you activate Amazon Inspector in your AWS account. If you haven’t activated Amazon Inspector yet, learn more about the free trial and pricing, and follow the steps in the Amazon Inspector documentation to set up the service and start monitoring your account. Alternatively, you can activate Amazon Inspector by using the AWS Command Line Interface (AWS CLI) and this GitHub example.

Solution overview and architecture

In this solution, you will use the follow AWS services and features:

  • Task orchestration
    • AWS Step Functions state machine workflows are used in this solution to verify that conditions are successfully validated before moving to the next task. This helps ensure that the Amazon Inspector scanning of the temporary instance launched in the first state machine is completed before the second state machine starts. This can help reduce the overall cost of the solution and can help prevent the first state machine from reaching state transition limitations.
    • Lambda functions handle the logic for retrieving AMIs to be scanned, launching temporary instances, creating Amazon EventBridge rules, tagging AMIs, and exporting Amazon Inspector reports to Amazon S3.
  • AMI tagging
    • To use this solution, you need to tag the AMIs that Amazon Inspector will scan, because a Lambda function will use these tags to start the solution orchestration. For this post, we use the tag InspectorScan with a value of true. With AMI tagging, you can configure automated processes as part of your deployment pipelines to implement the tagging.
  • Storage of exported Amazon Inspector findings
    • Amazon S3 helps you store the exported Amazon Inspector findings report and use them in a standardized format for multiple use cases across AWS services, or use Amazon Athena to query the reports, which we will cover later in the post. Each scanned report is stored in the S3 bucket and is named in the form AMI-NAME/guid.JSON or AMI-NAME/guid.CSV, depending on the export format that you specify.
    • You can also use S3 event notifications to alert different operational teams that there are Amazon Inspector scan results that require review.
  • Encryption of Amazon Inspector findings reports
    • AWS Key Management Service (AWS KMS) is used to encrypt the findings report. The AWS KMS key used must be a customer managed, symmetric KMS encryption key, and importantly, the key must be in the same AWS Region as the S3 bucket that you configured to store the report. The solution in this post creates a new KMS key, as well as a key policy that is configured to grant permissions for Amazon Inspector to use the key.
  • Event tracking and scheduling
    • This solution uses an Amazon EventBridge rule to listen for completed Amazon Inspector scan events for each temporary EC2 instance launch. When the EventBridge rule finds a matched event, the rule passes the required parameters and invokes the second Step Functions state machine. The event pattern used in this solution uses the following format:
      	{
      			"source": ["aws.inspector2"],
      			"detail-type": ["Inspector2 Scan"],
      			"resources": ["i-abcdef01234567890"]
      		}

    • You can schedule the AMI scanning by using an EventBridge rule that invokes a Lambda function that runs on a schedule. The Lambda function uses a cron expression to occur weekly. You can configure this parameter according to your requirements. Initially, this rule will be disabled to allow you to configure and enable the rule at a later stage.
    • Amazon SNS sends notifications during the AMI scanning solution process. From the SNS topic, you can configure different subscriptions, depending on your preferred use case and environment. An example of a subscription could be a shared mailbox email address for the security team or incident ticketing system.

Figure 1 shows the solution architecture.

Figure 1: Amazon Inspector scanning of an AMI

Figure 1: Amazon Inspector scanning of an AMI

The high-level workflow of the solution is as follows:

  1. You can use EventBridge to create a scheduled rule to invoke a Lambda function. You can set the rule for daily, weekly, or monthly, depending on your use case.
  2. The Lambda function searches for AMIs with the appropriate tags and passes these as parameters to the Step Functions workflow.
  3. The first Step Functions state machine is invoked for each AMI to be scanned.
  4. The first Step Functions workflow deploys a temporary EC2 instance from the AMI that is defined.
  5. A Lambda function is invoked to create an EventBridge rule.
  6. An EventBridge rule is created to listen for the successful Amazon Inspector scanned event of the temporary EC2 instance.
  7. A Lambda function is invoked to tag the EC2 instance.
  8. The temporary EC2 instance is tagged, showing Amazon Inspector that scanning is in progress.
  9. The first Step Functions workflow sends a notification to an SNS topic.
  10. The EventBridge rule parses the required parameters and invokes the second Step Functions state machine.
  11. A Lambda function is invoked to generate an Amazon Inspector report and export the findings to an S3 bucket.
  12. The scanned Amazon Inspector AMI results are saved to an S3 bucket.
  13. The Step Functions workflow terminates the temporary EC2 instance that can reduce cost and clean up the process.
  14. A Lambda function is invoked to delete the temporary EventBridge rule.
  15. The temporary EventBridge rule and targets are deleted.
  16. A Lambda function is invoked to tag the AMI.
  17. The scanned AMI is updated with tagging metadata.
  18. The second Step Functions workflow sends a final notification to an SNS topic.

Deploy the solution

The solution will be deployed with the scheduled rule in Amazon EventBridge disabled to allow you to create your tagging strategy and to familiarize yourself with the solution. Later in this post, we’ll cover how to enable the Amazon EventBridge scheduled rule.

Step 1: Deploy the CloudFormation template

For this next step, make sure that you deploy the CloudFormation template provided for multi-AMI scanning in the AWS account and Region where you want to test this solution.

To deploy the CloudFormation template

  1. Choose the following Launch Stack button to launch a CloudFormation stack in your account. Note that the stack will launch in the N. Virginia (us-east-1) Region. To deploy this solution into other AWS Regions, download the solution’s CloudFormation template, modify it, and deploy it to the selected Region.

    Select this image to open a link that starts building the CloudFormation stack

    Make sure that you configure the following parameters in the CloudFormation template so that it deploys successfully:

    • AMITagName — The AMI tag name to check if the AMI should be scanned by Amazon Inspector.
    • AMITagValue — The AMI tag value to check if the AMI should be scanned by Amazon Inspector.
    • InspectorReportFormat — The report format, which can be either CSV or JSON.
    • InstanceSubnetID — The subnet ID to launch the temporary EC2 instance into.
    • InstanceType — The instance type to deploy the AMI to for temporary scanning purposes.
    • KmsKeyAdministratorRole — The existing IAM role that needs to have administrator access to the KMS key created for the solution. This key provides access to encrypt and decrypt the Amazon Inspector report.
    • S3ReportBucketName — The name of the S3 bucket to be created.
    • SnsTopic — The name of the new SNS topic to be created. This name defines the SNS topic that notifications are published to.
  2. Review the stack name and the parameters for the template.
  3. On the Quick create stack screen, scroll to the bottom and select I acknowledge that AWS CloudFormation might create IAM resources.
  4. Choose Create stack. The deployment of the CloudFormation stack will take 3–4 minutes.

After the CloudFormation stack has deployed successfully, you can use the deployed solution.

Step 2: Manually run the first Step Functions workflow

The first Step Functions state machine requires parameters to be passed in; the SingleAMI Lambda function accomplishes this. You can start the Lambda function by creating a test event and passing the correct JSON text and parameters. The following parameters are available in the output section of the CloudFormation stack that the solution deployed:

  • AmiId — The ID of the AMI to be used for deploying the EC2 instance. This is the EC2 AMI to be scanned.
  • EC2InstanceProfile — The Amazon Resource Name (ARN) of the EC2 instance profile that the CloudFormation stack created.
  • InstanceType — The type of EC2 instance to use for deployment.
  • KmsKeyName — The ARN of the KMS key to be used for encrypting and decrypting the Amazon Inspector report that the CloudFormation stack created.
  • S3Bucket — The name of the S3 bucket to which the Amazon Inspector reports will be exported. The S3 bucket was created previously by the CloudFormation stack.
  • S3ReportFormat — The report format that Amazon Inspector will use to export the findings report; either the JSON or the CSV format is valid.
  • SnsTopc — The ARN of the SNS topic to which notifications will be sent. This SNS topic was created previously by the CloudFormation stack.
  • StateMachineArn — The ARN of the first Step Functions state machine, which the Lambda function will run first.
  • SubnetId — The ID of the VPC subnet to which the EC2 instance will be attached and launched into. This is a required parameter and could be a subnet that is created specifically for this scanning purpose.

The following is an example parameter configuration and JSON that you can use to run the Lambda function. Make sure to replace each <user input placeholder> with your own information.

{
"AmiId" : "<AMI-ABCDEF01234567890>",
"Ec2InstanceProfile" : "arn:aws:iam::<111122223333>:instance-profile/Ec2InstanceLaunchRole",
"InstanceType" : "t3.medium",
"KMSKeyName" : "arn:aws:kms:region-name:<111122223333>:key/<a1b2c3d4-5678-90ab-cdef-EXAMPLE11111>",
"S3Bucket" : "<DOC-EXAMPLE-BUCKET-111122223333>",
"S3ReportFormat" : "CSV",
"SnsTopic" : "arn:aws:sns:region-name-2:<111122223333>:InspectorScanner",
"StateMachine": "arn:aws:states:region-name:<111122223333>:stateMachine:AMIScanner-Part1-LaunchEC2",
"SubnetId" : "<SUBNET-ABCDEF01234567890>"
}

After the first state machine is finished, the EventBridge rule listens for the successful Amazon Inspector scan event. An SNS notification is sent, similar to the following.


{"AWS Inspector AMI Scan status":"EC2 instance","For AMI":"ami-abcdef01234567890","Temporarily launched AMI using instance":"i-abcdef01234567890"}

After Amazon Inspector has finished scanning the EC2 instance, and the second state machine completes successfully, the Amazon Inspector finding report appears in the S3 bucket and notifications appear on the SNS topic that was created. The following is an example of an SNS notification.


{"AWS Inspector AMI Scan completed":"Successfully","For AMI":"ami-abcdef01234567890","AWS Inspector report located at S3 Bucket":"DOC-EXAMPLE-BUCKET-111122223333","Temporarily launched AMI using instance":"i-abcdef01234567890"}

Enable scheduled scanning

You can enable the EventBridge scheduled rule to handle multiple AMIs and automatic scheduling. The scheduled rule invokes a Lambda function on a scheduled basis that identifies AMIs with the appropriate tags and passes parameters to the Step Functions workflow.

To enable the rule

  • In the EventBridge rules console, navigate to AMIScanner-ScheduledSolutionTask, and choose Enable.
    Figure 2: Enable Amazon EventBridge scheduled rule

    Figure 2: Enable Amazon EventBridge scheduled rule

Extend the solution

With Amazon Athena, you can run SQL queries on raw data that is stored in S3 buckets. The Amazon Inspector reports are exported to S3, and you can query the data and create tables by using AWS Glue crawlers. To make sure that AWS Glue can crawl the S3 data, you need to add the role that you create for AWS Glue to the AWS KMS key permissions, so that AWS Glue can decrypt the S3 data. The following is an example policy JSON that you can update. Make sure to replace the AWS account ID <111122223333> and S3 bucket name <DOC-EXAMPLE-BUCKET-111122223333> with your own information.

{
"Sid": "Allow the AWS Glue crawler usage of the KMS key",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<111122223333>:role/service-role/AWSGlueServiceRole-S3InspectorReports"
},
"Action": [
"kms:Decrypt",
"kms:GenerateDataKey*"
],
"Resource": "arn:aws:s3:::<DOC-EXAMPLE-BUCKET-111122223333>"
},

After an AWS Glue Data Catalog has been built, you can run the crawler on a scheduled basis to help keep the catalog up to date with the latest Amazon Inspector findings as they are exported into the S3 bucket.

Using Amazon Athena, you can run queries against the Amazon Inspector reports to generate output data that is relevant to your environment. For example, to list the AMIs that are affected by high-severity findings, you can run the following SQL query. Make sure to replace <DOC-EXAMPLE-BUCKET-111122223333> with your own information.

SELECT DISTINCT partition_0 from "<DOC-EXAMPLE-BUCKET-111122223333>" where severity='HIGH'

With the results, you can use AWS Systems Manager to update the relevant AMIs to include the latest patches and update the launch template used in your Auto Scaling groups.

To further extend this solution, you can also use Amazon QuickSight to visualize the data by connecting to the AWS Glue table and producing dashboards for consumption.

Conclusion

By performing security assessments of your AMIs on a regular basis, you can gain greater visibility and control over the security of your EC2 instances that are created from those AMIs. In this blog post, you learned how to set up AMI vulnerability assessments, and how the results of these continuous vulnerability assessments can help you keep your environment up to date with security patches. For additional hands-on walkthroughs for Amazon Inspector, see Amazon Inspector workshops. You can find the code for this blog post in the inspector-ami-scanning-solution GitHub repository.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Luke Notley

Luke Notley

Luke is a Senior Solutions Architect with Amazon Web Services and is based in Western Australia. Luke has a passion for helping customers connect business outcomes with technology and assisting customers throughout their cloud journey, helping them design scalable, flexible, and resilient architectures. In his spare time, he enjoys traveling, coaching basketball teams, and DJing.

How Amazon Finance Automation built a data mesh to support distributed data ownership and centralize governance

Post Syndicated from Nitin Arora original https://aws.amazon.com/blogs/big-data/how-amazon-finance-automation-built-a-data-mesh-to-support-distributed-data-ownership-and-centralize-governance/

Amazon Finance Automation (FinAuto) is the tech organization of Amazon Finance Operations (FinOps). Its mission is to enable FinOps to support the growth and expansion of Amazon businesses. It works as a force multiplier through automation and self-service, while providing accurate and on-time payments and collections. FinAuto has a unique position to look across FinOps and provide solutions that help satisfy multiple use cases with accurate, consistent, and governed delivery of data and related services.

In this post, we discuss how the Amazon Finance Automation team used AWS Lake Formation and the AWS Glue Data Catalog to build a data mesh architecture that simplified data governance at scale and provided seamless data access for analytics, AI, and machine learning (ML) use cases.

Challenges

Amazon businesses have grown over the years. In the early days, financial transactions could be stored and processed on a single relational database. In today’s business world, however, even a subset of the financial space dedicated to entities such as Accounts Payable (AP) and Accounts Receivable (AR) requires separate systems handling terabytes of data per day. Within FinOps, we can curate more than 300 datasets and consume many more raw datasets from dozens of systems. These datasets can then be used to power front end systems, ML pipelines, and data engineering teams.

This exponential growth necessitated a data landscape that was geared towards keeping FinOps operating. However, as we added more transactional systems, data started to grow in operational data stores. Data copies were common, with duplicate pipelines creating redundant and often out-of-sync domain datasets. Multiple curated data assets were available with similar attributes. To resolve these challenges, FinAuto decided to build a data services layer based on a data mesh architecture. FinAuto wanted to verify that the data domain owners would retain ownership of their datasets while users got access to the data by using a data mesh architecture.

Solution overview

Being customer focused, we started by understanding our data producers’ and consumers’ needs and priorities. Consumers prioritized data discoverability, fast data access, low latency, and high accuracy of data. Producers prioritized ownership, governance, access management, and reuse of their datasets. These inputs reinforced the need of a unified data strategy across the FinOps teams. We decided to build a scalable data management product that is based on the best practices of modern data architecture. Our source system and domain teams were mapped as data producers, and they would have ownership of the datasets. FinAuto provided the data services’ tools and controls necessary to enable data owners to apply data classification, access permissions, and usage policies. It was necessary for domain owners to continue this responsibility because they had visibility to the business rules or classifications and applied that to the dataset. This enabled producers to publish data products that were curated and authoritative assets for their domain. For example, the AR team created and governed their cash application dataset in their AWS account AWS Glue Data Catalog.

With many such partners building their data products, we needed a way to centralize data discovery, access management, and vending of these data products. So we built a global data catalog in a central governance account based on the AWS Glue Data Catalog. The FinAuto team built AWS Cloud Development Kit (AWS CDK), AWS CloudFormation, and API tools to maintain a metadata store that ingests from domain owner catalogs into the global catalog. This global catalog captures new or updated partitions from the data producer AWS Glue Data Catalogs. The global catalog is also periodically fully refreshed to resolve issues during metadata sync processes to maintain resiliency. With this structure in place, we then needed to add governance and access management. We selected AWS Lake Formation in our central governance account to help secure the data catalog, and added secure vending mechanisms around it. We also built a front-end discovery and access control application where consumers can browse datasets and request access. When a consumer requests access, the application validates the request and routes them to a respective producer via internal tickets for approval. Only after the data producer approves the request are permissions provisioned in the central governance account through Lake Formation.

Solution tenets

A data mesh architecture has its own advantages and challenges. By democratizing the data product creation, we removed dependencies on a central team. We made reuse of data possible with data discoverability and minimized data duplicates. This also helped remove data movement pipelines, thereby reducing data transfer and maintenance costs.

We realized, however, that our implementation could potentially impact day-to-day tasks and inhibit adoption. For example, data producers need to onboard their dataset to the global catalog, and complete their permissions management before they can share that with consumers. To overcome this obstacle, we prioritized self-service tools and automation with a reliable and simple-to-use interface. We made interaction, including producer-consumer onboarding, data access request, approvals, and governance, quicker through the self-service tools in our application.

Solution architecture

Within Amazon, we isolate different teams and business processes with separate AWS accounts. From a security perspective, the account boundary is one of the strongest security boundaries in AWS. Because of this, the global catalog resides in its own locked-down AWS account.

The following diagram shows AWS account boundaries for producers, consumers, and the central catalog. It also describes the steps involved for data producers to register their datasets as well as how data consumers get access. Most of these steps are automated through convenience scripts with both AWS CDK and CloudFormation templates for our producers and consumer to use.

Solution Architecture Diagram

The workflow contains the following steps:

  1. Data is saved by the producer in their own Amazon Simple Storage Service (Amazon S3) buckets.
  2. Data source locations hosted by the producer are created within the producer’s AWS Glue Data Catalog.
  3. Data source locations are registered with Lake Formation.
  4. An onboarding AWS CDK script creates a role for the central catalog to use to read metadata and generate the tables in the global catalog.
  5. The metadata sync is set up to continuously sync data schema and partition updates to the central data catalog.
  6. When a consumer requests table access from the central data catalog, the producer grants Lake Formation permissions to the consumer account AWS Identity and Access Management (IAM) role and tables are visible in the consumer account.
  7. The consumer account accepts the AWS Resource Access Manager (AWS RAM) share and creates resource links in Lake Formation.
  8. The consumer data lake admin provides grants to IAM users and roles mapping to data consumers within the account.

The global catalog

The basic building block of our business-focused solutions are data products. A data product is a single domain attribute that a business understands as accurate, current, and available. This could be a dataset (a table) representing a business attribute like a global AR invoice, invoice aging, aggregated invoices by a line of business, or a current ledger balance. These attributes are calculated by the domain team and are available for consumers who need that attribute, without duplicating pipelines to recreate it. Data products, along with raw datasets, reside within their data owner’s AWS account. Data producers register their data catalog’s metadata to the central catalog. We have services to review source catalogs to identify and recommend classification of sensitive data columns such as name, email address, customer ID, and bank account numbers. Producers can review and accept those recommendations, which results in corresponding tags applied to the columns.

Producer experience

Producers onboard their accounts when they want to publish a data product. Our job is to sync the metadata between the AWS Glue Data Catalog in the producer account with the central catalog account, and register the Amazon S3 data location with Lake Formation. Producers and data owners can use Lake Formation for fine-grained access controls on the table. It is also now searchable and discoverable via the central catalog application.

Consumer experience

When a data consumer discovers the data product that they’re interested in, they submit a data access request from the application UI. Internally, we route the request to the data owner for the disposition of the request (approval or rejection). We then create an internal ticket to track the request for auditing and traceability. If the data owner approves the request, we run automation to create an AWS RAM resource share to share with the consumer account covering the AWS Glue database and tables approved for access. These consumers can now query the datasets using the AWS analytics services of their choice like Amazon Redshift Spectrum, Amazon Athena, and Amazon EMR.

Operational excellence

Along with building the data mesh, it’s also important to verify that we can operate with efficiency and reliability. We recognize that the metadata sync process is at the heart of this global data catalog. As such, we are hypervigilant of this process and have built alarms, notifications, and dashboards to verify that this process doesn’t fail silently and create a single point of failure for the global data catalog. We also have a backup repair service that syncs the metadata from producer catalogs into the central governance account catalog periodically. This is a self-healing mechanism to maintain reliability and resiliency.

Empowering customers with the data mesh

The FinAuto data mesh hosts around 850 discoverable and shareable datasets from multiple partner accounts. There are more than 300 curated data products to which producers can provide access and apply governance with fine-grained access controls. Our consumers use AWS analytics services such as Redshift Spectrum, Athena, Amazon EMR, and Amazon QuickSight to access their data. This capability with standardized data vending from the data mesh, along with self-serve capabilities, allows you to innovate faster without dependency on technical teams. You can now get access to data faster with automation that continuously improves the process.

By serving the FinOps team’s data needs with high availability and security, we enabled them to effectively support operation and reporting. Data science teams can now use the data mesh for their finance-related AI/ML use cases such as fraud detection, credit risk modeling, and account grouping. Our finance operations analysts are now enabled to dive deep into their customer issues, which is most important to them.

Conclusion

FinOps implemented a data mesh architecture with Lake Formation to improve data governance with fine-grained access controls. With these improvements, the FinOps team is now able to innovate faster with access to the right data at the right time in a self-serve manner to drive business outcomes. The FinOps team will continue to innovate in this space with AWS services to further provide for customer needs.

To learn more about how to use Lake Formation to build a data mesh architecture, see Design a data mesh architecture using AWS Lake Formation and AWS Glue.


About the Authors

Nitin Arora PicNitin Arora is a Sr. Software Development Manager for Finance Automation in Amazon. He has over 18 years of experience building business critical, scalable, high-performance software. Nitin leads several data and analytics initiatives within Finance, which includes building Data Mesh. In his spare time, he enjoys listening to music and read.

Pradeep Misra PicPradeep Misra is a Specialist Solutions Architect at AWS. He works across Amazon to architect and design modern distributed analytics and AI/ML platform solutions. He is passionate about solving customer challenges using data, analytics, and AI/ML. Outside of work, Pradeep likes exploring new places, trying new cuisines, and playing board games with his family. He also likes doing science experiments with his daughters.

Rajesh Rao PicRajesh Rao is a Sr. Technical Program Manager in Amazon Finance. He works with Data Services teams within Amazon to build and deliver data processing and data analytics solutions for Financial Operations teams. He is passionate about delivering innovative and optimal solutions using AWS to enable data-driven business outcomes for his customers.

Andrew Long PicAndrew Long, the lead developer for data mesh, has designed and built many of the big data processing systems that have fueled Amazon’s financial data processing infrastructure. His work encompasses a range of areas, including S3-based table formats for Spark, diverse Spark performance optimizations, distributed orchestration engines and the development of data cataloging systems. Additionally, Andrew finds pleasure in sharing his knowledge of partner acrobatics.

Satyen GauravKumar Satyen Gaurav, is an experienced Software Development Manager at Amazon, with over 16 years of expertise in big data analytics and software development. He leads a team of engineers to build products and services using AWS big data technologies, for providing key business insights for Amazon Finance Operations across diverse business verticals. Beyond work, he finds joy in reading, traveling and learning strategic challenges of chess.

Configure end-to-end data pipelines with Etleap, Amazon Redshift, and dbt

Post Syndicated from Zygimantas Koncius original https://aws.amazon.com/blogs/big-data/configure-end-to-end-data-pipelines-with-etleap-amazon-redshift-and-dbt/

This blog post is co-written with Zygimantas Koncius from Etleap.

Organizations use their data to extract valuable insights and drive informed business decisions. With a wide array of data sources, including transactional databases, log files, and event streams, you need a simple-to-use solution capable of efficiently ingesting and transforming large volumes of data in real time, ensuring data cleanliness, structural integrity, and data team collaboration.

In this post, we explain how data teams can quickly configure low-latency data pipelines that ingest and model data from a variety of sources, using Etleap’s end-to-end pipelines with Amazon Redshift and dbt. The result is robust and flexible data products with high scalability and best-in-class query performance.

Introduction to Amazon Redshift

Amazon Redshift is a fast, fully-managed, self-learning, self-tuning, petabyte-scale, ANSI-SQL compatible, and secure cloud data warehouse. Thousands of customers use Amazon Redshift to analyze exabytes of data and run complex analytical queries. Amazon Redshift Serverless makes it straightforward to run and scale analytics in seconds without having to manage the data warehouse. It automatically provisions and scales the data warehouse capacity to deliver high performance for demanding and unpredictable workloads, and you only pay for the resources you use. Amazon Redshift helps you break down the data silos and allows you to run unified, self-service, real-time, and predictive analytics on all data across your operational databases, data lake, data warehouse, and third-party datasets with built-in governance. Amazon Redshift delivers up to five times better price performance than other cloud data warehouses out of the box and helps you keep costs predictable.

Introduction to dbt

dbt is a SQL-based transformation workflow that is rapidly emerging as the go-to standard for data analytics teams. For straightforward use cases, dbt provides a simple yet robust SQL transformation development pattern. For more advanced scenarios, dbt models can be expanded using macros created with the Jinja templating language and external dbt packages, providing additional functionality.

One of the key advantages of dbt is its ability to foster seamless collaboration within and across data analytics teams. A strong emphasis on version control empowers teams to track and review the history of changes made to their models. A comprehensive testing framework ensures that your models consistently deliver accurate and reliable data, while modularity enables faster development via component reusability. Combined, these features can improve your data team’s velocity, ensure higher data quality, and empower team members to assume ownership.

dbt is popular for transforming big datasets, so it’s important that the data warehouse that runs the transformations provide a lot of computational capacity at the lowest possible cost. Amazon Redshift is capable of fulfilling both of these requirements, with features such as concurrency scaling, RA3 nodes, and Redshift Serverless.

To take advantage of dbt’s capabilities, you can use dbt Core, an open-source command-line tool that serves as the interface to using dbt. By running dbt Core along with dbt’s Amazon Redshift adapter, you can compile and run your models directly within your Amazon Redshift data warehouse.

Introduction to Etleap

Etleap is an AWS Advanced Technology Partner with the AWS Data & Analytics Competency and Amazon Redshift Service Ready designation. Etleap simplifies the data pipeline building experience. A cloud-native platform that seamlessly integrates with AWS infrastructure, Etleap consolidates data without the need for coding. Automated issue detection pinpoints problems so data teams can stay focused on analytics initiatives, not data pipelines. Etleap integrates key Amazon Redshift features into its product, such as streaming ingestion, Redshift Serverless, and data sharing.

In Etleap, pre-load transformations are primarily used for cleaning and structuring data, whereas post-load SQL transformations enable multi-table joins and dataset aggregations. Bridging the gap between data ingestion and SQL transformations comes with multiple challenges, such as dependency management, scheduling issues, and monitoring the data flow. To help you address these challenges, Etleap introduced end-to-end pipelines that use dbt Core models to combine data ingestion with modeling.

Etleap end-to-end data pipelines

The following diagram illustrates Etleap’s end-to-end pipeline architecture and an example data flow.

Etleap end-to-end data pipelines combine data ingestion with modeling in the following way: a cron schedule first triggers ingestion of data required by the models. Once all the ingestion is complete, a user-defined dbt build is run, which performs post-load SQL transformations and aggregations on the data that has just been ingested by ingestion pipelines.

End-to-end pipelines offer several advantages over running dbt workflows in isolation, including dependency management, scheduling and latency, Amazon Redshift workload synchronization, and managed infrastructure.

Dependency management

In a typical dbt use case, the data that dbt performs SQL transformations on is ingested by an extract, transform, and load (ETL) tool such as Etleap. Tables ingested by ETL processes in dbt projects are usually referenced as dbt sources. Those source references need to be maintained either manually or using custom solutions. This is often a laborious and error-prone process. Etleap eliminates these processes by automatically keeping your dbt source list up to date. Additionally, any changes made to the dbt project or ingestion pipeline will be validated by Etleap, ensuring that the changes are compatible and won’t disrupt your dbt builds.

Scheduling and latency

End-to-end pipelines allow you to monitor and minimize end-to-end latency. This is achieved by using a single end-to-end pipeline schedule, which eliminates the need for an independent ingestion pipeline and dbt job-level schedules. When the schedule triggers the end-to-end pipeline, the ingestion processes will run. The dbt workflow will start only after the data for every table used in the dbt SQL models is up to date. This removes the need for additional scheduling components outside of Etleap, which reduces data stack complexity. It also ensures that all data involved in dbt transformations is at least as recent as the scheduled trigger time. Consequently, data in all the final tables or views will be up to date as of the scheduled trigger time.

Amazon Redshift workload synchronization

Due to pipelines and dbt builds running on the same schedule and triggering only the required parts of data ingestion and dbt transformations, higher workload synchronization is achieved. This means that customers using Redshift Serverless can further minimize their compute usage, driving their costs down further.

Managed infrastructure

One of the challenges when using dbt Core is the need to set up and maintain your own infrastructure in which the dbt jobs can be run efficiently and securely. As a software as a service (SaaS) provider, Etleap provides highly scalable and secure dbt Core infrastructure out of the box, so there’s no infrastructure management required by your data teams.

Solution overview

To illustrate how end-to-end pipelines can address a data analytics team’s needs, we use an example based on Etleap’s own customer success dashboard.

For Etleap’s customer success team, it’s important to track changes in the number of ingestion pipelines customers have. To meet the team’s requirements, the data analyst needs to ingest the necessary data from internal systems into an Amazon Redshift cluster. They then need to develop dbt models and schedule an end-to-end pipeline. This way, Etleap’s customer success team has dashboard-ready data that is consistently up to date.

Ingest data from the sources

In Etleap’s case, the internal entities are stored in a MySQL database, and customer relationships are managed via HubSpot. Therefore, the data analyst must first ingest all data from the MySQL user and pipeline tables as well as the companies entity from HubSpot into their Amazon Redshift cluster. They can achieve this by logging into Etleap and configuring ingestion pipelines through the UI.

Develop the dbt models

After the data has been loaded into Amazon Redshift, the data analyst can begin creating dbt models by using queries that join the HubSpot data with internal entities. The first model, user_pipelines.sql, joins the users table with the pipelines table based on the foreign key user_id stored in the pipelines table, as shown in the following code. Note the use of source notation to reference the source tables, which were ingested using ingestion pipelines.

select u.domain, p.name, p.create_date
from {{source('mysql', 'users')}} u
join {{source('mysql', 'pipelines')}} p on p.user_id = u.id
user_pipelines.sql model

The second model, company_pipelines.sql, joins the HubSpot companies table with the user_pipelines table, which is created by the first dbt model, based on the email domain. Note the usage of ref notation to reference the first model:

select c.name as company_name, up.name as user_name, up.create_date as pipeline_create_date
from {{source('hubspot', 'companies')}} hc
join {{ref('user_pipelines')}} up on up.domain = hc.domain
company_pipelines.sql model

After creating these models in the dbt project, the data analyst will have achieved the data flow summarized in the following figure.

Test the dbt workflow

Finally, the data analyst can define a dbt selector to select the newly created models and run the dbt workflow locally. This creates the views and tables defined by the models in their Amazon Redshift cluster.

The resulting company_pipelines table enables the team to track metrics, such as the number of pipelines created by each customer or the number of pipelines created on any particular day.

Schedule an end-to-end pipeline in Etleap

After the data analyst has developed the initial models and queries, they can schedule an Etleap end-to-end pipeline by choosing the selector and defining a desired cron schedule. The end-to-end pipeline matches the sources to pipelines and takes care of running the ingestion pipelines as well as dbt builds on a defined schedule, ensuring high freshness of the data.

The following screenshot of the Etleap UI shows the configuration of an end-to-end pipeline, including its cron schedule, which models are included in the dbt build, and the mapping of inferred dbt sources to Etleap pipelines.

Summary

In this post, we described how Etleap’s end-to-end pipelines enable data teams to simplify their data integration and transformation workflows as well as achieve higher data freshness. In particular, we illustrated how data teams can use Etleap with dbt and Amazon Redshift to run their data ingestion pipelines with post-load SQL transformations with minimal effort required by the team.

Start using Amazon Redshift or Amazon Redshift Serverless to take advantage of their powerful SQL transformations. To get started with Etleap, start a free trial or request a tailored demo.


About the authors

Zygimantas Koncius is an engineer at Etleap with 3 years of experience in developing robust and performant ETL software. In addition to development work, he maintains Etleap infrastructure and provides deep-level technical customer support.

Sudhir Gupta is a Principal Partner Solutions Architect, Analytics Specialist at AWS with over 18 years of experience in Databases and Analytics. He helps AWS partners and customers design, implement, and migrate large-scale data & analytics (D&A) workloads. As a trusted advisor to partners, he enables partners globally on AWS D&A services, builds solutions/accelerators, and leads go-to-market initiatives.

How to enforce multi-party approval for creating Matter-compliant certificate authorities

Post Syndicated from Ram Ramani original https://aws.amazon.com/blogs/security/how-to-enforce-multi-party-approval-for-creating-matter-compliant-certificate-authorities/

Customers who build smart home devices using the Matter protocol from the Connectivity Standards Alliance (CSA) need to create and maintain digital certificates, called device attestation certificates (DACs), to allow their devices to interoperate with devices from other vendors. DACs must be issued by a Matter device attestation certificate authority (CA). The CSA mandates multi-party approval for creating Matter compliant CAs. You can use AWS Private CA to create your Matter device attestation CA, which will store two important certificates: the product attestation authority (PAA) and product attestation intermediate (PAI) certificate. The PAA is the root CA that signs the PAI; the PAI is the intermediate CA that issues DACs. In this blog post, we will show how to implement multi-party approval for creation of these two certificates within AWS Private CA.

The CSA allows the use of delegated service providers (DSP) to provide you with public key infrastructure (PKI) services to create your Matter device attestation CA. You can use AWS Private CA as a DSP to create a Matter device attestation CA to issue DACs.

You should carefully plan to implement and demonstrate compliance with the Matter PKI Certificate Policy (CP) requirements when you issue Matter certificates by using the CA infrastructure services provided by AWS Private CA. Matter PKI CP is not just a technical standard; it also covers people, processes, and technology. For information about the requirements to comply with the Matter PKI CP and a reference list of acronyms, see the Matter PKI Compliance Guide. In this blog post, we address one of the Matter requirements for technical security controls for implementing multi-party approval for the creation of PAA and PAI certificates

Note: The solution presented in this post uses AWS Systems Manager Change Manager, a capability of AWS Systems Manager, for demonstrating multi-party approval as required by the Matter CP for the creation of the PAA and PAI. Additionally, the solution also uses AWS Systems Manager documents (SSM documents), which contain the code to automate the creation of PAA and PAI DAC certificates.

Implementing multi-party approval: Personas and IAM roles

For the process of achieving the multi-party approval required for Matter compliance, we will use the following human personas:

  • Jane Doe and Paulo Santos as the two approvers responsible for signing off on the creation of PAA and PAI.
  • Shirley Rodriguez as the persona responsible for setting up the prerequisite infrastructure and creating the change template that governs the multi-party approval process and specifying the human personas who are authorized to approve change requests.
  • Richard Roe as the persona responsible for reviewing and approving change template changes made by Shirley Rodriguez, to verify the separation of duties.

AWS offers support for identity federation to enable federated single sign-on (SSO). This allows users to sign into the AWS Management Console or call AWS API operations by using the credentials of an IAM role. To establish a secure authentication and authorization model, we highly recommend that you map the identities of the human personas to IAM roles.

As a prerequisite, Shirley Rodriguez will create the following AWS Identity and Access Management (IAM) roles that support the multi-party approval operations:

  • TmpltReview-Role — Richard Roe will assume this role to review and approve changes to the change template that is used to run the SSM document to create the matter CAs.
  • CreatePAA-Role and CreatePAI-Role — Clone the solution GitHub repository and create the roles by using the policies from the repository:
    • CreatePAA-Role — This role is assumed by the AWS Systems Manager service to create the PAA.
    • CreatePAI-Role — This role is assumed by the AWS Systems Manager service to create the PAI.
  • MatterCA-Admin-1 and MatterCA-Admin-2 — Jane Doe will use the MatterCA-Admin-1 role, while Paulo Santos will use the MatterCA-Admin-2 role. These individuals will serve as the two approvers for the multi-party approval process.

Note: It’s important that one person cannot approve an action by themselves. If a person is allowed to assume the MatterCA-Admin-1 role, they must not be allowed to assume the MatterCA-Admin-2 role also. If the same person can assume both roles, then that person can bypass the requirement for two different people to approve.

To create the IAM roles

  1. Create IAM roles MatterCA-Admin-1 and MatterCA-Admin-2, and attach the following AWS-managed policies:
  2. You should configure the trust relationship to allow Jane Doe to use the Matter-CA-Admin-1-Role and Paulo Santos to use the Matter-CA-Admin-2-Role for the multi-party approval process. This is intended to restrict Jane Doe and Paulo Santos from assuming each other’s roles. Use the following policy as a guide, and make sure to replace <AccountNumber> and <Role_Name> with your own information, depending on the federated identity models that you have chosen.
    {
    
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "aws:PrincipalARN":"arn:aws:iam::<AccountNumber>:role/<Role-Name>"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
  3. Create the IAM role TmpltReview-Role, and attach the following policies.
    • AmazonSSMReadOnlyAccess
    • Attach the following custom inline policy to enable review and approval of the change template.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "TemplateReviewer",
                "Effect": "Allow",
                "Action": "ssm:UpdateDocumentMetadata",
                "Resource": "*"
            }
        ]
    }
  4. Modify the trust relationship to allow only Richard Roe to use the role, as shown in the following policy. Make sure to replace <AccountNumber> and <Role-Name> with your own information.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS":"aws:PrincipalARN":"arn:aws:iam::<AccountNumber>:role/<Role-Name>"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }
  5. Create the IAM role CreatePAA-Role, which will be used by the AWS Systems Manager change template to run the SSM document to create PAA.
    1. Attach the following inline policy to the CreatePAA-Role.
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "acm-pca:ImportCertificateAuthorityCertificate",
                      "acm-pca:IssueCertificate",
                      "acm-pca:CreateCertificateAuthority",
                      "acm-pca:GetCertificate",
                      "acm-pca:GetCertificateAuthorityCsr",
                      "acm-pca:DescribeCertificateAuthority"
                  ],
                  "Resource": "*"
              }
          ]
      }
    2. Modify the trust relationship for CreatePAA-Role to allow only the AWS Systems Manager service to assume this role, as shown following.
      {
      
      
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": "ssm.amazonaws.com"
                  },
                  "Action": "sts:AssumeRole",
                  "Condition": {}
              }
          ]
  6. Create the IAM role CreatePAI-Role, which will be used by the change template to run the SSM document to create the PAI certificate.
    1. Attach the following policy as an inline policy on the CreatePAI-Role.
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": [
                      "acm-pca:ImportCertificateAuthorityCertificate",
                      "acm-pca:CreateCertificateAuthority",
                      "acm-pca:GetCertificateAuthorityCsr",
                      "acm-pca:DescribeCertificateAuthority"
                  ],
                  "Resource": "*"
              },
              {
                  "Effect": "Allow",
                  "Action": [
                      "acm-pca:GetCertificateAuthorityCertificate",
                      "acm-pca:GetCertificate",
                      "acm-pca:IssueCertificate"
                  ],
                  "Resource": “*”
              }
          ]
      }
  7. Modify the trust relationship for CreatePAI-Role to allow only AWS Systems Manager to assume this role, as shown following.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "ssm.amazonaws.com"
                },
                "Action": "sts:AssumeRole",
                "Condition": {}
            }
        ]
    }
    

Preventive security controls recommended for this solution

We recommend that you apply the following security controls for this solution:

  • Dedicate an AWS account to this solution – It is important that the only users who can perform actions on the PAA and PAI are the users in this account. By deploying these items in a dedicated AWS account, you limit the number of users who might have elevated privileges, but don’t have cause to use those privileges here.
  • SCPs (service control policies) – The IAM policies in this solution do not prevent someone with privileges, such as an administrator, from bypassing your expected controls and approving usage of the CA. SCPs, if they are applied by using AWS Organizations, can restrict the creation of CAs (certificate authorities) exclusively to CreatePAA-Role and CreatePAI-Role.

    The following example SCP will enforce this type of restriction. Make sure to replace <AccountNumber> with your own information. With a strong SCP, even the root account will not be able to perform these operations or change these security controls.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "RestrictCACreation",
                "Effect": "Deny",
                "Action": ["acm-pca:CreateCertificateAuthority"],
                "Resource": "*",
                "Condition": {
                    "StringNotLike": {
                        "aws:PrincipalARN": [
                            "arn:aws:iam::<AccountNumber>:role/CreatePAA-Role",
                            "arn:aws:iam::<AccountNumber>:role/CreatePAI-Role"
                        ]
                    }
                }
            }
        ]
     }

AWS Systems Manager configuration

Shirley Rodriguez will download the following sample Systems Manager (SSM) documents from our GitHub repository and perform the listed steps in this section.

The content in these yaml files will be used in the next steps to create SSM documents.

Create the SSM document

The first step is to create the SSM document that automates resource creation of the PAA and PAI in AWS Private CA.

To create the SSM document

  1. Open the Systems Manager console.
  2. In the left navigation pane, under Shared Resources, choose Documents.
  3. Choose the Owned by me tab, choose Create document, and from the dropdown list, choose Automation.
    Figure 1: Create the automation document

    Figure 1: Create the automation document

  4. Under Create automation, choose the Editor tab, and then choose Edit.
    Figure 2: Automation document editor

    Figure 2: Automation document editor

  5. Copy the sample automation code from the file CreatePAA.yaml that you downloaded from the GitHub repository, and paste it into the editor.
  6. For Name, enter CreatePAA, and then choose Create automation.
  7. To check that the CreatePAA document was created successfully, choose the Owned by me tab. You should see the Systems Manager document, as shown in Figure 3.
    Figure 3: Successful creation of the CreatePAA document

    Figure 3: Successful creation of the CreatePAA document

  8. Repeat the preceding steps for creating the PAI. Make sure that you paste the code from the file CreatePAI.yaml into the editor and enter the name CreatePAI to create the PAI CA.
  9. To check that the CreatePAI document was created successfully, choose the Owned by me tab. You should see the CreatePAI Systems Manager document, as shown in Figure 4.
    Figure 4: Successful creation of the PAA and PAI documents

    Figure 4: Successful creation of the PAA and PAI documents

You’ve now completed the creation of an SSM document that contains the automation code to create certificate authorities PAA and PAI. The next step is to create Change Manager templates, which will use the SSM document and apply multi-party approval before the creation of the PAA and PAI.

Create the Change Manager templates

Shirley Rodriguez will next create two change templates that run the SSM documents: one for the PAA and one for the PAI.

To create the change templates

  1. Open the Systems Manager console.
  2. In the left navigation pane, under Change Management, choose Change Manager.
  3. On the Change Manager page, in the top right, choose Create template.
  4. For Name, enter CreatePAATemplate.
  5. In the Builder section, add a description (optional), and for Runbook, search and select CreatePAA. Keep the defaults for the other selections.
    Figure 5: Select the runbook CreatePAA in the change template

    Figure 5: Select the runbook CreatePAA in the change template

  6. Scroll down to the Change request approvals section and choose Add approval level. This is where you configure multi-party approval for the change template.
  7. Because there are two approvers, for Number of approvers required at this level, choose 1 from the dropdown.
  8. Choose Add approver, choose Template specified approvers, and then select the MatterCA-Admin-1. Then choose Add another approval level for the second approver.
    Figure 6: Add first level approver for the template

    Figure 6: Add first level approver for the template

  9. Choose Template specified approvers, and then select the MatterCA-Admin-2 role for multi-party approval. These roles can now approve the change request.
    Figure 7: Add second level approver for the template.

    Figure 7: Add second level approver for the template.

  10. Keep the defaults for the rest of the options, and at the bottom of the screen, choose Save and preview.
  11. On the preview screen, review the configurations, and then on the top right, choose Submit for review. This pushes the template to be reviewed by template reviewer Richard Roe. In the Templates tab, the template status shows as Pending review.
    Figure 8: Template with a status of pending review

    Figure 8: Template with a status of pending review

  12. Repeat the preceding steps to create the PAI change template. Make sure to name it CreatePAITemplate, and at step 5, for Runbook, select the CreatePAI document.
    Figure 9: Both templates ready for review

    Figure 9: Both templates ready for review

You’ve successfully created two change templates, CreatePAATemplate and CreatePAITemplate, that generate a change request that contains an SSM document with automation code for building the PAA and PAI. These change requests are configured with multi-party approval before running the SSM document. However, before you can proceed with running the change template, it must undergo review and approval by the template reviewer Richard Roe.

Review and approve the Change Manager templates

First you need to make sure that TmpltReview-Role is added as a reviewer and approver of change templates. Shirley Rodriguez will follow the steps in this section to add TmpltReview-Role as change template reviewer.

To add the change template reviewer

  1. Follow the instructions in the Systems Manager documentation to configure the IAM role TmpltReview-Role to review and approve the change template. Figure 10 shows how this setup looks in the Systems Manager console.
    Figure 10: The template reviewer role in Settings

    Figure 10: The template reviewer role in Settings

    Now you have TmpltReview-Role added as a reviewer. Change templates that are created or modified will now need to be reviewed and approved by this role. Richard Roe will use the role TmpltReview-Role for governance of change templates, to make sure changes made by Shirley Rodriguez are in alignment with the organization’s compliance needs for Matter.

  2. Richard Roe will follow the steps in the Systems Manager documentation for reviewing and approving change templates, to approve CreatePAATemplate and CreatePAITemplate. After the change template is approved, its status changes to Approved, and it’s ready to be run.
    Figure 11: Change template approval details

    Figure 11: Change template approval details

You now have the change templates CreatePAATemplate and CreatePAITemplate in approved status.

Create the PAA and PAI with multi-party approval for Matter compliance

Up to this point, these instructions have described one-time configurations of AWS Systems Manager to set up the IAM roles, SSM documents, and change templates that are required to enforce multi-party approval. Now you are ready to use these change templates to create the PAA and PAI and perform multi-party approval.

Shirley Rodriguez will generate change requests that require approval from Jane Doe and Paulo Santos. This manual approval process will then run the SSM documents to create the PAA and PAI.

Create a change request for the PAA

Perform the following steps to create a change request for the PAA.

To create a change request for the PAA

  1. Open the Systems Manager console.
  2. In the left navigation pane, choose Change Manager, and then choose Create request.
  3. Search for and select CreatePAATemplate, and then choose Next.
  4. For Name, enter the name CreatePAA_ChangeRequest.
  5. (Optional) For Change request information, provide additional information about the change request.
  6. For Workflow start time, choose Run the operation as soon as possible after approval to run the change immediately after the request is approved.
  7. For Change request approvals, validate that the list of First-level approvals includes the change request approvers MatterCA-Admin-1 and MatterCA-Admin-2, which you configured previously in the section Create Change Manager template. Then choose Next.
    Figure 12: Change request approvers

    Figure 12: Change request approvers

  8. For Automation assume role, select the IAM role CreatePAA_Role for creating the PAA.
    Figure 13: Change request role

    Figure 13: Change request role

  9. For Runbook parameters, enter the PAA certificate details for CommonName, Organization, VendorId, and ValidityInYears, and then choose Next.
  10. Review the change request content and then, at the bottom of the screen, choose Submit for approval. Optionally, you can set up an SNS topic to notify the approvers.

You have successfully created a change request that is currently awaiting approval from Jane Doe and Paulo Santos. Let’s now move on to the approval steps.

Multi-party approval: Approve the change request for the PAA

Each of the approvers should now follow the steps in this section for approval. Jane Doe will use the IAM role MatterCA-Admin-1, while Paulo Santos will need to use the IAM role MatterCA-Admin-2.

To approve the change request for the PAA

  1. Open the Systems Manager console and do the following.
    1. In the navigation pane, choose Change Manager.
    2. Choose the Approvals tab, select the CreatePAA change request, and then choose Approve.
    Figure 14: Change request approval

    Figure 14: Change request approval

    After Jane Doe and Paulo Santos each follow these steps to approve the change request, the change request will run and will complete with status “Success,” and the PAA will be created in AWS Private CA.

  2. Check that the status of the change request is Success, as shown in Figure 15.
    Figure 15: The change request ran successfully

    Figure 15: The change request ran successfully

Validate that the PAA is created in AWS Private CA

Next, you need to validate that the PAA was created successfully and copy its Amazon Resource Name (ARN) to use for PAI creation.

To validate the creation of the PAA and retrieve its ARN

  1. In the AWS Private CA console, choose the PAA CA that you created in the previous step.
  2. Copy the ARN of the PAA root CA. You will use PAA CA ARN when you set up the PAI change request.
    Figure 16: ARN of the PAA root CA PAA

    Figure 16: ARN of the PAA root CA PAA

After successfully completing these steps, you have created the certificate authority PAA by using AWS Private CA with multi-party approval. You can now proceed to the next section, where we will demonstrate how to use this PAA to issue a CA for PAI.

Create a change request for the PAI

Perform the following steps to create a change request for the PAI.

Note: Creation of a valid PAA is a prerequisite for creating the PAI in the following steps.

To create a change request for the PAI

  1. After the PAA is created successfully, you can complete the creation of the PAI by repeating the same steps that you did in the Create a change request for the PAA section, but with the following changes:
    1. At step 3, make sure that you search for and select CreatePAITemplate.
      Figure 17: CreatePAITemplate template

      Figure 17: CreatePAITemplate template

    2. At step 4, for Name, enter CreatePAI_ChangeRequest.
    3. At step 8, for Automation assume role, select the IAM role CreatePAI_Role.
      Figure 18: Change request IAM role selection

      Figure 18: Change request IAM role selection

    4. At step 9, for Runbook parameters, enter the PAA CA ARN that you made note of earlier, along with the CommonName, Organization, VendorId, ProductId, and ValidityInYears for the PAI, and then choose Next.

    Multi-party approval: Approve the change request for the PAI

    Each of the approvers should now follow the steps in this section for approval for the PAI. Jane Doe will need to use IAM role MatterCA-Admin-1, and Paulo Santos will need to use IAM role MatterCA-Admin-2.

    To approve the change request for the PAI

    1. Open the Systems Manager console and do the following:
      1. In the navigation pane, choose Change Manager.
      2. Choose the Approvals tab, select the CreatePAI change request, and choose Approve.

      After both approvers (Jane Doe and Paulo Santos) approve the change request, the change request will run, and the PAA will be created inside AWS Private CA.

    2. Check that the status of the change request shows Success, as shown in Figure 19.
      Figure 19: The change requests for the PAA and PAI have run successfully

      Figure 19: The change requests for the PAA and PAI have run successfully

    3. In the AWS Private CA console, verify that the PAA and PAI have been created successfully, as shown in Figure 20.
      Figure 20: PAA and PAI in AWS Private CA

      Figure 20: PAA and PAI in AWS Private CA

    After successfully completing these steps, you have created the certificate authority PAI by using AWS Private CA with multi-party approval. You can now issue DAC certificates by using this PAI. Next, we will show how to validate the logs to confirm multi-party approval.

    Demonstrate compliance with multi-party approval requirements of the Matter CP by using the change manager timeline

    To keep audit records that show that multi-party approval was used to create the PAA and PAI to issue DACs, you can use the Change Manager timeline.

    To retrieve the change manager timeline report

    1. Open the Systems Manager console.
    2. In the left navigation pane, choose Change Manager.
    3. Choose Requests, and then select either the CreatePAA_ChangeRequest or the CreatePAI_ChangeRequest change request.
    4. Select the Timeline tab. Figure 21 shows an example of a timeline with complete runbook steps for PAA creation. It also shows the two approvers, Jane Doe and Paulo Santos, approving the change request. You can use this information to demonstrate multi-party approval.
      Figure 21: Audit trail for approval and creation of the PAA

      Figure 21: Audit trail for approval and creation of the PAA

      Likewise, Figure 22 shows an example of a timeline with complete runbook steps for creating the PAI by using multi-party approval.

      Figure 22: Audit trail for approval and creation of the PAI

      Figure 22: Audit trail for approval and creation of the PAI

    Conclusion

    In this post, you learned how to use AWS Private CA to facilitate the creation of Matter CAs in compliance with the Matter PKI CP. By using AWS Systems Manager, you can effectively fulfill the technical security control outlined in the Matter PKI CP for implementing multi-party approval for the creation of PAA and PAI certificates.

    If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Private Certificate Authority re:Post or contact AWS Support.

    Want more AWS Security news? Follow us on Twitter.

    Author photo: Ram Ramani

    Ram Ramani

    Ram is a Principal Security Solutions Architect at AWS with deep expertise in data protection and privacy. Ram is currently helping customers accelerate their Matter compliance needs using AWS services.

    Pravin Nair

    Pravin Nair

    Pravin is a seasoned Senior Security Solution Architect focused on data protection and privacy. Specializing in encryption, infrastructure security, and privacy, he assists customers in developing secure and scalable solutions that align with their business requirements. Pravin’s expertise helps to provide optimal data protection while addressing evolving security challenges.

    Lukas Rash

    Lukas Rash

    Lukas is a Software Engineer in the AWS Cryptography organization. He is passionate about building robust cloud services to help customers improve the security of their systems. He specializes in building software to help customers implement their public key infrastructures.

Consolidating controls in Security Hub: The new controls view and consolidated findings

Post Syndicated from Emmanuel Isimah original https://aws.amazon.com/blogs/security/consolidating-controls-in-security-hub-the-new-controls-view-and-consolidated-findings/

In this blog post, we focus on two recently released features of AWS Security Hub: the consolidated controls view and consolidated control findings. You can use these features to manage controls across standards and to consolidate findings, which can help you significantly reduce finding noise and administrative overhead.

Security Hub is a cloud security posture management service that you can use to apply security best practice controls, such as “EC2 instances should not have a public IP address.” With Security Hub, you can check that your environment is properly configured and that your existing configurations don’t pose a security risk. Security Hub has more than 200 controls that cover more than 30 AWS services, such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3), and AWS Lambda. In addition, Security Hub has integrations with more than 85 partner products. Security Hub can centralize findings across your AWS accounts and AWS Regions into a single delegated administrator account in your aggregation Region of choice, creating a single pane of glass to view findings. This can help you to triage, investigate, and respond to findings in a simpler way and improve your security posture.

The Security Hub controls are grouped into the following security standards:

With the new features — consolidated controls view and consolidated control findings—you can now do the following:

  • Enable or disable controls across standards in a single action. Previously, if you wanted to maintain the same enablement status of controls between standards, you had to take the same action across multiple standards (up to six times!).
  • If you choose to turn on consolidated control findings, you will receive only a single finding for a security check, even if the security check is enabled across several standards. This reduces the number of findings and helps you focus on the most important misconfigured resources in your AWS environment. It allows you to apply actions and updates (such as suppressing the finding or changing its severity) one time rather than having to do so multiple times across non-consolidated findings.

Overview of new features

Now we’ll discuss some of the details of how you can use the two new features to streamline the management of controls.

The new consolidated controls view

On the new Controls page, now available in the Security Hub console as shown in Figure 1, you can view and configure security controls across standards from one central location.

Figure 1: Security Hub Controls page

Figure 1: Security Hub Controls page

Before this release, controls had to be managed within the context of individual security standards. Even if the same control was part of multiple standards, the control had different IDs in each of them. With this recent release, Security Hub now assigns controls a unique security control ID across standards, so that it’s simpler for you to reference the controls and view their findings. Following the current naming convention of the AWS FSBP standard, the consolidated control IDs start with the relevant service in scope for the control. In fact, whenever possible, the new consolidated control ID is the same as the previous FSBP control ID.

For example, before this release, control IAM.6 in FSBP was also referenced as 1.14 in CIS 1.2, and 1.6 in CIS 1.4, PCI.IAM.4, and CT.IAM.6. After the release, the control is now referenced as IAM.6 in the Security Hub standards. This change does not affect the pre-existing API calls for Security Hub, such as UpdateStandardsControl, where you can still provide the previous StandardControlARN in order to make the call.

By using the new Controls view, you can understand the status of controls across your system, view control findings, and prioritize next steps without context switching. The following information is available on the Controls page of the Security Hub console:

  • An overall security score, which is based on the proportion of passed controls to the total number of enabled controls.
  • A breakdown of security checks across controls, with the percentage of failed security checks highlighted. Because many controls can contain multiple security checks and multiple findings, this value might be different from the security score, which considers controls as a single object. You can use this metric, as well as your security score, to monitor your progress as you work to remediate findings.
  • A list of controls that are categorized into different tabs based on enablement and compliance status. If you are an administrator of an organization within Security Hub, the enablement and compliance status will reflect the aggregate status of the entire organization. In your finding aggregation Region, the status will also be aggregated across linked Regions.

From the controls page, you can select a control to view its details (including its title and the standards it belongs to), and view and act on the findings generated by the control.

Security Hub also offers new API operations that match the capabilities of the controls page. Unlike the pre-existing API operations, these new API operations use the consolidated control IDs (also known as security control IDs) to provide a way to know and manage the relationship between controls and standards. You can use these API operations to manage each Security Hub control across standards, to make sure that the status of controls in the standards is aligned. The new API operations include the following:

We also provide an example script that makes use of these API calls and applies them across accounts and Regions so that your configuration is consistent. You can use our script to enable or disable Security Hub controls across your various accounts or Regions.

Consolidating control findings between standards

Before we released the consolidated control findings feature, Security Hub generated separate findings per standard for each related control. Now, you can turn on consolidated control findings, and after doing so, Security Hub will produce a single finding per security check, even when the underlying control is shared across multiple standards. Having a single finding per check across standards will help you investigate, update, and remediate failed findings more quickly, while also reducing finding noise.

As an example, we can look at control CloudTrail.2, which is shared between standards supported by Security Hub. Before you turn on this capability, you might potentially receive up to six findings for each security check generated by this control—with one finding for each security standard. After you turn on consolidated control findings, these older findings will be archived and Security Hub will generate one finding per security check in this control, regardless of how many security standards you have enabled. For an example of how the standard-specific findings compare to the new consolidated finding, see Sample control findings. The following is an example of a consolidated finding for the CloudTrial.2 control; we’ve highlighted the part that shows this finding is shared across standards.

{
  "SchemaVersion": "2018-10-08",
  "Id": "arn:aws:securityhub:us-east-2:123456789012:security-control/CloudTrail.2/finding/a1b2c3d4-5678-90ab-cdef-EXAMPLE11111",
  "ProductArn": "arn:aws:securityhub:us-east-2::product/aws/securityhub",
  "ProductName": "Security Hub",
  "CompanyName": "AWS",
  "Region": "us-east-2",
  "GeneratorId": "security-control/CloudTrail.2",
  "AwsAccountId": "123456789012",
  "Types": [
    "Software and Configuration Checks/Industry and Regulatory Standards"
  ],
  "FirstObservedAt": "2022-10-06T02:18:23.076Z",
  "LastObservedAt": "2022-10-28T16:10:06.956Z",
  "CreatedAt": "2022-10-06T02:18:23.076Z",
  "UpdatedAt": "2022-10-28T16:10:00.093Z",
  "Severity": {
    "Label": "MEDIUM",
    "Normalized": "40",
    "Original": "MEDIUM"
  },
  "Title": "CloudTrail should have encryption at-rest enabled",
  "Description": "This AWS control checks whether AWS CloudTrail is configured to use the server-side encryption (SSE) AWS Key Management Service (AWS KMS) customer master key (CMK) encryption. The check will pass if the KmsKeyId is defined.",
  "Remediation": {
    "Recommendation": {
      "Text": "For directions on how to correct this issue, consult the AWS Security Hub controls documentation.",
      "Url": "https://docs.aws.amazon.com/console/securityhub/CloudTrail.2/remediation"
    }
  },
  "ProductFields": {
    "RelatedAWSResources:0/name": "securityhub-cloud-trail-encryption-enabled-fe95bf3f",
    "RelatedAWSResources:0/type": "AWS::Config::ConfigRule",
    "aws/securityhub/ProductName": "Security Hub",
    "aws/securityhub/CompanyName": "AWS",
    "Resources:0/Id": "arn:aws:cloudtrail:us-east-2:123456789012:trail/AWSMacieTrail-DO-NOT-EDIT",
    "aws/securityhub/FindingId": "arn:aws:securityhub:us-east-2::product/aws/securityhub/arn:aws:securityhub:us-east-2:123456789012:security-control/CloudTrail.2/finding/a1b2c3d4-5678-90ab-cdef-EXAMPLE11111"
  }
  "Resources": [
    {
      "Type": "AwsCloudTrailTrail",
      "Id": "arn:aws:cloudtrail:us-east-2:123456789012:trail/AWSMacieTrail-DO-NOT-EDIT",
      "Partition": "aws",
      "Region": "us-east-2"
    }
  ],
  "Compliance": {
    "Status": "FAILED",
    "RelatedRequirements": [
        "PCI DSS v3.2.1/3.4",
        "CIS AWS Foundations Benchmark v1.2.0/2.7",
        "CIS AWS Foundations Benchmark v1.4.0/3.7"
    ],
    "SecurityControlId": "CloudTrail.2",
    "AssociatedStandards": [
  { "StandardsId": "standards/aws-foundational-security-best-practices/v/1.0.0"},
  { "StandardsId": "standards/pci-dss/v/3.2.1"},
  { "StandardsId": "ruleset/cis-aws-foundations-benchmark/v/1.2.0"},
  { "StandardsId": "standards/cis-aws-foundations-benchmark/v/1.4.0"},
  { "StandardsId": "standards/service-managed-aws-control-tower/v/1.0.0"},
  ]
  },
  "WorkflowState": "NEW",
  "Workflow": {
    "Status": "NEW"
  },
  "RecordState": "ACTIVE",
  "FindingProviderFields": {
    "Severity": {
      "Label": "MEDIUM",
      "Normalized": "40",
      "Original": "MEDIUM"
    },
    "Types": [
      "Software and Configuration Checks/Industry and Regulatory Standards"
    ]
  }
}

To turn on consolidated control findings

  1. Open the Security Hub console.
  2. In the left navigation pane, choose Settings, and then choose the General tab.
  3. Under Controls, turn on Consolidated control findings, and then choose Save.
Figure 2: Turn on consolidated control findings

Figure 2: Turn on consolidated control findings

If you are using the Security Hub integration with AWS Organizations or have invited member accounts through a manual invitation process, consolidated control findings can only be turned on by the administrator account. When this action is taken in the administrator account, the action will also be reflected in each member account in the current Region. It can take up to 18 hours for Security Hub to archive existing standard-specific findings and generate the new, standard-agnostic, findings.

You can also enable consolidated control findings by using the API (calling the UpdateSecurityHubConfiguration API with the ControlFindingGenerator parameter equal to SECURITY_CONTROL), or by using the AWS CLI (running the update-security-hub-configuration command with control-finding-generator equal to SECURITY_CONTROL), as in the following example.

aws securityhub ‐‐region <Region of choice> update-security-hub-configuration ‐‐control-finding-generator SECURITY_CONTROL

Much like the console behavior, if you have an organizational setup in Security Hub, this API action can only be taken by the administrator, and it will be reflected in each member account in the same Region.

What to expect when you enable consolidated control findings

To allow for these new capabilities to be launched, changes to the AWS Security Finding Format (ASFF) are required. This format is used by Security Hub for findings it generates from its controls or ingests from external providers. When you turn on finding consolidation, Security Hub will archive old standard-specific findings and generate standard-agnostic findings instead. This action will only affect control findings that Security Hub generates, and it will not affect findings ingested from partner products. However, in Security Hub findings, turning on consolidated control findings might cause some updates that you previously made to findings to be archived. Despite this one-time change, after the migration is complete (it can take up to 18 hours), you will be able to update finding fields in a single action and the updates will apply across standards, without the need to make multiple updates.

One field affected by the new capabilities is the Workflow field, which provides information about the status of the investigation into a finding. Manipulating this field can also update the overall compliance status of the control that the finding is related to. For example, if you have a control with one failed finding (and the rest have passed), and the failed finding comes from a resource for which you’d like to make an exception, you can decide to suppress that failed finding by updating the Workflow field. If you suppress failed findings in a control, its compliance status can change to pass.

Before turning on consolidated control findings, if you want to maintain an aligned compliance status in controls that belong to multiple standards, you have to update the Workflow status of findings in each standard. After turning on finding consolidation, you will only have to update the Workflow status once, and the suppression will be applied across standards, helping you to reduce the number of steps needed to suppress the same findings across standards.

As mentioned earlier, when you turn on this new capability, some updates made to the previous, standard-specific findings will be archived and will not be included in the new consolidated control findings generated by Security Hub. In the case of the Workflow status, the new consolidated findings will be created with a value of NEW (for failed findings) or RESOLVED (for new findings) in the Workflow field. However, after you have onboarded to the new finding format, you can update the value of the Workflow field, as well as other fields, and this value will be maintained without requiring you to make continuous updates. For the full list of fields that can be affected by the migration to the consolidated finding format, see Consolidated control findings – ASFF changes. Before you turn on finding consolidation, we suggest that you check if your custom automations refer to those affected fields. If they do, you can update your automations and test them by using the Sample control findings in the documentation.

Conclusion

This blog post covers new Security Hub features that make it simpler for you to manage controls across standards. With the new consolidated control findings feature, you can focus on the most relevant findings and reduce noise, which is why we recommend that you review the new feature and its associated changes and turn it on at your earliest convenience.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the Security Hub forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Emmanuel Isimah

Emmanuel Isimah

Emmanuel is a solutions architect covering hypergrowth customers in the Digital Native Business sector. He has a background in networking, security, and containers. Emmanuel helps customers build and secure innovative cloud solutions, solving their business problems by using data-driven approaches. Emmanuel’s areas of depth include security and compliance, cloud operations, and containers.

Level up your React app with Amazon QuickSight: How to embed your dashboard for anonymous access

Post Syndicated from Adrianna Kurzela original https://aws.amazon.com/blogs/big-data/level-up-your-react-app-with-amazon-quicksight-how-to-embed-your-dashboard-for-anonymous-access/

Using embedded analytics from Amazon QuickSight can simplify the process of equipping your application with functional visualizations without any complex development. There are multiple ways to embed QuickSight dashboards into application. In this post, we look at how it can be done using React and the Amazon QuickSight Embedding SDK.

Dashboard consumers often don’t have a user assigned to their AWS account and therefore lack access to the dashboard. To enable them to consume data, the dashboard needs to be accessible for anonymous users. Let’s look at the steps required to enable an unauthenticated user to view your QuickSight dashboard in your React application.

Solution overview

Our solution uses the following key services:

After loading the web page on the browser, the browser makes a call to API Gateway, which invokes a Lambda function that calls the QuickSight API to generate a dashboard URL for an anonymous user. The Lambda function needs to assume an IAM role with the required permissions. The following diagram shows an overview of the architecture.

Architecture

Prerequisites

You must have the following prerequisites:

Set up permissions for unauthenticated viewers

In your account, create an IAM policy that your application will assume on behalf of the viewer:

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. On the JSON tab, enter the following policy code:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "quicksight:GenerateEmbedUrlForAnonymousUser"
            ],
            "Resource": [
                "arn:aws:quicksight:*:*:namespace/default",
				"arn:aws:quicksight:*:*:dashboard/<YOUR_DASHBOARD_ID1>",
				"arn:aws:quicksight:*:*:dashboard/<YOUR_DASHBOARD_ID2>"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Make sure to change the value of <YOUR_DASHBOARD_ID> to the value of the dashboard ID. Note this ID to use in a later step as well.

For the second statement object with logs, permissions are optional. It allows you to create a log group with the specified name, create a log stream for the specified log group, and upload a batch of log events to the specified log stream.

In this policy, we allow the user to perform the GenerateEmbedUrlForAnonymousUser action on the dashboard ID within the list of dashboard IDs inserted in the placeholder.

  1. Enter a name for your policy (for example, AnonymousEmbedPolicy) and choose Create policy.

Next, we create a role and attach this policy to the role.

  1. Choose Roles in the navigation pane, then choose Create role.

Identity and access console

  1. Choose Lambda for the trusted entity.
  2. Search for and select AnonymousEmbedPolicy, then choose Next.
  3. Enter a name for your role, such as AnonymousEmbedRole.
  4. Make sure the policy name is included in the Add permissions section.
  5. Finish creating your role.

You have just created the AnonymousEmbedRole execution role. You can now move to the next step.

Generate an anonymous embed URL Lambda function

In this step, we create a Lambda function that interacts with QuickSight to generate an embed URL for an anonymous user. Our domain needs to be allowed. There are two ways to achieve the integration of Amazon QuickSight:

  1. By adding the URL to the list of allowed domains in the Amazon QuickSight admin console (explained later in [Optional] Add your domain in QuickSight section).
  2. [Recommended] By adding the embed URL request during runtime in the API call. Option 1 is recommended when you need to persist the allowed domains. Otherwise, the domains will be removed after 30 minutes, which is equivalent to the session duration. For other use cases, it is recommended to use the second option (described and implemented below).

On the Lambda console, create a new function.

  1. Select Author from scratch.
  2. For Function name, enter a name, such as AnonymousEmbedFunction.
  3. For Runtime¸ choose Python 3.9.
  4. For Execution role¸ choose Use an existing role.
  5. Choose the role AnonymousEmbedRole.
  6. Choose Create function.
  7. On the function details page, navigate to the Code tab and enter the following code:

import json, boto3, os, re, base64

def lambda_handler(event, context):
print(event)
try:
def getQuickSightDashboardUrl(awsAccountId, dashboardIdList, dashboardRegion, event):
#Create QuickSight client
quickSight = boto3.client('quicksight', region_name=dashboardRegion);

#Construct dashboardArnList from dashboardIdList
dashboardArnList=[ 'arn:aws:quicksight:'+dashboardRegion+':'+awsAccountId+':dashboard/'+dashboardId for dashboardId in dashboardIdList]
#Generate Anonymous Embed url
response = quickSight.generate_embed_url_for_anonymous_user(
AwsAccountId = awsAccountId,
Namespace = 'default',
ExperienceConfiguration = {'Dashboard':{'InitialDashboardId': dashboardIdList[0]}},
AuthorizedResourceArns = dashboardArnList,
SessionLifetimeInMinutes = 60,
AllowedDomains = ['http://localhost:3000']
)
return response

#Get AWS Account Id
awsAccountId = context.invoked_function_arn.split(':')[4]

#Read in the environment variables
dashboardIdList = re.sub(' ','',os.environ['DashboardIdList']).split(',')
dashboardNameList = os.environ['DashboardNameList'].split(',')
dashboardRegion = os.environ['DashboardRegion']

response={}

response = getQuickSightDashboardUrl(awsAccountId, dashboardIdList, dashboardRegion, event)

return {'statusCode':200,
'headers': {"Access-Control-Allow-Origin": "http://localhost:3000",
"Content-Type":"text/plain"},
'body':json.dumps(response)
}

except Exception as e: #catch all
return {'statusCode':400,
'headers': {"Access-Control-Allow-Origin": "http://localhost:3000",
"Content-Type":"text/plain"},
'body':json.dumps('Error: ' + str(e))
}

If you don’t use localhost, replace http://localhost:3000 in the returns with the hostname of your application. To move to production, don’t forget to replace http://localhost:3000 with your domain.

  1. On the Configuration tab, under General configuration, choose Edit.
  2. Increase the timeout from 3 seconds to 30 seconds, then choose Save.
  3. Under Environment variables, choose Edit.
  4. Add the following variables:
    1. Add DashboardIdList and list your dashboard IDs.
    2. Add DashboardRegion and enter the Region of your dashboard.
  5. Choose Save.

Your configuration should look similar to the following screenshot.

  1. On the Code tab, choose Deploy to deploy the function.

Environment variables console

Set up API Gateway to invoke the Lambda function

To set up API Gateway to invoke the function you created, complete the following steps:

  1. On the API Gateway console, navigate to the REST API section and choose Build.
  2. Under Create new API, select New API.
  3. For API name, enter a name (for example, QuicksightAnonymousEmbed).
  4. Choose Create API.

API gateway console

  1. On the Actions menu, choose Create resource.
  2. For Resource name, enter a name (for example, anonymous-embed).

Now, let’s create a method.

  1. Choose the anonymous-embed resource and on the Actions menu, choose Create method.
  2. Choose GET under the resource name.
  3. For Integration type, select Lambda.
  4. Select Use Lambda Proxy Integration.
  5. For Lambda function, enter the name of the function you created.
  6. Choose Save, then choose OK.

API gateway console

Now we’re ready to deploy the API.

  1. On the Actions menu, choose Deploy API.
  2. For Deployment stage, select New stage.
  3. Enter a name for your stage, such as embed.
  4. Choose Deploy.

[Optional] Add your domain in QuickSight

If you added Allowed domains in Generate an anonymous embed URL Lambda function part, feel free to move to Turn on capacity pricing section.

To add your domain to the allowed domains in QuickSight, complete the following steps:

  1. On the QuickSight console, choose the user menu, then choose Manage QuickSight.

Quicksight dropdown menu

  1. Choose Domains and Embedding in the navigation pane.
  2. For Domain, enter your domain (http://localhost:<PortNumber>).

Make sure to replace <PortNumber> to match your local setup.

  1. Choose Add.

Quicksight admin console

Make sure to replace the localhost domain with the one you will use after testing.

Turn on capacity pricing

If you don’t have session capacity pricing enabled, follow the steps in this section. It’s mandatory to have this function enabled to proceed further.

Capacity pricing allows QuickSight customers to purchase reader sessions in bulk without having to provision individual readers in QuickSight. Capacity pricing is ideal for embedded applications or large-scale business intelligence (BI) deployments. For more information, visit Amazon QuickSight Pricing.

To turn on capacity pricing, complete the following steps:

  1. On the Manage QuickSight page, choose Your Subscriptions in the navigation pane.
  2. In the Capacity pricing section, select Get monthly subscription.
  3. Choose Confirm subscription.

To learn more about capacity pricing, see New in Amazon QuickSight – session capacity pricing for large scale deployments, embedding in public websites, and developer portal for embedded analytics.

Set up your React application

To set up your React application, complete the following steps:

  1. In your React project folder, go to your root directory and run npm i amazon-quicksight-embedding-sdk to install the amazon-quicksight-embedding-sdk package.
  2. In your App.js file, replace the following:
    1. Replace YOUR_API_GATEWAY_INVOKE_URL/RESOURCE_NAME with your API Gateway invoke URL and your resource name (for example, https://xxxxxxxx.execute-api.xx-xxx-x.amazonaws.com/embed/anonymous-embed).
    2. Replace YOUR_DASHBOARD1_ID with the first dashboardId from your DashboardIdList. This is the dashboard that will be shown on the initial render.
    3. Replace YOUR_DASHBOARD2_ID with the second dashboardId from your DashboardIdList.

The following code snippet shows an example of the App.js file in your React project. The code is a React component that embeds a QuickSight dashboard based on the selected dashboard ID. The code contains the following key components:

  • State hooks – Two state hooks are defined using the useState() hook from React:
    • dashboard – Holds the currently selected dashboard ID.
    • quickSightEmbedding – Holds the QuickSight embedding object returned by the embedDashboard() function.
  • Ref hook – A ref hook is defined using the useRef() hook from React. It’s used to hold a reference to the DOM element where the QuickSight dashboard will be embedded.
  • useEffect() hook – The useEffect() hook is used to trigger the embedding of the QuickSight dashboard whenever the selected dashboard ID changes. It first fetches the dashboard URL for the selected ID from the QuickSight API using the fetch() method. After it retrieves the URL, it calls the embed() function with the URL as the argument.
  • Change handler – The changeDashboard() function is a simple event handler that updates the dashboard state whenever the user selects a different dashboard from the drop-down menu. As soon as new dashboard ID is set, the useEffect hook is triggered.
  • 10-millisecond timeout – The purpose of using the timeout is to introduce a small delay of 10 milliseconds before making the API call. This delay can be useful in scenarios where you want to avoid immediate API calls or prevent excessive requests when the component renders frequently. The timeout gives the component some time to settle before initiating the API request. Because we’re building the application in development mode, the timeout helps avoid errors caused by the double run of useEffect within StrictMode. For more information, refer to Updates to Strict Mode.

See the following code:

import './App.css';
import * as React from 'react';
import { useEffect, useRef, useState } from 'react';
import { createEmbeddingContext } from 'amazon-quicksight-embedding-sdk';

 function App() {
  const dashboardRef = useRef([]);
  const [dashboardId, setDashboardId] = useState('YOUR_DASHBOARD1_ID');
  const [embeddedDashboard, setEmbeddedDashboard] = useState(null);
  const [dashboardUrl, setDashboardUrl] = useState(null);
  const [embeddingContext, setEmbeddingContext] = useState(null);

  useEffect(() => {
    const timeout = setTimeout(() => {
      fetch("YOUR_API_GATEWAY_INVOKE_URL/RESOURCE_NAME"
      ).then((response) => response.json()
      ).then((response) => {
        setDashboardUrl(response.EmbedUrl)
      })
    }, 10);
    return () => clearTimeout(timeout);
  }, []);

  const createContext = async () => {
    const context = await createEmbeddingContext();
    setEmbeddingContext(context);
  }

  useEffect(() => {
    if (dashboardUrl) { createContext() }
  }, [dashboardUrl])

  useEffect(() => {
    if (embeddingContext) { embed(); }
  }, [embeddingContext])

  const embed = async () => {

    const options = {
      url: dashboardUrl,
      container: dashboardRef.current,
      height: "500px",
      width: "600px",
    };

    const newEmbeddedDashboard = await embeddingContext.embedDashboard(options);
    setEmbeddedDashboard(newEmbeddedDashboard);
  };

  useEffect(() => {
    if (embeddedDashboard) {
      embeddedDashboard.navigateToDashboard(dashboardId, {})
    }
  }, [dashboardId])

  const changeDashboard = async (e) => {
    const dashboardId = e.target.value
    setDashboardId(dashboardId)
  }

  return (
    <>
      <header>
        <h1>Embedded <i>QuickSight</i>: Build Powerful Dashboards in React</h1>
      </header>
      <main>
        <p>Welcome to the QuickSight dashboard embedding sample page</p>
        <p>Please pick a dashboard you want to render</p>
        <select id='dashboard' value={dashboardId} onChange={changeDashboard}>
          <option value="YOUR_DASHBOARD1_ID">YOUR_DASHBOARD1_NAME</option>
          <option value="YOUR_DASHBOARD2_ID">YOUR_DASHBOARD2_NAME</option>
        </select>
        <div ref={dashboardRef} />
      </main>
    </>
  );
};

export default App

Next, replace the contents of your App.css file, which is used to style and layout your web page, with the content from the following code snippet:

body {
  background-color: #ffffff;
  font-family: Arial, sans-serif;
  margin: 0;
  padding: 0;
}

header {
  background-color: #f1f1f1;
  padding: 20px;
  text-align: center;
}

h1 {
  margin: 0;
}

main {
  margin: 20px;
  text-align: center;
}

p {
  margin-bottom: 20px;
}

a {
  color: #000000;
  text-decoration: none;
}

a:hover {
  text-decoration: underline;
}

i {
  color: orange;
  font-style: normal;
}

Now it’s time to test your app. Start your application by running npm start in your terminal. The following screenshots show examples of your app as well as the dashboards it can display.

Example application page with the visualisation

Example application page with the visualisation

Conclusion

In this post, we showed you how to embed a QuickSight dashboard into a React application using the AWS SDK. Sharing your dashboard with anonymous users allows them to access your dashboard without granting them access to your AWS account. There are also other ways to share your dashboard anonymously, such as using 1-click public embedding.

Join the Quicksight Community to ask, answer and learn with others and explore additional resources.


About the Author

author headshot

Adrianna is a Solutions Architect at AWS Global Financial Services. Having been a part of Amazon since August 2018, she has had the chance to be involved both in the operations as well as the cloud business of the company. Currently, she builds software assets which demonstrate innovative use of AWS services, tailored to a specific customer use cases. On a daily basis, she actively engages with various aspects of technology, but her true passion lies in combination of web development and analytics.

Getting started guide for near-real time operational analytics using Amazon Aurora zero-ETL integration with Amazon Redshift

Post Syndicated from Rohit Vashishtha original https://aws.amazon.com/blogs/big-data/getting-started-guide-for-near-real-time-operational-analytics-using-amazon-aurora-zero-etl-integration-with-amazon-redshift/

Amazon Aurora zero-ETL integration with Amazon Redshift was announced at AWS re:Invent 2022 and is now available in public preview for Amazon Aurora MySQL-Compatible Edition 3 (compatible with MySQL 8.0) in regions us-east-1, us-east-2, us-west-2, ap-northeast-1 and eu-west-1. For more details, refer to the What’s New Post.

In this post, we provide step-by-step guidance on how to get started with near-real time operational analytics using this feature.

Challenges

Customers across industries today are looking to increase revenue and customer engagement by implementing near-real time analytics use cases like personalization strategies, fraud detection, inventory monitoring, and many more. There are two broad approaches to analyzing operational data for these use cases:

  • Analyze the data in-place in the operational database (e.g. read replicas, federated query, analytics accelerators)
  • Move the data to a data store optimized for running analytical queries such as a data warehouse

The zero-ETL integration is focused on simplifying the latter approach.

A common pattern for moving data from an operational database to an analytics data warehouse is via extract, transform, and load (ETL), a process of combining data from multiple sources into a large, central repository (data warehouse). ETL pipelines can be expensive to build and complex to manage. With multiple touchpoints, intermittent errors in ETL pipelines can lead to long delays, leaving applications that rely on this data to be available in the data warehouse with stale or missing data, further leading to missed business opportunities.

For customers that need to run unified analytics across data from multiple operational databases, solutions that analyze data in-place may work great for accelerating queries on a single database, but such systems have a limitation of not being able to aggregate data from multiple operational databases.

Zero-ETL

At AWS, we have been making steady progress towards bringing our zero-ETL vision to life. With Aurora zero-ETL integration with Amazon Redshift, you can bring together the transactional data of Aurora with the analytics capabilities of Amazon Redshift. It minimizes the work of building and managing custom ETL pipelines between Aurora and Amazon Redshift. Data engineers can now replicate data from multiple Aurora database clusters into the same or a new Amazon Redshift instance to derive holistic insights across many applications or partitions. Updates in Aurora are automatically and continuously propagated to Amazon Redshift so the data engineers have the most recent information in near-real time. Additionally, the entire system can be serverless and can dynamically scale up and down based on data volume, so there’s no infrastructure to manage.

When you create an Aurora zero-ETL integration with Amazon Redshift, you continue to pay for Aurora and Amazon Redshift usage with existing pricing (including data transfer). The Aurora zero-ETL integration with Amazon Redshift feature is available at no additional cost.

With Aurora zero-ETL integration with Amazon Redshift, the integration replicates data from the source database into the target data warehouse. The data becomes available in Amazon Redshift within seconds, allowing users to use the analytics features of Amazon Redshift and capabilities like data sharing, workload optimization autonomics, concurrency scaling, machine learning, and many more. You can perform real-time transaction processing on data in Aurora while simultaneously using Amazon Redshift for analytics workloads such as reporting and dashboards.

The following diagram illustrates this architecture.

Solution overview

Let’s consider TICKIT, a fictional website where users buy and sell tickets online for sporting events, shows, and concerts. The transactional data from this website is loaded into an Aurora MySQL 3.03.1 (or higher version) database. The company’s business analysts want to generate metrics to identify ticket movement over time, success rates for sellers, and the best-selling events, venues, and seasons. They would like to get these metrics in near-real time using a zero-ETL integration.

The integration is set up between Amazon Aurora MySQL-Compatible Edition 3.03.1 (source) and Amazon Redshift (destination). The transactional data from the source gets refreshed in near-real time on the destination, which processes analytical queries.

You can use either the provisioned or serverless option for both Amazon Aurora MySQL-Compatible Edition as well as Amazon Redshift. For this illustration, we use a provisioned Aurora database and an Amazon Redshift Serverless data warehouse. For the complete list of public preview considerations, please refer to the feature AWS documentation.

The following diagram illustrates the high-level architecture.

The following are the steps needed to set up zero-ETL integration. For complete getting started guides, refer to the following documentation links for Aurora and Amazon Redshift.

  1. Configure the Aurora MySQL source with a customized DB cluster parameter group.
  2. Configure the Amazon Redshift Serverless destination with the required resource policy for its namespace.
  3. Update the Redshift Serverless workgroup to enable case-sensitive identifiers.
  4. Configure the required permissions.
  5. Create the zero-ETL integration.
  6. Create a database from the integration in Amazon Redshift.

Configure the Aurora MySQL source with a customized DB cluster parameter group

To create an Aurora MySQL database, complete the following steps:

  1. On the Amazon RDS console, create a DB cluster parameter group called zero-etl-custom-pg.

Zero-ETL integrations require specific values for the Aurora DB cluster parameters that control binary logging (binlog). For example, enhanced binlog mode must be turned on (aurora_enhanced_binlog=1).

  1. Set the following binlog cluster parameter settings:
    1. binlog_backup=0
    2. binlog_replication_globaldb=0
    3. binlog_format=ROW
    4. aurora_enhanced_binlog=1
    5. binlog_row_metadata=FULL
    6. binlog_row_image=FULL
  2. Choose Save changes.
  3. Choose Databases in the navigation pane, then choose Create database.
  4. For Available versions, choose Aurora MySQL 3.03.1 (or higher).
  5. For Templates, select Production.
  6. For DB cluster identifier, enter zero-etl-source-ams.
  7. Under Instance configuration, select Memory optimized classes and choose a suitable instance size (the default is db.r6g.2xlarge).
  8. Under Additional configuration, for DB cluster parameter group, choose the parameter group you created earlier (zero-etl-custom-pg).
  9. Choose Create database.

In a couple of minutes, it should spin up an Aurora MySQL database as the source for zero-ETL integration.

Configure the Redshift Serverless destination

For our use case, create a Redshift Serverless data warehouse by completing the following steps:

  1. On the Amazon Redshift console, choose Serverless dashboard in the navigation pane.
  2. Choose Create preview workgroup.
  3. For Workgroup name, enter zero-etl-target-rs-wg.
  4. For Namespace, select Create a new namespace and enter zero-etl-target-rs-ns.
  5. Navigate to the namespace zero-etl-target-rs-ns and choose the Resource policy tab.
  6. Choose Add authorized principals.
  7. Enter either the Amazon Resource Name (ARN) of the AWS user or role, or the AWS account ID (IAM principals) that are allowed to create integrations in this namespace.

An account ID is stored as an ARN with root user.

  1. Add an authorized integration source to the namespace and specify the ARN of the Aurora MySQL DB cluster that’s the data source for the zero-ETL integration.
  2. Choose Save changes.

You can get the ARN for the Aurora MySQL source on the Configuration tab as shown in the following screenshot.

Update the Redshift Serverless workgroup to enable case-sensitive identifiers

Use the AWS Command Line Interface (AWS CLI) to run the update-workgroup action:

aws redshift-serverless update-workgroup --workgroup-name zero-etl-target-rs-wg --config-parameters parameterKey=enable_case_sensitive_identifier,parameterValue=true --region us-east-1

You can use AWS CloudShell or another interface like Amazon Elastic Compute Cloud (Amazon EC2) with an AWS user configuration that can update the Redshift Serverless parameter group. The following screenshot illustrates how to run this on CloudShell.

The following screenshot shows how to run the update-workgroup command on Amazon EC2.

Configure required permissions

To create a zero-ETL integration, your user or role must have an attached identity-based policy with the appropriate AWS Identity and Access Management (IAM) permissions. The following sample policy allows the associated principal to perform the following actions:

  • Create zero-ETL integrations for the source Aurora DB cluster.
  • View and delete all zero-ETL integrations.
  • Create inbound integrations into the target data warehouse. This permission is not required if the same account owns the Amazon Redshift data warehouse and this account is an authorized principal for that data warehouse. Also note that Amazon Redshift has a different ARN format for provisioned and serverless:
    • Provisioned clusterarn:aws:redshift:{region}:{account-id}:namespace:namespace-uuid
    • Serverlessarn:aws:redshift-serverless:{region}:{account-id}:namespace/namespace-uuid

Complete the following steps to configure the permissions:

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. Create a new policy called rds-integrations using the following JSON:
    {
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
            "Action": [
                "rds:CreateIntegration"
            ],
            "Resource": [
                "arn:aws:rds:{region}:{account-id}:cluster:source-cluster",
                "arn:aws:rds:{region}:{account-id}:integration:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "rds:DescribeIntegration"
            ],
            "Resource": ["*"]
        },
        {
            "Effect": "Allow",
            "Action": [
                "rds:DeleteIntegration"
            ],
            "Resource": [
                "arn:aws:rds:{region}:{account-id}:integration:*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:CreateInboundIntegration"
            ],
            "Resource": [
                "arn:aws:redshift:{region}:{account-id}:cluster:namespace-uuid"
            ]
        }]
    }

Policy preview:

If you see IAM policy warnings for the RDS policy actions, this is expected because the feature is in public preview. These actions will become part of IAM policies when the feature is generally available. It’s safe to proceed.

  1. Attach the policy you created to your IAM user or role permissions.

Create the zero-ETL integration

To create the zero-ETL integration, complete the following steps:

  1. On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
  2. Choose Create zero-ETL integration.
  3. For Integration name, enter a name, for example zero-etl-demo.
  4. For Aurora MySQL source cluster, browse and choose the source cluster zero-etl-source-ams.
  5. Under Destination, for Amazon Redshift data warehouse, choose the Redshift Serverless destination namespace (zero-etl-target-rs-ns).
  6. Choose Create zero-ETL integration.

To specify a target Amazon Redshift data warehouse that’s in another AWS account, you must create a role that allows users in the current account to access resources in the target account. For more information, refer to Providing access to an IAM user in another AWS account that you own.

Create a role in the target account with the following permissions:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "redshift:DescribeClusters",
            "redshift-serverless:ListNamespaces"
         ],
         "Resource":[
            "*"
         ]
      }
   ]
}

The role must have the following trust policy, which specifies the target account ID. You can do this by creating a role with a trusted entity as an AWS account ID in another account.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "AWS": "arn:aws:iam::{external-account-id}:root"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}

The following screenshot illustrates creating this on the IAM console.

Then while creating the zero-ETL integration, choose the destination account ID and the name of the role you created to proceed further, for Specify a different account option.

You can choose the integration to view the details and monitor its progress. It takes a few minutes to change the status from Creating to Active. The time varies depending on size of the dataset already available in the source.

Create a database from the integration in Amazon Redshift

To create your database, complete the following steps:

  1. On the Redshift Serverless dashboard, navigate to the zero-etl-target-rs-ns namespace.
  2. Choose Query data to open Query Editor v2.
  3. Connect to the preview Redshift Serverless data warehouse by choosing Create connection.
  4. Obtain the integration_id from the svv_integration system table:

    select integration_id from svv_integration; ---- copy this result, use in the next sql

  5. Use the integration_id from the previous step to create a new database from the integration:
    CREATE DATABASE aurora_zeroetl FROM INTEGRATION '<result from above>';

The integration is now complete, and an entire snapshot of the source will reflect as is in the destination. Ongoing changes will be synced in near-real time.

Analyze the near-real time transactional data

Now we can run analytics on TICKIT’s operational data.

Populate the source TICKIT data

To populate the source data, complete the following steps:

  1. Connect to your Aurora MySQL cluster and create a database/schema for the TICKIT data model, verify that the tables in that schema have a primary key, and initiate the load process:
    mysql -h <amazon_aurora_mysql_writer_endpoint> -u admin -p

You can use the script from the following HTML file to create the sample database demodb (using the tickit.db model) in Amazon Aurora MySQL-Compatible edition.

  1. Run the script to create the tickit.db model tables in the demodb database/schema:
  2. Load data from Amazon Simple Storage Service (Amazon S3), record the finish time for change data capture (CDC) validations at destination, and observe how active the integration was.

The following are common errors associated with load from Amazon S3:

  • For the current version of the Aurora MySQL cluster, we need to set the aws_default_s3_role parameter in the DB cluster parameter group to the role ARN that has the necessary Amazon S3 access permissions.
  • If you get an error for missing credentials (for example, Error 63985 (HY000): S3 API returned error: Missing Credentials: Cannot instantiate S3 Client), you probably haven’t associated your IAM role to the cluster. In this case, add the intended IAM role to the source Aurora MySQL cluster.

Analyze the source TICKIT data in the destination

On the Redshift Serverless dashboard, open Query Editor v2 using the database you created as part of the integration setup. Use the following code to validate the seed or CDC activity:

SELECT * FROM SYS_INTEGRATION_ACTIVITY;

Choose the cluster or workgroup and database created from integration on the drop-down menu and run tickit.db sample analytic queries.

Monitoring

You can query the following system views and tables in Amazon Redshift to get information about your Aurora zero-ETL integrations with Amazon Redshift:

In order to view the integration-related metrics published to Amazon CloudWatch, navigate to Amazon Redshift console. Choose Zero-ETL integrations from left navigation pane and click on the integration links to display activity metrics.

Available metrics on the Redshift console are Integration metrics and table statistics, with table statistics providing details of each table replicated from Aurora MySQL to Amazon Redshift.

Integration metrics contains table replication success/failure counts and lag details:

Clean up

When you delete a zero-ETL integration, Aurora removes it from your Aurora cluster. Your transactional data isn’t deleted from Aurora or Amazon Redshift, but Aurora doesn’t send new data to Amazon Redshift.

To delete a zero-ETL integration, complete the following steps:

  1. On the Amazon RDS console, choose Zero-ETL integrations in the navigation pane.
  2. Select the zero-ETL integration that you want to delete and choose Delete.
  3. To confirm the deletion, choose Delete.

Conclusion

In this post, we showed you how to set up Aurora zero-ETL integration from Amazon Aurora MySQL-Compatible Edition to Amazon Redshift. This minimizes the need to maintain complex data pipelines and enables near-real time analytics on transactional and operational data.

To learn more about Aurora zero-ETL integration with Amazon Redshift, visit documentation for Aurora and Amazon Redshift.


About the Authors

Rohit Vashishtha is a Senior Analytics Specialist Solutions Architect at AWS based in Dallas, Texas. He has 17 years of experience architecting, building, leading, and maintaining big data platforms. Rohit helps customers modernize their analytic workloads using the breadth of AWS services and ensures that customers get the best price/performance with utmost security and data governance.

Vijay Karumajji is a Database Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.

BP Yau is a Sr Partner Solutions Architect at AWS. He is passionate about helping customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate its Oracle data warehouse to Amazon Redshift and build its next generation big data analytics platform using AWS technologies.

Jyoti Aggarwal is a Product Manager on the Amazon Redshift team based in Seattle. She has spent the last 10 years working on multiple products in the data warehouse industry.

Adam Levin is a Product Manager on the Amazon Aurora team based in California. He has spent the last 10 years working on various cloud database services.

Enforce boundaries on AWS Glue interactive sessions

Post Syndicated from Nicolas Jacob Baer original https://aws.amazon.com/blogs/big-data/enforce-boundaries-on-aws-glue-interactive-sessions/

AWS Glue interactive sessions allow engineers to build, test, and run data preparation and analytics workloads in an interactive notebook. Interactive sessions provide isolated development environments, take care of the underlying compute cluster, and allow for configuration to stop idling resources.

Glue interactive sessions provides default recommended configurations, and also allows users to customize the session to meet their needs. For example, you can provision more workers to experiment on a larger dataset or set the idle timeout for long-running workloads. With the flexibility to change these options depending on the workload, you may need ensure that the options are changed within specific boundaries and apply a control mechanism.

In this post, we present the process of deploying a reusable solution to enforce AWS Glue interactive session limits on three options: connection, number of workers, and maximum idle time. The first option addresses the need for applying custom inspection and controls on traffic, for example by enforcing an interactive session to only be run inside a VPC. The other two enforce limits on costs and usage of AWS Glue resources by enforcing an upper boundary on the number of workers and idle time per session. You can further extend the solution for other properties or services within AWS Glue.

Overview of solution

The proposed architecture is built on serverless components and runs whenever a new AWS Glue interactive session is created.

Architecture Diagram of the Solution

The workflow steps are as follows:

  1. A data engineer creates a new AWS Glue interactive session either through the AWS Management Console or in a Jupyter notebook locally.
  2. The interactive session produces a new event to AWS CloudTrail for the CreateSession event with all relevant information to identify and inspect a session as soon as the session is initiated.
  3. An Amazon EventBridge rule filters the CloudTrail events and invokes an AWS Lambda function to inspect the CreateSession event.
  4. The Lambda function inspects the CreateSession event and checks for all defined boundary conditions. Currently, the boundaries configurable with this solution are limited to maximum number of workers, idle timeout in minutes, and deployment with connection enforced.
  5. If any of the defined boundary conditions are not met, for example too many workers are provisioned for the session, depending on the provided configuration, the function ends the interactive session immediately and sends an email via Amazon Simple Notification Service (Amazon SNS). If the session hasn’t started yet, the function will wait for it to start before taking any action.
  6. If the session was stopped, an email is sent to an SNS topic. There is no information available in the interactive session notebook on the reason for the ending of the session. Therefore, additional context information is provided through the SNS topic to the data engineers.
  7. If the function fails, the sessions are logged in a dead-letter queue inside Amazon Simple Queue Service (Amazon SQS). Furthermore, the queue is monitored and in case of a message, it will trigger an Amazon CloudWatch alarm.

The following steps walk you through how to build and deploy the solution. The code is available in the GitHub repo.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Overview of the deployed resources

All the necessary resources are defined in an AWS CloudFormation file located under cfn/template.yaml. To deploy those resources, we use AWS Serverless Application Model (AWS SAM), which enables us to conveniently build and package all the dependencies and also manages the AWS CloudFormation steps for us.

The CloudFormation stack deploys the following resources:

  • A Lambda function with its library, both defined under the directory src/functions. The function is the control. It will validate that the session is started within the limits defined.
  • An EventBridge rule. This event listens to CloudTrail and in case of a new interactive session, will trigger the control Lambda function.
  • An SQS dead-letter queue (DLQ) attached to the Lambda function. This keeps a record of events that triggered a Lambda function failure.
  • Two CloudWatch alarms monitoring the Lambda function failures and the messages in the DLQ.

If notification via email is enabled, two more resources are deployed:

Additionally, AWS CloudFormation deploys all the necessary AWS Identity and Access Management (IAM) roles and policies, and an AWS Key Management Service (AWS KMS) key to ensure that the exchanged data is encrypted.

Deploy the solution

To facilitate the deployment lifecycle, including the setup of the user local environment, we provide a Makefile that describes all the necessary steps. Make sure you have your AWS credentials renewed and have access to your account. For more information, refer to Configuration and credential file settings.

  • Explore the Makefile and adjust the Region and stack name as needed by modifying the values of the variables AWS_REGION and STACK_NAME.
  • Set KILL_SESSION = "True" if you want to immediately stop the interactive session that has been found out of boundaries. Allowed values are True or False; the default is True.
  • Set NOTIFICATION_EMAIL_ADDRESS = <[email protected]> in the Makefile if you want get notified when a session has been found out of boundaries.
  • Set values for your controls:
    • ENFORCE_VPC_CONNECTION to stop sessions not running inside a VPC (true or false).
    • MAX_WORKERS to set the maximum number of workers for a session (numeric).
    • MAX_IDLE_TIMEOUT_MINUTES to define the maximum idle time for sessions in minutes (numeric).
  • Install all the prerequisite libraries:
    make install-pre-requisites

    These will be installed under a newly created Python virtual environment inside this repository in the directory .venv.

  • Deploy the new stack:
    make deploy

    This command will complete the following tasks:

    • Check if the prerequisites are met.
    • Perform pytest unittest on the Python files.
    • Validate the CloudFormation template.
    • Build the artifacts (Lambda function and Lambda layers).
    • Deploy the resources via AWS SAM.

Test the solution

Refer to Introducing AWS Glue interactive sessions for Jupyter for information about running an interactive session. If you follow the instructions in the post (see the section Run your first code cell and author your AWS Glue notebook), the initialization of the interactive session should fail with an error similar to the following.

Example of code in the cell:

from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
sc3 = SparkContext.getOrCreate()
glueContext1 = GlueContext(sc3)
spark = glueContext1.spark_session
job = Job(glueContext1)

Received output:

Authenticating with profile=XXXXXXXX
glue_role_arn defined by user: arn:aws:iam::XXXXXXXXXX:role/XXXXXXXX
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: XXXXXXXXXXXXX
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
Waiting for session xxxxxxxxx to get into ready status...
Session xxxxxxxxx has been created
Exception encountered while running statement: An error occurred (EntityNotFoundException) when calling the GetStatement operation: Session ID xxxxxxxxx not found

If you enabled the email feature, you should also get an email notification.

You can also check on the AWS Glue console that your session ID isn’t listed.

Clean up

Clean up the deployed resources by running the following command:

make clean-up

Note that the resources deployed from following the recommended post, Introducing AWS Glue interactive sessions for Jupyter, will not be removed with the previous command.

Limitations

The delivery guarantee for CloudTrail events to EventBridge are best effort. This means CloudTrail will attempt to deliver all events to EventBridge, but in some rare cases, an event might not be delivered. For more information, refer to Events from AWS services.

Conclusion

This post described how to build, deploy, and test a solution to enforce boundary conditions on AWS Glue interactive sessions in order to enforce constraints on the number of workers, idle timeouts, and AWS Glue connection.

You can adapt this solution based on your needs and further extend it to allow controls on other options.

To learn more about how to use AWS Glue interactive sessions, refer to Introducing AWS Glue interactive sessions for Jupyter and Author AWS Glue jobs with PyCharm using AWS Glue interactive sessions.


About the Authors

Nicolas Jacob Baer is a Senior Cloud Application Architect with a strong focus on data engineering and machine learning, based in Switzerland. He works closely with enterprise customers to design data platforms and build advanced analytics/ml use-cases.

Luca Mazzaferro is a Senior DevOps Architect at Amazon Web Services. He likes to have infrastructure automated, reproducible and secured. In his free time he likes to cook, especially pizza.

Kemeng Zhang is a Cloud Application Architect with a strong focus on machine learning and UX, based in Switzerland. She works closely with customers to design user experiences and build advanced analytics/ml use-cases.

Mark Walser, a Senior Global Data Architect at Amazon Web Services, collaborates with customers to develop innovative Big Data solutions that solve business problems and speed up the adoption of AWS services. Outside of work, he finds pleasure in running, swimming, and all things related to technology.

Gal blog picGal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering and BI, based in California. She is passionate about developing a deep understanding of customer’s business needs and collaborating with engineers to design easy to use data products.

Get started managing partitions for Amazon S3 tables backed by the AWS Glue Data Catalog

Post Syndicated from Anderson dos Santos original https://aws.amazon.com/blogs/big-data/get-started-managing-partitions-for-amazon-s3-tables-backed-by-the-aws-glue-data-catalog/

Large organizations processing huge volumes of data usually store it in Amazon Simple Storage Service (Amazon S3) and query the data to make data-driven business decisions using distributed analytics engines such as Amazon Athena. If you simply run queries without considering the optimal data layout on Amazon S3, it results in a high volume of data scanned, long-running queries, and increased cost.

Partitioning is a common technique to lay out your data optimally for distributed analytics engines. By partitioning your data, you can restrict the amount of data scanned by downstream analytics engines, thereby improving performance and reducing the cost for queries.

In this post, we cover the following topics related to Amazon S3 data partitioning:

  • Understanding table metadata in the AWS Glue Data Catalog and S3 partitions for better performance
  • How to create a table and load partitions in the Data Catalog using Athena
  • How partitions are stored in the table
  • Different ways to add partitions in a table on the Data Catalog
  • Partitioning data stored in Amazon S3 while ingestion and catalog

Understanding table metadata in the Data Catalog and S3 partitions for better performance

A table in the AWS Glue Data Catalog is the metadata definition that organizes the data location, data type, and column schema, which represents the data in a data store. Partitions are data organized hierarchically, defining the location where the data for a particular partition resides. Partitioning your data allows you to limit the amount of data scanned by S3 SELECT, thereby improving performance and reducing cost.

There are a few factors to consider when deciding the columns on which to partition. For example, if you’re using columns as filters, don’t use a column that is partitioning too finely, or don’t choose a column where your data is heavily skewed to one partition value. You can partition your data by any column. Partition columns are usually designed by a common query pattern in your use case. For example, a common practice is to partition the data based on year/month/day because many queries tend to run time series analyses in typical use cases. This often leads to a multi-level partitioning scheme. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns.

Let’s look at an example of how partitioning works.

Files corresponding to a single day’s worth of data are placed under a prefix such as s3://my_bucket/logs/year=2023/month=06/day=01/.

If your data is partitioned per day, every day you have a single file, such as the following:

  • s3://my_bucket/logs/year=2023/month=06/day=01/file1_example.json
  • s3://my_bucket/logs/year=2023/month=06/day=02/file2_example.json
  • s3://my_bucket/logs/year=2023/month=06/day=03/file3_example.json

We can use a WHERE clause to query the data as follows:

SELECT * FROM table WHERE year=2023 AND month=06 AND day=01

The preceding query reads only the data inside the partition folder year=2023/month=06/day=01 instead of scanning through the files under all partitions. Therefore, it only scans the file file1_example.json.

Systems such as Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by value, eliminating unnecessary (partition) requests to Amazon S3. This capability can improve the performance of applications that specifically need to read a limited number of partitions. For more information about partitioning with Athena and Redshift Spectrum, refer to Partitioning data in Athena and Creating external tables for Redshift Spectrum, respectively.

How to create a table and load partitions in the Data Catalog using Athena

Let’s begin by understanding how to create a table and load partitions using DDL (Data Definition Language) queries in Athena. Note that to demonstrate the various methods of loading partitions into the table, we need to delete and recreate the table multiple times throughout the following steps.

First, we create a database for this demo.

  1. On the Athena console, choose Query editor.

If this is your first time using the Athena query editor, you need to configure and specify an S3 bucket to store the query results.

  1. Create a database with the following command:
CREATE DATABASE partitions_blog;

  1. In the Data pane, for Database, choose the database partitions_blog.
  2. Create the table impressions following the example in Hive JSON SerDe. Replace <myregion> in s3://<myregion>.elasticmapreduce/samples/hive-ads/tables/impressions with the Region identifier where you run Athena (for example, s3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions).
  3. Run the following query to create the table:
CREATE EXTERNAL TABLE impressions (
    requestbegintime string,
    adid string,
    impressionid string,
    referrer string,
    useragent string,
    usercookie string,
    ip string,
    number string,
    processid string,
    browsercookie string,
    requestendtime string,
    timers struct
                <
                 modellookup:string, 
                 requesttime:string
                >,
    threadid string, 
    hostname string,
    sessionid string
)   
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions';

The following screenshot shows the query in the query editor.

  1. Run the following query to review the data:
SELECT * FROM impressions;

You can’t see any results because the partitions aren’t loaded yet.

If the partition isn’t loaded into a partitioned table, when the application downloads the partition metadata, the application will not be aware of the S3 path that needs to be queried. For more information, refer to Why do I get zero records when I query my Amazon Athena table.

  1. Load the partitions using the command MSCK REPAIR TABLE.

The MSCK REPAIR TABLE command was designed to manually add partitions that are added to or removed from the file system, such as HDFS or Amazon S3, but are not present in the metastore.

  1. Query the table again to see the results.

After the MSCK REPAIR TABLE command scans Amazon S3 and adds partitions to AWS Glue for Hive-compatible partitions, the records under the registered partitions are now returned.

How partitions are stored in the table metadata

We can list the table partitions in Athena by running the SHOW PARTITIONS command, as shown in the following screenshot.

We also can see the partition metadata on the AWS Glue console. Complete the following steps:

  1. On the AWS Glue console, choose Tables in the navigation pane under Data Catalog.
  2. Choose the impressions table in the partitions_blog database.
  3. On the Partitions tab, choose View Properties next to a partition to view its details.

The following screenshot shows an example of the partition properties.

We can also get the partitions using the AWS Command Line Interface (AWS CLI) command get-partitions, as shown in the following screenshot.

From the get-partitions, the element “Values” defines the partition value and “Location” defines the S3 path to be queried by the application:

"Values": [
                "2009-04-12-19-05"
            ]

When querying the data from the partition dt="2009-04-12-19-05", the application lists and reads only the files in the S3 path s3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions/dt="2009-04-12-19-05".

Different ways to add partitions in a table on the Data Catalog

There are multiple ways to load partitions into the table. You can create tables and partitions directly using the AWS Glue API, SDKs, AWS CLI, DDL queries on Athena, using AWS Glue crawlers, or using AWS Glue ETL jobs.

For the next examples, we need to drop and recreate the table. Run the following command in the Athena query editor:

DROP table impressions;

After that, recreate the table:

CREATE EXTERNAL TABLE impressions (
    requestbegintime string,
    adid string,
    impressionid string,
    referrer string,
    useragent string,
    usercookie string,
    ip string,
    number string,
    processid string,
    browsercookie string,
    requestendtime string,
    timers struct
                <
                 modellookup:string, 
                 requesttime:string
                >,
    threadid string, 
    hostname string,
    sessionid string
)   
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions';

Creating partitions individually

If the data arrives in an S3 bucket at a scheduled time, for example every hour or once a day, you can individually add partitions. One way of doing so is by running an ALTER TABLE ADD PARTITION DDL query on Athena.

We use Athena for this query as an example. You can do the same from Hive on Amazon EMR, Spark on Amazon EMR, AWS Glue for Apache Spark jobs, and more.

To load partitions using Athena, we need to use the ALTER TABLE ADD PARTITION command, which can create one or more partitions in the table. ALTER TABLE ADD PARTITION supports partitions created on Amazon S3 with camel case (s3://bucket/table/dayOfTheYear=20), Hive format (s3://bucket/table/dayoftheyear=20), and non-Hive style partitioning schemes used by AWS CloudTrail logs, which use separate path components for date parts, such as s3://bucket/data/2021/01/26/us/6fc7845e.json.

To load partitions into a table, you can run the following query in the Athena query editor:

ALTER TABLE impressions 
  ADD PARTITION (dt = '2009-04-12-19-05');


Refer to ALTER TABLE ADD PARTITION for more information.

Another option is using AWS Glue APIs. AWS Glue provides two APIs to load partitions into table create_partition() and batch_create_partition(). For the API parameters, refer to CreatePartition.

The following example uses the AWS CLI:

aws glue create-partition \
    --database-name partitions_blog \
    --table-name impressions \
    --partition-input '{
                            "Values":["2009-04-14-13-00"],
                            "StorageDescriptor":{
                                "Location":"s3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions/dt=2009-04-14-13-00",
                                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                                "SerdeInfo": {
                                    "SerializationLibrary": "org.apache.hive.hcatalog.data.JsonSerDe"
                                }
                            }
                        }'

Both commands (ALTER TABLE in Athena and the AWS Glue API create-partition) will create partition enhancing from the table definition.

Load multiple partitions using MSCK REPAIR TABLE

You can load multiple partitions in Athena. MSCK REPAIR TABLE is a DDL statement that scans the entire S3 path defined in the table’s Location property. Athena lists the S3 path searching for Hive-compatible partitions, then loads the existing partitions into the AWS Glue table’s metadata. A table needs to be created in the Data Catalog, and the data source must be from Amazon S3 before it can run. You can create a table with AWS Glue APIs or by running a CREATE TABLE statement in Athena. After the table creation, run MSCK REPAIR TABLE to load the partitions.

The parameter DDL query timeout in the service quotas defines how long a DDL statement can run. The runtime increases accordingly to the number of folders or partitions in the S3 path.

The MSCK REPAIR TABLE command is best used when creating a table for the first time or when there is uncertainty about parity between data and partition metadata. It supports folders created in lowercase and using Hive-style partitions format (for example, year=2023/month=6/day=01). Because MSCK REPAIR TABLE scans both the folder and its subfolders to find a matching partition scheme, you should keep data for separate tables in separate folder hierarchies.

Every MSCK REPAIR TABLE command lists the entire folder specified in the table location. If you add new partitions frequently (for example, every 5 minutes or every hour), consider scheduling an ALTER TABLE ADD PARTITION statement to load only the partitions defined in the statement instead of scanning the entire S3 path.

The partitions created in the Data Catalog by MSCK REPAIR TABLE enhance the schema from the table definition. Note that Athena doesn’t charge for DDL statements, making MSCK REPAIR TABLE a more straightforward and affordable way to load partitions.

Add multiple partitions using an AWS Glue crawler

An AWS Glue crawler offers more features when loading partitions into the table. A crawler automatically identifies partitions in Amazon S3, extracts metadata, and creates table definitions in the Data Catalog. Crawlers can crawl the following file-based and table-based data stores.

Crawlers can help automate table creation and loading partitions into tables. They are charged per hour, and bill per second. You can optimize the crawler’s performance by altering parameters like the sample size or by specifying it to crawl new folders only.

If the schema of the data changes, the crawler will update the table and partition schemas accordingly. The crawler configuration options have parameters such as update the table definition in the Data Catalog, add new columns only, and ignore the change and don’t update the table in the Data Catalog, which tell the crawler how to update the table when needed and evolve the table schema.

Crawlers can create and update multiple tables from the same data source. When an AWS Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure and which directories are partitions for the table.

To create an AWS Glue crawler, complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane under Data Catalog.
  2. Choose Create crawler.
  3. Provide a name and optional description, then choose Next.
  4. Under Data source configuration, select Not yet and choose Add a data source.
  5. For Data source, choose S3.
  6. For S3 path, enter the path of the impression data (s3://us-east-1.elasticmapreduce/samples/hive-ads/tables/impressions).
  7. Select a preference for subsequent crawler runs.
  8. Choose Add an S3 data source.
  9. Select your data source and choose Next.
  10. Under IAM role, either choose an existing AWS Identity and Access Management (IAM) role or choose Create new IAM role.
  11. Choose Next.
  12. For Target database, choose partitions_blog.
  13. For Table name prefix, enter crawler_.

We use the table prefix to add a custom prefix in front of the table name. For example, if you leave the prefix field empty and start the crawler on s3://my-bucket/some-table-backup, it creates a table with the name some-table-backup. If you add crawler_ as a prefix, it a creates table called crawler_some-table-backup.

  1. Choose your crawler schedule, then choose Next.
  2. Review your settings and create the crawler.
  3. Select your crawler and choose Run.

Wait for the crawler to finish running.

You can go to Athena and check the table was created:

SHOW PARTITIONS crawler_impressions;

Partitioning data stored in Amazon S3 while ingestion and cataloging

The previous examples work with data that already exists in Amazon S3. If you’re using AWS Glue jobs to write data on Amazon S3, you have the option to create partitions with DynamicFrames by enabling the “enableUpdateCatalog=True” parameter. Refer to Creating tables, updating the schema, and adding new partitions in the Data Catalog from AWS Glue ETL jobs for more information.

DynamicFrame supports native partitioning using a sequence of keys, using the partitionKeys option when you create a sink. For example, the following Python code writes out a dataset to Amazon S3 in Parquet format into directories partitioned by the ‘year’ field. After ingesting the data and registering partitions from the AWS Glue job, you can utilize these partitions from queries running on other analytics engines such as Athena.

## Create partitioned table in Glue Data Catalog using DynamicFrame

#Read Dataset
datasource0 = glueContext.create_dynamic_frame.from_catalog(
      database = "default", 
      table_name = "flight_delays_pq", 
      transformation_ctx = "datasource0")

#Create Sink
sink = glueContext.getSink(
    connection_type="s3", 
    path="s3://BUCKET/glueetl/",
    enableUpdateCatalog=True,
    partitionKeys=[ "year"])
    
sink.setFormat("parquet", useGlueParquetWriter=True)

sink.setCatalogInfo(catalogDatabase="default", catalogTableName="test_table")

#Write data, create table and add partitions
sink.writeFrame(datasource0)
job.commit()

Conclusion

This post showed multiple methods for partitioning your Amazon S3 data, which helps reduce costs by avoiding unnecessary data scanning and also improves the overall performance of your processes. We further described how AWS Glue makes effective metadata management for partitions possible, allowing you to optimize your storage and query operations in AWS Glue and Athena. These partitioning methods can help optimize scanning high volumes of data or long-running queries, as well as reduce the cost of scanning.

We hope you try out these options!


About the authors

Anderson Santos is a Senior Solutions Architect at Amazon Web Services. He works with AWS Enterprise customers to provide guidance and technical assistance, helping them improve the value of their solutions when using AWS.

Arun Pradeep Selvaraj is a Senior Solutions Architect and is part of Analytics TFC at AWS. Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build and reinvent. He is creative, fast-paced, deeply customer-obsessed and leverages the working backwards process to build modern architectures to help customers solve their unique challenges.

Patrick Muller is a Senior Solutions Architect and a valued member of the Datalab. With over 20 years of expertise in analytics, data warehousing, and distributed systems, he brings extensive knowledge to the table. Patrick’s passion lies in evaluating new technologies and assisting customers with innovative solutions. During his free time, he enjoys watching soccer.

Amazon OpenSearch Service’s vector database capabilities explained

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-services-vector-database-capabilities-explained/

OpenSearch is a scalable, flexible, and extensible open-source software suite for search, analytics, security monitoring, and observability applications, licensed under the Apache 2.0 license. It comprises a search engine, OpenSearch, which delivers low-latency search and aggregations, OpenSearch Dashboards, a visualization and dashboarding tool, and a suite of plugins that provide advanced capabilities like alerting, fine-grained access control, observability, security monitoring, and vector storage and processing. Amazon OpenSearch Service is a fully managed service that makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud.

As an end-user, when you use OpenSearch’s search capabilities, you generally have a goal in mind—something you want to accomplish. Along the way, you use OpenSearch to gather information in support of achieving that goal (or maybe the information is the original goal). We’ve all become used to the “search box” interface, where you type some words, and the search engine brings back results based on word-to-word matching. Let’s say you want to buy a couch in order to spend cozy evenings with your family around the fire. You go to Amazon.com, and you type “a cozy place to sit by the fire.” Unfortunately, if you run that search on Amazon.com, you get items like fire pits, heating fans, and home decorations—not what you intended. The problem is that couch manufacturers probably didn’t use the words “cozy,” “place,” “sit,” and “fire” in their product titles or descriptions.

In recent years, machine learning (ML) techniques have become increasingly popular to enhance search. Among them are the use of embedding models, a type of model that can encode a large body of data into an n-dimensional space where each entity is encoded into a vector, a data point in that space, and organized such that similar entities are closer together. An embedding model, for instance, could encode the semantics of a corpus. By searching for the vectors nearest to an encoded document — k-nearest neighbor (k-NN) search — you can find the most semantically similar documents. Sophisticated embedding models can support multiple modalities, for instance, encoding the image and text of a product catalog and enabling similarity matching on both modalities.

A vector database provides efficient vector similarity search by providing specialized indexes like k-NN indexes. It also provides other database functionality like managing vector data alongside other data types, workload management, access control and more. OpenSearch’s k-NN plugin provides core vector database functionality for OpenSearch, so when your customer searches for “a cozy place to sit by the fire” in your catalog, you can encode that prompt and use OpenSearch to perform a nearest neighbor query to surface that 8-foot, blue couch with designer arranged photographs in front of fireplaces.

Using OpenSearch Service as a vector database

With OpenSearch Service’s vector database capabilities, you can implement semantic search, Retrieval Augmented Generation (RAG) with LLMs, recommendation engines, and search rich media.

Semantic search

With semantic search, you improve the relevance of retrieved results using language-based embeddings on search documents. You enable your search customers to use natural language queries, like “a cozy place to sit by the fire” to find their 8-foot-long blue couch. For more information, refer to Building a semantic search engine in OpenSearch to learn how semantic search can deliver a 15% relevance improvement, as measured by normalized discounted cumulative gain (nDCG) metrics compared with keyword search. For a concrete example, our Improve search relevance with ML in Amazon OpenSearch Service workshop explores the difference between keyword and semantic search, based on a Bidirectional Encoder Representations from Transformers (BERT) model, hosted by Amazon SageMaker to generate vectors and store them in OpenSearch. The workshop uses product question answers as an example to show how keyword search using the keywords/phrases of the query leads to some irrelevant results. Semantic search is able to retrieve more relevant documents by matching the context and semantics of the query. The following diagram shows an example architecture for a semantic search application with OpenSearch Service as the vector database.

Architecture diagram showing how to use Amazon OpenSearch Service to perform semantic search to improve relevance

Retrieval Augmented Generation with LLMs

RAG is a method for building trustworthy generative AI chatbots using generative LLMs like OpenAI, ChatGPT, or Amazon Titan Text. With the rise of generative LLMs, application developers are looking for ways to take advantage of this innovative technology. One popular use case involves delivering conversational experiences through intelligent agents. Perhaps you’re a software provider with knowledge bases for product information, customer self-service, or industry domain knowledge like tax reporting rules or medical information about diseases and treatments. A conversational search experience provides an intuitive interface for users to sift through information through dialog and Q&A. Generative LLMs on their own are prone to hallucinations—a situation where the model generates a believable but factually incorrect response. RAG solves this problem by complementing generative LLMs with an external knowledge base that is typically built using a vector database hydrated with vector-encoded knowledge articles.

As illustrated in the following diagram, the query workflow starts with a question that is encoded and used to retrieve relevant knowledge articles from the vector database. Those results are sent to the generative LLM whose job is to augment those results, typically by summarizing the results as a conversational response. By complementing the generative model with a knowledge base, RAG grounds the model on facts to minimize hallucinations. You can learn more about building a RAG solution in the Retrieval Augmented Generation module of our semantic search workshop.

Architecture diagram showing how to use Amazon OpenSearch Service to perform retrieval-augmented generation

Recommendation engine

Recommendations are a common component in the search experience, especially for ecommerce applications. Adding a user experience feature like “more like this” or “customers who bought this also bought that” can drive additional revenue through getting customers what they want. Search architects employ many techniques and technologies to build recommendations, including Deep Neural Network (DNN) based recommendation algorithms such as the two-tower neural net model, YoutubeDNN. A trained embedding model encodes products, for example, into an embedding space where products that are frequently bought together are considered more similar, and therefore are represented as data points that are closer together in the embedding space. Another possibility
is that product embeddings are based on co-rating similarity instead of purchase activity. You can employ this affinity data through calculating the vector similarity between a particular user’s embedding and vectors in the database to return recommended items. The following diagram shows an example architecture of building a recommendation engine with OpenSearch as a vector store.

Architecture diagram showing how to use Amazon OpenSearch Service as a recommendation engine

Media search

Media search enables users to query the search engine with rich media like images, audio, and video. Its implementation is similar to semantic search—you create vector embeddings for your search documents and then query OpenSearch Service with a vector. The difference is you use a computer vision deep neural network (e.g. Convolutional Neural Network (CNN)) such as ResNet to convert images into vectors. The following diagram shows an example architecture of building an image search with OpenSearch as the vector store.

Architecture diagram showing how to use Amazon OpenSearch Service to search rich media like images, videos, and audio files

Understanding the technology

OpenSearch uses approximate nearest neighbor (ANN) algorithms from the NMSLIB, FAISS, and Lucene libraries to power k-NN search. These search methods employ ANN to improve search latency for large datasets. Of the three search methods the k-NN plugin provides, this method offers the best search scalability for large datasets. The engine details are as follows:

  • Non-Metric Space Library (NMSLIB) – NMSLIB implements the HNSW ANN algorithm
  • Facebook AI Similarity Search (FAISS) – FAISS implements both HNSW and IVF ANN algorithms
  • Lucene – Lucene implements the HNSW algorithm

Each of the three engines used for approximate k-NN search has its own attributes that make one more sensible to use than the others in a given situation. You can follow the general information in this section to help determine which engine will best meet your requirements.

In general, NMSLIB and FAISS should be selected for large-scale use cases. Lucene is a good option for smaller deployments, but offers benefits like smart filtering where the optimal filtering strategy—pre-filtering, post-filtering, or exact k-NN—is automatically applied depending on the situation. The following table summarizes the differences between each option.

.

NMSLIB-HNSW

FAISS-HNSW

FAISS-IVF

Lucene-HNSW

Max Dimension

16,000

16,000

16,000

1024

Filter

Post filter

Post filter

Post filter

Filter while search

Training Required

No

No

Yes

No

Similarity Metrics

l2, innerproduct, cosinesimil, l1, linf

l2, innerproduct

l2, innerproduct

l2, cosinesimil

Vector Volume

Tens of billions

Tens of billions

Tens of billions

< Ten million

Indexing latency

Low

Low

Lowest

Low

Query Latency & Quality

Low latency & high quality

Low latency & high quality

Low latency & low quality

High latency & high quality

Vector Compression

Flat

Flat

Product Quantization

Flat

Product Quantization

Flat

Memory Consumption

High

High

Low with PQ

Medium

Low with PQ

High

Approximate and exact nearest-neighbor search

The OpenSearch Service k-NN plugin supports three different methods for obtaining the k-nearest neighbors from an index of vectors: approximate k-NN, score script (exact k-NN), and painless extensions (exact k-NN).

Approximate k-NN

The first method takes an approximate nearest neighbor approach—it uses one of several algorithms to return the approximate k-nearest neighbors to a query vector. Usually, these algorithms sacrifice indexing speed and search accuracy in return for performance benefits such as lower latency, smaller memory footprints, and more scalable search. Approximate k-NN is the best choice for searches over large indexes (that is, hundreds of thousands of vectors or more) that require low latency. You should not use approximate k-NN if you want to apply a filter on the index before the k-NN search, which greatly reduces the number of vectors to be searched. In this case, you should use either the score script method or painless extensions.

Score script

The second method extends the OpenSearch Service score script functionality to run a brute force, exact k-NN search over knn_vector fields or fields that can represent binary objects. With this approach, you can run k-NN search on a subset of vectors in your index (sometimes referred to as a pre-filter search). This approach is preferred for searches over smaller bodies of documents or when a pre-filter is needed. Using this approach on large indexes may lead to high latencies.

Painless extensions

The third method adds the distance functions as painless extensions that you can use in more complex combinations. Similar to the k-NN score script, you can use this method to perform a brute force, exact k-NN search across an index, which also supports pre-filtering. This approach has slightly slower query performance compared to the k-NN score script. If your use case requires more customization over the final score, you should use this approach over score script k-NN.

Vector search algorithms

The simple way to find similar vectors is to use k-nearest neighbors (k-NN) algorithms, which compute the distance between a query vector and the other vectors in the vector database. As we mentioned earlier, the score script k-NN and painless extensions search methods use the exact k-NN algorithms under the hood. However, in the case of extremely large datasets with high dimensionality, this creates a scaling problem that reduces the efficiency of the search. Approximate nearest neighbor (ANN) search methods can overcome this by employing tools that restructure indexes more efficiently and reduce the dimensionality of searchable vectors. There are different ANN search algorithms; for example, locality sensitive hashing, tree-based, cluster-based, and graph-based. OpenSearch implements two ANN algorithms: Hierarchical Navigable Small Worlds (HNSW) and Inverted File System (IVF). For a more detailed explanation of how the HNSW and IVF algorithms work in OpenSearch, see blog post “Choose the k-NN algorithm for your billion-scale use case with OpenSearch”.

Hierarchical Navigable Small Worlds

The HNSW algorithm is one of the most popular algorithms out there for ANN search. The core idea of the algorithm is to build a graph with edges connecting index vectors that are close to each other. Then, on search, this graph is partially traversed to find the approximate nearest neighbors to the query vector. To steer the traversal towards the query’s nearest neighbors, the algorithm always visits the closest candidate to the query vector next.

Inverted File

The IVF algorithm separates your index vectors into a set of buckets, then, to reduce your search time, only searches through a subset of these buckets. However, if the algorithm just randomly split up your vectors into different buckets, and only searched a subset of them, it would yield a poor approximation. The IVF algorithm uses a more elegant approach. First, before indexing begins, it assigns each bucket a representative vector. When a vector is indexed, it gets added to the bucket that has the closest representative vector. This way, vectors that are closer to each other are placed roughly in the same or nearby buckets.

Vector similarity metrics

All search engines use a similarity metric to rank and sort results and bring the most relevant results to the top. When you use a plain text query, the similarity metric is called TF-IDF, which measures the importance of the terms in the query and generates a score based on the number of textual matches. When your query includes a vector, the similarity metrics are spatial in nature, taking advantage of proximity in the vector space. OpenSearch supports several similarity or distance measures:

  • Euclidean distance – The straight-line distance between points.
  • L1 (Manhattan) distance – The sum of the differences of all of the vector components. L1 distance measures how many orthogonal city blocks you need to traverse from point A to point B.
  • L-infinity (chessboard) distance – The number of moves a King would make on an n-dimensional chessboard. It’s different than Euclidean distance on the diagonals—a diagonal step on a 2-dimensional chessboard is 1.41 Euclidean units away, but 2 L-infinity units away.
  • Inner product – The product of the magnitudes of two vectors and the cosine of the angle between them. Usually used for natural language processing (NLP) vector similarity.
  • Cosine similarity – The cosine of the angle between two vectors in a vector space.
  • Hamming distance – For binary-coded vectors, the number of bits that differ between the two vectors.

Advantage of OpenSearch as a vector database

When you use OpenSearch Service as a vector database, you can take advantage of the service’s features like usability, scalability, availability, interoperability, and security. More importantly, you can use OpenSearch’s search features to enhance the search experience. For example, you can use Learning to Rank in OpenSearch to integrate user clickthrough behavior data into your search application and improve search relevance. You can also combine OpenSearch text search and vector search capabilities to search documents with keyword and semantic similarity. You can also use other fields in the index to filter documents to improve relevance. For advanced users, you can use a hybrid scoring model to combine OpenSearch’s text-based relevance score, computed with the Okapi BM25 function and its vector search score to improve the ranking of your search results.

Scale and limits

OpenSearch as vector database support billions of vector records. Keep in mind the following calculator regarding number of vectors and dimensions to size your cluster.

Number of vectors

OpenSearch VectorDB takes advantage of the sharding capabilities of OpenSearch and can scale to billions of vectors at single-digit millisecond latencies by sharding vectors and scale horizontally by adding more nodes. The number of vectors that can fit in a single machine is a function of the off-heap memory availability on the machine. The number of nodes required will depend on the amount of memory that can be used for the algorithm per node and the total amount of memory required by the algorithm. The more nodes, the more memory and better performance. The amount of memory available per node is computed as memory_available = (node_memoryjvm_size) * circuit_breaker_limit, with the following parameters:

  • node_memory – The total memory of the instance.
  • jvm_size – The OpenSearch JVM heap size. This is set to half of the instance’s RAM, capped at approximately 32 GB.
  • circuit_breaker_limit – The native memory usage threshold for the circuit breaker. This is set to 0.5.

Total cluster memory estimation depends on total number of vector records and algorithms. HNSW and IVF have different memory requirements. You can refer to Memory Estimation for more details.

Number of dimensions

OpenSearch’s current dimension limit for the vector field knn_vector is 16,000 dimensions. Each dimension is represented as a 32-bit float. The more dimensions, the more memory you’ll need to index and search. The number of dimensions is usually determined by the embedding models that translate the entity to a vector. There are a lot of options to choose from when building your knn_vector field. To determine the correct methods and parameters to choose, refer to Choosing the right method.

Customer stories:

Amazon Music

Amazon Music is always innovating to provide customers with unique and personalized experiences. One of Amazon Music’s approaches to music recommendations is a remix of a classic Amazon innovation, item-to-item collaborative filtering, and vector databases. Using data aggregated based on user listening behavior, Amazon Music has created an embedding model that encodes music tracks and customer representations into a vector space where neighboring vectors represent tracks that are similar. 100 million songs are encoded into vectors, indexed into OpenSearch, and served across multiple geographies to power real-time recommendations. OpenSearch currently manages 1.05 billion vectors and supports a peak load of 7,100 vector queries per second to power Amazon Music recommendations.

The item-to-item collaborative filter continues to be among the most popular methods for online product recommendations because of its effectiveness at scaling to large customer bases and product catalogs. OpenSearch makes it easier to operationalize and further the scalability of the recommender by providing scale-out infrastructure and k-NN indexes that grow linearly with respect to the number of tracks and similarity search in logarithmic time.

The following figure visualizes the high-dimensional space created by the vector embedding.

A visualization of the vector encoding of Amazon Music entries in the large vector space

Brand protection at Amazon

Amazon strives to deliver the world’s most trustworthy shopping experience, offering customers the widest possible selection of authentic products. To earn and maintain our customers’ trust, we strictly prohibit the sale of counterfeit products, and we continue to invest in innovations that ensure only authentic products reach our customers. Amazon’s brand protection programs build trust with brands by accurately representing and completely protecting their brand. We strive to ensure that public perception mirrors the trustworthy experience we deliver. Our brand protection strategy focuses on four pillars: (1) Proactive Controls (2) Powerful Tools to Protect Brands (3) Holding Bad Actors Accountable (4) Protecting and Educating Customers. Amazon OpenSearch Service is a key part of Amazon’s Proactive Controls.

In 2022, Amazon’s automated technology scanned more than 8 billion attempted changes daily to product detail pages for signs of potential abuse. Our proactive controls found more than 99% of blocked or removed listings before a brand ever had to find and report it. These listings were suspected of being fraudulent, infringing, counterfeit, or at risk of other forms of abuse. To perform these scans, Amazon created tooling that uses advanced and innovative techniques, including the use of advanced machine learning models to automate the detection of intellectual property infringements in listings across Amazon’s stores globally. A key technical challenge in implementing such automated system is the ability to search for protected intellectual property within a vast billion-vector corpus in a fast, scalable and cost effective manner. Leveraging Amazon OpenSearch Service’s scalable vector database capabilities and distributed architecture, we successfully developed an ingestion pipeline that has indexed a total of 68 billion, 128- and 1024-dimension vectors into OpenSearch Service to enable brands and automated systems to conduct infringement detection, in real-time, through a highly available and fast (sub-second) search API.

Conclusion

Whether you’re building a generative AI solution, searching rich media and audio, or bringing more semantic search to your existing search-based application, OpenSearch is a capable vector database. OpenSearch supports a variety of engines, algorithms, and distance measures that you can employ to build the right solution. OpenSearch provides a scalable engine that can support vector search at low latency and up to billions of vectors. With OpenSearch and its vector DB capabilities, your users can find that 8-foot-blue couch easily, and relax by a cozy fire.


About the Authors

Jon Handler is a Senior Principal Solutions Architect with AWSJon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Jianwei Li is a Principal Analytics Specialist TAM at Amazon Web Services. Jianwei provides consultant service for customers to help customer design and build modern data platform. Jianwei has been working in big data domain as software developer, consultant and tech leader.

Dylan Tong is a Senior Product Manager at AWS. He works with customers to help drive their success on the AWS platform through thought leadership and guidance on designing well architected solutions. He has spent most of his career building on his expertise in data management and analytics by working for leaders and innovators in the space.

Vamshi Vijay Nakkirtha is a Software Engineering Manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems. He is an active contributor to various plugins, like k-NN, GeoSpatial, and dashboard-maps.

Use AWS Private Certificate Authority to issue device attestation certificates for Matter

Post Syndicated from Ramesh Nagappan original https://aws.amazon.com/blogs/security/use-aws-private-certificate-authority-to-issue-device-attestation-certificates-for-matter/

In this blog post, we show you how to use AWS Private Certificate Authority (CA) to create Matter device attestation CAs to issue device attestation certificates (DAC). By using this solution, device makers can operate their own device attestation CAs, building on the solid security foundation provided by AWS Private CA. This post assumes that you are familiar with the public key infrastructure (PKI) — if needed, you can review the topic What is public key infrastructure?.

Overview

Matter, governed by the Connectivity Standard Alliance (CSA), is a new open standard for seamless and secure cross-vendor connectivity for smart home devices. It provides a common language for devices manufactured by different vendors to communicate on a smart home network by using the Wi-Fi and Thread wireless protocols. The Matter 1.0 specification, released in October 2022, supports smart home bridges and controllers, door locks, HVAC controls, lighting and electrical devices, media devices, safety and security sensors, and window coverings and shades. Future versions of the Matter specification will support additional smart home devices such as cameras, home appliances, robot vacuums, and other market segments such as electric vehicle (EV) charging and energy management.

To help achieve security and interoperability, Matter requires device certification and authenticity checks before devices can join a smart home network (known as the Matter fabric). The authenticity of a Matter device is established by using an X.509 certificate called a DAC, which is provisioned by device makers. When a customer purchases a Matter-compliant smart home device and uses a smart home hub to add that device to their Matter fabric, the hub validates the DAC before allowing the device to operate on the smart home’s Matter fabric.

Matter requires DACs to be issued by a device attestation CA that is compliant with the Matter PKI certificate policy (CP). Device vendors can use AWS Private CA to host their device attestation CAs. AWS Private CA is a highly available service that helps organizations secure their applications and devices by using private certificates. In Q1 2023, AWS Private CA launched support for Matter certificates, published the Matter PKI Compliance Customer Guide, and released open source samples to help customers create and operate Matter-compliant device attestation CAs.

In this post, we demonstrate how to build a Matter device attestation CA hierarchy by using AWS Private CA. The examples will use AWS Cloud Development Kit (AWS CDK) and AWS CloudFormation to create and provision the CAs and to issue Matter-compliant DACs. We also show how you can configure other AWS services like Amazon CloudWatch, AWS CloudTrail, and Amazon Simple Storage Service (Amazon S3) to help perform the event logging, record keeping, and audit log retention that is required in the Matter standard.

The Matter CA hierarchy for DACs

When a smart home device vendor makes the decision to adopt the Matter standard, they first apply to the CSA to get a vendor ID for their organization, and one or more product IDs for the products that will be Matter-certified. Using the vendor ID and product ID, the vendor can then set up their Matter CA hierarchy to issue DACs that will be provisioned on to their Matter-certified devices. Matter prescribes a two-level CA hierarchy for issuing DACs. The product attestation authority (PAA) is at the top of the hierarchy and forms the root of trust for the DACs that chain up to it. The product attestation intermediate (PAI) is at the second level of the CA hierarchy and is the CA that issues DACs. Figure 1 illustrates a sample Matter CA hierarchy for a vendor that manufactures devices with two different product IDs.

Figure 1: Matter CA three-tier hierarchy and the role of PAA, PAIs, and DACs

Figure 1: Matter CA three-tier hierarchy and the role of PAA, PAIs, and DACs

PAA certificates must be approved by the CSA before they are made available to the Matter community through the Distributed Compliance Ledger (DCL). The DCL is a shared database based on blockchain technology that holds the data elements that are necessary to attest the validity of a Matter device. CSA provides access to a public DCL node that it maintains, and private nodes can be created for dedicated access. To indicate whether a particular device is Matter compliant, a certification declaration (CD) is used. A CD is a CSA-created cryptographic document encoded in the Cryptographic Message Syntax (CMS) format described in RFC 5652. The data elements in the DCL include the list of PAAs (the roots of trust) and the vendor IDs, product IDs, and CD for each Matter certified product. The DCL also contains a description of each Matter certified product, which includes elements such as the device name, vendor, firmware version, and localized strings to support internationalization.

After the Matter CA hierarchy is set up, the CA operator must complete a certification practice statement (CPS) that describes how the CA operator will comply with the physical, operational, and logical controls specified in the Matter PKI CP. These controls address a large list of scenarios. These include the creation of the CA keys, storage of these key pairs, physical access controls for the HSMs, and separation of roles that perform administrative and operational tasks. The CA operators must submit their CPS, along with the PAA certificate, to the CSA for final approval. After the CPS is approved by the CSA, the PAA certificate is added to the list of approved Matter root certificates. This will now allow a Matter-compliant smart home network hub to verify the validity of the DAC that is provisioned on the Matter-certified device.

Build Matter CA hierarchies by using AWS Private CA

AWS Private CA has open sourced samples on GitHub that help you to create Matter-ready CA hierarchies and can issue DACs for your devices. These samples use the AWS CDK and CloudFormation templates to set up and configure services like Amazon CloudWatch, AWS CloudTrail and Amazon S3, to help you meet the Matter PKI compliance requirements. In the following sections, we walk you through an introduction to these samples, and discuss how you can use them to build your Matter CA hierarchies in AWS Private CA.

Solution architecture and building blocks

Figure 2 shows the Matter device attestation PKI architecture that is created when you run the GitHub samples. The architecture works as follows:

  1. A PAA is created by a PKI administrator, using the generatePaa key.
  2. You create PAIs by using the generatePaiCnt key in a single AWS Region, or you can choose to house the PAIs in different Regions.
  3. AWS CloudTrail captures the AWS API actions that are performed.
  4. Amazon CloudWatch receives a stream of filtered events specific to AWS Private CA and stores it in a dedicated MatterAudit log group.
  5. The logs are stored in Amazon S3 buckets.
  6. Logs can only be read by using the auditor-specific AWS Identity and Access Management (IAM) role that is configured for read-only access.
  7. The sample script configures the S3 buckets to automatically move log data to Amazon S3 Glacier storage by using S3 Lifecycle rules (Figure 2 shows an example of monthly backup plan for S3), in order to meet the long-term log retention requirements of the Matter standard.
  8. To issue DACs by using the PAIs created by the samples, you upload your certificate signing requests (CSRs) to the input/output S3 bucket.
  9. New CSRs are picked up by an AWS Lambda function through the use of Amazon Simple Queue Service (Amazon SQS).
  10. The DAC Signing Lambda function uses a PAI to generate a certificate.
  11. The Lambda function writes the generated certificates (which are the DACs) in the input/output S3 bucket in the Privacy Enhanced Mail (PEM) format.
Figure 2: Matter device attestation PKI architecture topology that uses AWS Private CA

Figure 2: Matter device attestation PKI architecture topology that uses AWS Private CA

Prerequisites

For a typical deployment, make sure you have a working host with the following software dependencies installed:

Using the following Git command, download sample scripts from the Matter PKI CDK samples on GitHub.

$ git clone https://github.com/aws-samples/aws-private-ca-matter-infrastructure.git
$ cd aws-private-ca-matter-infrastructure

Finally, set up the AWS CDK Toolkit with your AWS account’s credentials (see the AWS Cloud Development Kit documentation for details).

Deploy the PAA

The first step is to create and deploy a Matter PAA by running the following command:

$ cdk deploy ‐‐context generatePaa=1 ‐‐parameters vendorId=FFF1 ‐‐parameters validityInDays=3600

After the command completes, you should see the following output:

MatterStackPAA.CloudTrailArn = arn:aws:cloudtrail:us-east-1:111111111111:trail/MatterStackPAA-MatterAuditTrail586D7D12-akSFjLJyEumz

MatterStackPAA.LogGroupName = MatterAudit

MatterStackPAA.PAA = VID=FFF1 CN=PAA arn:aws:acm-pca:us-east-1:111111111111:certificate-authority/99999999-9999-9999-9999-999999999999

MatterStackPAA.PAACertArn = arn:aws:acm-pca:us-east-1:111111111111:certificate-authority/99999999-9999-9999-9999-999999999999/certificate/99999999999999999999999999999999

MatterStackPAA.PAACertLink = https://console.aws.amazon.com/acm-pca/home?region=us-east-1#/details?arn=arn:aws:acm-pca:us-east-1:111111111111:certificate-authority/99999999-9999-9999-9999-999999999999&tab=certificate

Stack ARN:
    arn:aws:cloudformation:us-east-1:111111111111:stack/MatterStackPAA/11111111-1111-1111-1111-111111111111

Note the following fields in the output:

  • CloudTrailArn — The Amazon Resource Name (ARN) of the CloudTrail trail that is collecting logs of performed operations.
  • LogGroupName — The ARN of the CloudWatch log group where you can find logs of Matter-relevant operations.
  • PAA — The vendor ID, common name, and ARN of the AWS Private CA root certificate authority created as the PAA.
  • PAACertArn — The ARN of the certificate of the PAA in AWS Private CA.
  • PAACertLink — The link to the PAA certificate in the AWS Private CA console.

Deploy one or more PAIs

To deploy two PAIs that chain up to the PAA that you created in the previous step, run the following command:

$ PAA_ARN=$(awk '/MatterStackPAA.PAA = /{ print substr($0, match($0, "arn")); }' DeployPAA.log)
$ cdk deploy --context generatePaiCnt=2 ‐‐parameters vendorId=FFF1 ‐‐parameters productIds=0100,0102 ‐‐parameters validityInDays=365 ‐‐parameters paaArn=$PAA_ARN

If you want to create more or fewer PAIs, you can modify the generatePaiCnt and productIds parameters to reflect the correct number of PAIs that you want.

After the command completes, you should see the following output:

MatterStackPAI.CertArnPAI0 = arn:aws:acm-pca:us-east-1:111111111111:certificate-authority/99999999-9999-9999-9999-999999999999/certificate/11111111111111111111111111111111

MatterStackPAI.CertArnPAI1 = arn:aws:acm-pca:us-east-1:111111111111:certificate-authority/99999999-9999-9999-9999-999999999999/certificate/22222222222222222222222222222222

MatterStackPAI.CertLinkPAI0 = https://console.aws.amazon.com/acm-pca/home?region=us-west-1#/details?arn=arn:aws:acm-pca:us-west-1:111111111111:certificate-authority/11111111-1111-1111-1111-111111111111&tab=certificate

MatterStackPAI.CertLinkPAI1 = https://console.aws.amazon.com/acm-pca/home?region=us-west-1#/details?arn=arn:aws:acm-pca:us-west-1:111111111111:certificate-authority/22222222-2222-2222-2222-222222222222&tab=certificate

MatterStackPAI.PAI0 = VID=FFF1 PID=0100 CN=PAI0 arn:aws:acm-pca:us-west-1:111111111111:certificate-authority/11111111-1111-1111-1111-111111111111

MatterStackPAI.PAI1 = VID=FFF1 PID=0102 CN=PAI1 arn:aws:acm-pca:us-west-1:111111111111:certificate-authority/22222222-2222-2222-2222-222222222222

Note the following fields in the output:

  • CertArnPAI0 — The ARN of the certificate of the first PAI in AWS Private CA.
  • CertArnPAI1 — The ARN of the certificate of the second PAI in AWS Private CA.
  • CertLinkPAI0 — The link to the first PAI certificate in the AWS Private CA console.
  • CertLinkPAI1 — The link to the second PAI certificate in the AWS Private CA console.
  • PAI0 — The vendor ID, product ID, common name, and ARN of the AWS Private CA subordinate certificate authority created as the first PAI.
  • PAI1 — The vendor ID, product ID, common name, and ARN of the AWS Private CA subordinate certificate authority created as the second PAI.

Deploy multi-Region PAIs

The following command is an example of how you can generate a PAI in a different Region. To specify a Region, either use the CDK_DEFAULT_REGION environment variable or specify a different AWS profile by using the –profile parameter (see AWS CDK documentation for more details).

$ CDK_DEFAULT_REGION=us-east-1
$ cdk deploy ‐‐context generatePaiCnt=1 ‐‐parameters vendorId=FFF1 ‐‐parameters productIds=0100 ‐‐parameters validityInDays=365 ‐‐parameters paaArn=<PAA-ARN-FROM-PAA-OUTPUT>

Issue DACs

After the Matter CA hierarchy is set up, you can issue DACs by writing your CSRs directly to the DacInputS3ToSQSS3Bucket S3 bucket. The SqsToDacIssuingLambda function signs the CSRs by using the specified PAI to create a DAC, and then the function writes that DAC back to the S3 bucket.

To test this procedure by using OpenSSL

  1. Use the following command to create a key pair for the DAC:
    $ openssl ecparam -name prime256v1 -out certificates/ecparam

  2. Create the DAC CSR with DAC as the common name; the following is an example:
    $ openssl req -config certificates/config.ssl -newkey param:certificates/ecparam -keyout key-DAC.pem -out cert-DAC.csr -sha256 -subj "/CN=DAC"

  3. Request CSR signing by using a PAI identified by its UUID:
    $ aws s3 cp cert-DAC.csr s3://Matterstackpai-dacinputs3tosqss3bucket<remainder of your bucket name>/arn:aws:acm-pca:<region>:<account>:certificate-authority/<PAI UUID>/<PID>/cert-DAC.csr

  4. The PAI will respond with the certificate as soon as it finishes processing the CSR, and the certificate will be made available in the same S3 bucket:
    $ aws s3 cp s3://Matterstackpai-dacinputs3tosqss3bucket<remainder of your bucket name>/arn:aws:acm-pca:<region>:<account>:certificate-authority/<PAI UUID>/<PID>/cert-DAC.pem

Audit trails, record keeping, and retention

The Matter PKI CP mandates controls and processes that are required to operate Matter device attestation CAs. The Matter PKI CDK samples on GitHub configure IAM roles, AWS CloudTrail, Amazon CloudWatch, and Amazon S3 to help you meet these requirements. Specifically, the samples configure the following:

  • Permissions
    • The /MatterPKI/MatterIssueDACRole role, for issuing and revoking DACs from PAIs in AWS Private CA.
    • The /MatterPKI/MatterIssuePAIRole role, for issuing and revoking PAIs from the PAA in AWS Private CA.
    • The /MatterPKI/MatterManagePAARole role, for monitoring and managing the PAA.
    • The /MatterPKI/MatterAuditorRole role, for viewing, collecting, and managing the logs that pertain to Matter PKI resources.
  • Logging
    • Audit logs should be maintained for operations performed in AWS that are related to the Matter PKI. This is accomplished by using AWS CloudTrail to store logs of API calls in S3.
    • In order to preserve the integrity of the data in the audit logging S3 bucket, the sample scripts configure S3 Object Lock to make objects write-once, read-many.
    • A resource policy is applied to the S3 bucket that grants IAM role access only to the auditor.
    • Data in this S3 bucket is backed up to a vault by using AWS Backup. From there, the data can be restored in case of an emergency.
    • The logs are retained in S3 for two months after creation and then are automatically moved into S3 Glacier for the rest of their five-year lifecycle, where they can be restored if necessary.
    • In order to make sure that the logs can be quickly queried by an auditor, the sample script sets up CloudTrail to continuously send events that are filtered to include only events relevant to Matter PKI to the AWS CloudWatch log group MatterAudit. You should restrict access to the stored logs by enabling log integrity validation to assure that the log remains computationally infeasible to modify, delete or forge.

Verify and validate certificates for Matter compliance

The Matter standards working group provides the CHIP Certificate Tool (chip-cert), a CLI utility that you can use to verify the conformance of the certificates that form a DAC chain. If you use the sample script we provided earlier to generate DACs, this tool is run as part of the certificate issuance process to verify that the certificate and chain are Matter compliant and no misconfiguration has taken place.

Use the following command to validate a DAC chain by using chip-cert manually.

$./chip-cert validate-att-cert ‐‐dac <dac-certificate-file-in-PEM> ‐‐pai <pai-certificate-file-in-PEM> ‐‐paa <paa-certificate-file-in-PEM> 

Best practices for assuring security, scalability, and resiliency

For a list of best practices and recommendations for using AWS Private CA effectively, see AWS Private CA best practices.

Cost considerations

When you install the Matter PKI CDK, you will incur charges for the deployment and use of AWS Private CA and the associated services. To delete a private CA permanently, see Deleting your private CA.

Conclusion

In this blog post, we discussed in depth the use of AWS Private CA to create and operate Matter device attestation CAs in order to issue DACs. We walked you through how you can use our open source samples on GitHub to create Matter PAAs and PAIs, and how to configure other AWS services like Amazon CloudWatch, AWS CloudTrail and Amazon S3 to help you meet the compliance requirements that are stipulated by the Matter PKI CP. We also showed you how you can use the Matter PAIs to issue DACs for your smart home devices and verify that the issued DACs meet the Matter requirements. To learn more about using the GitHub samples, see the README file located in the GitHub repository. We welcome your feedback to help us improve the samples.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Ramesh Nagappan

Ramesh Nagappan

Ramesh is a Principal Security Engineer in the Devices and Services Trust Security (DSTS) organization at Amazon. With extensive experience, he remains focused on technologies related to applied cryptography, identity management, IoT, and Cloud infrastructure.

Leonid Wilde

Leonid Wilde

Leonid is a Software Engineer in the AWS Cryptography organization who also worked in various teams within Amazon. In his previous life, he worked on various products involving cryptographic algorithms, compilers, drivers, networking, and more.

Lukas Rash

Lukas Rash

Lukas is a Software Engineer in the AWS Cryptography organization. He is passionate about building robust cloud services to help customers improve the security of their systems.

Lorey Spade

Lorey Spade

Lorey is a Senior Industry Specialist in the AWS Cryptography organization, specializing in public key infrastructure compliance. She has extensive experience in the area of information technology auditing and compliance.

Efficiently crawl your data lake and improve data access with an AWS Glue crawler using partition indexes

Post Syndicated from Srividya Parthasarathy original https://aws.amazon.com/blogs/big-data/efficiently-crawl-your-data-lake-and-improve-data-access-with-aws-glue-crawler-using-partition-indexes/

In today’s world, customers manage vast amounts of data in their Amazon Simple Storage Service (Amazon S3) data lakes, which requires convoluted data pipelines to continuously understand the changes in the data layout and make them available to consuming systems. AWS Glue crawlers provide a straightforward way to catalog data in the AWS Glue Data Catalog that removes the heavy lifting when it comes to schema management and data classification. AWS Glue crawlers extract the data schema and partitions from Amazon S3 to automatically populate the Data Catalog, keeping the metadata current.

But with data growing exponentially over time, the number of partitions in a given table can grow significantly. Because analytics services like Amazon Athena query a table containing millions of partitions, the time needed to retrieve the partition increases and can cause query runtime to increase.

Today, AWS Glue crawler support has been expanded to automatically add partition indexes for newly discovered tables to optimize query processing on the partitioned dataset. Now, when the crawler creates a new Data Catalog table during a crawler run, it also creates a partition index by default, with the largest permutation of all numeric and string type partition columns as keys. The Data Catalog then creates a searchable index based on these keys, reducing the time required to retrieve and filter partition metadata on tables with millions of partitions. The creation of partition indexes benefits the analytics workloads running on Athena, Amazon EMR, Amazon Redshift Spectrum, and AWS Glue.

In this post, we describe how to create partition indexes with an AWS Glue crawler and compare the query performance improvement when accessing the crawled data with and without a partition index from Athena.

Solution overview

We use an AWS CloudFormation template to create our solution resources. In the following steps, we demonstrate how to configure the AWS Glue crawler to create a partition index using either the AWS Glue console or the AWS Command Line Interface (AWS CLI). Then we compare the query performance improvements using Athena.

Prerequisites

To follow along with this post, you must have access to an AWS Identity and Access Management (IAM) administrator role to create resources using AWS CloudFormation.

Set up your solution resources

The CloudFormation template generates the following resources:

  • IAM roles and policies
  • An AWS Glue database to hold the schema
  • An AWS Glue crawler pointing to a highly partitioned dataset
  • An Athena workgroup and bucket to store query results

Complete the following steps to set up the solution resources:

  1. Log in to the AWS Management Console as an IAM administrator.
  2. Choose Launch Stack to deploy the CloudFormation template:
  3. For DatabaseName, keep the default blog_partition_index_crawlerdb.
  4. Choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create stack.
  7. When the stack is complete, on the AWS CloudFormation console, navigate to the Outputs tab of the stack.
  8. Note down values of DatabaseName and GlueCrawlerName.

Some of the resources that this stack deploys incur costs when in use.

Edit and run the AWS Glue crawler

To configure and run the AWS Glue crawler, complete the following steps:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Locate the crawler blog-partition-index-crawler and choose Edit.
  3. In the Set output and scheduling section, under Advanced options, select Create partition indexes automatically.
  4. Review and update the crawler settings.

Alternatively, you can configure your crawler using the AWS CLI (provide your IAM role and Region):

aws glue create-crawler --name blog-partition-index-crawler --targets '{ "S3Targets": [{ "Path": "s3://awsglue-datasets/examples/highly-partitioned-table/"}] }' --database-name "blog_partition_index_crawlerdb" --role <Crawler_IAM_role> --configuration "{\"Version\":1.0,\"CreatePartitionIndex\":true}" --region <region_name>
  1. Now run the crawler and verify that the crawler run is complete.

This is highly partitioned dataset and will take approximately 90 minutes to complete.

Verify the partitioned table

In the AWS Glue database blog_partition_index_crawlerdb, verify that the table highly_partitioned_table is created.

By default, the crawler determines an index based on the largest permutation of partition columns of valid column types in the same order of partition columns, which are either numeric or string. For the table created by the crawler (highly_partitioned_table), we have partition columns year (string), month (string), day (string), and hour (string).

Based on this definition, the crawler created an index on the permutation of year, month, day, and hour. The crawler created the indexes prefixed with crawler_ on any partition index created by default.

Verify the same by navigating to the table highly_partitioned_table on the AWS Glue console and choosing the Indexes tab.

The crawler was able to crawl the S3 data source and successfully populate the partition indexes for the table.

Compare the query performance improvements using Athena

First, we query the table in Athena without using the partition index. To verify the tables using Athena, complete the following steps:

  1. On the Athena console, choose crawler-primary-workgroup as the Athena workgroup and choose Acknowledge.
  2. Run the following query:
    select count(*), sum(value) from blog_partition_index_crawlerdb.highly_partitioned_table where year='1980' and month='01' and day ='01'

The following screenshot shows the query took approximately 32 seconds without filtering enabled using the partition index.

  1. Now we enable the partition index on the Athena query:
    ALTER TABLE blog_partition_index_crawlerdb.highly_partitioned_table
    SET TBLPROPERTIES ('partition_filtering.enabled' = 'true')

  2. Run the following query again and note the runtime:
    select count(*), sum(value) from blog_partition_index_crawlerdb.highly_partitioned_table where year=‘1980’ and month=‘01’ and day =‘01’

The following screenshot shows the query took only 700 milliseconds, which is much faster with filtering enabled using the partition index.

Clean up

To avoid unwanted charges to your AWS account, you can delete the AWS resources:

  1. Sign in to the CloudFormation console as the IAM admin used for creating the CloudFormation stack.
  2. Delete the CloudFormation stack you created.

Conclusion

In this post, we explained how to configure an AWS crawler to create partition indexes and compared the query performance when accessing the data with indexes from Athena.

If no partition indexes are present on the table, AWS Glue loads all the partitions of the table, and then filters the loaded partitions, which results in inefficient retrieval of metadata. Analytics services like Redshift Spectrum, Amazon EMR, and AWS Glue ETL Spark DataFrames can now utilize indexes for fetching partitions, resulting in significant query performance.

For more information on partition indexes and query performance across various analytical engines, refer to Improve Amazon Athena query performance using AWS Glue Data Catalog partition indexes and Improve query performance using AWS Glue partition indexes.

Special thanks to everyone who contributed to this crawler feature launch: Yuhang Chen, Kyle Duong,and Mita Gavade.


About the authors

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building data mesh solutions and sharing them with the community.

Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.

Improved resiliency with backpressure and admission control for Amazon OpenSearch Service

Post Syndicated from Ketan Verma original https://aws.amazon.com/blogs/big-data/improved-resiliency-with-backpressure-and-admission-control-for-amazon-opensearch-service/

Amazon OpenSearch Service is a managed service that makes it simple to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. Last year, we introduced Shard Indexing Backpressure and admission control, which monitors cluster resources and incoming traffic to selectively reject requests that would otherwise pose stability risks like out of memory and impact cluster performance due to memory contentions, CPU saturation and GC overhead, and more.

We are now excited to introduce Search Backpressure and CPU-based admission control for OpenSearch Service, which further enhances the resiliency of clusters. These improvements are available for all OpenSearch versions 1.3 or higher.

Search Backpressure

Backpressure prevents a system from being overwhelmed with work. It does so by controlling the traffic rate or by shedding excessive load in order to prevent crashes and data loss, improve performance, and avoid total failure of the system.

Search Backpressure is a mechanism to identify and cancel in-flight resource-intensive search requests when a node is under duress. It’s effective against search workloads with anomalously high resource usage (such as complex queries, slow queries, many hits, or heavy aggregations), which could otherwise cause node crashes and impact the cluster’s health.

Search Backpressure is built on top of the task resource tracking framework, which provides an easy-to-use API to monitor each task’s resource usage. Search Backpressure uses a background thread that periodically measures the node’s resource usage and assigns a cancellation score to each in-flight search task based on factors like CPU time, heap allocations, and elapsed time. A higher cancellation score corresponds to a more resource-intensive search request. Search requests are cancelled in descending order of their cancellation score to recover nodes quickly, but the number of cancellations is rate-limited to avoid wasteful work.

The following diagram illustrates the Search Backpressure workflow.

Search requests return an HTTP 429 “Too Many Requests” status code upon cancellation. OpenSearch returns partial results if only some shards fail and partial results are allowed. See the following code:

{
    "error": {
        "root_cause": [
            {
                "type": "task_cancelled_exception",
                "reason": "cancelled task with reason: heap usage exceeded [403mb >= 77.6mb], elapsed time exceeded [1.7m >= 45s]"
            }
        ],
        "type": "search_phase_execution_exception",
        "reason": "SearchTask was cancelled",
        "phase": "fetch",
        "grouped": true,
        "failed_shards": [
            {
                "shard": 0,
                "index": "nyc_taxis",
                "node": "9gB3PDp6Speu61KvOheDXA",
                "reason": {
                    "type": "task_cancelled_exception",
                    "reason": "cancelled task with reason: heap usage exceeded [403mb >= 77.6mb], elapsed time exceeded [1.7m >= 45s]"
                }
            }
        ],
        "caused_by": {
            "type": "task_cancelled_exception",
            "reason": "cancelled task with reason: heap usage exceeded [403mb >= 77.6mb], elapsed time exceeded [1.7m >= 45s]"
        }
    },
    "status": 429
}

Monitoring Search Backpressure

You can monitor the detailed Search Backpressure state using the node stats API:

curl -X GET "https://{endpoint}/_nodes/stats/search_backpressure"

You can also view the cluster-wide summary of cancellations using Amazon CloudWatch. The following metrics are now available in the ES/OpenSearchService namespace:

  • SearchTaskCancelled – The number of coordinator node cancellations
  • SearchShardTaskCancelled – The number of data node cancellations

The following screenshot shows an example of tracking these metrics on the CloudWatch console.

CPU-based admission control

Admission control is a gatekeeping mechanism that proactively limits the number of requests to a node based on its current capacity, both for organic increases and spikes in traffic.

In addition to the JVM memory pressure and request size thresholds, it now also monitors each node’s rolling average CPU usage to reject incoming _search and _bulk requests. It prevents nodes from being overwhelmed with too many requests leading to hot spots, performance problems, request timeouts, and other cascading failures. Excessive requests return an HTTP 429 “Too Many Requests” status code upon rejection.

Handling HTTP 429 errors

You’ll receive HTTP 429 errors if you send excessive traffic to a node. It indicates either insufficient cluster resources, resource-intensive search requests, or an unintended spike in the workload.

Search Backpressure provides the reason for rejection, which can help fine-tune resource-intensive search requests. For traffic spikes, we recommend client-side retries with exponential backoff and jitter.

You can also follow these troubleshooting guides to debug excessive rejections:

Conclusion

Search Backpressure is a reactive mechanism to shed excessive load, while admission control is a proactive mechanism to limit the number of requests to a node beyond its capacity. Both work in tandem to improve the overall resiliency of an OpenSearch cluster.

Search Backpressure is available in OpenSearch, and we are always looking for external contributions. You can refer to the RFC to get started.


About the authors

Ketan Verma is a Senior SDE working on Amazon OpenSearch Service. He is passionate about building large-scale distributed systems, improving performance, and simplifying complex ideas with simple abstractions. Outside work, he likes to read and improve his home barista skills.

Suresh N S is a Senior SDE working on Amazon OpenSearch Service. He is passionate towards solving problems in large scale distributed systems.

Pritkumar Ladani is an SDE-2 working on Amazon OpenSearch Service. He likes to contribute to open source software development, and is passionate about distributed systems. He is an amateur badminton player and enjoys trekking.

Bukhtawar Khan is a Principal Engineer working on Amazon OpenSearch Service. He is interested in building distributed and autonomous systems. He is a maintainer and an active contributor to OpenSearch.