Tag Archives: Technical How-to

Deliver Amazon CloudWatch logs to Amazon OpenSearch Serverless

Post Syndicated from Balaji Mohan original https://aws.amazon.com/blogs/big-data/deliver-amazon-cloudwatch-logs-to-amazon-opensearch-serverless/

Amazon CloudWatch Logs collect, aggregate, and analyze logs from different systems in one place. CloudWatch provides subcriptions as a real-time feed of these logs to other services like Amazon Kinesis Data Streams, AWS Lambda, and Amazon OpenSearch Service. These subscriptions are a popular mechanism to enable custom processing and advanced analysis of log data to gain additional valuable insights. At the time of publishing this blog post, these subscription filters support delivering logs to Amazon OpenSearch Service provisioned clusters only. Customers are increasingly adopting Amazon OpenSearch Serverless as a cost-effective option for infrequent, intermittent and unpredictable workloads.

In this blog post, we will show how to use Amazon OpenSearch Ingestion to deliver CloudWatch logs to OpenSearch Serverless in near real-time. We outline a mechanism to connect a Lambda subscription filter with OpenSearch Ingestion and deliver logs to OpenSearch Serverless without explicitly needing a separate subscription filter for it.

Solution overview

The following diagram illustrates the solution architecture.

  1. CloudWatch Logs: Collects and stores logs from various AWS resources and applications. It serves as the source of log data in this solution.
  2. Subscription filter : A CloudWatch Logs subscription filter filters and routes specific log data from CloudWatch Logs to the next component in the pipeline.
  3. CloudWatch exporter Lambda function: This is a Lambda function that receives the filtered log data from the subscription filter. Its purpose is to transform and prepare the log data for ingestion into the OpenSearch Ingestion pipeline.
  4. OpenSearch Ingestion: This is a component of OpenSearch Service. The Ingestion pipeline is responsible for processing and enriching the log data received from the CloudWatch exporter Lambda function before storing it in the OpenSearch Serverless collection.
  5. OpenSearch Service: This is fully managed service that stores and indexes log data, making it searchable and available for analysis and visualization. OpenSearch Service offers two configurations: provisioned domains and serverless. In this setup, we use serverless, which is an auto-scaling configuration for OpenSearch Service.

Prerequisites

Deploy the solution

With the prerequisites in place, you can create and deploy the pieces of the solution.

Step 1: Create PipelineRole for ingestion

  • Open the AWS Management Console for AWS Identity and Access Management (IAM).
  • Choose Policies, and then choose Create policy.
  • Select JSON and paste the following policy into the editor:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "aoss:BatchGetCollection",
                "aoss:APIAccessAll"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:aoss:us-east-1:{accountId}:collection/{collectionId}"
        },
        {
            "Action": [
                "aoss:CreateSecurityPolicy",
                "aoss:GetSecurityPolicy",
                "aoss:UpdateSecurityPolicy"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aoss:collection": "{collection}"
                }
            }
        }
    ]
}

// Replace {accountId}, {collectionId}, and {collection} with your own values
  • Choose Next, choose Next, and name your policy collection-pipeline-policy.
  • Choose Create policy.
  • Next, create a role and attach the policy to it. Choose Roles, and then choose Create role.
  • Select Custom trust policy and paste the following policy into the editor:
{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Principal":{
            "Service":"osis-pipelines.amazonaws.com"
         },
         "Action":"sts:AssumeRole"
      }
   ]
}
  • Choose Next, and then search for and select the collection-pipeline-policy you just created.
  • Choose Next and name the role PipelineRole.
  • Choose Create role.

Step 2: Configure the network and data policy for OpenSearch collection

  • In the OpenSearch Service console, navigate to the Serverless menu.
  • Create a VPC endpoint by following the instruction in Create an interface endpoint for OpenSearch Serverless.
  • Go to Security and choose Network policies.
  • Choose Create network policy.
  • Configure the following policy
[
  {
    "Rules": [
      {
        "Resource": [
          "collection/{collection name}"
        ],
        "ResourceType": "collection"
      }
    ],
    "AllowFromPublic": false,
    "SourceVPCEs": [
      "{VPC Enddpoint Id}"
    ]
  },
  {
    "Rules": [
      {
        "Resource": [
          "collection/{collection name}"
        ],
        "ResourceType": "dashboard"
      }
    ],
    "AllowFromPublic": true
  }
]
  • Go to Security and choose Data access policies.
  • Choose Create access policy.
  • Configure the following policy:
[
  {
    "Rules": [
      {
        "Resource": [
          "index/{collection name}/*"
        ],
        "Permission": [
          "aoss:CreateIndex",
          "aoss:UpdateIndex",
          "aoss:DescribeIndex",
          "aoss:ReadDocument",
          "aoss:WriteDocument"
        ],
        "ResourceType": "index"
      }
    ],
    "Principal": [
      "arn:aws:iam::{accountId}:role/PipelineRole",
      "arn:aws:iam::{accountId}:role/Admin"
    ],
    "Description": "Rule 1"
  }
]

Step 3: Create an OpenSearch Ingestion pipeline

  • Navigate to the OpenSearch Service.
  • Go to the Ingestion pipelines section.
  • Choose Create pipeline.
  • Define the pipeline configuration.
version: "2"
 cwlogs-ingestion-pipeline:

  source:

    http:

      path: /logs/ingest

  sink:

    - opensearch:

        # Provide an AWS OpenSearch Service domain endpoint

        hosts: ["https://{collectionId}.{region}.aoss.amazonaws.com"]

        index: "cwl-%{yyyy-MM-dd}"

        aws:

          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com

          sts_role_arn: "arn:aws:iam::{accountId}:role/PipelineRole"

          # Provide the region of the domain.

          region: "{region}"

          serverless: true

          serverless_options:

            network_policy_name: "{Network policy name}"
 # To get the values for the placeholders: 
 # 1. {collectionId}: You can find the collection ID by navigating to the Amazon OpenSearch Serverless Collection in the AWS Management Console, and then clicking on the Collection. The collection ID is listed under the "Overview" section. 
 # 2. {region}: This is the AWS region where your Amazon OpenSearch Service domain is located. You can find this information in the AWS Management Console when you navigate to the domain. 
 # 3. {accountId}: This is your AWS account ID. You can find your account ID by clicking on your username in the top-right corner of the AWS Management Console and selecting "My Account" from the dropdown menu. 
 # 4. {Network policy name}: This is the name of the network policy you have configured for your Amazon OpenSearch Serverless Collection. If you haven't configured a network policy, you can leave this placeholder as is or remove it from the configuration.
 # After obtaining the necessary values, replace the placeholders in the configuration with the actual values.            

Step 4: Create a Lambda function

  • Create a Lambda layer for requests and sigv4 packages. Run the following commands in AWS Cloudshell.
mkdir lambda_layers
 cd lambda_layers
 mkdir python
 cd python
 pip install requests -t ./
 pip install requests_auth_aws_sigv4 -t ./
 cd ..
 zip -r python_modules.zip .


 aws lambda publish-layer-version --layer-name Data-requests --description "My Python layer" --zip-file fileb://python_modules.zip --compatible-runtimes python3.x
import base64
 import gzip
 import json
 import logging
 import json
 import jmespath
 import requests
 from datetime import datetime
 from requests_auth_aws_sigv4 import AWSSigV4
 import boto3


 LOGGER = logging.getLogger(__name__)
 LOGGER.setLevel(logging.INFO)


 def lambda_handler(event, context):

    """Extract the data from the event"""

    data = jmespath.search("awslogs.data", event)

    """Decompress the logs"""

    cwLogs = decompress_json_data(data)

    """Construct the payload to send to OpenSearch Ingestion"""

    payload = prepare_payload(cwLogs)

    print(payload)

    """Ingest the set of events to the pipeline"""    

    response = ingestData(payload)

    return {

        'statusCode': 200

    }
 def decompress_json_data(data):

    compressed_data = base64.b64decode(data)

    uncompressed_data = gzip.decompress(compressed_data)

    return json.loads(uncompressed_data)


 def prepare_payload(cwLogs):

    payload = []

    logEvents = cwLogs['logEvents']

    for logEvent in logEvents:

        request = {}

        request['id'] = logEvent['id']

        dt = datetime.fromtimestamp(logEvent['timestamp'] / 1000) 

        request['timestamp'] = dt.isoformat()

        request['message'] = logEvent['message'];

        request['owner'] = cwLogs['owner'];

        request['log_group'] = cwLogs['logGroup'];

        request['log_stream'] = cwLogs['logStream'];

        payload.append(request)

    return payload

 def ingestData(payload):

    ingestionEndpoint = '{OpenSearch Pipeline Endpoint}'

    endpoint = 'https://' + ingestionEndpoint

    headers = {'Content-Type': 'application/json', 'Accept':'application/json'}

    r = requests.request('POST', f'{endpoint}/logs/ingest', json=payload, auth=AWSSigV4('osis'), headers=headers)

    LOGGER.info('Response received: ' + r.text)

    return r
  • Replace {OpenSearch Pipeline Endpoint}’ with the endpoint of your OpenSearch Ingestion pipeline.
  • Attach the following inline policy in execution role.
{

    "Version": "2012-10-17",

    "Statement": [

        {

            "Sid": "PermitsWriteAccessToPipeline",

            "Effect": "Allow",

            "Action": "osis:Ingest",

            "Resource": "arn:aws:osis:{region}:{accountId}:pipeline/{OpenSearch Pipeline Name}"

        }

    ]
 }
  • Deploy the function.

Step 5: Set up a CloudWatch Logs subscription

  • Grant permission to a specific AWS service or AWS account to invoke the specified Lambda function. The following command grants permission to the CloudWatch Logs service to invoke the cloud-logs Lambda function for the specified log group. This is necessary because CloudWatch Logs cannot directly invoke a Lambda function without being granted permission. Run the following command in CloudShell to add permission.
aws lambda add-permission
 --function-name "{function name}"
 --statement-id "{function name}"
 --principal "logs.amazonaws.com"
 --action "lambda:InvokeFunction"
 --source-arn "arn:aws:logs:{region}:{accountId}:log-group:{log_group}:*"
 --source-account "{accountId}"
  • Create a subscription filter for a log group. The following command creates a subscription filter on the log group, which forwards all log events (because the filter pattern is an empty string) to the Lambda function. Run the following command in Cloudshell to create the subscription filter.
aws logs put-subscription-filter
 --log-group-name {log_group}
 --filter-name {filter name}
 --filter-pattern ""
 --destination-arn arn:aws:lambda:{region}:{accountId}:function:{function name}

Step 6: Testing and verification

  • Generate some logs in your CloudWatch log group. Run the following command in Cloudshell to create sample logs in log group.
aws logs put-log-events --log-group-name {log_group} --log-stream-name {stream_name} --log-events "[{\"timestamp\":{timestamp in millis} , \"message\": \"Simple Lambda Test\"}]"
  • Check the OpenSearch collection to ensure logs are indexed correctly.

Clean up

Remove the infrastructure for this solution when not in use to avoid incurring unnecessary costs.

Conclusion

You saw how to set up a pipeline to send CloudWatch logs to an OpenSearch Serverless collection within a VPC. This integration uses CloudWatch for log aggregation, Lambda for log processing, and OpenSearch Serverless for querying and visualization. You can use this solution to take advantage of the pay-as-you-go pricing model for OpenSearch Serverless to optimize operational costs for log analysis.

To further explore, you can:


About the Authors

Balaji Mohan is a senior modernization architect specializing in application and data modernization to the cloud. His business-first approach ensures seamless transitions, aligning technology with organizational goals. Using cloud-native architectures, he delivers scalable, agile, and cost-effective solutions, driving innovation and growth.

Souvik Bose is a Software Development Engineer working on Amazon OpenSearch Service.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Accelerate your Terraform development with Amazon Q Developer

Post Syndicated from Dr. Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/accelerate-your-terraform-development-with-amazon-q-developer/

This post demonstrates how Amazon Q Developer, a generative AI-powered assistant for software development, helps create Terraform templates. Terraform is an infrastructure as code (IaC) tool that provisions and manages infrastructure on AWS safely and predictably. When used in an integrated development environment (IDE), Amazon Q Developer assists with software development, including code generation, explanation, and improvements. This blog highlights the 5 most used cases to show how Amazon Q Developer can generate Terraform code snippets:

  1. Secure Networking Deployment: Generate Terraform code snippet to create an Amazon Virtual Private Cloud (VPC) with subnets, route tables, and security groups.
  2. Multi-Account CI/CD Pipelines with AWS CodePipeline: Generate Terraform code snippet for setting up continuous integration and continuous delivery (CI/CD) pipelines using AWS CodePipeline across multiple AWS accounts.
  3. Event-Driven Architecture: Generate Terraform code snippet to set up an event-driven architecture using Amazon EventBridge, AWS Lambda, Amazon API Gateway, and other serverless services.
  4. Container Orchestration (Amazon ECS Fargate): Generate Terraform code snippet for deploying and managing container orchestration using Amazon Elastic Container Service (ECS) with Fargate.
  5. Machine Learning Workflows: Generate Terraform module for deploying machine learning workflows using AWS SageMaker.

Terraform Project Structure

First, you want to understand the best practices and requirements for setting up a multi-environment Terraform Infrastructure as Code (IaC) project. Following industry practices from the start is crucial to manage your multi-environment infrastructure and this is where Amazon Q will recommend a standard folder and file structure for organising Terraform templates and managing infrastructure deployments.

Interface showing Amazon Q Developer's recommendations for structuring a Terraform project

Amazon Q Developer recommended organising Terraform templates in separate folders for each environment, with sub-folders to group resources into modules. It provided best practices like version control, remote backend, and tagging for reusability and scalability. We can ask Amazon Q Developer to generate a sample example for better understanding.

Screenshot of the Terraform folder structure generated by Amazon Q Developer

You can see that the recommended folder structure includes a root project folder and sub-folders for environments and modules. The sub-folders manage multiple environments (like development, testing, production), reusable modules (like Virtual Private Cloud (VPC), Elastic Compute Cloud (EC2)) for various components and it demonstrates references to manage Terraform templates for individual environments and components. Let’s explore the top 5 most common use cases one by one.

1. Secure Networking Deployment

Once we setup Terraform project structure, we will ask Amazon Q Developer to give recommendations for networking services and requirements. Amazon Q Developer suggests to use Amazon Virtual Private Cloud (VPC), Subnets, Internet Gateways, Network Access Control Lists (ACLs) and Security Groups.

Amazon Q Developer interface displaying recommendations for core Amazon VPC components

Now, we can ask Amazon Q Developer to generate a Terraform template for building components like Virtual Private Cloud (VPC) and subnets. The Terraform code generated by Amazon Q Developer for a VPC, private subnet, and public subnet. It also added tags to each resource, following best practices. To leverage the suggested code snippet in your template, open the vpc.tf file and either insert it at the cursor or copy it into the file. Additionally, Amazon Q Developer created an Internet Gateway, Route Tables, and associated them with the VPC.

Generated Terraform code snippet for VPC and Subnets shown in the Amazon Q Developer interface

2. Multi-Account CI/CD Pipelines with AWS CodePipeline

Let’s discuss the second use case: a CI/CD (Continuous Integration/Continuous Deployment) pipeline using AWS CodePipeline to deploy Terraform code across multiple AWS accounts. Assume this is your first time working on this use case. Use Amazon Q Developer to understand the pipeline’s design stages and requirements for creating your pipeline across AWS accounts. I asked Amazon Q Developer about the pipeline stages and requirements to deploy across AWS accounts.

Amazon Q Developer displaying generated stages and dependencies for a CI/CD pipeline

Amazon Q Developer provided all the main stages for the CI/CD (Continuous Integration/Continuous Deployment) pipeline:

  1. Source stage pulls the Terraform code from the source code repository.
  2. Plan stage includes linting, validation, or running terraform plan to view the Terraform plan before deploying the code.
  3. Build stage performs additional testing for the infrastructure to ensure all components will be created successfully.
  4. Deploy stage runs terraform apply.

Amazon Q Developer also provided all the requirements needed before creating the CI/CD pipeline. These requirements include terraform is installed in the pipeline environment, IAM roles that will be assumed by the pipeline stages to deploy the terraform code across different accounts, an Amazon S3 bucket to store the state of the terraform code, source code repository to store our terraform files, use parameters to store sensitive data like AWS accounts IDs and AWS CodePipeline to create our CI/CD pipeline.

You will use Amazon Q Developer now to generate the terraform code for the CI/CD pipeline based on stages design proposed by the services in the previous question.

Interface showing Terraform code snippets for CI/CD pipeline stages generated by Amazon Q Developer

We would like to highlight three things here:

1. Amazon Q Developer suggested all the stages for our Continuous Integration/Continuous Deployment (CI/CD) design:

  • Source Stage: this stage pulls the code from the source code repository.
  • Build Stage: this stage will validates the infrastructure code, run a Terraform plan stage, and can automate any review.
  • Deploy Stage: this stage deploys the terraform code to the target AWS account.

2. Amazon Q Developer generated the build stage before the plan and this is expected in any CI/CD pipeline. First, we start with building the artifacts, running any tests or validation steps. If this stage is passed successfully, it will then proceed to terraform plan.
3. Amazon’s Q Developer suggests deploying to separate stages for each target account, aligning with the best practice of using different AWS accounts for development, testing, and production environments. This reduces the blast radius and allows configuring a separate stage for each environment/account.

3. Event-Driven Architecture

For the next use case, we will start by asking Amazon Q Developer to explain Event-driven architecture. As shown below, Amazon Q Developer highlights the important aspects that we need to consider when designing event-driven architecture like:

  • An event that represents an actual change in the application state like S3 object upload.
  • The event source that produces the actual event like Amazon S3.
  • An event route that routes the event between source and target.
  • An event target that consumes the event like AWS Lambda function.

Amazon Q Developer providing an explanation of event-driven architecture components

Assume you want to build a sample Event-driven architecture using Amazon SNS as a simple pub/sub. In order to build such architecture, we will need:

  • An Amazon SNS topic to publish events to an Amazon SQS queue subscriber.
  • An Amazon SQS queue which subscribe to SNS topic.

Sample architecture of Event-driven approach

Asking Amazon Q Developer to create a terraform code for the above use case, it generated a code that can get us started in seconds. The code includes:

  • Create an SNS Topic and SQS queue
  • Subscribe SQS queue to SNS topic
  • AWS IAM policy to allow SNS to write to SQS

Terraform code snippet for an event-driven application generated by Amazon Q Developer

4. Container Orchestration (Amazon ECS Fargate)

Now to our next use case, we want to build an Amazon ECS cluster with Fargate. Before we start, we will ask Amazon Q Developer about Amazon ECS and the difference between using Amazon EC2 or Fargate option.

Amazon Q Developer interface explaining Amazon ECS and its compute options

Amazon Q Developer not only provides details about Amazon ECS capabilities and well-defined description about the difference between Amazon EC2 and Fargate as compute options, and also a guide when to use what to help in decision making process.

Now, we will ask Amazon Q Developer to suggest a terraform code to create an Amazon ECS cluster with Fargate as the compute option.

Generated Terraform code snippet for deploying Amazon ECS on Fargate shown by Amazon Q Developer

Amazon Q Developer suggested the key resources in the Terraform code for the ECS cluster, the resources are:

  • An Amazon ECS Cluster.
  • A sample task definition for an Nginx container with all the required configuration for the container to run.
  • ECS service with Fargate as launch type for the container to run.

5. Secure Machine Learning Workflows

Now let us explore final use case on setting up Machine Learning workflow. Assuming I am new to ML on AWS, I will first try to understand the best practices and requirements for ML workflows. We can ask Amazon Q Developer to get the recommendations for the resources, security policies etc.

Amazon Q Developer's recommendations for best practices in secure ML workflows

Amazon Q Developer provided recommendations to provision SageMaker resources in private VPC, implement authentication and authorization using AWS IAM, encrypt data at rest and in transit using AWS KMS and setup secure CI/CD. Once I get the recommendations, I can identify the resources which I need to build MLOps workflow.

Interface showing suggested resources for building an MLOPs workflow by Amazon Q Developer

Amazon Q Developer suggested resources such as a SageMaker Studio instance, model registry, endpoints, version control, CI/CD Pipeline, CloudWatch etc. for ML model building, training and deployment using SageMaker. Now I can start building Terraform templates to deploy these resources. To get the code recommendations, I can ask Amazon Q Developer to give a Terraform snippet for all these resources.

Amazon Q Developer will generate Terraform snippets for isolated VPC, Security Group, S3 bucket and SageMaker pipeline. Now, as shown earlier, we can leverage Amazon Q Developer to generate Terraform code for each and every resource using comments.

Conclusion

In this blog post, we showed you how to use Amazon Q Developer, a generative AI–powered assistant from AWS, in your IDE to quickly improve your development experience when using an infrastructure-as-code tool like Terraform to create your AWS infrastructure. However, we would like to highlight some important points:

  • Code Validation and Testing: Always thoroughly validate and test the Terraform code generated by Amazon Q Developer in a safe environment before deploying it to production. Automated code generation tools can sometimes produce code that may need adjustments based on your specific requirements and configurations.
  • Security Considerations: Ensure that all security practices are followed, such as least privilege principles for IAM roles, proper encryption methods for sensitive data, and continuous monitoring for security compliance.
  • Limitations of Amazon Q Developer: While Amazon Q Developer provides valuable assistance, it may not cover all edge cases or specific customizations needed for your infrastructure. Always review the generated code and modify it as necessary to fit your unique use cases.
  • Updates and Maintenance: Infrastructure and services on AWS are continuously evolving. Make sure to keep your Terraform code and the use of Amazon Q Developer updated with the latest best practices and AWS service changes.
  • Real-Time Assistance: Use Amazon Q Developer in your IDE to enhance generated code as it provides real-time recommendations based on your existing code and comments. The suggestions are based on your existing code and comments, which are in this case generated by Amazon Q Developer.

To get started with Amazon Q Developer for debugging today navigate to Amazon Q Developer in IDE and simply start asking questions about debugging. Additionally, explore Amazon Q Developer workshop for additional hands-on use cases.

For any inquiries or assistance with Amazon Q Developer, please reach out to your AWS account team.

About the authors:

Dr. Rahul Sharad Gaikwad

Dr. Rahul is a Lead DevOps Consultant at AWS, specializing in migrating and modernizing customer workloads on the AWS Cloud. He is a technology enthusiast with a keen interest in Generative AI and DevOps. He received his Ph.D. in AIOps in 2022. He is recipient of the Indian Achievers’ Award (2024), Best PhD Thesis Award (2024), Research Scholar of the Year Award (2023) and Young Researcher Award (2022).

Omar Kahil

Omar is a Professional Services Senior consultant who helps customers adopt DevOps culture and best practices. He also works to simplify the adoption of AWS services by automating and implementing complex solutions.

Balance deployment speed and stability with DORA metrics

Post Syndicated from Rostislav Markov original https://aws.amazon.com/blogs/devops/balance-deployment-speed-and-stability-with-dora-metrics/

Development teams adopt DevOps practices to increase the speed and quality of their software delivery. The DevOps Research and Assessment (DORA) metrics provide a popular method to measure progress towards that outcome. Using four key metrics, senior leaders can assess the current state of team maturity and address areas of optimization.

This blog post shows you how to make use of DORA metrics for your Amazon Web Services (AWS) environments. We share a sample solution which allows you to bootstrap automatic metric collection in your AWS accounts.

Benefits of collecting DORA metrics

DORA metrics offer insights into your development teams’ performance and capacity by measuring qualitative aspects of deployment speed and stability. They also indicate the teams’ ability to adapt by measuring the average time to recover from failure. This helps product owners in defining work priorities, establishing transparency on team maturity, and developing a realistic workload schedule. The metrics are appropriate for communication with senior leadership. They help commit leadership support to resolve systemic issues inhibiting team satisfaction and user experience.

Use case

This solution is applicable to the following use case:

  • Development teams have a multi-account AWS setup including a tooling account where the CI/CD tools are hosted, and an operations account for log aggregation and visualization.
  • Developers use GitHub code repositories and AWS CodePipeline to promote code changes across application environment accounts.
  • Tooling, operations, and application environment accounts are member accounts in AWS Control Tower or workload accounts in the Landing Zone Accelerator on AWS solution.
  • Service impairment resulting from system change is logged as OpsItem in AWS Systems Manager OpsCenter.

Overview of solution

The four key DORA metrics

The ‘four keys’ measure team performance and ability to react to problems:

  1. Deployment Frequency measures the frequency of successful change releases in your production environment.
  2. Lead Time For Changes measures the average time for committed code to reach production.
  3. Change Failure Rate measures how often changes in production lead to service incidents/failures, and is complementary to Mean Time Between Failure.
  4. Mean Time To Recovery measures the average time from service interruption to full recovery.

The first two metrics focus on deployment speed, while the other two indicate deployment stability (Figure 1). We recommend organizations to set their own goals (that is, DORA metric targets) based on service criticality and customer needs. For a discussion of prior DORA benchmark data and what it reveals about the performance of development teams, consult How DORA Metrics Can Measure and Improve Performance.

Balance between deployment speed and stability in software delivery, utilizing DORA metrics across four quadrants. The horizontal axis depicts speed, progressing from low, infrequent deployments and higher time for changes on the left to rapid, frequent deployments with lower time for changes on the right. Vertically, the stability increases from the bottom, characterized by longer service restoration and higher failure rates, to the top, indicating quick restoration and fewer failures. The top-right quadrant represents the ideal state of high speed and stability, serving as the target for optimized software delivery and high performance.

Figure 1. Overview of DORA metrics

Consult the GitHub code repository Balance deployment speed and stability with DORA metrics for a detailed description of the metric calculation logic. Any modifications to this logic should be made carefully.

For example, the Change Failure Rate focuses on changes that impair the production system. Limiting the calculation to tags (such as hotfixes) on pull requests would exclude issues related to the build process. It’s important to match system change records that lead to actual impairments in production. Limiting the calculation to the number of failed deployments from the deployment pipeline only considers deployments that didn’t reach production. We use AWS Systems Manager OpsCenter as the system of records for change-related outages, rather than relying solely on data from CI/CD tools.

Similarly, Mean Time To Recovery measures the duration from a service impairment in production to a successful pipeline run. We encourage teams to track both pipeline status and recovery time, as frequent pipeline failure can indicate insufficient local testing and potential pipeline engineering issues.

Gathering DORA events

Our metric calculation process runs in four steps:

  1. In the tooling account, we send events from CodePipeline to the default event bus of Amazon EventBridge.
  2. Events are forwarded to custom event buses which process them according to the defined metrics and any filters we may have set up.
  3. The custom event buses call AWS Lambda functions which forward metric data to Amazon CloudWatch. CloudWatch gives us an aggregated view of each of the metrics. From Amazon CloudWatch, you can send the metrics to another designated dashboard like Amazon Managed Grafana.
  4. As part of the data collection, the Lambda function will also query GitHub for the relevant commit to calculate the lead time for changes metric. It will query AWS Systems Manager for OpsItem data for change failure rate and mean time to recovery metrics. You can create OpsItems manually as part of your change management process or configure CloudWatch alarms to create OpsItems automatically.

Figure 2 visualizes these steps. This setup can be replicated to a group of accounts of one or multiple teams.

This figure visualizes the aforementioned four steps of our metric calculation process. AWS Lambda functions process all events and publish custom metrics in Amazon CloudWatch.

Figure 2. DORA metric setup for AWS CodePipeline deployments

Walkthrough

Follow these steps to deploy the solution in your AWS accounts.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Deploying the solution

Clone the GitHub code repository Balance deployment speed and stability with DORA metrics.

Before you start deploying or working with this code base, there are a few configurations you need to complete in the constants.py file in the cdk/ directory. Open the file in your IDE and update the following constants:

  1. TOOLING_ACCOUNT_ID & TOOLING_ACCOUNT_REGION: These represent the AWS account ID and AWS region for AWS CodePipeline (that is, your tooling account).
  2. OPS_ACCOUNT_ID & OPS_ACCOUNT_REGION: These are for your operations account (used for centralized log aggregation and dashboard).
  3. TOOLING_CROSS_ACCOUNT_LAMBDA_ROLE: The IAM Role for cross-account access that allows AWS Lambda to post metrics from your tooling account to your operations account/Amazon CloudWatch dashboard.
  4. DEFAULT_MAIN_BRANCH: This is the default branch in your code repository that’s used to deploy to your production application environment. It is set to “main” by default, as we assumed feature-driven development (GitFlow) on the main branch; update if you use a different naming convention.
  5. APP_PROD_STAGE_NAME: This is the name of your production stage and set to “DeployPROD” by default. It’s reserved for teams with trunk-based development.

Setting up the environment

To set up your environment on MacOS and Linux:

  1. Create a virtual environment:
    $ python3 -m venv .venv
  2. Activate the virtual environment: On MacOS and Linux:
    $ source .venv/bin/activate

Alternatively, to set up your environment on Windows:

  1. Create a virtual environment:
    % .venv\Scripts\activate.bat
  2. Install the required Python packages:
    $ pip install -r requirements.txt

To configure the AWS Command Line Interface (AWS CLI):

  1. Follow the configuration steps in the AWS CLI User Guide.
    $ aws configure sso
  2. Configure your user profile (for example, Ops for operations account, Tooling for tooling account). You can check user profile names in the credentials file.

Deploying the CloudFormation stacks

  1. Switch directory
    $ cd cdk
  2. Bootstrap CDK
    $ cdk bootstrap –-profile Ops
  3. Synthesize the AWS CloudFormation template for this project:
    $ cdk synth
  4. To deploy a specific stack (see Figure 3 for an overview), specify the stack name and AWS account number(s) in the following command:
    $ cdk deploy <Stack-Name> --profile {Tooling, Ops}

    To launch the DoraToolingEventBridgeStack stack in the Tooling account:

    $ cdk deploy DoraToolingEventBridgeStack --profile Tooling

    To launch the other stacks in the Operations account (including DoraOpsGitHubLogsStack, DoraOpsDeploymentFrequencyStack, DoraOpsLeadTimeForChangeStack, DoraOpsChangeFailureRateStack, DoraOpsMeanTimeToRestoreStack, DoraOpsMetricsDashboardStack):

    $ cdk deploy DoraOps* --profile Ops

The following figure shows the resources you’ll launch with each CloudFormation stack. This includes six AWS CloudFormation stacks in operations account. The first stack sets up log integration for GitHub commit activity. Four stacks contain a Lambda function which creates one of the DORA metrics. The sixth stack creates the consolidated dashboard in Amazon CloudWatch.

Figure 3. Resources provisioned with this solution

Testing the deployment

To run the provided tests:

$ pytest

Understanding what you’ve built

Deployed resources in tooling account

The DoraToolingEventBridgeStack includes Amazon EventBridge rules with a target of the central event bus in the operations account, plus an AWS IAM role with cross-account access to put events in the operations account. The event pattern for invoking our EventBridge rules listens for deployment state changes in AWS CodePipeline:

{
  "detail-type": ["CodePipeline Pipeline Execution State Change"],
  "source": ["aws.codepipeline"]
}

Deployed resources in operations account

  1. The Lambda function for Deployment Frequency tracks the number of successful deployments to production, and posts the metric data to Amazon CloudWatch. You can add a dimension with the repository name in Amazon CloudWatch to filter on particular repositories/teams.
  2. The Lambda function for the Lead Time For Change metric calculates the duration from the first commit to successful deployment in production. This covers all factors contributing to lead time for changes, including code reviews, build, test, as well as the deployment itself.
  3. The Lambda function for Change Failure Rate keeps track of the count of successful deployments and the count of system impairment records (OpsItems) in production. It publishes both as metrics to Amazon CloudWatch and the latter calculates the ratio, as shown in below example.
    This visual shows three graphed metrics in Amazon CloudWatch: metric “m1” calculating number of failed deployments, metric “m2” calculating number of total deployments, and metric “m3” calculating change failure rate by dividing m1 with m2 and multiplying by 100.
  4. The Lambda function for Mean Time To Recovery keeps track of all deployments with status SUCCEEDED in production and whose repository branch name references an existing OpsItem ID. For every matching event, the function gets the creation time of the OpsItem record and posts the duration between OpsItem creation and successful re-deployment to the CloudWatch dashboard.

All Lambda functions publish metric data to Amazon CloudWatch using the PutMetricData API. The final calculation of the four keys is performed on the CloudWatch dashboard. The solution includes a simple CloudWatch dashboard so you can validate the end-to-end data flow and confirm that it has deployed successfully:

This simple CloudWatch dashboard displays the four DORA metrics for three reporting periods: per day, per week, and per month.

Cleaning up

Remember to delete example resources if you no longer need them to avoid incurring future costs.

You can do this via the CDK CLI:

$ cdk destroy <Stack-Name> --profile {Tooling, Ops}

Alternatively, go to the CloudFormation console in each AWS account, select the stacks related to DORA and click on Delete. Confirm that the status of all DORA stacks is DELETE_COMPLETE.

Conclusion

DORA metrics provide a popular method to measure the speed and stability of your deployments. The solution in this blog post helps you bootstrap automatic metric collection in your AWS accounts. The four keys help you gain consensus on team performance and provide data points to back improvement suggestions. We recommend using the solution to gain leadership support for systemic issues inhibiting team satisfaction and user experience. To learn more about developer productivity research, we encourage you to also review alternative frameworks including DevEx and SPACE.

Further resources

If you enjoyed this post, you may also like:

Author bio

Rostislav Markov

Rostislav is principal architect with AWS Professional Services. As technical leader in AWS Industries, he works with AWS customers and partners on their cloud transformation programs. Outside of work, he enjoys spending time with his family outdoors, playing tennis, and skiing.

Ojesvi Kushwah

Ojesvi works as a Cloud Infrastructure Architect with AWS Professional Services supporting global automotive customers. She is passionate about learning new technologies and building observability solutions. She likes to spend her free time with her family and animals.

Integrate Amazon MWAA with Microsoft Entra ID using SAML authentication

Post Syndicated from Satya Chikkala original https://aws.amazon.com/blogs/big-data/integrate-amazon-mwaa-with-microsoft-entra-id-using-saml-authentication/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) provides a fully managed solution for orchestrating and automating complex workflows in the cloud. Amazon MWAA offers two network access modes for accessing the Apache Airflow web UI in your environments: public and private. Customers often deploy Amazon MWAA in private mode and want to use existing login authentication mechanisms and single sign-on (SSO) features to have seamless integration with the corporate Active Directory (AD). Also, the end-users don’t need to log in to the AWS Management Console to access the Airflow UI.

In this post, we illustrate how to configure an Amazon MWAA environment deployed in private network access mode with customer managed VPC endpoints and authenticate users using SAML federated identity using Microsoft Entra ID and Application Load Balancer (ALB). Users can seamlessly log in to the Airflow UI with their corporate credentials and access the DAGs. This solution can be modified for Amazon MWAA public network access mode as well.

Solution overview

The architectural components involved in authenticating the Amazon MWAA environment using SAML SSO are depicted in the following diagram. The infrastructure components include two public subnets and three private subnets. The public subnets are required for the internet-facing ALB. Two private subnets are used to set up the Amazon MWAA environment, and the third private subnet is used to host the AWS Lambda authorizer function. This subnet will have a NAT gateway attached to it, because the function needs to verify the signer to confirm the JWT header has the expected LoadBalancer ARN.

The workflow consists of the following steps:

  1. For SAML configuration, Microsoft Entra ID serves as the identity provider (IdP).
  2. Amazon Cognito serves as the service provider (SP).
  3. ALB has built-in support for Amazon Cognito and authenticates requests.
  4. Post-authentication, ALB forwards the requests to the Lambda authorizer function. The Lambda function decodes the user’s JWT token and validates whether the user’s AD group is mapped to the relevant AWS Identity and Access Management (IAM) role.
  5. If valid, the function creates a web login token and redirects to the Amazon MWAA environment for successful login.

The following are the high-level steps to deploy the solution:

  1. Create an Amazon Simple Storage Service (Amazon S3) bucket for artifacts.
  2. Create an SSL certificate and upload it to AWS Certificate Manager (ACM).
  3. Deploy the Amazon MWAA infrastructure stack using AWS CloudFormation.
  4. Configure Microsoft Entra ID services and integrate the Amazon Cognito user pool.
  5. Deploy the ALB CloudFormation stack.
  6. Log in to Amazon MWAA using Microsoft Entra ID user credentials.

Prerequisites

Before you get started, make sure you have the following prerequisites:

  • An AWS account
  • Appropriate IAM permissions to deploy AWS CloudFormation stack resources
  • A Microsoft Azure account is required for creating the Microsoft Entra ID app (IdP config) and Microsoft Entra ID P2.
  • A public certificate for the ALB in the AWS Region where the infrastructure is being deployed and a custom domain name relevant to the certificate.

Create an S3 bucket

In this step, we create an S3 bucket to store your Airflow DAGs, custom plugins in a plugins.zip file, and Python dependencies in a requirements.txt file. This bucket is used by the Amazon MWAA environment to fetch DAGs and dependency files.

  1. On the Amazon S3 console, choose the Region where you want to create a bucket.
  2. In the navigation pane, choose Buckets.
  3. Choose Create bucket.
  4. For Bucket type, select General purpose.
  5. For Bucket name, enter a name for your bucket (for this post, mwaa-sso-blog-<your-aws-account-number>).
  6. Choose Create bucket. 

  7. Navigate to the bucket and choose Create folder.
  8. For Folder name, enter a name (for this post, we name the folder dags).
  9. Choose Create folder.


Import certificates into ACM

ACM is integrated with Elastic Load Balancing (ALB). In this step,  you can request a public certificate using ACM or import a certificate into ACM. To import organization certificates linked to a custom DNS into ACM, you must provide the certificate and its private key. To import a certificate signed by a non-AWS Certificate Authority (CA), you must also include the private and public keys of the certificate.

  1. On the ACM console, choose Import certificate in the navigation pane.
  2. For Certificate body, enter the contents of the cert.pem file.
  3. For Certificate private key, enter the contents of the privatekey.pem file.
  4. Choose Next.


  5. Choose Review and import.
  6. Review the metadata about your certificate and choose Import.

After the import is successful, the status of the imported certificate will show as Issued.

Create the Azure AD service, users, groups, and enterprise application

For the SSO integration with Azure, an enterprise application is required, which acts as the IdP for the SAML flow. We add relevant users and groups to the application and configure the SP (Amazon Cognito) details.

Airflow comes with five default roles: Public, Admin, Op, User, Viewer. In this post, we focus on three: Admin , User and Viewer. We create three roles and three corresponding users and assign memberships appropriately.

  1. Log in to the Azure portal.
  2. Navigate to Enterprise applications and choose New application.

  3. Enter a name for your application (for example, mwaa-environment) and choose Create.



    You can now view the details of your application.


    Now you create two groups.

  4. In the search bar, search for Microsoft Entra ID.

  5. On the Add menu, choose Group.

  6. For Group type, choose a type (for this post, Security).
  7. Enter a group name (for example, airflow-admins) and description.
  8. Choose Create.


  9. Repeat these steps to create two more groups, named airflow-users and airflow-viewers.
  10. Note the object IDs for each group (these are required in a later step).


    Next, you create users.
  11. On the Overview page, on the Add menu, choose User and Create new user.
  12. Enter a name for your user (for example, mwaa-user), display name, and password.
  13. Choose Review + create.


  14. Repeat these steps to create a user called mwaa-admin.
  15. In your airflow-users group details page, choose Members in the navigation pane.
  16. Choose Add members.
  17. Search for and select the users you created and choose Select.


  18. Repeat these steps to add the users to each group.

  19. Navigate to your application and choose Assign users and groups.

  20. Choose Add user/group.

  21. Search for and select the groups you created, then choose Select.

 

Deploy the Amazon MWAA environment stack

For this solution, we provide two CloudFormation templates that set up the services illustrated in the architecture. Deploying the CloudFormation stacks in your account incurs AWS usage charges.

The first CloudFormation stack creates the following resources:
  • A VPC with two public subnets and three private subnets and relevant route tables, NAT gateway, internet gateway, and security group
  • VPC endpoints required for the Amazon MWAA environment
  • An Amazon Cognito user pool and user pool domain
  • Application Load Balancer
Deploy the stack by completing the following steps:
  1. Choose Launch Stack to launch the CloudFormation stack.

  2. For Stack name, enter a name (for example, sso-blog-mwaa-infra-stack).

  3.  Enter the following parameters:

    1. For MWAAEnvironmentName, enter the environment name.

    2. For MwaaS3Bucket, enter the S3 artifacts bucket you created.

    3. For VpcCIDR, enter the specify IP range (CIDR notation) for this VPC.

    4. For PrivateSubnet1CIDR, enter the IP range (CIDR notation) for the private subnet in the first Availability Zone.

    5.  For PrivateSubnet2CIDR, enter the IP range (CIDR notation) for the private subnet in the second Availability Zone.

    6. For PrivateSubnet3CIDR, enter the IP range (CIDR notation) for the private subnet in the third Availability Zone.

    7. For PublicSubnet1CIDR, enter the IP range (CIDR notation) for the public subnet in the first Availability Zone.

    8. For PublicSubnet2CIDR, enter the IP range (CIDR notation) for the public subnet in the second Availability Zone.

  4. Choose Next

  5. Review the template and choose Create stack.

After the stack is deployed successfully, you can view the resources on the stack’s Outputs tab on the AWS CloudFormation console. Note the ALB URL, Amazon Cognito user pool ID, and domain.

 

Integrate the Amazon MWAA application with the Azure enterprise application

Next, you configure the SAML configuration in the enterprise application by adding the SP details and redirect URLs (in this case, the Amazon Cognito details and ALB URL).

  1. In the Azure portal, navigate to your environment.
  2. Choose Set up single sign on.
  3. For Identifier, enter urn:amazon:cognito:sp:<your cognito user_id>.
  4. For Reply URL, enter https://<Your user pool domain>/saml2/idpresponse.
  5. For Sign on URL, enter https://<Your application load balancer DNS>.
  6. In the Attributes & Claims section, choose Add a group claim.
  7. Select Security groups.
  8. For Source attribute, choose Group ID.
  9. Choose Save.
  10. Note the values for App Federation Metadata Url and Login URL.


Deploy the ALB stack

When the SAML configuration is complete on the Azure end, the IdP details have to be configured in Amazon Cognito. When users access the ALB URL, they will be authenticated against the corporate identity using SAML through Amazon Cognito. After they’re authenticated, they’re redirected to the Lambda function for authorization against the group they belong to. The user’s group is then validated against matching IAM role. If it’s valid, the Lambda function adds the web login token to the URL, and the user will gain access to the Amazon MWAA environment.

This CloudFormation stack creates the following resources:

  • Two target groups: the Lambda target group and Amazon MWAA target group
  • Listener rules for the ALB to redirect URL requests to the relevant target groups
  • A user pool client and SAML provider (Azure) details to the Amazon Cognito user pool
  • IAM roles for Admin, User, and Viewer personas required for Airflow
  • The Lambda authorizer function to validate the JWT token and map Azure groups to IAM roles for appropriate Airflow UI access

Deploy the stack by completing the following steps:

  1. Choose Launch Stack to launch the CloudFormation stack:
  2. For Stack name, enter a name (for example, sso-blog-mwaa-alb-stack).

  3. Enter the following parameters:

    1. For MWAAEnvironmentName, enter your environment name.

    2. For ALBCertificateArn, enter the certificate ARN required for ALB. 

    3. For AzureAdminGroupID, enter the group name for the Azure Admin persona.

    4. For AzureUserGroupID, enter the group name for the Azure User persona.

    5. For AzureViewerGroupID, enter the group name for the Azure Viewer persona.

    6. For EntraIDLoginURL, enter the Azure IdP URI.

    7. For AppFederationMetadataURL, enter the URL of the metadata file for the SAML provider. 

  4. Choose Next.

  5. Review the template and choose Create stack.

Test the solution

Now that the SAML configuration and relevant AWS services are created, it’s time to access the Amazon MWAA environment.

  1. Open your web browser and enter the ALB DNS name.
    The SP initiates the sign-in request process and the browser redirects you to the Microsoft login page for credentials.
  2. Enter the Admin user credentials.

    The SAML request sign-in process completes and the SAML response is redirected to the Amazon Cognito user pool attached to the ALB.

    The listener rules will validate the query URL and pass the requests to the Lambda authorizer to validate the JWT and assign the appropriate group (Azure) to role (AWS) mapping.


  3. Repeat the steps to log in with User and Viewer credentials and observe the differences in access.

Clean up

When you’re done experimenting with this solution, it’s essential to clean up your resources to avoid incurring AWS charges.

  1. On the AWS CloudFormation console, delete the stacks you created.
  2. Remove the SSM parameters and private webserver and database VPC endpoints (created by the Lambda events function):
    aws ssm delete-parameters --names "MyFirstParameter" "MySecondParameter"
    aws ec2 delete-vpc-endpoints --vpc-endpoint-ids "Endpoint1" "Endpoint2"

  3. Delete the users, groups, and enterprise application in the Azure environment.

Conclusion

In this post, we demonstrated how to integrate Amazon MWAA with organization Azure AD services. We walked through the solution that solves this problem using infrastructure as code. This solution allows different end-user personas in your organization to access the Amazon MWAA Airflow UI using SAML SSO.

For additional details and code examples for Amazon MWAA, visit the Amazon MWAA User Guide and the Amazon MWAA examples GitHub repo.


About the Authors

Satya Chikkala is a Solutions Architect at Amazon Web Services. Based in Melbourne, Australia, he works closely with enterprise customers to accelerate their cloud journey. Beyond work, he is very passionate about nature and photography.

Vijay Velpula is a Data Lake Architect with AWS Professional Services. He assists customers in building modern data platforms by implementing big data and analytics solutions. Outside of his professional responsibilities, Velpula enjoys spending quality time with his family, as well as indulging in travel, hiking, and biking activities.

Federating access to Amazon DataZone with AWS IAM Identity Center and Okta

Post Syndicated from Carlos Gallegos original https://aws.amazon.com/blogs/big-data/federating-access-to-amazon-datazone-with-aws-iam-identity-center-and-okta/

Many customers rely today on Okta or other identity providers (IdPs) to federate access to their technology stack and tools. With federation, security teams can centralize user management in a single place, which helps simplify and brings agility to their day-to-day operations while keeping highest security standards.

To help develop a data-driven culture, everyone inside an organization can use Amazon DataZone. To realize the benefits of using Amazon DataZone for governing data and making it discoverable and available across different teams for collaboration, customers integrate it with their current technology stack. Handling access through their identity provider and preserving a familiar single sign-on (SSO) experience enables customers to extend the use of Amazon DataZone to users across teams in the organization without any friction while keeping centralized control.

Amazon DataZone is a fully managed data management service that makes it faster and simpler for customers to catalog, discover, share, and govern data stored across Amazon Web Services (AWS), on premises, and third-party sources. It also makes it simpler for data producers, analysts, and business users to access data throughout an organization so that they can discover, use, and collaborate to derive data-driven insights.

You can use AWS IAM Identity Center to securely create and manage identities for your organization’s workforce, or sync and use identities that are already set up and available in Okta or other identity provider, to keep centralized control of them. With IAM Identity Center you can also manage the SSO experience of your organization centrally, across your AWS accounts and applications.

This post guides you through the process of setting up Okta as an identity provider for signing in users to Amazon DataZone. The process uses IAM Identity Center and its native integration with Amazon DataZone to integrate with external identity providers. Note that, even though this post focuses on Okta, the presented pattern relies on the SAML 2.0 standard and so can be replicated with other identity providers.

Prerequisites

To build the solution presented in this post, you must have:

Process overview

Throughout this post you’ll follow these high-level steps:

  1. Establish a SAML connection between Okta and IAM Identity Center
  2. Set up automatic provisioning of users and groups in IAM Identity Center so that users and groups in the Okta domain are created in Identity Center.
  3. Assign users and groups to your AWS accounts in IAM Identity Center by assuming an AWS Identity and Access Management (IAM) role.
  4. Access the AWS Management Console and Amazon DataZone portal through Okta SSO.
  5. Manage Amazon DataZone specific permissions in the Amazon DataZone portal.

Setting up user federation with Okta and IAM Identity Center

This guide follows the steps in Configure SAML and SCIM with Okta and IAM Identity Center.

Before you get started, review the following items in your Okta setup:

  • Every Okta user must have a First name, Last name, Username and Display name value specified.
  • Each Okta user has only a single value per data attribute, such as email address or phone number. Users that have multiple values will fail to synchronize. If there are users that have multiple values in their attributes, remove the duplicate attributes before attempting to provision the user in IAM Identity Center. For example, only one phone number attribute can be synchronized. Because the default phone number attribute is work phone, use the work phone attribute to store the user’s phone number, even if the phone number for the user is a home phone or a mobile phone.
  • If you update a user’s address you must have streetAddress, city, state, zipCode and the countryCode value specified. If any of these values aren’t specified for the Okta user at the time of synchronization, the user (or changes to the user) won’t be provisioned.

Okta account

1) Establish a SAML connection between Okta and AWS IAM Identity Center

Now, let’s establish a SAML connection between Okta and AWS IAM Identity Center. First, you’ll create an application in Okta to establish the connection:

  1. Sign in to the Okta admin dashboard, expand Applications, then select Applications.
  2. On the Applications page, choose Browse App Catalog.
  3. In the search box, enter AWS IAM Identity Center, then select the app to add the IAM Identity Center app.

IAM identity center app in Okta

  1. Choose the Sign On tab.

IAM identity center app in Okta - sign on

  1. Under SAML Signing Certificates, select Actions, and then select View IdP Metadata. A new browser tab opens showing the document tree of an XML file. Select all of the XML from <md:EntityDescriptor> to </md:EntityDescriptor> and copy it to a text file.
  2. Save the text file as metadata.xml.

Identity provider metadata in Okta

Leave the Okta admin dashboard open, you will continue using it in the later steps.

Second, you’re going to set up Okta as an external identity provider in IAM Identity Center:

  1. Open the IAM Identity Center console as a user with administrative privileges.
  2. Choose Settings in the navigation pane.
  3. On the Settings page, choose Actions, and then select Change identity source.

Identity provider source in IAM identity center

  1. Under Choose identity source, select External identity provider, and then choose Next.

Identity provider source in IAM identity center

  1. Under Configure external identity provider, do the following:
    1. Under Service provider metadata, choose Download metadata file to download the IAM Identity Center metadata file and save it on your system. You will provide the Identity Center SAML metadata file to Okta later in this tutorial.
      1. Copy the following items to a text file for easy access (you’ll need these values later):
        • IAM Identity Center Assertion Consumer Service (ACS) URL
        • IAM Identity Center issuer URL
    2. Under Identity provider metadata, under IdP SAML metadata, choose Choose file and then select the metadata.xml file you created in the previous step.
    3. Choose Next.
  2. After you read the disclaimer and are ready to proceed, enter accept.
  3. Choose Change identity source.

Identity provider source in IAM identity center

Leave the AWS console open, because you will use it in the next procedure.

  1. Return to the Okta admin dashboard and choose the Sign On tab of the IAM Identity Center app, then choose Edit.
  2. Under Advanced Sign-on Settings enter the following:
    1. For ACS URL, enter the value you copied for IAM Identity Center Assertion Consumer Service (ACS) URL.
    2. For Issuer URL, enter the value you copied for IAM Identity Center issuer URL.
    3. For Application username format, select one of the options from the drop-down menu.
      Make sure the value you select is unique for each user. For this tutorial, select Okta username.
  3. Choose Save.

IAM identity center app in Okta - sign on

2) Set up automatic provisioning of users and groups in AWS IAM Identity Center

You are now able to set up automatic provisioning of users from Okta into IAM Identity Center. Leave the Okta admin dashboard open and return to the IAM Identity Center console for the next step.

  1. In the IAM Identity Center console, on the Settings page, locate the Automatic provisioning information box, and then choose Enable. This enables automatic provisioning in IAM Identity Center and displays the necessary System for Cross-domain Identity Management (SCIM) endpoint and access token information.

Automatic provisioning in IAM identity center

  1. In the Inbound automatic provisioning dialog box, copy each of the values for the following options:
    • SCIM endpoint
    • Access token

You will use these values to configure provisioning in Okta later.

  1. Choose Close.

Automatic provisioning in IAM identity center

  1. Return to the Okta admin dashboard and navigate to the IAM Identity Center app.
  2. On the AWS IAM Identity Center app page, choose the Provisioning tab, and then in the navigation pane, under Settings, choose Integration.
  3. Choose Edit, and then select the check box next to Enable API integration to enable provisioning.
  4. Configure Okta with the SCIM provisioning values from IAM Identity Center that you copied earlier:
    1. In the Base URL field, enter the SCIM endpoint Make sure that you remove the trailing forward slash at the end of the URL.
    2. In the API Token field, enter the Access token value.
  5. Choose Test API Credentials to verify the credentials entered are valid. The message AWS IAM Identity Center was verified successfully! displays.
  6. Choose Save. You are taken to the Settings area, with Integration selected.

API Integration in Okta

  1. Review the following setup before moving forward. In the Provisioning tab, in the navigation pane under Settings, choose To App. Check that all options are enabled. They should be enabled by default, but if not, enable them.

Application provision in Okta

3) Assign users and groups to your AWS accounts in AWS IAM Identity Center by assuming an AWS IAM role

By default, no groups nor users are assigned to your Okta IAM Identity Center app. Complete the following steps to synchronize users with IAM Identity Center.

  1. In the Okta IAM Identity Center app page, choose the Assignments tab. You can assign both people and groups to the IAM Identity Center app.
    1. To assign people:
      1. In the Assignments page, choose Assign, and then choose Assign to people.
      2. Select the Okta users that you want to have access to the IAM Identity Center app. Choose Assign, choose Save and Go Back, and then choose Done.
        This starts the process of provisioning the individual users into IAM Identity Center.

      Users assignment in Okta

    1. To assign groups:
      1. Choose the Push Groups tab. You can create rules to automatically provision Okta groups into IAM Identity Center.

      Groups assignment in Okta

      1. Choose the Push Groups drop-down list and select Find groups by rule.
      2. In the By rule section, set a rule name and a condition. For this post we’re using AWS SSO Rule as rule name and starts with awssso as a group name condition. This condition can be different depending on the name of the group you want to sync.
      3. Choose Create Rule

      Okta SSO group rule

      1. (Optional) To create a new group choose Directory in the navigation pane, and then choose Groups.

      Group creation in Okta

      1. Choose Add group and enter a name, and then choose Save.

      Group creation in Okta

      1. After you have created the group, you can assign people to it. Select the group name to manage the group’s users.

      Group user assign in Okta

      1. Choose Assign people and select the users that you want to assign to the group.

      Group user assign in Okta

      1. You will see the users that are assigned to the group.

      Group user assign in Okta

      1. Going back to Applications in the navigation pane, select the AWS IAM Identity Center app and choose the Push Groups tab. You should have the groups that match the rule synchronized between Okta and IAM Identity Center. The group status should be set to Active after the group and its members are updated in Identity Center.

      Active groups in Okta

  1. Return to the IAM Identity Center console. In the navigation pane, choose Users. You should see the user list that was updated by Okta.

Active users in IAM identity center

  1. In the left navigation, select Groups, you should see the group list that was updated by Okta.

Active groups in IAM identity center

Congratulations! You have successfully set up a SAML connection between Okta and AWS and have verified that automatic provisioning is working.

OPTIONAL: If you need to provide Amazon DataZone console access to the Okta users and groups, you can manage these permissions through the IAM Identity Center console.

  1. In the IAM Identity Center navigation pane, under Multi-account permissions, choose AWS accounts.
  2. On the AWS accounts page, the Organizational structure displays your organizational root with your accounts underneath it in the hierarchy. Select the checkbox for your management account, then choose Assign users or groups.

IAM Roles in IAM identity center

  1. The Assign users and groups workflow displays. It consists of three steps:
    1. For Step 1: Select users and groups choose the user that will be performing the administrator job function. Then choose Next.
    2. For Step 2: Select permission sets choose Create permission set to open a new tab that steps you through the three sub-steps involved in creating a permission set.
      1. For Step 1: Select permission set type complete the following:
        • In Permission set type, choose Predefined permission set.
        • In Policy for predefined permission set, choose AdministratorAccess.
      2. Choose Next.
      3. For Step 2: Specify permission set details, keep the default settings, and choose Next.
        The default settings create a permission set named AdministratorAccess with session duration set to one hour. You can also specify reduced permissions with a custom policy just to allow Amazon DataZone console access.
      4. For Step 3: Review and create, verify that the Permission set type uses the AWS managed policy AdministratorAccess or your custom policy. Choose Create. On the Permission sets page, a notification appears informing you that the permission set was created. You can close this tab in your web browser now.
  2. On the Assign users and groups browser tab, you are still on Step 2: Select permission sets from which you started the create permission set workflow.
  3. In the Permissions sets area, Refresh. The AdministratorAccess permission or your custom policy set you created appears in the list. Select the checkbox for that permission set, and then choose Next.

IAM Roles in IAM identity center

    1. For Step 3: Review and submit review the selected user and permission set, then choose Submit.
      The page updates with a message that your AWS account is being configured. Wait until the process completes.
    2. You are returned to the AWS accounts page. A notification message informs you that your AWS account has been re-provisioned, and the updated permission set is applied. When a user signs in, they will have the option of choosing the AdministratorAccess role or a custom policy role.

4) Access the AWS console and Amazon DataZone portal through Okta SSO

Now, you can test your user access into the console and Amazon DataZone portal using the Okta external identity application.

  1. Sign in to the Okta dashboard using a test user account.
  2. Under My Apps, select the AWS IAM Identity Center icon.

IAM identity center access in Okta

  1. Complete the authentication process using your Okta credentials.

IAM identity center access in Okta

4.1) For administrative users

  1. You’re signed in to the portal and can see the AWS account icon. Expand that icon to see the list of AWS accounts that the user can access. In this tutorial, you worked with a single account, so expanding the icon only shows one account.
  2. Select the account to display the permission sets available to the user. In this tutorial you created the AdministratorAccess permission set.
  3. Next to the permission set are links for the type of access available for that permission set. When you created the permission set, you specified both management console and programmatic access be enabled, so those two options are present. Select Management console to open the console.

AWS Management console

  1. The user is signed in to the console. Using the search bar, look for Amazon DataZone service and open it.
  2. Open the Amazon DataZone console and make sure you have enabled SSO users through IAM Identity Center. In case you haven’t, you can follow the steps in Enable IAM Identity Center for Amazon DataZone.

Note: In this post, we followed the default IAM Identity Center for Amazon DataZone configuration, which has implicit user assignment mode enabled. With this option, any user added to your Identity Center directory can access your Amazon DataZone domain automatically. If you opt for using explicit user assignment instead, remember that you need to manually add users to your Amazon DataZone domain in the Amazon DataZone console for them to have access.
To learn more about how to manage user access to an Amazon DataZone domain, see Manage users in the Amazon DataZone console.

  1. Choose the Open data portal to access the Amazon DataZone Portal.

DataZone console

4.2) For all other users

  1. Choose the Applications tab in the AWS access portal window and choose the Amazon DataZone data portal application link.

DataZone application

  1. In the Amazon DataZone data portal, choose SIGN IN WITH SSO to continue

DataZone portal

Congratulations! Now you’re signed in to the Amazon DataZone data portal using your user that’s managed by Okta.

DataZone portal

5) Manage Amazon DataZone specific permissions in the Amazon DataZone portal

After you have access to the Amazon DataZone portal, you can work with projects, the data assets within, environments, and other constructs that are specific to Amazon DataZone. A project is the overarching construct that brings together people, data, and analytics tools. A project has two roles: owner and contributor. Next, you’ll learn how a user can be made an owner or contributor of existing projects.

These steps must be completed by the existing project owner in the Amazon DataZone portal:

  1. Open the Amazon DataZone portal, select the project in the drop-down list on the left top of the portal and choose the project you own

DataZone project

  1. In the project window, choose the Members tab to see the current users in the project and add a new one.

DataZone project members

  1. Choose Add Members to add a new user. Make sure the User type is SSO User to add an Okta user. Look for the Okta user in the name drop-down list, select it, and select a project role for it. Finally, choose Add Members to add the user.

DataZone project members

  1. The Okta user has been granted the selected project role and can interact with the project, assets, and tools.

DataZone project members

  1. You can also grant permissions to SSO Groups. Choose Add members, then select SSO group in the drop-down list, next select the Group name, set the assigned project role, and choose Add Members.

DataZone project members

  1. The Okta group has been granted the project role and can interact with the project, assets, and tools.

DataZone project members

You can also manage SSO user and group access to the Amazon DataZone data portal from the console. See Manage users in the Amazon DataZone console for additional details.

Clean up

To ensure a seamless experience and avoid any future charges, we kindly request that you follow these steps:

By following these steps, you can effectively clean up the resources utilized in this blog post and prevent any unnecessary charges from accruing.

Summary

In this post, you followed a step-by-step guide to set up and use Okta to federate access to Amazon DataZone with AWS IAM Identity Center. You also learned how to group users and manage their permission in Amazon DataZone. As a final thought, now that you’re familiar with the elements involved in the integration of an external identity provider such as Okta to federate access to Amazon DataZone, you’re ready to try it with other identity providers.

To learn more about, see Managing Amazon DataZone domains and user access.


About the Authors

Carlos Gallegos is a Senior Analytics Specialist Solutions Architect at AWS. Based in Austin, TX, US. He’s an experienced and motivated professional with a proven track record of delivering results worldwide. He specializes in architecture, design, migrations, and modernization strategies for complex data and analytics solutions, both on-premises and on the AWS Cloud. Carlos helps customers accelerate their data journey by providing expertise in these areas. Connect with him on LinkedIn.

Jose Romero is a Senior Solutions Architect for Startups at AWS. Based in Austin, TX, US. He’s passionate about helping customers architect modern platforms at scale for data, AI, and ML. As a former senior architect in AWS Professional Services, he enjoys building and sharing solutions for common complex problems so that customers can accelerate their cloud journey and adopt best practices. Connect with him on LinkedIn.

Arun Pradeep Selvaraj is a Senior Solutions Architect at AWS. Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build, and reinvent. He is creative, fast-paced, deeply customer-obsessed and uses the working backwards process to build modern architectures to help customers solve their unique challenges. Connect with him on LinkedIn.

How to deploy an Amazon OpenSearch cluster to ingest logs from Amazon Security Lake

Post Syndicated from Kevin Low original https://aws.amazon.com/blogs/security/how-to-deploy-an-amazon-opensearch-cluster-to-ingest-logs-from-amazon-security-lake/

January 30, 2025: This post was republished to make the instructions clearer and compatible with OCSF 1.1.


Customers often require multiple log sources across their AWS environment to empower their teams to respond and investigate security events. In part one of this two-part blog post, I show you how you can use Amazon OpenSearch Service to ingest logs collected by Amazon Security Lake to facilitate near real-time monitoring.

Many customers use Security Lake to automatically centralize security data from Amazon Web Services (AWS) environments, software as a service (SaaS) providers, on-premises workloads, and cloud sources into a purpose-built data lake in their AWS environment. OpenSearch Service is a managed service that customers can use to deploy, operate, and scale OpenSearch clusters in the AWS Cloud. It natively integrates with Security Lake to enable customers to perform interactive log analytics and searches across large datasets, create enterprise visualization and dashboards, and perform analysis across disparate applications and logs. With Amazon OpenSearch Security Analytics, customers can also gain visibility into the security posture of their organization’s infrastructure, monitor for anomalous activity, detect potential security threats in near real time, and initiate alerts to pre-configured destinations.

Without using Amazon OpenSearch Service, customers would need to build, deploy and manage infrastructure for an analytics solution, such as an ELK stack.

Prerequisites

Security Lake should already be deployed. For details on how to deploy Security Lake, see Getting started with Amazon Security Lake. You will need AWS Identity and Access Management (IAM) permissions to manage Security Lake, OpenSearch Service, Amazon Cognito, AWS Secrets Manager, and Amazon Elastic Compute Cloud (Amazon EC2), and to create IAM roles to follow along with this post. The solution can be deployed in any AWS Region that has at least 3 Availability Zones, supports Security Lake, OpenSearch, and OpenSearch Ingestion.

Solution overview

The architecture diagram in Figure 1 shows the completed architecture of the solution.

  1. The OpenSearch Service cluster is deployed within a virtual private cloud (VPC) across three Availability Zones for high availability.
  2. The OpenSearch Service cluster ingests logs from Security Lake using an OpenSearch Ingestion pipeline.
  3. The cluster is accessed by end users through a public-facing proxy hosted on an Amazon EC2 instance.
    1. To reduce costs, the template doesn’t deploy a dead letter queue (DLQ) for the OpenSearch Ingestion pipeline. You can add one later if you want.
    2. Instead of a public facing proxy, you can deploy a VPN to access your cluster.
  4. Authentication to the cluster is managed with Amazon Cognito.

Figure 1: Solution architecture

Figure 1: Solution architecture

Planning the deployment

This section will help you plan your OpenSearch service deployment, including what nodes you should choose, the amount of storage to allocate, and where to deploy the cluster.

Deciding instances for the OpenSearch Service master and data nodes

First, determine what instance type to use for the master and data nodes. If your workload generates less than 100 GB of Security Lake logs per day, we recommend using three m6g.large.search master nodes and three r6g.large.search data nodes. You can start small and scale up or scale out later. For more information about deciding the size and number of instances, see Get started with Amazon OpenSearch Service. Note the instance types that you have selected on a text editor because you will use this as an input for the AWS CloudFormation template that you will deploy later.

Configuring storage

To optimize your storage costs, you need to plan your data strategy. In this architecture, Security Lake is used for long-term log storage. Because Security Lake uses Amazon Simple Storage Service (Amazon S3), you can optimize long-term storage costs. You can configure OpenSearch Service to ingest priority logs based on the recent data that you can use for near-real time detection and alerting. Your team can query logs in Security Lake using its Zero-ETL integration with OpenSearch Service to analyze older logs.

Therefore, Security Lake should serve as your primary long-term log storage, with OpenSearch Service storing only the most recent logs.

The number of days of logs in OpenSearch Service will depend on how many days’ worth of data you need to investigate at a given time. I recommend storing 15 days of data in OpenSearch Service. This allows you to react to and investigate the most immediate security events while optimizing storage costs for older logs.

The next step is to determine the volume of logs generated by Security Lake.

  1. Sign in to the Security Lake delegated administrator account.
  2. Go to the AWS Management Console for Security Lake. Choose Usage in the navigation pane.
  3. On the Usage screen, select Last 30 days as the range of usage.
  4. Add the total Actual usage for the last 30 days for the data sources that you intend to send to OpenSearch. If you have used Security Lake for less than 30 days, you can use the Total predicted usage per month. Divide this figure by 30 to get the daily data volume.

Figure 2: Select range of usage

Figure 2: Select range of usage

To determine the total storage needed, multiply the data generated by Security Lake per day by the retention period you chose, then by 1.1 to account for the indexes, then multiply that number by 1.15 for overhead storage. For more information about calculating storage, see Get started with Amazon OpenSearch Service.

To determine the amount of Amazon Elastic Block Store (Amazon EBS) storage that you need per node, take the total amount of storage and divide it by the number of nodes that you have. Round that number up to the nearest whole number. You can increase the amount of storage after deployment when you have a better understanding of your workload. Make a note of this number in a text editor because you’ll use it as an input in the CloudFormation template later.

Example 1: 10 GB of Security Lake logs generated per day, stored for 30 days in OpenSearch Service in three nodes

  • 10 GB of Security Lake logs stored for 30 days = 10 GB * 30 = 300 GB
  • Account for additional space for indexes and overhead space = 300 GB * 1.1 * 1.15 = 379.5 GB
  • Divide the storage required across three nodes, rounded up = 379.5/3 ≈ 127 GB per node
  • You would need 127 GB per node in OpenSearch Service

Example 2: 200 GB of Security Lake logs generated per day, stored for 15 days in OpenSearch Service across six nodes

  • 200 GB of Security Lake logs stored for 15 days = 200 GB * 15 = 3000 GB
  • Account for additional space for indexes and overhead space = 3000 GB * 1.1 * 1.15 = 3795 GB
  • Divide the storage required across three nodes, rounded up = 3795/6 ≈ 633 GB per node
  • You would need 633 GB per node in OpenSearch Service

Where to deploy the cluster?

If you have an AWS Control Tower deployment or have a deployment modelled after the AWS Security Reference Architecture (AWS SRA), Security Lake should be deployed in the Log Archive account. Because security best practices recommend that the Log Archive account should not be frequently accessed, the OpenSearch Service cluster should be deployed into your Audit account or Security Tooling account.

You need to deploy your Security Lake subscriber in the same Region as your Security Lake roll-up Region. If you have more than one roll-up Region, choose the Region that collects logs from the Regions you want to monitor.

Your cluster needs to be deployed in the same Region as your Security Lake subscriber be able to access data.

Setting up the Security Lake subscriber

Before deploying the solution, create a Security Lake subscriber in your Security Lake roll-up Region so that OpenSearch Service can access data from Amazon Security Lake.

  1. Access the Security Lake console in your Log Archive account.
  2. Choose Subscribers in the navigation pane.
  3. Choose Create subscriber.
  4. On the Create subscriber page, enter a name, such as OpenSearch-subscriber.
  5. Under Data Access, select Under S3 notification type, select SQS queue.
  6. Under Subscriber credentials, enter the AWS account ID for the account you plan to deploy the OpenSearch cluster to, which should be your Security Tooling
  7. Enter OpenSearchIngestion-<AWS account ID> under External ID.

    Figure 3: Configuring the Security Lake subscriber

    Figure 3: Configuring the Security Lake subscriber

  8. Leave All log and event sources selected and choose Create.

After the subscriber has been created, you will need to collect information to facilitate the deployment.

To gather necessary information:

  1. Select the subscriber that you just created.
  2. Derive the S3 bucket name from the S3 bucket ARN and store it in a text editor. The Amazon Resource Name (ARN) is formatted as arn:aws:s3:::<bucket name>. The bucket name should look like aws-security-data-lake-<region>-xxxxx.

    Figure 4: Derive the S3 bucket name from the Subscriber details page

    Figure 4: Derive the S3 bucket name from the Subscriber details page

  3. Go to the Amazon Simple Queue Service (Amazon SQS) console and select the SQS queue created as part of the Security Lake subscriber. It should look like AmazonSecurityLake-xxxxxxxxx-Main-Queue. Note the queue’s ARN and URL in your text editor.

    Figure 5: Relevant details from the SQS queue

    Figure 5: Relevant details from the SQS queue

Deploy the solution

To deploy the solution in your Security Tooling account, use a CloudFormation template. This template deploys the OpenSearch Service cluster, OpenSearch Ingestion pipeline, and an AWS Lambda function to initialize the cluster.

To deploy the OpenSearch cluster:

  1. To deploy the CloudFormation template that builds the OpenSearch service cluster, select the Launch Stack button.

    Select this image to open a link that starts building the CloudFormation stack

  2. In the CloudFormation console, make sure that you are in the correct AWS account. You should be in your Security Tooling account. Also make sure that you have selected the same Region as your Security Lake subscriber.
  3. Enter a name for your stack. A name like os-stack-<day>-<month> can help you keep track of deployments.
  4. Enter the instance types and Amazon EBS volume size that you noted earlier.
  5. Enter the IP address range that you want to allow to access the proxy’s security group. You should limit this to your corporate IP range. You can set it as 0.0.0/0 if you want to expose it to the public internet.
  6. Fill in the details of the Security Lake bucket and the subscriber Amazon SQS queue ARN, URL, and Region.

    Figure 6: Add stack parameters

    Figure 6: Add stack parameters

  7. Check the acknowledgements in the Capabilities section.
  8. Choose Create stack to begin deploying the resources.
  9. It will take 20–30 minutes to deploy the multiple nested templates. Wait for the main stack (not the nested ones) to achieve the CREATE_COMPLETE status before proceeding to the next step.

    Note: If you encounter failures while deployment, you can download the CloudFormation file here and select Preserve successfully provisioned resources under Stack failure options while deploying. This will allow you to troubleshoot the stack deployment.

  10. Go to the Outputs pane of the main CloudFormation stack. Save the DashboardsProxyURL, OpenSearchInitRoleARN, and PipelineRole values in a text editor to refer to later.

    Figure 7: The stacks in the CREATE_COMPLETE state with the outputs panel shown

    Figure 7: The stacks in the CREATE_COMPLETE state with the outputs panel shown

  11. Open the DashboardsProxyURL value in a new tab.

    Note: Because the proxy relies on a self-signed certificate, you will get an insecure certificate warning. You can safely ignore this warning and proceed. For a production workload, you should issue a trusted private certificate from your internal public key infrastructure or use AWS Private Certificate Authority.

  12. You will be presented with the Amazon Cognito sign-in page. Use administrator as the username.
  13. Access Secrets Manager to find the password. Select the secret that was created as part of the stack.

    Figure 9: Retrieve the secret value

    Figure 8: The Cognito password in Secrets Manager

  14. Choose Retrieve secret value to get the password.

    Figure 9: Retrieve the secret value

    Figure 9: Retrieve the secret value

  15. After signing in, you will be prompted to change your password and will be redirected to the OpenSearch dashboard.
  16. If you see a pop-up that states Start by adding your own data, select Explore on my own. On the next page, Introducing new OpenSearch Dashboards look & feel, choose Dismiss.
  17. If you see a pop-up that states Select your tenant, select Global, and then choose Confirm.

    Figure 10: Select and confirm your tenant

    Figure 10: Select and confirm your tenant

To initialize the OpenSearch cluster:

  1. Choose the menu icon (three stacked horizontal lines) on the top left and select Security under the Management section.

    Figure 11: Navigating to the Security page in the OpenSearch console

    Figure 11: Navigating to the Security page in the OpenSearch console

  2. Select Roles. On the Roles page, search for the all_access role and select it.
  3. Select Mapped users, and then select Manage mapping.
  4. On the Map user screen, choose Add another backend role. Paste the value for the OpenSearchInitRoleARN from the list of CloudFormation outputs. Choose Map.

    Figure 12: Mapping the role on the Security page in the OpenSearch console

    Figure 12: Mapping the role on the Security page in the OpenSearch console

  5. Leave this tab open and return to the AWS Management console. Go to the AWS Lambda console and select the function named xxxxxx-OS_INIT.
  6. In the function screen, choose Test, and then Create new test event.

    Figure 13: Creating the test event in the Lambda console

    Figure 13: Creating the test event in the Lambda console

  7. Choose Invoke. The function should run for about 30 seconds. The execution results should show the component templates that have been created. This Lambda function creates the component and index templates to ingest Open Cybersecurity Framework (OCSF) formatted data, a set of indices and aliases that correspond with the OCSF classes generated by Security Lake, and a rollover policy that will rollover the index daily or if it becomes larger than 40 GB.

    Figure 14: Invoking the Lambda function in the Lambda console

    Figure 14: Invoking the Lambda function in the Lambda console

To set up the pipeline

  1. Return to the Map user page on the OpenSearch console.
  2. Choose Add another backend role. Paste the value of the PipelineRole from the CloudFormation template output. Choose This will allow the OpenSearch Ingestion to write to the cluster.

    Figure 15: Mapping the OpenSearch Ingestion role

    Figure 15: Mapping the OpenSearch Ingestion role

  3. Access the Amazon S3 console in the Log Archive account where Security Lake is hosted.
  4. Select the Security Lake bucket in your roll-up Region. It should look like aws-security-data-lake-region-xxxxxxxxxx.
  5. Choose Permissions, then Edit under Bucket policy.
  6. Add this policy to the end of the existing bucket policy. Replace the Principal with the ARN of the PipelineRole and the name of your Security Lake bucket in the Resource section.
    {
                "Sid": "Cross Account Permissions",
                "Effect": "Allow",
                "Principal": {
                    "AWS": "<Pipeline role ARN>"
                },
                "Action": "s3:*",
                "Resource": [
                    "arn:aws:s3:::<Security Lake bucket name>/*",
                    "arn:aws:s3:::<Security Lake bucket name>"
                ]
            }

    Figure 16: The modified S3 bucket access policy

    Figure 16: The modified S3 bucket access policy

  7. Choose Save changes.

To upload the index patterns and dashboards

  1. Download the Security-lake-objects.ndjson file by right-clicking on this link and selecting Save link as.
  2. Access the Dashboards Management page through the navigation menu.
  3. Choose Saved objects in the navigation pane.
  4. On the Saved Objects page, choose Import on the right side of the screen.

    Figure 17: Import saved objects

    Figure 17: Import saved objects

  5. Choose Import and select the Security-lake-objects.ndjson file that you downloaded previously.
  6. Leave Create new objects with unique IDs selected and choose Import.
  7. You can now view the ingested logs on the Discover page and visualizations on the Dashboards page, which you can find on the navigation bar.

    Figure 18: The Discover page displaying ingested logs

    Figure 18: The Discover page displaying ingested logs

Clean up

To avoid unwanted charges, delete the main CloudFormation template, named os-stack-<day>-<month> (not the nested stacks).

Figure 19: Select the main stack in the CloudFormation console

Figure 19: Select the main stack in the CloudFormation console

Modify the Security Lake bucket policy in the logging account to remove the section you added that trusted the PipelineRole. Be careful not to modify the rest of the policy because it could impact the functioning of Security Lake and other subscribers.

Figure 20: The S3 bucket policy with the relevant sections that needed to be deleted

Figure 20: The S3 bucket policy with the relevant sections that needed to be deleted

Conclusion

In this post, you learned how to plan an OpenSearch deployment with Amazon OpenSearch Service to ingest logs from Amazon Security Lake. With this solution, you’re able to aggregate and manage logs with Security Lake and visualize and monitor those logs with OpenSearch Service. After deployment, monitor the OpenSearch Service metrics to determine if you need to scale this up or out for improved performance. In part 2, I will show you how to set up the Security Analytics detector to generate alerts to security findings in near-real time.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Kevin Low
Kevin Low

Kevin is a Security Solutions Architect at AWS who helps the largest customers across ASEAN build securely. He specializes in threat detection and incident response and is passionate about integrating resilience and security. Outside of work, he loves spending time with his wife and dog, a poodle called Noodle.

Get started with the new Amazon DataZone enhancements for Amazon Redshift

Post Syndicated from Carmen Manzulli original https://aws.amazon.com/blogs/big-data/get-started-with-the-new-amazon-datazone-enhancements-for-amazon-redshift/

In today’s data-driven landscape, organizations are seeking ways to streamline their data management processes and unlock the full potential of their data assets, while controlling access and enforcing governance. That’s why we introduced Amazon DataZone.

Amazon DataZone is a powerful data management service that empowers data engineers, data scientists, product managers, analysts, and business users to seamlessly catalog, discover, analyze, and govern data across organizational boundaries, AWS accounts, data lakes, and data warehouses.

On March 21, 2024, Amazon DataZone introduced several exciting enhancements to its Amazon Redshift integration that simplify the process of publishing and subscribing to data warehouse assets like tables and views, while enabling Amazon Redshift customers to take advantage of the data management and governance capabilities or Amazon DataZone.

These updates empower the experience for both data users and administrators.

Data producers and consumers can now quickly create data warehouse environments using preconfigured credentials and connection parameters provided by their Amazon DataZone administrators.

Additionally, these enhancements grant administrators greater control over who can access and use the resources within their AWS accounts and Redshift clusters, and for what purpose.

As an administrator, you can now create parameter sets on top of DefaultDataWarehouseBlueprint by providing parameters such as cluster, database, and an AWS secret. You can use these parameter sets to create environment profiles and authorize Amazon DataZone projects to use these environment profiles for creating environments.

In turn, data producers and data consumers can now select an environment profile to create environments without having to provide the parameters themselves, saving time and reducing the risk of issues.

In this post, we explain how you can use these enhancements to the Amazon Redshift integration to publish your Redshift tables to the Amazon DataZone data catalog, and enable users across the organization to discover and access them in a self-service fashion. We present a sample end-to-end customer workflow that covers the core functionalities of Amazon DataZone, and include a step-by-step guide of how you can implement this workflow.

The same workflow is available as video demonstration on the Amazon DataZone official YouTube channel.

Solution overview

To get started with the new Amazon Redshift integration enhancements, consider the following scenario:

  • A sales team acts as the data producer, owning and publishing product sales data (a single table in a Redshift cluster called catalog_sales)
  • A marketing team acts as the data consumer, needing access to the sales data in order to analyze it and build product adoption campaigns

At a high level, the steps we walk you through in the following sections include tasks for the Amazon DataZone administrator, Sales team, and Marketing team.

Prerequisites

For the workflow described in this post, we assume a single AWS account, a single AWS Region, and a single AWS Identity and Access Management (IAM) user, who will act as Amazon DataZone administrator, Sales team (producer), and Marketing team (consumer).

To follow along, you need an AWS account. If you don’t have an account, you can create one.

In addition, you must have the following resources configured in your account:

  • An Amazon DataZone domain with admin, sales, and marketing projects
  • A Redshift namespace and workgroup

If you don’t have these resources already configured, you can create them by deploying an AWS CloudFormation stack:

  1. Choose Launch Stack to deploy the provided CloudFormation template.
  2. For AdminUserPassword, enter a password, and take note of this password to use in later steps.
  3. Leave the remaining settings as default.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.
  5. When the stack deployment is complete, on the Amazon DataZone console, choose View domains in the navigation pane to see the new created Amazon DataZone domain.
  6. On the Amazon Redshift Serverless console, in the navigation pane, choose Workgroup configuration and see the new created resource.

You should be logged in using the same role that you used to deploy the CloudFormation stack and verify that you’re in the same Region.

As a final prerequisite, you need to create a catalog_sales table in the default Redshift database (dev).

  1. On the Amazon Redshift Serverless console, selected your workgroup and choose Query data to open the Amazon Redshift query editor.
  2. In the query editor, choose your workgroup and select Database user name and password as the type of connection, then provide your admin database user name and password.
  3. Use the following query to create the catalog_sales table, which the Sales team will publish in the workflow:
    CREATE TABLE catalog_sales AS 
    SELECT 146776932 AS order_number, 23 AS quantity, 23.4 AS wholesale_cost, 45.0 as list_price, 43.0 as sales_price, 2.0 as discount, 12 as ship_mode_sk,13 as warehouse_sk, 23 as item_sk, 34 as catalog_page_sk, 232 as ship_customer_sk, 4556 as bill_customer_sk
    UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
    UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
    UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
    UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
    UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
    UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
    UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
    UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
    UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
    UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

Now you’re ready to get started with the new Amazon Redshift integration enhancements.

Amazon DataZone administrator tasks

As the Amazon DataZone administrator, you perform the following tasks:

  1. Configure the DefaultDataWarehouseBlueprint.
    • Authorize the Amazon DataZone admin project to use the blueprint to create environment profiles.
    • Create a parameter set on top of DefaultDataWarehouseBlueprint by providing parameters such as cluster, database, and AWS secret.
  2. Set up environment profiles for the Sales and Marketing teams.

Configure the DefaultDataWarehouseBlueprint

Amazon DataZone blueprints define what AWS tools and services are provisioned to be used within an Amazon DataZone environment. Enabling the data warehouse blueprint will allow data consumers and data producers to use Amazon Redshift and the Query Editor for data sharing, accessing, and consuming.

  1. On the Amazon DataZone console, choose View domains in the navigation pane.
  2. Choose your Amazon DataZone domain.
  3. Choose Default Data Warehouse.

If you used the CloudFormation template, the blueprint is already enabled.

Part of the new Amazon Redshift experience involves the Managing projects and Parameter sets tabs. The Managing projects tab lists the projects that are allowed to create environment profiles using the data warehouse blueprint. By default, this is set to all projects. For our purpose, let’s grant only the admin project.

  1. On the Managing projects tab, choose Edit.

  1. Select Restrict to only managing projects and choose the AdminPRJ project.
  2. Choose Save changes.

With this enhancement, the administrator can control which projects can use default blueprints in their account to create environment profile

The Parameter sets tab lists parameters that you can create on top of DefaultDataWarehouseBlueprint by providing parameters such as Redshift cluster or Redshift Serverless workgroup name, database name, and the credentials that allow Amazon DataZone to connect to your cluster or workgroup. You can also create AWS secrets on the Amazon DataZone console. Before these enhancements, AWS secrets had to be managed separately using AWS Secrets Manager, making sure to include the proper tags (key-value) for Amazon Redshift Serverless.

For our scenario, we need to create a parameter set to connect a Redshift Serverless workgroup containing sales data.

  1. On the Parameter sets tab, choose Create parameter set.
  2. Enter a name and optional description for the parameter set.
  3. Choose the Region containing the resource you want to connect to (for example, our workgroup is in us-east-1).
  4. In the Environment parameters section, select Amazon Redshift Serverless.

If you already have an AWS secret with credentials to your Redshift Serverless workgroup, you can provide the existing AWS secret ARN. In this case, the secret must be tagged with the following (key-value): AmazonDataZoneDomain: <Amazon DataZone domain ID>.

  1. Because we don’t have an existing AWS secret, we create a new one by choosing Create new AWS Secret.
  2. In the pop-up, enter a secret name and your Amazon Redshift credentials, then choose Create new AWS Secret.

Amazon DataZone creates a new secret using Secrets Manager and makes sure the secret is tagged with the domain in which you’re creating the parameter set.

  1. Enter the Redshift Serverless workgroup name and database name to complete the parameters list. If you used the provided CloudFormation template, use sales-workgroup for the workgroup name and dev for the database name.
  2. Choose Create parameter set.

You can see the parameter set created for your Redshift environment and the blueprint enabled with a single managing project configured.

 

Set up environment profiles for the Sales and Marketing teams

Environment profiles are predefined templates that encapsulate technical details required to create an environment, such as the AWS account, Region, and resources and tools to be added to projects. The next Amazon DataZone administrator task consists of setting up environment profiles, based on the default enabled blueprint, for the Sales and Marketing teams.

This task will be performed from the admin project in the Amazon DataZone data portal, so let’s follow the data portal URL and start creating an environment profile for the Sales team to publish their data.

  1. On the details page of your Amazon DataZone domain, in the Summary section, choose the link for your data portal URL.

When you open the data portal for the first time, you’re prompted to create a project. If you used the provided CloudFormation template, the projects are already created.

  1. Choose the AdminPRJ project.
  2. On the Environments page, choose Create environment profile.
  3. Enter a name (for example, SalesEnvProfile) and optional description (for example, Sales DWH Environment Profile) for the new environment profile.
  4. For Owner, choose AdminPRJ.
  5. For Blueprint, select the DefaultDataWarehouse blueprint (you’ll only see blueprints where the admin project is listed as a managing project).
  6. Choose the current enabled account and the parameter set you previously created.

Then you will see each pre-compiled value for Redshift Serverless. Under Authorized projects, you can pick the authorized projects allowed to use this environment profile to create an environment. By default, this is set to All projects.

  1. Select Authorized projects only.
  2. Choose Add projects and choose the SalesPRJ project.
  3. Configure the publishing permissions for this environment profile. Because the Sales team is our data producer, we select Publish from any schema.
  4. Choose Create environment profile.

Next, you create a second environment profile for the Marketing team to consume data. To do this, you repeat similar steps made for the Sales team.

  1. Choose the AdminPRJ project.
  2. On the Environments page, choose Create environment profile.
  3. Enter a name (for example, MarketingEnvProfile) and optional description (for example, Marketing DWH Environment Profile).
  4. For Owner, choose AdminPRJ.
  5. For Blueprint, select the DefaultDataWarehouse blueprint.
  6. Select the parameter set you created earlier.
  7. This time, keep All projects as the default (alternatively, you could select Authorized projects only and add MarketingPRJ).
  8. Configure the publishing permissions for this environment profile. Because the Marketing team is our data consumer, we select Don’t allow publishing.
  9. Choose Create environment profile.

With these two environment profiles in place, the Sales and Marketing teams can start working on their projects on their own to create their proper environments (resources and tools) with fewer configurations and less risk to incur errors, and publish and consume data securely and efficiently within these environments.

To recap, the new enhancements offer the following features:

  • When creating an environment profile, you can choose to provide your own Amazon Redshift parameters or use one of the parameter sets from the blueprint configuration. If you choose to use the parameter set created in the blueprint configuration, the AWS secret only requires the AmazonDataZoneDomain tag (the AmazonDataZoneProject tag is only required if you choose to provide your own parameter sets in the environment profile).
  • In the environment profile, you can specify a list of authorized projects, so that only authorized projects can use this environment profile to create data warehouse environments.
  • You can also specify what data authorized projects are allowed to be published. You can choose one of the following options: Publish from any schema, Publish from the default environment schema, and Don’t allow publishing.

These enhancements grant administrators more control over Amazon DataZone resources and projects and facilitate the common activities of all roles involved.

Sales team tasks

As a data producer, the Sales team performs the following tasks:

  1. Create a sales environment.
  2. Create a data source.
  3. Publish sales data to the Amazon DataZone data catalog.

Create a sales environment

Now that you have an environment profile, you need to create an environment in order to work with data and analytics tools in this project.

  1. Choose the SalesPRJ project.
  2. On the Environments page, choose Create environment.
  3. Enter a name (for example, SalesDwhEnv) and optional description (for example, Environment DWH for Sales) for the new environment.
  4. For Environment profile, choose SalesEnvProfile.

Data producers can now select an environment profile to create environments, without the need to provide their own Amazon Redshift parameters. The AWS secret, Region, workgroup, and database are ported over to the environment from the environment profile, streamlining and simplifying the experience for Amazon DataZone users.

  1. Review your data warehouse parameters to confirm everything is correct.
  2. Choose Create environment.

The environment will be automatically provisioned by Amazon DataZone with the preconfigured credentials and connection parameters, allowing the Sales team to publish Amazon Redshift tables seamlessly.

Create a data source

Now, let’s create a new data source for our sales data.

  1. Choose the SalesPRJ project.
  2. On the Data page, choose Create data source.
  3. Enter a name (for example, SalesDataSource) and optional description.
  4. For Data source type, select Amazon Redshift.
  5. For Environment¸ choose SalesDevEnv.
  6. For Redshift credentials, you can use the same credentials you provided during environment creation, because you’re still using the same Redshift Serverless workgroup.
  7. Under Data Selection, enter the schema name where your data is located (for example, public) and then specify a table selection criterion (for example, *).

Here, the * indicates that this data source will bring into Amazon DataZone all the technical metadata from the database tables of your schema (in this case, a single table called catalog_sales).

  1. Choose Next.

On the next page, automated metadata generation is enabled. This means that Amazon DataZone will automatically generate the business names of the table and columns for that asset. 

  1. Leave the settings as default and choose Next.
  2. For Run preference, select when to run the data source. Amazon DataZone can automatically publish these assets to the data catalog, but let’s select Run on demand so we can curate the metadata before publishing.
  3. Choose Next.
  4. Review all settings and choose Create data source.
  5. After the data source has been created, you can manually pull technical metadata from the Redshift Serverless workgroup by choosing Run.

When the data source has finished running, you can see the catalog_sales asset correctly added to the inventory.

Publish sales data to the Amazon DataZone data catalog

Open the catalog_sales asset to see details of the new asset (business metadata, technical metadata, and so on).

In a real-world scenario, this pre-publishing phase is when you can enrich the asset providing more business context and information, such as a readme, glossaries, or metadata forms. For example, you can start accepting some metadata automatically generated recommendations and rename the asset or its columns in order to make them more readable, descriptive, and easy to search and understand from a business user.

For this post, simply choose Publish asset to complete the Sales team tasks.

Marketing team tasks

Let’s switch to the Marketing team and subscribe to the catalog_sales asset published by the Sales team. As a consumer team, the Marketing team will complete the following tasks:

  1. Create a marketing environment.
  2. Discover and subscribe to sales data.
  3. Query the data in Amazon Redshift.

Create a marketing environment

To subscribe and access Amazon DataZone assets, the Marketing team needs to create an environment.

  1. Choose the MarketingPRJ project.
  2. On the Environments page, choose Create environment.
  3. Enter a name (for example, MarketingDwhEnv) and optional description (for example, Environment DWH for Marketing).
  4. For Environment profile, choose MarketingEnvProfile.

As with data producers, data consumers can also benefit from a pre-configured profile (created and managed by the administrator) in order to speed up the environment creation process, avoiding mistakes and reducing risks of errors.

  1. Review your data warehouse parameters to confirm everything is correct.
  2. Choose Create environment.

Discover and subscribe to sales data

Now that we have a consumer environment, let’s search the catalog_sales table in the Amazon DataZone data catalog.

  1. Enter sales in the search bar.
  2. Choose the catalog_sales table.
  3. Choose Subscribe.
  4. In the pop-up window, choose your marketing consumer project, provide a reason for the subscription request, and choose Subscribe.

When you get a subscription request as a data producer, Amazon DataZone will notify you through a task in the sales producer project. Because you’re acting as both subscriber and publisher here, you will see a notification.

  1. Choose the notification, which will open the subscription request.

You can see details including which project has requested access, who is the requestor, and why access is needed.

  1. To approve, enter a message for approval and choose Approve.

Now that subscription has been approved, let’s go back to the MarketingPRJ. On the Subscribed data page, catalog_sales is listed as an approved asset, but access hasn’t been granted yet. If we choose the asset, you can see that Amazon DataZone is working on the backend to automatically grant the access. When it’s complete, you’ll see the subscription as granted and the message “Asset added to 1 environment.”

Query data in Amazon Redshift

Now that the marketing project has access to the sales data, we can use the Amazon Redshift Query Editor V2 to analyze the sales data.

  1. Under MarketingPRJ, go to the Environments page and select the marketing environment.
  2. Under the analytics tools, choose Query data with Amazon Redshift, which redirects you to the query editor within the environment of the project.
  3. To connect to Amazon Redshift, choose your workgroup and select Federated user as the connection type.

When you’re connected, you will see the catalog_sales table under the public schema.

  1. To make sure that you have access to this table, run the following query:
SELECT * FROM catalog_sales LIMIT 10

As a consumer, you’re now able to explore data and create reports, or you can aggregate data and create new assets to publish in Amazon DataZone, becoming a producer of a new data product to share with other users and departments.

Clean up

To clean up your resources, complete the following steps:

  1. On the Amazon DataZone console, delete the projects used in this post. This will delete most project-related objects like data assets and environments.
  2. Clean up all Amazon Redshift resources (workgroup and namespace) to avoid incurring additional charges.

Conclusion

In this post, we demonstrated how you can get started with the new Amazon Redshift integration in Amazon DataZone. We showed how to streamline the experience for data producers and consumers and how to grant administrators control over data resources.

Embrace these enhancements and unlock the full potential of Amazon DataZone and Amazon Redshift for your data management needs.

Resources

For more information, refer to the following resources:

 


About the author

Carmen is a Solutions Architect at AWS, based in Milan (Italy). She is a Data Lover that enjoys helping companies in the adoption of Cloud technologies, especially with Data Analytics and Data Governance. Outside of work, she is a creative people who loves being in contact with nature and sometimes practicing adrenaline activities.

Monitoring Apache Iceberg metadata layer using AWS Lambda, AWS Glue, and AWS CloudWatch

Post Syndicated from Michael Greenshtein original https://aws.amazon.com/blogs/big-data/monitoring-apache-iceberg-metadata-layer-using-aws-lambda-aws-glue-and-aws-cloudwatch/

In the era of big data, data lakes have emerged as a cornerstone for storing vast amounts of raw data in its native format. They support structured, semi-structured, and unstructured data, offering a flexible and scalable environment for data ingestion from multiple sources. Data lakes provide a unified repository for organizations to store and use large volumes of data. This enables more informed decision-making and innovative insights through various analytics and machine learning applications.

Despite their advantages, traditional data lake architectures often grapple with challenges such as understanding deviations from the most optimal state of the table over time, identifying issues in data pipelines, and monitoring a large number of tables. As data volumes grow, the complexity of maintaining operational excellence also increases. Monitoring and tracking issues in the data management lifecycle are essential for achieving operational excellence in data lakes.

This is where Apache Iceberg comes into play, offering a new approach to data lake management. Apache Iceberg is an open table format designed specifically to improve the performance, reliability, and scalability of data lakes. It addresses many of the shortcomings of traditional data lakes by providing features such as ACID transactions, schema evolution, row-level updates and deletes, and time travel.

In this blog post, we’ll discuss how the metadata layer of Apache Iceberg can be used to make data lakes more efficient. You will learn about an open-source solution that can collect important metrics from the Iceberg metadata layer. Based on collected metrics, we will provide recommendations on how to improve the efficiency of Iceberg tables. Additionally, you will learn how to use Amazon CloudWatch anomaly detection feature to detect ingestion issues.

Deep dive into Iceberg’s Metadata layer

Before diving into a solution, let’s understand how the Apache Iceberg metadata layer works. The Iceberg metadata layer provides an open specification instructing integrated big data engines such as Spark or Trino how to run read and write operations and how to resolve concurrency issues. It’s crucial for maintaining inter-operability between different engines. It stores detailed information about tables such as schema, partitioning, and file organization in versioned JSON and Avro files. This ensures that each change is tracked and reversible, enhancing data governance and auditability.

Apache Iceberg metadata layer architecture diagram

History and versioning: Iceberg’s versioning feature captures every change in table metadata as immutable snapshots, facilitating data integrity, historical views, and rollbacks.

File organization and snapshot management: Metadata closely manages data files, detailing file paths, formats, and partitions, supporting multiple file formats like Parquet, Avro, and ORC. This organization helps with efficient data retrieval through predicate pushdown, minimizing unnecessary data scans. Snapshot management allows concurrent data operations without interference, maintaining data consistency across transactions.

In addition to its core metadata management capabilities, Apache Iceberg also provides specialized metadata tables—snapshots, files, and partitions—that provide deeper insights and control over data management processes. These tables are dynamically generated and provide a live view of the metadata for query purposes, facilitating advanced data operations:

  • Snapshots table: This table lists all snapshots of a table, including snapshot IDs, timestamps, and operation types. It enables users to track changes over time and manage version history effectively.
  • Files table: The files table provides detailed information on each file in the table, including file paths, sizes, and partition values. It is essential for optimizing read and write performance.
  • Partitions table: This table shows how data is partitioned across different files and provides statistics for each partition, which is crucial for understanding and optimizing data distribution.

Metadata tables enhance Iceberg’s functionality by making metadata queries straightforward and efficient. Using these tables, data teams can gain precise control over data snapshots, file management, and partition strategies, further improving data system reliability and performance.

Before you get started

The next section describes a packaged open source solution using Apache Iceberg’s metadata layer and AWS services to enhance monitoring across your Iceberg tables.

Before we deep dive into the suggested solution, let’s mention Iceberg MetricsReporter, which is a native way to emit metrics for Apache Iceberg. It supports two types of reports: one for commits and one for scans. The default output is log based. It produces log files as a result of commit or scan operations. To submit metrics to CloudWatch or any other monitoring tool, users need to create and configure a custom MetricsReporter implementation. MetricsReporter is supported in Apache Iceberg v1.1.0 and later versions, and customers who want to use it must enable it through Spark configuration on their existing pipelines.

The following is deployed independently and doesn’t require any configuration changes to existing data pipelines. It can immediately start monitoring all the tables within the AWS account and AWS Region where it’s deployed. This solution introduces an additional latency of metrics arrival between 20 and 80 seconds compared to MetricsReporter but offers seamless integration without the need for custom configurations or changes to current workflows.

Solution overview

This solution is specifically designed for customers who run Apache Iceberg on Amazon Simple Storage Service (Amazon S3) and use AWS Glue as their data catalog.

Solution architecture diagram

Key features

This solution uses an AWS Lambda deployment package to collect metrics from Apache Iceberg tables. The metrics are then submitted to CloudWatch where you can create metrics visualizations to help recognize trends and anomalies over time.

The solution is designed to be lightweight, focusing on collecting metrics directly from the Iceberg metadata layer without scanning the actual data layer. This approach significantly reduces the compute capacity required, making it efficient and cost-effective. Key features of the solution include:

  • Time-series metrics collection: The solution monitors Iceberg tables continuously to identify trends and detect anomalies in data ingestion rates, partition skewness, and more.
  • Event-driven architecture: The solution uses Amazon EventBridge to launch a Lambda function when the state of an AWS Glue Data Catalog table changes. This ensures real-time metrics collection every time a transaction is committed to an Iceberg table.
  • Efficient data retrieval: Incorporates minimal compute resources by utilizing AWS Glue interactive sessions and the pyiceberg library to directly access Iceberg metadata tables such as snapshots, partitions, and files.

Metrics tracked

As of the blog release date, the solution collects over 25 metrics. These metrics are categorized into several groups:

  • Snapshot metrics: Include total and changes in data files, delete files, records added or removed, and size changes.
  • Partition and file metrics: Aggregated and per-partition metrics like average, maximum, minimum record counts and file sizes, which help in understanding data distribution and help optimizing storage.

To see the complete list of metrics, go to the GitHub repository.

Visualizing data with CloudWatch dashboards

The solution also provides a sample CloudWatch dashboard to visualize the collected metrics. Metrics visualization is important for real-time monitoring and detecting operational issues. The provided helper script simplifies the set up and deployment of the dashboard.

Amazon CloudWatch dashboard

You can go to the GitHub repository to learn more about how to deploy the solution in your AWS account.

What are the vital metrics for Apache Iceberg tables?

This section discusses specific metrics from Iceberg’s metadata and explains why they’re important for monitoring data quality and system performance. The metrics are broken down into three parts: insight, challenge, and action. This provides a clear path for practical application. In this section, we provide only a subset of the available metrics that the solution can collect, for a complete list, see the solution Github page.

1. snapshot.added_data_files, snapshot.added_records

  • Metric insight: The number of data files and number of records added to the table during the last transaction. The ingestion rate measures the speed at which new data is added to the data lake. This metric helps identify bottlenecks or inefficiencies in data pipelines, guiding capacity planning and scalability decisions.
  • Challenge: A sudden drop in the ingestion rate can indicate failures in data ingestion pipelines, source system outages, configuration errors or traffic spikes.
  • Action: Teams need to establish real-time monitoring and alert systems to detect drops in ingestion rates promptly, allowing quick investigations and resolutions.

2. files.avg_record_count, files.avg_file_size

  • Metric insight: These metrics provide insights into the distribution and storage efficiency of the table. Small file sizes might suggest excessive fragmentation.
  • Challenge: Excessively small file sizes can indicate inefficient data storage leading to increased read operations and higher I/O costs.
  • Action: Implementing regular data compaction processes helps consolidate small files, optimizing storage and enhancing content delivery speeds as demonstrated by a streaming service. Data Catalog offers automatic compaction of Apache Iceberg tables. To learn more about compacting Apache Iceberg tables, see Enable compaction in Working with tables on the AWS Glue console.

3. partitions.skew_record_count, partitions.skew_file_count

  • Metric insight: The metrics indicate the asymmetry of the data distribution across the available table partitions. A skewness value of zero, or very close to zero, suggests that the data is balanced. Positive or negative skewness values might indicate a problem.
  • Challenge: Imbalances in data distribution across partitions can lead to inefficiencies and slow query responses.
  • Action: Regularly analyze data distribution metrics to adjust partitioning configuration. Apache Iceberg allows you to transform partitions dynamically, which enables optimization of table partitioning as query patterns or data volumes change, without impacting your existing data.

4. snapshot.deleted_records, snapshot.total_delete_files, snapshot.added_position_deletes

  • Metric insight: Deletion metrics in Apache Iceberg provide important information on the volume and nature of data deletions within a table. These metrics help track how often data is removed or updated, which is essential for managing data lifecycle and compliance with data retention policies.
  • Challenge: High values in these metrics can indicate excessive deletions or updates, which might lead to fragmentation and decreased query performance.
  • Action: To address these challenges, run compaction periodically to ensure deleted rows do not persist in new files. Regularly review and adjust data retention policies and consider expiring old snapshots to keep only necessary amount of data files. You can run compaction operation on specific partitions using Amazon Athena Optimize

Effective monitoring is essential for making informed decisions about necessary maintenance actions for Apache Iceberg tables. Determining the right timing for these actions is crucial. Implementing timely preventative maintenance ensures high operational efficiency of the data lake and helps to address potential issues before they become significant problems.

Using Amazon CloudWatch for anomaly detection and alerts

This section assumes that you have completed the solution setup and collected operational metrics from your Apache Iceberg tables into Amazon CloudWatch.

Now you can start setting up some alerts and detect anomalies.

We guide you on setting up the anomaly detection and configuring alerts in CloudWatch to monitor the snapshot.added_records metric, which indicates the ingestion rate of data written into an Apache Iceberg table.

Set up anomaly detection

CloudWatch anomaly detection applies machine learning algorithms to continuously analyze system metrics, determine normal baselines, and identify items that are outside of the established patterns. Here is how you configure it:

Amazon CloudWatch anomaly detection screenshot

  1. Select Metrics: In the AWS Management Console for Cloudwatch, go to the Metrics  tab and search for and select snapshot.added_records.
  2. Create anomaly detection models: Choose the Graphed metrics tab and click the Pulse icon to enable anomaly detection.
  3. Set Sensitivity: The second parameter of the ANOMALY_DETECTION_BAND (m1, 5) is to adjust the sensitivity of the anomaly detection. The goal is to balance detecting real issues and reducing false positives.

Configure alerts

After the anomaly detection model is set up, set up an alert to notify operations teams about potential issues:

  1. Create alarm: Choose the bell icon under Actions on the same Graphed metrics tab.
  2. Alarm settings: Set the alarm to notify the operations team when the snapshot.added_records metric is outside the anomaly detection band for two consecutive periods. This helps reduce the risk of false alerts.
  3. Alarm actions: Configure CloudWatch to send an alarm email to the operations team. In addition to sending emails, CloudWatch alarm actions can automatically launch remediation processes, such as scaling operations or initiating data compaction.

Best practices

  • Regularly review and adjust models: As data patterns evolve, periodically review and adjust anomaly detection models and alarm settings to remain effective.
  • Comprehensive coverage: Ensure that all critical aspects of the data pipeline are monitored, not just a few metrics.
  • Documentation and communication: Maintain clear documentation of what each metric and alarm represent and ensure that your operations team understands the monitoring set up and response procedures. Set up the alerting mechanisms to send notifications through appropriate channels such as email, corporate messenger, or telephone to ensure your operations team stays informed and can quickly address the issues.
  • Create playbooks and automate remediation tasks: Establish detailed playbooks that describe step-by-step responses for common scenarios identified by alerts. Additionally, automate remediation tasks where possible to speed up response times and reduce the manual burden on teams. This ensures consistent and effective responses to all incidents.

CloudWatch anomaly detection and alerting features help organizations proactively manage their data lakes. This ensures data integrity, reduces downtime, and maintains high data quality. As a result, it enhances operational efficiency and supports robust data governance.

Conclusion

In this blog post, we explored Apache Iceberg’s transformative impact on data lake management. Apache Iceberg addresses the challenges of big data with features like ACID transactions, schema evolution, and snapshot isolation, enhancing data reliability, query performance, and scalability.

We delved into Iceberg’s metadata layer and related metadata tables such as snapshots, files, and partitions that allow easy access to crucial information about the current state of the table. These metadata tables facilitate the extraction of performance-related data, enabling teams to monitor and optimize the data lake’s efficiency.

Finally, we showed you a practical solution for monitoring Apache Iceberg tables using Lambda, AWS Glue, and CloudWatch. This solution uses Iceberg’s metadata layer and CloudWatch monitoring capabilities to provide a proactive operational framework. This framework detects trends and anomalies, ensuring robust data lake management.


About the Author

AvatarMichael Greenshtein is a Senior Analytics Specialist at Amazon Web Services. He is an experienced data professional with over 8 years in cloud computing and data management. Michael is passionate about open-source technology and Apache Iceberg.

Accelerate incident response with Amazon Security Lake – Part 2

Post Syndicated from Frank Phillis original https://aws.amazon.com/blogs/security/accelerate-incident-response-with-amazon-security-lake-part-2/

This blog post is the second of a two-part series where we show you how to respond to a specific incident by using Amazon Security Lake as the primary data source to accelerate incident response workflow. The workflow is described in the Unintended Data Access in Amazon S3 incident response playbook, published in the AWS incident response playbooks repository.

The first post in this series outlined prerequisite information and provided steps for setting up Security Lake. The post also highlighted how Security Lake can add value to your incident response capabilities and how that aligns with the National Institute of Standards and Technology (NIST) SP 800-61 Computer Security Incident Handling Guide. We demonstrated how you can set up Security Lake and related services in alignment with the AWS Security Reference Architecture (AWS SRA).

The following diagram shows the service architecture that we configured in the previous post. The highlighted services (Amazon Macie, Amazon GuardDuty, AWS CloudTrail, AWS Security Hub, Amazon Security Lake, Amazon Athena, and AWS Organizations) are relevant to the example referenced in this post, which focuses upon the phases of incident response outlined in NIST SP 800-61.

Figure 1: Example architecture configured in the previous blog post

Figure 1: Example architecture configured in the previous blog post

The first phase of the NIST SP 800-61 framework is preparation, which was covered in part 1 of this two-part blog series. The following sections cover phase 2 (detection and analysis) and phase 3 (containment, eradication, and recovery) of the NIST framework, and demonstrate how Security Lake accelerates your incident response workflow by providing a central datastore of security logs and findings in a standardized format.

Consider a scenario where your security team has received an Amazon GuardDuty alert through Amazon EventBridge for anomalous user behavior. As a result, the team becomes aware of potential unintended data access. GuardDuty noted unusual IAM user activity for the AWS Identity and Access Management (IAM) user Service_Backup and generated a finding. The security team had set up a rule in EventBridge that sends an alert email notifying them of GuardDuty findings relating to IAM user activity. At this point, the security team is unsure if any malicious activity has occurred, however, the username is not familiar. The security team should investigate further to determine whether the finding is a false positive by querying data in Security Lake. In line with the NIST incident response framework, the team moves to phase 2 of this investigation.

Phase 2: Acquire, preserve, and document evidence

The security team wants to investigate which API calls the unfamiliar user has been making. First, the team checks AWS CloudTrail management activity for the user, and they can do that by using Security Lake. They want a list of that user’s activity, which will help them in several ways:

  1. Give them a list of API calls that warrant further investigation (especially Create*)
  2. Give an indication whether the activity is unusual or not (in the context of “typical” user activity within the account, team, user group, or individual user)
  3. Give an indication of when potentially malicious activity might have started (user history)

The team can use Amazon Athena to query CloudTrail management events that were captured by Security Lake. In some cases, compromised user accounts might have existed for a long time previously as legitimate users and made tens of thousands of API calls. In such a case, how would a security team identify the calls that might need further investigation? A quick way to get a summary list of API calls the user has made would be to run a query like the following:

SELECT DISTINCT api.operation FROM amazon_security_lake_table_us_west_2_cloud_trail_mgmt_2_0
WHERE lower(actor.user.name) = 'service_backup'

From the results, the team can determine information and queries of interest to focus on for further investigation. To begin with, the security team uses the preceding query to enumerate the number and type of API calls made by the user, as shown in Figure 2.

Figure 2: Example API call summary made by an IAM user

Figure 2: Example API call summary made by an IAM user

The initial query results in Figure 2 show API calls that could indicate privilege elevation (creating users, attaching user policies, and similar calls are a good indicator). There are other API calls that indicate that additional resources may have been created, such as Amazon Simple Storage Service (Amazon S3) buckets and Amazon Elastic Compute Cloud (Amazon EC2) instances.

Note that in this early phase, the team didn’t time-bound the query. However, if there is a high degree of confidence that the team can focus on a specific time or date range, the query can be further modified. How would the team decide on a time period to focus on? When the team received the alert email from EventBridge, that email included information about the GuardDuty finding, including the time it was observed. The team can use that time as an anchor to search around.

The team now wants to look at the user’s activity in a bit more detail. To do that, they can use a query that returns more detail for each of the API calls the user has made:

SELECT time_dt AS "Time Date", metadata.product.name AS "Log Source", cloud.region, actor.user.type, actor.user.name, actor.user.account.uid AS "User AWS Acc ID", api.operation AS "API Operation", status AS "API Response", api.service.name AS "Service", api.request.data AS "Request Data"
FROM amazon_security_lake_table_us_west_2_cloud_trail_mgmt_2_0
WHERE lower(actor.user.name) = 'service_backup';
Figure 3: Example CloudTrail activity for an IAM user

Figure 3: Example CloudTrail activity for an IAM user

Figure 3 shows the result of the example query. The team observes that the user created an S3 bucket and performed other management plane actions, including creating IAM users and attaching the administrator access policy to a newly created user. Attempts were made to create other resources such as Amazon EC2 instances, but these were not successful. So the team needs to do further investigation on the newly created IAM users and S3 buckets, but they don’t need to take further action on EC2 instances, at least for this user.

The team starts to focus on investigating the IAM permissions of the Service_Backup user because some resources created by this user could lead to privilege elevation (for example: CreateUser >> AttachPolicy >> CreateAccessKey). The team verifies that the policy attached to that new user was AdminAccess. The activities of this newly created user should be investigated. Now, with a broad idea of that user’s activity and the time the activity occurred, the security team wants to focus on IAM activity, so that they can understand what resources have been created. Those resources will likely also need to be investigated.

The team can use the following query to find more detail about the resources that the user has created, modified, or deleted. In this case, the user has also created additional IAM users and roles. The team uses timestamps to limit query results to a specific time range during which they believe the suspicious activity occurred, and can also focus on IAM activity specifically.

SELECT time_dt AS "Time", cloud.region AS "Region", actor.session.issuer AS "Role Issuer", actor.user.name AS "User Name", api.operation AS "API Call", json_extract(api.request.data, '$.userName') AS "Target Principal", json_extract(api.request.data, '$.policyArn') AS "Target Policy", json_extract(api.request.data, '$.roleName') AS "Target Role", actor.user.uid AS "User ID", user.name AS "Target Principal", status AS "Response Status", accountid AS "AWS Account"
FROM amazon_security_lake_table_us_west_2_cloud_trail_mgmt_2_0
WHERE (lower(api.operation) LIKE 'create%' OR lower(api.operation) LIKE 'attach%' OR lower(api.operation) LIKE 'delete%')
AND lower(api.service.name) = 'iam.amazonaws.com'
AND lower(actor.user.name) = 'service_backup'
AND time_dt BETWEEN TIMESTAMP '2024-03-01 00:00:00.000' AND TIMESTAMP  '2024-05-31 23:59:00.000';
Figure 4: CloudTrail IAM activity for a specific user with additional resource detail

Figure 4: CloudTrail IAM activity for a specific user with additional resource detail

This additional detail helps the security team focus on the resources created by the Service_Backup user. These will need to be investigated further and most likely quarantined. If further analysis is required, resources can be disabled, or (in some instances) copied over to a forensic account to conduct the analysis.

Having identified newly created IAM resources that require further investigation, the team now continues by focusing on the resources created in S3. Has the Service_Backup user put objects into that bucket? Have they interacted with objects in any other buckets? To do that, the team queries the S3 data events by using Athena, as follows:

SELECT time_dt AS "Time Date", cloud.region AS "Region", actor.user.type, actor.user.name AS "User Name", api.operation AS "API Call", status AS "Status", api.request.data AS "Request Data", accountid AS "AWS Account ID"
FROM amazon_security_lake_table_us_west_2_s3_data_2_0
WHERE lower(actor.user.name) = 'service_backup';

The security team discovers that the Service_Backup user created an S3 bucket named breach.notify and uploaded a file named data-locked-xxx.txt in the bucket (the bucket name is also returned in the query results shown in Figure 3). Additionally, they see GetObject API calls for potentially sensitive data, followed by DeleteObject API calls for the same data and additional potentially sensitive data files. These are a group of CSV files, for example cc-data-2.csv, as shown in Figure 5.

Figure 5: Example Amazon S3 API activity for an IAM user

Figure 5: Example Amazon S3 API activity for an IAM user

Now the security team has two important goals:

  1. Protect and recover their data and resources
  2. Make sure that any applications that are reliant on those resources or data are available and serving customers

The security team knows that their S3 buckets do contain sensitive data, and they need a quick way to understand which files may be of value to a threat actor. Because the security team was able to quickly investigate their S3 data logs and determine that files have indeed been downloaded and deleted, they already have actionable information. To enrich that with additional context, the team needs a way to verify whether the file contains sensitive data. Amazon Macie can be configured to detect sensitive data in S3 buckets and natively integrate with Security Hub. The team had already configured Macie to scan their buckets and provide alerts if potentially sensitive data is discovered. The team can continue to use Athena to query Security Hub data stored in Security Lake, to see if the Macie results could be related to those files or buckets. The team looks for such findings that were generated around the time that the breach.notify S3 bucket was created, with the unusually named object uploaded, through the following Athena query:

SELECT time_dt AS "Date/Time", metadata.product.name AS "Log Source", cloud.account.uid AS "AWS Account ID", cloud.region AS "AWS Region", resources[1].type AS "Resource Type", resources[1].uid AS "Bucket ARN", resources[2].type AS "Resource Type 2", resources[2].uid AS "Object Name", finding_info.desc AS "Finding summary" FROM amazon_security_lake_table_us_west_2_sh_findings_2_0
WHERE cloud.account.uid = '<YOUR_AWS_ACCOUNT_ID>'
AND lower(metadata.product.name) = 'macie'
AND time_dt BETWEEN TIMESTAMP '2024-03-10 00:00:00.000' AND TIMESTAMP  '2024-03-14 23:59:00.000';
Figure 6: Example Amazon Macie finding summary for data in S3 buckets

Figure 6: Example Amazon Macie finding summary for data in S3 buckets

As Figure 6 shows, the team used the preceding query to pull just the information they needed to help them understand whether there is likely sensitive information in the bucket, which files contain that information, and what kind of information it is. From these results, it appears that the files listed do contain sensitive information, which appears to be credit card related data.

It’s time to stop and briefly review what the team now knows, and what still needs to be done. The team established that the Service_Backup user created additional IAM users and assigned wide-ranging permissions to those users. They also found that the Service_Backup user downloaded what appears to be confidential data, and then deleted those files from the customer’s buckets. Meanwhile, the Service_Backup user created a new S3 bucket and stored ransom files in it. In addition to this, that user also created IAM roles and attempted to create EC2 instances. In our example scenario, the first part of the investigation is complete.

There are a few things to note about what the team has done so far with Security Lake. First, because they’ve set up Security Lake across their entire organization, the team can query results from accounts in their organization, for various types of resources. That in itself saves a significant amount of time during the investigative process. Additionally, the team has seamlessly queried different sets of data to get an outcome—so far they have queried CloudTrail management events, S3 data events, and Macie findings through Security Hub—with the preceding queries done through Security Lake, directly from the Athena console, with no account or role switching, and no tool switching.

Next, we’ll move on to the containment step.

Phase 3: Containment, eradication, and recovery

Having gathered sufficient evidence in the previous phases to act, it’s now time for the team to contain this incident and focus on the AWS API by using either the AWS Management Console, the AWS CLI, or other tools. For the purposes of this blog post, we’ll use the CLI tools.

The team needs to perform several actions to contain the incident. They want to reduce the risk of further data exposure and the creation, modification, or deletion of resources in these AWS accounts. First, they will disable the Service_Backup user’s access and subsequently investigate and assess whether to disable the access of the IAM principal which created that user. Additionally, because Service_Backup created other IAM users, those users must also be investigated using the same process outlined earlier.

Next, the security team needs to determine whether or how they can restore the sensitive data that has been deleted from the bucket. If the bucket has versioning enabled, the act of deleting the object will result in the next most recent version becoming the current version. Alternatively, if they are using AWS Backup to protect their Amazon S3 data, they will be able to restore the most recently backed-up version. It’s worth noting that there could be other ways to restore that data—for example, the organization might have configured cross-Region replication for S3 buckets or other methods to protect their data.

After completing the steps to help prevent further access of the impacted IAM users, and restoring relevant data in impacted S3 buckets, the team now turns their attention to the additional resources created by the now-disabled user. Because these resources include IAM resources, the team needs a list of what has been created and deleted. They could see that information from earlier queries, but now decide to focus just on IAM resources by using the following example query;

SELECT time_dt AS "Time", cloud.region AS "Region", actor.session.issuer AS "Role Issuer", actor.user.name AS "User Name", api.operation AS "API Call", json_extract(api.request.data, '$.userName') AS "Target Principal", json_extract(api.request.data, '$.policyArn') AS "Target Policy", json_extract(api.request.data, '$.roleName') AS "Target Role", actor.user.uid AS "User ID", user.name AS "Target Principal", status AS "Response Status", accountid AS "AWS Account"
FROM amazon_security_lake_table_us_west_2_cloud_trail_mgmt_2_0
WHERE (lower(api.operation) LIKE 'create%' OR lower(api.operation) LIKE 'attach%' OR lower(api.operation) LIKE 'delete%')
AND lower(api.service.name) = 'iam.amazonaws.com'
AND time_dt BETWEEN TIMESTAMP '2024-03-01 00:00:00.000' AND TIMESTAMP  '2024-05-31 23:59:00.000';

This query returns a concise and informative list of activities for several users. There is a separate column for the role name or the user ID, corresponding to IAM roles and users, respectively, as shown in Figure 7.

Figure 7: IAM mutating API activity

Figure 7: IAM mutating API activity

The team uses the AWS CLI to revoke IAM role session credentials, to verify and, if necessary, to modify the role’s trust policy. They also capture an image of the EC2 instance for forensic analysis and terminate the instance. They will copy the data they want to save from the questionable S3 buckets and then delete the buckets, or at least remove the bucket policy.

After completing these tasks, the security team now confirms with the application owners that application recovery is complete or ongoing. They will subsequently review the event and undertake phase 4 of the NIST framework (post-incident activity) to find the root cause, look for opportunities for improvement, and work on remediating any configuration or design flaws that led to the initial breach.

Conclusion

This is the second post in a two-part series about accelerating security incident response with Security Lake. We used anomalous IAM user activity as an incident example to show how you can use Security Lake as a central repository for your security logs and findings, to accelerate the incident response process.

With Security Lake, your security team is empowered to use analytics tools like Amazon Athena to run queries against a central point of security logs and findings from various security data sources, including management logs and S3 data logs from AWS CloudTrail, Amazon Macie findings, Amazon GuardDuty findings, and more.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Frank Phillis

Frank Phillis
Frank is a Senior Solutions Architect (Security) at AWS. He enables customers to get their security architecture right. Frank specializes in cryptography, identity, and incident response. He’s the creator of the popular AWS Incident Response playbooks and regularly speaks at security events. When not thinking about tech, Frank can be found with his family, riding bikes, or making music.
You can follow Frank on LinkedIn.

Jerry Chen

Jerry Chen
Jerry is a Senior Cloud Optimization Success Solutions Architect at AWS. He focuses on cloud security and operational architecture design for AWS customers and partners.
You can follow Jerry on LinkedIn.

Testing your applications with Amazon Q Developer

Post Syndicated from Svenja Raether original https://aws.amazon.com/blogs/devops/testing-your-applications-with-amazon-q-developer/

Testing code is a fundamental step in the field of software development. It ensures that applications are reliable, meet quality standards, and work as intended. Automated software tests help to detect issues and defects early, reducing impact to end-user experience and business. In addition, tests provide documentation and prevent regression as code changes over time.

In this blog post, we show how the integration of generative AI tools like Amazon Q Developer can further enhance unit testing by automating test scenarios and generating test cases.

Amazon Q Developer helps developers and IT professionals with all of their tasks across the software development lifecycle – from coding, testing, and upgrading, to troubleshooting, performing security scanning and fixes, optimizing AWS resources, and creating data engineering pipelines. It integrates into your Integrated Development Environment (IDE) and aids with providing answers to your questions. Amazon Q Developer supports you across the Software Development Lifecycle (SLDC) by enriching feature and test development with step-by-step instructions and best practices. It learns from your interactions and training itself over time to output personalized, and tailored answers.

Solution overview

In this blog, we show how to use Amazon Q Developer to:

  • learn about software testing concepts and frameworks
  • identify unit test scenarios
  • write unit test cases
  • refactor test code
  • mock dependencies
  • generate sample data

Note: Amazon Q Developer may generate an output different from this blog post’s examples due to its nondeterministic nature.

Using Amazon Q Developer to learn about software testing frameworks and concepts

As you start gaining experience with testing, Amazon Q Developer can accelerate your learning through conversational Q&A directly within the AWS Management Console or the IDE. It can explain topics, provide general advice and share helpful resources on testing concepts and frameworks. It gives personalized recommendations on resources which makes the learning experience more interactive and accelerates the time to get started with writing unit tests. Let’s introduce an example conversation to demonstrate how you can leverage Amazon Q Developer for learning before attempting to write your first test case.

Example – Select and install frameworks

A unit testing framework is a software tool used for test automation, design and management of test cases. Upon starting a software project, you may be faced with the selection of a framework depending on the type of tests, programming language, and underlying technology. Let’s ask for recommendations around unit testing frameworks for Python code running in an AWS Lambda function.

In the Visual Studio Code IDE, a user asks Amazon Q Developer for recommendations on unit test frameworks suitable for AWS Lambda functions written in Python using the following prompt: “Can you recommend unit testing frameworks for AWS Lambda functions in Python?”. Amazon Q Developer returns a numbered list of popular unit testing frameworks, including Pytest, unittest, Moto, AWS SAM Local, and AWS Lambda Powertools. Amazon Q Developer returns two URLs in the Sources section. The first URL links to an article called “AWS Lambda function testing in Python – AWS Lambda” from the AWS documentation, and the other URL to an AWS DevOps Blog post with the title “Unit Testing AWS Lambda with Python and Mock AWS Services”.

Figure 1: Recommend unit testing frameworks for AWS Lambda functions in Python

In Figure 1, Amazon Q Developer answers with popular frameworks (pytest, unittest, Moto, AWS SAM Command Line Interface, and Powertools for AWS Lambda) including a brief description for each of them. It provides a reference to the sources of its response at the bottom and suggested follow up questions. As a next step, you may want to refine what you are looking for with a follow-up question. Amazon Q Developer uses the context from your previous questions and answers to give more precise answers as you continue the conversation. For example, one of the frameworks it suggested using was pytest. If you don’t know how to install that locally, you can ask something like “What are the different options to install pytest on my Linux machine?”. As shown in Figure 2, Amazon Q Developer provides installation recommendation using Python.

In the Visual Studio Code IDE, a user asks Amazon Q Developer about the different options to install pytest on a Linux machine using the following prompt: “What are the different options to install pytest on my Linux machine?”. Amazon Q Developer replies with four different options: using pip, using a package manager, using a virtual environment, and using a Python distribution. Each option includes the steps to install pytest. A source URL is included for an article called “How to install pytest in Python? – Be on the Right Side of Change”.

Figure 2: Options to install pytest

Example – Explain concepts

Amazon Q Developer can also help you to get up to speed with testing concepts, such as mocking service dependencies. Let’s ask another follow up question to explain the benefits of mocking AWS services.

In the Visual Studio Code IDE, a user asks Amazon Q Developer about the key benefits of mocking AWS Services’ API calls to create unit tests for Lambda function code using the following prompt: “What are the key benefits of mocking AWS Services' API calls to create unit tests for Lambda function code?”. Amazon Q Developer replies with six key benefits, including: Isolation of the Lambda function code, Faster feedback loop, Consistent and repeatable tests, Cost savings, Improved testability, and Easier debugging. Each benefit has a brief explanation attached to it. The Sources section includes one URL to an AWS DevOps Blog post with the title “Unit Testing AWS Lambda with Python and Mock AWS Services”.

Figure 3: Benefits of mocking AWS services

The above conversation in Figure 3 shows how Amazon Q Developer can help to understand concepts. Let’s learn more about Moto.

In the Visual Studio Code IDE, a user asks Amazon Q Developer: “What is Moto used for?”. Amazon Q Developer provides a brief explanation of the Moto library, and how it can be used to create simulations of various AWS resources, such as: AWS Lambda, Amazon Dynamo DB, Amazon S3, Amazon EC2, AWS IAM, and many other AWS services. Amazon Q Developer provides a simple code example of how to use Moto to mock the AWS Lambda service in a Pytest test case by using the @mock_lambda decorator.

Figure 4: Follow up question about Moto

In Figure 4, Amazon Q Developer gives more details about the Moto framework and provides a short example code snippet for mocking an AWS Lambda service.

Best Practice – Write clear prompts

Writing clear prompts helps you to get the desired answers from Amazon Q. A lack of clarity and topic understanding may result in unclear questions and irrelevant or off-target responses. Note how those prompts contain specific description of what the answer should provide. For example, Figure 1 includes the programming language (Python) and service (AWS Lambda) to be considered in the expected answer. If unfamiliar with a topic, leverage Amazon Q Developer as part of your research, to better understand that topic.

Using Amazon Q Developer to identify unit test cases

Understanding the purpose and intended functionality of the code is important for developing relevant test cases. We introduce an example use case in Python, which handles payroll calculation for different hour rates, hours worked, and tax rates.

"""
This module provides a Payroll class to calculate the net pay for an employee.

The Payroll class takes in the hourly rate, hours worked, and tax rate, and
calculates the gross pay, tax amount, and net pay.
"""

class Payroll:
    """
    A class to handle payroll calculations.
    """

    def __init__(self, hourly_rate: float, hours_worked: float, tax_rate: float):
        self._validate_inputs(hourly_rate, hours_worked, tax_rate)
        self.hourly_rate = hourly_rate
        self.hours_worked = hours_worked
        self.tax_rate = tax_rate

    def _validate_inputs(self, hourly_rate: float, hours_worked: float, tax_rate: float) -> None:
        """
        Validate the input values for the Payroll class.

        Args:
            hourly_rate (float): The employee's hourly rate.
            hours_worked (float): The number of hours the employee worked.
            tax_rate (float): The tax rate to be applied to the employee's gross pay.

        Raises:
            ValueError: If the hourly rate, hours worked, or tax rate is not a positive number, 
            or if the tax rate is not between 0 and 1.
        """
        if hourly_rate <= 0:
            raise ValueError("Hourly rate must be a non-negative number.")
        if hours_worked < 0:
            raise ValueError("Hours worked must be a non-negative number.")
        if tax_rate < 0 or tax_rate >= 1:
            raise ValueError("Tax rate must be between 0 and 1.")

    def gross_pay(self) -> float:
        """
        Calculate the employee's gross pay.

        Returns:
            float: The employee's gross pay.
        """
        return self.hourly_rate * self.hours_worked

    def tax_amount(self) -> float:
        """
        Calculate the tax amount to be deducted from the employee's gross pay.

        Returns:
            float: The tax amount.
        """
        return self.gross_pay() * self.tax_rate

    def net_pay(self) -> float:
        """
        Calculate the employee's net pay after deducting taxes.

        Returns:
            float: The employee's net pay.
        """
        return self.gross_pay() - self.tax_amount()

The example shows how Amazon Q Developer can be used to identify test scenarios before writing the actual cases. Let’s ask Amazon Q Developer to suggest test cases for the Payroll class.

In the Visual Studio Code IDE, a user has a Payroll Python class open on the right side of the editor. The Payroll class is a module used to calculate the net pay for an employee. It takes the hourly rate, hours worked, and tax rate to calculate the gross pay, tax amount, and net pay. The user asks Amazon Q Developer to list unit test scenarios for the Payroll class using the following prompt: “Can you list unit test scenarios for the Payroll class?”. Amazon Q Developer provides eight different unit test scenarios: Test valid input values, Test invalid input values, Test edge cases, Test methods behavior, Test error handling, Test data types, Test rounding behavior, and Test consistency.

Figure 5: Suggest unit test scenarios

In Figure 5, Amazon Q Developer provides a list of different scenarios specific to the Payroll class, including valid, error, and edge cases.

Using Amazon Q Developer to write unit tests

Developers can have a collaborative conversation with Amazon Q Developer, which helps to unpack the code and think through testing cases to check that important test cases are captured but also edge cases are identified. This section focuses on how to facilitate the quick generation of unit test cases, based on the cases recommended in the previous section. Let’s start with a question around best practices when writing unit tests with pytest.

In the Visual Studio Code IDE, a user asks Amazon Q Developer “What are the best practices for unit testing with pytest?”. Amazon Q Developer replies with ten best practices, including: Keep tests simple and focused, Use descriptive test function names, Organize your tests, Run tests multiple times in random order, Utilize static code analysis tools, Focus on behavior not implementation, Use fixtures to set up test data, Parameterize your tests, and Integrate with continuous integration. A source URL is also provided for a Medium article called “Python Unit Tests with pytest Guide”

Figure 6: Recommended best practices for testing with pytest

In Figure 6, Amazon Q Developer provides a list of best practices for writing effective unit tests. Let’s follow up by asking to generate one of the suggested test cases.

In the Visual Studio Code IDE, a user asks Amazon Q Developer to generate a test case using pytest to make sure that the Payroll class raises a ValueError when the hourly rate holds a negative value. Amazon Q Developer generates a code sample using the pytest.raises() context manager to satisfy the requirement. It also provides instructions on how to run the text by prefixing the test module with pytest and running the command in the terminal. The user can now click on Insert at cursor button to insert the test case into the module, and run the test.

Figure 7: Generate a unit test case

Amazon Q Developer includes code in its response which you can copy or insert directly into your file by choosing Insert at cursor. Figure 7 displays valid unit tests covering some of the suggested scenarios and best practices, such as being simple and holding descriptive naming. It also states how to run the test using a command for the terminal.

Best Practice – Provide context

Context allows Amazon Q Developer to offer tailored responses that are more in sync with the conversation. In the chat interface, the flow of the ongoing conversation and past interactions are a critical contextual element. Other ways to provide context are selecting the code-under-test, keeping any relevant files, such as test examples, open in the editor and leveraging conversation context such as asking for best practices and example test scenarios before writing the test cases.

Using Amazon Q Developer to refactor unit tests

To improve code quality, Amazon Q Developer can be used to recommend improvements and refactor parts of the code base. To illustrate Amazon Q Developer refactoring functionality, we prepared test cases for the Payroll class which deviate from some of the suggested best practices.

Example – Send to Amazon Q Refactor

Let’s follow up by asking to refactor the code built-in Amazon Q > Refactor functionality.

In the Visual Studio Code IDE, a user selects the code in test_payroll_refactor module and asks Amazon Q Developer to refactor it via the Amazon Q Developer Refactor functionality, accessible through right click. This code contains ambiguous function and variable names, and might be hard to read without context. Amazon Q Developer then generates the refactored code and outlines the changes made: renamed test functions, removed variable names, and unnecessary comments, as functions are now self-explanatory. The user can now use the Insert at Cursor feature to add the code to the test_payroll_refactored module, and run the tests.

Figure 8: Refactor test cases

In Figure 8, the refactoring renamed the function and variable names to be more descriptive and therefore removed the comments. The recommendation is inserted in the second file to verify it runs correctly.

Best Practice – Apply human judgement and continuously interact with Amazon Q Developer

Note that code generations should always be reviewed and adjusted before used in your projects. Amazon Q Developer can provide you with initial guidance, but you might not get a perfect answer. A developer’s judgement should be applied to value usefulness of the generated code and iterations should be used to continuously improve the results.

Using Amazon Q Developer for mocking dependencies and generating sample data

More complex application architectures may require developers to mock dependencies and use sample data to test specific functionalities. The second code example contains a save_to_dynamodb_table function that writes a job_posting object into a specific Amazon DynamoDB table. This function references the TABLE_NAME environment variable to specify the name of the table in which the data should be saved.

We break down the tasks for Amazon Q Developer into three smaller steps for testing: Generate a fixture for mocking the TABLE_NAME environment variable name, generate instances of the given class to be used as test data, and generate the test.

Example – Generate fixtures

Pytest provides the capability to define fixtures to set defined, reliable, and consistent context for tests. Let’s ask Amazon Q Developer to write a fixture for the TABLE_NAME environment variable.

In the Visual Studio Code IDE, a user asks Amazon Q Developer to create a pytest fixture to mock the TABLE_NAME environment variable using the following prompt: “Create a pytest fixture which mocks my TABLE_NAME environment variable”. Amazon Q Developer replies with a code example showing how the fixture uses the os.environ dictionary to temporarily set the environment variable value to a mock value just for the duration of any test case using the fixture. The code example also includes a yield keyword to pause the fixture, and then delete the mock environment variable to restore the actual value once the test is completed.

Figure 9: Generate a pytest fixture

The result in Figure 9 shows that Amazon Q Developer generated a simple fixture for the TABLE_NAME environment variable. It provides code showing how to use the fixture in the actual test case with additional comments for its content.

Example – Generate data

Amazon Q Developer provides capabilities that can help you generate input data for your tests based on a schema, data model, or table definition. The save_to_dynamodb_table saves an instance of the job posting class to the table. Let’s ask Amazon Q Developer to create a sample instance based on this definition.

In the Visual Studio Code IDE, a user asks Amazon Q Developer to create a sample valid instance of the selected JobPosting class using the following prompt: “Create a sample valid instance of the selected JobPostings class”. The JobPosting class includes multiple fields, including: id, title, description, salary, location, company, employment type, and application_deadline. Amazon Q Developer provides a valid snippet for the JobPosting class, incorporating a UUID for the id field, nested amount and currency for the Salary class, and generic sample values for the remaining fields.

Figure 10: Generate sample data

The answer shows a valid instance of the class in Figure 10 containing common example values for the fields.

Example – Generate unit test cases with context

The code being tested relies on an external library, boto3. To make sure that this dependency is included, we leave a comment specifying that boto3 should be mocked using the Moto library. Additionally, we tell Amazon Q Developer to consider the test instance named job_posting and the fixture named mock_table_name for reference. Developers can now provide a prompt to generate the test case using the context from previous tasks or use comments as inline prompts to generate the test within the test file itself.

In the Visual Studio Code IDE, a user is leveraging inline prompts to generate an autocomplete suggestion from Amazon Q Developer. The inline prompt response is trying to fill out the test_save_to_dynamodb_table function with a mock test asserting the previous JobPosting fields. The user can decide to accept or reject the provided code completion suggestion.

Figure 11: Inline prompts for generating unit test case

Figure 11 shows the recommended code using inline prompts, which can be accepted as the unit test for the save_to_dynamodb_table function.

Best Practice – Break down larger tasks into smaller ones

For cases where Amazon Q Developer does not have much context or example code to refer to, such as writing unit tests from scratch, it is helpful to break down the tasks into smaller tasks. Amazon Q Developer will get more context with each step and can result in more effective responses.

Conclusion

Amazon Q Developer is a powerful tool that simplifies the process of writing and executing unit tests for your application. The examples provided in this post demonstrated that it can be a helpful companion throughout different stages of your unit test process. From initial learning to investigation and writing of test cases, the Chat, Generate, and Refactor capabilities allow you to speed up and improve your test generation. Using clear and concise prompts, context, an iterative approach, and small scoped tasks to interact with Amazon Q Developer improves the generated answers.

To learn more about Amazon Q Developer, see the following resources:

About the authors

Iris Kraja

Iris is a Cloud Application Architect at AWS Professional Services based in New York City. She is passionate about helping customers design and build modern AWS cloud native solutions, with a keen interest in serverless technology, event-driven architectures and DevOps. Outside of work, she enjoys hiking and spending as much time as possible in nature.

Svenja Raether

Svenja is a Cloud Application Architect at AWS Professional Services based in Munich.

Davide Merlin

Davide is a Cloud Application Architect at AWS Professional Services based in Jersey City. He specializes in backend development of cloud-native applications, with a focus on API architecture. In his free time, he enjoys playing video games, trying out new restaurants, and watching new shows.

How to build a CA hierarchy across multiple AWS accounts and Regions for global organization

Post Syndicated from Jiaqing Xue original https://aws.amazon.com/blogs/security/how-to-build-a-ca-hierarchy-across-multiple-aws-accounts-and-regions-for-global-organization/

Building a certificate authority (CA) hierarchy using AWS Private Certificate Authority has been made simple in Amazon Web Services (AWS); however, the CA tree will often reside in one AWS Region in one account. Many AWS customers run their businesses in multiple Regions using multiple AWS accounts and have described the process of creating a CA hierarchy to reflect their organizational structure as complex. This blog post guides you through building a two-level CA hierarchy with CAs in multiple Regions and in multiple accounts.

Certificates, like those provided through AWS Private CA, are usually used to establish encrypted network connections to meet security requirements of an organization’s data, such as database connections. For example, a software as a service (SaaS) company headquartered in Singapore that offers managed database services to users across multiple Regions, such as Oregon (us-west-2), Singapore (ap-southeast-1), and Tokyo (ap-northeast-1). To comply with the regulatory requirements of various countries and Regions, the company needs to build a CA hierarchy with a root CA located in Singapore and subordinate CAs distributed globally.

Another example is a large manufacturing company selling millions of Internet of Things (IoT) devices globally that needs to issue certificates to those devices for identity, authentication, and secure network connections. Different device types or business requirements are handled by different teams that have their own independent AWS accounts within the organization. To meet the certificate policy requirements (such as key length, validity period, and so on) for different types of devices, preserve business flexibility, and control the blast radius of risks, these teams use their own CAs to issue certificates. A customer like this needs a CA hierarchy that aligns with its organizational structure. Subordinate CAs should be deployed to the AWS accounts in different business departments, with each department managing its own subordinate CAs. Each device, or end-entity, typically requires a unique certificate, but the number of private certificates each CA will issue is 1,000,000 by default. Because this number is easy to exceed in a mass-production scenario, a CA hierarchy with multiple subordinate CAs can help mitigate this limitation.

In a CA hierarchy, the root CA is responsible for issuing certificates to subordinate CAs, and subordinate CAs are responsible for issuing end-entity certificates. Because the root CA is the root of the trust chain, loss of trust in this CA can mean that the end-entity certificates it issued will become invalid. Therefore, it’s recommended to place the root CA in a dedicated account and apply strict access control and auditing on its usage. We also recommend you strictly limit the scope of use of the root CA. For more best practices, see AWS Private CA best practices and Designing a CA hierarchy.

Key concepts: Certificate template

In AWS Private CA, when you use AWS Command Line Interface (AWS CLI) or AWS APIs to issue a certificate, you need to supply the certificate template Amazon Resource Name (ARN) to specify the X509v3 parameters of the certificate. A certificate template defines what type of certificates can be issued and allows CA administrators to control what certificates are issued and with what attributes. If you provide no ARN, AWS Private CA will automatically apply the EndEntityCertificate/V1 template to issue an end-entity certificate.

AWS Private CA offers certificate templates that encapsulate best practices for the basic constraint values in X509v3 parameters of certificates. For example, if it’s an end-entity certificate, the cA value of Basic constraints in the parameter is False; if issuing a CA certificate, it is necessary to set the cA value in the parameter to True and set the pathLenConstraint according to the level of the CA using this certificate in the CA hierarchy.

When issuing certificates, you must select the appropriate certificate template based on the certificate’s intended purpose. For more information about certificate templates, see Understanding certificate templates.

Journey of a two-level CA hierarchy

Now that you understand the core concepts of a CA hierarchy, we will show you how to use two CAs located in different accounts and different Regions to establish a CA hierarchy and use the subordinate CA to issue an end-entity certificate associated with a web application. In the example, you will create two CAs: one in the Oregon (us-west-2) Region acting as the root CA and the other one in the Singapore (ap-southeast-1) Region acting as the subordinate CA.

The following figure shows what you will build:

Figure 1: Architecture of a two-level CA hierarchy

Figure 1: Architecture of a two-level CA hierarchy

Prerequisites

You need the following prerequisites to build and implement the solution in this post.

  1. Designate two different AWS accounts for creating the hierarchy. In this example, we use one called Root CA account and the another called Subordinate CA account.
  2. Verify that you can sign in to AWS CloudShell in the Root CA account, and verify that your console user or role has the AWS Identity and Access Management (IAM) policy AWSCertificateManagerPrivateCAFullAccess associated or has permissions to create CAs and issue certificates in AWS Private CA.
  3. Verify that you have access to a role in the Subordinate CA account that has appropriate permissions to create CAs in AWS Private CA, issue certificates using AWS Certificate Manager (ACM), and create an Application Load Balancer (ALB) for a sample web application.

Build the hierarchy

With the prerequisites in place, you can start building the hierarchy, including the:

  • Root CA
  • Subordinate CA
  • Subordinate CA certificate
  • End-entity certificate

To build the root CA

  1. In the AWS Management Console of the root CA account, go to AWS Private CA and make sure that you are in the Oregon (us-west-2) Region.
  2. Create a CA according to this guide. Select Root in the CA type options to indicate that it’s a root CA.

    Figure 2: Create root CA

    Figure 2: Create root CA

  3. After creation, the CA will be in Pending certificate status. Finish configuring your CA by choosing Actions and selecting Install CA certificate.

    Figure 3: Root CA in pending certificate status

    Figure 3: Root CA in pending certificate status

  4. The root CA will change to Active status.

    Figure 4: Root CA in active status

    Figure 4: Root CA in active status

To build the subordinate CA

  1. To build the subordinate CA, sign in to your second account designated for this and make sure your Region is different than that of the root CA. For this example, I’m creating this in the Singapore (ap-southeast-1) Region. Similar to the root CA, create a CA in the console and select Subordinate as the CA type options.

    Figure 5: Create a subordinate CA

    Figure 5: Create a subordinate CA

  2. As with the root CA, the subordinate CA is in the Pending certificate state, so you need to install the certificate for it.

    Figure 6: Subordinate CA in pending certificate status

    Figure 6: Subordinate CA in pending certificate status

To issue the subordinate CA certificate

In a CA hierarchy, the certificate of the subordinate CA must be issued by the higher-level CA, which in this case is the root CA. To configure the subordinate CA to be Active, you need to get the certificate signing request (CSR) of the subordinate CA and use the root CA to sign it and issue a certificate for it.

  1. Remaining within the Subordinate CA account, get the CSR of the subordinate CA for signing. In the details page of the subordinate CA, open the Install subordinate CA certificate page by choosing Actions and selecting Install CA certificate, the CSR will be displayed when you select External private CA as the CA type. Export this to a file called sub-ca.csr on your local machine.

    Although the subordinate CA was managed by AWS Private CA, it will be considered as an external private CA because it’s in a different account and a different Region from the root CA.

    Figure 7: Details page of Install subordinate CA certificate

    Figure 7: Details page of Install subordinate CA certificate

  2. In the example scenario, you want the subordinate CA to only issue certificates directly to end-entities, so its certificate must be signed using the SubordinateCACertificate_PathLen0/V1 template on the root CA. This encodes a limitation into the CA that declares that the subordinate CA cannot sign a CSR for another CA.

    To do this, switch to the Root CA account and use CloudShell to run commands with the AWS CLI. Recreate the file sub-ca.csr using a text editor within CloudShell, and then use the following command to issue subordinate CA certificate. Replace the root-ca-arn with the ARN of the root CA you created previously:

    aws acm-pca issue-certificate --region us-west-2 \
      --certificate-authority-arn <root-ca-arn> \
      --csr file://sub-ca.csr \
      --signing-algorithm SHA256WITHRSA \
      --template-arn arn:aws:acm-pca:::template/SubordinateCACertificate_PathLen0/V1 \
     --validity Value=5,Type=YEARS

    The output looks like the following. Note the CertificateArn that is returned, because it will be used as <sub-ca-certificate-arn> in the following steps:

    {
        "CertificateArn": "<sub-ca-certificate-arn>"
    }

  3. Remaining in the Root CA account, export the subordinate CA’s certificate and certificate chain. Store the text that is returned from the CLI in a file on your local machine for later use in the Subordinate CA account. Replace the <root-ca-arn> and <sub-ca-certificate-arn> with the actual values.
    1. Export the subordinate CA’s certificate:
      aws acm-pca get-certificate --region us-west-2 --certificate-authority-arn <root-ca-arn> --certificate-arn <sub-ca-certificate-arn> --query “Certificate” --output text

    2. Export the subordinate CA’s certificate chain:
      aws acm-pca get-certificate --region us-west-2 --certificate-authority-arn <root-ca-arn> --certificate-arn <sub-ca-certificate-arn> --query “CertificateChain” --output text

  4. Switching back to the Subordinate CA account, import the subordinate CA’s certificate and root certificate chain to activate the subordinate CA. On the install subordinate CA certificate, choose Actions and select Install CA certificate.
    1. Use the exported certificate from the previous step to fill in the certificate body.
    2. Use the exported certificate chain from the previous step to fill in the certificate chain.
    3. Choose Confirm and install.

      Figure 8: Import certificate and certificate chain to the subordinate CA

      Figure 8: Import certificate and certificate chain to the subordinate CA

  5. Now, the subordinate CA is in Active status.

    Figure 9: Subordinate CA in active status

    Figure 9: Subordinate CA in active status

To validate the end-entity certificate

Now that you have established a two-level CA hierarchy, you can issue an end-entity certificate and deploy a web application in the Subordinate CA account. Use a web browser to view the web application and validate the CA hierarchy.

  1. Go to the AWS Certificate Manager console of the subordinate CA account and request a private certificate. In the Certificate authority details section, select the subordinate CA that was previously built.

    Figure 10: Issue certificate using subordinate CA in ACM

    Figure 10: Issue certificate using subordinate CA in ACM

  2. Deploy a sample web application behind an Application Load Balancer (ALB) in the Subordinate CA account. This sample web application uses a feature of the ALB listener that returns a fixed response without needing to configure backend compute resources. Configure the listener to use the certificate issued in the previous step to handle HTTPS traffic.

    Figure 11: Details of the ALB

    Figure 11: Details of the ALB

  3. Check the hierarchical relationship of CAs by visiting the web application and viewing the certificate details in your web browser.

    Figure 12: Certificate details in the web browser

    Figure 12: Certificate details in the web browser

From this image, you can see that the certificate trust chain of the CA hierarchy has been established successfully. The end-entity certificate (on the left) was issued by the CA Example Corp SG L2 CA, which you can identify from its Issuer Name. In the middle is the certificate of Example Corp SG L2 CA, and its issuer is Example Corp Root CA. You can see the certificate of Example Corp Root CA on the right, which is the root CA, and its Issuer Name is itself. When the subject and issuer attributes match, this is indicative of a root CA.

Conclusion

In this post, we showed you how to build a CA hierarchy solution across AWS accounts and Regions, using AWS Private CA. Using this as a guide, you can build a complex CA hierarchy to help meet your global business security and compliance requirements.

Read more

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Private Certificate Authority re:Post or contact AWS Support.

Jiaqing Xue

Jiaqing Xue
Jiaqing is a Solutions Architect at AWS. He focuses on helping customers build more secure, modern applications on AWS. In his spare time, he enjoys coding.

Patrick Palmer

Patrick Palmer
Patrick is a Principal Security Specialist Solutions Architect. He helps customers around the world use AWS services in a secure manner and specializes in cryptography. When not working, he enjoys spending time with his growing family and playing video games.

Xinran Liu

Xinran Liu
Xinran is a Solutions Architect at AWS. Edge computing and DevOps are his areas of interest. He has 14 years of experience leading large-scale engineering projects with expertise in designing and managing both on-premises and cloud-based infrastructures.

Matt Yu

Matt Yu
Matt is a Security Solutions Architect at AWS, responsible for providing customers with end-to-end cloud security solutions. He has extensive experience in identity and access management, threat detection, and cloud security protection.

How ATPCO enables governed self-service data access to accelerate innovation with Amazon DataZone

Post Syndicated from Brian Olsen original https://aws.amazon.com/blogs/big-data/how-atpco-enables-governed-self-service-data-access-to-accelerate-innovation-with-amazon-datazone/

This blog post is co-written with Raj Samineni  from ATPCO.

In today’s data-driven world, companies across industries recognize the immense value of data in making decisions, driving innovation, and building new products to serve their customers. However, many organizations face challenges in enabling their employees to discover, get access to, and use data easily with the right governance controls. The significant barriers along the analytics journey constrain their ability to innovate faster and make quick decisions.

ATPCO is the backbone of modern airline retailing, enabling airlines and third-party channels to deliver the right offers to customers at the right time. ATPCO’s reach is impressive, with its fare data covering over 89% of global flight schedules. The company collaborates with more than 440 airlines and 132 channels, managing and processing over 350 million fares in its database at any given time. ATPCO’s vision is to be the platform driving innovation in airline retailing while remaining a trusted partner to the airline ecosystem. ATPCO aims to empower data-driven decision-making by making high quality data discoverable by every business unit, with the appropriate governance on who can access what.

In this post, using one of ATPCO’s use cases, we show you how ATPCO uses AWS services, including Amazon DataZone, to make data discoverable by data consumers across different business units so that they can innovate faster. We encourage you to read Amazon DataZone concepts and terminologies first to become familiar with the terms used in this post.

Use case

One of ATPCO’s use cases is to help airlines understand what products, including fares and ancillaries (like premium seat preference), are being offered and sold across channels and customer segments. To support this need, ATPCO wants to derive insights around product performance by using three different data sources:

  • Airline Ticketing data – 1 billion airline ticket sales data processed through ATPCO
  • ATPCO pricing data – 87% of worldwide airline offers are powered through ATPCO pricing data. ATPCO is the industry leader in providing pricing and merchandising content for airlines, global distribution systems (GDSs), online travel agencies (OTAs), and other sales channels for consumers to visually understand differences between various offers.
  • De-identified customer master data – ATPCO customer master data that has been de-identified for sensitive internal analysis and compliance.

In order to generate insights that will then be shared with airlines as a data product, an ATPCO analyst needs to be able to find the right data related to this topic, get access to the data sets, and then use it in a SQL client (like Amazon Athena) to start forming hypotheses and relationships.

Before Amazon DataZone, ATPCO analysts needed to find potential data assets by talking with colleagues; there wasn’t an easy way to discover data assets across the company. This slowed down their pace of innovation because it added time to the analytics journey.

Solution

To address the challenge, ATPCO sought inspiration from a modern data mesh architecture. Instead of a central data platform team with a data warehouse or data lake serving as the clearinghouse of all data across the company, a data mesh architecture encourages distributed ownership of data by data producers who publish and curate their data as products, which can then be discovered, requested, and used by data consumers.

Amazon DataZone provides rich functionality to help a data platform team distribute ownership of tasks so that these teams can choose to operate less like gatekeepers. In Amazon DataZone, data owners can publish their data and its business catalog (metadata) to ATPCO’s DataZone domain. Data consumers can then search for relevant data assets using these human-friendly metadata terms. Instead of access requests from data consumer going to a ATPCO’s data platform team, they now go to the publisher or a delegated reviewer to evaluate and approve. When data consumers use the data, they do so in their own AWS accounts, which allocates their consumption costs to the right cost center instead of a central pool. Amazon DataZone also avoids duplicating data, which saves on cost and reduces compliance tracking. Amazon DataZone takes care of all of the plumbing, using familiar AWS services such as AWS Identity and Access Management (IAM), AWS Glue, AWS Lake Formation, and AWS Resource Access Manager (AWS RAM) in a way that is fully inspectable by a customer.

The following diagram provides an overview of the solution using Amazon DataZone and other AWS services, following a fully distributed AWS account model, where data sets like airline ticket sales, ticket pricing, and de-identified customer data in this use case are stored in different member accounts in AWS Organizations.

Implementation

Now, we’ll walk through how ATPCO implemented their solution to solve the challenges of analysts discovering, getting access to, and using data quickly to help their airline customers.

There are four parts to this implementation:

  1. Set up account governance and identity management.
  2. Create and configure an Amazon DataZone domain.
  3. Publish data assets.
  4. Consume data assets as part of analyzing data to generate insights.

Part 1: Set up account governance and identity management

Before you start, compare your current cloud environment, including data architecture, to ATPCO’s environment. We’ve simplified this environment to the following components for the purpose of this blog post:

  1. ATPCO uses an organization to create and govern AWS accounts.
  2. ATPCO has existing data lake resources set up in multiple accounts, each owned by different data-producing teams. Having separate accounts helps control access, limits the blast radius if things go wrong, and helps allocate and control cost and usage.
  3. In each of their data-producing accounts, ATPCO has a common data lake stack: An Amazon Simple Storage Service (Amazon S3) bucket for data storage, AWS Glue crawler and catalog for updating and storing technical metadata, and AWS LakeFormation (in hybrid access mode) for managing data access permissions.
  4. ATPCO created two new AWS accounts: one to own the Amazon DataZone domain and another for a consumer team to use for analytics with Amazon Athena.
  5. ATPCO enabled AWS IAM Identity Center and connected their identity provider (IdP) for authentication.

We’ll assume that you have a similar setup, though you might choose differently to suit your unique needs.

Part 2: Create and configure an Amazon DataZone domain

After your cloud environment is set up, the steps in Part 2 will help you create and configure an Amazon DataZone domain. A domain helps you organize your data, people, and their collaborative projects, and includes a unique business data catalog and web portal that publishers and consumers will use to share, collaborate, and use data. For ATPCO, their data platform team created and configured their domain.

Step 2.1: Create an Amazon DataZone domain

Persona: Domain administrator

Go to the Amazon DataZone console in your domain account. If you use AWS IAM Identity Center for corporate workforce identity authentication, then select the AWS Region in which your Identity Center instance is deployed. Choose Create domain.

  1. Enter a name and description.
  2. Leave Customize encryption settings (advanced) cleared.
  3. Leave the radio button selected for Create and use a new role. AWS creates an IAM role in your account on your behalf with the necessary IAM permissions for accessing Amazon DataZone APIs.
  4. Leave clear the quick setup option for Set-up this account for data consumption and publishing because we don’t plan to publish or consume data in our domain account.
  5. Skip Add new tag for now. You can always come back later to edit the domain and add tags.
  6. Choose Create Domain.

After a domain is created, you will see a domain detail page similar to the following. Notice that IAM Identity Center is disabled by default.

Step 2.2: Enable IAM Identity Center for your Amazon DataZone domain and add a group

Persona: Domain administrator

By default, your Amazon domain, its APIs, and its unique web portal are accessible by IAM principals in this AWS account with the necessary datazone IAM permissions. ATPCO wanted its corporate employees to be able to use Amazon DataZone with their corporate single sign-on SSO credentials without needing secondary federation to IAM roles. AWS Identity Center is the AWS cross-service solution for passing identity provider credentials. You can skip this step if you plan to use IAM principals directly for accessing Amazon DataZone.

Navigate to your Amazon DataZone domain’s detail page and choose Enable IAM Identity Center.

  • Scroll down to the User management section and select Enable users in IAM Identity Center. When you do, User and group assignment method options appear below. Turn on Require assignments. This means that you need to explicitly allow (add) users and groups to access your domain. Choose Update domain.

Now let’s add a group to the domain to provide its members with access. Back on your domain’s detail page, scroll to the bottom and choose the User management tab. Choose Add, and select Add SSO Groups from the drop-down.

  1. Enter the first letters of the group name and select it from the options. After you’ve added the desired groups, choose Add group(s).
  2. You can confirm that the groups are added successfully on the domain’s detail page, under the User management tab by selecting SSO Users and then SSO Groups from the drop-down.

Step 2.3: Associate AWS accounts with the domain for segregated data publishing and consumption

Personas: Domain administrator and AWS account owners

Amazon DataZone supports a distributed AWS account structure, where data assets are segregated from data consumption (such as Amazon Athena usage), and data assets are in their own accounts (owned by their respective data owners). We call these associated accounts. Amazon DataZone and the other AWS services it orchestrates take care of the cross-account data sharing. To make this work, domain and account owners need to perform a one-time account association: the domain needs to be shared with the account, and the account owner needs to configure it for use with Amazon DataZone. For ATPCO, there are four desired associated accounts, three of which are the accounts with data assets stored in Amazon S3 and cataloged in AWS Glue (airline ticketing data, pricing data, and de-identified customer data), and a fourth account that is used for an analyst’s consumption.

The first part of associating an account is to share the Amazon DataZone domain with the desired accounts (Amazon DataZone uses AWS RAM to create the resource policy for you). In ATPCO’s case, their data platform team manages the domain, so a team member does these steps.

  1. Todo this in the Amazon DataZone console, sign in to the domain account and navigate to the domain detail page, and then scroll down and choose the Associated Accounts tab. Choose Request association.
  2. Enter the AWS account ID of the first account to be associated.
  3. Choose Add another account and repeat step one for the remaining accounts to be associated. For ATPCO, there were four to-be associated accounts.
  4. When complete, choose Request Association.

The second part of associating an account is for the account owner to then configure their account for use by Amazon DataZone. Essentially, this process means that the account owner is allowing Amazon DataZone to perform actions in the account, like granting access to Amazon DataZone projects after a subscription request is approved.

  1. Sign in to the associated account and go to the Amazon DataZone console in the same Region as the domain. On the Amazon DataZone home page, choose View requests.
  2. Select the name of the inviting Amazon DataZone domain and choose Review request.

  1. Choose the Amazon DataZone blueprint you want to enable. We select Data Lake in this example because ATPCO’s use case has data in Amazon S3 and consumption through Amazon Athena.

  1. Leave the defaults as-is in the Permissions and resources The Glue Manage Access role allows Amazon DataZone to use IAM and LakeFormation to manage IAM roles and permissions to data lake resources after you approve a subscription request in Amazon DataZone. The Provisioning role allows Amazon DataZone to create S3 buckets and AWS Glue databases and tables in your account when you allow users to create Amazon DataZone projects and environments. The Amazon S3 bucket for data lake is where you specify which S3bucket is used by Amazon DataZone when users store data with your account.

  1. Choose Accept & configure association. This will take you to the associated domains table for this associated account, showing which domains the account is associated with. Repeat this process for other to-be associated accounts.

After the associations are configured by accounts, you will see the status reflected in the Associated accounts tab of the domain detail page.

Step 2.4: Set up environment profiles in the domain

Persona: Domain administrator

The final step to prepare the domain is making the associated AWS accounts usable by Amazon DataZone domain users. You do this with an environment profile, which helps less technical users get started publishing or consuming data. It’s like a template, with pre-defined technical details like blueprint type, AWS account ID, and Region. ATPCO’s data platform team set up an environment profile for each associated account.

To do this in the Amazon DataZone console, the data platform team member sign in to the domain account and navigates to the domain detail page, and chooses Open data portal in the upper right to go to the web-based Amazon DataZone portal.

  1. Choose Select project in the upper-left next to the DataZone icon and select Create Project. Enter a name, like Domain Administration and choose Create. This will take you to your new project page.
  2. In the Domain Administration project page, choose the Environments tab, and then choose Environment profiles in the navigation pane. Select Create environment profile.
    1. Enter a name, such as Sales – Data lake blueprint.
    2. Select the Domain Administration project as owner, and the DefaultDataLake as the blueprint.
    3. Select the AWS account with sales data as well as the preferred Region for new resources, such as AWS Glue and Athena consumption.
    4. Leave All projects and Any database
    5. Finalize your selection by choosing Create Environment Profile.

Repeat this step for each of your associated accounts. As a result, Amazon DataZone users will be able to create environments in their projects to use AWS resources in specific AWS accounts forpublishing or consumption.

Part 3: Publish assets

With Part 2 complete, the domain is ready for publishers to sign in and start publishing the first data assets to the business data catalog so that potential data consumers find relevant assets to help them with their analyses. We’ll focus on how ATPCO published their first data asset for internal analysis—sales data from their airline customers. ATPCO already had the data extracted, transformed, and loaded in a staged S3 bucket and cataloged with AWS Glue.

Step 3.1: Create a project

Persona: Data publisher

Amazon DataZone projects enable a group of users to collaborate with data. In this part of the ATPCO use case, the project is used to publish sales data as an asset in the project. By tying the eventual data asset to a project (rather than a user), the asset will have long-lived ownership beyond the tenure of any single employee or group of employees.

  1. As a data publisher, obtain theURL of the domain’s data portal from your domain administrator, navigate to this sign-in page and authenticate with IAM or SSO. After you’re signed in to the data portal, choose Create Project, enter a name (such as Sales Data Assets) and choose Create.
  2. If you want to add teammates to the project, choose Add Members. On the Project members page, choose Add Members, search for the relevant IAM or SSO principals, and select a role for them in the project. Owners have full permissions in the project, while contributors are not able to edit or delete the project or control membership. Choose Add Members to complete the membership changes.

Step 3.2: Create an environment

Persona: Data publisher

Projects can be comprised of several environments. Amazon DataZone environments are collections of configured resources (for example, an S3 bucket, an AWS Glue database, or an Athena workgroup). They can be useful if you want to manage stages of data production for the same essential data products with separate AWS resources, such as raw, filtered, processed, and curated data stages.

  1. While signed in to the data portal and in the Sales Data Assets project, choose the Environments tab, and then select Create Environment. Enter a name, such as Processed, referencing the processed stage of the underlying data.
  2. Select the Sales – Data lake blueprint environment profile the domain administrator created in Part 2.
  3. Choose Create Environment. Notice that you don’t need any technical details about the AWS account or resources! The creation process might take several minutes while Amazon DataZone sets up Lake Formation, Glue, and Athena.

Step 3.3: Create a new data source and run an ingestion job

Persona: Data publisher

In this use case, ATPCO has cataloged their data using AWS Glue. Amazon DataZone can use AWS Glue as a data source. Amazon DataZone data source (for AWS Glue) is a representation of one or more AWS Glue databases, with the option to set table selection criteria based on their name. Similar to how AWS Glue crawlers scan for new data and metadata, you can run an Amazon DataZone ingestion job against an Amazon DataZone data source (again, AWS Glue) to pull all of the matching tables and technical metadata (such as column headers) as the foundation for one or more data assets. An ingestion job can be run manually or automatically on a schedule.

  1. While signed in to the data portal and in the Sales Data Assets project, choose the Data tab, and then select Data sources. Choose Create Data Source, and enter a name for your data source, such as Processed Sales data in Glue, select AWS Glue as the type, and choose Next.
  2. Select the Processed environment from Step 3.2. In the database name box, enter a value or select from the suggested AWS Glue databases that Amazon DataZone identified in the AWS account. You can add additional criteria and another AWS Glue database.
  3. For Publishing settings, select No. This allows you to review and enrich the suggested assets before publishing them to the business data catalog.
  4. For Metadata generation methods, keep this box selected. Amazon DataZone will provide you with recommended business names for the data assets and its technical schema to publish an asset that’s easier for consumers to find.
  5. Clear Data quality unless you have already set up AWS Glue data quality. Choose Next.
  6. For Run preference, select to run on demand. You can come back later to run this ingestion job automatically on a schedule. Choose Next.
  7. Review the selections and choose Create.

To run the ingestion job for the first time, choose Run in the upper right corner. This will start the job. The run time is dependent on the quantity of databases, tables, and columns in your data source. You can refresh the status by choosing Refresh.

Step 3.4: Review, curate, and publish assets

Persona: Data publisher

After the ingestion job is complete, the matching AWS Glue tables will be added to the project’s inventory. You can then review the asset, including automated metadata generated by Amazon DataZone, add additional metadata, and publish the asset.

  • While signed in to the data portal and in the Sales Data Assets project, go to the Data tab, and select Inventory. You can review each of the data assets generated by the ingestion job. Let’s select the first result. In the asset detail page, you can edit the asset’s name and description to make it easier to find, especially in a list of search results.
  • You can edit the Read Me section and add rich descriptions for the asset, with markdown support. This can help reduce the questions consumers message the publisher with for clarification.
  • You can edit the technical schema (columns), including adding business names and descriptions. If you enabled automated metadata generation, then you’ll see recommendations here that you can accept or reject.
  • After you are done enriching the asset, you can choose Publish to make it searchable in the business data catalog.

Have the data publisher for each asset follow Part 3. For ATPCO, this means two additional teams followed these steps to get pricing and de-identified customer data into the data catalog.

Part 4: Consume assets as part of analyzing data to generate insights

Now that the business data catalog has three published data assets, data consumers will find available data to start their analysis. In this final part, an ATPCO data analyst can find the assets they need, obtain approved access, and analyze the data in Athena, forming the precursor of a data product that ATPCO can then make available to their customer (such as an airline).

Step 4.1: Discover and find data assets in the catalog

Persona: Data consumer

As a data consumer, obtain the URL of the domain’s data portal from your domain administrator, navigate to in the sign-in page, and authenticate with IAM or SSO. In the data portal, enter text to find data assets that match what you need to complete your analysis. In the ATPCO example, the analyst started by entering ticketing data. This returned the sales asset published above because the description noted that the data was related to “sales, including tickets and ancillaries (like premium seat selection preferences).”

The data consumer reviews the detail page of the sales asset, including the description and human-friendly terms in the schema, and confirms that it’s of use to the analysis. They then choose Subscribe. The data consumer is prompted to select a project for the subscription request, in which case they follow the same instructions as creating a project in Step 3.1, naming it Product analysis project. Enter a short justification of the request. Choose Subscribe to send the request to the data publisher.

Repeat Steps 4.2 and 4.3 for each of the needed data assets for the analysis. In the ATPCO use case, this meant searching for and subscribing to pricing and customer data.

While waiting for the subscription requests to be approved, the data consumer creates an Amazon DataZone environment in the Product analysis project, similar to Step 3.2. The data consumer selects an environment profile for their consumption AWS account and the data lake blueprint.

Step 4.2: Review and approve subscription request

Persona: Data publisher

The next time that a member of the Sales Data Assets project signs in to the Amazon DataZone data portal, they will see a notification of the subscription request. Select that notification or navigate in the Amazon DataZone data portal to the project. Choose the Data tab and Incoming requests and then the Requested tab to find the request. Review the request and decide to either Approve or Reject, while providing a disposition reason for future reference.

Step 4.3: Analyze data

Persona: Data consumer

Now that the data consumer has subscribed to all three data assets needed (by repeating steps 4.1-4.2 for each asset), the data consumer navigates to the Product analysis project in the Amazon DataZone data portal. The data consumer can verify that the project has data asset subscriptions by choosing the Data tab and Subscribed data.

Because the project has an environment with the data lake blueprint enabled in their consumption AWS account, the data consumer will see an icon in the right-side tab called Query Data: Amazon Athena. By selecting this icon, they’re taken to the Amazon Athena console.

In the Amazon Athena console, the data consumer sees the data assets their DataZone project is subscribed to (from steps 4.1-4.2). They use the Amazon Athena query editor to query the subscribed data.

Conclusion

In this post, we walked you through an ATPCO use case to demonstrate how Amazon DataZone allows users across an organization to easily discover relevant data products using business terms. Users can then request access to data and build products and insights faster. By providing self-service access to data with the right governance guardrails, Amazon DataZone helps companies tap into the full potential of their data products to drive innovation and data-driven decision making. If you’re looking for a way to unlock the full potential of your data and democratize it across your organization, then Amazon DataZone can help you transform your business by making data-driven insights more accessible and productive.

To learn more about Amazon DataZone and how to get started, refer to the Getting started guide. See the YouTube playlist for some of the latest demos of Amazon DataZone and short descriptions of the capabilities available.


About the Author

Brian Olsen is a Senior Technical Product Manager with Amazon DataZone. His 15 year technology career in research science and product has revolved around helping customers use data to make better decisions. Outside of work, he enjoys learning new adventurous hobbies, with the most recent being paragliding in the sky.

Mitesh Patel is a Principal Solutions Architect at AWS. His passion is helping customers harness the power of Analytics, machine learning and AI to drive business growth. He engages with customers to create innovative solutions on AWS.

Raj Samineni is the Director of Data Engineering at ATPCO, leading the creation of advanced cloud-based data platforms. His work ensures robust, scalable solutions that support the airline industry’s strategic transformational objectives. By leveraging machine learning and AI, Raj drives innovation and data culture, positioning ATPCO at the forefront of technological advancement.

Sonal Panda is a Senior Solutions Architect at AWS with over 20 years of experience in architecting and developing intricate systems, primarily in the financial industry. Her expertise lies in Generative AI, application modernization leveraging microservices and serverless architectures to drive innovation and efficiency.

How to migrate your AWS CodeCommit repository to another Git provider

Post Syndicated from Rodney Bozo original https://aws.amazon.com/blogs/devops/how-to-migrate-your-aws-codecommit-repository-to-another-git-provider/

Customers can migrate their AWS CodeCommit Git repositories to other Git providers using several methods, such as cloning the repository, mirroring, or migrating specific branches. This blog describes a basic use case to mirror a repository to a generic provider, and links to instructions for mirroring to more specific providers. Your exact steps could vary depending on the type or complexity of your repository, and the decisions made on what and how you want to migrate. This post only describes how to migrate Git repository data, and does not describe exporting other data from CodeCommit such as pull requests.

Pre-requisites

  1. Before you can migrate your CodeCommit repository to another provider, make sure that you have the necessary credentials and permissions to both the AWS Management Console and the other provider’s account. For migrating to GitHub and Gitlab, use CodeCommit static credentials as described in HTTPS users using Git credentials. If you choose to use the generic migration option process described below, any type of CodeCommit credentials can be used. To learn more about setting up AWS CodeCommit access control see Setting up for AWS CodeCommit.
  2. In the AWS CodeCommit console, select the clone URL for the repository you will migrate. The correct clone URL (HTTPS, SSH, or HTTPS (CRC)) depends on which credential type and network protocol you have chosen to use.

Figure 1: Clone repositories

Migrating your AWS CodeCommit repository to a GitLab repository

Using the CodeCommit clone URL in combination with the HTTPS Git repository credentials, follow the guidance in GitLab’s documentation for importing source code from a repository by URL.

Migrating your AWS CodeCommit repository to a GitHub repository

Using the CodeCommit clone URL in combination with the HTTPS Git repository credentials, follow the guidance in GitHub’s documentation for importing source code.

Generic migration to a different repository provider

1.     Clone the AWS CodeCommit Repository
Clone the repository from AWS CodeCommit to your local machine using Git. If you’re using HTTPS, you can do this by running the following command:

git clone --mirror https://your-aws-repository-url your-aws-repository

Replace your-aws-repository-url with the URL of your AWS CodeCommit repository.
Replace your-aws-repository with a name for this repository. Example:

git clone https://git-codecommit.us-east-2.amazonaws.com/v1/repos/MyDemoRepo my-demo-repo

2.     Set up new remote repository

Navigate to the directory of your cloned AWS CodeCommit repository. Then, add the repository URL from the new repository provider as a remote:

git remote add <provider name> <provider-repository-url>

Replace <provider name> with the provider name of your choice. (Example: gitlab)
Replace <provider-repository-url> with the URL of your new repository provider’s repository.

3.     Push your local repository to the new remote repository

This will push all branches and tags to your new repository provider’s repository. The provider name must match the provider name from step 2.

git push <provider name> --mirror

Notes:

  • The remote repository should be empty
  • The remote repository may have protected branches not allowing force push. In this case, navigate to your new repository provider and disable branch protections to allow force push.

4.     Verify the Migration

Once the push is complete, verify that all files, branches, and tags have been successfully migrated to the new repository provider. You can do this by browsing your repository online or by cloning it to another location and checking it locally.

5.     Update Remote URLs

If you plan to continue working with the migrated repository locally, you may want to update the remote URL to point to the new provider’s repository instead of AWS CodeCommit. You can do this using the following command:

git remote set-url origin <provider-repository-url>

Replace <provider-repository-url> with the URL of your new repository provider’s repository.

6.     Update CI/CD Pipelines and fix protected branches

If you have CI/CD pipelines set up that interact with your repository, such as GitLab, GitHub or AWS CodePipeline, update their configuration to reflect the new repository URL. If you removed protected branch permissions in Step 3 you may want to add these back to your main branch.

7.     Inform Your Team

If you’re migrating a repository that others are working on, be sure to inform your team about the migration and provide them with the new repository URL.

8.     Delete the, now migrated, AWS CodeCommit repository

This action cannot be undone. Navigate back to the AWS CodeCommit console and delete the repository that you have migrated using the “Delete Repository” button.

Figure 2: Delete repositories

Conclusion

This post described a few methods to migrate your existing AWS CodeCommit repository to another Git provider. After migration, you have the option to continue to use your current AWS CodeCommit repository, but doing so will likely require a regular sync operation between AWS CodeCommit and the new repository provider. For more information about repository migration, please see the following resources:

Migrate a repository incrementally – This guide is written to migrate a repository to CodeCommit incrementally but can also be used for other Git providers.

Configure SAML federation with Amazon OpenSearch Serverless and Keycloak

Post Syndicated from Arpad Csoke original https://aws.amazon.com/blogs/big-data/configure-saml-federation-with-amazon-opensearch-serverless-and-keycloak/

Amazon OpenSearch Serverless is a serverless version of Amazon OpenSearch Service, a fully managed open search and analytics platform. On Amazon OpenSearch Service you can run petabyte-scale search and analytics workloads without the heavy lifting of managing the underlying OpenSearch Service clusters and Amazon OpenSearch Serverless supports workloads up to 30TB of data for time-series collections. Amazon OpenSearch Serverless provides an installation of OpenSearch Dashboards with every collection created.

The network configuration for an OpenSearch Serverless collection controls how the collection can be accessed over the network. You have the option to make the collection publicly accessible over the internet from any network, or to restrict access to the collection only privately through OpenSearch Serverless-managed virtual private cloud (VPC) endpoints. This network access setting can be defined separately for the collection’s OpenSearch endpoint (used for data operations) and its corresponding OpenSearch Dashboards endpoint (used for visualizing and analyzing data). In this post, we work with a publicly accessible OpenSearch Serverless collection.

SAML enables users to access multiple applications or services with a single set of credentials, eliminating the need for separate logins for each application or service. This improves the user experience and reduces the overhead of managing multiple credentials. We provide SAML authentication for OpenSearch Serverless. With this you can use your existing identity provider (IdP) to offer single sign-on (SSO) for the OpenSearch Dashboards endpoints of serverless collections. OpenSearch Serverless supports IdPs that adhere to the SAML 2.0 standard, including services like AWS IAM Identity Center, Okta, Keycloak, Active Directory Federation Services (AD FS), and Auth0. This SAML authentication mechanism is solely intended for accessing the OpenSearch Dashboards interface through a web browser.

In this post, we show you how to configure SAML authentication for controlling access to public OpenSearch Dashboards using Keycloak as an IdP.

Solution overview

The following diagram illustrates a sample architecture of a solution that allows users to authenticate to OpenSearch Dashboards using SSO with Keycloak.

The sign-in flow includes the following steps:

  1. A user accesses OpenSearch Dashboards in a browser and chooses an IdP from the list.
  2. OpenSearch Serverless generates a SAML authentication request.
  3. OpenSearch Service redirects the request back to the browser.
  4. The browser redirects the user to the selected IdP (Keycloak). Keycloak provides a login page, where users can provide their login credentials.
  5. If authentication was successful, Keycloak returns the SAML response to the browser.
  6. The SAML assertions is sent back to OpenSearch Serverless.
  7. OpenSearch Serverless validates the SAML assertion, and logs the user in to OpenSearch Dashboards.

Prerequisites

To get started, you should have the following prerequisites:

  1. An active OpenSearch Serverless collection
  2. A working Keycloak server (on premises or in the cloud)
  3. The following AWS Identity and Access Management (IAM) permissions to configure SAML authentication in OpenSearch Serverless:
    • aoss:CreateSecurityConfig – Create a SAML provider.
    • aoss:ListSecurityConfig – List all SAML providers in the current account.
    • aoss:GetSecurityConfig – View SAML provider information.
    • aoss:UpdateSecurityConfig – Modify a given SAML provider configuration, including the XML metadata.
    • aoss:DeleteSecurityConfig – Delete a SAML provider.

Create and configure a client in Keycloak

Complete the following steps to create your Keycloak client:

  1. Login to your Keycloak admin page.
  2. In the navigation pane, choose Client.
  3. Choose Create client
  4. For Client type, choose SAML.
  5. For Client ID enter aws:opensearch:AWS_ACCOUNT_ID, where AWS_ACCOUNT_ID is your AWS account ID.
  6. Enter a name and description for your client.
  7. Choose Next.
  8. For Valid redirect URIs, enter the address of the assertion consumer service (ACS), where REGION is the AWS Region in which you have created the OpenSearch Serverless collection.
  9. For Master SAML Processing URL, also enter the preceding ACS address.
  10. Complete your client creation.
  11. After you create the client, you have to disable the Signing keys config setting, because OpenSearch Serverless signed and encrypted requests are not supported. For more details, refer to Considerations.
  12. After you have created the client and disabled the client signature, you can export the SAML 2.0 IdP Metadata by choosing the link on the Realm settings page. You need this metadata, when you create the SAML provider in OpenSearch Serverless.

Create a SAML provider

When your OpenSearch Serverless collection is active, you then create a SAML provider. This SAML provider can be assigned to any collection in the same Region. Complete the following steps:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose SAML authentication under Security.
  2. Choose Create SAML provider.
  3. Enter a name and description for your SAML provider.
  4. Enter the IdP metadata you downloaded earlier from Keycloak.
  5. Under Additional settings, you can optionally add custom user ID and group attributes (for this example, we leave this empty).
  6. Choose Create a SAML provider.

You have now configured a SAML provider for OpenSearch Serverless. Next, you configure the data access policy for accessing collections.

Create a data access policy

After you have configured SAML provider, you have to create data access policies for OpenSearch Serverless to allow access to the users.

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Data access policies under Security.
  2. Choose Create access policy.
  3. Enter a name and optional description for your access policy.
  4. For Policy definition method, select Visual editor.
  5. For Rule name, enter a name.
  6. Under Select principals, for Add principals, choose SAML users and groups.

  7. For SAML provider name, choose the provider you created before.
  8. Choose Save.

  9. Specify the user or group in the format user/USERNAME or group/GROUPNAME. The value of the USERNAME or GROUPNAME should match the value you specified in Keycloak for user-/groupname.
  10. Choose Save.
  11. Choose Grant to grant permissions to resources.
  12. In the Grant resources and permissions section, you can specify access you want to provide for a given user at the collection level, and also at the index pattern level.
    For more information about how to set up more granular access for your users, refer to Supported OpenSearch API operations and permissions and Supported policy permissions.
  13. Choose Save.
  14. You can create additional rules if needed.
  15. Choose Create to create the data access policy.

Now, you have data access policy that will allow users to access the OpenSearch Dashboards and perform the allowed actions there.

Access the OpenSearch Dashboards

Complete the following steps to sign in to the OpenSearch Dashboards:

  1. On the OpenSearch Service console, under Serverless in the navigation pane, choose Dashboard.
  2. In the Collection section, locate your collection and choose Dashboard.

    The OpenSearch login page will open in a new browser tab.
  3. Choose your IdP provider on the dropdown menu and choose Login.

    You will be redirected to the Keycloak sign-in page.
  4. Log in with your SSO credentials.

After a successful login, you will be redirected to OpenSearch Dashboards, and you can perform the actions allowed by the data access policy.

You have successfully federated OpenSearch Dashboards with Keycloak as an IdP.

Cleaning up

When you’re done with this solution, delete the resources you created if you no longer need them.

  1. Delete your OpenSearch Serverless collection.
  2. Delete your data access policy.
  3. Delete the SAML provider.

Conclusion

In this post, we demonstrated how to set up Keycloak as an IdP to access an OpenSearch Serverless dashboard using SAML authentication. For more details, refer to SAML authentication for Amazon OpenSearch Serverless


About the Author

Arpad Csoke is a Solutions Architect at Amazon Web Services. His responsibilities include helping large enterprise customers understand and utilize the AWS environment, acting as a technical consultant to contribute to solving their issues.

How ActionIQ built a truly composable customer data platform using Amazon Redshift

Post Syndicated from Mackenzie Johnson original https://aws.amazon.com/blogs/big-data/how-actioniq-built-a-truly-composable-customer-data-platform-using-amazon-redshift/

This post is written in collaboration with Mackenzie Johnson and Phil Catterall from ActionIQ.

ActionIQ is a leading composable customer data (CDP) platform designed for enterprise brands to grow faster and deliver meaningful experiences for their customers. ActionIQ taps directly into a brand’s data warehouse to build smart audiences, resolve customer identities, and design personalized interactions to unlock revenue across the customer lifecycle. Enterprise brands including Albertsons, Atlassian, Bloomberg, e.l.f. Beauty, DoorDash, HP, and more use ActionIQ to drive growth through better customer experiences.

High costs associated with launching campaigns, the security risk of duplicating data, and the time spent on SQL requests have created a demand for a better solution for managing and activating customer data. Organizations are demanding secure, cost efficient, and time efficient solutions to power their marketing outcomes.

This post will demonstrate how ActionIQ built a connector for Amazon Redshift to tap directly into your data warehouse and deliver a secure, zero-copy CDP. It will cover how you can get started with building a truly composable CDP with Amazon Redshift—from the solution architecture to setting up and testing the connector.

The challenge

Copying or moving data means heavy and complex logistics, along with added incurred cost and security risks associated with replicating data. On the logistics side, data engineering teams have to set up additional extract, transform, and load (ETL) pipelines out of their Amazon Redshift warehouse into ActionIQ, then configure ActionIQ to ingest the data on a recurring basis. Additional ETL jobs means more moving parts, which introduces more potential points of failure, such as breaking schema changes, partial data transfers, delays, and more. All of this requires additional observability overhead to help your team alert on and manage issues as they come up.

These additional ETL jobs add latency to the end-to-end process from data collection to activation, which makes it more likely that your campaigns are activating on stale data and missing key audience members. That will have implications for the customer experience, thereby directly affecting their ability to drive revenue.

The solution

Our solution aims to reduce the logistics already discussed and enables up-to-the-minute data by establishing a secure connection and pushing queries directly down to your data warehouse. Instead of loading full datasets into ActionIQ, the query is pushed to the data warehouse, making it do the hard querying and aggregation work, and wait for the result set.

With Amazon Redshift as your data warehouse, you can run complex workloads with consistently high performance while minimizing the time and effort spent in copying data over to the data warehouse through the use of features like zero-ETL integration with transactional data stores, streaming ingestion, and data sharing. You can also train machine learning models and make predictions directly from your Amazon Redshift data warehouse using familiar SQL commands.

Solution architecture

Within AWS, ActionIQ has a virtual private cloud (VPC) and you have your own VPC. We work within our own private area in AWS, with our own locks and access restrictions. Because ActionIQ is going to have access to your Amazon Redshift data warehouse, this implies that an outside organization (ActionIQ) will be able to make direct database queries to the production database environment.

A Solution architecture diagram showing AWS PrivaeLink set up between ActionIQ's VPC and the customer's VPC

For your information security (infosec) teams to approve this design, we need very clear and tight guardrails to ensure that:

  • ActionIQ only has access to what is absolutely necessary
  • No unintended third party can access these assets

ActionIQ needs to communicate securely to satisfy every information security requirement. To do that, within those AWS environments, you must set up AWS PrivateLink with ActionIQ to create a secure connection between the two VPCs. PrivateLink sets up a secure tunnel between the two VPCs, thus avoiding any opening of either VPC to the public internet. After PrivateLink is set up, ActionIQ needs to be granted privileges to the relevant database objects in your Amazon Redshift data warehouse.

In Amazon Redshift, you must create a distinct database, a service account specifically for ActionIQ, and views to populate the data to be shared with ActionIQ. The views need to adhere to ActionIQ’s data model guidelines, which aren’t rigid, but nonetheless require some structure, such as a clear profile_id that is used in all the views for easy joins between the various data sets.

Getting started with ActionIQ

When starting a hybrid compute integration with Amazon Redshift, it’s key to align your data to ActionIQ’s table types in the following manner:

  • Customer base table: A single dimension table with one record per customer that contains all customers.
  • User info tables: Dimension tables that describe customers and join to the customer base table. They often contain slow-moving or static demographic information and are typically matched one to one with customer records.
  • Event tables: Fact or log-like tables contain events or actions your customers take. The primary key is typically a user_id and timestamp.
  • Entity tables: Dimension tables that describe non-customer objects. They often provide additional information to augment the data in event tables. For example, an entity table could be a product table that contains product metadata and joins to a transaction event table on a product_id.

Visual representation of the relationships of the high level entities in the customer, event and product subject areas

Note: User info and event tables can join on any available identifier to the customer base table, not just the base user ID.

Now you can set up the connection and declare the views in the ActionIQ UI. After ActionIQ establishes a table with master profiles, users can begin to interpret those tables and work with them to build out campaigns.

Establish a secure connection

After setting up PrivateLink, the remaining steps to prepare for hybrid compute are the following:

  1. Create a separate database in Amazon Redshift to define the shared dataset with ActionIQ.
  2. Create a service account for ActionIQ in Amazon Redshift.
  3. Grant READ access to the service account for the dedicated database.
  4. Define the views that will be shared with ActionIQ.

Allow listing

If your data warehouse is on a private network, you must add ActionIQ’s IP addresses to your network’s allow list to allow ActionIQ to access your cloud warehouse. For more information on how to set this up, see the Configure inbound rules for SQL clients.

Database set up

Create an Amazon Redshift user for ActionIQ

  1. Sign in to the Amazon Redshift console
  2. From the navigation menu, choose the Query Editor and connect to your database.
  3. Create a user for ActionIQ:
    CREATE USER actioniq PASSWORD 'password';

  4. Grant permissions to the tables within a given schema you want to give ActionIQ access to:
GRANT USAGE ON SCHEMA 'yourschema' TO actioniq;
GRANT SELECT ON ALL TABLES IN SCHEMA 'yourschema' TO actioniq;

You can then run commands in the query editor to create a new user and to grant permission to the data sets you want to access through ActionIQ.

The result is that ActionIQ now has a programmatic query access to a dedicated database in your Amazon Redshift data warehouse, and that access is limited to that database.

In order to make this easy to govern, we recommend the following guidelines on the shared views:

  • As much as possible, the shared objects should be views and not tables.
  • The views should never use select *, but should explicitly specify each field desired in the view. This has multiple benefits:
    • The schema is robust; even if the underlying table changes, it won’t initiate a change in the shared view
    • It makes it very clear which fields are accessible by ActionIQ and which are not, thereby enabling a proper governance approval process.
  • Limiting privileges to READ access means the data warehouse administrators can be structurally certain that the data views won’t change unless they want them to.

The importance of providing views instead of actual tables is two-fold:

  1. A view doesn’t replicate data. The whole point is to avoid data replication, and we don’t want to replicate data within Amazon Redshift either. With a view, which is essentially a query definition on top of the actual data tables, we avoid the need to replicate data at all. There is a legitimate question of “why not give access to tables directly?” which brings us to the second point.
  2. Tables and data schema change on their own schedule, and ActionIQ needs a stable data schema to work with. By defining a view, we’re also defining a contract for sharing data between you and ActionIQ. The underlying data table can change, and the view definition can absorb this change, without modifying the structure of what the view delivers. This stability is critical for any enterprise software as a service to work effectively with a large organization.

On the ActionIQ side, there’s no caching or persisting of data of any kind. This means that ActionIQ launches a query whenever a scheduled campaign is launched and requires data, and whenever a user of the platform asks for an audience count. In other words, queries will generally happen during business hours, but technically can happen at any time.

Testing the connector

ActionIQ deploys the Amazon Redshift connector and tested queries to validate the success of the connector. After the audience is defined and validated, ActionIQ sends the SQL query to Amazon Redshift and it returns information. We also validate the results with Amazon Redshift to ensure that the logic is correct as intended.

With this, you experience a lean and more transparent deployment process. You can see the queries ActionIQ sends to Amazon Redshift, because the queries are logged. You can see what’s going on, what is attributable to ActionIQ, and can see the growth of adoption and usage.

Image showing connection set up to an Amazon Redshift database

A Connector defines the credentials and other parameters needed to connect to a cloud database or warehouse. The Connector screen is used to create, view and manage your connectors.

Key considerations

Organizations need strong data governance. ActionIQ requires a contract of what the data is going to look like within the defined view. With the dynamic nature of data, strong governance workflows with defined fields are required to run the connector smoothly to achieve the ultimate outcome—driving revenue through marketing campaigns.

Because ActionIQ is used as the central hub of marketing orchestration and activation, it needs to process a lot of queries. Because marketing activity can have significant spikes in activity, it’s prudent to plan for the maximum load on the underlying database.

In one scenario, you might have spikey workloads. With Amazon Redshift Serverless, your data warehouse will be able to scale automatically to manage those spikes. That means Amazon Redshift can absorb large and sudden spikes in queries from ActionIQ without much technical planning.

If workload isolation is a priority and you want to run the ActionIQ workloads using dedicated compute resources, you can use the data sharing feature to create a data share that can be accessed by a dedicated Redshift serverless endpoint. This would allow ActionIQ to query up-to-date data from a separate Redshift serverless instance without the need to copy any data while maintaining complete workload isolation.

The data team needs data to run business intelligence. ActionIQ is driving marketing activation and creating a new data set for the universal contact history—essentially the log of all marketing contacts from the activity. ActionIQ provides this dataset back to Amazon Redshift, which can then be included in the BI reports for ROI measurements.

Conclusion

For your information security teams, the ActionIQ’s Amazon Redshift connector presents a viable solution because ActionIQ doesn’t replicate the data, and the controls outlined establish how ActionIQ accesses the data. Key benefits include:

  • Control: Choose where data is stored and queried to improve security and fit existing technology investments.
  • Performance: Reduce operational effort, increase productivity and cut down on unnecessary technology costs.
  • Power: Use the auto-scaling capabilities of Amazon Redshift for running your workload.

For business teams, the ActionIQ Amazon Redshift connector is querying the freshest data possible. With the connector, there is zero data latency—an important consideration with key audience members that are primed to convert.

ActionIQ is excited to launch the Amazon Redshift connector to activate your data where it lives—within your Amazon Redshift data warehouse—for a zero-copy, real-time experience that drives outcomes with your customers. To learn more about how organizations are modernizing their data platforms using Amazon Redshift, visit the Amazon Redshift page.

Enhance your Amazon Redshift investment with ActionIQ.


About the authors

Mackenzie Johnson is a Senior Manager at ActionIQ. She is an innovative marketing strategist who’s passionate about the convergence of complementary technologies and amplifying joint value. With extensive experience across digital transformation storytelling, she thrives on educating enterprise businesses about the impact of CX based on a data-driven approach.

Phil Catterall is a Senior Product Manager at ActionIQ and leads product development on ActionIQ’s foundational data management, processing, and query federation capabilities. He’s passionate about designing and building scalable data products to empower business users in new ways.

Sain Das is a Senior Product Manager on the Amazon Redshift team and leads Amazon Redshift GTM for partner programs including the Powered by Amazon Redshift and Redshift Ready programs.

Leveraging Amazon Q Developer for Efficient Code Debugging and Maintenance

Post Syndicated from Dr. Rahul Sharad Gaikwad original https://aws.amazon.com/blogs/devops/leveraging-amazon-q-developer-for-efficient-code-debugging-and-maintenance/

In this post, we guide you through five common components of efficient code debugging. We also show you how Amazon Q Developer can significantly reduce the time and effort required to manually identify and fix errors across numerous lines of code. With Amazon Q Developer on your side, you can focus on other aspects of software development, such as design and innovation. This leads to faster development cycles and more frequent releases, as you as a software developer can allocate less time to debugging and more time to building new features and functionality.

Debugging is a crucial part of the software development process. It involves identifying and fixing errors or bugs in the code to ensure that it runs smoothly and efficiently in the production environment. Traditionally, debugging has been a time-consuming and labor-intensive process, requiring developers to manually search through lines of code to investigate and subsequently fix errors. With Amazon Q Developer, a generative AI–powered assistant that helps you learn, plan, create, deploy, and securely manage applications faster, the debugging process becomes more efficient and streamlined. This enables you to easily understand the code, detect anomalies, fix issues and even automatically generate the test cases for your application code.

Let’s consider the following code example, which utilizes Amazon Bedrock, Amazon Polly, and Amazon Simple Storage Service (Amazon S3) to generate a speech explanation of a given prompt. A prompt is the message sent to a model in order to generate a response. It constructs a JSON payload with the prompt (for example – “explain black holes to 8th graders”) and configuration for the Claude model, and invokes the model via the Bedrock Runtime API. Amazon Polly converts the Large Language Model (LLM) response into an audio output format which is then uploaded to an Amazon S3 bucket location:

Sample Code Snippet:

import boto3
import json
# Define boto clients
brt = boto3.client(service_name='bedrock-runtime')
polly = boto3.client('polly')
s3 = boto3.client('s3')
body = json.dumps({
"prompt": "\n\nHuman: explain black holes to 8th graders\n\nAssistant:",
"max_tokens_to_sample": 300,
"temperature": 0.1,
"top_p": 0.9,
})
# Define variable
modelId = 'anthropic.claude-v2'
accept = 'application/json'
contentType = 'application/json'
response = brt.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
response_body = json.loads(response.get('body').read())
# Print response
print(response_body.get('completion'))
# Send response to Polly
polly_response = polly.synthesize_speech(
Text=response_body.get('completion'),
OutputFormat='mp3',
VoiceId='Joanna'
)
# Write audio to file
with open('output.mp3', 'wb') as file:
file.write(polly_response['AudioStream'].read())
# Upload file to S3 bucket
s3.upload_file('output.mp3', '<REPLACE_WITH_YOUR_BUCKET_NAME>', 'speech.mp3')

Top common ways to debug code:

1. Code Explanation:

The first step in debugging your code is to understand what the code is doing. Very often you have to build upon code from others. Amazon Q Developer comes with an inbuilt feature that allows it to quickly explain the selected code and identify the main components and functions. This feature utilizes natural language processing (NLP) and summarization capability of LLMs, making your code easy to understand, and helping you identify potential issues. To gain deeper insights into your codebase, simply select your code, right-click and select ‘Send to Amazon Q’ option from menu. Finally, click on “Explain” option.

Image showing the interface for sending selected code to Amazon Q Developer for explanation

Amazon Q Developer generates an explanation of the code, highlighting the key components and functions. You can then use this summary to quickly identify potential issues and minimize your debugging efforts. It demonstrates how Amazon Q Developer accepts your code as input and the summarizes it for you. It explains the code in a natural language, which can help developers understand and debug the code.

If you have any follow-up question regarding the given explanation, you can also ask Amazon Q Developer for more in-depth review of a specific task. With respect to the code context, Amazon Q Developer automatically understands the existing code functionality and provides more insights. We ask Amazon Q Developer to explain how the Amazon Polly service is being used for synthesizing in the sample code snippet. It provides detailed step-by-step clarification from the retrieval to the upload process. Using this approach, you can ask any clarifying questions similar to asking a colleague to help you better understand a topic.

Amazon Q Developer explains how Polly is used to Synthesize in the given code

2. Amplify debugging with Logs

The sample code is deficient in terms of support for debugging including proper exception handling mechanisms. This makes it difficult to identify and handle issues during the execution or runtime of the application, further causing delays in the overall development process. You can instruct Amazon Q Developer to add a new code block for debugging your application against potential failures. Amazon Q Developer analyzes the existing code and then recommends to add a new code snippet to include printing and logging of debug messages and exception handling. It embeds ‘try-except’ blocks to catch potential exceptions that may occur. You use the recommended code after reviewing and modifying it. For example, you need to update the variable REPLACE_WITH_YOUR_BUCKET_NAME with your actual Amazon S3 bucket name.

With Amazon Q, you seamlessly enhance your codebase by selecting specific code segments and providing them as prompts. This enables you to explore suggestions for incorporating the required exception handling mechanisms or strategic logging statements.

Amazon Q Developer interface illustrating the addition of debugging code to the original code snippet

Logs are an essential tool for debugging your application code. They allow you to track the execution of your code and identify where errors are occurring. With Amazon Q Developer, you can view and analyze log data from your application, and identify issues and errors that may not be immediately apparent from the code. It demonstrates that the given code does not have the logging statements embedded.

You can provide a prompt to Amazon Q Developer in the chat window asking it to write a code to enable logging and add log statements. This will automatically add logging statements wherever needed. It is useful to log inputs, outputs, and errors when making API calls or invoking models. You may copy the entire recommended code to the clipboard, review it, and then modify as needed. For example, you can decide which logging level you want to use, INFO or DEBUG.

Example of logging statements inserted into the code by Amazon Q Developer for enhanced debugging

3. Anomaly Detection

Another powerful use case for Amazon Q Developer is anomaly detection in your application code. Amazon Q Developer uses machine learning algorithms to identify unusual patterns in your code and highlight potential issues. It can detect issues that may be difficult to detect, such as an array out of bounds, infinite loops, or concurrency issues. For demonstration purposes, we intentionally introduce a simple anomaly in the code. We ask Amazon Q Developer to detect the anomaly in the code when attempting to generate text from the response. A simple prompt about detecting anomalies in the given code generates a useful output. It is able to detect the anomaly in the ‘response_body’ dictionary. Additionally, Amazon Q Developer recommends best practices to check the status code and handle the errors with a code snippet.

Process of submitting selected code to Amazon Q Developer for automated bug fixing

With code anomaly detection at your fingertips, you can promptly identify issues that may be impacting the application’s performance or user experience.

4. Automated Bug Fixing

After conducting all the necessary analysis, you can use Amazon Q Developer to fix bugs and issues in the code. Amazon Q Developer’s automated bug-fixing feature saves you hours you would otherwise spend on debugging and testing the code without its help. Starting with the code example which has an anomaly, you can identify and fix the code issue by simply selecting the code and sending it to Amazon Q Developer to apply fixes.

Process of submitting selected code to Amazon Q Developer for automated bug fixing

Amazon Q Developer identifies and suggests various ways to fix common issues in your code, such as syntax errors, logical errors, and performance issues. It can also recommend optimizations and improvements to your code, helping you to deliver a better user experience and improve your application’s performance.

5. Automated Test Case Generation

Testing is an integral part of code development and debugging. Running high quality test cases improve the quality of the code. With Amazon Q Developer, you can automatically generate the test cases quickly and easily. Amazon Q Developer uses its Large Language Model core to generate test cases based on your code, identifying potential issues and ensuring that your code is comprehensive and reliable.

With automated test case generation, you save time and effort while testing your code, ensuring that it is robust and reliable. A natural language prompt is sent to Amazon Q Developer to suggest test cases for given application code. You can notice that Amazon Q Developer provides a list of possible test cases that can aid in debugging and generating further test cases.

Recommendations and solutions provided by Amazon Q Developer to address issues in the code blocks

After receiving suggestions for test cases, you can also ask Amazon Q Developer to generate code snippet for some or all of the identified test cases.

Test cases proposed by Amazon Q Developer for evaluating the example application code

Conclusion:

Amazon Q Developer revolutionizes the debugging process for developers by leveraging advanced and natural language understanding and generative AI capabilities. From explaining and summarizing code to detecting anomalies, implementing automatic bug fixes, and generating test cases, Amazon Q Developer streamlines common aspects of debugging. Developers can now spend less time manually hunting for errors and more time innovating and building features. By harnessing the power of Amazon Q, organizations can accelerate development cycles, improve code quality, and deliver superior software experiences to their customers.

To get started with Amazon Q Developer for debugging today navigate to Amazon Q Developer in IDE and simply start asking questions about debugging. Additionally, explore Amazon Q Developer workshop for additional hands-on use cases.

For any inquiries or assistance with Amazon Q Developer, please reach out to your AWS account team.

About the authors:

Dr. Rahul Sharad Gaikwad

Dr. Rahul is a Lead DevOps Consultant at AWS, specializing in migrating and modernizing customer workloads on the AWS Cloud. He is a technology enthusiast with a keen interest in Generative AI and DevOps. He received his Ph.D. in AIOps in 2022. He is recipient of the Indian Achievers’ Award (2024), Best PhD Thesis Award (2024), Research Scholar of the Year Award (2023) and Young Researcher Award (2022).

Nishant Dhiman

Nishant is a Senior Solutions Architect at AWS with an extensive background in Serverless, Generative AI, Security and Mobile platform offerings. He is a voracious reader and a passionate technologist. He loves to interact with customers and believes in giving back to community by learning and sharing. Outside of work, he likes to keep himself engaged with podcasts, calligraphy and music.

Streamline your data governance by deploying Amazon DataZone with the AWS CDK

Post Syndicated from Bandana Das original https://aws.amazon.com/blogs/big-data/streamline-your-data-governance-by-deploying-amazon-datazone-with-the-aws-cdk/

Managing data across diverse environments can be a complex and daunting task. Amazon DataZone simplifies this so you can catalog, discover, share, and govern data stored across AWS, on premises, and third-party sources.

Many organizations manage vast amounts of data assets owned by various teams, creating a complex landscape that poses challenges for scalable data management. These organizations require a robust infrastructure as code (IaC) approach to deploy and manage their data governance solutions. In this post, we explore how to deploy Amazon DataZone using the AWS Cloud Development Kit (AWS CDK) to achieve seamless, scalable, and secure data governance.

Overview of solution

By using IaC with the AWS CDK, organizations can efficiently deploy and manage their data governance solutions. This approach provides scalability, security, and seamless integration across all teams, allowing for consistent and automated deployments.

The AWS CDK is a framework for defining cloud IaC and provisioning it through AWS CloudFormation. Developers can use any of the supported programming languages to define reusable cloud components known as constructs. A construct is a reusable and programmable component that represents AWS resources. The AWS CDK translates the high-level constructs defined by you into equivalent CloudFormation templates. AWS CloudFormation provisions the resources specified in the template, streamlining the usage of IaC on AWS.

Amazon DataZone core components are the building blocks to create a comprehensive end-to-end solution for data management and data governance. The following are the Amazon DataZone core components. For more details, see Amazon DataZone terminology and concepts.

  • Amazon DataZone domain – You can use an Amazon DataZone domain to organize your assets, users, and their projects. By associating additional AWS accounts with your Amazon DataZone domains, you can bring together your data sources.
  • Data portal – The data portal is outside the AWS Management Console. This is a browser-based web application where different users can catalog, discover, govern, share, and analyze data in a self-service fashion.
  • Business data catalog – You can use this component to catalog data across your organization with business context and enable everyone in your organization to find and understand data quickly.
  • Projects – In Amazon DataZone, projects are business use case-based groupings of people, assets (data), and tools used to simplify access to AWS analytics.
  • Environments – Within Amazon DataZone projects, environments are collections of zero or more configured resources on which a given set of AWS Identity and Access Management (IAM) principals (for example, users with a contributor permissions) can operate.
  • Amazon DataZone data source – In Amazon DataZone, you can publish an AWS Glue Data Catalog data source or Amazon Redshift data source.
  • Publish and subscribe workflows – You can use these automated workflows to secure data between producers and consumers in a self-service manner and make sure that everyone in your organization has access to the right data for the right purpose.

We use an AWS CDK app to demonstrate how to create and deploy core components of Amazon DataZone in an AWS account. The following diagram illustrates the primary core components that we create.

In addition to the core components deployed with the AWS CDK, we provide a custom resource module to create Amazon DataZone components such as glossaries, glossary terms, and metadata forms, which are not supported by AWS CDK constructs (at the time of writing).

Prerequisites

The following local machine prerequisites are required before starting:

Deploy the solution

Complete the following steps to deploy the solution:

  1. Clone the GitHub repository and go to the root of your downloaded repository folder:
    git clone https://github.com/aws-samples/amazon-datazone-cdk-example.git
    cd amazon-datazone-cdk-example

  2. Install local dependencies:
    $ npm ci ### this will install the packages configured in package-lock.json

  3. Sign in to your AWS account using the AWS CLI by configuring your credential file (replace <PROFILE_NAME> with the profile name of your deployment AWS account):
    $ export AWS_PROFILE=<PROFILE_NAME>

  4. Bootstrap the AWS CDK environment (this is a one-time activity and not needed if your AWS account is already bootstrapped):
    $ npm run cdk bootstrap

  5. Run the script to replace the placeholders for your AWS account and AWS Region in the config files:
    $ ./scripts/prepare.sh <<YOUR_AWS_ACCOUNT_ID>> <<YOUR_AWS_REGION>>

The preceding command will replace the AWS_ACCOUNT_ID_PLACEHOLDER and AWS_REGION_PLACEHOLDER values in the following config files:

  • lib/config/project_config.json
  • lib/config/project_environment_config.json
  • lib/constants.ts

Next, you configure your Amazon DataZone domain, project, business glossary, metadata forms, and environments with your data source.

  1. Go to the file lib/constants.ts. You can keep the DOMAIN_NAME provided or update it as needed.
  2. Go to the file lib/config/project_config.json. You can keep the example values for projectName and projectDescription or update them. An example value for projectMembers has also been provided (as shown in the following code snippet). Update the value of the memberIdentifier parameter with an IAM role ARN of your choice that you would like to be the owner of this project.
    "projectMembers": [
                {
                    "memberIdentifier": "arn:aws:iam::AWS_ACCOUNT_ID_PLACEHOLDER:role/Admin",
                    "memberIdentifierType": "UserIdentifier"
                }
            ]

  3. Go to the file lib/config/project_glossary_config.json. An example business glossary and glossary terms are provided for the projects; you can keep them as is or update them with your project name, business glossary, and glossary terms.
  4. Go to the lib/config/project_form_config.json file. You can keep the example metadata forms provided for the projects or update your project name and metadata forms.
  5. Go to the lib/config/project_enviornment_config.json file. Update EXISTING_GLUE_DB_NAME_PLACEHOLDER with the existing AWS Glue database name in the same AWS account where you are deploying the Amazon DataZone core components with the AWS CDK. Make sure you have at least one existing AWS Glue table in this AWS Glue database to publish as a data source within Amazon DataZone. Replace DATA_SOURCE_NAME_PLACEHOLDER and DATA_SOURCE_DESCRIPTION_PLACEHOLDER with your choice of Amazon DataZone data source name and description. An example of a cron schedule has been provided (see the following code snippet). This is the schedule for your data source run; you can keep the same or update it.
    "Schedule":{
       "schedule":"cron(0 7 * * ? *)"
    }

Next, you update the trust policy of the AWS CDK deployment IAM role to deploy a custom resource module.

  1. On the IAM console, update the trust policy of the IAM role for your AWS CDK deployment that starts with cdk-hnb659fds-cfn-exec-role- by adding the following permissions. Replace ${ACCOUNT_ID} and ${REGION} with your specific AWS account and Region.
         {
             "Effect": "Allow",
             "Principal": {
                 "Service": "lambda.amazonaws.com"
             },
             "Action": "sts:AssumeRole",
             "Condition": {
                 "ArnLike": {
                     "aws:SourceArn": [
                         
                         "arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-GlossaryLambda*",
                         "arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-GlossaryTermLambda*",
                         "arn:aws:lambda:${REGION}:{ACCOUNT_ID}:function:DataZonePreqStack-FormLambda*"
                     ]
                 }
             }
         }

Now you can configure data lake administrators in Lake Formation.

  1. On the Lake Formation console, choose Administrative roles and tasks in the navigation pane.
  2. Under Data lake administrators, choose Add and add the IAM role for AWS CDK deployment that starts with cdk-hnb659fds-cfn-exec-role- as an administrator.

This IAM role needs permissions in Lake Formation to create resources, such as an AWS Glue database. Without these permissions, the AWS CDK stack deployment will fail.

  1. Deploy the solution:
    $ npm run cdk deploy --all

  2. During deployment, enter y if you want to deploy the changes for some stacks when you see the prompt Do you wish to deploy these changes (y/n)?.
  3. After the deployment is complete, sign in to your AWS account and navigate to the AWS CloudFormation console to verify that the infrastructure deployed.

You should see a list of the deployed CloudFormation stacks, as shown in the following screenshot.

  1. Open the Amazon DataZone console in your AWS account and open your domain.
  2. Open the data portal URL available in the Summary section.
  3. Find your project in the data portal and run the data source job.

This is a one-time activity if you want to publish and search the data source immediately within Amazon DataZone. Otherwise, wait for the data source runs according to the cron schedule mentioned in the preceding steps.

Troubleshooting

If you get the message "Domain name already exists under this account, please use another one (Service: DataZone, Status Code: 409, Request ID: 2d054cb0-0 fb7-466f-ae04-c53ff3c57c9a)" (RequestToken: 85ab4aa7-9e22-c7e6-8f00-80b5871e4bf7, HandlerErrorCode: AlreadyExists), change the domain name under lib/constants.ts and try to deploy again.

If you get the message "Resource of type 'AWS::IAM::Role' with identifier 'CustomResourceProviderRole1' already exists." (RequestToken: 17a6384e-7b0f-03b3 -1161-198fb044464d, HandlerErrorCode: AlreadyExists), this means you’re accidentally trying to deploy everything in the same account but a different Region. Make sure to use the Region you configured in your initial deployment. For the sake of simplicity, the DataZonePreReqStack is in one Region in the same account.

If you get the message “Unmanaged asset” Warning in the data asset on your datazone project, you must explicitly provide Amazon DataZone with Lake Formation permissions to access tables in this external AWS Glue database. For instructions, refer to Configure Lake Formation permissions for Amazon DataZone.

Clean up

To avoid incurring future charges, delete the resources. If you have already shared the data source using Amazon DataZone, then you have to remove those manually first in the Amazon DataZone data portal because the AWS CDK isn’t able to automatically do that.

  1. Unpublish the data within the Amazon DataZone data portal.
  2. Delete the data asset from the Amazon DataZone data portal.
  3. From the root of your repository folder, run the following command:
    $ npm run cdk destroy --all

  4. Delete the Amazon DataZone created databases in AWS Glue. Refer to the tips to troubleshoot Lake Formation permission errors in AWS Glue if needed.
  5. Remove the created IAM roles from Lake Formation administrative roles and tasks.

Conclusion

Amazon DataZone offers a comprehensive solution for implementing a data mesh architecture, enabling organizations to address advanced data governance challenges effectively. Using the AWS CDK for IaC streamlines the deployment and management of Amazon DataZone resources, promoting consistency, reproducibility, and automation. This approach enhances data organization and sharing across your organization.

Ready to streamline your data governance? Dive deeper into Amazon DataZone by visiting the Amazon DataZone User Guide. To learn more about the AWS CDK, explore the AWS CDK Developer Guide.


About the Authors

Bandana Das is a Senior Data Architect at Amazon Web Services and specializes in data and analytics. She builds event-driven data architectures to support customers in data management and data-driven decision-making. She is also passionate about enabling customers on their data management journey to the cloud.

Gezim Musliaj is a Senior DevOps Consultant with AWS Professional Services. He is interested in various things CI/CD, data, and their application in the field of IoT, massive data ingestion, and recently MLOps and GenAI.

Sameer Ranjha is a Software Development Engineer on the Amazon DataZone team. He works in the domain of modern data architectures and software engineering, developing scalable and efficient solutions.

Sindi Cali is an Associate Consultant with AWS Professional Services. She supports customers in building data-driven applications in AWS.

Bhaskar Singh is a Software Development Engineer on the Amazon DataZone team. He has contributed to implementing AWS CloudFormation support for Amazon DataZone. He is passionate about distributed systems and dedicated to solving customers’ problems.

How to use the AWS Secrets Manager Agent

Post Syndicated from Eduardo Patrocinio original https://aws.amazon.com/blogs/security/how-to-use-the-aws-secrets-manager-agent/

AWS Secrets Manager is a service that helps you manage, retrieve, and rotate database credentials, application credentials, OAuth tokens, API keys, and other secrets throughout their lifecycles. You can use Secrets Manager to replace hard-coded credentials in application source code with a runtime call to the Secrets Manager service to retrieve credentials dynamically when you need them. Storing the credentials in Secrets Manager helps to avoid unintended access by anyone who inspects your application’s source code, configuration, or components.

In this blog post, we introduce a new feature, the Secrets Manager Agent, and walk through how you can use it to retrieve Secretes Manager secrets.

New approach: Secrets Manager Agent

Previously, if you had an application that used Secrets Manager and needed to retrieve secrets, you had to use the AWS SDK or one of our existing caching libraries. Both these options are specific to a certain coding language and allow only limited scope for customization.

The Secrets Manager Agent is a client-side agent that allows you to standardize consumption of secrets from Secrets Manager across your AWS compute environments. (AWS has published the code for the agent as open source code.) Secrets Manager Agent pulls and caches secrets in your compute environment and allows your applications to consume secrets directly from the in-memory cache. The Secrets Manager Agent opens a localhost port inside your application environment. With this port, you fetch the secret value from the local agent instead of making network calls to the service. This allows you to improve the overall availability of your application while reducing your API calls. Because the Secrets Manager Agent is language agnostic, you can install the binary file of the agent on many types of AWS compute environments.

Although you can use this feature to retrieve and cache secrets in your application’s compute environment, the access controls for Secrets Manager secrets remain unchanged. This means that AWS Identity and Access Management (IAM) principals need the same permissions as if they were to retrieve each of the secrets. You will need to provide GetSecretValue and DescribeSecret permissions to the secrets that you want to consume by using the Secrets Manager Agent.

The Secrets Manager Agent offers protection against server-side request forgery (SSRF). When you install the Secrets Manager Agent, the script generates a random SSRF token on startup and stores it in the file /var/run/awssmatoken. The token is readable by the awssmatokenreader group that the install script creates. The Secrets Manager Agent denies requests that don’t have an SSRF token in the header or that have an invalid SSRF token.

Solution overview

The Secrets Manager Agent provides a language-agnostic way to consume secrets in your application code. It supports various AWS compute services, such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), and AWS Lambda functions. In this solution, we share how you can install the Secrets Manager Agent on an EC2 machine and retrieve secrets in your application code by using CURL commands. See the AWS Secrets Manager Agent documentation to learn how you can use this agent with other types of compute services.

Prerequisites

You need to have the following:

  1. An AWS account
  2. The AWS Command Line Interface (AWS CLI) version 2
  3. jq

Follow the steps on the Install or update to the latest version of the AWS CLI page to install the AWS CLI and the Configure the AWS CLI page to configure it.

Create the secret

The first step will be to create a secret in Secrets Manager by using the AWS CLI.

To create a secret

  • Enter the following command in a terminal to create a secret:
    aws secretsmanager create-secret --name MySecret --description "My Secret" \
      --secret-string "{\"user\": \"my_user\", \"password:\": \"my-password\"}"

    You will see an output like the following:

    % aws secretsmanager create-secret —name MySecret —description "My Secret" \
     —secret-string "{\"user\": \"my_user\", \"password:\": \"my-password\"}"
    {
     "ARN": "arn:aws:secretsmanager:us-east-1:XXXXXXXXXXXX:secret:MySecret-LrBlpm",
     "Name": "MySecret",
     "VersionId": "b5e73e9b-6ec5-4144-a176-3648304b2d60"
    }

    Record the secret ARN as <SECRET_ARN>, because you will use it in the next section.

Create the IAM role

The Lambda function, the EC2 instance, and the ECS task definition need an IAM role that grants permission to retrieve the secret you just created.

To create the IAM role

  1. Using an editor, create a file named ec2_iam_policy.json with the following content:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "Service": "ec2.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            } 
        ]
    }

  2. Type the following command in a terminal to create the IAM role:
    aws iam create-role --role-name ec2-secret-execution-role \
      --assume-role-policy-document file://ec2_iam_policy.json

  3. Create a file named iam_permission.json with the following content, replacing <SECRET_ARN> with the secret ARN you noted earlier:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "secretsmanager:GetSecretValue",
                    "secretsmanager:DescribeSecret"
                ],
                "Resource": "<SECRET_ARN>"
            }
        ]
    }

  4. Type the following command to create a policy:
    aws iam create-policy \
      --policy-name get-secret-policy \
      --policy-document file://iam_permission.json

    Record the Arn as <POLICY_ARN>, because you will need that value next.

  5. Type the following command to add this policy to the IAM role, replacing <POLICY_ARN> with the value you just noted:
    aws iam attach-role-policy \
      --role-name ec2-secret-execution-role \
      --policy-arn <POLICY_ARN>

  6. Type the following command to add the AWS Systems Manager policy to the role:
    aws iam attach-role-policy \
      --role-name ec2-secret-execution-role \
      --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

Launch an EC2 instance

Use the steps in this section to launch an EC2 instance.

To create an instance profile

  1. Type the following command to create an instance profile:
    aws iam create-instance-profile --instance-profile-name secret-profile

  2. Type the following command to associate this instance profile with the role you just created:
    aws iam add-role-to-instance-profile --instance-profile-name secret-profile \
      --role-name ec2-secret-execution-role

To create a security group

  • Type the following command to create a security group:
    aws ec2 create-security-group --group-name secret-security-group \
      --description "Secrets Manager Security Group"

    Record the group ID as <GROUP_ID>, because you will need this value in the next step.

To launch an EC2 instance

  1. Run the following command to launch an EC2 instance, replacing <GROUP_ID> with the security group ID:
    aws ec2 run-instances \
      --image-id resolve:ssm:/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2 \
      --instance-type t3.micro \
      --security-group-ids <GROUP_ID> \
      --iam-instance-profile Name=secret-profile \
      --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=secret-instance}]'

    Record the InstanceId value as <INSTANCE_ID>.

  2. Check the status of this launch by running the following command:
    aws ec2 describe-instances --filters Name=tag:Name,Values=secret-instance | \
      jq ".Reservations[0].Instances[0].State"

    You will see a response like the following, which shows that the instance is running:

    % aws ec2 describe-instances —filters Name=tag:Name,Values=secret-instance | jq ".Reservations[0].Instances[0].State"
    {
     "Code": 16,
     "Name": "running"
    }

  3. After the instance is in running state, type the following command to connect to the EC2 instance, replacing <INSTANCE_ID> with the value you noted earlier:
    aws ssm start-session --target <INSTANCE_ID>

Leave the session open, because you will use it in the next step.

Install the Secrets Manager Agent to the EC2 instance

Use the steps in this section to install the Secrets Manager Agent in the EC2 instance. You will run these commands in the EC2 instance you created earlier.

To download the Secrets Manager Agent code

  1. Type the following command to install git in the EC2 instance:
    sudo yum install -y git 

  2. Type the following command to download the Secrets Manager Agent code:
    cd ~;git clone https://github.com/awslabs/aws-secretsmanager-agent

To install the Secrets Manager Agent

  • Type the following command to install the Secrets Manager Agent:
    cd aws-secretsmanager-agent/release
    sudo ./install

To grant permission to read the token file

  • Type the following command to copy the token file and grant permission for the current user (ec2-user) to read it:
    sudo cp /var/run/awssmatoken /tmp
    sudo chown ssm-user /tmp/awssmatoken

Retrieve the secret

Now you can use the local web server to retrieve the agent. Processes running in this EC2 instance can retrieve the secret with a REST API call from the web server.

To retrieve a secret

Retrieving a secret is now possible for the process in this EC2 instance, thanks to the local agent.

  1. Run the following command to retrieve the secret:
    curl -H "X-Aws-Parameters-Secrets-Token: $(</tmp/awssmatoken)” localhost:2773/secretsmanager/get?secretId=MySecret

    You will see the following output:

    $ curl -H "X-Aws-Parameters-Secrets-Token: $(</tmp/awssmatoken)" localhost:2773/secretsmanager/get?secretId=MySecret
    {"ARN":"arn:aws:secretsmanager:us-east-1:XXXXXXXXXXXX:secret:MySecret-3z00LH","Name":"MySecret","VersionId":"e7b07d00-a0e8-41b9-b76e-45bdd8daca4f","SecretString":"{\"user\": \"my_user\", \"password:\": \"my-password\"}","VersionStages":["AWSCURRENT"],"CreatedDate":"1716912317.961"}

  2. Exit from the EC2 instance by typing exit.

Clean up

Follow the steps in this section to clean up the resources created by the solution.

To terminate the EC2 instance and associated resources

  1. Type the following command to stop the EC2 instance, replacing <INSTANCE_ID> with the EC2 InstanceId received at the time of instance launch:
    aws ec2 terminate-instances --instance-ids <INSTANCE_ID>

  2. Run the following command to delete the security group:
    aws ec2 delete-security-group --group-name secret-security-group

  3. Run the following command to delete the IAM role from the instance profile:
    aws iam remove-role-from-instance-profile --instance-profile-name secret-profile \
      --role-name ec2-secret-execution-role

  4. Run these commands to delete the instance profile:
    aws iam delete-instance-profile --instance-profile-name secret-profile

To clean up the IAM role

  1. Run the following command to delete the policy role, replacing <POLICY_ARN> with the value you noted earlier:
    aws iam detach-role-policy --role-name ec2-secret-execution-role \
      --policy-arn <POLICY_ARN>

  2. Run the following command to detach the policy from the role:
    aws iam detach-role-policy --role-name ec2-secret-execution-role \
      --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

  3. Run the following command to delete the IAM role:
    aws iam delete-role --role-name ec2-secret-execution-role

To clean up the secret

  • Run the following command to delete the secret:
    aws secretsmanager delete-secret --secret-id MySecret

Conclusion

In this post, we introduced the Secrets Manager Agent and showed how to install it in an EC2 instance, allowing the retrieval of secrets from Secrets Manager. An application can call this web server to retrieve secrets without using the AWS SDK. See the AWS Secrets Manager Agent documentation to learn more about how you can use this Secrets Manager Agent in other compute environments.

To learn more about AWS Secrets Manager, see the AWS Secrets Manager documentation.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Eduardo Patrocinio

Eduardo Patrocinio
Eduardo is a distinguished Principal Solutions Architect on the AWS Strategic Accounts team, bringing unparalleled expertise to the forefront of cloud technology. With an impressive career spanning over 25 years, Eduardo has been a driving force in designing and delivering innovative customer solutions within the dynamic realms of Cloud and Service Management.

Akshay Aggarwal

Akshay Aggarwal
Akshay is a Senior Technical Product Manager on the AWS Secrets Manager team. As part of AWS Cryptography, Akshay drives technologies and defines best practices that help improve customer’s experience of building secure, reliable workloads in the AWS Cloud. Akshay is passionate about building technologies that are easy to use, secure, and scalable.

Understanding Google Postmaster Tools (spam complaints) for Amazon SES email senders

Post Syndicated from Bruno Giorgini original https://aws.amazon.com/blogs/messaging-and-targeting/understanding-google-postmaster-tools-spam-complaints-for-amazon-ses-email-senders/

Introduction

Amazon Simple Email Service (SES) includes a robust set of built-in tools, such as the Virtual Deliverability Manager (VDM), to help senders ensure optimal email deliverability. Additionally, deliverability data from email service providers like Postmaster Tools by Google can provide invaluable insights for all sending domain owners, including those using SES for bulk or transactional email. Postmaster Tools offers detailed metrics on factors like delivery errors, spam rates, domain reputation, and recipient feedback for Gmail-hosted inboxes. Combining this external data with SES email sending events is critical for maintaining a healthy sender reputation. By leveraging both SES-native tools and resources like Postmaster Tools, senders can identify and address deliverability issues, ensuring their SES-powered emails reach intended recipients across providers.

Many, but not all, mailbox providers will send recipient feedback in the form of “complaints” that can each be attributed directly to the message that the recipient found to be objectionable. These complaints are available in the SES email sending event type “Complaint”. Gmail does not send spam complaint events because their priority is to protect the privacy of their users from the tracking techniques employed by spammers and data brokers. Gmail requires bulk senders to adopt “easy unsubscribe” mechanisms to reduce the need for their users to report messages as spam, and they will show spam complaint metrics in Postmaster Tools. This blog will show you how to maximize value in the spam complaint metric provided by Postmaster Tools.

Amazon SES now supports custom values in the Feedback-ID header in messages sent through SES. This feature provides additional details to help customers identify deliverability trends. Together with Postmaster Tools, customers can group complaints by identifiers of their choice, such as sender business unit or campaign ID. This makes it easier to track deliverability performance associated with independent workloads and campaigns, and accelerates troubleshooting when diagnosing complaint rates.

This image describes the flow for spam complaints to the email sender

Figure 1: Email Feedback Loop

This blog will guide you through implementing and using Feedback Loops within Postmaster Tools to identify email campaigns receiving high complaint volumes from Gmail users. It covers the history and background of feedback loops, the specific requirements for implementing them with Postmaster Tools, and practical examples using AWS CLI and Boto3 to send SES emails with the necessary Feedback-ID header. By the end, you’ll understand how to effectively set up and use Postmaster Tools to monitor and improve your SES email deliverability.

History and Background of FBLs

Traditional Feedback Loops (herein “FBLs”) have been a cornerstone of email deliverability for many years. Initially developed by Internet Service Providers (ISPs), FBLs serve as a mechanism for recipients to report spam complaints to the sender. This feedback is crucial for email service providers and senders to identify problematic email campaigns, take corrective actions, and maintain a healthy sender reputation.

FBLs operate by allowing recipients to mark emails as spam, which then sends a report to the sender’s email service provider. This report typically includes details about the email that triggered the complaint, enabling the sender to investigate and address any issues. By analyzing these reports, senders can refine their email lists, improve content, and ensure that their emails comply with best practices and regulatory requirements. Senders who receive a higher volume of spam complaints are more likely to be blocked or have their emails routed to the spam folder. While high spam complaints are not the sole reason for deliverability issues, they are often the underlying cause.

Postmaster Tools by Gmail is not a traditional FBL. Postmaster Tools will show complaint feedback metrics, but the complaints are not attributable to any individual recipient.

Requirements for using Postmaster Tools FBL with SES

The FBL helps identify campaigns with high complaint rates from Gmail users, specifically useful for email service providers to detect potential abuse of their services.

Note: Data in Postmaster Tools only applies to messages sent to personal Gmail accounts. A personal Gmail account is an account that ends in @gmail.comor @googlemail.com.

  • Implementation of FBL:
    • Feedback-ID Header: SES embeds a header called Feedback-ID containing parameters (Identifiers) uniquely identifying the account and SenderID (AmazonSES)
    • Header Format: The Feedback-ID header consists of four parameters, separated by colons:
      a:b:c:SenderId
      Where:
      • SenderId is a mandatory parameter that uniquely identifies the sender.
      • In the case of Amazon SES (Simple Email Service), the SenderId is always “AmazonSES” and cannot be overridden.
Header Parameter  Description
a  First parameter in the Feedback-ID header. SES users can customize through ses:feedback-id-a EmailTag
b  Second parameter in the Feedback-ID header. SES users can customize through ses:feedback-id-b EmailTag.
c  Third parameter in the Feedback-ID header. SES uses this to identify the sender account
SenderID  Fourth parameter in the Feedback-ID header. Mandatory parameter that uniquely identifies the sender. For Amazon SES, this is always “AmazonSES” and cannot be overridden.
  • Sender Data Handling:
    • DKIM signing by a sender-owned domain is required to prevent spoofing.
    • The domain must be added and verified in Postmaster Tools.
    • Complaint data is aggregated by distinct values on each of the 4 fields of Feedback-ID.
  • Feedback-ID header Requirements:
    • When sending emails through Amazon SES, users are limited to a single verified header value per traffic stream.
      • This means that the Feedback-ID header cannot contain an individualized value for each destination email address.
      • Instead, the Feedback-ID header needs to contain an identifier that can be used to match a larger campaign or batch of emails, rather than a unique value per recipient.
      • This constraint helps maintain a consistent sender reputation, improves deliverability monitoring and troubleshooting within tools like Postmaster Tools. The Feedback-ID acts as a grouping mechanism, rather than a per-message identifier
    • Identifiers must be unique and non-repetitive across fields.
  • Feedback-ID Example:
    • CampaignIDX:CustomerID2:1.us-west-2.TDQeKqHkSNfQztk25wIeVIGTuNmGDud4r1l7dUlxOio=:AmazonSES
      • Each Identifier is used to report spam percentages independently if unusual rates occur.
      • Amazon SES lets customers set the part a and part b of the Feedback-ID header using the EmailTag ses:feedback-id-a and ses:feedback-id-b
      • Amazon SES will combine these tags into a single Feedback-ID header with the format: Feedback-ID=a:b:region.accountId:AmazonSES

The next steps will cover what’s needed to leverage FBLs with SES.

Step 1 – Add Your Domain(s) To Google’s Postmaster Tools

  • In order to verify with Postmaster Tools that you’re authorized to track the feedback from your domain, you first need to register your ownership of the domain with Postmaster Tools by visiting https://gmail.com/postmaster/.
Verify a new domain with Google Postmaster Tools

Figure 2: Step 1 to verify a domain in Google Postmaster Tools

  • After entering in your domain, you’d be prompted to add a TXT record into your DNS configuration.
Step 2 to verify a domain in Google Postmaster Tools

Figure 3: Step 2 to verify a domain in Google Postmaster Tools

  • Update your sending domain(s) DNS records accordingly.
    • The example below specifies how to create the TXT record in Route53. If you’re using another DNS service provider, please refer to their documentation.
Create a new record in Route53

Figure 4: Create a new record in Route53

    • Navigate to the Route53 Console and click on Hosted zones , specify the hosted zone that contains the domain you want to verify and then Create record.
This image describe the creation of a TXT record including the value provided by Google Postmaster to verify the domain

Figure 5: Add a TXT record with the provided value for verification

    • Following the screenshot, create a TXT record type and paste the value assigned by Google for verification in step 2 here.
  • Go to Postmaster Tools and click on Verify. After successful verification of your domain in Postmaster Tools, you should see the Status column changed from Not Verified to Verified. You can verify your compliance status with the requirements in the Dashboard (2) link.
In this picture we show an example of how the domain would appear once verified in Google Postmaster Tools

Figure 6: Domain verified

  • Follow the recommendations provided in the Postmaster Tools dashboard to fully comply with the requirements (example below):
Email sender requirements

Figure 7: Email sender requirements compliance status recommendations

  • Once you have completed all the verification and configuration steps, you should see compliant checkmarks next to all available requirements (see example below):
Email sender requirements

Figure 8: Email sender requirements compliant status

Step 2 – Add Feedback-ID headers to your SES emails

  • Use this command line to send an email with Feedback-ID using the AWS CLI:
aws sesv2 send-email --from-email-address [email protected] \
   --destination '{"ToAddresses":["[email protected]"]}' \
   --content '{"Simple":{"Subject":{"Data":"Test Subject","Charset":"UTF-8"},"Body":{"Text":{"Data":"Test Data","Charset":"UTF-8"}}}}' \
   --email-tags '[{"Name": "ses:feedback-id-a","Value":"feedback-id-part-a-value"}]'

The values of ses:feedback-id-a and ses:feedback-id-b are specified using the --email-tags option.

  • Alternatively, use Boto3 to send an email with Feedback-ID with the following Python script:
import boto3
from botocore.exceptions import ClientError

def send_email(region_name):
    # Create a new SES client
    ses = boto3.client('sesv2', region_name=region_name)

    # Replace sender and recipient values
    SENDER = "Sender Name <[email protected]>"
    RECIPIENT = "[email protected]"
    CONFIGURATION_SET = "SES_Config_Set"
    SUBJECT = "Amazon SES Test (SDK for Python)"
    BODY_TEXT = "Amazon SES Test (Python)\r\nThis email was sent with Amazon SES using the AWS SDK for Python (Boto)."
    BODY_HTML = """<html>
    <head></head>
    <body>
      <h1>Amazon SES Test (SDK for Python)</h1>
      <p>This email was sent with
        <a href='https://aws.amazon.com/ses/'>Amazon SES</a> using the
        <a href='https://aws.amazon.com/sdk-for-python/'>
          AWS SDK for Python (Boto)</a>.</p>
    </body>
    </html>"""
    CHARSET = "UTF-8"

    try:
        # Send email
        response = ses.send_email(
            FromEmailAddress=SENDER,
            Destination={'ToAddresses': [RECIPIENT]},
            ConfigurationSetName=CONFIGURATION_SET,
            Content={
                "Simple": {
                    "Subject": {
                        "Charset": CHARSET,
                        "Data": SUBJECT
                    },
                    "Body": {
                        "Text": {
                            "Charset": CHARSET,
                            "Data": BODY_TEXT
                        },
                        "Html": {
                            "Charset": CHARSET,
                            "Data": BODY_HTML
                        }
                    },
                    "Headers": [
                        {
                            "Name": "List-Unsubscribe",
                            "Value": "<https://unsubscribe.example.email/[email protected]&topic=topic1>"
                        },
                        {
                            "Name": "List-Unsubscribe-Post",
                            "Value": "One-Click"
                        }
                    ]
                }
            },
            EmailTags=[
                {
                    'Name': 'ses:feedback-id-a',
                    'Value': 'campaign1'
                },
                {
                    'Name': 'ses:feedback-id-b',
                    'Value': 'line-of-business'
                }
            ] #the ses:feedback-id-a and ses:feedback-id-b are specified as a list using EmailTags
        )
        print("Email sent! Response:", response)
        print("Message ID:", response['MessageId'])

    except ClientError as e:
        print(e.response['Error']['Message'])

# Call the function to send the email
send_email(region_name='us-west-2')  # Specify the region here

Step 3 – Viewing FBL results in Postmaster Tools

In order to see any results in the Postmaster Tool dashboard (see examples below), you must send a substantial daily volume of email through the domain(s) you’ve registered. If you see the message “No Data to Display”, your reputation may already be too low, more likely the volume of email traffic sent since you configured the Postmaster tool is insufficient (return to the dashboard in later, after you’ve sent 1,000s of emails).

Figure 9: Feedback loop example image

Figure 9: Feedback loop example image

The image shows a section of the Postmaster Tools dashboard, specifically the Feedback Loop section. This dashboard provides insights into the spam complaint rates and the number of feedback loop identifiers flagged across a given time period, in this case, the last 120 days.

Conclusion

High-volume email senders should look to the combination of Amazon SES’ powerful framework for monitoring in concert with Postmaster Tools to improve and ensure email deliverability. Implementing the Feedback-ID header in your SES emails can significantly enhance your ability to track and troubleshoot deliverability issues. Use Postmaster Tools and the Feedback Loop via Feedback-ID headers in SES emails to gain detailed insights into complaint rates and other key metrics, enabling you to maintain a healthy sender reputation and ensure their emails reach the intended recipients.

Call to Action:

  1. Set Up Postmaster Tools for your sending domain(s)
  2. Verify Your Domain: Register and verify your domain with Postmaster Tools to access valuable insights and track your compliance status.
  3. Set Up Feedback-ID: Start embedding the Feedback-ID header in your emails sent via Amazon SES to take advantage of detailed complaint data and improve your email campaigns.
  4. Monitor and Adjust: Regularly check the Postmaster Tools dashboard to monitor your spam rates and feedback loop identifiers. Use this data to refine your email content and sending practices.
  5. Leverage AWS CLI and Boto3: Utilize the provided AWS CLI commands and Boto3 scripts to automate the process of sending emails with Feedback-ID headers, ensuring consistent and accurate tracking.

By following these steps, you can enhance your email deliverability, reduce spam complaints, and maintain a strong sender reputation. For more information on using Amazon SES and Google’s Postmaster Tools, refer to the Amazon SES Documentation and the Postmaster Tools Guide.

Migrate from Apache Solr to OpenSearch

Post Syndicated from Aswath Srinivasan original https://aws.amazon.com/blogs/big-data/migrate-from-apache-solr-to-opensearch/

OpenSearch is an open source, distributed search engine suitable for a wide array of use-cases such as ecommerce search, enterprise search (content management search, document search, knowledge management search, and so on), site search, application search, and semantic search. It’s also an analytics suite that you can use to perform interactive log analytics, real-time application monitoring, security analytics and more. Like Apache Solr, OpenSearch provides search across document sets. OpenSearch also includes capabilities to ingest and analyze data. Amazon OpenSearch Service is a fully managed service that you can use to deploy, scale, and monitor OpenSearch in the AWS Cloud.

Many organizations are migrating their Apache Solr based search solutions to OpenSearch. The main driving factors include lower total cost of ownership, scalability, stability, improved ingestion connectors (such as Data Prepper, Fluent Bit, and OpenSearch Ingestion), elimination of external cluster managers like Zookeeper, enhanced reporting, and rich visualizations with OpenSearch Dashboards.

We recommend approaching a Solr to OpenSearch migration with a full refactor of your search solution to optimize it for OpenSearch. While both Solr and OpenSearch use Apache Lucene for core indexing and query processing, the systems exhibit different characteristics. By planning and running a proof-of-concept, you can ensure the best results from OpenSearch. This blog post dives into the strategic considerations and steps involved in migrating from Solr to OpenSearch.

Key differences

Solr and OpenSearch Service share fundamental capabilities delivered through Apache Lucene. However, there are some key differences in terminology and functionality between the two:

  • Collection and index: In OpenSearch, a collection is called an index.
  • Shard and replica: Both Solr and OpenSearch use the terms shard and replica.
  • API-driven Interactions: All interactions in OpenSearch are API-driven, eliminating the need for manual file changes or Zookeeper configurations. When creating an OpenSearch index, you define the mapping (equivalent to the schema) and the settings (equivalent to solrconfig) as part of the index creation API call.

Having set the stage with the basics, let’s dive into the four key components and how each of them can be migrated from Solr to OpenSearch.

Collection to index

A collection in Solr is called an index in OpenSearch. Like a Solr collection, an index in OpenSearch also has shards and replicas.

Although the shard and replica concept is similar in both the search engines, you can use this migration as a window to adopt a better sharding strategy. Size your OpenSearch shards, replicas, and index by following the shard strategy best practices.

As part of the migration, reconsider your data model. In examining your data model, you can find efficiencies that dramatically improve your search latencies and throughput. Poor data modeling doesn’t only result in search performance problems but extends to other areas. For example, you might find it challenging to construct an effective query to implement a particular feature. In such cases, the solution often involves modifying the data model.

Differences: Solr allows primary shard and replica shard collocation on the same node. OpenSearch doesn’t place the primary and replica on the same node. OpenSearch Service zone awareness can automatically ensure that shards are distributed to different Availability Zones (data centers) to further increase resiliency.

The OpenSearch and Solr notions of replica are different. In OpenSearch, you define a primary shard count using number_of_primaries that determines the partitioning of your data. You then set a replica count using number_of_replicas. Each replica is a copy of all the primary shards. So, if you set number_of_primaries to 5, and number_of_replicas to 1, you will have 10 shards (5 primary shards, and 5 replica shards). Setting replicationFactor=1 in Solr yields one copy of the data (the primary).

For example, the following creates a collection called test with one shard and no replicas.

http://localhost:8983/solr/admin/collections?
  _=action=CREATE
  &maxShardsPerNode=2
  &name=test
  &numShards=1
  &replicationFactor=1
  &wt=json

In OpenSearch, the following creates an index called test with five shards and one replica

PUT test
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

Schema to mapping

In Solr schema.xml OR managed-schema has all the field definitions, dynamic fields, and copy fields along with field type (text analyzers, tokenizers, or filters). You use the schema API to manage schema. Or you can run in schema-less mode.

OpenSearch has dynamic mapping, which behaves like Solr in schema-less mode. It’s not necessary to create an index beforehand to ingest data. By indexing data with a new index name, you create the index with OpenSearch managed service default settings (for example: "number_of_shards": 5, "number_of_replicas": 1) and the mapping based on the data that’s indexed (dynamic mapping).

We strongly recommend you opt for a pre-defined strict mapping. OpenSearch sets the schema based on the first value it sees in a field. If a stray numeric value is the first value for what is really a string field, OpenSearch will incorrectly map the field as numeric (integer, for example). Subsequent indexing requests with string values for that field will fail with an incorrect mapping exception. You know your data, you know your field types, you will benefit from setting the mapping directly.

Tip: Consider performing a sample indexing to generate the initial mapping and then refine and tidy up the mapping to accurately define the actual index. This approach helps you avoid manually constructing the mapping from scratch.

For Observability workloads, you should consider using Simple Schema for Observability. Simple Schema for Observability (also known as ss4o) is a standard for conforming to a common and unified observability schema. With the schema in place, Observability tools can ingest, automatically extract, and aggregate data and create custom dashboards, making it easier to understand the system at a higher level.

Many of the field types (data types), tokenizers, and filters are the same in both Solr and OpenSearch. After all, both use Lucene’s Java search library at their core.

Let’s look at an example:

<!-- Solr schema.xml snippets -->
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="name" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="address" type="text_general" indexed="true" stored="true"/>
<field name="user_token" type="string" indexed="false" stored="true"/>
<field name="age" type="pint" indexed="true" stored="true"/>
<field name="last_modified" type="pdate" indexed="true" stored="true"/>
<field name="city" type="text_general" indexed="true" stored="true"/>

<uniqueKey>id</uniqueKey>

<copyField source="name" dest="text"/>
<copyField source="address" dest="text"/>

<fieldType name="string" class="solr.StrField" sortMissingLast="true" />
<fieldType name="pint" class="solr.IntPointField" docValues="true"/>
<fieldType name="pdate" class="solr.DatePointField" docValues="true"/>

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
    <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
    <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
PUT index_from_solr
{
  "settings": {
    "analysis": {
      "analyzer": {
        "text_general": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "keyword",
        "copy_to": "text"
      },
      "address": {
        "type": "text",
        "analyzer": "text_general"
      },
      "user_token": {
        "type": "keyword",
        "index": false
      },
      "age": {
        "type": "integer"
      },
      "last_modified": {
        "type": "date"
      },
      "city": {
        "type": "text",
        "analyzer": "text_general"
      },
      "text": {
        "type": "text",
        "analyzer": "text_general"
      }
    }
  }
}

Notable things in OpenSearch compared to Solr:

  1. _id is always the uniqueKey and cannot be defined explicitly, because it’s always present.
  2. Explicitly enabling multivalued isn’t necessary because any OpenSearch field can contain zero or more values.
  3. The mapping and the analyzers are defined during index creation. New fields can be added and certain mapping parameters can be updated later. However, deleting a field isn’t possible. A handy ReIndex API can overcome this problem. You can use the Reindex API to index data from one index to another.
  4. By default, analyzers are for both index and query time. For some less-common scenarios, you can change the query analyzer at search time (in the query itself), which will override the analyzer defined in the index mapping and settings.
  5. Index templates are also a great way to initialize new indexes with predefined mappings and settings. For example, if you continuously index log data (or any time-series data), you can define an index template so that all the indices have the same number of shards and replicas. It can also be used for dynamic mapping control and component templates

Look for opportunities to optimize the search solution. For instance, if the analysis reveals that the city field is solely used for filtering rather than searching, consider changing its field type to keyword instead of text to eliminate unnecessary text processing. Another optimization could involve disabling doc_values for the user_token field if it’s only intended for display purposes. doc_values are disabled by default for the text datatype.

SolrConfig to settings

In Solr, solrconfig.xml carries the collection configuration. All sorts of configurations pertaining to everything from index location and formatting, caching, codec factory, circuit breaks, commits and tlogs all the way up to slow query config, request handlers, and update processing chain, and so on.

Let’s look at an example:

<codecFactory class="solr.SchemaCodecFactory">
<str name="compressionMode">`BEST_COMPRESSION`</str>
</codecFactory>

<autoCommit>
    <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
    <openSearcher>false</openSearcher>
</autoCommit>

<autoSoftCommit>
    <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
    </autoSoftCommit>

<slowQueryThresholdMillis>1000</slowQueryThresholdMillis>

<maxBooleanClauses>${solr.max.booleanClauses:2048}</maxBooleanClauses>

<requestHandler name="/query" class="solr.SearchHandler">
    <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="wt">json</str>
    <str name="indent">true</str>
    <str name="df">text</str>
    </lst>
</requestHandler>

<searchComponent name="spellcheck" class="solr.SpellCheckComponent"/>
<searchComponent name="suggest" class="solr.SuggestComponent"/>
<searchComponent name="elevator" class="solr.QueryElevationComponent"/>
<searchComponent class="solr.HighlightComponent" name="highlight"/>

<queryResponseWriter name="json" class="solr.JSONResponseWriter"/>
<queryResponseWriter name="velocity" class="solr.VelocityResponseWriter" startup="lazy"/>
<queryResponseWriter name="xslt" class="solr.XSLTResponseWriter"/>

<updateRequestProcessorChain name="script"/>

Notable things in OpenSearch compared to Solr:

  1. Both OpenSearch and Solr have BEST_SPEED codec as default (LZ4 compression algorithm). Both offer BEST_COMPRESSION as an alternative. Additionally OpenSearch offers zstd and zstd_no_dict. Benchmarking for different compression codecs is also available.
  2. For near real-time search, refresh_interval needs to be set. The default is 1 second which is good enough for most use cases. We recommend increasing refresh_interval to 30 or 60 seconds to improve indexing speed and throughput, especially for batch indexing.
  3. Max boolean clause is a static setting, set at node level using the indices.query.bool.max_clause_count setting.
  4. You don’t need an explicit requestHandler. All searches use the _search or _msearch endpoint. If you’re used to using the requestHandler with default values then you can use search templates.
  5. If you’re used to using /sql requestHandler, OpenSearch also lets you use SQL syntax for querying and has a Piped Processing Language.
  6. Spellcheck, also known as Did-you-mean, QueryElevation (known as pinned_query in OpenSearch), and highlighting are all supported during query time. You don’t need to explicitly define search components.
  7. Most API responses are limited to JSON format, with CAT APIs as the only exception. In cases where Velocity or XSLT is used in Solr, it must be managed on the application layer. CAT APIs respond in JSON, YAML, or CBOR formats.
  8. For the updateRequestProcessorChain, OpenSearch provides the ingest pipeline, allowing the enrichment or transformation of data before indexing. Multiple processor stages can be chained to form a pipeline for data transformation. Processors include GrokProcessor, CSVParser, JSONProcessor, KeyValue, Rename, Split, HTMLStrip, Drop, ScriptProcessor, and more. However, it’s strongly recommended to do the data transformation outside OpenSearch. The ideal place to do that would be at OpenSearch Ingestion, which provides a proper framework and various out-of-the-box filters for data transformation. OpenSearch Ingestion is built on Data Prepper, which is a server-side data collector capable of filtering, enriching, transforming, normalizing, and aggregating data for downstream analytics and visualization.
  9. OpenSearch also introduced search pipelines, similar to ingest pipelines but tailored for search time operations. Search pipelines make it easier for you to process search queries and search results within OpenSearch. Currently available search processors include filter query, neural query enricher, normalization, rename field, scriptProcessor, and personalize search ranking, with more to come.
  10. The following image shows how to set refresh_interval and slowlog. It also shows you the other possible settings.
  11. Slow logs can be set like the following image but with much more precision with separate thresholds for the query and fetch phases.

Before migrating every configuration setting, assess if the setting can be adjusted based on your current search system experience and best practices. For instance, in the preceding example, the slow logs threshold of 1 second might be intensive for logging, so that can be revisited. In the same example, max.booleanClauses might be another thing to look at and reduce.

Differences: Some settings are done at the cluster level or node level and not at the index level. Including settings such as max boolean clause, circuit breaker settings, cache settings, and so on.

Rewriting queries

Rewriting queries deserves its own blog post; however we want to at least showcase the autocomplete feature available in OpenSearch Dashboards, which helps ease query writing.

Similar to the Solr Admin UI, OpenSearch also features a UI called OpenSearch Dashboards. You can use OpenSearch Dashboards to manage and scale your OpenSearch clusters. Additionally, it provides capabilities for visualizing your OpenSearch data, exploring data, monitoring observability, running queries, and so on. The equivalent for the query tab on the Solr UI in OpenSearch Dashboard is Dev Tools. Dev Tools is a development environment that lets you set up your OpenSearch Dashboards environment, run queries, explore data, and debug problems.

Now, let’s construct a query to accomplish the following:

  1. Search for shirt OR shoe in an index.
  2. Create a facet query to find the number of unique customers. Facet queries are called aggregation queries in OpenSearch. Also known as aggs query.

The Solr query would look like this:

http://localhost:8983/solr/solr_sample_data_ecommerce/select?q=shirt OR shoe
  &facet=true
  &facet.field=customer_id
  &facet.limit=-1
  &facet.mincount=1
  &json.facet={
   unique_customer_count:"unique(customer_id)"
  }

The image below demonstrates how to re-write the above Solr query into an OpenSearch query DSL:

Conclusion

OpenSearch covers a wide variety of uses cases, including enterprise search, site search, application search, ecommerce search, semantic search, observability (log observability, security analytics (SIEM), anomaly detection, trace analytics), and analytics. Migration from Solr to OpenSearch is becoming a common pattern. This blog post is designed to be a starting point for teams seeking guidance on such migrations.

You can try out OpenSearch with the OpenSearch Playground. You can get started with Amazon OpenSearch Service, a managed implementation of OpenSearch in the AWS Cloud.


About the Authors

Aswath Srinivasan is a Senior Search Engine Architect at Amazon Web Services currently based in Munich, Germany. With over 17 years of experience in various search technologies, Aswath currently focuses on OpenSearch. He is a search and open-source enthusiast and helps customers and the search community with their search problems.

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.