Tag Archives: Technical How-to

Testing and evaluating GuardDuty detections

2025-01-28 Marshall Jones

Post Syndicated from Marshall Jones original https://aws.amazon.com/blogs/security/testing-and-evaluating-guardduty-detections/

Amazon GuardDuty is a threat detection service that continuously monitors, analyzes, and processes Amazon Web Services (AWS) data sources and logs in your AWS environment. GuardDuty uses threat intelligence feeds, such as lists of malicious IP addresses and domains, file hashes, and machine learning (ML) models to identify suspicious and potentially malicious activity in your AWS environment. When GuardDuty identifies a potential security issue, it creates a GuardDuty finding that gives you information about what the potential security issue is, the resources involved, and contextualized information that’s key to remediating the issue. GuardDuty helps you monitor for the latest threats by continually expanding threat detection to emerging and common threats.

Whether you’re new to GuardDuty or are a long-time user, it’s recommended that you understand the different GuardDuty finding types and finding details and practice responding to them as suggested in the security pillar of AWS Well-Architected.

In this blog post, I dive deep into an open source tool for testing GuardDuty findings and then walk through three examples of how you can use this tool to test and improve your response to GuardDuty findings.

Overview

If you want to learn more about GuardDuty, you can read about the finding types in this AWS documentation. However, customers often want realistic findings in their environment to understand what a finding looks like and to practice responding hands on. While you can use GuardDuty to create sample findings in your environment, these findings are approximations populated with placeholder values and look different from real findings. Additionally, you cannot practice remediation with these findings because they’re not tied to actual resources in your account. This can be helpful if you only want to see what details are in a finding, but if you want to practice a real-world scenario, these sample findings might not be adequate.

To address this use case and provide customers with a secure and reliable way to test the threat detection capabilities of GuardDuty, the GuardDuty service team launched an open source project called GuardDuty Tester. The GuardDuty Tester creates infrastructure in your environment to simulate different security issues so that you can test GuardDuty findings that mirror actual security issues that you might encounter, such as crypto mining or a reverse shell being created on an Amazon Elastic Compute Cloud (Amazon EC2) instance. The GuardDuty Tester was originally released in 2018 as an AWS CloudFormation template and was focused more on testing investigation workflows than on a wide range of finding types. AWS has since released an updated version that uses the AWS Cloud Development Kit (AWS CDK) to make the infrastructure code easier to read and expanded the test coverage to over 100 unique finding types and resource combinations.

The ability to create findings across different resource types such as Amazon EC2, Amazon Simple Storage Service (Amazon S3), and Amazon Elastic Kubernetes Service (Amazon EKS) is a valuable resource for your security team, allowing them to simulate various types of threats with isolated infrastructure so that you don’t need to compromise your deployed workloads to improve response actions and techniques. Remember that the GuardDuty Tester doesn’t cover every possible scenario, but is instead focused on threat intelligence and rules-based findings. Anomaly-based findings, which require learning about how you operate your environment, aren’t included in the GuardDuty Tester.

Getting started with the GuardDuty Tester

The GuardDuty Tester is deployed by using the AWS CDK to create the required infrastructure and scripts to generate the GuardDuty findings. For safety, AWS recommends that you deploy the GuardDuty Tester in a nonproduction environment in an account that’s used specifically for this purpose. This way, your security team can differentiate between test GuardDuty findings and findings for other workloads that they’re monitoring.

In this post, I won’t walk through configuring the GuardDuty Tester because this is already documented in the GuardDuty documentation. Instead, I will go over what you need to know about the GuardDuty Tester and some of the benefits.

Figure 1 shows the GuardDuty Tester architecture, which includes the resources necessary to create GuardDuty findings for various protection plans such as Amazon S3 buckets, Amazon EC2 instances, and an Amazon EKS cluster. The tester also deploys a dedicated GuardDuty Tester instance where you will run the scripts needed to create the GuardDuty findings.

Figure 1: GuardDuty Tester architecture

The GuardDuty Tester provides key features including:

A wide range of threat scenario simulations: Resources that the GuardDuty Tester can create findings for include Amazon S3, AWS Identity and Access Management (IAM), Amazon Elastic Container Service (Amazon ECS) for both Amazon EC2 and AWS Fargate hosted workloads, Amazon EKS, and AWS Lambda and covers over 105 threat scenarios. This includes GuardDuty runtime monitoring as well as other GuardDuty protection plans.
Access through AWS Systems Manager: The GuardDuty Tester provides secure access by using Systems Manager to minimize open ports to the internet and allowing access only through Systems Manager.
Modular scripts: With an expanded library of tests available, the GuardDuty Tester accepts user parameters to set the scope of the tests to run, which gives you greater flexibility for different testing scenarios.

Setting up the GuardDuty Tester environment is straightforward and requires only a few commands. As outlined in the documentation and the README file in the repository, there are a number of prerequisites to set up the stack. These prerequisites include Python 3+, git, the AWS Command Line Interface (AWS CLI), AWS Systems Manager Session Manager plugin, npm, Docker, and a subscription to Kali Linux image for Amazon EC2. You will have to subscribe to the Kali Linux instance in AWS Marketplace, but will be charged for the instance only while the GuardDuty Tester is deployed. After these prerequisites are met, you can clone the repository, install the packages, and deploy the GuardDuty Tester to your AWS account.

Deploying the GuardDuty Tester can take 20–30 minutes, but if you’re following along with this post, I assume that you have deployed the GuardDuty Tester into your environment and have started your Systems Manager session as stated on Part A of Step 3 – Run tester scripts in the GuardDuty documentation. Now, I will dive into the first testing example.

Manual investigation

The first test use case is about getting familiar with what GuardDuty findings look like and the details that a finding gives you. This might be one of your first steps after turning on GuardDuty, or this might be an activity that you perform to help new team members understand GuardDuty findings.

To start a manual investigation:

Run the following command in your Systems Manager session to view the GuardDuty Tester options.
```
Python3 guardduty_tester.py --help
```
Run the following command in your Systems Manager session to create the first test finding.
```
Python3 guardduty_tester.py - -ec2 - -runtime only - -tactics impact
```
Before creating the findings, the GuardDuty Tester prompts you to confirm that it’s allowed to change GuardDuty settings in the environment. For example, if you’ve chosen to create findings related to the GuardDuty runtime monitoring feature but don’t have this feature enabled, the GuardDuty Tester will enable it for the tests and then disable it after testing is complete.

Note: This will start the 30-day trial of the enabled features in this account, in this AWS Region, even if the feature is disabled after testing is complete. More information about GuardDuty pricing and free trials can be found on the GuardDuty pricing page.
After choosing y which indicates “yes”, the GuardDuty Tester reports the number of domain reputation findings it’s expecting. Figure 2 shows an example of the expected findings. You can learn more about domain reputation findings in the GuardDuty finding documentation.

Figure 2: Generated GuardDuty findings in the console
After the GuardDuty Tester is finished, wait a few minutes and then go to the AWS Management Console for GuardDuty to see the findings. In this example, there are four new GuardDuty findings as expected from step 4 and shown in Figure 3. With the findings generated, you can start your manual investigation.

Figure 3: GuardDuty finding details

In the preceding figure, you can see some of the finding details presented—such as the action type and the process information—that can help you quickly identify what trigger started the suspicious communication. From here, I encourage you to use this finding to practice your runbooks for investigation and response. For example, you might start with validating and triaging the finding before moving into evidence collection and remediation. If you don’t have incident response runbooks already built, you can use this finding as an example to get started. There are multiple open source examples such as AWS incident response playbooks and AWS customer response playbook. A runbook will help your team evaluate the information provided in the GuardDuty finding and understand what else they need to know about your specific environment to properly respond to the finding. For example, in the finding, you will have resource and actor information but not things such as who is the account owner or point of contact for security for that account.

Creating alerts

The next use case highlights how to create alerts based on GuardDuty findings. When setting up alerting automation with tools such as Amazon Simple Notification Service (Amazon SNS) and Slack, you should create a finding using the GuardDuty Tester to test that you’ve configured your alert correctly. See Creating custom responses to GuardDuty findings for information about creating alerts with either of these tools. Figure 4 shows a sample EventBridge rule that will send GuardDuty findings to SNS.

Figure 4: EventBridge rule to send GuardDuty findings to SNS

For this post, I assume that you’ve already configured an Amazon EventBridge rule and Amazon SNS alert.

To test alerts:

Run the following command in your Systems Manager session to create a privileged container finding.
```
Python3 guardduty_tester.py - -finding ‘PrivilegedEscalation:Kubernetes/PrivilegedContainer’
```
Shortly after creating this finding, you should see an SNS alert based on the finding type.

Figure 5: SNS notification from a GuardDuty finding

If you’ve configured the alert correctly, you will see an email similar to Figure 5. The email demonstrates that SNS notifications were successfully configured and tested using the GuardDuty Tester. If this is a new finding, you will receive this SNS notification shortly after the GuardDuty Tester generates the finding, but if this is an updated finding, then the timing will be based on the notification frequency configured in the account.

There are many ways that customers consume GuardDuty findings in their environments. Whether you’re using Amazon SNS or another mechanism such as a chat application, ticketing system, or a security information and event management (SIEM) solution, you can use this example of an EventBridge rule and the GuardDuty Tester to test out your notification pipeline.

Automated response

For the third use case, I show you how to create an automated action based on a GuardDuty finding. In this example, I create a finding based on an EC2 instance connecting to a Bitcoin mining domain, then based on this finding, I use Lambda to tag the instance to assist with identification during that investigation steps that follow. Although this is a simple example, it shows you what you can do by combining EventBridge rules and Lambda functions. If you want to create an automated response for GuardDuty runtime monitoring findings that requires making a host-level modification, you can use EventBridge rules with AWS Systems Manager Run Command to run commands locally on a host to remediate a security issue.

Start by creating a Lambda function that will take a GuardDuty event delivered by EventBridge, pull out the instance ID information, and then use that as a parameter in the create_tags API call. See the following example code.

import json
import boto3
import logging

logger = logging.getLogger()
logger.setLevel(logging.INFO)

def lambda_handler(event, context):
    try:
        # Extract the necessary information from the GuardDuty finding
        instance_id = event['detail']['resource']['instanceDetails']['instanceId']
        account_id = event['detail']['accountId']
        region = event['detail']['region']

        # Create an EC2 client
        ec2 = boto3.client('ec2', region_name=region)

        # Add the "infected" and "cryptomining" tag value pair to the instance
        ec2.create_tags(
            Resources=[instance_id],
            Tags=[
                {
                    'Key': 'infected',
                    'Value': 'cryptomining'
                }
            ]
        )

        logger.info(f"Tagged instance {instance_id} with 'infected=cryptomining' in account {account_id} and region {region}")
        return {
            'statusCode': 200,
            'body': 'Instance tagged successfully'
        }
    except Exception as e:
        logger.error(f"Error tagging instance {instance_id}: {str(e)}")
        return {
            'statusCode': 500,
            'body': f"Error tagging instance: {str(e)}"
        }

Next, I create an EventBridge rule specific to the Bitcoin mining finding that I want to test, shown in Figure 6. The target is the Lambda function that I just created.

Figure 6: EventBridge rule for crypto mining GuardDuty finding

Now that the EventBridge rule is in place with the Lambda function as the target, I can use the GuardDuty Tester to trigger a Bitcoin mining finding and test my solution with the following command.

Python3 guardduty_tester.py - - finding ‘CryptoCurrency:EC2/BitcoinTool.B!DNS’

After the finding is generated, I go to my EC2 instance, where there’s a new instance tag with a key of infected and a value of cryptomining, shown in Figure 7.

Figure 7: Updated tags after automated response

Although this is a general example, you can use the same approach across various actions that you might take in response to a GuardDuty finding and then test them using the GuardDuty Tester. Examples include using Lambda to add logic in AWS WAF, a network access control list (network ACL), or AWS Network Firewall to block suspicious traffic, or use Systems Manager Run Command to end a malicious process that’s running on a host.

Conclusion

The updated GuardDuty Tester represents a significant advancement in helping organizations validate and gain confidence in GuardDuty threat detection. The GuardDuty Tester now provides more comprehensive coverage of GuardDuty runtime monitoring and protection plans across various AWS services.

By using the GuardDuty Tester and following the use cases in this post, you can proactively assess your threat detection readiness, identify potential gaps, and implement necessary measures to help you fortify your AWS environments against evolving cyber threats.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

AWS Firewall Manager retrofitting: Harmonizing central security with application team flexibility

2025-01-28 Ian Olson

Post Syndicated from Ian Olson original https://aws.amazon.com/blogs/security/aws-firewall-manager-retrofitting-harmonizing-central-security-with-application-team-flexibility/

AWS Firewall Manager is a powerful tool that organizations can use to define common AWS WAF rules with centralized security policies. These policies specify which accounts and resources are in scope. Firewall Manager creates a web access control list (web ACL) that adheres to the organization’s policy requirements and associates it with the in-scope resources. Figure 1 shows a Firewall Manager security policy and web ACL created in each in-scope account.

Figure 1: A Firewall Manager security policy and Firewall Manager created web ACLs in each in-scope account

In this post, we’ll talk about the benefits of retrofitting and how you can use this feature to allow Firewall Manager to manage existing web ACLs. When retrofitting is enabled, a Firewall Manager security policy doesn’t replace existing web ACLs. Instead, Firewall Manager adds the top and bottom rule sections to existing web ACLs associated with in-scope resources. For application teams, Firewall Manger no longer restricts how they configure and deploy AWS WAF. Teams can use either the AWS Management Console or infrastructure as code (IaC) tools to customize rules in web ACLs, even if those web ACLs are managed by Firewall Manager.

Firewall Manager before retrofitting

Firewall Manager offers significant benefits, but the existing approach results in several challenges:

Compatibility with infrastructure as code (IaC): Firewall Manager creates and associates AWS WAF web ACLs with in-scope resources. IaC tools expect to create and manage resources (in other words, own their lifecycle). Application teams cannot use IaC to manage the WAF rules and other web ACL configuration components that are created by Firewall Manager; there are custom solutions that inject locally defined rules into Firewall Manager-created web ACLs, but these are complex and have risks such as drift. For AWS WAF customers and application teams, this introduces an operational challenge.
Existing WAF migration: Customers who are already using AWS WAF must migrate existing rules to Firewall Manager-managed web ACLs.
Application-specific rule complexity: Forcing all in-scope resources in the same account and AWS Region to use the same web ACL makes application-specific or exception-based rules more complex. In addition, changes to one application’s rules could impact others that share the same web ACL.
Increased costs: When many applications share a single web ACL, application-specific rules are part of a single web ACL, which can increase total WAF capacity units (WCU) usage, sometimes resulting in higher AWS WAF request costs.

Firewall Manager with retrofitting addresses challenges

To address these challenges, Firewall Manager now offers the ability to retrofit existing web ACLs. Let’s get specific about when and how Firewall Manager retrofitting works:

Firewall Manager will only retrofit a web ACL when all associated resources (for example, Application Load Balancers, API Gateways, and Amazon CloudFront distributions) are in scope. If a not-in-scope resource is also associated, Firewall Manager will not retrofit or update future security policy changes to a retrofitted web ACL. In that scenario, associated in-scope resources and the web ACL are marked noncompliant with security policies.
Retrofitting only modifies customer-created web ACLs. Retrofitting will not act on web ACLs retrofitted by another security policy or managed by Firewall Manager. If either scenario occurs, the web ACL and associated resources are marked noncompliant with security policies.
Retrofitting adds the following on top of an existing web ACL that is associated with one or more in-scope resources:
- Retrofitting adds first rule groups and last rule groups defined in a security policy to the web ACL. Existing rules within the web ACL are not changed. The order of rule evaluation changes is the following:
  1. Security policy–defined first rule group rules
  2. Security policy–defined last rule group rules
- Retrofitting adds a WAF logging configuration if one is defined by the security policy. If the web ACL already has a logging configuration, Firewall Manager does not replace the existing logging configuration and marks the web ACL noncompliant.
- Retrofitting does not verify or configure other attributes that are defined by the security policy. This includes the following properties: default action, custom request headers, web ACL Captcha or challenge configurations, and token domain list. These properties are used only when Firewall Manager creates a web ACL.
If an in-scope resource that supports AWS WAF does not have a web ACL, Firewall Manager creates and associates a Firewall Manager-managed web ACL with that resource.

Figure 2 shows the rules and logging configuration before and after a Firewall Manager security policy retrofits a web ACL that is associated with in-scope resources.

Figure 2: Using retrofitting to update an existing web ACL

This new retrofitting capability solves the previous challenges in the following ways:

IaC compatible: Application teams can provision and manage AWS WAF with IAC tools. Firewall Manager retrofits existing web ACLs by adding rules defined in the security policy to web ACLs created by IaC tools. Application teams can manage AWS WAF exactly the same as when Firewall Manager was not in use.
Existing WAF integration: Customers with existing AWS WAF deployments can adopt Firewall Manager without the need to migrate existing WAF rules.
Application-specific rules: Multiple resources in the same account can use separate web ACLs, which simplifies application-specific rules.
Help prevent additional costs: Application-specific rules are applied only to the relevant web ACLs. This helps prevent increased AWS WAF request costs from shared web ACLs that have a high WCU usage.

By enhancing Firewall Manager to retrofit existing web ACLs, customers can use the power of centralized WAF management without restricting how AWS WAF is deployed and configured by application teams in member accounts.

Firewall Manager security policy for AWS WAF – Enabling retrofitting

Figure 3 shows an example of a Firewall Manager security policy.

Figure 3: An example Firewall Manager security policy that uses the new Retrofit existing webACLs feature

This security policy defines WAF rules in the first rule group section. It applies to all Application Load Balancers (ALBs) across your organization in AWS Organizations in the Region where this security policy is created. The policy action is set to automatically remediate. Under Web ACL management, there is a new section, Managed web ACL source, with two options, Default and Retrofit existing webACLs. Default is the existing behavior: Firewall Manager creates and associates a Firewall Manager managed web ACL. Retrofit existing webACLs applies the WAF rules and logging configuration (if any) defined by a security policy to existing web ACLs when they are associated with an in-scope resource. This policy specifies Retrofit existing webACLs. If an in-scope resource does not have a web ACL, Firewall Manager still creates and associates a web ACL by default.

Retrofitting in action

Let’s walk through what happens when you set Managed web ACL source to Retrofit existing webACLs. Figure 4 shows two ALBs that are in-scope of our security policy, LoadBalancer1 and LoadBalancer2.

Figure 4: Two existing ALBs, one with an existing web ACL and the other without

LoadBalancer1 has the following characteristics:

LoadBalancer1 does NOT have a web ACL associated.
After the Firewall Manager security policy applies, LoadBalancer1 is associated with a Firewall Manager created web ACL, as shown in Figure 5.

Figure 5: LoadBalancer1 is now associated with a Firewall Manager managed and created web ACL
The Firewall Manager-created web ACL contains the WAF rules defined in the security policy.

LoadBalancer2 has the following characteristics:

LoadBalancer2 has an existing customer created web ACL associated with it, as shown in Figure 6.

Figure 6: LoadBalancer2 is associated with a customer-created web ACL
This web ACL was created by the application team with an application-specific rule. The web ACL could have been created with the AWS Management Console, AWS CloudFormation, or other IaC tools like Terraform.
After the Firewall Manager security policy takes effect, LoadBalancer2 remains associated with the existing customer-created web ACL MyCustomWebACL.
Retrofitting adds WAF rules in the first rule group according to the security policy, as shown in Figure 7. Existing WAF rules are not changed, and rules and other aspects of the web ACL are not changed and can continue to be managed exactly as you would without Firewall Manager present.

Figure 7: Firewall Manager has retrofitted a customer-created web ACL and added the security policy–defined first rule groups rules

Figure 8 shows that both ALBs are now compliant with our security policy. LoadBalancer1 has a web ACL created by Firewall Manager; future ALBs that don’t have a web ACL associated would also become associated to this web ACL. LoadBalancer2 is associated with a web ACL the application team previously created. The application-defined WAF rules are not changed and the WAF rules defined by the security policy are added to this web ACL.

Figure 8: Multiple ALBs in scope for the same security policy

Associate a web ACL with Firewall Manager already in place

Let’s continue from our previous scenario. The application team for LoadBalancer1 now configures application-specific rules for their ALB by using the following process.

The application team creates a web ACL and defines their own WAF rules. They might do this with the console, AWS CloudFormation, or other IaC tools like Terraform.
The application team associates LoadBalancer1 with the custom web ACL.

Figure 9: From the web ACL, only LoadBalancer1 is associated with the app team’s web ACL
After a moment, Firewall Manager detects the change to an in-scope resource (LoadBalancer1). Firewall Manager retrofits the application team’s web ACL AppTeamNewWebACL, making it align with the security policy.

Figure 10: Firewall Manager has retrofitted the app team–created web ACL and added the security policy–defined first rule group rules

The diagram in Figure 11 shows the workflow that happens when the application team creates their own web ACL and associates it with LoadBalancer1. Firewall Manager detects the association change and retrofits the application team’s web ACL again, bringing LoadBalancer1 into alignment with security policies.

Figure 11: Firewall Manager retrofitting a web ACL in response to being associated with an existing in-scope resource

Out-of-scope resources associated with a retrofitted web ACL

Let’s make a change to our security policy. In Figure 12, the policy scope has been updated to only apply to resources when they have the resource tag Tier: Production. Application teams add this tag to LoadBalancer1 and LoadBalancer2.

Figure 12: The Firewall Manager security policy scope has been updated to include a resource tag

Later, an application team creates LoadBalancer3 and associates it with AppTeamNewWebACL (Figure 13).

Figure 13: A new ALB associated with a custom WAF web ACL

The application team does not tag LoadBalancer3, making it out of scope with the security policy, as shown in Figure 14.

Figure 14: A new ALB that is out of scope with a security policy

This web ACL is currently retrofitted by our security policy and associated with LoadBalancer2 (in-scope), as shown in Figure 15.

Figure 15: A retrofitted web ACL associated with in-scope and out-of-scope ALBs

Firewall Manager detects that an out-of-scope resource is associated with a retrofitted web ACL. As shown in Figure 16, Firewall Manager marks the web ACL AppTeamNewWebACL noncompliant. It also marks LoadBalancer2 noncompliant.

Figure 16: Shows retrofit-specific reasons a resource or web ACL will be marked noncompliant. In this case, a not-in-scope resource is associated with a web ACL where another associated resource is in-scope.

As long as the web ACL is noncompliant, Firewall Manager will not retrofit a non-retrofitted web ACL, and future security policy changes will not be applied to that web ACL. Existing retrofitted WAF rules for that web ACL are not modified or removed. When the web ACL later becomes compliant, it will again retrofit the latest state of the security policy. Figure 17 shows how Firewall Manager keeps the existing retrofitting but does not apply updates. Using our example, one of three things would need to happen for the web ACL to become compliant:

LoadBalancer3 can be made in-scope with the current security policy. The application team can then add the resource tag Tier: Production to this ALB.
The resource tag can be removed from the policy scope.
LoadBalancer3 can be disassociated with the web ACL.

Note: We recommend that you promptly address not-in-scope resources associated with web ACLs that are shared by in-scope resources. For example, if the preceding scenario was not addressed and LoadBalancer3 was deleted three months later, Firewall Manager would at that point retrofit the web ACL (3 months later). This is not necessarily a problem, but could trigger unexpected changes to a web ACL’s rules.

Figure 17: Firewall Manager retains existing but not future changes as long as a not-in-scope resource is associated with this web ACL

In summary: Firewall Manager initially retrofits and will apply future updates to existing web ACLs while all associated resources are in-scope. When an out-of-scope resource is associated, an initial retrofit is delayed and security policy retrofit updates are paused until the out-of-scope resource is addressed.

Figure 18 demonstrates how Firewall Manager will not perform an initial retrofit of a web ACL associated with both in-scope and not-in-scope resources.

Figure 18: Firewall Manger will only perform an initial retrofit when only in-scope resources are associated with a web ACL

Conclusion

Firewall Manager verifies that in-scope resources adhere to relevant security policies or are marked noncompliant. You can enable retrofitting for Firewall Manager for AWS WAF to seamlessly enforce these security policies without changing how your application teams manage the configuration of their WAF rules.

Note: Retrofitting is only available for AWS Firewall Manager security policies for AWS WAF; it is not available for AWS WAF Classic.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Firewall Manager re:Post or contact AWS Support.

Generate vector embeddings for your data using AWS Lambda as a processor for Amazon OpenSearch Ingestion

2025-01-21 Jagadish Kumar

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/generate-vector-embeddings-for-your-data-using-aws-lambda-as-a-processor-for-amazon-opensearch-ingestion/

On Nov 22, 2024, Amazon OpenSearch Ingestion launched support for AWS Lambda processors. With this launch, you now have more flexibility enriching and transforming your logs, metrics, and trace data in an OpenSearch Ingestion pipeline. Some examples include using foundation models (FMs) to generate vector embeddings for your data and looking up external data sources like Amazon DynamoDB to enrich your data.

Amazon OpenSearch Ingestion is a fully managed, serverless data pipeline that delivers real-time log, metric, and trace data to Amazon OpenSearch Service domains and Amazon OpenSearch Serverless collections.

Processors are components within an OpenSearch Ingestion pipeline that enable you to filter, transform, and enrich events using your desired format before publishing records to a destination of your choice. If no processor is defined in the pipeline configuration, then the events are published in the format specified by the source component. You can incorporate multiple processors within a single pipeline, and they are run sequentially as defined in the pipeline configuration.

OpenSearch Ingestion gives you the option of using Lambda functions as processors along with built-in native processors when transforming data. You can batch events into a single payload based on event count or size before invoking Lambda to optimize the pipeline for performance and cost. Lambda enables you to run code without provisioning or managing servers, eliminating the need to create workload-aware cluster scaling logic, maintain event integrations, or manage runtimes.

In this post, we demonstrate how to use the OpenSearch Ingestion’s Lambda processor to generate embeddings for your source data and ingest them to an OpenSearch Serverless vector collection. This solution uses the flexibility of OpenSearch Ingestion pipelines with a Lambda processor to dynamically generate embeddings. The Lambda function will invoke the Amazon Titan Text Embeddings Model hosted in Amazon Bedrock, allowing for efficient and scalable embedding creation. This architecture simplifies various use cases, including recommendation engines, personalized chatbots, and fraud detection systems.

Integrating OpenSearch Ingestion, Lambda, and OpenSearch Serverless creates a fully serverless pipeline for embedding generation and search. This combination offers automatic scaling to match workload demands and a usage-driven model. Operations are simplified because AWS manages infrastructure, updates, and maintenance. This serverless approach allows you to focus on developing search and analytics solutions rather than managing infrastructure.

Note that Amazon OpenSearch Service also provides Neural search which transforms text into vectors and facilitates vector search both at ingestion time and at search time. During ingestion, neural search transforms document text into vector embeddings and indexes both the text and its vector embeddings in a vector index. Neural search is available for managed clusters running version 2.9 and above.

Solution overview

This solution builds embeddings on a dataset stored in Amazon Simple Storage Service (Amazon S3). We use the Lambda function to invoke the Amazon Titan model on the payload delivered by OpenSearch Ingestion.

Prerequisites

You should have an appropriate role with permissions to invoke your Lambda function and Amazon Bedrock model and also write to the OpenSearch Serverless collection.

To provide access to the collection, you must configure an AWS Identity and Access Management (IAM) pipeline role with a permissions policy that grants access to the collection. For more details, see Granting Amazon OpenSearch Ingestion pipelines access to collections. The following is example code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "allowinvokeFunction",
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
                
            ],
            "Resource": "arn:aws:lambda:{{region}}:{{account-id}}:function:{{function-name}}"
            
        }
    ]
}

The role must have the following trust relationship, which allows OpenSearch Ingestion to assume it:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "osis-pipelines.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Create an ingestion pipeline

You can create a pipeline using a blueprint. For this post, we select the AWS Lambda custom enrichment blueprint.

We use the IMDB title basics dataset, which that contains movie information, including originalTitle, runtimeMinutes, and genres.

The OpenSearch Ingestion pipeline uses a Lambda processor to create embeddings for the field original_title and store the embeddings as original_title_embeddings along with other data.

See the following pipeline code:

version: "2"
s3-log-pipeline:
  source:
    s3:
      acknowledgments: true
      compression: "none"
      codec:
        csv:
      aws:
        # Provide the region to use for aws credentials
        region: "us-west-2"
        # Provide the role to assume for requests to SQS and S3
        sts_role_arn: "<<arn:aws:iam::123456789012:role/ Example-Role>>"
      scan:
        buckets:
          - bucket:
              name: "lambdaprocessorblog"
      
  processor:
     - aws_lambda:
        function_name: "generate_embeddings_bedrock"
        response_events_match: true
        tags_on_failure: ["lambda_failure"]
        batch:
          key_name: "documents"
          threshold:
            event_count: 4
        aws:
          region: us-west-2
          sts_role_arn: "<<arn:aws:iam::123456789012:role/Example-Role>>"
  sink:
    - opensearch:
        hosts:
          - 'https://myserverlesscollection.us-region.aoss.amazonaws.com'
        index: imdb-data-embeddings
        aws:
          sts_role_arn: "<<arn:aws:iam::123456789012:role/Example-Role>>"
          region: us-west-2
          serverless : true

Let’s take a closer look at the Lambda processor in the ingestion pipeline .Pay attention to the key_name, parameter. You can choose any value for key_name and your Lambda function will need to reference this key in your Lambda function when processing the payload from OpenSearch Ingestion. The payload size is determined by the batch setting. When batching is enabled in the Lambda processor, OpenSearch Ingestion groups multiple events into a single payload before invoking the Lambda function. A batch is sent to Lambda when any of the following thresholds are met:

- event_count – The number of events reaches the specified limit

- maximum_size – The total size of the batch reaches the specified size (for example, 5 MB) and is configurable up to 6MB (Invocation payload limit for AWS Lambda)

Lambda function

The Lambda function receives the data from OpenSearch Ingestion, invokes Amazon Bedrock to generate the embedding, and adds it to the source record. “documents” is used to reference the events coming in from OpenSearch Ingestion and matches the key_name declared in the pipeline. We add the embedding from Amazon Bedrock back to the original record. This new record with the appended embedding value is then sent to the OpenSearch Serverless sink by OpenSearch Ingestion. See the following code:

import json
import boto3
import os

# Initialize Bedrock client
bedrock = boto3.client('bedrock-runtime')

def generate_embedding(text):
    """Generate embedding for the given text using Bedrock."""
    response = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v1",
        contentType="application/json",
        accept="application/json",
        body=json.dumps({"inputText": text})
    )
    embedding = json.loads(response['body'].read())['embedding']
    return embedding

def lambda_handler(event, context):
    # Assuming the input is a list of JSON documents
    documents = event['documents']
    
    processed_documents = []
    
    for doc in documents:
        if originalTitle' in doc:
            # Generate embedding for the 'originalTitle' field
            embedding = generate_embedding(doc[originalTitle'])
            
            # Add the embedding to the document
            doc['originalTitle_embeddings'] = embedding
        
        processed_documents.append(doc)
    
    # Return the processed documents
    return  processed_documents

In case of any exceptions while using the lambda processor, all the documents in the batch are considered failed events and are forwarded the next chain of processors if any or to the sink with a failed tag. The tag can be configured to the pipeline with the tags_on_failure parameter and the errors are also sent to CloudWatch logs for further action.

After the pipeline runs, you can see that the embeddings were created and stored as originalTitle_embeddings within the document in a k-NN index, imdb-data-embeddings. The following screenshot shows an example.

Summary

In this post, we showed how you can use Lambda as part of your OpenSearch Ingestion pipeline to enable complex transformation and enrichment of your data. For more details on the feature, refer to Using an OpenSearch Ingestion pipeline with AWS Lambda.

About the Authors

Jagadish Kumar (Jag) is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Sam Selvan is a Principal Specialist Solution Architect with Amazon OpenSearch Service.

Srikanth Govindarajan is a Software Development Engineer at Amazon Opensearch Service. Srikanth is passionate about architecting infrastructure and building scalable solutions for search, analytics, security, AI and machine learning based usecases.

Automate topic provisioning and configuration using Terraform with Amazon MSK

2025-01-16 Vijay Kardile

Post Syndicated from Vijay Kardile original https://aws.amazon.com/blogs/big-data/automate-topic-provisioning-and-configuration-using-terraform-with-amazon-msk/

As organizations deploy Amazon Managed Streaming for Apache Kafka (Amazon MSK) clusters across multiple use cases, the manual management of topic configurations can be challenging. This can lead to several issues:

Inefficiency – Manual configuration is time-consuming and error-prone, especially for large deployments. Maintaining consistency across multiple configurations can be difficult. To avoid this, Kafka administrators often set the create.topics.enable property on brokers, which leads to cluster operation inefficiency.
Human error – Manual configuration increases the risk of mistakes that can disrupt data flow and impact applications relying on Amazon MSK.
Scalability challenges – Scaling an Amazon MSK environment with manual configuration is cumbersome. Adding new topics or modifying existing ones requires manual intervention, hindering agility.

These challenges highlight the need for a more automated and robust approach to MSK topic configuration management.

In this post, we address this problem by using Terraform to optimize the configuration of MSK topics. This solution supports both provisioned and serverless MSK clusters.

Solution overview

Customers want a better way to manage the overhead of topics and their configurations. Manually handling topic configurations can be cumbersome and error-prone, making it difficult to keep track of changes and updates.

To address these challenges, you can use Terraform, an infrastructure as code (IaC) tool by HashiCorp. Terraform allows you to manage and provision infrastructure declaratively. It uses human-readable configuration files written in HashiCorp Configuration Language (HCL) to define the desired state of infrastructure resources. These resources can span virtual machines, networks, databases, and a vast array of cloud provider-specific offerings.

Terraform offers a compelling solution to the challenges of manual Kafka topic configuration. Terraform allows you to define and manage your Kafka topics through code. This approach provides several key benefits:

Automation – Terraform automates the creation, modification, and deletion of MSK topics.
Consistency and repeatability – Terraform configurations provide consistent topic structures and settings across your entire Amazon MSK environment. This simplifies management and reduces the likelihood of configuration drift.
Scalability – Terraform enables you to provision and manage large numbers of MSK topics, facilitating the growth of your Amazon MSK environment.
Version control – Terraform configurations are stored in version control systems, allowing you to track changes, roll back if needed, and collaborate effectively on your Amazon MSK infrastructure.

By using Terraform for MSK topic configuration management, you can streamline your operations, minimize errors, and have a robust and scalable Amazon MSK environment.

In this post, we provide a comprehensive guide for using Terraform to manage Amazon MSK configurations. We explore the process of installing Terraform on Amazon Elastic Compute Cloud (Amazon EC2), defining and decentralizing topic configurations, and deploying and updating configurations in an automated manner.

Prerequisites

Before proceeding with the solution, make sure you have the following resources and access:

To simplify the setup, use the provided AWS CloudFormation template. This template will create the necessary Amazon MSK provisioned cluster and required resources for this post. You can create an MSK Serverless cluster using the Amazon MSK console and use it in this solution. This is sample template, not production ready, and AWS Identity and Access Management (IAM) policies should be implemented using best practices and the principle of least privilege. For more details, see Get started with AWS managed policies and move toward least-privilege permissions. An EC2 instance will be created as a part of this template. The MSK cluster and EC2 instance will be created on a single virtual private cloud (VPC); however, you can install Terraform in a different account or on different VPC. For more details, see Connect Kafka client applications securely to your Amazon MSK cluster from different VPCs and AWS accounts.
For this post, we use the latest Terraform version (1.10.x) and Terraform plugins – Mongey/Kafka provider. In Terraform, plugins are binary executables responsible for implementing resource types and providers. The plugins are installed automatically when we initialize a Terraform configuration using the terraform init
You need access to an AWS account with sufficient permissions to create and manage resources, including IAM roles and MSK clusters. For more information, see IAM access control.

By making sure you have these prerequisites in place, you will be ready to streamline your topic configurations with Terraform.

Install Terraform on your client machine

When your cluster and client machine are ready, SSH to your client machine (Amazon EC2) and install Terraform.

Run the following commands to install Terraform:

sudo yum update -y
sudo yum install -y yum-utils shadow-utils
sudo yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo
sudo yum -y install terraform

Run the following command to check the installation:
```
terraform -v
```

This indicates that Terraform installation is successful and you are ready to automate your MSK topic configuration.

Provision an MSK topic using Terraform

To provision the MSK topic, complete the following steps:

Create a new file called main.tf and copy the following code into this file, replacing the BOOTSTRAP_SERVERS and AWS_REGION information with the details for your cluster. For instructions on retrieving the bootstrap_servers information for IAM authentication from your MSK cluster, see Getting the bootstrap brokers for an Amazon MSK cluster. This script is common for Amazon MSK provisioned and MSK Serverless.
```
terraform {
required_providers {
kafka = {
source = "Mongey/kafka" }}}
provider "kafka" {
bootstrap_servers = [{BOOTSTRAP_SERVERS}]
tls_enabled       = true
sasl_mechanism    = "aws-iam"
sasl_aws_region   ={AWS_REGION}
sasl_aws_profile  = "dev" }
resource "kafka_topic" "sampleTopic" {
name               = "sampleTopic"
replication_factor = 1
partitions         = 50 }
```

Add IAM bootstrap servers endpoints in a comma separated list format:

BOOTSTRAP_SERVERS = ["b-2.mskcluster…. ","b-3.mskcluster…. ","b-1.mskcluster…. "]

Run the command terraform init to initialize Terraform and download the required providers.

The terraform init command initializes a working directory containing Terraform configuration files(main.tf). This is the first command that should be run after writing a new Terraform configuration.

Run the command terraform plan to review the run plan.

This command shows the changes that Terraform will make to the infrastructure based on the provided configuration. This step is optional but is often used as a preview of the changes Terraform will make.

If the plan looks correct, run the command terraform apply to apply the configuration.
When prompted for confirmation before proceeding, enter yes.

The terraform apply command runs the actions proposed in a Terraform plan. Terraform will create the sampleTopic topic in your MSK cluster.

After the terraform apply command is complete, verify the infrastructure has been created with the help of the kafka-topics.sh utility:
```
kafka/bin/kafka-topics.sh 
--bootstrap-server "b-1…..amazonaws.com:9098" 
--command-config ./kafka/bin/client.properties  
--list
```

You can use the kafka-toipcs.sh tool with the --list option to retrieve a list of topics associated with your MSK cluster. For more information, refer to the createtopic documentation.

Update the MSK topic configuration using Terraform

To update the MSK topic configuration, let’s assume we want to change the number of partitions from 50 to 10 on our topic. We need to perform the following steps:

Verify the number of partitions on the topic using the --describe command:

kafka/bin/kafka-topics.sh 
--bootstrap-server "b-1…...amazonaws.com:9098" 
--command-config ./kafka/bin/client.properties  
--describe 
--topic sampleTopic

This command will show 50 partitions on the sampleTopic topic.

Modify the Terraform file main.tf and change the value of the partitions parameter to 10:

resource "kafka_topic" "sampleTopic" {
name               = " sampleTopic "
replication_factor = 1
partitions         = 10 }

Run the command terraform plan to review the run plan.

If the plan shows the changes, run the command terraform apply to apply the configuration.
When prompted for confirmation before proceeding, enter yes.

Terraform will drop and recreate the sampleTopic topic with the changed configuration.

Verify the changed number of partitions on the topic, ad rerun the --describe command:

kafka/bin/kafka-topics.sh 
--bootstrap-server "b-1…...amazonaws.com:9098" 
--command-config ./kafka/bin/client.properties  
--describe --topic sampleTopic

Now, this command will show 10 partitions on the sampleTopic topic.

Delete the MSK topic using Terraform

When you no longer need the infrastructure, you can remove all resources created by your Terraform file.

Run the command terraform destroy to remove the topic.
When prompted for confirmation before proceeding, enter yes.

Terraform will delete the sampleTopic topic from your MSK cluster.

To verify, rerun the --list command:

kafka/bin/kafka-topics.sh 
--bootstrap-server "b-1…..amazonaws.com:9098" 
--command-config ./kafka/bin/client.properties  
--list

Now, this command will not show the sampleTopic topic.

Conclusion

In this post, we addressed the common challenges associated with manual MSK topic configuration management and presented a robust Terraform-based solution. Using Terraform for automated topic provisioning and configuration streamlines your processes, fosters scalability, and enhances flexibility. Additionally, it facilitates automated deployments and centralized management.

We encourage you to explore Terraform as a means to optimize Amazon MSK configurations and unlock further efficiencies within your streaming data pipelines.

About the author

Vijay Kardile is a Sr. Technical Account Manager with Enterprise Support, India. With over two decades of experience in IT Consulting and Engineering, he specializes in Analytics services, particularly Amazon EMR and Amazon MSK. He has empowered numerous enterprise clients by facilitating their adoption of various AWS services and offering expert guidance on attaining operational excellence.

Batch data ingestion into Amazon OpenSearch Service using AWS Glue

2025-01-13 Ravikiran Rao

Post Syndicated from Ravikiran Rao original https://aws.amazon.com/blogs/big-data/batch-data-ingestion-into-amazon-opensearch-service-using-aws-glue/

Organizations constantly work to process and analyze vast volumes of data to derive actionable insights. Effective data ingestion and search capabilities have become essential for use cases like log analytics, application search, and enterprise search. These use cases demand a robust pipeline that can handle high data volumes and enable efficient data exploration.

Apache Spark, an open source powerhouse for large-scale data processing, is widely recognized for its speed, scalability, and ease of use. Its ability to process and transform massive datasets has made it an indispensable tool in modern data engineering. Amazon OpenSearch Service—a community-driven search and analytics solution—empowers organizations to search, aggregate, visualize, and analyze data seamlessly. Together, Spark and OpenSearch Service offer a compelling solution for building powerful data pipelines. However, ingesting data from Spark into OpenSearch Service can present challenges, especially with diverse data sources.

This post showcases how to use Spark on AWS Glue to seamlessly ingest data into OpenSearch Service. We cover batch ingestion methods, share practical examples, and discuss best practices to help you build optimized and scalable data pipelines on AWS.

Overview of solution

AWS Glue is a serverless data integration service that simplifies data preparation and integration tasks for analytics, machine learning, and application development. In this post, we focus on batch data ingestion into OpenSearch Service using Spark on AWS Glue.

AWS Glue offers multiple integration options with OpenSearch Service using various open source and AWS managed libraries, including:

In the following sections, we explore each integration method in detail, guiding you through the setup and implementation. As we progress, we incrementally build the architecture diagram shown in the following figure, providing a clear path for creating robust data pipelines on AWS. Each implementation is independent of the others. We chose to showcase them separately, because in a real-world scenario, only one of the three integration methods is likely to be used.

Image showing the high level architecture diagram

You can find the code base in the accompanying GitHub repo. In the following sections, we walk through the steps to implement the solution.

Prerequisites

Before you deploy this solution, make sure the following prerequisites are in place:

Access to a valid AWS account
The latest AWS Command Line Interface (AWS CLI) installed on your local machine
git, awk, curl, and bash installed on your local machine
Permission to create AWS resources
Familiarity with Apache Spark, AWS Glue, and Amazon OpenSearch Service

Clone the repository to your local machine

Clone the repository to your local machine and set the BLOG_DIR environment variable. All the relative paths assume BLOG_DIR is set to the repository location in your machine. If BLOG_DIR is not being used, adjust the path accordingly.

git clone [email protected]:aws-samples/opensearch-glue-integration-patterns.git
cd opensearch-glue-integration-patterns
export BLOG_DIR=$(pwd)

Deploy the AWS CloudFormation template to create the necessary infrastructure

The main focus of this post is to demonstrate how to use the mentioned libraries in Spark on AWS Glue to ingest data into OpenSearch Service. Though we center on this core topic, several key AWS components will need to be pre-provisioned for the integration examples, such as a Amazon Virtual Private Cloud (Amazon VPC), multiple Subnets, an AWS Key Management Service (AWS KMS) key, an Amazon Simple Storage Service (Amazon S3) bucket, an AWS Glue role, and an OpenSearch Service cluster with domains for OpenSearch Service and Elasticsearch. To simplify the setup, we’ve automated the provisioning of this core infrastructure using the cloudformation/opensearch-glue-infrastructure.yaml AWS CloudFormation template.

Run the following commands

The CloudFormation template will deploy the necessary networking components (such as VPC and subnets), Amazon CloudWatch logging, AWS Glue role, and OpenSearch Service and Elasticsearch domains required to implement the proposed architecture. Use a strong password (8–128 characters, three of which are lowercase, uppercase, numbers, or special characters, and no /, “, or spaces) and adhere to your organization’s security standards for ESMasterUserPassword and OSMasterUserPassword in the following command:

cd ${BLOG_DIR}/cloudformation/
aws cloudformation deploy \
--template-file ${BLOG_DIR}/cloudformation/opensearch-glue-infrastructure.yaml \
--stack-name GlueOpenSearchStack \
--capabilities CAPABILITY_NAMED_IAM \
--region <AWS_REGION> \
--parameter-overrides \
ESMasterUserPassword=<ES_MASTER_USER_PASSWORD> \
OSMasterUserPassword=<OS_MASTER_USER_PASSWORD>

You should see a success message such as "Successfully created/updated stack – GlueOpenSearchStack" after the resources have been provisioned successfully. Provisioning this CloudFormation stack typically takes approximately 30 minutes to complete.

On the AWS CloudFormation console, locate the GlueOpenSearchStack stack, and confirm that its status is CREATE_COMPLETE.

Image showing the "CREATE_COMPLETE" status of cloudformation template

You can review the deployed resources on the Resources tab, as shown in the following screenshot.The screenshot does not display all the created resources.

Image showing the "Resources" tab of cloudformation template

Additional setup steps

In this section, we collect essential information, including the S3 bucket name and the OpenSearch Service and Elasticsearch domain endpoints. These details are required for executing the code in subsequent sections.

Capture the details of the provisioned resources

Use the following AWS CLI command to extract and save the output values from the CloudFormation stack to a file named GlueOpenSearchStack_outputs.txt. We refer to the values in this file in upcoming steps.

aws cloudformation describe-stacks \
--stack-name GlueOpenSearchStack \
--query 'sort_by(Stacks[0].Outputs[], &OutputKey)[].{Key:OutputKey,Value:OutputValue}' \
--output table \
--no-cli-pager \
--region <AWS_REGION> > ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Download NY Green Taxi December 2022 dataset and copy to S3 bucket

The purpose of this post is to demonstrate the technical implementation of ingesting data into OpenSearch Service using AWS Glue. Understanding the dataset itself is not essential, aside from its data format, which we discuss in AWS Glue notebooks in later sections. To learn more about the dataset, you can find additional information on the NYC Taxi and Limousine Commission website.

We specifically request that you download the December 2022 dataset, because we have tested the solution using this particular dataset:

S3_BUCKET_NAME=$(awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt)
mkdir -p ${BLOG_DIR}/datasets && cd ${BLOG_DIR}/datasets
curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-12.parquet
aws s3 cp green_tripdata_2022-12.parquet s3://${S3_BUCKET_NAME}/datasets/green_tripdata_2022-12.parquet

Download the required JARs from the Maven repository and copy to S3 bucket

We’ve specified a particular JAR file version to ensure stable deployment experience. However, we recommend adhering to your organization’s security best practices and reviewing any known vulnerabilities in the version of the JAR files before deployment. AWS does not guarantee the security of any open-source code used here. Additionally, please verify the downloaded JAR file’s checksum against the published value to confirm its integrity and authenticity.

mkdir -p ${BLOG_DIR}/jars && cd ${BLOG_DIR}/jars
# OpenSearch Service jar
curl -O https://repo1.maven.org/maven2/org/opensearch/client/opensearch-spark-30_2.12/1.0.1/opensearch-spark-30_2.12-1.0.1.jar
aws s3 cp opensearch-spark-30_2.12-1.0.1.jar s3://${S3_BUCKET_NAME}/jars/opensearch-spark-30_2.12-1.0.1.jar
# Elasticsearch jar
curl -O https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-spark-30_2.12/7.17.23/elasticsearch-spark-30_2.12-7.17.23.jar
aws s3 cp elasticsearch-spark-30_2.12-7.17.23.jar s3://${S3_BUCKET_NAME}/jars/elasticsearch-spark-30_2.12-7.17.23.jar

In the following sections, we implement the individual data ingestion methods as outlined in the architecture diagram.

Ingest data into OpenSearch Service using the OpenSearch Spark library

In this section, we load an OpenSearch Service index using Spark and the OpenSearch Spark library. We demonstrate this implementation by using AWS Glue notebooks, employing basic authentication using user name and password.

To demonstrate the ingestion mechanisms, we have provided the Spark-and-OpenSearch-Code-Steps.ipynb notebook with detailed instructions. Follow the steps in this section in conjunction with the instructions in the notebook.

Set up the AWS Glue Studio notebook

Complete the following steps:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Under Create job, choose Notebook.

Image showing AWS console page for AWS Glue to open notebook

Upload the notebook file located at ${BLOG_DIR}/glue_jobs/Spark-and-OpenSearch-Code-Steps.ipynb.
For IAM role, choose the AWS Glue job IAM role that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

Enter a name for the notebook (for example, Spark-and-OpenSearch-Code-Steps) and choose Save.

Image showing AWS Glue OpenSearch Notebook

Replace the placeholder values in the notebook

Complete the following steps to update the placeholders in the notebook:

In Step 1 in the notebook, replace the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection name. You can get the name of the interactive session by executing the following command:

cd ${BLOG_DIR}
awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 1 in the notebook, replace the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket name. You can get the name of the S3 bucket by executing the following command:

awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 4 in the notebook, replace <OPEN-SEARCH-DOMAIN-WITHOUT-HTTPS> with the OpenSearch Service domain name. You can get the domain name by executing the following command:

awk -F '|' '$2 ~ /OpenSearchDomainEndpoint/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the notebook

Run each cell of the notebook to load data into the OpenSearch Service domain and read it back to verify the successful load. Refer to the detailed instructions within the notebook for execution-specific guidance.

Spark write modes (append vs. overwrite)

It is recommended to write data incrementally into OpenSearch Service indexes using the append mode, as demonstrated in Step 8 in the notebook. However, in certain cases, you may need to refresh the entire dataset in the OpenSearch Service index. In these scenarios, you can use the overwrite mode, though it is not advised for large indexes. When using overwrite mode, the Spark library deletes rows from the OpenSearch Service index one by one and then rewrites the data, which can be inefficient for large datasets. To avoid this, you can implement a preprocessing step in Spark to identify insertions and updates, and then write the data into OpenSearch Service using append mode.

Ingest data into Elasticsearch using the Elasticsearch Hadoop library

In this section, we load an Elasticsearch index using Spark and the Elasticsearch Hadoop Library. We demonstrate this implementation by using AWS Glue as the engine for Spark.

Set up the AWS Glue Studio notebook

Complete the following steps to set up the notebook:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Under Create job, choose Notebook.

Image showing AWS console page for AWS Glue to open notebook

Upload the notebook file located at ${BLOG_DIR}/glue_jobs/Spark-and-Elasticsearch-Code-Steps.ipynb.
For IAM role, choose the AWS Glue job IAM role that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

Enter a name for the notebook (for example, Spark-and-ElasticSearch-Code-Steps) and choose Save.

Image showing AWS Glue Elasticsearch Notebook

Replace the placeholder values in the notebook

Complete the following steps:

In Step 1 in the notebook, replace the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection name. You can get the name of the interactive session by executing the following command:

awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 1 in the notebook, replace the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket name. You can get the name of the S3 bucket by executing the following command:

awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 4 in the notebook, replace <ELASTIC-SEARCH-DOMAIN-WITHOUT-HTTPS> with the Elasticsearch domain name. You can get the domain name by executing the following command:

awk -F '|' '$2 ~ /ElasticsearchDomainEndpoint/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the notebook

Run each cell in the notebook to load data to the Elasticsearch domain and read it back to verify the successful load. Refer to the detailed instructions within the notebook for execution-specific guidance.

Ingest data into OpenSearch Service using the AWS Glue OpenSearch Service connection

In this section, we load an OpenSearch Service index using Spark and the AWS Glue OpenSearch Service connection.

Create the AWS Glue job

Complete the following steps to create an AWS Glue Visual ETL job:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Under Create job, choose Visual ETL

This will open the AWS Glue job visual editor. Image showing AWS console page for AWS Glue to open Visual ETL