Tag Archives: Amazon OpenSearch

Implement fine-grained access control using Amazon OpenSearch Service and JSON Web Tokens

2025-08-28 Ramya Bhat

Post Syndicated from Ramya Bhat original https://aws.amazon.com/blogs/big-data/implement-fine-grained-access-control-using-amazon-opensearch-service-and-json-web-tokens/

This post demonstrates how to build a secure search application using Amazon OpenSearch Service and JSON Web Tokens (JWTs). We discuss the basics of OpenSearch Service and JWTs and how to implement user authentication and authorization through an existing identity provider (IdP). The focus is on enforcing fine-grained access control based on user roles and permissions.

JWT authentication and authorization for your OpenSearch Service domain provides a robust mechanism that addresses requirements for fine-grained access control. An IdP is a service that stores and manages user identities and their access rights, enabling centralized user authentication across multiple applications. The IdP issues JWTs, which are secure tokens containing claims about the authenticated user. By using JWTs from the IdP, you can:

Implement secure, role-based access control to search results
Validate user permissions before granting access to sensitive data
Maintain a centralized authentication mechanism across your search application
Make sure only authorized users can view data based on their predefined roles

The JWT integration helps organizations:

Define granular permissions within the IdP
Authenticate users using bearer tokens across different applications
Protect sensitive information through token-based access management
Reduce complexity of managing multiple authentication systems

Key benefits of the solution include:

Standardized token-based authentication
Centralized permission management
Simplified single sign-on (SSO) experience
Flexible and scalable access control mechanism

The ability to dynamically filter sensitive information based on token claims enhances data security while reducing the complexity of managing multiple authentication systems. This capability is made possible through the fine-grained access control (FGAC) feature in OpenSearch Service, which enforces document- and field-level access based on user roles.

Use case overview

In this post, we explore a user workflow with multiple roles and access level requirements. A research institution wants to build a secure search application with controlled access to biomedical databases specifically PubMed (a comprehensive database of biomedical literature) and Clinical Trials (a registry of medical research studies). Different research teams require varying levels of access to these datasets based on their roles and clearance levels. The following hierarchical access structure defines the user roles and their corresponding permission levels for accessing PubMed and Clinical Trials databases:

PubMed Admin – Full read access to all PubMed data (for senior research groups)
PubMed Limited – Restricted access to specific fields and documents (for researchers with limited access)
Clinical Trials Admin – Full read access to all Clinical Trials data (for principal investigators and senior trial managers)
Clinical Trials Limited – Restricted read access to specific trial information and aggregated data (for trial researchers with limited access)
Research Basic – Read-only access to specific public data in PubMed and Clinical Trials (for general research staff and interns)
Research Full Access – Full read and write access to all indices, with permissions to update or modify data

To implement this use case, we use JWTs generated by the supported IdP, which encode role-specific information. This setup makes sure OpenSearch Service can validate tokens before returning search results, dynamically filtering sensitive data based on the user’s JWT claims and fine-grained access control settings.

Solution overview

The technical workflow for using JWT authorization with OpenSearch Service involves several key stages:

User authentication – Users log in through the existing authentication system linked to the IdP
JWT generation – Upon successful authentication, the IdP generates a JWT containing specific role information
Search query submission – Users submit search queries to OpenSearch Service along with their JWT
Token validation – OpenSearch Service validates and decodes the JWT to verify user permissions
Result filtering – Search results are filtered based on the user’s permissions defined in the JWT
Data retrieval – Only authorized data is returned to the user, enforcing compliance with privacy standards

This workflow provides a standardized approach to authentication and authorization while streamlining user interactions with the search application. The solution makes sure each user sees only the information appropriate to their role, maintaining data privacy and organizational security standards.

You must enable JWT authentication and authorization, and fine-grained access control during the OpenSearch Service domain creation process. For more information, refer to Configuring JWT authentication and authorization and Fine-grained access control in Amazon OpenSearch Service.

The following diagram illustrates the solution architecture.

AWS architecture diagram showing authentication and search flow between services. The diagram shows integration with Amazon OpenSearch Service for queries and Amazon Cognito for authentication. The flow is marked with numbered steps (1-7) indicating the sequence of operations from client login through Cognito to executing authenticated OpenSearch queries.

This solution demonstrates authentication using Amazon Cognito as the IdP to generate the JWT. However, you can use another supported IdP. The ID token includes group membership information that OpenSearch Service maps to roles configured using fine-grained access control.

The user flow consists of the following steps:

The client initiates authentication by logging in with Amazon Cognito user credentials. Amazon Cognito returns an authorization code.
The client sends the authorization code to an Amazon API Gateway /token endpoint for ID token exchange.
API Gateway forwards the authorization code to an AWS Lambda function.
The Lambda function sends a token exchange request to Amazon Cognito with the authorization code.
The Lambda function receives the ID token from Amazon Cognito and returns it to the client.
The client sends an OpenSearch Service query to the API Gateway /search endpoint, including the ID token. API Gateway validates the ID token (JWT) with Amazon Cognito.
API Gateway forwards the request to a Lambda function.
The Lambda function checks if JWT authentication and authorization is enabled for the OpenSearch Service domain with the respective public key of the Amazon Cognito user pool. If not, it will enable and configure this feature for the OpenSearch Service domain. The Lambda function forwards the query and ID token to OpenSearch Service.
OpenSearch Service validates the JWT with Amazon Cognito:
1. OpenSearch Service verifies user permissions against fine-grained access control based on group membership.
2. OpenSearch Service returns query results to the client if authorization succeeds.

The following diagram illustrates the request flow.

Request flow diagram showing authentication and search flow between services.

Prerequisites

Before you deploy the solution, make sure you have the following prerequisites:

An AWS account
Familiarity with the Python programming language
Familiarity with AWS Identity and Access Management (IAM) and OpenSearch Service

Deploy solution resources

To deploy the solution resources, we use an AWS CloudFormation template. Launch the AWS CloudFormation template with the following Launch Stack button.

Enter an appropriate stack name. This name is used as a prefix for resources like OpenSearch Service domains and Lambda functions. Keep the default settings, and choose Create.

The stack deployment takes approximately 15–20 minutes. When deployment is complete, the stack status shows as CREATE_COMPLETE.

The outputs for this CloudFormation stack show important information regarding the deployed resources. This information will be referenced throughout different sections of this post.

On the Outputs tab, note the following values:

OpenSearchDashboardURL
SharedLambdaRoleArn

On the Resources tab, locate the following information:

OpenSearchMasterUserSecret: Choose the Physical ID link, then choose Retrieve Secret Value. Note the user name and password required for OpenSearch Service domain login.
IngestDataAndCreateBackendRoles: Choose the Physical ID link to open the Lambda function, needed in later steps.
UserPool: Choose the Physical ID link to open the Amazon Cognito user pool, needed in later steps.
RestAPI: Choose the Physical ID link to open the API Gateway endpoint, needed in later sections.

AWS CloudFormation Resources tab showing a list of deployed resources in a stack. The tab displays columns for Logical ID, Physical ID, Type, and Status of each resource. This view helps track and manage infrastructure components created by the CloudFormation template.

AWS CloudFormation Outputs tab displaying exported values and information from the stack. The tab shows a table with columns for Output Key, Output Value, and Description. This view allows users to see and access important configuration values and endpoints created by the stack.

In a separate browser tab, log in to the OpenSearch dashboard using OpenSearchDashboardsURL and user credentials noted previously.

Assign permissions to the IAM role associated with the Lambda function

Complete the following steps to map your IAM role to both the all_access and security_manager roles in OpenSearch Service:

In OpenSearch Dashboards, choose Security in the navigation pane, then choose Roles.
Open the all_access role.
In the Mapped users section, choose Manage mapping.
For Backend role, enter the IAM role Amazon Resource name (ARN). This is the value you copied from the CloudFormation stack output for SharedLambdaRoleArn.
Choose Map to confirm.

Interface showing mapping of users to all_access OpenSearch Service role

On the Roles page, open the security_manager role.
In the Mapped users section, choose Manage mapping.
For Backend role, enter the same IAM role ARN.
Choose Map to confirm the changes.

Interface showing mapping of users to security_manager OpenSearch Service role

These steps ensure the IAM role attached to the Lambda function has the necessary permissions to ingest data (all_access) and create roles (security_manager) within the OpenSearch Service domain.

In this sample setup, the Lambda function handles bulk ingestion and role creation without granting any direct access to users, and all_access is provided to the Lambda role solely to enable ingestion. FGAC in OpenSearch provides in-depth access control, allowing you to further tighten the Lambda role permissions by granting only the necessary CRUD operations, rather than full access for ingestion. For more details, refer to Defining users and roles and Fine-grained access control in OpenSearch.

Run the Lambda function to ingest data into the OpenSearch Service domain

On the CloudFormation stack’s Resources tab, locate the IngestDataAndCreateBackendRoles Lambda function. Open the Lambda function, choose Test, and execute it. You can confirm the function’s successful execution by checking Amazon CloudWatch Logs.

This Lambda function is designed to perform bulk ingestion and role creation in the OpenSearch Service domain. It ingests sample clinical research data into OpenSearch Service, creating two indexes (pubmed and clinical_trials), and sets up required OpenSearch Service roles. We explore these roles in detail in the next section.

Map roles and users in OpenSearch Service

In this step, we define two key OpenSearch Service roles:

pubmed-admin – Grants full read access to the PubMed index containing biomedical literature and research abstracts, intended for senior research groups
pubmed-limited – Provides restricted read access to only specific fields (journal, title, and abstract, where journal is a masked field), intended for researchers with limited data access

We have already created these roles by running the Lambda function in the previous section. The following code is the pubmed-admin OpenSearch Service role description:

The following code is the pubmed-limited OpenSearch Service role description:

The pubmed-admin and pubmed-limited roles serve different purposes, and their main distinction lies in how they control data visibility. Document-level security (DLS) lets you restrict a role to a subset of documents in an index, while field-level security (FLS) lets you control which document fields a user can see. The limited role is configured with FLS to expose only the journal, title, and abstract fields, while masked fields anonymize sensitive data such as journal. On top of these, you can apply DLS to hide specific records, for example, to prevent users from viewing documents from certain journals or publication years. In your use cases, use DLS and FLS to control document and field visibility for different users. These roles are fully configurable; you can add, remove, or update document and field access at any time to match evolving security or business requirements.

To enforce access control, users need to be mapped to appropriate OpenSearch Service roles on OpenSearch Dashboards. Complete the following steps to map users to the OpenSearch Service roles:

On OpenSearch Dashboards, choose Security in the navigation pane, then choose Roles.
Open the pubmed-admin role.
In the Mapped users section, choose Manage mapping.
For Backend role, enter pubmed_admin_group.
Choose Map to confirm the mapping.

Interface showing mapping of users to pubmed-admin OpenSearch Service role

On the Roles page, open the pubmed-limited role.
In the Mapped users section, choose Manage mapping.
For Backend role, enter pubmed_limited_group.
Choose Map to confirm the mapping.

Interface showing mapping of users to pubmed-limited OpenSearch Service role

Backend roles simplify access management in OpenSearch Service. Instead of mapping individual users to OpenSearch service roles, you can map roles to backend roles that users share. This approach lets you map IdP groups directly to the OpenSearch service roles. OpenSearch Service provides options when configuring your OpenSearch Service domain to map JWT claims to OpenSearch Service roles using the roles key.

In this solution, the JWT contains a field called cognito:groups that will be mapped as the roles key. In every JWT, this field has a value for the appropriate group the user belongs to. Based on the field value in the JWT and the mapping defined in the previous step for different research groups, OpenSearch Service domain dynamically assigns permissions:

If the JWT contains “cognito:groups”: [“pubmed_admin_group”], the user is granted pubmed_admin access
If the JWT contains “cognito:groups”: [“pubmed_limited_group”], the user is granted pubmed_limited access

Take a look at the examples below to understand what a JWT header and payload look like.

Sample JWT header:

{ "kid": "ksBAnCwgFgjaSVlETXx/xeUtvuPkZkacu10Xexample=", "alg": "RS256" }

Sample JWT payload:

{
    "at_hash": "Q7Bljd1Hj4bvC40example",
    "sub": "246894e8-a081-70ab-8fc0-25729example",
    "cognito:groups": [
        "pubmed_limited_group"
    ],
    "email_verified": true,
    "iss": "https://cognito-idp.us-east-1.amazonaws.com/us-east-1_B2example",
    "cognito:username": "PubMedAdminUser",
    "origin_jti": "096e366f-ce11-40e8-9e82-c4a15example",
    "aud": "q72b4a6o3sc2am2c235cqi2vc",
    "event_id": "0545ea01-3026-4563-8d1c-05a07example",
    "token_use": "id",
    "auth_time": 1739269731,
    "exp": 1739273331,
    "iat": 1739269731,
    "jti": "b39d6a3f-1670-4aaa-840a-1a92fexample",
    "email": "[email protected]“
}

Create users in Amazon Cognito

In this section, we create the following Amazon Cognito users:

PubMedAdminUser
PubMedLimitedUser
ClinicalTrialsAdminUser
ClinicalTrialsLimitedUser
ResearchBasicUser

The email address required for each user should be unique. If your email domain supports email alias, you can add a suffix to your own email address by using [email protected]. The following screenshot shows our users.

screenshot of Users section of Cognito User pool showing the target state after all the users are created.

On the CloudFormation stack’s Resources tab, locate the UserPool Amazon Cognito user pool that you noted earlier. Open the user pool in a new browser tab.

To create the Amazon Cognito users, complete the following steps for each user:

On the Amazon Cognito console, choose Users in the navigation pane.
Choose Create user.
For Alias attributes used to sign in, select Email.
For User name, enter a unique user name.
For Email address, enter a unique email address for each user.
Select Mark email address as verified.
Choose Create User.

screenshot of Information to be provided for creating each of the user

Create groups in Amazon Cognito

We create the following groups in Amazon Cognito:

pubmed_admin_group
pubmed_limited_group
clinical_trials_admin_group
clinical_trials_limited_group
research_basic_group

The following screenshot shows created groups.

screenshot of Groups section of Cognito User pool showing the target state after all the groups are created.

To create the Amazon Cognito groups, complete the following steps for each group:

On the Amazon Cognito console, choose Groups in the navigation pane.
Choose Create group.
For Group name, enter a unique name.
Choose Create group.

Add Amazon Cognito users to groups

The users should be added to the groups as follows:

Add PubMedAdminUser to the pubmed_admin_group group
Add PubMedLimitedUser to the pubmed_limited_group group
Add ClinicalTrialsAdminUser to the clinical_trials_admin_group group
Add ClinicalTrialsLimitedUser to the clinical_trials_limited_group group
Add ResearchBasicUser to the research_basic_group group

To add users to their respective group, complete the following steps for each group:

On the Amazon Cognito console, choose Groups in the navigation pane.
Choose the group to which you want to add a user.
Choose Add user to group.
Choose the user and choose Add.

Log in to generate a JWT

Before running the test queries in the next section, you must obtain the id_token (JWT) for the specified users. The tokens will expire in 60 minutes. If the token is expired for a user, you must log in again to get a fresh token. To log in with your user to get the id_token, complete the following steps:

On the Amazon Cognito console, open your user pool.
Choose App clients in the navigation pane.
Choose the app client.
Choose View login page.

screenshot of the App clients section of the userpool

Enter the user name that you used when creating the user.
Enter the temporary password that you set when creating the user.
For first-time logins, you will be prompted to create a new password. Enter a new password that meets the following requirements:
1. At least 8 characters
2. Contains uppercase and lowercase letters
3. Contains at least one number
4. Contains at least one special character
Copy the id_token value you generated (without quotation marks).

Query data in OpenSearch Service

This example demonstrates how OpenSearch Service filters search results based on user permissions. We test searches using JWTs for two different users to verify access controls. Each user’s search results are limited to the indexes and documents allowed by their assigned roles.

On the CloudFormation stack’s Resources tab, locate the RestAPI value that you noted earlier. Open the API gateway in a new browser tab.

Complete the following steps to test the search API for each of the scenarios mentioned in this section:

On the API Gateway console, choose Resources in the navigation pane.
Choose the /search resource.
Choose the POST method.
Choose Test.

Screenshot of the Test section for the search API in Amazon API Gateway.

When submitting queries to OpenSearch Service, make sure all double quotation marks are escaped to prevent syntax errors. Additionally, make sure you complete your query before your JWT expires, or you will need to generate a new token. If you attempt to use an expired token, it will result in an error.

For Scenarios 1 and 2, log in with your PubMedAdmin user, and for Scenarios 3 and 4, log in with your PubMedLimitedUser to obtain the required id_token.

Scenario 1

In this first query, we query the pubmed index with the credentials of user PubMedAdminUser, which is part of pubmed_admin_group:

{
  "query": {
    "match_all": {}
  }
}

Add the following values to the respective input fields:

For Query strings, enter query="{\"query\":{\"match_all\":{}}}"&index=pubmed
For Headers, enter id_token:<id-token-for-PubMedAdminUser>

values to be used for testing scenario 1

The following screenshot shows our query results.

Result of the search API call made for scenario 1

Users with the pubmed_admin role have full access to the PubMed index and can perform unrestricted searches across all fields and document types. This query successfully returns documents with the HTTP 200 status code because the user has complete read permissions on this index.

Scenario 2

Next, we query the clinical-trials index with the credentials of user PubMedAdminUser, who is part of pubmed_admin_group:

{
  "query": {
    "match_all": {}
  }
}

Add the following values to the respective input fields:

For Query strings, enter query="{\"query\":{\"match_all\":{}}}"&index=clinical-trials
For Headers, enter id_token:<id-token-for-PubMedAdminUser>

values to be used for testing scenario 2

The following screenshot shows our query results.

Result of the search API call made for scenario 2

Despite having admin privileges for PubMed data, this user receives a 403 Forbidden response when attempting to access the clinical-trials index. The error message indicates the lack of necessary permissions for performing search operations on this index.

Scenario 3

Now we query allowed fields in the pubmed index with the credentials of user PubMedLimitedUser, which is part of pubmed_limited_group:

{
    "query": {
        "match": {
            "title": "molecular biology"
        }
    }
}

Add the following values to the respective input fields:

For Query strings, enter query="{\"query\":{\"match\":{\"title\": \"molecular biology\"}}}"&index=pubmed
For Headers, enter id_token:<id-token-for-PubMedLimitedUser>

values to be used for testing scenario 3

The following screenshot shows our query results.

Result of the search API call made for scenario 3

Users with the pubmed_limited role can successfully query specific fields like title, but with restricted access to sensitive information. The query returns results with the HTTP 200 status code, but the journal field is anonymized due to field-level security policies. Users can search and view certain fields while having sensitive data automatically masked or excluded from their results.

Scenario 4

Lastly, we query unauthorized fields in the pubmed index with the credentials of user PubMedLimitedUser, which is part of pubmed_limited_group:

{
    "query": {
        "match": {
            "research_group": "RG_345"
        }
    }
}

Add the following values to the respective input fields:

For Query strings, enter query="{\"query\":{\"match\":{\"research_group\":\"RG_345\"}}}"&index=pubmed
For Headers, enter id_token:<id-token-for-PubMedLimitedUser>

values to be used for testing scenario 4

The following screenshot shows our query results.

Result of the search API call made for scenario 4

When a user with the pubmed_limited role attempts to query the restricted research_group field, OpenSearch returns a successful response (HTTP 200) but with empty results. This behavior occurs because field-level security is enforcing access controls instead of returning a HTTP 403 error, it silently filters out the restricted field from both the query and results. This security-by-obscurity approach means that users can’t determine whether their query failed due to lack of permissions or genuine absence of matching documents.

Clean up

To avoid incurring further AWS usage charges, delete the resources created in this post by deleting the CloudFormation stack. This step will remove all resources except Lambda layers. To delete the Lambda layers, navigate to the Layers page on the Lambda console, and delete the layers named <CloudFormation-Stack-Name>-requests and <CloudFormation-Stack-Name>-crypt.

Conclusion

In this post, we discussed how JWTs provide a robust and scalable authentication mechanism that can be integrated with existing IdPs. We also demonstrated how to seamlessly integrate fine-grained access control across search applications. Organizations can define granular permissions within their IdP, making sure sensitive information remains protected. The JWT integration with OpenSearch Service enables secure, efficient access control, so users can only access role-appropriate information while simplifying compliance and access management.

If you have feedback about this post, leave them in the comments section. If you have questions about this post, start a new thread on AWS Security, Identity, and Compliance re:Post or contact AWS Support.

About the authors

Ramya Bhat is a Data Analytics Consultant at AWS, specializing in the design and implementation of cloud-based data platforms. She builds enterprise-grade solutions across search, data warehousing, and ETL that enable organizations to modernize data ecosystems and derive insights through scalable analytics. She has delivered customer engagements across healthcare, insurance, fintech, and media sectors.

Shubhansu Sawaria is a Sr. Delivery Consultant – SRC at AWS, based in Bangalore, India. He specializes in designing and implementing comprehensive AWS Cloud security solutions. He has developed security solutions for startups, banks, and healthcare organizations. His expertise helps organizations elevate their cloud security infrastructures, achieve compliance objectives, and provide robust data protection.

Soujanya Konka is a Sr. Solutions Architect and Analytics Specialist at AWS, focused on helping customers build their ideas in the cloud. She has expertise in designing and implementing enterprise search solutions and advanced data analytics at scale.

Build enterprise-scale log ingestion pipelines with Amazon OpenSearch Service

2025-08-21 Akhil B

Post Syndicated from Akhil B original https://aws.amazon.com/blogs/big-data/build-enterprise-scale-log-ingestion-pipelines-with-amazon-opensearch-service/

Organizations of all sizes generate massive volumes of logs across their applications, infrastructure, and security systems to gain operational insights, troubleshoot issues, and maintain regulatory compliance. However, implementing log analytic solutions presents significant challenges, including complex data ingestion pipelines and the need to balance cost and performance while scaling to handle petabytes of data.

Amazon OpenSearch Service addresses these challenges by providing high-performance search and analytics capabilities, making it straightforward to deploy and manage OpenSearch clusters in the AWS Cloud without the infrastructure management overhead. A well-designed log analytics solution can help support proactive management in a variety of use cases, including debugging production issues, monitoring application performance, or meeting compliance requirements.

In this post, we share field-tested patterns for log ingestion that have helped organizations successfully implement logging at scale, while maintaining optimal performance and managing costs effectively.

Solution overview

Organizations can choose from several data ingestion architectures, such as:

Log shippers like FluentBit or Fluentd to send logs directly
Amazon Data Firehose for serverless data delivery
Custom solutions using AWS Lambda
Amazon OpenSearch Ingestion for managed data ingestion

Irrespective of the chosen pattern, a scalable log ingestion architecture should comprise the following logical layers:

Collect layer – This is the initial stage where logs are gathered from various sources, including application logs, system logs, and more.
Buffer layer – This layer acts as a temporary storage layer to handle spikes in log volume and prevents data loss during downstream processing issues. This layer also maintains system stability during high load.
Process layer – This layer transforms the unstructured logs into structured formats while adding relevant metadata and contextual information needed for effective analysis.
Store layer – This layer is the final destination for processed logs (OpenSearch in this case), which supports various access patterns, including querying, historical analysis, and data visualization.

OpenSearch Ingestion offers a purpose-built, fully managed experience that simplifies the data ingestion process. In this post, we focus on using OpenSearch Ingestion to load logs from Amazon Simple Storage Service (Amazon S3) into an OpenSearch Service domain, a common and efficient pattern for log analytics.

OpenSearch Ingestion is a fully managed, serverless data ingestion service that streamlines the process of loading data into OpenSearch Service domains or Amazon OpenSearch Serverless collections. It’s powered by Data Prepper, an open source data collector that filters, enriches, transforms, normalizes, and aggregates data for downstream analysis and visualization.

OpenSearch Ingestion uses pipelines as a mechanism that consists of the following major components:

Source – The input component of a pipeline. It defines the mechanism through which a pipeline consumes records.
Buffer – A persistent, disk-based buffer that stores data across multiple Availability Zones to enhance durability. OpenSearch Ingestion dynamically allocates OCUs for buffering, which increases pricing as you may need additional OCUs to maintain ingestion throughput.
Processors – The intermediate processing units that can filter, transform, and enrich records into a desired format before publishing them to the sink. The processor is an optional component of a pipeline.
Sink – The output component of a pipeline. It defines one or more destinations to which a pipeline publishes records. A sink can also be another pipeline, so you can chain multiple pipelines together.

Because of its serverless nature, OpenSearch Ingestion automatically scales to accommodate varying workload demands, alleviating the need for manual infrastructure management while providing built-in monitoring capabilities. Users can focus on their data processing logic rather than spending time on operational complexities, making it an efficient solution for managing data pipelines in OpenSearch environments.

The following diagram illustrates the architecture of the log ingestion pipeline.

Let’s walk through how this solution processes Apache logs from ingestion to visualization:

The source application generates Apache logs that need to be analyzed and stores them in an S3 bucket, which acts as the central storage location for incoming log data. When a new log file is uploaded to the S3 bucket (ObjectCreate event), Amazon S3 automatically triggers an event notification that is configured to send messages to a designated Amazon Simple Queue Service (Amazon SQS) queue.
The SQS queue reliably manages and tracks the notifications of new files uploaded to Amazon S3, making sure the file event is delivered to the OpenSearch Ingestion pipeline. A dead-letter queue (DLQ) is configured to capture failed event processing.
The OpenSearch Ingestion pipeline monitors the SQS queue, retrieving messages that contain information about newly uploaded log files. When a message is received, the pipeline reads the corresponding log file from Amazon S3 for processing.
After the log file is retrieved, the OpenSearch Ingestion pipeline parses the content, and uses the OpenSearch Bulk API to efficiently ingest the processed log data into the OpenSearch Service domain, where it becomes available for search and analysis.
The ingested data can be visualized and analyzed through OpenSearch Dashboards, which provides a user-friendly interface for creating custom visualizations, dashboards, and performing real-time analysis of the log data with features like search, filtering, and aggregations.

In the following sections, we guide you through the steps to ingest application log files from Amazon S3 into OpenSearch Service using OpenSearch Ingestion. Additionally, we demonstrate how to visualize the ingested data using OpenSearch Dashboards.

Prerequisites

This post assumes you have the following:

An AWS account
The AWS Command Line Interface (AWS CLI) installed
The AWS CDK Toolkit installed
Python 3 installed

Deploy the solution

The solution uses a Python AWS Cloud Development Kit (AWS CDK) project to deploy an OpenSearch Service domain and associated components. This project demonstrates event-based data ingestion into the OpenSearch Service domain in a no code approach using OpenSearch Ingestion pipelines.

The deployment is automated using the AWS CDK and comprises the following steps:

Clone the GitHub repo.

git clone [email protected]:aws-samples/sample-log-ingestion-pipeline-for-amazon-opensearch-service.git

Create a virtual environment and install the Python dependencies:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Update the following environment variables in cdk.json:
1. domain_name: The OpenSearch domain to be created in your AWS account.
2. user_name: The user name for the internal primary user to be created within the OpenSearch domain.
3. user_password: The password for the internal primary user.

This deployment creates a public-facing OpenSearch domain but is secured through fine-grained access control (FGAC). For production workloads, consider deploying within a virtual private cloud (VPC) with additional security measures. For more information, see Security in Amazon OpenSearch Service.

Bootstrap the AWS CDK stack and initiate the deployment. Provide your AWS account number and the AWS Region where you want deploy the solution:

cdk bootstrap <Account ID>/<region>
cdk deploy --all

The process takes about 30–45 minutes to complete.

Verify the solution resources

When the previous steps are complete, you can check for the created resources.

You can confirm the existence of the stacks on the AWS CloudFormation console. As shown in the following screenshot, the CloudFormation stacks have been created and deployed by cdk bootstrap and cdk deploy.

On the OpenSearch Service console, under Managed clusters in the navigation pane, choose Domains. You can confirm the domain created.

On the OpenSearch Service console, under Ingestion in the navigation pane, choose Pipelines. You can see the pipeline apache-log-pipeline created.

Configure security options

To configure your security roles, complete the following steps:

On the AWS CloudFormation console, open the stack CdkIngestionStack, and on the Outputs tab, copy the Amazon Resource Name (ARN) of osi-pipeline-role.

Open the OpenSearch Service console in the deployed Region within your AWS account and choose the domain you created.
Choose the link for OpenSearch Dashboards URL.
In the login prompt, enter the user credentials that were provided in cdk.json.

After a successful login, the OpenSearch Dashboards console will be displayed.

If you’re prompted to select a tenant, select the Global tenant.
In the Security options, navigate to the Roles section and choose the all_access role.
On the all_access role page, navigate to mapped_users and choose Manage.
Choose Add another backend role under Backend roles and enter the IAM role ARN you copied.
Confirm by choosing Map.

Create an index template

The next step is to create an index template. Complete the following steps:

On the Dev Tools console, copy the contents of the file index_template.txt within the opensearch_object directory.
Enter the code in the Dev Tools console.

This index template defines the mapping and settings for our OpenSearch index.

Choose the play icon to submit the request and create a template.

In the Dashboard Management section, choose Saved Objects and choose Import.
Choose Import and choose the apache_access_log_dashboard.ndjson file within the opensearch_object directory.
Choose Check for existing objects.
Choose Automatically overwrite conflicts and choose Import.

Ingest data

Now you can proceed with the data ingestion.

On the Amazon S3 console, open the S3 bucket opensearch-logging-blog-<Account ID>.
Upload the data file apache_access_log.gz (within the apache_log_data directory). The file can be uploaded in any prefix.

For this solution, we use Apache access logs as our example data source. Although this pipeline is configured for Apache log format, it can be modified to support other log types by adjusting the pipeline configuration. See Overview of Amazon OpenSearch Ingestion for details about configuring different log formats.

After a few minutes, navigate to the Discover tab in OpenSearch Dashboards, where you can find that the data is ingested.
Confirm that the apache* index pattern is selected.

5. On the Dashboards tab, choose Apache Log Dashboard.

The dashboard will be populated by the data and visuals should be displayed.

Operational best practices

When designing your log analytics platform on OpenSearch Service, make sure you follow the recommended operational best practices for cluster configuration, data management, performance, monitoring, and cost optimization. For detailed guidance, refer to Operational best practices for Amazon OpenSearch Service.

Clean up

To avoid ongoing charges for the resources that you created, delete them by completing the following steps:

On the Amazon S3 console, open the bucket opensearch-logging-blog-<Account ID> and choose Empty.
Follow the prompts to delete the contents of the bucket.
Delete the AWS CDK stacks using the following command:

cdk destroy --all --force

Conclusion

As organizations continue to generate increasing volumes of log data, having a well-architected logging solution becomes crucial for maintaining operational visibility and meeting compliance requirements.

Implementing a robust logging infrastructure requires careful planning. In this post, we explored a field-tested approach in building a scalable, efficient, and cost-effective logging solution using OpenSearch Ingestion.

This solution serves as a starting point that can be customized based on specific organizational needs while maintaining the core principles of scalability, reliability, and cost-effectiveness.

Remember that logging infrastructure is not a “set-and-forget” system. Regular monitoring, periodic reviews of storage patterns, and adjustments to index management policies will help make sure your logging solution continues to serve your organization’s evolving needs effectively.

To dive deeper into OpenSearch Ingestion implementation, explore our comprehensive Amazon OpenSearch Service Workshops, which include hands-on labs and reference architectures. For additional insights, see Build a serverless log analytics pipeline using Amazon OpenSearch Ingestion with managed Amazon OpenSearch Service. You can also visit our Migration Hub if you’re ready to migrate legacy or self-managed workloads to OpenSearch Service.

About the authors

Akhil B is a Data Analytics Consultant at AWS Professional Services, specializing in cloud-based data solutions. He partners with customers to design and implement scalable data analytics platforms, helping organizations transform their traditional data infrastructure into modern, cloud-based solutions on AWS. His expertise helps organizations optimize their data ecosystems and maximize business value through modern analytics capabilities.

Chanpreet Singh is a Senior Consultant at AWS, specializing in the Data and AI/ML space. He has over 18 years of industry experience and is passionate about helping customers design, prototype, and scale Big Data and Generative AI applications using AWS native and open-source tech stacks. In his spare time, Chanpreet loves to explore nature, read, and spend time with his family.

Create an OpenSearch dashboard with Amazon OpenSearch Service

2025-08-05 Smita Singh

Post Syndicated from Smita Singh original https://aws.amazon.com/blogs/big-data/create-an-opensearch-dashboard-with-amazon-opensearch-service/

Effective log analysis is essential for maintaining the health and performance of modern applications. Amazon OpenSearch Service stands out as a powerful, fully managed solution for log analytics and observability. With its advanced indexing, full-text search, and real-time analytics capabilities, OpenSearch Service makes it possible for organizations to seamlessly ingest, process, and search log data across diverse sources—including AWS services like Amazon CloudWatch, VPC Flow Logs, and more.

With OpenSearch Dashboards, you can turn indexed log data into actionable visualizations that reveal insights and help detect anomalies. By querying data stored in OpenSearch Service, you can extract relevant information and display it using a variety of visualization types—such as line charts, bar graphs, pie charts, heatmaps, and more. These tools make it effortless to monitor system behavior, spot trends, and quickly identify issues in your environment.

This post demonstrates how to harness OpenSearch Dashboards to analyze logs visually and interactively. With this solution, IT administrators, developers, and DevOps engineers can create custom dashboards to monitor system behavior, detect anomalies early, and troubleshoot issues faster through interactive charts and graphs.

Solution overview

In this post, we show how to create an index pattern in OpenSearch Dashboards, create two types of visualizations, and display these visualizations on a custom dashboard. We also demonstrate how to export and import visualizations.

Prerequisites

Before diving into log analysis with OpenSearch Dashboards, you must have the following:

A properly configured OpenSearch Service domain
A working log collection and ingestion pipeline

Amazon OpenSearch Service 101: Create your first search application with OpenSearch guides you through setting up your OpenSearch Service domain and configuring the log ingestion pipeline.

For this post, we work with the following log sources, which have already been ingested into an OpenSearch Service cluster as part of the prerequisite steps:

Access OpenSearch Dashboards

Complete the following steps to access OpenSearch Dashboards:

On the OpenSearch Service console, choose Domains in the navigation pane.
Check if your domain status shows as Active.
Choose your domain to open the domain details page.
Choose the OpenSearch Dashboards URL to open it in a new browser window.

Authenticate into OpenSearch Dashboards using one of the supported methods.

Create an index pattern

After you’re logged in to OpenSearch Dashboards, you must create an index pattern. An index pattern allows OpenSearch Dashboards to locate indexes to search. Complete the following steps

In OpenSearch Dashboards, expand the navigation pane and choose Dashboard Management under Management.
Choose Index patterns in the navigation pane.

Choose Create index pattern.
For Index pattern name, enter a name (for example, log-aws-cloudtrail-*).
Choose Next step.

For Time field¸ choose @timestamp.
Choose Create index pattern.

Create visualizations

Now that the index pattern is created, let’s create some visualizations. For this post, we create a pie chart and an area graph.

Create a pie chart

Complete the following steps to create a pie chart:

In OpenSearch Dashboards, choose Visualize in the navigation pane.

Choose Create visualization.

Choose Pie as the visualization type.
For Source¸ choose log-aws-cloudtrail-*.

Under Buckets¸ choose Add and Split slices.

For Aggregation, choose Terms.

For Field, choose eventName.
For Size, enter 10.

Leave all other parameters as default and choose Update.
Choose Save to save the visualization.

Sample ndjson file for the pie chart – EventNamePie.ndjson

Please refer Export and import visualizations for how to import the samples.

The following screenshot shows our pie chart, which displays different types of events and their occurrence percentage in the last 30 minutes.

Create an area graph

Complete the following steps to create an area graph:

In OpenSearch Dashboards, choose Visualize in the navigation pane.
Choose Create visualization.
Choose Area as the visualization type.

For Source¸ choose log-aws-cloudtrail-*.

Under Buckets¸ choose Add and X-axis.

For Aggregation, choose Date Histogram.
For Field, choose @timestamp.
Leave all other parameters as default and choose Update

Under Advanced¸ choose Add and Split series.

For Aggregation, choose Terms.
For Field, choose eventName.
For Size, enter 10.
Leave all other parameters as default and choose Update.

Choose Save.
Update the time range to Last 60 minutes.
Choose Refresh and Save.

The following screenshot shows an area graph with different types of events and their occurrence count in the last 60 minutes.

Sample ndjson file for Area chart – EventNameArea.ndjson

Please refer Export and import visualizations for how to import the samples.

Create a dashboard

Now we will combine the visualizations we just created into a dashboard. A dashboard serves as a customizable interface that consolidates multiple visualizations, saved searches, and various content into a comprehensive view of data. Users can combine diverse visual elements—including charts, graphs, metrics, and tables—into a single cohesive display that can be arranged and resized on a flexible grid layout. You can simultaneously apply filters and time ranges across multiple visualizations, creating a coordinated analytical experience. Complete the following steps to create a dashboard:

In OpenSearch Dashboards, choose Dashboards in the navigation pane.
Choose Create new dashboard.

Choose Add on the menu bar.

Search for and choose the visualizations you created.

You can resize panels by dragging their corners to adjust dimensions. To modify the layout arrangement, you can drag the top portion of panels, which allows you to organize them horizontally in a row formation. When working with tabular visualizations, the system provides a convenient option to export your results in CSV format for further analysis or reporting purposes.

Choose Save.
Change the time range to Last 60 minutes.
Choose Refresh and Save.

Sample ndjson file for dashboard – CloudTrailSummary.ndjson

Please refer Export and import visualizations for how to import the samples.

The following screenshot shows the CloudTrail dashboard displaying both visualizations.

Export and import visualizations

In OpenSearch, an NDJSON file is used to import and export saved objects, such as dashboards, visualizations, maps, and index template. The NDJSON file provides a streamlined approach for handling large datasets by representing each JSON object on a separate line. This format enables efficient import/export operations, simplified data migration between environments, and seamless sharing of complex dashboard configurations. Organizations can back up and restore critical visualizations, saved searches, and dashboard settings while maintaining their integrity. The format’s structure reduces memory overhead during large transfers and improves processing speed for bulk operations. NDJSON’s human-readable nature also facilitates troubleshooting and manual editing when necessary, making it an invaluable tool for maintaining OpenSearch Dashboards deployments across development, testing, and production environments.

Export a visualization

Complete the following steps to export a visualization:

In OpenSearch Dashboards, choose Saved objects in the navigation pane.
Search for and select your object (in this case, a visualization), then choose Export.

The NDJSON file is downloaded in your local host.

Import a visualization

Complete the following steps to import a visualization:

In OpenSearch Dashboards, choose Saved objects in the navigation pane.
Choose Import.
Choose the first NDJSON file to be imported from your local host.
Select Create new objects with random IDs.
Choose Import.

Choose Done.

Choose Import.

You can now open the imported object.

The following screenshot shows our updated dashboard.

Clean up

To clean up your resources, delete the OpenSearch Service domain and relevant information stored or visualizations created on the domain. You will not be able to recover the data after you delete it.

On the OpenSearch Service console, choose Domains in the navigation pane.
Select the domain you created and choose Delete.

Conclusion

OpenSearch Dashboards is a powerful tool for transforming raw log data into actionable visualizations that drive insights and decision-making. In this post, we’ve shown how to create visualizations like pie charts and area graphs, build comprehensive dashboards, and efficiently export and import your work using NDJSON files. By using the fully managed OpenSearch Service features, organizations can focus on extracting valuable insights rather than managing infrastructure, ultimately enhancing their observability posture and operational efficiency.

To further enhance your OpenSearch proficiency, consider exploring advanced visualization options such as heat maps, gauge charts, and geographic maps that can represent your data in more specialized ways. Implementing automated alerting based on predefined thresholds will help you proactively identify anomalies before they become critical issues. You can also use OpenSearch’s powerful machine learning capabilities for sophisticated anomaly detection and predictive analytics to gain deeper insights from your log data. As your implementation grows, customizing security settings with fine-grained access controls will provide appropriate data visibility across different teams in your organization.

For comprehensive learning resources, refer to the Amazon OpenSearch Service Developer Guide, watch Create your first OpenSearch Dashboard on YouTube, explore best practices in Amazon OpenSearch blog posts, and gain hands-on experience through workshops available in AWS Workshops.

About the Authors

Smita Singh is a Senior Solutions Architect at AWS. She focuses on defining technical strategic vision and works on architecture, design, and implementation of modern, scalable platforms for large-scale global enterprises and SaaS providers. She is a data, analytics, and generative AI enthusiast and is passionate about building innovative, highly scalable, resilient, fault-tolerant, self-healing, multi-tenant platform solutions and accelerators.

Dipayan Sarkar is a Specialist Solutions Architect for Analytics at AWS, where he helps customers modernize their data platform using AWS analytics services. He works with customers to design and build analytics solutions, enabling businesses to make data-driven decisions.

Amazon OpenSearch Service 101: Create your first search application with OpenSearch

2025-06-25 Sriharsha Subramanya Begolli

Post Syndicated from Sriharsha Subramanya Begolli original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-101-create-your-first-search-application-with-opensearch/

Organizations today face the challenge of managing and deriving insights from an ever-expanding universe of data in real time. Industrial Internet of Things (IoT) sensors stream millions of temperature, pressure, and performance metrics from field equipment every second. Ecommerce platforms need to surface relevant products from vast catalogs instantly. Security teams must analyze system logs in real time to detect threats. As data volumes grow, organizations increasingly struggle with fragmented monitoring tools that create critical visibility gaps and slow incident response times. The cost of commercial observability solutions becomes prohibitive, forcing teams to manage multiple separate tools and increasing both operational overhead and troubleshooting complexity. Across these diverse scenarios, the ability to efficiently search, analyze, and visualize data in real time has become crucial for business success.

Amazon OpenSearch Service addresses these challenges by providing a fully managed search and analytics service. This managed service configures, manages, and scales OpenSearch clusters so you can focus on your search workloads and end customers. Amazon OpenSearch Serverless further makes it straightforward to run search and log analytics workloads by automatically scaling compute and storage resources up and down to match your application’s demands—with no infrastructure to manage. Whether you’re processing continuous streams of IoT telemetry, enabling product discovery, or performing security analytics, OpenSearch Service scales to meet your needs.

In this post, we walk you through a search application building process using Amazon OpenSearch Service. Whether you’re a developer new to search or looking to understand OpenSearch fundamentals, this hands-on post shows you how to build a search application from scratch—starting with the initial setup; diving into core components such as indexing, querying, result presentation; and culminating in the execution of your first search query.

Components of OpenSearch Service

Before building your first search application, it’s important to understand some key architectural components in OpenSearch. The fundamental unit of information in OpenSearch is a document stored in JSON format. These documents are organized into indices—collections of related documents that function similar to database tables. When you search for information, OpenSearch queries these indices to find matching documents.

OpenSearch operates on a distributed architecture where multiple servers, called nodes, work together in a cluster or domain. Each cluster can utilize dedicated master nodes that focus solely on cluster management tasks, such as maintaining cluster state, managing indices, and orchestrating shard allocation. These specialized nodes enhance cluster stability by offloading cluster management duties from data nodes. Data nodes, on the other hand, handle the storage, indexing, and querying of data—essentially performing the heavy lifting of data operations. Together, they provide scalability, availability, and efficient data processing in the cluster. Configure dedicated coordinator nodes that specialize in routing and distributing search and indexing requests across the cluster. These nodes reduce the load on data nodes, which allows them to focus on data storage, indexing, and search operations.

Coordinator nodes in OpenSearch are most beneficial in the following scenarios:

Large cluster deployments – When managing substantial data volumes across many nodes.
Query-intensive workloads – For environments handling frequent search queries or aggregations, especially those with complex date histograms or multiple aggregations, benefit from faster query processing.
Heavy dashboard utilization – OpenSearch Dashboards can be resource-intensive. Offloading this responsibility to dedicated coordinator nodes reduces the strain on data nodes.

To manage large datasets efficiently, OpenSearch splits indices into smaller pieces called shards. Each shard is distributed across the cluster, with a recommended size of 10–50 GB for optimal performance. For reliability and high availability, OpenSearch maintains replica copies of these shards on different nodes, which means that your data remains accessible even if some nodes fail.

Search operations in OpenSearch are powered by inverted indices, a data structure that maps terms to the documents containing them. The BM25 ranking algorithm helps make sure that search results are relevant to users’ queries. Although searches happen in near real time, with configurable refresh intervals, individual document retrievals are immediate.

This architecture provides the foundation for handling high-volume IoT data streams, complex full-text search operations, and real-time analytics, all while maintaining fault tolerance. Understanding these components will help you make informed decisions as you build your search application.OpenSearch Dashboards is a visualization and analytics tool for exploring, analyzing, and visualizing data in real time. It provides an intuitive interface for querying, monitoring, and reporting on OpenSearch data using visualizations such as charts, graphs, and maps. Key features include interactive dashboards, alerting, anomaly detection, security monitoring, and trace analytics.

Sample Amazon OpenSearch Service tutorial application overview

The following architecture diagram demonstrates how to build and deploy a scalable, fully managed search application on Amazon Web Services (AWS). The architecture uses Amazon OpenSearch Service for indexing and searching data. The UI application is deployed on AWS App Runner and interacts with Amazon OpenSearch Service through secure serverless Amazon API Gateway and AWS Lambda.

Here is the end-to-end workflow for our application detailing how user requests are handled from initial access through to data retrieval or indexing:

Users access the application through AWS App Runner, which hosts the frontend interface.
Amazon Cognito handles user authentication and authorization for secure access to the application.
When users interact with the application, their requests are sent to API Gateway. API Gateway communicates with Amazon Cognito to verify user authentication status. It serves as the primary entry point for all API operations and routes the requests appropriately. It forwards requests to Lambda functions within the virtual private cloud (VPC).
Lambda functions process the requests, performing either:
Data indexing operations into OpenSearch Service
Search queries against the OpenSearch Service cluster
The OpenSearch Service cluster resides within a private subnet in a VPC for enhanced security.

Prerequisites

Before you deploy the solution, review the prerequisites.

Install the sample app

The entire infrastructure is deployed using AWS Cloud Development Kit (AWS CDK), with cluster configurations customizable through the cdk.json file on GitHub. This deployment approach provides consistent and repeatable infrastructure creation while maintaining security best practices. The steps to deploy this infrastructure are available in this README file. After deployment, you’ll access a comprehensive search application built with Cloudscape React components that includes:

Interactive search functionality – Test various OpenSearch query methods including prefix match keyword searches, phrase matching, fuzzy searches, and field-specific queries against the sample product dataset
Document management tools – Bulk index the product catalog with a single click or delete and recreate the index as needed for testing purposes
Educational resources – Access embedded guides explaining OpenSearch concepts, query syntax, and best practices

Index the documents

After you’ve deployed this search application, the first step is to index some documents into OpenSearch Service. Sign in to the search application UI and follow these steps:

To trigger a bulk index process, under Index Documents in the navigation pane, choose Bulk Index Product Catalog.
Choose Index Product catalog, as shown in the following screenshot.

The Lambda function indexes a comprehensive ecommerce product catalog into your newly created OpenSearch Service cluster. This sample dataset includes detailed fashion and lifestyle products spanning multiple categories. Each product record contains rich metadata, including title, detailed description, category, color, and price.

Keyword searches

OpenSearch Service offers multiple search features. For an exhaustive list, refer to Search features. We focus on a few keyword search types to help you get started with OpenSearch.

With the product catalog in OpenSearch, you can perform prefix searches through the search application’s intuitive interface. To better understand the search functionality, expand the Guide section at the top of the interface. This interactive guide explains how various kinds of searches work, complete with a practical example in context of the product catalog dataset. The guide includes best practices and a link to the detailed documentation to help you make the most of OpenSearch’s powerful query capabilities.

You can do a prefix search on any of the three key search fields: Title, Description, or Color.

A typical prefix match query looks like this:

{
  "query": {
    "match_phrase_prefix": {
      "attribute_name": {
        "query": "attribute_value",
        "max_expansions": 10,
        "slop": 1
      }
    }
  }
}

You can use this query pattern to find documents where specific fields begin with your search term, offering an intuitive “starts with” search experience.

The following image illustrates a practical example of the Prefix Match search. Entering “Ru” in the title field matches products with titles such as “Running”, “Runners” and “Ruby.” Prefix Match search is particularly useful when users only remember the beginning of a product name or are searching across multiple variations or simply exploring product categories.

Multi Match search enables searching across multiple fields simultaneously. For example, you can search for “Coral” across product title, description, and color fields simultaneously. The search query can be customized using field boosting in which matches in certain fields carry more weight than others.

A typical multi match query looks like this:

{
  "query": {
    "multi_match": {
      "query": "Coral",
      "fields": [
        "title^3",
        "description",
        "color"
      ],
      "type": "best_fields"
    }
  }
}

You can explore Wildcard Match, Range Filter, and other search features through the search application. For developers and administrators managing this search infrastructure, OpenSearch Dashboards is a native, developer-friendly interface for indexing, searching, and managing your data. It serves as a comprehensive control center where you can interact directly with your indices, test queries, and monitor performance in real time. The following screenshot shows OpenSearch Dashboards which provides an interactive UI to explore, analyze and visualize search and log data.

While our example demonstrates lexical search functionality on a sample product catalog, OpenSearch Service is equally powerful for observability usecases. When handling time-series data from logs, metrics, or traces, OpenSearch excels at real-time analytics and visualization. For instance, DevOps teams can index application logs and system telemetry data, then use date histograms and statistical aggregations to identify performance bottlenecks or security anomalies as they occur. This real-time search allows IT teams to detect and respond to incidents with minimal delay. Using OpenSearch Dashboards, teams can create live operational dashboards that update automatically as new data streams in. For IoT applications monitoring thousands of sensors, this means temperature anomalies or equipment failures can trigger immediate alerts through OpenSearch’s alerting capabilities. These observability workloads benefit from the same distributed architecture that powers our product search example, with the added advantage of time-series optimized indices and retention policies for managing high-volume streaming data efficiently.

Beyond search management, you can configure alerts for specific conditions, set up notification channels for operational events, and enable data discovery features. If you want to experiment with the same search queries we implemented in our application, you can launch OpenSearch Dashboards and use relevant index and search APIs from the Dev Tools section, which is an ideal environment for developing and testing before implementing in your production application. Because our OpenSearch Service cluster resides within a private subnet, you need to create a Secure Shell (SSH) tunnel to access the dashboard. For more information and steps to do this, refer to How do I use an SSH tunnel to access OpenSearch Dashboards with Amazon Cognito authentication from outside a VPC? in the Knowledge Center. So far, we’ve explored OpenSearch’s query domain-specific language (DSL). However, for those coming in from a traditional database background, OpenSearch also offers SQL and Piped Processing Language (PPL) functionality, making the transition smoother. You can explore more on this at SQL and PPL in the OpenSearch documentation.

In this post, we introduced you to different types of keyword searches. You can also store documents as vector embeddings in OpenSearch and use it for semantic search, hybrid search, multimodal search, or to implement Retrieval Augmented Generation (RAG) pattern.

Conclusion

You can now build sample search applications by following the steps outlined in this post and the implementation details available at sample-for-amazon-opensearch-service-tutorials-101 on GitHub. By using the distributed architecture of Amazon OpenSearch Service, an AWS managed service, you get fast, scalable search capabilities that grow with your business, built-in security and compliance controls, and automated cluster management—all with pay-only-for-what-you-use pricing flexibility.

Ready to learn more? Check out the Amazon OpenSearch Service Developer Guide. For more insights, best practices and architectures, and industry trends, refer to Amazon OpenSearch Service blog posts and hands-on workshops at AWS Workshops. Please also visit the OpenSearch Service Migration Hub if you are ready to migrate legacy or self-managed workloads to OpenSearch Service.

We hope this detailed guide and accompanying code will help you get started. Try it out, let us know your thoughts in the comments section, and feel free to reach out to us for questions!

About the authors

Sriharsha Subramanya Begolli works as a Senior Solutions Architect with Amazon Web Services (AWS), based in Bengaluru, India. His primary focus is assisting large enterprise customers in modernizing their applications and developing cloud-based systems to meet their business objectives. His expertise lies in the domains of data and analytics.

Fraser Sequeira is a Startups Solutions Architect with Amazon Web Services (AWS) based in Melbourne, Australia. In his role at AWS, Fraser works closely with startups to design and build cloud-native solutions on AWS, with a focus on analytics and streaming workloads. With over 10 years of experience in cloud computing, Fraser has deep expertise in big data, real-time analytics, and building event-driven architecture on AWS. He enjoys staying on top of the latest technology innovations from AWS and sharing his learnings with customers. He spends his free time tinkering with new open source technologies.

Introducing vector search with UltraWarm in Amazon OpenSearch Service

2025-03-20 Kunal Kotwani

Post Syndicated from Kunal Kotwani original https://aws.amazon.com/blogs/big-data/introducing-vector-search-with-ultrawarm-in-amazon-opensearch-service/

Amazon OpenSearch Service has been providing vector database capabilities to enable efficient vector similarity searches using specialized k-nearest neighbor (k-NN) indexes to customers since 2019. This functionality has supported various use cases such as semantic search, Retrieval Augmented Generation (RAG) with large language models (LLMs), and rich media searching. With the explosion of AI capabilities and the increasing creation of generative AI applications, customers are seeking vector databases with rich feature sets.

OpenSearch Service also offers a multi-tiered storage solution to its customers in the form of UltraWarm and Cold tiers. UltraWarm provides cost-effective storage for less-active data with query capabilities, though with higher latency compared to hot storage. Cold tier offers even lower-cost archival storage for detached indexes that can be reattached when needed. Moving data to UltraWarm makes it immutable, which aligns well with use cases where data updates are infrequent like log analytics.

Until now, there was a limitation where UltraWarm or Cold storage tiers couldn’t store k-NN indexes. As customers adopt OpenSearch Service for vector use cases, we’ve observed that they’re facing high costs due to memory and storage becoming bottlenecks for their workloads.

To provide similar cost-saving economics for larger datasets, we are now supporting k-NN indexes in both UltraWarm and Cold tiers. This will enable you to save costs, especially for workloads where:

A significant portion of your vector data is accessed less frequently (for example, historical product catalogs, archived content embeddings, or older document repositories)
You need isolation between frequently and infrequently accessed workloads, minimizing the need to scale hot tier instances to help prevent interference from indexes that can be moved to the warm tier

In this post, we discuss this new capability and its use cases, and provide a cost-benefit analysis in different scenarios.

New capability: K-NN indexes in UltraWarm and Cold tiers

You can now enable UltraWarm and Cold tiers for your k-NN indexes from OpenSearch Service version 2.17 and up. This feature is available for both new and existing domains upgraded to version 2.17. K-NN indexes created after OpenSearch Service version 2.x are eligible for migration to warm and cold tiers. K-NN indexes using various types of engines (FAISS, NMSLib, and Lucene) are eligible to migrate.

Use cases

This multi-tiered approach to k-NN vector search benefits the following various use cases:

Long-term semantic search – Maintain searchability on years of historical text data for legal, research, or compliance purposes
Evolving AI models – Store embeddings from multiple versions of AI models, allowing comparisons and backward compatibility without the cost of keeping all data in hot storage
Large-scale image and video similarity – Build extensive libraries of visual content that can be searched efficiently, even as the dataset grows beyond the practical limits of hot storage
Ecommerce product recommendations – Store and search through vast product catalogs, moving less popular or seasonal items to cheaper tiers while maintaining search capabilities

Let’s explore real-world scenarios to illustrate the potential cost benefits of using k-NN indexes with UltraWarm and Cold storage tiers. We will be using us-east-1 as the representative AWS Region for these scenarios.

Scenario 1: Balancing hot and warm storage for mixed workloads

Let’s say you have 100 million vectors of 768 dimensions (around 330 GB of raw vectors) spread across 20 Lucene engine indexes of 5 million vectors each (roughly 16.5 GB), out of which 50% of data (about 10 indexes or 165 GB) is queried infrequently.

Domain setup without UltraWarm support

In this approach, you prioritize maximum performance by keeping all of the data in hot storage, providing the fastest possible query responses for the vectors. You deploy a cluster with 6x r6gd.4xlarge instances.

The monthly cost for this setup comes to $7,550 per month with a data instance cost of $6,700.

Although this provides top-tier performance for the queries, it might be over-provisioned given the mixed access patterns of your data.

Cost-saving strategy: UltraWarm domain setup

In this approach, you align your storage strategy with the observed access patterns, optimizing for both performance and cost. The hot tier continues to provide optimal performance for frequently accessed data, while less critical data moves to UltraWarm storage.

While UltraWarm queries experience higher latency compared to hot storage—this trade-off is often acceptable for less frequently accessed data. Additionally, since UltraWarm data becomes immutable, this strategy works best for stable datasets that don’t require any updates.

You keep the frequently accessed 50% of data (roughly 165 GB) in hot storage, allowing you to reduce your hot tier to 3x r6gd.4xlarge.search instances. For the less frequently accessed 50% of data (roughly 165 GB), you introduce 2x ultrawarm1.medium.search instances as UltraWarm nodes. This tier offers a cost-effective solution for data that doesn’t require the absolute fastest access times.

By tiering your data based on access patterns, you significantly reduce your hot tier footprint while introducing a small warm tier for less critical data. This strategy allows you to maintain high performance for frequent queries while optimizing costs for the entire system.

The hot tier continues to provide optimal performance for the majority of queries targeting frequently accessed data. For the warm tier, you see an increase in latency for queries on less frequently accessed data, but this is mitigated by effective caching on the UltraWarm nodes. Overall, the system maintains high availability and fault tolerance.

This balanced approach reduces your monthly cost to $5,350, with $3,350 for the hot tier and $350 for the warm tier, reducing the monthly costs by roughly 29% overall.

Scenario 2: Managing Growing Vector Database with Access-Based Patterns

Imagine your system processes and indexes vast amounts of content (text, images, and videos), generating vector embeddings using the Lucene engine for advanced content recommendation and similarity search. As your content library grows, you’ve observed clear access patterns where newer or popular content is queried frequently while older or less popular content sees decreased activity but still needs to be searchable.

To effectively leverage tiered storage in OpenSearch Service, consider organizing your data into separate indices based on expected query patterns. This index-level organization is important because data migration between tiers happens at the index level, allowing you to move specific indices to cost-effective storage tiers as their access patterns change.

Your current dataset consists of 150 GB of vector data, growing by 50 GB monthly as new content is added. The data access patterns show:

About 30% of your content receives 70% of the queries, typically newer or popular items
Another 30% sees moderate query volume
The remaining 40% is accessed infrequently but must remain searchable for completeness and occasional deep analysis

Given these characteristics, let’s explore a single-tiered and multi-tiered approach to managing this growing dataset efficiently.

Single-tiered configuration

For a single-tiered configuration, as the dataset expands, the vector data will grow to be around 400 GB over 6 months, all stored in a hot (default) tier. In the case of r6gd.8xlarge.search instances, the data instance count would be around 3 nodes.

The overall monthly costs for the domain under a single-tiered setup would be around $8050 with a data instance cost of around $6700.

Multi-tiered configuration

To optimize performance and cost, you implement a multi-tiered storage strategy using Index State Management (ISM) policies to automate the movement of indices between tiers as access patterns evolve:

Hot tier – Stores frequently accessed indices for fastest access
Warm tier – Houses moderately accessed indices with higher latency
Cold tier – Archives rarely accessed indices for cost-effective long-term retention

For the data distribution, you start with a total of 150 GB with a monthly growth of 50 GB. The following is the projected data distribution when the data reaches 400 GB at around the 6 month mark:

Hot tier – Approximately 100 GB (most frequently queried content) on 1x r6gd.8xlarge
Warm Tier – Approximately 100 GB (moderately accessed content) on 2x ultrawarm1.medium.search
Cold Tier – Approximately 200 GB (rarely accessed content)

Under the multi-tiered setup, the cost for the vector data domain totals $3880, including $2330 cost of data nodes, $350 cost of UltraWarm nodes, and $5.00 of cold storage costs.

You see compute savings as the hot tier instance size reduced by around 66%. Your overall cost savings were around 50% year-over-year with multi-tiered domains.

Scenario 3: Large-scale disk-based vector search with UltraWarm

Let’s consider a system managing 1 billion vectors of 768 dimensions distributed across 100 indexes of 10 million vectors each. The system predominantly uses disk-based vector search with 32x FAISS quantization for cost optimization, and about 70% of queries target 30% of the data, making it an ideal candidate for tiered storage.

Domain setup without UltraWarm support

In this approach, using disk-based vector search to handle the large-scale data, you deploy a cluster with 4x r6gd.4xlarge instances. This setup provides adequate storage capacity while optimizing memory usage through disk-based search.

The monthly cost for this setup comes to $6,500 per month with a data instance cost of $4,470.

Cost-saving strategy: UltraWarm domain setup

In this approach, you align your storage strategy with the observed query patterns, similar to Scenario 1.

You keep the frequently accessed 30% of data in hot storage, using 1x r6gd.4xlarge instances. For the less frequently accessed 70% of data, you use 2x ultrawarm1.medium.search instances.

You use disk-based vector search in both storage tiers to optimize memory usage. This balanced approach reduces your monthly cost to $3,270, with $1,120 for the hot tier and $400 for the warm tier, reducing the monthly costs by roughly 50% overall.

Get started with UltraWarm and Cold storage

To take advantage of k-NN indexes in UltraWarm and Cold tiers, make sure that your domain is running OpenSearch Service 2.17 or later. For instructions to migrate k-NN indexes across storage tiers, refer to UltraWarm storage for Amazon OpenSearch Service.

Consider the following best practices for multi-tiered vector search:

Analyze your query patterns to optimize data placement across tiers
Use Index State Management (ISM) to manage the data lifecycle across tiers transparently
Monitor cache hit rates using the k-NN stats and adjust tiering and node sizing as needed

Summary

The introduction of k-NN vector search capabilities in UltraWarm and Cold tiers for OpenSearch Service marks a significant step forward in providing cost-effective, scalable solutions for vector search workloads. This feature allows you to balance performance and cost by keeping frequently accessed data in hot storage for lowest latency, while moving less active data to UltraWarm for cost savings. While UltraWarm storage introduces some performance trade-offs and makes data immutable, these characteristics often align well with real-world access patterns where older data sees fewer queries and updates.

We encourage you to evaluate your current vector search workloads and consider how this multi-tier approach could benefit your use cases. As AI and machine learning continue to evolve, we remain committed to enhancing our services to meet your growing needs.

Stay tuned for future updates as we continue to innovate and expand the capabilities of vector search in OpenSearch Service.

About the Authors

Kunal Kotwani is a software engineer at Amazon Web Services, focusing on OpenSearch core and vector search technologies. His major contributions include developing storage optimization solutions for both local and remote storage systems that help customers run their search workloads more cost-effectively.

Navneet Verma is a senior software engineer at AWS OpenSearch . His primary interests include machine learning, search engines and improving search relevancy. Outside of work, he enjoys playing badminton.

Sorabh Hamirwasia is a senior software engineer at AWS working on the OpenSearch Project. His primary interest include building cost optimized and performant distributed systems.

Batch data ingestion into Amazon OpenSearch Service using AWS Glue

2025-01-13 Ravikiran Rao

Post Syndicated from Ravikiran Rao original https://aws.amazon.com/blogs/big-data/batch-data-ingestion-into-amazon-opensearch-service-using-aws-glue/

Organizations constantly work to process and analyze vast volumes of data to derive actionable insights. Effective data ingestion and search capabilities have become essential for use cases like log analytics, application search, and enterprise search. These use cases demand a robust pipeline that can handle high data volumes and enable efficient data exploration.

Apache Spark, an open source powerhouse for large-scale data processing, is widely recognized for its speed, scalability, and ease of use. Its ability to process and transform massive datasets has made it an indispensable tool in modern data engineering. Amazon OpenSearch Service—a community-driven search and analytics solution—empowers organizations to search, aggregate, visualize, and analyze data seamlessly. Together, Spark and OpenSearch Service offer a compelling solution for building powerful data pipelines. However, ingesting data from Spark into OpenSearch Service can present challenges, especially with diverse data sources.

This post showcases how to use Spark on AWS Glue to seamlessly ingest data into OpenSearch Service. We cover batch ingestion methods, share practical examples, and discuss best practices to help you build optimized and scalable data pipelines on AWS.

Overview of solution

AWS Glue is a serverless data integration service that simplifies data preparation and integration tasks for analytics, machine learning, and application development. In this post, we focus on batch data ingestion into OpenSearch Service using Spark on AWS Glue.

AWS Glue offers multiple integration options with OpenSearch Service using various open source and AWS managed libraries, including:

In the following sections, we explore each integration method in detail, guiding you through the setup and implementation. As we progress, we incrementally build the architecture diagram shown in the following figure, providing a clear path for creating robust data pipelines on AWS. Each implementation is independent of the others. We chose to showcase them separately, because in a real-world scenario, only one of the three integration methods is likely to be used.

Image showing the high level architecture diagram

You can find the code base in the accompanying GitHub repo. In the following sections, we walk through the steps to implement the solution.

Prerequisites

Before you deploy this solution, make sure the following prerequisites are in place:

Access to a valid AWS account
The latest AWS Command Line Interface (AWS CLI) installed on your local machine
git, awk, curl, and bash installed on your local machine
Permission to create AWS resources
Familiarity with Apache Spark, AWS Glue, and Amazon OpenSearch Service

Clone the repository to your local machine

Clone the repository to your local machine and set the BLOG_DIR environment variable. All the relative paths assume BLOG_DIR is set to the repository location in your machine. If BLOG_DIR is not being used, adjust the path accordingly.

git clone [email protected]:aws-samples/opensearch-glue-integration-patterns.git
cd opensearch-glue-integration-patterns
export BLOG_DIR=$(pwd)

Deploy the AWS CloudFormation template to create the necessary infrastructure

The main focus of this post is to demonstrate how to use the mentioned libraries in Spark on AWS Glue to ingest data into OpenSearch Service. Though we center on this core topic, several key AWS components will need to be pre-provisioned for the integration examples, such as a Amazon Virtual Private Cloud (Amazon VPC), multiple Subnets, an AWS Key Management Service (AWS KMS) key, an Amazon Simple Storage Service (Amazon S3) bucket, an AWS Glue role, and an OpenSearch Service cluster with domains for OpenSearch Service and Elasticsearch. To simplify the setup, we’ve automated the provisioning of this core infrastructure using the cloudformation/opensearch-glue-infrastructure.yaml AWS CloudFormation template.

Run the following commands

The CloudFormation template will deploy the necessary networking components (such as VPC and subnets), Amazon CloudWatch logging, AWS Glue role, and OpenSearch Service and Elasticsearch domains required to implement the proposed architecture. Use a strong password (8–128 characters, three of which are lowercase, uppercase, numbers, or special characters, and no /, “, or spaces) and adhere to your organization’s security standards for ESMasterUserPassword and OSMasterUserPassword in the following command:

cd ${BLOG_DIR}/cloudformation/
aws cloudformation deploy \
--template-file ${BLOG_DIR}/cloudformation/opensearch-glue-infrastructure.yaml \
--stack-name GlueOpenSearchStack \
--capabilities CAPABILITY_NAMED_IAM \
--region <AWS_REGION> \
--parameter-overrides \
ESMasterUserPassword=<ES_MASTER_USER_PASSWORD> \
OSMasterUserPassword=<OS_MASTER_USER_PASSWORD>

You should see a success message such as "Successfully created/updated stack – GlueOpenSearchStack" after the resources have been provisioned successfully. Provisioning this CloudFormation stack typically takes approximately 30 minutes to complete.

On the AWS CloudFormation console, locate the GlueOpenSearchStack stack, and confirm that its status is CREATE_COMPLETE.

Image showing the "CREATE_COMPLETE" status of cloudformation template

You can review the deployed resources on the Resources tab, as shown in the following screenshot.The screenshot does not display all the created resources.

Image showing the "Resources" tab of cloudformation template

Additional setup steps

In this section, we collect essential information, including the S3 bucket name and the OpenSearch Service and Elasticsearch domain endpoints. These details are required for executing the code in subsequent sections.

Capture the details of the provisioned resources

Use the following AWS CLI command to extract and save the output values from the CloudFormation stack to a file named GlueOpenSearchStack_outputs.txt. We refer to the values in this file in upcoming steps.

aws cloudformation describe-stacks \
--stack-name GlueOpenSearchStack \
--query 'sort_by(Stacks[0].Outputs[], &OutputKey)[].{Key:OutputKey,Value:OutputValue}' \
--output table \
--no-cli-pager \
--region <AWS_REGION> > ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Download NY Green Taxi December 2022 dataset and copy to S3 bucket

The purpose of this post is to demonstrate the technical implementation of ingesting data into OpenSearch Service using AWS Glue. Understanding the dataset itself is not essential, aside from its data format, which we discuss in AWS Glue notebooks in later sections. To learn more about the dataset, you can find additional information on the NYC Taxi and Limousine Commission website.

We specifically request that you download the December 2022 dataset, because we have tested the solution using this particular dataset:

S3_BUCKET_NAME=$(awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt)
mkdir -p ${BLOG_DIR}/datasets && cd ${BLOG_DIR}/datasets
curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-12.parquet
aws s3 cp green_tripdata_2022-12.parquet s3://${S3_BUCKET_NAME}/datasets/green_tripdata_2022-12.parquet

Download the required JARs from the Maven repository and copy to S3 bucket

We’ve specified a particular JAR file version to ensure stable deployment experience. However, we recommend adhering to your organization’s security best practices and reviewing any known vulnerabilities in the version of the JAR files before deployment. AWS does not guarantee the security of any open-source code used here. Additionally, please verify the downloaded JAR file’s checksum against the published value to confirm its integrity and authenticity.

mkdir -p ${BLOG_DIR}/jars && cd ${BLOG_DIR}/jars
# OpenSearch Service jar
curl -O https://repo1.maven.org/maven2/org/opensearch/client/opensearch-spark-30_2.12/1.0.1/opensearch-spark-30_2.12-1.0.1.jar
aws s3 cp opensearch-spark-30_2.12-1.0.1.jar s3://${S3_BUCKET_NAME}/jars/opensearch-spark-30_2.12-1.0.1.jar
# Elasticsearch jar
curl -O https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-spark-30_2.12/7.17.23/elasticsearch-spark-30_2.12-7.17.23.jar
aws s3 cp elasticsearch-spark-30_2.12-7.17.23.jar s3://${S3_BUCKET_NAME}/jars/elasticsearch-spark-30_2.12-7.17.23.jar

In the following sections, we implement the individual data ingestion methods as outlined in the architecture diagram.

Ingest data into OpenSearch Service using the OpenSearch Spark library

In this section, we load an OpenSearch Service index using Spark and the OpenSearch Spark library. We demonstrate this implementation by using AWS Glue notebooks, employing basic authentication using user name and password.

To demonstrate the ingestion mechanisms, we have provided the Spark-and-OpenSearch-Code-Steps.ipynb notebook with detailed instructions. Follow the steps in this section in conjunction with the instructions in the notebook.

Set up the AWS Glue Studio notebook

Complete the following steps:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Under Create job, choose Notebook.

Image showing AWS console page for AWS Glue to open notebook

Upload the notebook file located at ${BLOG_DIR}/glue_jobs/Spark-and-OpenSearch-Code-Steps.ipynb.
For IAM role, choose the AWS Glue job IAM role that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

Enter a name for the notebook (for example, Spark-and-OpenSearch-Code-Steps) and choose Save.

Image showing AWS Glue OpenSearch Notebook

Replace the placeholder values in the notebook

Complete the following steps to update the placeholders in the notebook:

In Step 1 in the notebook, replace the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection name. You can get the name of the interactive session by executing the following command:

cd ${BLOG_DIR}
awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 1 in the notebook, replace the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket name. You can get the name of the S3 bucket by executing the following command:

awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 4 in the notebook, replace <OPEN-SEARCH-DOMAIN-WITHOUT-HTTPS> with the OpenSearch Service domain name. You can get the domain name by executing the following command:

awk -F '|' '$2 ~ /OpenSearchDomainEndpoint/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the notebook

Run each cell of the notebook to load data into the OpenSearch Service domain and read it back to verify the successful load. Refer to the detailed instructions within the notebook for execution-specific guidance.

Spark write modes (append vs. overwrite)

It is recommended to write data incrementally into OpenSearch Service indexes using the append mode, as demonstrated in Step 8 in the notebook. However, in certain cases, you may need to refresh the entire dataset in the OpenSearch Service index. In these scenarios, you can use the overwrite mode, though it is not advised for large indexes. When using overwrite mode, the Spark library deletes rows from the OpenSearch Service index one by one and then rewrites the data, which can be inefficient for large datasets. To avoid this, you can implement a preprocessing step in Spark to identify insertions and updates, and then write the data into OpenSearch Service using append mode.

Ingest data into Elasticsearch using the Elasticsearch Hadoop library

In this section, we load an Elasticsearch index using Spark and the Elasticsearch Hadoop Library. We demonstrate this implementation by using AWS Glue as the engine for Spark.

Set up the AWS Glue Studio notebook

Complete the following steps to set up the notebook:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Under Create job, choose Notebook.

Image showing AWS console page for AWS Glue to open notebook

Upload the notebook file located at ${BLOG_DIR}/glue_jobs/Spark-and-Elasticsearch-Code-Steps.ipynb.
For IAM role, choose the AWS Glue job IAM role that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

Enter a name for the notebook (for example, Spark-and-ElasticSearch-Code-Steps) and choose Save.

Image showing AWS Glue Elasticsearch Notebook

Replace the placeholder values in the notebook

Complete the following steps:

In Step 1 in the notebook, replace the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection name. You can get the name of the interactive session by executing the following command:

awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 1 in the notebook, replace the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket name. You can get the name of the S3 bucket by executing the following command:

awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 4 in the notebook, replace <ELASTIC-SEARCH-DOMAIN-WITHOUT-HTTPS> with the Elasticsearch domain name. You can get the domain name by executing the following command:

awk -F '|' '$2 ~ /ElasticsearchDomainEndpoint/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the notebook

Run each cell in the notebook to load data to the Elasticsearch domain and read it back to verify the successful load. Refer to the detailed instructions within the notebook for execution-specific guidance.

Ingest data into OpenSearch Service using the AWS Glue OpenSearch Service connection

In this section, we load an OpenSearch Service index using Spark and the AWS Glue OpenSearch Service connection.

Create the AWS Glue job

Complete the following steps to create an AWS Glue Visual ETL job:

On the AWS Glue console, choose ETL jobs in the navigation pane.
Under Create job, choose Visual ETL

This will open the AWS Glue job visual editor. Image showing AWS console page for AWS Glue to open Visual ETL