Tag Archives: Amazon OpenSearch Service

Batch data ingestion into Amazon OpenSearch Service using AWS Glue

Post Syndicated from Ravikiran Rao original https://aws.amazon.com/blogs/big-data/batch-data-ingestion-into-amazon-opensearch-service-using-aws-glue/

Organizations constantly work to process and analyze vast volumes of data to derive actionable insights. Effective data ingestion and search capabilities have become essential for use cases like log analytics, application search, and enterprise search. These use cases demand a robust pipeline that can handle high data volumes and enable efficient data exploration.

Apache Spark, an open source powerhouse for large-scale data processing, is widely recognized for its speed, scalability, and ease of use. Its ability to process and transform massive datasets has made it an indispensable tool in modern data engineering. Amazon OpenSearch Service—a community-driven search and analytics solution—empowers organizations to search, aggregate, visualize, and analyze data seamlessly. Together, Spark and OpenSearch Service offer a compelling solution for building powerful data pipelines. However, ingesting data from Spark into OpenSearch Service can present challenges, especially with diverse data sources.

This post showcases how to use Spark on AWS Glue to seamlessly ingest data into OpenSearch Service. We cover batch ingestion methods, share practical examples, and discuss best practices to help you build optimized and scalable data pipelines on AWS.

Overview of solution

AWS Glue is a serverless data integration service that simplifies data preparation and integration tasks for analytics, machine learning, and application development. In this post, we focus on batch data ingestion into OpenSearch Service using Spark on AWS Glue.

AWS Glue offers multiple integration options with OpenSearch Service using various open source and AWS managed libraries, including:

In the following sections, we explore each integration method in detail, guiding you through the setup and implementation. As we progress, we incrementally build the architecture diagram shown in the following figure, providing a clear path for creating robust data pipelines on AWS. Each implementation is independent of the others. We chose to showcase them separately, because in a real-world scenario, only one of the three integration methods is likely to be used.

Image showing the high level architecture diagram

You can find the code base in the accompanying GitHub repo. In the following sections, we walk through the steps to implement the solution.

Prerequisites

Before you deploy this solution, make sure the following prerequisites are in place:

Clone the repository to your local machine

Clone the repository to your local machine and set the BLOG_DIR environment variable. All the relative paths assume BLOG_DIR is set to the repository location in your machine. If BLOG_DIR is not being used, adjust the path accordingly.

git clone [email protected]:aws-samples/opensearch-glue-integration-patterns.git
cd opensearch-glue-integration-patterns
export BLOG_DIR=$(pwd)

Deploy the AWS CloudFormation template to create the necessary infrastructure

The main focus of this post is to demonstrate how to use the mentioned libraries in Spark on AWS Glue to ingest data into OpenSearch Service. Though we center on this core topic, several key AWS components will need to be pre-provisioned for the integration examples, such as a Amazon Virtual Private Cloud (Amazon VPC), multiple Subnets, an AWS Key Management Service (AWS KMS) key, an Amazon Simple Storage Service (Amazon S3) bucket, an AWS Glue role, and an OpenSearch Service cluster with domains for OpenSearch Service and Elasticsearch. To simplify the setup, we’ve automated the provisioning of this core infrastructure using the cloudformation/opensearch-glue-infrastructure.yaml AWS CloudFormation template.

  1. Run the following commands

The CloudFormation template will deploy the necessary networking components (such as VPC and subnets), Amazon CloudWatch logging, AWS Glue role, and OpenSearch Service and Elasticsearch domains required to implement the proposed architecture. Use a strong password (8–128 characters, three of which are lowercase, uppercase, numbers, or special characters, and no /, “, or spaces) and adhere to your organization’s security standards for ESMasterUserPassword and OSMasterUserPassword in the following command:

cd ${BLOG_DIR}/cloudformation/
aws cloudformation deploy \
--template-file ${BLOG_DIR}/cloudformation/opensearch-glue-infrastructure.yaml \
--stack-name GlueOpenSearchStack \
--capabilities CAPABILITY_NAMED_IAM \
--region <AWS_REGION> \
--parameter-overrides \
ESMasterUserPassword=<ES_MASTER_USER_PASSWORD> \
OSMasterUserPassword=<OS_MASTER_USER_PASSWORD>

You should see a success message such as "Successfully created/updated stack – GlueOpenSearchStack" after the resources have been provisioned successfully. Provisioning this CloudFormation stack typically takes approximately 30 minutes to complete.

  1. On the AWS CloudFormation console, locate the GlueOpenSearchStack stack, and confirm that its status is CREATE_COMPLETE.

Image showing the "CREATE_COMPLETE" status of cloudformation template

You can review the deployed resources on the Resources tab, as shown in the following screenshot.The screenshot does not display all the created resources.

Image showing the "Resources" tab of cloudformation template

Additional setup steps

In this section, we collect essential information, including the S3 bucket name and the OpenSearch Service and Elasticsearch domain endpoints. These details are required for executing the code in subsequent sections.

Capture the details of the provisioned resources

Use the following AWS CLI command to extract and save the output values from the CloudFormation stack to a file named GlueOpenSearchStack_outputs.txt. We refer to the values in this file in upcoming steps.

aws cloudformation describe-stacks \
--stack-name GlueOpenSearchStack \
--query 'sort_by(Stacks[0].Outputs[], &OutputKey)[].{Key:OutputKey,Value:OutputValue}' \
--output table \
--no-cli-pager \
--region <AWS_REGION> > ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Download NY Green Taxi December 2022 dataset and copy to S3 bucket

The purpose of this post is to demonstrate the technical implementation of ingesting data into OpenSearch Service using AWS Glue. Understanding the dataset itself is not essential, aside from its data format, which we discuss in AWS Glue notebooks in later sections. To learn more about the dataset, you can find additional information on the NYC Taxi and Limousine Commission website.

We specifically request that you download the December 2022 dataset, because we have tested the solution using this particular dataset:

S3_BUCKET_NAME=$(awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt)
mkdir -p ${BLOG_DIR}/datasets && cd ${BLOG_DIR}/datasets
curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2022-12.parquet
aws s3 cp green_tripdata_2022-12.parquet s3://${S3_BUCKET_NAME}/datasets/green_tripdata_2022-12.parquet

Download the required JARs from the Maven repository and copy to S3 bucket

We’ve specified a particular JAR file version to ensure stable deployment experience. However, we recommend adhering to your organization’s security best practices and reviewing any known vulnerabilities in the version of the JAR files before deployment. AWS does not guarantee the security of any open-source code used here. Additionally, please verify the downloaded JAR file’s checksum against the published value to confirm its integrity and authenticity.

mkdir -p ${BLOG_DIR}/jars && cd ${BLOG_DIR}/jars
# OpenSearch Service jar
curl -O https://repo1.maven.org/maven2/org/opensearch/client/opensearch-spark-30_2.12/1.0.1/opensearch-spark-30_2.12-1.0.1.jar
aws s3 cp opensearch-spark-30_2.12-1.0.1.jar s3://${S3_BUCKET_NAME}/jars/opensearch-spark-30_2.12-1.0.1.jar
# Elasticsearch jar
curl -O https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-spark-30_2.12/7.17.23/elasticsearch-spark-30_2.12-7.17.23.jar
aws s3 cp elasticsearch-spark-30_2.12-7.17.23.jar s3://${S3_BUCKET_NAME}/jars/elasticsearch-spark-30_2.12-7.17.23.jar

In the following sections, we implement the individual data ingestion methods as outlined in the architecture diagram.

Ingest data into OpenSearch Service using the OpenSearch Spark library

In this section, we load an OpenSearch Service index using Spark and the OpenSearch Spark library. We demonstrate this implementation by using AWS Glue notebooks, employing basic authentication using user name and password.

To demonstrate the ingestion mechanisms, we have provided the Spark-and-OpenSearch-Code-Steps.ipynb notebook with detailed instructions. Follow the steps in this section in conjunction with the instructions in the notebook.

Set up the AWS Glue Studio notebook

Complete the following steps:

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. Under Create job, choose Notebook.

Image showing AWS console page for AWS Glue to open notebook

  1. Upload the notebook file located at ${BLOG_DIR}/glue_jobs/Spark-and-OpenSearch-Code-Steps.ipynb.
  2. For IAM role, choose the AWS Glue job IAM role that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

  1. Enter a name for the notebook (for example, Spark-and-OpenSearch-Code-Steps) and choose Save.

Image showing AWS Glue OpenSearch Notebook

Replace the placeholder values in the notebook

Complete the following steps to update the placeholders in the notebook:

  1. In Step 1 in the notebook, replace the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection name. You can get the name of the interactive session by executing the following command:
cd ${BLOG_DIR}
awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt
  1. In Step 1 in the notebook, replace the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket name. You can get the name of the S3 bucket by executing the following command:
awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt
  1. In Step 4 in the notebook, replace <OPEN-SEARCH-DOMAIN-WITHOUT-HTTPS> with the OpenSearch Service domain name. You can get the domain name by executing the following command:
awk -F '|' '$2 ~ /OpenSearchDomainEndpoint/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the notebook

Run each cell of the notebook to load data into the OpenSearch Service domain and read it back to verify the successful load. Refer to the detailed instructions within the notebook for execution-specific guidance.

Spark write modes (append vs. overwrite)

It is recommended to write data incrementally into OpenSearch Service indexes using the append mode, as demonstrated in Step 8 in the notebook. However, in certain cases, you may need to refresh the entire dataset in the OpenSearch Service index. In these scenarios, you can use the overwrite mode, though it is not advised for large indexes. When using overwrite mode, the Spark library deletes rows from the OpenSearch Service index one by one and then rewrites the data, which can be inefficient for large datasets. To avoid this, you can implement a preprocessing step in Spark to identify insertions and updates, and then write the data into OpenSearch Service using append mode.

Ingest data into Elasticsearch using the Elasticsearch Hadoop library

In this section, we load an Elasticsearch index using Spark and the Elasticsearch Hadoop Library. We demonstrate this implementation by using AWS Glue as the engine for Spark.

Set up the AWS Glue Studio notebook

Complete the following steps to set up the notebook:

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. Under Create job, choose Notebook.

Image showing AWS console page for AWS Glue to open notebook

  1. Upload the notebook file located at ${BLOG_DIR}/glue_jobs/Spark-and-Elasticsearch-Code-Steps.ipynb.
  2. For IAM role, choose the AWS Glue job IAM role that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

  1. Enter a name for the notebook (for example, Spark-and-ElasticSearch-Code-Steps) and choose Save.

Image showing AWS Glue Elasticsearch Notebook

Replace the placeholder values in the notebook

Complete the following steps:

  1. In Step 1 in the notebook, replace the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection name. You can get the name of the interactive session by executing the following command:
awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt
  1. In Step 1 in the notebook, replace the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket name. You can get the name of the S3 bucket by executing the following command:
awk -F '|' '$2 ~ /S3Bucket/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt
  1. In Step 4 in the notebook, replace <ELASTIC-SEARCH-DOMAIN-WITHOUT-HTTPS> with the Elasticsearch domain name. You can get the domain name by executing the following command:
awk -F '|' '$2 ~ /ElasticsearchDomainEndpoint/ {gsub(/^[ \t]+|[ \t]+$/, "", $3); print $3}' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the notebook

Run each cell in the notebook to load data to the Elasticsearch domain and read it back to verify the successful load. Refer to the detailed instructions within the notebook for execution-specific guidance.

Ingest data into OpenSearch Service using the AWS Glue OpenSearch Service connection

In this section, we load an OpenSearch Service index using Spark and the AWS Glue OpenSearch Service connection.

Create the AWS Glue job

Complete the following steps to create an AWS Glue Visual ETL job:

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. Under Create job, choose Visual ETL

This will open the AWS Glue job visual editor.Image showing AWS console page for AWS Glue to open Visual ETL

  1. Choose the plus sign, and under Sources, choose Amazon S3.

Image showing AWS console page for AWS Glue Visual Editor

  1. In the visual editor, choose the Data Source – S3 bucket node.
  2. In the Data source properties – S3 pane, configure the data source as follows:
    • For S3 source type, select S3 location.
    • For S3 URL, choose Browse S3, and choose the green_tripdata_2022-12.parquet file from the designated S3 bucket.
    • For Data format, choose Parquet.
  1. Choose Infer schema to let AWS Glue detect the schema of the data.

This will set up your data source from the specified S3 bucket.

Image showing AWS console page for AWS Glue Visual Editor

  1. Choose the plus sign again to add a new node.
  2. For Transforms, choose Drop Fields to include this transformation step.

This will allow you to remove any unnecessary fields from your dataset before loading it into OpenSearch Service.

Image showing AWS console page for AWS Glue Visual Editor

  1. Choose the Drop Fields transform node, then select the following fields to drop from the dataset:
    • payment_type
    • trip_type
    • congestion_surcharge

This will remove these fields from the data before it is loaded into OpenSearch Service.

Image showing AWS console page for AWS Glue Visual Editor

  1. Choose the plus sign again to add a new node.
  2. For Targets, choose Amazon OpenSearch Service.

This will configure OpenSearch Service as the destination for the data being processed.

Image showing AWS console page for AWS Glue Visual Editor

  1. Choose the Data target – Amazon OpenSearch Service node and configure it as follows:
    • For Amazon OpenSearch Service connection, choose the connection GlueOpenSearchServiceConnec-* from the drop down.
    • For Index, enter green_taxi. The green_taxi index was created earlier in the “Ingest data into OpenSearch Service using the OpenSearch Spark library” section.

This configures the OpenSearch Service to write the processed data to the specified index.

Image showing AWS console page for AWS Glue Visual Editor

  1. On the Job details tab, update the job details as follows:
    • For Name, enter a name (for example, Spark-and-Glue-OpenSearch-Connection).
    • For Description, enter an optional description (for example, AWS Glue job using Glue OpenSearch Connection to load data into Amazon OpenSearch Service).
    • For IAM role, choose the role starting with GlueOpenSearchStack-GlueRole-*.
    • For the Glue version, choose Glue 4.0 – Supports spark 3.3, Scala 2, Python 3
    • Leave the rest of the fields as default.
    • Choose Save to save the changes.

Image showing AWS console page for AWS Glue Visual Editor

  1. To run the AWS Glue job Spark-and-Glue-OpenSearch-Connector, choose Run.

This will initiate the job execution.

Image showing AWS console page for AWS Glue Visual Editor

  1. Choose the Runs tab and wait for the AWS Glue job to complete successfully.

You will see the status change to Succeeded when the job is complete.

Image showing AWS console page for AWS Glue job run status

Clean up

To clean up your resources, complete the following steps:

  1. Delete the CloudFormation stack:
aws cloudformation delete-stack \
--stack-name GlueOpenSearchStack \
--region <AWS_REGION>
  1. Delete the AWS Glue jobs:
    • On the AWS Glue console, under ETL jobs in the navigation pane, choose Visual ETL.
    • Select the jobs you created (Spark-and-Glue-OpenSearch-Connector, Spark-and-ElasticSearch-Code-Steps, and Spark-and-OpenSearch-Code-Steps) and on the Actions menu, choose Delete.

Conclusion

In this post, we explored several ways to ingest data into OpenSearch Service using Spark on AWS Glue. We demonstrated the use of three key libraries: the AWS Glue OpenSearch Service connection, the OpenSearch Spark Library, and the Elasticsearch Hadoop Library. The methods outlined in this post can help you streamline your data ingestion into OpenSearch Service.

If you’re interested in learning more and getting hands-on experience, we’ve created a workshop that walks you through the entire process in detail. You can explore the full setup for ingesting data into OpenSearch Service, handling both batch and real-time streams, and building dashboards. Check out the workshop Unified Real-Time Data Processing and Analytics Using Amazon OpenSearch and Apache Spark to deepen your understanding and apply these techniques step by step.


About the Authors

Ravikiran Rao is a Data Architect at Amazon Web Services and is passionate about solving complex data challenges for various customers. Outside of work, he is a theater enthusiast and amateur tennis player.

Vishwa Gupta is a Senior Data Architect with the AWS Professional Services Analytics Practice. He helps customers implement big data and analytics solutions. Outside of work, he enjoys spending time with family, traveling, and trying new food.

Suvojit Dasgupta is a Principal Data Architect at Amazon Web Services. He leads a team of skilled engineers in designing and building scalable data solutions for AWS customers. He specializes in developing and implementing innovative data architectures to address complex business challenges.This post showcases how to use Spark on AWS Glue to seamlessly ingest data into OpenSearch Service. We cover batch ingestion methods, share practical examples, and discuss best practices to help you build optimized and scalable data pipelines on AWS.

Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques

Post Syndicated from Aruna Govindaraju original https://aws.amazon.com/blogs/big-data/cost-optimized-vector-database-introduction-to-amazon-opensearch-service-quantization-techniques/

The rise of generative AI applications has heightened the necessity to implement semantic search and natural language search. These advanced search features help find and retrieve conceptually relevant documents from enterprise content repositories to serve as prompts for generative AI models. Raw data within various source repositories in the form of text, images, audio, video, and so on are converted, with the help of embedding models, to a standard numerical representation called vectors that powers the semantic and natural language search. As organizations harness more sophisticated large language and foundational models to power their generative AI applications, supplemental embedding models are also evolving to handle large, high-dimension vector embedding. As the vector volume expands, there is a proportional increase in memory usage and computational requirements, resulting in higher operational costs. To mitigate this issue, various compression techniques can be used to optimize memory usage and computational efficiency.

Quantization is a lossy data compression technique aimed to lower computation and memory usage leading to lower costs, especially for high-volume data workloads. There are various techniques to compress data depending on the type and volume of the data. The usual technique is to map infinite values (or a relatively large list of finites) to smaller more discrete values. Vector compression can be achieved through two primary techniques: product quantization and scalar quantization. In the product quantization technique, the original vector dimension array is broken into multiple sub-vectors and each sub-vector is encoded into a fixed number of bits that represent the original vector. This method requires that you only store and search across the encoded sub-vector instead of the original vector. In scalar quantization, each dimension of the input vector is mapped from a 32-bit floating-point representation to a smaller data type.

Amazon OpenSearch Service, as a vector database, supports scalar and product quantization techniques to optimize memory usage and reduce operational costs.

OpenSearch as a vector database

OpenSearch is a distributed search and analytics service. The OpenSearch k-nearest neighbor (k-NN) plugin allows you to index, store, and search vectors. Vectors are stored in OpenSearch as a 32-bit float array of type knn_vector and that supports up to 16,000 dimensions per vector.

OpenSearch uses approximate nearest neighbor search to provide scalable vector search. The approximate k-NN algorithm retrieves results based on an estimation of the nearest vectors to a given query vector. Two main methods for performing approximate k-NN are the graph-based Hierarchical Navigable Small-World (HNSW) and the cluster-based Inverted File (IVF). These data structures are constructed and loaded into memory during the initial vector search operation. As vector volume grows, both the data structures and associated memory requirements for search operations scale proportionally.

For example, each HNSW graph with 32-bit float data takes approximately 1.1 * (4 * d + 8 * m) * num_vectors bytes of memory. Here, num_vectors represents the total quantity of vectors to be indexed, d is the number of dimensions determined by the embedding model you use to generate the vectors and m is the number of edges in the HSNW graphs, an index parameter that can be controlled to tune performance. Using this formula, memory requirements for vector storage for a configuration of 384 dimensions and an m value of 16 would be:

  • 1 million vectors: 1.830 GB (1.1 * (4 * 384 + 8 * 16) * 1000,000 bytes)
  • 1 billion vectors: 1830 GB (1.1 * (4 * 384 + 8 * 16) * 1,000,000,000 bytes)

Although approximate nearest neighbor search can be optimized to handle massive datasets with billions of vectors efficiently, the memory requirements for loading 32-bit full-precision vectors to memory during the search process can become prohibitively costly. To mitigate this, OpenSearch service supports the following four quantization techniques.

  • Binary quantization
  • Byte quantization
  • FP16 quantization
  • Product quantization

These techniques fall within the broader category of scalar and product quantization that we discussed earlier. In this post, you will learn quantization techniques for optimizing vector workloads on OpenSearch Service, focusing on memory reduction and cost-efficiency. It introduces the new disk-based vector search approach that enables efficient querying of vectors stored on disk without loading them into memory. The method integrates seamlessly with quantization techniques, featuring key configurations such as the on_disk mode and compression_level parameter. These settings facilitate built-in, out-of-the-box scalar quantization at the time of indexing.

Binary quantization (up to 32x compression)

Binary quantization (BQ) is a type of scalar quantization. OpenSearch leverages FAISS engine’s binary quantization, enabling up to 32x compression during indexing. This technique reduces the vector dimension from the default 32-bit float to a 1-bit binary by compressing the vectors into a 0s and 1s. OpenSearch supports indexing, storing and searching binary vectors. You can also choose to encode each vector dimension using 1, 2, or 4 bits, depending upon the desired compression factor as shown in the example below. The compression factor can be adjusted using bits settings. A value of 2 yields 16x compression, while 4 results in 8x compression. The default setting is 1. In binary quantization, the training is handled natively at the time of indexing, allowing you to avoid an additional preprocessing step.

To implement binary quantization, define the vector type as knn_vector and specify the encoder name as binary with the desired number of encoding bits. Note, the encoder parameter refers to a method used to compress vector data before storing it in the index. Optimize performance by using space_type, m, and ef_construction parameters. See the OpenSearch documentation for information about the underlying configuration of the approximate k-NN.

PUT my-vector-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "method": {
          "name": "hnsw",
          "engine": "faiss",
          "space_type": "l2",
          "parameters": {
            "m": 16,
            "ef_construction": 512,
            "encoder": {
              "name": "binary",
              "parameters": {
                "bits": 1
              }
            }
          }
        }
      }
    }
  }
}

Memory requirements for implementing binary quantization with FAISS-HNSW:

1.1 * (bits * (d/8)+ 8 * m) * num_vectors bytes.

Compression Encoding bits

Memory required for 1 billion vector

with d=384 and m=16 (in GB)

32x 1 193.6
16x 2 246.4
8x 4 352.0

For detailed implementation steps on binary quantization, see the OpenSearch documentation.

Byte-quantization (4x compression)

Byte quantization compresses 32-bit floating-point dimensions to 8-bit integers, ranging from –128 to +127, reducing memory usage by 75%. OpenSearch supports indexing, storing, and searching byte vectors, which must be converted to 8-bit format prior to ingestion. To implement byte vectors, specify the k-NN vector field data_type as byte in the index mapping. This feature is compatible with both Lucene and FAISS engines. An example of creating an index for byte-quantized vectors follows.

PUT /my-vector-index
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 3,
        "data_type": "byte",
        "space_type": "l2",
        "method": {
          "name": "hnsw",
          "engine": "faiss",
          "parameters": {
            "ef_construction": 100,
            "m": 16
          }
        }
      }
    }
  }
}

This method requires ingesting a byte-quantized vector into OpenSearch for direct storage in the k-NN vector field (of byte type). However, the recently introduced disk-based vector search feature eliminates the need for external vector quantization. This feature will be discussed in detail later in this blog.

Memory requirements for implementing byte quantization with FAISS-HNSW:

1.1 * (1 * d + 8 * m) * num_vectors bytes.

For detailed implementation steps, see to the OpenSearch documentation. For performance metrics regarding accuracy, throughput, and latency, see Byte-quantized vectors in OpenSearch.

FAISS FP16 quantization (2x compression)

FP16 quantization is a technique that uses 16-bit floating-point scalar representation, reducing the memory usage by 50%. Each vector dimension is converted from 32-bit to 16-bit floating-point, effectively halving the memory requirements. The compressed vector dimensions must be in the range [–65504.0, 65504.0]. To implement FP16 quantization, create the index with the k-NN vector field and configure the following:

  • Set k-NN vector field method and engine to HNSW and FAISS, respectively.
  • Define encoder parameter and set name to sq and type to fp16.

Upon uploading 32-bit floating-point vectors to OpenSearch, the scalar quantization FP16 (SQfp16) automatically quantizes them to 16-bit floating-point vectors during ingestion and stores them in the vector field. The following example demonstrates the creation of the index for quantizing and storing FP16-quantized vectors.

PUT /my-vector-index
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 3,
        "space_type": "l2",
        "method": {
          "name": "hnsw",
          "engine": "faiss",
          "parameters": {
            "encoder": {
              "name": "sq",
              "parameters": {
                "type": "fp16",
                "clip": true
              }
            },
            "ef_construction": 256,
            "m": 8
          }
        }
      }
    }
  }
}

Memory requirements for implementing FP16 quantization with FAISS-HNSW:

(1.1 * (2 * d + 8 * m) * num_vectors) bytes.

The preceding FP16 example introduces an optional Boolean parameter called clip, which defaults to false. When false, vectors with out-of-range values (values not between –65504.0 and +65504.0) are rejected. Setting clip to true enables rounding of out-of-range vector values to fit within the supported range. For detailed implementation steps, see the OpenSearch documentation. For performance metrics regarding accuracy, throughput, and latency, see Optimizing OpenSearch with Faiss FP16 scalar quantization: Enhancing memory efficiency and cost-effectiveness.

Product quantization

Product quantization (PQ) is an advanced dimension-reduction technique that offers significantly higher levels of compression. While conventional scalar quantization methods typically achieve up to 32x compression, PQ can provide compression levels of up to 64x, making it a more efficient solution for optimizing storage and cost. OpenSearch supports PQ with both IVF and HNSW method from FAISS engine. Product quantization partitions vectors into m sub-vectors, each encoded with a bit count determined by the code size. The resulting vector’s memory footprint is m * code_size bits.

FAISS product quantization involves three key steps:

  1. Create and populate a training index to build the PQ model, optimizing for accuracy.
  2. Execute the _train API on the training index to generate the quantizer model.
  3. Construct the vector index, configuring the kNN field to use the prepared quantizer model.

The following example demonstrates the three steps to setting up product quantization.

Step1: Create the training index. Populate the training index with an appropriate dataset, making sure of dimensional alignment with train-index specifications. Note that the training index requires a minimum of 256 documents.

PUT /train-index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 2
  },
  "mappings": {
    "properties": {
      "train-field": {
        "type": "knn_vector",
        "dimension": 4
      }
    }
  }
}

Step2: Create a quantizer model called my-model by running the _train API on the training index you just created. Note that the encoder with name defined as pq facilitates native vector quantization. Other parameters for encoder include code_size and m. FAISS-HNSW requires a code_size of 8 and a training dataset of at least 256 (2^code_size) documents. For detailed parameter specifications, see the PQ parameter reference.

POST /_plugins/_knn/models/my-model/_train
{
  "training_index": "train-index",
  "training_field": "train-field",
  "dimension": 4,
  "description": "My test model description",
  "method": {
    "name": "hnsw",
    "engine": "faiss",
    "parameters": {
      "encoder": {
        "name": "pq", 
         "parameters": {
           "code_size":8,
           "m":2
         }
      },
      "ef_construction": 256,
      "m": 8
    }
  }
}

Step3: Map the quantizer model to your vector index.

PUT /my-vector-index
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 2,
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "target-field": {
        "type": "knn_vector",
        "model_id": "my-model"
      }
    }
  }
}

Ingest the complete dataset into the newly created index, my-vector-index. The encoder will automatically process the incoming vectors, applying encoding and quantization based on the compression parameters (code_size and m) specified in the quantizer model configuration.

Memory requirements for implementing product quantization with FAISS-HNSW:

1.1*(((code_size / 8) * m + 24 + 8 * m) * num_vectors bytes. Here the code_size and m are parameters within the encoder parameter, num_vectors are the total number of vectors.

During quantization, each of the training vectors is broken down to multiple sub-vectors or sub-spaces, defined by a configurable value m. The number of bits to encode each of the sub-vector is controlled by parameter code_size. Each of the sub-vectors is then compressed or quantized separately by running the k-means clustering with the value k defined as 2^code_size. In this technique, the vector is compressed roughly by m * code_size bits.

For detailed implementation guidelines and understanding of the configurable parameters during product quantization, see the OpenSearch documentation. For performance metrics regarding accuracy, throughput and latency using FAISS IVF for PQ, see Choose the k-NN algorithm for your billion-scale use case with OpenSearch.

Disk-based vector search

Disk-based vector search optimizes query efficiency by using compressed vectors in memory while maintaining full-precision vectors on disk. This approach enables OpenSearch to perform searches across large vector datasets without the need to load entire vectors into memory, thus improving scalability and resource utilization. Implementation is achieved through two new configurations at index creation: mode and compression level. As of OpenSearch 2.17, the mode parameter can be set to either in_memory or on_disk during indexing. The previously discussed methods default to an in-memory mode. In this configuration, the vector index is constructed using either a graph (HNSW) or bucket (IVF) structure, which is then loaded into native memory during search operations. While offering excellent recall, this approach could impact memory usage, and scalability for high volume vector workload.

The on_disk mode optimizes vector search efficiency by storing full-precision vectors on disk while using real-time, native quantization during indexing. Coupled with adjustable compression levels, this approach allows only compressed vectors to be loaded into memory, thereby improving memory and resource utilization and search performance. The following compression levels correspond to various scalar quantization methods discussed earlier.

  • 32x: Binary quantization (1-bit dimensions)
  • 4x: Byte and integer quantization (8-bit dimensions)
  • 2x: FP16 quantization (16-bit dimensions)

This method also supports other compression levels such as 16x and 8x that aren’t available with the in-memory mode. To enable disk-based vector search, create the index with mode set to on_disk as shown in the following example.

PUT /my-vector-index
{
  "settings" : {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "space_type": "innerproduct",
        "data_type": "float",
        "mode": "on_disk"
      }
    }
  }
}

Configuring just the mode as on_disk employs the default configuration, which uses the FAISS engine and HNSW method with a 32x compression level (1-bit, binary quantization). The ef_construction to optimize index time latency defaults to 100. For more granular fine-tuning, you can override these k-NN parameters as shown in the example that follows.

PUT /my-vector-index
{
  "settings" : {
    "index": {
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "my_vector_field": {
        "type": "knn_vector",
        "dimension": 8,
        "space_type": "innerproduct",
        "data_type": "float",
        "mode": "on_disk",
        "compression_level": "16x",
        "method": {
          "name": "hnsw",
          "engine": "faiss",
          "parameters": {
            "ef_construction": 512
          }
        }
      }
    }
  }
}

Because quantization is a lossy compression technique, higher compression levels typically result in lower recall. To improve recall during quantization, you can configure the disk-based vector search to run in two phases using the search time configuration parameter ef_search and the oversample_factor as shown in the following example.

GET my-vector-index/_search
{
  "query": {
    "knn": {
      "my_vector_field": {
        "vector": [1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5],
        "k": 5,
        "method_parameters": {
            "ef_search": 512
        },
        "rescore": {
            "oversample_factor": 10.0
        }
      }
    }
  }
}

In the first phase, oversample_factor * k results are retrieved from the quantized vectors in memory and the scores are approximated. In the second phase, the full-precision vectors of those oversample_factor * k results are loaded into memory from disk, and scores are recomputed against the full-precision query vector. The results are then reduced to the top k.

The oversample_factor for rescoring is determined by the configured dimension and compression level at indexing. For dimensions below 1,000, the factor is fixed at 5. For dimensions exceeding 1,000, the default factor varies based on the compression level, as shown in the following table.

Compression level Default oversample_factor for rescoring
32x (default) 3
16x 2
8x 2
4x No default rescoring
2x No default rescoring

As previously discussed, the oversample_factor can be dynamically adjusted at search time. This value presents a critical trade-off between accuracy and search efficiency. While a higher factor improves accuracy, it proportionally increases memory usage and reduces search throughput. See the OpenSearch documentation to learn more about disk-based vector search and understand the right usage for oversample_factor.

Performance assessment of quantization methods: Reviewing memory, recall, and query latency.

The OpenSearch documentation on approximate k-NN search provides a starting point for implementing vector similarity search. Additionally, Choose the k-NN algorithm for your billion-scale use case with OpenSearch offers valuable insights into designing efficient vector workloads for handling billions of vectors in production environments. It introduces product quantization techniques as a potential solution to reduce memory requirements and associated costs by scaling down the memory footprint.

The following table illustrates the memory requirements for storing and searching through 1 billion vectors using various quantization techniques. The table compares the default memory consumption of full-precision vector using the HNSW method against memory consumed by quantized vectors. The model employed in this analysis is the sentence-transformers/all-MiniLM-L12-v2, which operates with 384 dimensions. The raw metadata is assumed to be not more than 100Gb.

Without quantization
(in GB)
Product quantization
(in GB)
Scalar quantization
(in GB)
FP16 vectors Byte vectors Binary vectors
m value 16 16 16 16 16
pq_m, code_size 16, 8
Native memory consumption (GB) 1830.4 184.8 985.6 563.2 193.6
Total storage =
100 GB+vector
1930.4 284.8 1085.6 663.2 293.6

Reviewing the preceding table reveals that for a dataset comprising 1 billion vectors, the HNSW graph with 32-bit full-precision vector requires approximately 1830 GB of memory. Compression techniques such as product quantization can reduce this to 184.8 GB, while scalar quantization offers varying levels of compression. The following table summarizes the correlation between compression techniques and their impact on key performance indicators including cost savings, recall rate, and query latency. This analysis builds upon our previous assessment of memory usage to aid in selecting compression technique that meets your requirement.

The table presents two key search metrics: search latency at the 90th percentile (p90) and recall at 100.

  • Search latency @p90 indicates that 90% of search queries will be completed within that specific latency time.
  • recall@100 – The fraction of the top 100 ground truth neighbors found in the 100 results returned.
  Without quantization
(in GB)
Product quantization
(in GB)
Scalar quantization
(in GB)
  FP16 quantization
[mode=in_memory]
Byte quantization
[mode=in_memory]
Binary quantization
[mode=on_disk]
Preconditions/Datasets Applicable to all datasets Recall depends on the nature of the training data Works for dimension value in
range [-65536 to 65535]
Works for dimension value in
range [-128 to 127]
Works well for larger dimensions >=768
Preprocessing required? No Yes,
preprocessing/training is required
No No No
Rescoring No No No No Yes
Recall @100 >= 0.99 >0.7 >=0.95 >=0.95 >=0.90
p90 query latency (ms) <50 ms <50 ms <50 ms <50 ms <200 ms
Cost
(baseline $X)
$X $0.1*X
(up to 90% savings)
$0.5*X
(up to 50% savings)
$0.25*X
(up to 75%)
$0.15*X
(up to 85% savings)
Sample cost for a billion vector $20,923.14 $2,092.31 $10,461.57 $5,230.79 $3,138.47

The sample cost estimate for billion vector is based on a configuration optimized for cost. Please note that actual savings may vary based on your specific workload requirements and chosen configuration parameters. Notably in the table, product quantization offers up to 90% cost reduction compared to the baseline HNSW graph-based vector search cost ($X). Scalar quantization similarly yields proportional cost savings, ranging from 50% to 85% relative to the compressed memory footprint. The choice of compression technique involves balancing cost-effectiveness, accuracy, and performance, as it impacts precision and latency.

Conclusion

By leveraging OpenSearch’s quantization techniques, organizations can make informed tradeoffs between cost efficiency, performance, and recall, empowering them to fine-tune their vector database operations for optimal results. These quantization techniques significantly reduce memory requirements, improve query efficiency and offer built-in encoders for seamless compression. Whether you’re dealing with large-scale text embeddings, image features, or any other high-dimensional data, OpenSearch’s quantization techniques offer efficient solutions for vector search requirements, enabling the development of cost-effective, scalable, and high-performance systems.

As you move forward with your vector database projects, we encourage you to:

  1. Explore OpenSearch’s compression techniques in-depth
  2. Evaluate applicability of the right technique to your specific use case
  3. Determine the appropriate compression levels based on your requirements for recall and search latency
  4. Measure and compare cost savings based on accuracy, throughput, and latency

Stay informed about the latest developments in this rapidly evolving field, and don’t hesitate to experiment with different quantization techniques to find the optimal balance between cost, performance, and accuracy for your applications.


About the Authors

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open-source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.

Vamshi Vijay Nakkirtha is a software engineering manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems. He is an active contributor to various OpenSearch projects such as k-NN, Geospatial, and dashboard-maps.

Use CI/CD best practices to automate Amazon OpenSearch Service cluster management operations

Post Syndicated from Camille BIRBES original https://aws.amazon.com/blogs/big-data/use-ci-cd-best-practices-to-automate-amazon-opensearch-service-cluster-management-operations/

Quick and reliable access to information is crucial for making smart business decisions. That’s why companies are turning to Amazon OpenSearch Service to power their search and analytics capabilities. OpenSearch Service makes it straightforward to deploy, operate, and scale search systems in the cloud, enabling use cases like log analysis, application monitoring, and website search.

Efficiently managing OpenSearch Service indexes and cluster resources can lead to significant improvements in performance, scalability, and reliability – all of which directly impact a company’s bottom line. However, the industry lacks built-in and well-documented solutions to automate these important operational tasks.

Applying continuous integration and continuous deployment (CI/CD) to managing OpenSearch index resources can help do that. For instance, storing index configurations in a source repository allows for better tracking, collaboration, and rollback. Using infrastructure as code (IaC) tools can help automate resource creation, providing consistency and reducing manual work. Finally, using a CI/CD pipeline can automate deployments and streamline workflow.

In this post, we discuss two options to achieve this: the Terraform OpenSearch provider and the Evolution library. Which one is best suited to your use case depends on the tooling you are familiar with, your language of choice, and your existing pipeline.

Solution overview

Let’s walk through a straightforward implementation. For this use case, we use the AWS Cloud Development Kit (AWS CDK) to provision the relevant infrastructure as described in the following architecture diagram that follows, AWS Lambda to trigger Evolution scripts and AWS CodeBuild to apply Terraform files. You can find the code for the entire solution in the GitHub repo.

Solution Architecture Diagram

Prerequisites

To follow along with this post, you need to have the following:

  • Familiarity with Java and OpenSearch
  • Familiarity with the AWS CDK, Terraform, and the command line
  • The following software versions installed on your machine: Python 3.12, NodeJS 20, and AWS CDK 2.170.0 or higher
  • An AWS account, with an AWS Identity and Access Management (IAM) role configured with the relevant permissions

Build the solution

To build an automated solution for OpenSearch Service cluster management, follow these steps:

  1. Enter the following commands in a terminal to download the solution code; build the Java application; build the required Lambda layer; create an OpenSearch domain, two Lambda functions and a CodeBuild project; and deploy the code:
git clone https://github.com/aws-samples/opensearch-automated-cluster-management
cd opensearch-automated-cluster-management
cd app/openSearchMigration
mvn package
cd ../../lambda_layer
chmox a+x create_layer.sh
./create_layer.sh
cd ../infra
npm install
npx cdk bootstrap
aws iam create-service-linked-role --aws-service-name es.amazonaws.com
npx cdk deploy --require-approval never
  1. Wait 15 to 20 minutes for the infrastructure to finish deploying, then check that your OpenSearch domain is up and running, and that the Lambda function and CodeBuild project have been created, as shown in the following screenshots.

OpenSearch domain provisioned successfully OpenSearch Migration Lambda function created successfully OpenSearchQuery Lambda function created successfully CodeBuild project created successfully

Before you use automated tools to create index templates, you can verify that none already exist using the OpenSearchQuery Lambda function.

  1. On the Lambda console, navigate to the relevant Function
  2. On the Test tab, choose Test.

The function should return the message “No index patterns created by Terraform or Evolution,” as shown in the following screenshot.

Check that no index patterns have been created

Apply Terraform files

First, you use Terraform with CodeBuild. The code is ready for you to test, let’s look at a few important pieces of configuration:

  1. Define the required variables for your environment:
variable "OpenSearchDomainEndpoint" {
  type = string
  description = "OpenSearch domain URL"
}

variable "IAMRoleARN" {
  type = string
  description = "IAM Role ARN to interact with OpenSearch"
}
  1. Define and configure the provider
terraform {
  required_providers {
    opensearch = {
      source = "opensearch-project/opensearch"
      version = "2.3.1"
    }
  }
}

provider "opensearch" {
  url = "https://${var.OpenSearchDomainEndpoint}"
  aws_assume_role_arn = "${var.IAMRoleARN}"
}

NOTE: As of the publication date of this post, there is a bug in the Terraform OpenSearch provider that will trigger when launching your CodeBuild project and that will prevent successful execution. Until it is fixed, please use the following version:

terraform {
  required_providers {
    opensearch = {
      source = "gnuletik/opensearch"
      version = "2.7.0"
    }
  }
}
  1. Create an index template
resource "opensearch_index_template" "template_1" {
  name = "cicd_template_terraform"
  body = <<EOF
{
  "index_patterns": ["terraform_index_*"],
  "template": {
    "settings": {
      "number_of_shards": "1"
    },
    "mappings": {
        "_source": {
            "enabled": false
        },
        "properties": {
            "host_name": {
                "type": "keyword"
            },
            "created_at": {
                "type": "date",
                "format": "EEE MMM dd HH:mm:ss Z YYYY"
            }
        }
    }
  }
}
EOF
}

You are now ready to test.

  1. On the CodeBuild console, navigate to the relevant Project and choose Start Build.

The build should complete successfully, and you should see the following lines in the logs:

opensearch_index_template.template_1: Creating...
opensearch_index_template.template_1: Creation complete after 0s (id=cicd_template_terraform)
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

You can check that the index template has been properly created using the same Lambda function as earlier, and should see the following results.

Terraform index properly created

Run Evolution scripts

In the next step, you use the Evolution library. The code is ready for you to test, let’s look at a few important pieces of code and configuration:

  1. To begin with, you need to add the latest version of the Evolution core library and AWS SDK as Maven dependencies. The full xml file is available in the GitHub repository; to check the Evolution library’s compatibility with different OpenSearch versions, see here.
<dependency>
    <groupId>com.senacor.elasticsearch.evolution</groupId>
    <artifactId>elasticsearch-evolution-core</artifactId>
    <version>0.6.0</version><!--check the latest version-->
</dependency>
<dependency>
   <groupId>software.amazon.awssdk</groupId>
   <artifactId>auth</artifactId>
</dependency>
  1. Create Evolution Bean and an AWS interceptor (which implements HttpRequestInterceptor).

Interceptors are open-ended mechanisms in which the SDK calls code that you write to inject behavior into the request and response lifecycle. The function of the AWS interceptor is to hook into the execution of API requests and create an AWS signed request stamped with proper IAM roles. You can use the following code to create your own implementation to sign all the requests made to OpenSearch within AWS.

  1. Create your own OpenSearch client to manage automatic creation of index, mappings, templates, and aliases.

The default ElasticSearch client that comes bundled in as part of the Maven dependency can’t be used to make PUT calls to the OpenSearch cluster. Therefore, you need to bypass the default REST client instance, and add a CallBack to the AwsRequestSigningInterceptor.

The following is a sample implementation:

private RestClient getOpenSearchEvolutionRestClient() {
    return RestClient.builder(getHttpHost())
        .setHttpClientConfigCallback(hacb -> 
            hacb.addInterceptorLast(getAwsRequestSigningInterceptor()))
        .build();
}
  1. Use the Evolution Bean to call your migrate method, which is responsible for initiating the migration of the scripts defined either using classpath or filepath:
public void executeOpensearchScripts() {
    ElasticsearchEvolution opensearchEvolution = ElasticsearchEvolution.configure()
        .setEnabled(true) // true or false
        .setLocations(Arrays.asList("classpath:opensearch_migration/base",
            "classpath:opensearch_migration/dev")) // List of all locations where scripts are located.
        .setHistoryIndex("opensearch_changelog") // Tracker index to store history of scripts executed.
        .setValidateOnMigrate(false) // true or false
        .setOutOfOrder(true) // true or false
        .setPlaceholders(Collections.singletonMap("env","dev")) // list of placeholders which will get replaced in the script during execution.
        .load(getElasticsearchEvolutionRestClient());
    opensearchEvolution.migrate();
}
  1. An Evolution migration script represents a REST call to the OpenSearch API (for example, PUT /_index_template/cicd_template_evolution), where you define index patterns, settings, and mappings in JSON format. Evolution interprets these scripts, manages their versioning, and provides ordered execution. See the following example:
PUT /_index_template/cicd_template_evolution
Content-Type: application/json

{
  "index_patterns": ["evolution_index_*"],
  "template": {
    "settings": {
      "number_of_shards": "1"
    },
    "mappings": {
        "_source": {
            "enabled": false
        },
        "properties": {
            "host_name": {
                "type": "keyword"
            },
            "created_at": {
                "type": "date",
                "format": "EEE MMM dd HH:mm:ss Z YYYY"
            }
        }
    }
  }
}

The first two lines must be followed by a blank line. Evolution also supports comment lines in its migration scripts. Every line starting with # or // will be interpreted as a comment-line. Comment lines aren’t sent to OpenSearch. Instead, they are filtered by Evolution.

The migration script file naming convention must follow a pattern:

  • Start with esMigrationPrefix which is by default V or the value that has been configured using the configuration option esMigrationPrefix
  • Followed by a version number, which must be numeric and can be structured by separating the version parts with a period (.)
  • Followed by the versionDescriptionSeparator: __ (the double underscore symbol)
  • Followed by a description, which can be any text your filesystem supports
  • End with esMigrationSuffixes which is by default .http and is configurable and case insensitive

You’re now ready to execute your first automated change. An example of a migration script has already been created for you, which you can refer to in a previous section. It will create an index template named cicd_template_evolution.

  1. On the Lambda console, navigate to your function.
  2. On the Test tab, choose Test.

After a few seconds, the function should successfully complete. You can review the log output in the Details section, as shown in the following screenshots.

Migration function finish successfully

The index template now exists, and you can check that its configuration is indeed in line with the script, as shown in the following screenshot.

Evolution index template properly created

Clean up

To clean up the resources that were created as part of this post, run the following commands (in the infra folder):

npx cdk destroy --all

Conclusion

In this post, we demonstrated how to automate OpenSearch index templates using CI/CD practices and tools such as Terraform or the Evolution library.

To learn more about OpenSearch Service, refer to the Amazon OpenSearch Service Developer Guide. To further explore the Evolution library, refer to the documentation. To learn more about the Terraform OpenSearch provider, refer to the documentation.

We hope this detailed guide and accompanying code will help you get started. Try it out, let us know your thoughts in the comments section, and feel free to reach out to us for questions!


About the Authors

Camille BirbesCamille Birbes is a Senior Solutions Architect with AWS and is based in Hong Kong. He works with major financial institutions to design and build secure, scalable, and highly available solutions in the cloud. Outside of work, Camille enjoys any form of gaming, from board games to the latest video game.

Sriharsha Subramanya Begolli works as a Senior Solutions Architect with AWS, based in Bengaluru, India. His primary focus is assisting large enterprise customers in modernizing their applications and developing cloud-based systems to meet their business objectives. His expertise lies in the domains of data and analytics.

Enhancing Search Relevancy with Cohere Rerank 3.5 and Amazon OpenSearch Service

Post Syndicated from Breanne Warner original https://aws.amazon.com/blogs/big-data/enhancing-search-relevancy-with-cohere-rerank-3-5-and-amazon-opensearch-service/

This post is co-written with Elliott Choi from Cohere.

The ability to quickly access relevant information is a key differentiator in today’s competitive landscape. As user expectations for search accuracy continue to rise, traditional keyword-based search methods often fall short in delivering truly relevant results. In the rapidly evolving landscape of AI-powered search, organizations are looking to integrate large language models (LLMs) and embedding models with Amazon OpenSearch Service. In this blog post, we’ll dive into the various scenarios for how Cohere Rerank 3.5 improves search results for best matching 25 (BM25), a keyword-based algorithm that performs lexical search, in addition to semantic search. We will also cover how businesses can significantly improve user experience, increase engagement, and ultimately drive better search outcomes by implementing a reranking pipeline.

Amazon OpenSearch Service

Amazon OpenSearch Service is a fully managed service that simplifies the deployment, operation, and scaling of OpenSearch in the AWS Cloud to provide powerful search and analytics capabilities. OpenSearch Service offers robust search capabilities, including URI searches for simple queries and request body searches using a domain-specific language for complex queries. It supports advanced features such as result highlighting, flexible pagination, and k-nearest neighbor (k-NN) search for vector and semantic search use cases. The service also provides multiple query languages, including SQL and Piped Processing Language (PPL), along with customizable relevance tuning and machine learning (ML) integration for improved result ranking. These features make OpenSearch Service a versatile solution for implementing sophisticated search functionality, including the search mechanisms used to power generative AI applications.

Overview of traditional lexical search and semantic search using bi-encoders and cross-encoders

Two important techniques for using end-user search queries are lexical search and semantic search. OpenSearch Service natively supports BM25. This method, while effective for keyword searches, lacks the ability to recognize the intent or context behind a query. Lexical search relies on exact keyword matching between the query and documents. For a natural language query searching for “super hero toys,” it retrieves documents containing those exact terms. While this method is fast and works well for queries targeted at specific terms, it fails to capture context and synonyms, potentially missing relevant results that use different words such as “action figures of superheroes.” Bi-encoders are a specific type of embedding model designed to independently encode two pieces of text. Documents are first turned into an embedding or encoded offline and queries are encoded online at search time. In this approach, the query and document encodings are generated with the same embedding algorithm. The query’s encoding is then compared to pre-computed document embeddings. The similarity between query and documents is measured by their relative distances, despite being encoded separately. This allows the system to recognize synonyms and related concepts, such as “action figures” is related to “toys” and “comic book characters” to “super heroes.”

By contrast, processing the same query—”super hero toys”—with cross-encoders involves first retrieving a set of candidate documents using methods such as lexical search or bi-encoders. Each query-document pair is then jointly evaluated by the cross-encoder, which inputs the combined text to deeply model interactions between the query and document. This approach allows the cross-encoder to understand context, disambiguate meanings, and capture nuances by analyzing every word in relation to each other. It also assigns precise relevance scores to each pair, re-ranking the documents so that those most closely matching the user’s intent—specifically about toys depicting superheroes—are prioritized. Therefore, this significantly enhances search relevancy compared to methods that encode queries and documents independently.

It’s important to note that the effectiveness of semantic search, such as two-stage retrieval search pipelines, depend heavily on the quality of the initial retrieval stage. The primary goal of a robust first-stage retrieval is to efficiently recall a subset of potentially relevant documents from a large collection, setting the foundation for more sophisticated ranking in later stages. The quality of the first-stage results directly impacts the performance of subsequent ranking stages. The goal is to maximize recall and capture as many relevant documents as possible because the later ranking stage has no way to recover excluded documents. A poor initial retrieval can limit the effectiveness of even the most sophisticated re-ranking algorithms.

Overview of Cohere Rerank 3.5

Cohere is an AWS third-party model provider partner that provides advanced language AI models, including embeddings, language models, and reranking models. See Cohere Rerank 3.5 now generally available on Amazon Bedrock to learn more about accessing Cohere’s state-of- the-art models using Amazon Bedrock. The Cohere Rerank 3.5 model focuses on enhancing search relevance by reordering initial search results based on deeper semantic understanding of the user query. Rerank 3.5 uses a cross-encoder architecture where the input of the model always consists of a data pair (for example, a query and a document) that is processed jointly by the encoder. The model outputs an ordered list of results, each with an assigned relevance score, as shown in the following GIF.

Cohere Rerank 3.5 with OpenSearch Service search

Many organizations rely on OpenSearch Service for their lexical search needs, benefiting from its robust and scalable infrastructure. When organizations want to enhance their search capabilities to match the sophistication of semantic search, they are challenged with overhauling their existing systems. Often it is a difficult engineering task for teams or may not be feasible. Now through a single Rerank API call in Amazon Bedrock, you can integrate Rerank into existing systems at scale. For financial services firms, this means more accurate matching of complex queries with relevant financial products and information. For e-commerce businesses, they can improve product discovery and recommendations, potentially boosting conversion rates. The ease of integration through a single API call with Amazon OpenSearch enables quick implementation, offering a competitive edge in user experience without significant disruption or resource allocation.

In benchmarks conducted by Cohere, the normalized Discounted Cumulative Gain (nDCG), Cohere Rerank 3.5 improved accuracy when compared to Cohere’s previous Rerank 3 model as well as BM25 and hybrid search across a financial, e-commerce and project management data sets. The nDCG is a metric that’s used to evaluate the quality of a ranking system by assessing how well ranked items align with their actual relevance and prioritizes relevant results at the top. In this study, @10 indicates that the metric was calculated considering only the top 10 items in the ranked list. The nDCG metric is helpful because metrics such as precision, recall, and the F-score measure predictive performance without taking into account the position of ranked results. Whereas the nDCG normalizes scores and discounts relevant results that are returned lower on the list of results. The following figures below shows these performance improvements of Cohere Rerank 3.5 for financial domain as well as e-commerce evaluation consisting of external datasets.

Also, Cohere Rerank 3.5, when integrated with OpenSearch, can significantly enhance existing project management workflows by improving the relevance and accuracy of search results across engineering tickets, issue tracking systems, and open-source repository issues. This enables teams to quickly surface the most pertinent information from their extensive knowledge bases and boosting productivity. The following figure demonstrates the performance improvements of Cohere Rerank 3.5 for project management evaluation.

Combining reranking with BM25 for enterprise search is supported by studies from other organizations. For instance Anthropic, an artificial intelligence startup founded in 2021 that focuses on developing safe and reliable AI systems, conducted a study that found using reranked contextual embedding and contextual BM25 reduced the top-20-chunk retrieval failure rate by 67%, from 5.7% to 1.9%. The combination of BM25’s strength in exact matching with the semantic understanding of reranking models addresses the limitations of each approach when used alone and delivers a more effective search experience for users.

As organizations strive to improve their search capabilities, many find that traditional keyword-based methods such BM25 have limitations in understanding context and user intent. This leads customers to explore hybrid search approaches that combine the strengths of keyword-based algorithms with the semantic understanding of modern AI models. OpenSearch Service 2.11 and later supports the creation of hybrid search pipelines using normalization processors directly within the OpenSearch Service domain. By transitioning to a hybrid search system, organizations can use the precision of BM25 while benefiting from the contextual awareness and relevance ranking capabilities of semantic search.

Cohere Rerank 3.5 acts as a final refinement layer, analyzing the semantic and contextual aspects of both the query and the initial search results. These models excel at understanding nuanced relationships between queries and potential results, considering factors like customer reviews, product images, or detailed descriptions to further refine the top results. This progression from keyword search to semantic understanding, and then applying advanced reranking, allows for a dramatic improvement in search relevance.

How to integrate Cohere Rerank 3.5 with OpenSearch Service

There are several options available to integrate and use Cohere Rerank 3.5 with OpenSearch Service. Teams can use OpenSearch Service ML connectors which facilitate access to models hosted on third-party ML platforms. Every connector is specified by a connector blueprint. The blueprint defines all the parameters that you need to provide when creating a connector.

In addition to the Bedrock Rerank API, teams can use the Amazon SageMaker connector blueprint for Cohere Rerank hosted on Amazon Sagemaker for flexible deployment and fine-tuning of Cohere models. This connector option works with other AWS services for comprehensive ML workflows and allows teams to use the tools built into Amazon SageMaker for model performance monitoring and management. There is also a Cohere native connector option available that provides direct integration with Cohere’s API, offering immediate access to the latest models and is suitable for users with fine-tuned models on Cohere.

See this general reranking pipeline guide for OpenSearch Service 2.12 and later or this tutorial to configure a search pipeline that uses Cohere Rerank 3.5 to improve a first-stage retrieval system that can run on the native OpenSearch Service vector engine.

Conclusion

Integrating Cohere Rerank 3.5 with OpenSearch Service is a powerful way to enhance your search functionality and deliver a more meaningful and relevant search experience for your users. We covered the added benefits a rerank model could bring to various businesses and how a reranker can enhance search. By tapping into the semantic understanding of Cohere’s models, you can surface the most pertinent results, improve user satisfaction, and drive better business outcomes.


About the Authors

Breanne Warner is an Enterprise Solutions Architect at Amazon Web Services supporting healthcare and life science (HCLS) customers. She is passionate about supporting customers to use generative AI on AWS and evangelizing model adoption for 1P and 3P models. Breanne is also on the Women@Amazon board as co-director of Allyship with the goal of fostering inclusive and diverse culture at Amazon. Breanne holds a Bachelor of Science in Computer Engineering from University of Illinois at Urbana Champaign (UIUC).

Karan Singh is a generative AI Specialist for 3P models at AWS where he works with top-tier 3P foundational model providers to define and execute join GTM motions that help customers train, deploy, and scale models to enable transformative business applications and use cases across industry verticals. Karan holds a Bachelor of Science in Electrical and Instrumentation Engineering from Manipal University, a Masters in Science in Electrical Engineering from Northwestern University, and is currently an MBA Candidate at the Haas School of Business at University of California, Berkeley.

Hugo Tse is a Solutions Architect at Amazon Web Services supporting independent software vendors. He strives to help customers use technology to solve challenges and create business opportunities, especially in the domains of generative AI and storage. Hugo holds a Bachelor of Arts in Economics from the University of Chicago and a Master of Science in Information Technology from Arizona State University.

Elliott Choi is a Staff Product Manager at Cohere working on the Search and Retrieval Team. Elliott holds a Bachelor of Engineering and a Bachelor of Arts from the University of Western Ontario.

Introducing Amazon OpenSearch Service and Amazon Security Lake integration to simplify security analytics

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/introducing-amazon-opensearch-service-zero-etl-integration-for-amazon-security-lake/

Today, we’re announcing the general availability of Amazon OpenSearch Service zero-ETL integration with Amazon Security Lake. This integration enables organizations to efficiently search, analyze, and gain actionable insights from their security data, streamlining complex data engineering requirements and unlocking the full potential of security data. It’s a new way to in-place query and analyze logs in Security Lake that minimizes the need to duplicate data and reduces the operational overhead of managing custom data pipelines. You can directly query your Security Lake data, saving the costs of moving data.

With OpenSearch Service zero-ETL integration with Security Lake, you can use the rich analytics capabilities of OpenSearch Dashboards to query and visualize your data in Security Lake. You can also analyze multiple data sources within a single tool and a single schema, the Open Cybersecurity Schema Framework (OCSF) schema to help with threat-hunting and investigation scenarios.

For time-sensitive investigations and monitoring, you can optionally boost query performance by enabling additional accelerations such as indexed views and dashboards in Amazon OpenSearch Service when you need fast and frequent access to a subset of your data. These capabilities provide complete visibility into all your data stored in Security Lake, regardless of the log volume, to support security investigations, better understanding of your security posture, and gain security-relevant insights.

Getting started with direct queries with Amazon Security Lake
You can get started in a few steps. First, you need to enable Security Lake by creating a Security Lake subscriber. Then, you enable a data connection in Amazon OpenSearch Service. This will automatically create an OpenSearch Serverless collection to store your direct query results and indices.

1. Enable Security Lake and setup permissions for a data lake

To enable Security Lake in the AWS Management Console, specify the data sources that you want to collect such as Amazon Route 53 DNS queries, AWS CloudTrail logs, Amazon VPC Flow logs, and AWS Security Hub findings and your AWS Regions. I chose several Regions and set the Amazon Simple Storage Service (Amazon S3) storage class and roll-up Regions to consolidate data.

Security Lake offers a 15-day trial at no cost so you can deploy it across your organization with the desired data sources and estimate the costs specific to your organization.

Once the enablement is complete, all collected data is ingested into an Amazon Simple Storage Service (Amazon S3) bucket in your account.

To access Security Lake data from an account other than the Security Lake delegated admin account, you should create an AWS Lake Formation subscriber to access and query data from AWS Glue tables associated with Security Lake. Enter the AWS account and external ID that’s authorized to access Security Lake and select the data sources to be accessed. Lake Formation provides cross-account permissions for security analysts to access data in the lake.

After you create the query subscriber, you can go to the account where you plan to deploy your OpenSearch resources and accept the AWS Resource Access Manager (AWS RAM) share that is shared by the Security Lake delegated admin account. The subscriber account will show the share status as pending until it’s manually accepted.

To learn more, visit Enabling Security Lake using the console and Create query subscriber procedures in the Amazon Security Lake User Guide.

2. Create a data connection with OpenSearch Service

You can create a zero-ETL integration in a few steps. In the OpenSearch Service console of the subscriber’s account, choose Connected data source in the Data connections section of the left navigation pane. You can then choose Security Lake as a data source type.

In the next step, you can set up the IAM permissions for accessing the Security Lake data source using the zero-ETL integration. It will also automatically create an OpenSearch Serverless collection and an OpenSearch application.

After the connection is created, you can select one of the pre-built OpenSearch dashboards that periodically query your data in Security Lake to create visualizations. You can create a dashboard using templates for VPC Flow Logs, WAF logs, and CloudTrail data sources in Security Lake.

The following is an example of a pre-built dashboard for VPC Flow logs.

To learn more about data connection, visit Data connections and permissions in the Amazon OpenSearch Service Developer Guide.

3. Query Security Lake data in the OpenSearch Dashboard

To directly query your Security Lake data in OpenSearch Dashboards, go to the Discover page.

In the Discover page, you can use the data picker workflow to locate on a specific Security Lake table to query. There is one table for each Security Lake log source.

After making a selection, you can choose the query language that you want to use, either PPL (Piped Processing Language) or SQL (Structured Query Language), and then write and run your query. The following is a PPL sample result:

You can also choose to search and run a pre-built query template to start your query. There are more than 200 SQL and PPL queries that cover all AWS log sources that are available in Security Lake. You can use the search box to find queries that you’re interested in. For example, search for “VPC Flow” to see all queries related to VPC Flow logs. There’s a description explaining each query and when you might want to use it.

If you want to perform multiple queries on the same data set, for example to support security investigations, you can create an on-demand indexed view for the results of your direct query. After the results are ingested into an OpenSearch index, you can perform low-latency subsequent queries and analysis using analytics features in OpenSearch.

To create an indexed view, choose Create indexed view and select a specified query, an index name, and a time range. After the view is created, the query results will be ingested and available to query as part of the newly created index under available indexed views.

To learn more, visit Searching data in the Amazon OpenSearch Service Developer Guide.

Now available
Amazon OpenSearch Service zero-ETL integration with Amazon Security Lake is now available in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), South America (São Paulo), and Canada (Central) AWS Regions.

OpenSearch Service separately charges for only the compute needed (as OpenSearch Compute Units) to query your external data in addition to maintaining indexes in OpenSearch Service. For more information, see Amazon OpenSearch Service Pricing.

Give it a try and send feedback to the AWS re:Post for Amazon OpenSearch Service or through your usual AWS Support contacts.

Channy

New Amazon CloudWatch and Amazon OpenSearch Service launch an integrated analytics experience

Post Syndicated from Elizabeth Fuentes original https://aws.amazon.com/blogs/aws/new-amazon-cloudwatch-and-amazon-opensearch-service-launch-an-integrated-analytics-experience/

Today, Amazon Web Services (AWS) announces a new integrated analytics experience and zero-ETL integration between Amazon CloudWatch and Amazon OpenSearch Service. This integration simplifies log data analysis and visualization without data duplication, streamlining log management while reducing technical overhead and operational costs. CloudWatch Logs customers now have access to two additional query languages beyond CloudWatch Logs Insights QL, while OpenSearch customers can query CloudWatch logs in place without creating separate extract, transform, and load (ETL) pipelines.

Organizations often need different analytics capabilities for their log data. Some teams prefer CloudWatch Logs for its scalability and simplicity in centralizing logs from all their systems, applications, and AWS services. Others require OpenSearch Service for advanced analytics and visualizations. Previously, integration between these services required maintaining separate ingestion pipelines or creating ETL processes. This new integration helps customers get the best of both services by eliminating this complexity by bringing the power of OpenSearch analytics directly to CloudWatch Logs, without any data copy.

Amazon CloudWatch Logs now supports OpenSearch Piped Processing Language (PPL) and OpenSearch SQL directly within the CloudWatch Logs Insights console. You can use SQL to analyze data and correlate logs using JOIN. You can use SQL functions (such as JSON, mathematical, datetime, and string functions) for intuitive log analytics. You can also use the OpenSearch PPL to filter, aggregate, and analyze data. With a few clicks, you can access pre-built, out-of-the-box dashboards for vended logs, such as Amazon Virtual Private Cloud (VPC), AWS CloudTrail, and AWS WAF. These dashboards enable faster monitoring and troubleshooting through visualizations, such as analyzing flows over time, top talkers, megabytes, and packets transferred over time, without having to configure individual widgets or build specific queries. You can analyze VPC flows over time, identify top talkers, track network traffic metrics, monitor web request trends in AWS WAF, or analyze API activity patterns in AWS CloudTrail.

Additionally, OpenSearch Service users can now analyze CloudWatch logs using OpenSearch Discover and run SQL and PPL, similar to how they analyze data in Amazon Simple Storage (Amazon S3), and build indexes and create dashboards directly without any ETL operations or separate ingestion pipelines.

Let’s explore how this integration works
To demonstrate the new OpenSearch SQL and PPL query capabilities in CloudWatch, I start in the CloudWatch console. In the navigation pane, I choose Logs then Logs Insights. After selecting log groups for the query, I can now use OpenSearch PPL or OpenSearch SQL query languages directly within CloudWatch Logs Insights, with no additional setup or integration required. Using this new capability, I can write complex queries using familiar SQL syntax or OpenSearch PPL, making log analysis more intuitive and efficient. In the Query commands menu, you can find sample queries to help you get started.

This example demonstrates how to use SQL JOIN to combine data from two log groups: pet adoptions and pet availability. By filtering for specific customer IDs, you can analyze related log records and trace IDs for troubleshooting purposes.

One of the powerful features of this integration for CloudWatch Logs customers is the ability to create pre-built dashboards for Amazon VPC Flows, AWS CloudTrail and AWS WAF logs. Let’s explore this by creating a dashboard for AWS WAF logs. In the Analyze with OpenSearch tab, I choose Settings and follow the steps.

After a few minutes, my integration is ready and I go to Create an OpenSearch dashboard. In the options Select automatic dashboard type, I choose AWS WAF logs.

In the Dashboard data configuration tab, I can select Data synchronization frequency to occur every 15 minutes. I Select the log groups and View log samples of the selected log groups. I finish by choosing Create a dashboard.

After creating my dashboard, I can explore my logs. The AWS WAF logs dashboard provides comprehensive visibility into web application firewall metrics and events, with automatically configured visualizations that help you monitor and analyze security patterns.

Similarly, the CloudTrail dashboard offers deep insights into API activity across your AWS environment. It’s useful for monitoring API activity, auditing actions, and identifying potential security or compliance issues. 

The VPC Flow Logs dashboard provides detailed visualization of key metrics from your logs for network traffic analysis. You can analyze network traffic, detect unusual patterns, and monitor resource usage. The dashboard currently supports only VPC v2 fields (default format). Custom formatted fields are not supported.

With zero-ETL to access CloudWatch data from OpenSearch Services, I also can build an OpenSearch dashboard from the OpenSearch Service console without having to build and maintain an ETL process. For this, I go to Central management, then I select the new Connected data sources menu, click choose Connect to create a new connected data source, and choose CloudWatch Logs.

In the next step, I name my data source and choose to Create a new role, which must have the necessary permissions to execute actions on OpenSearch Service. You can see them in the Sample custom policy.

https://d2908q01vomqb2.cloudfront.net/artifacts/AWSNews/2024/AWSNEWS-1365-Role.gif

In the Set up OpenSearch step, configure a OpenSearch data connection for CloudWatch Logs by selecting Create a new collection. As part of setting up the CloudWatch Logs source, a new OpenSearch Service serverless collection and OpenSearch UI application is created to store the indexed views and provide a user interface to analyze your CloudWatch Logs data. I create a new collection, name it, and configure the OpenSearch application and workspace within the application. After setting the Data retention days, I choose Next and finish with Review and connect.

When the integration with CloudWatch is ready, I can choose between Explore logs without indexing data which will take me to a querying interface in Discover or Explore vended logs by creating a dashboard for Amazon VPC Flows, CloudTrail and AWS WAF logs.

After I select Explore logs, OpenSearch UI takes me to Discover in the application workspace I created during the data source setup. In Discover, I select the data picker and choose View all available data to access my CloudWatch Logs data source and log groups.

After I select the log groups, I can analyze my CloudWatch logs using OpenSearch SQL and PPL directly in Discover, without having to switch between applications.

To create a dashboard, I return to the Connected data sources overview page on the console. From there, I select Create dashboard, which allows me to visually analyze my CloudWatch data without having to define queries or build visualizations, as I previously did in the CloudWatch console

After the dashboard is created, I navigate to OpenSearch resources where I can see the newly created indexes being populated with data in my Collection. After I have the data, I can go to the dashboard with the data from the CloudWatch logs that I selected in the configuration, and as more data comes in, it will be displayed in near real-time on the OpenSearch dashboard.

With this zero-ETL integration you can ingest data directly into OpenSearch, using its powerful query capabilities and visualization features while maintaining data consistency and reducing operational overhead.

Integration Highlights
For CloudWatch customers:

  • Query capabilities – Streamline log investigation by using OpenSearch SQL and PPL queries directly within the CloudWatch Logs Insights console.
  • Analytics features – With a few clicks, access pre-built, out-of-the-box dashboards for vended logs, such as VPC, AWS WAF, and CloudTrail logs. These dashboards enable faster monitoring and troubleshooting through visualizations for analyzing flows over time, top talkers, megabytes, and packets transferred over time, without having to configure individual widgets or build specific queries.
  • Getting started for CloudWatch users – Configure integration from CloudWatch Logs to OpenSearch Service. For more information refer to the Amazon CloudWatch Logs query capabilities and Amazon CloudWatch Logs vended dashboard documentation.

For OpenSearch Service customers:

  • Zero-ETL integration – Access and analyze CloudWatch data directly from OpenSearch Service without building or maintaining ETL processes. This integration eliminates separate ingestion pipelines while reducing storage costs and operational overhead through simplified data management and zero data duplication.
  • Getting started for OpenSearch users – Create a data connection selecting CloudWatch as a data source from OpenSearch Service. For more information, refer to the Amazon OpenSearch Service Developer Guide.

Regional availability and pricing
This integration is now available in AWS Regions where Amazon OpenSearch Service direct query is available. For pricing details and free trial information, you can visit the Amazon CloudWatch Pricing and Amazon OpenSearch Service Pricing pages.

PS: Writing a blog post at AWS is always a team effort, even when you see only one name under the post title. In this case, I want to thank Joshua Bright, Ashok Swaminathan, Abeetha Bala, Calvin Weng, and Ronil Prasad for their generous help with screenshots, technical guidance, and sharing their expertise in both services, which made this integration overview possible and comprehensive.

Eli

Intel Accelerators on Amazon OpenSearch Service improve price-performance on vector search by up to 51%

Post Syndicated from Mulugeta Mammo original https://aws.amazon.com/blogs/big-data/intel-accelerators-on-amazon-opensearch-service-improve-price-performance-on-vector-search-by-up-to-51/

This post is co-written with Mulugeta Mammo and Akash Shankaran from Intel.

Today, we’re excited to announce the availability of Intel Advanced Vector Extensions 512 (AVX-512) technology acceleration on vector search workloads when you run OpenSearch 2.17+ domains with the 4th generation Intel Xeon Intel instances on the Amazon OpenSearch Service. When you run OpenSearch 2.17 domains on C/M/R 7i instances, you can gain up to 51% in vector search performance at no additional cost compared to previous R5 Intel instances.

Increasingly, application builders are using vector search to improve the search quality of their applications. This modern technique involves encoding content into numerical representations (vectors) that can be used to find similarities between content. For instance, it’s used in generative AI applications to match user queries to semantically similar knowledge articles providing context and grounding for generative models to perform tasks. However, vector search is computationally intensive, and higher compute and memory requirements can lead to higher costs than traditional search. Therefore, cost optimization levers are important to achieve a favorable balance of cost vs. benefit.

OpenSearch Service is a managed service for the OpenSearch search and analytics suite, which includes support for vector search. By running your OpenSearch 2.17+ domains on C/M/R 7i instances, you can achieve up to a 51% price-performance gain compared to the past R5 instances on OpenSearch Service. As we discuss in this post, this launch offers improvements to your infrastructure total cost of ownership (TCO) and savings.

Accelerating generative AI applications with vectorization

Let’s understand how these technologies come together through the building of a simple generative AI application. First, you bring vector search online by using machine learning (ML) models to encode your content (such as text, image or audio) into vectors. You then index these vectors into an OpenSearch Service domain, enabling real-time content similarity search that can be scaled to search billions of vectors in milliseconds. These vector searches provide contextually relevant insights, which can be further enriched by AI for hyper-personalization and integrated with generative models to power chatbots.

Vector search use cases extend beyond generative AI applications. Use cases include image to semantic search, and recommendations such as the following real-world use case from Amazon Music. The Amazon Music application uses vectorization to encode 100 million songs into vectors that represent both music tracks and customer preferences. These vectors are then indexed in OpenSearch, which manages over a billion vectors and handles up to 7,100 vector queries per second to analyze user listening behavior and provide real-time recommendations.

The indexing and search processes are computationally intensive, requiring calculations between vectors that are typically represented as 128–2,048 dimensions (numerical values). The Intel Xeon Scalable processors found on the 7th generation Intel instances use Intel AVX-512 to increase the speed and efficiency of vector operations through the following features:

  • Data parallel processing – By processing 512 bits (twice the number of its predecessor) of data at once, Intel AVX-512 efficiently uses SIMD (single input multiple data) to run multiple operations simultaneously, which provides significant speed-up
  • Pathlength reduction – The speed-up is due to a significant improvement in pathlength, which is a measure of the number of instructions required to perform a unit of work in workloads
  • Power performance savings – You can lower power performance costs by processing more data and performing more operations in a shorter amount of time

Benchmarking vector search on OpenSearch

OpenSearch Services R7i Instances with Intel AVX-512 are an excellent choice for OpenSearch vector workloads. They offer a high CPU-to-memory ratio, which further maximizes the compute potential while providing ample memory.

To verify just how much faster the new R7i instances perform, you can run OpenSearch benchmarks firsthand. Using your OpenSearch 2.17 domain, create a k-NN index configured to use either the Lucene or FAISS engine. Use the OpenSearch Benchmark with the public Cohere 10M 768D dataset to replicate the benchmarks published in this post. Replicate these tests using the older R5 instances as the baseline.

In the following sections, we present the benchmarks that demonstrate the 51% price-performance gains between the R7i and the R5 instances.

Lucene engine results

In this post, we define price-performance as the number of documents that can be indexed or search queries executed given a fixed budget ($1), taking into account the instance cost. The following are results of price-performance with the Cohere 10M dataset.

Up to a 44% improvement in price-performance is observed when using the Lucene engine and upgrading from R5 to R7i instances. The difference between the blue and orange bars in the following graphs illustrates the gains contributed by AVX512 acceleration.

FAISS engine results

We also examine results from the same tests performed on k-NN indexes configured on the FAISS engine. Up to 51% price-performance gains is achieved on index performance simply by upgrading from r5 to r7i instances. Again, the difference between the blue and orange bar demonstrates the additional gains contributed by AVX512.

In addition to price-performance gains, search response times also improved by upgrading R5 to R7i instances with AVX512. P90 and P99 latencies were lower by 33% and 38%, respectively.

The FAISS engine has the added benefit of AVX-512 acceleration with FP16 quantized vectors. With FP16 quantization, vectors are compressed to half the size, reducing memory and storage requirements and in turn infrastructure costs. AVX-512 contributes to further price-performance gains.

Conclusion

If you’re looking to modernize search experiences on OpenSearch Service while potentially lowering costs, try out the OpenSearch vector engine on OpenSearch Service C7i, M7i, or R7i instances. Built on 4th Gen Intel Xeon processors, the latest Intel instances provide advanced features like Intel AVX-512 accelerators, improved CPU performance, and higher memory bandwidth than the previous generation, which makes them an excellent choice for optimizing your vector search workloads on OpenSearch Service.

Credits to: Vesa Pehkonen, Noah Staveley, Assane Diop, Naveen Tatikonda


About the Authors

Mulugeta Mammo is a Senior Software Engineer, and currently leads the OpenSearch Optimization team at Intel.

Vamshi Vijay Nakkirtha is a software engineering manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems.

Akash Shankaran is a Software Architect and Tech Lead in the Xeon software team at Intel working on OpenSearch. He works on pathfinding opportunities and enabling optimizations within databases, analytics, and data management domains.

Dylan Tong is a Senior Product Manager at Amazon Web Services. He leads the product initiatives for AI and machine learning (ML) on OpenSearch including OpenSearch’s vector database capabilities. Dylan has decades of experience working directly with customers and creating products and solutions in the database, analytics and AI/ML domain. Dylan holds a BSc and MEng degree in Computer Science from Cornell University.


Notices and disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index website.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.

Accelerate your migration to Amazon OpenSearch Service with Reindexing-from-Snapshot

Post Syndicated from Hang Zuo original https://aws.amazon.com/blogs/big-data/accelerate-your-migration-to-amazon-opensearch-service-with-reindexing-from-snapshot/

It is appealing to migrate from self-managed OpenSearch and Elasticsearch clusters in legacy versions to Amazon OpenSearch Service to enjoy the ease of use, native integration with AWS services, and rich features from the open-source environment (OpenSearch is now part of Linux Foundation). However, the data migration process can be daunting, especially when downtime and data consistency are critical concerns for your production workload.

In this post, we will introduce a new mechanism called Reindexing-from-Snapshot (RFS), and explain how it can address your concerns and simplify migrating to OpenSearch.

Key concepts

To understand the value of RFS and how it works, let’s look at a few key concepts in OpenSearch (and the same in Elasticsearch):

  1. OpenSearch index: An OpenSearch index is a logical container that stores and manages a collection of related documents. OpenSearch indices are composed of multiple OpenSearch shards, and each OpenSearch shard contains a single Lucene index.
  2. Lucene index and shard: OpenSearch is built as a distributed system on top of Apache Lucene, an open-source high-performance text search engine library. An OpenSearch index can contain multiple OpenSearch shards, and each OpenSearch shard maps to a single Lucene index. Each Lucene index (and, therefore, each OpenSearch shard) represents a completely independent search and storage capability hosted on a single machine. OpenSearch combines many independent Lucene indices into a single higher-level system to extend the capability of Lucene beyond what a single machine can support. OpenSearch provides resilience by creating and managing replicas of the Lucene indices as well as managing the allocation of data across Lucene indices and combining search results across all Lucene indices.
  3. Snapshots: Snapshots are backups of an OpenSearch cluster’s indexes and state in an off-cluster storage location (snapshot repository) such as Amazon Simple Storage Service (Amazon S3). As a backup strategy, snapshots can be created automatically in OpenSearch, or users can create a snapshot manually for restoring it on to a different domain or for data migration.

For example, when a document is added to the OpenSearch index, the distributed system layer picks a specific shard to host the document, and the document is ingested into that shard’s Lucene index. Operations on that document are then routed to the same shard (though the shard might have replicas). Search operations are performed across the shards in OpenSearch index individually and then a combined result is returned. A snapshot can be created to backup the cluster’s indexes and state, including cluster settings, node information, index settings and shard allocation, so that the snapshot can be used for data migration.

Why RFS?

RFS can transfer data from OpenSearch and Elasticsearch clusters at high throughput without impacting the performance of the source cluster. This is achieved by using the shard-level codependency and snapshots:

  1. Minimized performance impact to source clusters: Instead of retrieving data directly from the source cluster, RFS can use a snapshot of the source cluster for data migration. Documents are parsed from the snapshot and then reindexed to the target cluster, so that performance impact to the source clusters is minimized during migration. This maintains a smooth transition and minimal performance impact to end users, especially for production workloads.
  2. High throughput: Because shards are separate entities, RFS can retrieve, parse, extract and reindex the documents from each shard in parallel, to achieve high data throughput.
  3. Multi-version upgrades: RFS supports migrating data across multiple major versions (for example, from Elasticsearch 6.8 to OpenSearch 2.x), which can be a significant challenge with other data migration approaches. This is because the data indexed into OpenSearch (and Lucene) is only backward compatible for one major version. By incorporating reindexing as the core mechanism of the migration process, RFS can migrate data across multiple versions in one hop and make sure the data is fully updated and readable in the target cluster’s version, so that you don’t need to worry about the hidden technical debt imposed by having previous-version Lucene files in the new OpenSearch cluster.

How RFS works

OpenSearch and Elasticsearch snapshots are a directory tree that contains both data and metadata. Each index has its own sub-directory, and each shard has its own sub-directory under the directory of its parent index. The raw data for a given shard is stored in its corresponding shard sub-directory as a collection of Lucene files, which OpenSearch and Elasticsearch lightly obfuscates. Metadata files exist in the snapshot to provide details about the snapshot as a whole, the source cluster’s global metadata and settings, each index in the snapshot, and each shard in the snapshot.

The following is an example for the structure of an Elasticsearch 7.10 snapshot, along with a breakdown of its contents:

/snapshot/root
├── index-0 <-------------------------------------------- [1]
├── index.latest
├── indices
│   ├── DG4Ys006RDGOkr3_8lfU7Q <------------------------- [2]
│   │   ├── 0 <------------------------------------------ [3]
│   │   │   ├── __iU-NaYifSrGoeo_12o_WaQ <--------------- [4]
│   │   │   ├── __mqHOLQUtToG23W5r2ZWaKA <--------------- [4]
│   │   │   ├── index-gvxJ-ifiRbGfhuZxmVj9Hg 
│   │   │   └── snap-eBHv508cS4aRon3VuqIzWg.dat <-------- [5]
│   │   └── meta-tDcs8Y0BelM_jrnfY7OE.dat <-------------- [6]
│   └── _iayRgRXQaaRNvtfVfRdvg
│       ├── 0
│       │   ├── __DNRvbH6tSxekhRUifs35CA
│       │   ├── __NRek2UuKTKSBOGczcwftng
│       │   ├── index-VvqHYPQaRcuz0T_vy_bMyw
│       │   └── snap-eBHv508cS4aRon3VuqIzWg.dat
│       └── meta-tTcs8Y0BelM_jrnfY7OE.dat
├── meta-eBHv508cS4aRon3VuqIzWg.dat <-------------------- [7]
└── snap-eBHv508cS4aRon3VuqIzWg.dat <-------------------- [8]

The structure includes the following elements:

  1. Repository metadata file: JSON encoded and contains a mapping between the snapshots within the repository and the OpenSearch or Elasticsearch indices and shards stored within it.
  2. Index directory: Contains the data and metadata for a specific OpenSearch or Elasticsearch index.
  3. Shard directory: Contains the data and metadata for a specific shard of an OpenSearch or Elasticsearch index
  4. Lucene Files: Lucene index files, lightly obfuscated by the snapshotting process. Large files from the source file system are split into multiple parts.
  5. Shard metadata file: SMILE encoded and contains details about all the Lucene files in the shard and a mapping between their in-snapshot representation and their original representation on the source machine they were pulled from (including the original file name and other details).
  6. Index metadata file: SMILE encoded and contains things such as the index aliases, settings, mappings, and number of shards.
  7. Global metadata file: SMILE encoded and contains things such as the legacy, index, and component templates.
  8. Snapshot metadata file: SMILE encoded and contains things such as whether the snapshot succeeded, the number of shards, how many shards succeeded, the OpenSearch or Elasticsearch version, and the indices in the snapshot.

RFS works by retrieving a local copy of a shard-level directory, unpacking its contents and de-obfuscating them, reading them as a Lucene index, and extracting the documents within. This is enabled because OpenSearch and Elasticsearch store the original format of documents added to an OpenSearch or Elasticsearch index in Lucene using the _source field; this feature is enabled by default and is what allows the standard _reindex REST API to work (among other things).

The user workflow for performing a document migration with RFS using the Migration Assistant is shown in the following figure:

The workflow is:

  1. The operator shells into the Migration Assistant console
  2. The operator uses the console command line interface (CLI) to initiate a snapshot on their source cluster. The source cluster stores the snapshot in an S3 Bucket.
  3. The operator starts the document migration with RFS using the console CLI. This creates a single RFS Worker, which is a Docker container running in AWS Fargate.
  4. Each RFS worker provisioned pulls down an un-migrated shard from the snapshot bucket and reindexes its documents against the target cluster. Once finished, it proceeds to the next shard until all shards are completed.
  5. The operator monitors the progress of the migration using the console CLI, which reports both the number of shards yet to be migrated and the number that have been completed. The operator can scale the RFS worker fleet up or down to increase or reduce the rate of indexing on the target cluster.
  6. After all shards have been migrated to the target cluster, the operator scales the RFS worker fleet down to zero.

As previously mentioned, the RFS workers operate at the shard-level, so that you can provision one RFS worker for every shard in the snapshot to achieve maximum throughput. If a RFS worker stops unexpectedly in the middle of migrating a shard, another RFS worker will restart its migration from the beginning. The original document identifiers are preserved in the migration process, so that the restarted migration will be able to over-write the failed attempt. RFS workers coordinate amongst themselves using metadata that they store in an index on the target cluster.

How RFS performs

To highlight the performance of RFS, let’s consider the following scenario: you have an Elasticsearch 7.10 source cluster containing 5 TiB (3.9 billion documents) and wants to migrate to OpenSearch 2.15. With RFS, you can perform this migration in approximately 35 minutes, spending approximately $10 in Amazon Elastic Container Service (Amazon ECS) usage to run the RFS workers during the migration.

To demonstrate this capability, we created an Elasticsearch 7.10 source cluster in Amazon OpenSearch Service, with 1,024 shards and 0 replicas. We used AWS Glue to bulk-load sample data into the source cluster with the AWS Public Blockchain Dataset, and repeated the bulk-load process until 5 TiB of data (3.9 billion documents) was stored. We created an OpenSearch 2.15 cluster as the target cluster in Amazon OpenSearch Service, with 15 r7gd.16xlarge data nodes and 3 m7g.large master nodes, and used Sigv4 for authentication. Using the Migration Assistant solution, we created a snapshot of the source cluster, stored it in S3, and performed a metadata migration so that the indices on the source were recreated on the target cluster with the same shard and replica counts. We then ran console backfill start and console backfill scale 200 to begin the RFS migration with 200 workers. RFS indexed data into the target cluster at 2,497 MiB per second. The migration was completed in approximately 35 minutes. We metered approximately $10 in ECS cost for running the RFS workers.

To better highlight the performance, the following figures show metrics from the OpenSearch target cluster during this process (presented below).

In the preceding figures, you can see the cyclical variation in the document index rate and target cluster resource utilization as the 200 RFS workers pick up shards, complete a shard, and then pick up a new shard. At peak RFS indexing, we see the target cluster nodes maxing their CPU and begin queuing writes. The queue is cleared as shards complete and more workers transition to the downloading state. In general, we find that RFS performance is limited by the ability of the target cluster to absorb the traffic it generates. You can tune the RFS worker fleet to match what your target cluster can reliably ingest.

Conclusion

This blog post is designed to be a starting point for teams seeking guidance on how to use Reindexing-from-Snapshot as a straightforward, high throughput, and low-cost solution for data migration from self-managed OpenSearch and Elasticsearch clusters to Amazon OpenSearch Service. RFS is now part of the Migration Assistant solution and available from the AWS Solution Library. To use RFS to migrate to Amazon OpenSearch Service, try the Migration Assistant solution. To experience OpenSearch, try the OpenSearch Playground. To use the managed implementation of OpenSearch in the AWS Cloud, see Getting started with Amazon OpenSearch Service.


About the authors

Hang (Arthur) Zuo is a Senior Product Manager with Amazon OpenSearch Service. Arthur leads the core experience in the next-gen OpenSearch UI and data migration to Amazon OpenSearch Service. Arthur is passionate about cloud technologies and building data products that help users and businesses gain actionable insights and achieve operational excellence.

Chris Helma is a Senior Engineer at Amazon Web Services based in Austin, Texas. He is currently developing tools and techniques to enable users to shift petabyte-scale data workloads into OpenSearch. He has extensive experience building highly-scalable technologies in diverse areas such as search, security analytics, cryptography, and developer productivity. He has functional domain expertise in distributed systems, AI/ML, cloud-native design, and optimizing DevOps workflows. In his free time, he loves to explore specialty coffee and run through the West Austin hills.

Andre Kurait is a Software Development Engineer II at Amazon Web Services, based in Austin, Texas. He is currently working on Migration Assistant for Amazon OpenSearch Service. Prior to joining Amazon OpenSearch, Andre worked within Amazon Health Services. In his free time, Andre enjoys traveling, cooking, and playing in his church sport leagues. Andre holds Bachelor of the Science degrees from the University of Kansas in Computer Science and Mathematics.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Introducing Point in Time queries and SQL/PPL support in Amazon OpenSearch Serverless

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/introducing-point-in-time-queries-and-sql-ppl-support-in-amazon-opensearch-serverless/

Today we announced support for three new features for Amazon OpenSearch Serverless: Point in Time (PIT) search, which enables you to maintain stable sorting for deep pagination in the presence of updates, and Piped Processing Language (PPL) and Structured Query Language (SQL), which give you new ways to query your data. Querying with SQL or PPL is useful if you’re already familiar with the language or want to integrate your domain with an application that uses them.

OpenSearch Serverless is a powerful and scalable search and analytics engine that enables you to store, search, and analyze large volumes of data while reducing the burden of manual infrastructure provisioning and scaling as you ingest, analyze, and visualize your time series and search data, simplifying data management and enabling you to derive actionable insights from data. The vector engine for OpenSearch Serverless also makes it easy for you to build modern machine learning (ML) augmented search experiences and generative artificial intelligence (generative AI) applications without needing to manage the underlying vector database infrastructure.

PIT search

Point in Time (PIT) search lets you run different queries against a dataset that’s fixed in time. Typically, when you run the same query on the same index at different points in time, you receive different results because documents are constantly indexed, updated, and deleted. With PIT, you can query against a state of your dataset for a point in time. Although OpenSearch still supports other ways of paginating results, PIT search provides superior capabilities and performance because it isn’t bound to a query and supports consistent pagination. When you create a PIT for a set of indexes, OpenSearch creates contexts to access data at that point in time and when you use a query with a PIT ID, it searches the contexts that are frozen in time to provide consistent results.

Using PIT involves the following high-level steps:

  1. Create a PIT.
  2. Run search queries with a PIT ID and use the search_after parameter for the next page of results.
  3. Close the PIT.

Create a PIT

When you create a PIT, OpenSearch Serverless provides a PIT ID, which you can use to run multiple queries on the frozen dataset. Even though the indexes continue to ingest data and modify or delete documents, the PIT references the data that hasn’t changed since the PIT creation.

Run a search query with the PIT ID

PIT search isn’t bound to a query, so you can run different queries on the same dataset, which is frozen in time.

When you run a query with a PIT ID, you can use the search_after parameter to retrieve the next page of results. This gives you control over the order of documents in the pages of results.

The following response contains the first 100 documents that match the query. To get the next set of documents, you can run the same query with the last document’s sort values as the search_after parameter, keeping the same sort and pit.id. You can use the optional keep_alive parameter to extend the PIT time.

Close the PIT

When your queries on the dataset are complete, you can delete the PIT using the DELETE operation. PITs automatically expire after the keep_alive duration.

Considerations and limitations

Keep in mind the following limitations when using this feature:

SQL and PPL support

OpenSearch Serverless provides a primary query interface called query DSL that you can use to search your data. Query DSL is a flexible language with a JSON interface. In addition to DSL, you can now extract insights out of OpenSearch Serverless using the familiar SQL query syntax.

You can use the SQL and PPL API, the /plugins/_sql and /plugins/_ppl endpoints respectively, to search the data. You can use aggregations, group by, and where clauses to investigate your data and read your data as JSON documents or CSV tables, so you have the flexibility to use the format that works best for you. By default, queries return data in JDBC format. You can specify the response format as JDBC, standard OpenSearch JSON, CSV, or raw.

Use the /plugins/_sql endpoint to send SQL queries to the SQL plugin, as shown in the following example.

Besides basic filtering and aggregation, OpenSearch SQL also supports complex queries, such as querying semi-structured data, set operations, sub-queries and limited JOINs. Beyond the standard functions, OpenSearch functions are provided for better analytics and visualization.

For PPL queries, use the /plugins/_ppl endpoint to send queries to the SQL plugin.

Considerations and limitations

Keep in mind the following:

  • Query Workbench is not supported for SQL and PPL queries
  • The SQL and PPL CLI is supported and can be used to issue SQL and PPL queries
  • DELETE statements are not supported
  • SQL plugin data sources are not supported
  • The SQL query stats API is not supported

Summary

In this post, we discussed new features in OpenSearch Serverless. PIT is a useful feature when you need to maintain a consistent view of your data for pagination during search operations. SQL in OpenSearch Service bridges the gap between traditional relational database concepts and the flexibility of OpenSearch’s document-oriented data storage. You can send SQL and PPL queries to the _sql and _ppl endpoints, respectively, and use aggregations, group by, and where clauses to analyze their data.

For more information, refer to :


About the Authors

Jagadish Kumar (Jag) is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.

Frank Dattalo is a Software Engineer with Amazon OpenSearch Service. He focuses on the search and plugin experience in Amazon OpenSearch Serverless. He has an extensive background in search, data ingestion, and AI/ML. In his free time, he likes to explore Seattle’s coffee landscape.

Milav Shah is an Engineering Leader with Amazon OpenSearch Service. He focuses on the search experience for OpenSearch customers. He has extensive experience building highly scalable solutions in databases, real-time streaming, and distributed computing. He also possesses functional domain expertise in verticals like Internet of Things, fraud protection, gaming, and ML/AI. In his free time, he likes to ride his bicycle, hike, and play chess.

Manage access controls in generative AI-powered search applications using Amazon OpenSearch Service and Amazon Cognito

Post Syndicated from Karim Akhnoukh original https://aws.amazon.com/blogs/big-data/manage-access-controls-in-generative-ai-powered-search-applications-using-amazon-opensearch-service-and-aws-cognito/

Organizations of all sizes and types are using generative AI to create products and solutions. A common adoption pattern is to introduce document search tools to internal teams, especially advanced document searches based on semantic search. In semantic search, documents are stored as vectors, a numeric representation of the document content, in a vector database such as Amazon OpenSearch Service, and are retrieved by performing similarity search with a vector representation of the search query.

In a real-world scenario, organizations want to make sure their users access only documents they are entitled to access. They are looking for a reliable and scalable solution to implement robust access controls to make sure these documents are only accessible to individuals who have a legitimate business need and the appropriate level of authorization. The permission mechanism has to be secure, built on top of built-in security features, and scalable for manageability when the user base scales out. Maintaining proper access controls for these sensitive assets is paramount, because unauthorized access could lead to severe consequences, such as data breaches, compliance violations, and reputational damage.

In this post, we show you how to manage user access to enterprise documents in generative AI-powered tools according to the access you assign to each persona.

Common use cases

The following are industry-specific use cases for document access management across different departments:

  • In R&D and engineering, access to product design documents evolves from restricted to broader as development progresses
  • HR maintains open access to general policies while limiting access to sensitive employee information
  • Finance and accounting documents require varying levels of access for auditing and executive decision-making
  • Sales and marketing teams carefully manage customer data and strategies, implementing tiered access for different roles and departments

These examples demonstrate the need for dynamic, role-based access control to balance information sharing with confidentiality in various business contexts.

Solution overview

By combining the powerful vector search capabilities of OpenSearch Service with the access control features provided by Amazon Cognito, this solution enables organizations to manage access controls based on custom user attributes and document metadata.

This approach simplifies the management of access rights, making sure only authorized users can access and interact with specific documents based on their roles, departments, and other relevant attributes. Following this approach, you can manage the access to your organization’s documents at scale. The following diagram depicts the solution architecture.

Solution diagram

The solution workflow consists of the following steps:

  1. The user accesses a smart search portal and lands on a web interface deployed on AWS Amplify.
  2. The user authenticates through an Amazon Cognito user pool and an access token is returned to the client. This access token will be used to retrieve the key pair custom attributes assigned to the user. In our case, we created two custom attributes (custom:department and custom:access_level).
  3. For each user query, an API is invoked on Amazon API Gateway to process the request. Each invocation includes the user access token in the header.
  4. The API is integrated with AWS Lambda, which processes the user query and generates the answers based on available documents and user access using retrieval augmented generation (RAG). The process starts by creating a vector based on the question (embedding) by invoking the embedding model.
  5. A query is sent to OpenSearch Service that includes the following:
    1. The embedding vector generated.
    2. User custom attributes retrieved by Lambda based on their access token, by calling the Amazon Cognito GetUser API.
    3. The query relies on the support of an efficient k-NN filter in OpenSearch Service to perform the search.
  6. Pre-filtered documents that relate to the user query are included in the prompt of the large language model (LLM) that summarizes the answer. Then, Lambda replies back to the web interface with the LLM completion (reply).
  7. If the user’s access needs to be modified (assigned attributes), an API call is made through API Gateway to a Lambda function that processes the request to add or update the custom attributes’ value for a specific user.
  8. New attributes are reflected in the user’s profile in Amazon Cognito.

Our solution is implemented and wrapped within AWS Cloud Development Kit (AWS CDK) stacks, which are available in the GitHub repo.

Our sample documents assume a fictional manufacturing company called Unicorn Robotics Factory, which develops robotic unicorns. The dataset contains over 900 documents that are a mix of engineering, roadmap, and business reporting documents. The following is an example of a document’s content:

**CONFIDENTIAL - UNICORNS ROBOTICS INTERNAL DOCUMENT**

**Project: "Galactic Unicorn"**

Unicorns Robotics is proud to announce the development of our latest project, the "Galactic Unicorn". 
This top-secret project aims to create a robotic unicorn that can travel through space and time, bringing magic and joy to children and adults alike.....

The associated metadata file for this document consists of the following:

{ "department": "research", "access_level": "confidential" }

Our solution in the GitHub repo takes care of loading the documents with associated metadata tags. For illustration purposes, we used the following mapping for the users and document access.

user access mapping

This solution is meant to delegate access management to the application tier, to simplify the implementation of use cases like generative AI-powered document search tools. However, if your use case requires a stricter approach to control document access, like multi-tenant environments or field-level security, you might want to use the fine-grained access control feature in OpenSearch Service. In our solution, we manage the access on the document level according to the assigned metadata.

Prerequisites

To deploy the solution, you need the following prerequisites:

Deploy the solution

To deploy the solution to your AWS account, refer to the Readme file in our GitHub repo.

Query documents with different personas

Now let’s test the application using different personas. In this example, we use the same users with their corresponding custom attributes as illustrated in the solution overview.

To start, let’s log in using the researcher account and run the search around a confidential document.

We ask, “What is the projected profit margin of the Galactic Unicorn project?” and get the result as shown in the following screenshot.

search using researcher access

The question invokes a query to OpenSearch Service using the custom attributes assigned to the researcher. The following code illustrates how the query is structured:

for attr, values in user_attributes.items():
        must_conditions.append(
            {
                "bool": {
                    "should": [{"term": {attr: value}} for value in values],
                    "minimum_should_match": 1,
                }
            }
        )

query = {
        "size": 5,
        "query": {
            "knn": {
                "doc_embedding": {
                    "vector": query_vector,
                    "k": 10,
                    "filter": {"bool": {"must": must_conditions}},
                }
            }
        },
    }

Let’s sign out and log in again with an engineer profile to test the same query. Based on the assigned attributes and document metadata, the result should look like that in the following screenshot.

search using engineer access

If you tried to query some support documents, you will get the desired answer, as shown in the following screenshot.

tech question by engineer

Modify user access

As depicted in the solution diagram, we’ve added a feature in the web interface to allow you to modify user access, which you could use to perform further tests. To do so, log in as a tool admin and choose Manage Attributes. Then modify the custom attribute value for a given user, as shown in the following screenshot.

access modification

Clean up

When deleting a stack, most resources will be deleted upon stack deletion, but that’s not the case for all resources. The Amazon Simple Storage Service (Amazon S3) bucket, Amazon Cognito user pool, and OpenSearch Service domain will be retained by default. However, our AWS CDK code altered this default behavior by setting the RemovalPolicy to DESTROY for the mentioned resources. If you want to retain them, you can adjust the RemovalPolicy in the AWS CDK code for the different resources.

You can use the following command to clean up the resources deployed to your AWS account:

make destroy

Conclusion

This post illustrated how to build a document search RAG solution that makes sure only authorized users can access and interact with specific documents based on their roles, departments, and other relevant attributes. It combines OpenSearch Service and Amazon Cognito custom attributes to make a tag-based access control mechanism that makes it straightforward to manage at scale.

For demonstration purposes, the following points weren’t included in the AWS CDK code. However, they’re still applicable and you might want to work on them before deploying for production purposes:


About the Authors

Karim Akhnoukh is a Solutions Architect at AWS working with manufacturing customers in Germany. He is passionate about applying machine learning and generative AI to solve customers’ business challenges. Besides work, he enjoys playing sports, aimless walks, and good quality coffee.

Ahmed Ewis is a Senior Solutions Architect at AWS GenAI Labs. He helps customers build generative AI-based solutions to solve business problems. When not collaborating with customers, he enjoys playing with his kids and cooking.

Fortune Hui is a Solutions Architect at AWS Hong Kong, working with conglomerate customers. He helps customers and partners build big data platform and generative AI applications. In his free time, he plays badminton and enjoys whisky.

Use Amazon Kinesis Data Streams to deliver real-time data to Amazon OpenSearch Service domains with Amazon OpenSearch Ingestion

Post Syndicated from M Mehrtens original https://aws.amazon.com/blogs/big-data/use-amazon-kinesis-data-streams-to-deliver-real-time-data-to-amazon-opensearch-service-domains-with-amazon-opensearch-ingestion/

In this post, we show how to use Amazon Kinesis Data Streams to buffer and aggregate real-time streaming data for delivery into Amazon OpenSearch Service domains and collections using Amazon OpenSearch Ingestion. You can use this approach for a variety of use cases, from real-time log analytics to integrating application messaging data for real-time search. In this post, we focus on the use case for centralizing log aggregation for an organization that has a compliance need to archive and retain its log data.

Kinesis Data Streams is a fully managed, serverless data streaming service that stores and ingests various streaming data in real time at any scale. For log analytics use cases, Kinesis Data Streams enhances log aggregation by decoupling producer and consumer applications, and providing a resilient, scalable buffer to capture and serve log data. This decoupling provides advantages over traditional architectures. As log producers scale up and down, Kinesis Data Streams can be scaled dynamically to persistently buffer log data. This prevents load changes from impacting an OpenSearch Service domain, and provides a resilient store of log data for consumption. It also allows for multiple consumers to process log data in real time, providing a persistent store of real-time data for applications to consume. This allows the log analytics pipeline to meet Well-Architected best practices for resilience (REL04-BP02) and cost (COST09-BP02).

OpenSearch Ingestion is a serverless pipeline that provides powerful tools for extracting, transforming, and loading data into an OpenSearch Service domain. OpenSearch Ingestion integrates with many AWS services, and provides ready-made blueprints to accelerate ingesting data for a variety of analytics use cases into OpenSearch Service domains. When paired with Kinesis Data Streams, OpenSearch Ingestion allows for sophisticated real-time analytics of data, and helps reduce the undifferentiated heavy lifting of creating a real-time search and analytics architecture.

Solution overview

In this solution, we consider a common use case for centralized log aggregation for an organization. Organizations might consider a centralized log aggregation approach for a variety of reasons. Many organizations have compliance and governance requirements that have stipulations for what data needs to be logged, and how long log data must be retained and remain searchable for investigations. Other organizations seek to consolidate application and security operations, and provide common observability toolsets and capabilities across their teams.

To meet such requirements, you need to collect data from log sources (producers) in a scalable, resilient, and cost-effective manner. Log sources may vary between application and infrastructure use cases and configurations, as illustrated in the following table.

Log Producer Example Example Producer Log Configuration
Application Logs AWS Lambda Amazon CloudWatch Logs
Application Agents FluentBit Amazon OpenSearch Ingestion
AWS Service Logs Amazon Web Application Firewall Amazon S3

The following diagram illustrates an example architecture.

You can use Kinesis Data Streams for a variety of these use cases. You can configure Amazon CloudWatch logs to send data to Kinesis Data Streams using a subscription filter (see Real-time processing of log data with subscriptions). If you send data with Kinesis Data Streams for analytics use cases, you can use OpenSearch Ingestion to create a scalable, extensible pipeline to consume your streaming data and write it to OpenSearch Service indexes. Kinesis Data Streams provides a buffer that can support multiple consumers, configurable retention, and built-in integration with a variety of AWS services. For other use cases where data is stored in Amazon Simple Storage Service (Amazon S3), or where an agent writes data such as FluentBit, an agent can write data directly to OpenSearch Ingestion without an intermediate buffer thanks to OpenSearch Ingestion’s built-in persistent buffers and automatic scaling.

Standardizing logging approaches reduces development and operational overhead for organizations. For example, you might standardize on all applications logging to CloudWatch logs when feasible, and also handle Amazon S3 logs where CloudWatch logs are unsupported. This reduces the number of use cases that a centralized team needs to handle in their log aggregation approach, and reduces the complexity of the log aggregation solution. For more sophisticated development teams, you might standardize on using FluentBit agents to write data directly to OpenSearch Ingestion to lower cost when log data doesn’t need to be stored in CloudWatch.

This solution focuses on using CloudWatch logs as a data source for log aggregation. For the Amazon S3 log use case, see Using an OpenSearch Ingestion pipeline with Amazon S3. For agent-based solutions, see the agent-specific documentation for integration with OpenSearch Ingestion, such as Using an OpenSearch Ingestion pipeline with Fluent Bit.

Prerequisites

Several key pieces of infrastructure used in this solution are required to ingest data into OpenSearch Service with OpenSearch Ingestion:

  • A Kinesis data stream to aggregate the log data from CloudWatch
  • An OpenSearch domain to store the log data

When creating the Kinesis data stream, we recommend starting with On-Demand mode. This will allow Kinesis Data Streams to automatically scale the number of shards needed for your log throughput. After you identify the steady state workload for your log aggregation use case, we recommend moving to Provisioned mode, using the number of shards identified in On-Demand mode. This can help you optimize long-term cost for high-throughput use cases.

In general, we recommend using one Kinesis data stream for your log aggregation workload. OpenSearch Ingestion supports up to 96 OCUs per pipeline, and 24,000 characters per pipeline definition file (see OpenSearch Ingestion quotas). This means that each pipeline can support a Kinesis data stream with up to 96 shards, because each OCU processes one shard. Using one Kinesis data stream simplifies the overall process to aggregate log data into OpenSearch Service, and simplifies the process for creating and managing subscription filters for log groups.

Depending on the scale of your log workloads, and the complexity of your OpenSearch Ingestion pipeline logic, you may consider more Kinesis data streams for your use case. For example, you may consider one stream for each major log type in your production workload. Having log data for different use cases separated into different streams can help reduce the operational complexity of managing OpenSearch Ingestion pipelines, and allows you to scale and deploy changes to each log use case separately when required.

To create a Kinesis Data Stream, see Create a data stream.

To create an OpenSearch domain, see Creating and managing Amazon OpenSearch domains.

Configure log subscription filters

You can implement CloudWatch log group subscription filters at the account level or log group level. In both cases, we recommend creating a subscription filter with a random distribution method to make sure log data is evenly distributed across Kinesis data stream shards.

Account-level subscription filters are applied to all log groups in an account, and can be used to subscribe all log data to a single destination. This works well if you want to store all your log data in OpenSearch Service using Kinesis Data Streams. There is a limit of one account-level subscription filter per account. Using Kinesis Data Streams as the destination also allows you to have multiple log consumers to process the account log data when relevant. To create an account-level subscription filter, see Account-level subscription filters.

Log group-level subscription filters are applied on each log group. This approach works well if you want to store a subset of your log data in OpenSearch Service using Kinesis Data Streams, and if you want to use multiple different data streams to store and process multiple log types. There is a limit of two log group-level subscription filters per log group. To create a log group-level subscription filter, see Log group-level subscription filters.

After you create your subscription filter, verify that log data is being sent to your Kinesis data stream. On the Kinesis Data Streams console, choose the link for your stream name.

Choose a shard with Starting position set as Trim horizon, and choose Get records.

You should see records with a unique Partition key column value and binary Data column. This is because CloudWatch sends data in .gzip format to compress log data.

Configure an OpenSearch Ingestion pipeline

Now that you have a Kinesis data stream and CloudWatch subscription filters to send data to the data stream, you can configure your OpenSearch Ingestion pipeline to process your log data. To begin, you create an AWS Identity and Access Management (IAM) role that allows read access to the Kinesis data stream and read/write access to the OpenSearch domain. To create your pipeline, your manager role that is used to create the pipeline will require iam:PassRole permissions to the pipeline role created in this step.

  1. Create an IAM role with the following permissions to read from your Kinesis data stream and access your OpenSearch domain:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "allowReadFromStream",
                "Effect": "Allow",
                "Action": [
                    "kinesis:DescribeStream",
                    "kinesis:DescribeStreamConsumer",
                    "kinesis:DescribeStreamSummary",
                    "kinesis:GetRecords",
                    "kinesis:GetShardIterator",
                    "kinesis:ListShards",
                    "kinesis:ListStreams",
                    "kinesis:ListStreamConsumers",
                    "kinesis:RegisterStreamConsumer",
                    "kinesis:SubscribeToShard"
                ],
                "Resource": [
                    "arn:aws:kinesis:{{region}}:{{account-id}}:stream/{{stream-name}}"
                ]
            },
            {
                "Sid": "allowAccessToOS",
                "Effect": "Allow",
                "Action": [
                    "es:DescribeDomain",
                    "es:ESHttp*"
                ],
                "Resource": [
                    "arn:aws:es:{region}:{account-id}:domain/{domain-name}",
                    "arn:aws:es:{region}:{account-id}:domain/{domain-name}/*"
                ]
            }
        ]
    }

  2. Give your role a trust policy that allows access from osis-pipelines.amazonaws.com:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "",
                "Effect": "Allow",
                "Principal": {
                    "Service": [
                        "osis-pipelines.amazonaws.com"
                    ]
                },
                "Action": "sts:AssumeRole",
                "Condition": {
                    "StringEquals": {
                        "aws:SourceAccount": "{account-id}"
                    },
                    "ArnLike": {
                        "aws:SourceArn": "arn:aws:osis:{region}:{account-id}:pipeline/*"
                    }
                }
            }
        ]
    }

For a pipeline to write data to a domain, the domain must have a domain-level access policy that allows the pipeline role to access it, and if your domain uses fine-grained access control, then the IAM role needs to be mapped to a backend role in the OpenSearch Service security plugin that allows access to create and write to indexes.

  1. After you create your pipeline role, on the OpenSearch Service console, choose Pipelines under Ingestion in the navigation pane.
  2. Choose Create pipeline.
  3. Search for Kinesis in the blueprints, select the Kinesis Data Streams blueprint, and choose Select blueprint.
  4. Under Pipeline settings, enter a name for your pipeline, and set Max capacity for the pipeline to be equal to the number of shards in your Kinesis data stream.

If you’re using On-Demand mode for the data stream, choose a capacity equal to the current number of shards in the stream. This use case doesn’t require a persistent buffer, because Kinesis Data Streams provides a persistent buffer for the log data, and OpenSearch Ingestion tracks its position in the Kinesis data stream over time, preventing data loss on restarts.

  1. Under Pipeline configuration, update the pipeline source settings to use your Kinesis data stream name and pipeline IAM role Amazon Resource Name (ARN).

For full configuration information, see . For most configurations, you can use the default values. By default, the pipeline will write batches of 100 documents every 1 second, and will subscribe to the Kinesis data stream from the latest position in the stream using enhanced fan-out, checkpointing its position in the stream every 2 minutes. You can adjust this behavior as desired to tune how frequently the consumer checkpoints, where it begins in the stream, and use polling to reduce costs from enhanced fan-out.

  source:
    kinesis-data-streams:
      acknowledgments: true
      codec:
        # JSON codec supports parsing nested CloudWatch events into
        # individual log entries that will be written as documents to
        # OpenSearch
        json:
          key_name: "logEvents"
          # These keys contain the metadata sent by CloudWatch Subscription Filters
          # in addition to the individual log events:
          # https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html#DestinationKinesisExample
          include_keys: ['owner', 'logGroup', 'logStream' ]
      streams:
        # Update to use your Kinesis Stream name used in your Subscription Filters:
        - stream_name: "KINESIS_STREAM_NAME"
          # Can customize initial position if you don't want OSI to consume the entire stream:
          initial_position: "EARLIEST"
          # Compression will always be gzip for CloudWatch, but will vary for other sources:
          compression: "gzip"
      aws:
        # Provide the Role ARN with access to KDS. This role should have a trust relationship with osis-pipelines.amazonaws.com
        # This must be the same role used below in the Sink configuration.
        sts_role_arn: "PIPELINE_ROLE_ARN"
        # Provide the region of the Data Stream.
        region: "REGION"
  1. Update the pipeline sink settings to include your OpenSearch domain endpoint URL and pipeline IAM role ARN.

The IAM role ARN must be the same for both the OpenSearch Servicer sink definition and the Kinesis Data Streams source definition. You can control what data gets indexed in different indexes using the index definition in the sink. For example, you can use metadata about the Kinesis data stream name to index by data stream (${getMetadata("kinesis_stream_name")), or you can use document fields to index data depending on the CloudWatch log group or other document data (${path/to/field/in/document}). In this example, we use three document-level fields (data_stream.type, data_stream.dataset, and data_stream.namespace) to index our documents, and create these fields in our pipeline processor logic in the next section:

  sink:
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "OPENSEARCH_ENDPOINT" ]
        # Route log data to different target indexes depending on the log context:
        index: "ss4o_${data_stream/type}-${data_stream/dataset}-${data_stream/namespace}"
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          # This role must be the same as the role used above for Kinesis.
          sts_role_arn: "PIPELINE_ROLE_ARN"
          # Provide the region of the domain.
          region: "REGION"
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection
          serverless: false

Finally, you can update the pipeline configuration to include processor definitions to transform your log data before writing documents to the OpenSearch domain. For example, this use case adopts Simple Schema for Observability (SS4O) and uses the OpenSearch Ingestion pipeline to create the desired schema for SS4O. This includes adding common fields to associate metadata with the indexed documents, as well as parsing the log data to make data more searchable. This use case also uses the log group name to identify different log types as datasets, and uses this information to write documents to different indexes depending on their use cases.

  1. Rename the CloudWatch event timestamp to mark the observed timestamp when the log was generated using the rename_keys processor, and add the current timestamp as the processed timestamp when OpenSearch Ingestion handled the record using the date processor:
      #  Processor logic is used to change how log data is parsed for OpenSearch.
      processor:
        - rename_keys:
            entries:
            # Include CloudWatch timestamp as the observation timestamp - the time the log
            # was generated and sent to CloudWatch:
            - from_key: "timestamp"
              to_key: "observed_timestamp"
        - date:
            # Include the current timestamp that OSI processed the log event:
            from_time_received: true
            destination: "processed_timestamp"

  2. Use the add_entries processor to include metadata about the processed document, including the log group, log stream, account ID, AWS Region, Kinesis data stream information, and dataset metadata:
        - add_entries:
            entries:
            # Support SS4O common log fields (https://opensearch.org/docs/latest/observing-your-data/ss4o/)
            - key: "cloud/provider"
              value: "aws"
            - key: "cloud/account/id"
              format: "${owner}"
            - key: "cloud/region"
              value: "us-west-2"
            - key: "aws/cloudwatch/log_group"
              format: "${logGroup}"
            - key: "aws/cloudwatch/log_stream"
              format: "${logStream}"
            # Include default values for the data_stream:
            - key: "data_stream/namespace"
              value: "default"
            - key: "data_stream/type"
              value: "logs"
            - key: "data_stream/dataset"
              value: "general"
            # Include metadata about the source Kinesis message that contained this log event:
            - key: "aws/kinesis/stream_name"
              value_expression: "getMetadata(\"stream_name\")"
            - key: "aws/kinesis/partition_key"
              value_expression: "getMetadata(\"partition_key\")"
            - key: "aws/kinesis/sequence_number"
              value_expression: "getMetadata(\"sequence_number\")"
            - key: "aws/kinesis/sub_sequence_number"
              value_expression: "getMetadata(\"sub_sequence_number\")"

  3. Use conditional expression syntax to update the data_stream.dataset fields depending on the log source, to control what index the document is written to, and use the delete_entries processor to delete the original CloudWatch document fields that were renamed:
        - add_entries:
            entries:
            # Update the data_stream fields based on the log event context - in this case
            # classifying the log events by their source (CloudTrail or Lambda).
            # Additional logic could be added to classify the logs by business or application context:
            - key: "data_stream/dataset"
              value: "cloudtrail"
              add_when: "contains(/logGroup, \"cloudtrail\") or contains(/logGroup, \"CloudTrail\")"
              overwrite_if_key_exists: true
            - key: "data_stream/dataset"
              value: "lambda"
              add_when: "contains(/logGroup, \"/aws/lambda/\")"
              overwrite_if_key_exists: true
            - key: "data_stream/dataset"
              value: "apache"
              add_when: "contains(/logGroup, \"/apache/\")"
              overwrite_if_key_exists: true
        # Remove the default CloudWatch fields, as we re-mapped them to SS4O fields:
        - delete_entries:
            with_keys:
              - "logGroup"
              - "logStream"
              - "owner"

  4. Parse the log message fields to allow structured and JSON data to be more searchable in the OpenSearch indexes using the grok and parse_json

Grok processors use pattern matching to parse data from structured text fields. For examples of built-in Grok patterns, see java-grok patterns and dataprepper grok patterns.

    # Use Grok parser to parse non-JSON apache logs
    - grok:
        grok_when: "/data_stream/dataset == \"apache\""
        match:
          message: ['%{COMMONAPACHELOG_DATATYPED}']
        target_key: "http"
    # Attempt to parse the log data as JSON to support field-level searches in the OpenSearch index:
    - parse_json:
        # Parse root message object into aws.cloudtrail to match SS4O standard for SS4O logs
        source: "message"
        destination: "aws/cloudtrail"
        parse_when: "/data_stream/dataset == \"cloudtrail\""
        tags_on_failure: ["json_parse_fail"]
    - parse_json:
        # Parse root message object as JSON when possible for Lambda function logs - can also set up Grok support
        # for Lambda function logs to capture non-JSON logging function data as searchable fields
        source: "message"
        destination: "aws/lambda"
        parse_when: "/data_stream/dataset == \"lambda\""
        tags_on_failure: ["json_parse_fail"]
    - parse_json:
        # Parse root message object as JSON when possible for general logs
        source: "message"
        destination: "body"
        parse_when: "/data_stream/dataset == \"general\""
        tags_on_failure: ["json_parse_fail"]

When it’s all put together, your pipeline configuration will look like the following code:

version: "2"
kinesis-pipeline:
  source:
    kinesis-data-streams:
      acknowledgments: true
      codec:
        # JSON codec supports parsing nested CloudWatch events into
        # individual log entries that will be written as documents to
        # OpenSearch
        json:
          key_name: "logEvents"
          # These keys contain the metadata sent by CloudWatch Subscription Filters
          # in addition to the individual log events:
          # https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/SubscriptionFilters.html#DestinationKinesisExample
          include_keys: ['owner', 'logGroup', 'logStream' ]
      streams:
        # Update to use your Kinesis Stream name used in your Subscription Filters:
        - stream_name: "KINESIS_STREAM_NAME"
          # Can customize initial position if you don't want OSI to consume the entire stream:
          initial_position: "EARLIEST"
          # Compression will always be gzip for CloudWatch, but will vary for other sources:
          compression: "gzip"
      aws:
        # Provide the Role ARN with access to KDS. This role should have a trust relationship with osis-pipelines.amazonaws.com
        # This must be the same role used below in the Sink configuration.
        sts_role_arn: "PIPELINE_ROLE_ARN"
        # Provide the region of the Data Stream.
        region: "REGION"
        
  #  Processor logic is used to change how log data is parsed for OpenSearch.
  processor:
    - rename_keys:
        entries:
        # Include CloudWatch timestamp as the observation timestamp - the time the log
        # was generated and sent to CloudWatch:
        - from_key: "timestamp"
          to_key: "observed_timestamp"
    - date:
        # Include the current timestamp that OSI processed the log event:
        from_time_received: true
        destination: "processed_timestamp"
    - add_entries:
        entries:
        # Support SS4O common log fields (https://opensearch.org/docs/latest/observing-your-data/ss4o/)
        - key: "cloud/provider"
          value: "aws"
        - key: "cloud/account/id"
          format: "${owner}"
        - key: "cloud/region"
          value: "us-west-2"
        - key: "aws/cloudwatch/log_group"
          format: "${logGroup}"
        - key: "aws/cloudwatch/log_stream"
          format: "${logStream}"
        # Include default values for the data_stream:
        - key: "data_stream/namespace"
          value: "default"
        - key: "data_stream/type"
          value: "logs"
        - key: "data_stream/dataset"
          value: "general"
        # Include metadata about the source Kinesis message that contained this log event:
        - key: "aws/kinesis/stream_name"
          value_expression: "getMetadata(\"stream_name\")"
        - key: "aws/kinesis/partition_key"
          value_expression: "getMetadata(\"partition_key\")"
        - key: "aws/kinesis/sequence_number"
          value_expression: "getMetadata(\"sequence_number\")"
        - key: "aws/kinesis/sub_sequence_number"
          value_expression: "getMetadata(\"sub_sequence_number\")"
    - add_entries:
        entries:
        # Update the data_stream fields based on the log event context - in this case
        # classifying the log events by their source (CloudTrail or Lambda).
        # Additional logic could be added to classify the logs by business or application context:
        - key: "data_stream/dataset"
          value: "cloudtrail"
          add_when: "contains(/logGroup, \"cloudtrail\") or contains(/logGroup, \"CloudTrail\")"
          overwrite_if_key_exists: true
        - key: "data_stream/dataset"
          value: "lambda"
          add_when: "contains(/logGroup, \"/aws/lambda/\")"
          overwrite_if_key_exists: true
        - key: "data_stream/dataset"
          value: "apache"
          add_when: "contains(/logGroup, \"/apache/\")"
          overwrite_if_key_exists: true
    # Remove the default CloudWatch fields, as we re-mapped them to SS4O fields:
    - delete_entries:
        with_keys:
          - "logGroup"
          - "logStream"
          - "owner"
    # Use Grok parser to parse non-JSON apache logs
    - grok:
        grok_when: "/data_stream/dataset == \"apache\""
        match:
          message: ['%{COMMONAPACHELOG_DATATYPED}']
        target_key: "http"
    # Attempt to parse the log data as JSON to support field-level searches in the OpenSearch index:
    - parse_json:
        # Parse root message object into aws.cloudtrail to match SS4O standard for SS4O logs
        source: "message"
        destination: "aws/cloudtrail"
        parse_when: "/data_stream/dataset == \"cloudtrail\""
        tags_on_failure: ["json_parse_fail"]
    - parse_json:
        # Parse root message object as JSON when possible for Lambda function logs - can also set up Grok support
        # for Lambda function logs to capture non-JSON logging function data as searchable fields
        source: "message"
        destination: "aws/lambda"
        parse_when: "/data_stream/dataset == \"lambda\""
        tags_on_failure: ["json_parse_fail"]
    - parse_json:
        # Parse root message object as JSON when possible for general logs
        source: "message"
        destination: "body"
        parse_when: "/data_stream/dataset == \"general\""
        tags_on_failure: ["json_parse_fail"]

  sink:
    - opensearch:
        # Provide an AWS OpenSearch Service domain endpoint
        hosts: [ "OPENSEARCH_ENDPOINT" ]
        # Route log data to different target indexes depending on the log context:
        index: "ss4o_${data_stream/type}-${data_stream/dataset}-${data_stream/namespace}"
        aws:
          # Provide a Role ARN with access to the domain. This role should have a trust relationship with osis-pipelines.amazonaws.com
          # This role must be the same as the role used above for Kinesis.
          sts_role_arn: "PIPELINE_ROLE_ARN"
          # Provide the region of the domain.
          region: "REGION"
          # Enable the 'serverless' flag if the sink is an Amazon OpenSearch Serverless collection
          serverless: false
  1. When your configuration is complete, choose Validate pipeline to check your pipeline syntax for errors.
  2. In the Pipeline role section, optionally enter a suffix to create a unique service role that will be used to start your pipeline run.
  3. In the Network section, select VPC access.

For a Kinesis Data Streams source, you don’t need to select a virtual private cloud (VPC), subnets, or security groups. OpenSearch Ingestion only requires these attributes for HTTP data sources that are located within a VPC. For Kinesis Data Streams, OpenSearch Ingestion uses AWS PrivateLink to read from Kinesis Data Streams and write to OpenSearch domains or serverless collections.

  1. Optionally, enable CloudWatch logging for your pipeline.
  2. Choose Next to review and create your pipeline.

If you’re using account-level subscription filters for CloudWatch logs in the account where OpenSearch Ingestion is running, this log group should be excluded from the account-level subscription. This is because OpenSearch Ingestion pipeline logs could cause a recursive loop with the subscription filter that could lead to high volumes of log data ingestion and cost.

  1. In the Review and create section, choose Create pipeline.

When your pipeline enters the Active state, you’ll see logs begin to populate in your OpenSearch domain or serverless collection.

Monitor the solution

To maintain the health of the log ingestion pipeline, there are several key areas to monitor:

  • Kinesis Data Streams metrics – You should monitor the following metrics:
    • FailedRecords – Indicates an issue in CloudWatch subscription filters writing to the Kinesis data stream. Reach out to AWS Support if this metric stays at a non-zero level for a sustained period.
    • ThrottledRecords – Indicates your Kinesis data stream needs more shards to accommodate the log volume from CloudWatch.
    • ReadProvisionedThroughputExceeded – Indicates your Kinesis data stream has more consumers consuming read throughput than supplied by the shard limits, and you may need to move to an enhanced fan-out consumer strategy.
    • WriteProvisionedThroughputExceeded – Indicates your Kinesis data stream needs more shards to accommodate the log volume from CloudWatch, or that your log volume is being unevenly distributed to your shards. Make sure the subscription filter distribution strategy is set to random, and consider enabling enhanced shard-level monitoring on the data stream to identify hot shards.
    • RateExceeded – Indicates that a consumer is incorrectly configured for the stream, and there may be an issue in your OpenSearch Ingestion pipeline causing it to subscribe too often. Investigate your consumer strategy for the Kinesis data stream.
    • MillisBehindLatest – Indicates the enhanced fan-out consumer isn’t keeping up with the load in the data stream. Investigate the OpenSearch Ingestion pipeline OCU configuration and make sure there are sufficient OCUs to accommodate the Kinesis data stream shards.
    • IteratorAgeMilliseconds – Indicates the polling consumer isn’t keeping up with the load in the data stream. Investigate the OpenSearch Ingestion pipeline OCU configuration and make sure there are sufficient OCUs to accommodate the Kinesis data stream shards, and investigate the polling strategy for the consumer.
  • CloudWatch subscription filter metrics – You should monitor the following metrics:
    • DeliveryErrors – Indicates an issue in CloudWatch subscription filter delivering data to the Kinesis data stream. Investigate data stream metrics.
    • DeliveryThrottling – Indicates insufficient capacity in the Kinesis data stream. Investigate data stream metrics.
  • OpenSearch Ingestion metrics – For recommended monitoring for OpenSearch Ingestion, see Recommended CloudWatch alarms.
  • OpenSearch Service metrics – For recommended monitoring for OpenSearch Service, see Recommended CloudWatch alarms for Amazon OpenSearch Service.

Clean up

Make sure you clean up unwanted AWS resources created while following this post in order to prevent additional billing for these resources. Follow these steps to clean up your AWS account:

  1. Delete your Kinesis data stream.
  2. Delete your OpenSearch Service domain.
  3. Use the DeleteAccountPolicy API to remove your account-level CloudWatch subscription filter.
  4. Delete your log group-level CloudWatch subscription filter:
    1. On the CloudWatch console, select the desired log group.
    2. On the Actions menu, choose Subscription Filters and Delete all subscription filter(s).
  5. Delete the OpenSearch Ingestion pipeline.

Conclusion

In this post, you learned how to create a serverless ingestion pipeline to deliver CloudWatch logs in real time to an OpenSearch domain or serverless collection using OpenSearch Ingestion. You can use this approach for a variety of real-time data ingestion use cases, and add it to existing workloads that use Kinesis Data Streams for real-time data analytics.

For other use cases for OpenSearch Ingestion and Kinesis Data Streams, consider the following:

To continue improving your log analytics use cases in OpenSearch, consider using some of the pre-built dashboards available in Integrations in OpenSearch Dashboards.


About the authors

M Mehrtens has been working in distributed systems engineering throughout their career, working as a Software Engineer, Architect, and Data Engineer. In the past, M has supported and built systems to process terrabytes of streaming data at low latency, run enterprise Machine Learning pipelines, and created systems to share data across teams seamlessly with varying data toolsets and software stacks. At AWS, they are a Sr. Solutions Architect supporting US Federal Financial customers.

Arjun Nambiar is a Product Manager with Amazon OpenSearch Service. He focuses on ingestion technologies that enable ingesting data from a wide variety of sources into Amazon OpenSearch Service at scale. Arjun is interested in large-scale distributed systems and cloud-centered technologies, and is based out of Seattle, Washington.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Achieve data resilience using Amazon OpenSearch Service disaster recovery with snapshot and restore

Post Syndicated from Samir Patel original https://aws.amazon.com/blogs/big-data/achieve-data-resilience-using-amazon-opensearch-service-disaster-recovery-with-snapshot-and-restore/

Amazon OpenSearch Service is a fully managed service offered by AWS that enables you to deploy, operate, and scale OpenSearch domains effortlessly. OpenSearch is a distributed search and analytics engine, which is an open-source project. OpenSearch Service seamlessly integrates with other AWS offerings, providing a robust solution for building scalable and resilient search and analytics applications in the cloud.

Disaster recovery is vital for organizations, offering a proactive strategy to mitigate the impact of unforeseen events like system failures, natural disasters, or cyberattacks.

In Disaster Recovery (DR) Architecture on AWS, Part I: Strategies for Recovery in the Cloud, we introduced four major strategies for disaster recovery (DR) on AWS. These strategies enable you to prepare for and recover from a disaster. By using the best practices provided in the AWS Well-Architected Reliability Pillar to design your DR strategy, your workloads can remain available despite disaster events such as natural disasters, technical failures, or human actions. OpenSearch Service provides various DR solutions, including active-passive and active-active approaches. This post focuses on introducing an active-passive approach using a snapshot and restore strategy.

Snapshot and restore in OpenSearch Service

The snapshot and restore strategy in OpenSearch Service involves creating point-in-time backups, known as snapshots, of your OpenSearch domain. These snapshots capture the entire state of the domain, including indexes, mappings, and settings. In the event of data loss or system failure, these snapshots will be used to restore the domain to a specific point in time. Implementing a snapshot and restore strategy helps organizations meet Recovery Point Objectives (RPOs) and Recovery Time Objectives (RTOs), providing minimal data loss and rapid system recovery in case of disasters.

Snapshot and restore results in longer downtimes and greater loss of data between when the disaster event occurs and recovery. However, backup and restore can still be the right strategy for your workload because it is the most straightforward and least expensive strategy to implement. Additionally, not all workloads require RTO and RPO in minutes or less.

Solution overview

The following architecture diagram illustrates how manual snapshots are taken from the OpenSearch Service domain in the primary AWS Region and stored in an Amazon Simple Storage Service (Amazon S3) bucket in the secondary Region.

We walk through each step and discuss scenarios for failing over to the OpenSearch Service domain in the secondary Region in the event of a disaster in the primary Region, as well as how to fail back to the OpenSearch Service domain to resume operations in the primary Region.

bdb-4227-Arch1.1

The workflow consists of the following initial steps:

  1. OpenSearch Service is hosted in the primary Region, and all the active traffic is routed to the OpenSearch Service domain in the primary Region.
  2. The manual snapshots from the OpenSearch Service domain in the primary Region are transferred to the S3 bucket in the secondary Region on a predefined schedule.

This process can be programmatically scheduled using an AWS Lambda function, as described in Unleash the power of Snapshot Management to take automated snapshots using Amazon OpenSearch Service. This gives you the most effective protection from disasters of any scope of impact. In the event of a disaster in the primary Region, in addition to OpenSearch data recovery from backup, you must also be able to restore your infrastructure in the secondary Region. Infrastructure as code (IaC) methods such as using AWS CloudFormation or the AWS Cloud Development Kit (AWS CDK) enable you to deploy consistent infrastructure across Regions.

The following diagram illustrates the architecture in the event of a disaster.

bdb-4227-Arch1.2

The workflow consists of the following steps:

  1. In the event of a disaster making the OpenSearch Service domain in the primary Region unavailable, all active traffic routed to the primary Region’s OpenSearch Service domain will cease.
  2. When the OpenSearch Service domain becomes unavailable, the manual snapshots to Amazon S3 will no longer be taken at the predefined intervals.
  3. To fail over, launch the OpenSearch Service domain in the secondary Region using IaC. Restore manual snapshots from the S3 bucket in the secondary Region to the OpenSearch Service domain in the secondary domain. For log workloads, restore only recent or relevant logs to save time and use this opportunity to purge unnecessary documents or indexes.
  4. Update the DNS controller (Amazon Route 53) to redirect traffic to the OpenSearch Service domain in the secondary Region.
  5. When the primary Region becomes available, set up manual snapshots from the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region.

The following diagram illustrates the architecture after the primary Region becomes available.

bdb-4227-Arch1.3

The workflow consists of the following steps:

  1. When the primary Region becomes available again, destroy the existing OpenSearch domain in the primary Region. Launch a new OpenSearch Service domain in the primary Region.
  2. Restore manual snapshots from the S3 bucket in the primary Region to the new OpenSearch Service domain created in the previous step.
  3. Update Route 53 to redirect traffic to the new OpenSearch Service domain in the primary Region.
  4. Set up manual snapshots from the new OpenSearch Service domain in the primary Region to a new prefix in the S3 bucket in the secondary Region.
  5. After successfully failing back to the OpenSearch Service domain in the primary Region, destroy the OpenSearch Service domain in the secondary Region.

In this post, we demonstrate how to launch an OpenSearch Service domain in the primary Region and set up manual snapshots to an S3 bucket in the secondary Region. Then we simulate a failover to resume operations using the OpenSearch Service domain in the secondary Region in the event of a disaster. Finally, we illustrate the failback mechanism by reverting to the OpenSearch Service domain in the primary Region.

Regular operations

In this section, we discuss the regular operations to set up the solution architecture.

Launch an OpenSearch Service domain in the primary Region

Create an OpenSearch Service domain in the primary Region by following the instructions in Creating and managing Amazon OpenSearch Service domains with fine-grained access control enabled. Do not enable standby mode. Create indexes and populate them with documents.

Create an S3 bucket in the secondary Region

To store OpenSearch snapshots in the secondary Region, you need to create S3 buckets in that Region. For instructions, see Creating a bucket.

Create the snapshot IAM role

The snapshot AWS Identity and Access Management (IAM) role is necessary to grant permissions specifically for managing snapshots within the OpenSearch Service domain. For instructions, see Creating an IAM role (console). We refer to this role as TheSnapshotRole in this post.

  1. Attach the following IAM policy to TheSnapshotRole:
    {
      "Version": "2012-10-17",
      "Statement": [{
          "Action": [
            "s3:ListBucket"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:s3:::s3-bucket-name"
          ]
        },
        {
          "Action": [
            "s3:GetObject",
            "s3:PutObject",
            "s3:DeleteObject"
          ],
          "Effect": "Allow",
          "Resource": [
            "arn:aws:s3:::s3-bucket-name/*"
          ]
        }
      ]
    }

  2. Edit the trust relationship of TheSnapshotRole to specify OpenSearch Service in the Principal statement, as shown in the following example:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "",
    "Effect": "Allow",
    "Principal": {
      "Service": "es.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}

To register the snapshot repository, you need to be able to pass TheSnapshotRole to OpenSearch Service. You also need access to the es:ESHttpPut action.

  1. To grant both of these permissions, attach the following policy to the IAM role whose credentials are being used to sign the request:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::123456789012:role/TheSnapshotRole"
    },
    {
      "Effect": "Allow",
      "Action": "es:ESHttpPut",
      "Resource": "arn:aws:es:region:123456789012:domain/domain-name/*"
    }
  ]
}

Associate the IAM role or user to the OpenSearch security role for manual snapshots

Fine-grained access control introduces an additional step when registering a repository. Even if you use HTTP basic authentication for all other purposes, you need to map the manage_snapshots role to your IAM role that has iam:PassRole permissions to pass TheSnapshotRole. Snapshots can only be taken by a process or user associated with an IAM identity. This makes sure only authorized entities can create, manage, or restore snapshots.

One such method is to use Amazon Cognito. With Amazon Cognito, users can sign in with IAM credentials indirectly, either using proxy mapping with SAML or through user pool credentials. This setup provides a secure way to manage access while using the capabilities of IAM. The preferred method is to use a process that signs requests with AWS SigV4. This approach involves programmatically signing each request to OpenSearch with the appropriate IAM credentials, making sure only authorized processes can manage snapshots. This method is recommended because it provides a higher level of security and can be automated using Lambda functions as part of your backup and DR workflows.

  1. On OpenSearch Dashboards, navigate to the main menu and choose Security.
  2. Choose Roles and search for the manage_snapshots
  3. Choose Mapped users and choose Manage mappings.
  4. Add the Amazon Resource Name (ARN) of TheSnapshotRole to the backend roles.

bdb-4227-AssociateRole

Register a snapshot repository on the OpenSearch Service domain

To register a snapshot repository, send a signed PUT request to the OpenSearch Service domain endpoint using Curl; integrated development environments (IDEs) like PyCharm or VS Code, Postman; or another method. Using a PUT request in OpenSearch Dashboards for repository registration is not supported. For more details, see Using OpenSearch Dashboards with Amazon OpenSearch Service.

The curl command is as follows:

curl —aws-sigv4 "aws:amz:us-east-1:es" —user "ACCESS_KEY:SECRET_KEY" -XPUT "https://DOMAIN_ENDPOINT/_snapshot/REPOSITORY_NAME" -H 'Content-Type: application/json' -d '{ "type": "s3", "settings": { "bucket": "BUCKET_NAME", "endpoint": "s3.amazonaws.com", "role_arn": "ROLE_ARN" }}'

Use the curl command to register a snapshot repository in the OpenSearch Service domain in the primary Region pointing to the S3 bucket in the secondary Region.

To verify the snapshot repository creation, run the following query:

GET /_snapshot/os-snapshot-repo

bdb-4227-GetSnapshot

Take manual snapshots

To take a manual snapshot, perform the following steps from OpenSearch Dashboards. To include or exclude certain indexes and specify other settings, add a request body. For the request structure, see Take snapshots in the OpenSearch documentation.

  1. To create a manual snapshot, use the following query. In this query, the repository name is os-snapshot-repo and the snapshot name is 2023-11-18.

PUT /_snapshot/os-snapshot-repo/2023-11-18

bdb-4227-PutSnapshot

  1. Verify the snapshot has been created and indexes for which snapshot was taken:

GET /_snapshot/os-snapshot-repo/_all

bdb-4227-GetAllSnapshots

  1. Schedule your manual snapshot at a defined interval (for example, every 1 hour) based on your RPO requirements.

You can schedule this by creating an Amazon EventBridge rule to invoke a Lambda function every hour. For instructions, see Tutorial: Create an EventBridge scheduled rule for AWS Lambda functions. The Lambda function will transfer incremental manual snapshots into Amazon S3. For more information, see Unleash the power of Snapshot Management to take automated snapshots using Amazon OpenSearch Service.

Failover scenario

In a disaster, if your OpenSearch Service domain in the primary Region goes down, you can fail over to a domain in the secondary Region. This provides business continuity and minimizes downtime during unexpected Region failures.

To maintain business continuity during a disaster, you can use message queues like Amazon Simple Queue Service (Amazon SQS) and streaming solutions like Apache Kafka or Amazon Kinesis. These tools buffer incoming data in the primary Region, allowing you to replay traffic on a predefined period in the secondary Region when you fail over, to keep the OpenSearch Service domain up to date with all recent changes.

Launch an OpenSearch Service domain in the Secondary Region

Create an OpenSearch Service domain in the secondary Region by following the instructions in Creating and managing Amazon OpenSearch Service domains with fine-grained access control enabled. Do not enable standby mode.

Depending on your RTO requirements, you can keep the OpenSearch Service domain in the secondary Region up and running if you have an RTO of less than 1 hour. However, it will incur additional costs. If you have an RTO of more than 1 hour, you can launch a new OpenSearch Service domain in the secondary Region during the failover activity to reduce operational costs.

Associate the IAM role or user to the OpenSearch security role for manual snapshots

Follow the instructions in the previous section to associate the IAM role with the OpenSearch security role.

Register a snapshot repository on the OpenSearch Service domain

To make sure your data is available for failover, you need to register a snapshot repository on the OpenSearch Service domain in the secondary Region. The snapshots taken from your OpenSearch Service domain in the primary Region can be restored. Use the following command:

curl —aws-sigv4 "aws:amz:us-west-2:es" —user "ACCESS_KEY:SECRET_KEY" -XPUT "https://DOMAIN_ENDPOINT/_snapshot/REPOSITORY_NAME" -H 'Content-Type: application/json' -d '{ "type": "s3", "settings": { "bucket": "BUCKET_NAME", "endpoint": "s3.amazonaws.com", "role_arn": "ROLE_ARN" }}'

The S3 bucket should be the bucket created in the secondary Region where the snapshots from your OpenSearch Service domain in the primary Region are stored.

Restore snapshots

Before you restore a snapshot, make sure that the destination domain doesn’t use Multi-AZ with standby.

After you register the snapshot repository on your OpenSearch Service domain in the secondary Region, the next step is to restore the desired indexes from the snapshot repository. This step makes sure your data is available in the OpenSearch Service domain in the secondary Region. This step allows you to selectively restore specific index from your snapshot, providing flexibility to recover only the necessary data. Use the following command:

POST /_snapshot/<REPOSITORY_NAME>/<SNAPSHOT_NAME>/_restore
{
"indices": "movie-index"
}

bdb-4227-Restore

Verify the snapshots for all the necessary indexes are stored in the OpenSearch Service domain in the secondary Region.

Update Route 53 to redirect traffic to the OpenSearch Service domain in the secondary Region

After you restore the snapshots to the OpenSearch Service domain in the secondary Region, update the DNS settings (Route 53) with the new OpenSearch Service domain endpoint to redirect indexing traffic to the OpenSearch Service domain in the secondary Region. Route 53, a scalable DNS service, can seamlessly redirect traffic to the new OpenSearch endpoint by updating its DNS records.

A Route 53 resource record set directs internet traffic to specific resources, such as an OpenSearch Service domain. It includes a domain name, a record type (for example, CNAME), and the DNS name or IP address of the endpoint. To redirect traffic to a new endpoint, update or create a new record set.

Set up manual snapshots from the OpenSearch Service domain in the secondary Region to the Amazon S3 bucket in the primary Region

Complete the following steps to set up manual snapshots from the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region:

  1. Create S3 bucket in the primary Region, following the steps from earlier in this post.
  2. Associate the IAM role or user to the OpenSearch security role for taking manual snapshots in your OpenSearch Service domain in the secondary Region. For instructions, refer to the earlier section in this post.
  3. Register a snapshot repository on the OpenSearch Service domain in the secondary Region pointing to the S3 bucket in the primary Region. For instructions, refer to the earlier section in this post.
  4. Take manual snapshots of the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region, following the instructions from earlier in this post.
  5. Schedule your manual snapshot from the OpenSearch Service domain in the secondary Region to the S3 bucket in the primary Region at a defined interval (for example, every 1 hour) based on your RPO requirements.

Failback scenario

When the primary Region becomes available again, you can seamlessly revert to the OpenSearch Service domain in the primary Region. This failback process involves the following steps.

Destroy an existing OpenSearch Service domain in the primary Region

When the primary Region becomes available again, destroy the existing OpenSearch Service domain in the primary Region from the OpenSearch Service console. In the following screenshot, the primary Region is US East (N. Virginia).

bdb-4227-DestroyDomain

Launch a new OpenSearch Service domain in the primary Region

Create an OpenSearch Service domain in the primary Region by following the instructions in Creating and managing Amazon OpenSearch Service domains with fine-grained access control. Do not enable standby mode.

Associate the IAM role or user to the OpenSearch security role for restoring manual snapshots

Follow the instructions from earlier in this post to associate the IAM role or user to the OpenSearch security role.

Register a snapshot repository on the OpenSearch Service domain

To make sure your data is available for failover, you need to register a snapshot repository on the new OpenSearch Service domain in the primary Region. The snapshots taken from your OpenSearch Service domain in the secondary Region can be restored. Use the following command:

curl —aws-sigv4 "aws:amz:us-west-2:es" —user "ACCESS_KEY:SECRET_KEY" -XPUT "https://DOMAIN_ENDPOINT/_snapshot/REPOSITORY_NAME" -H 'Content-Type: application/json' -d '{ "type": "s3", "settings": { "bucket": "BUCKET_NAME", "endpoint": "s3.amazonaws.com", "role_arn": "ROLE_ARN" }}'

The S3 bucket should be the bucket created in the primary Region where the snapshots from your OpenSearch Service domain in the secondary Region are stored.

Restore manual snapshots from the S3 bucket in the primary Region to the new OpenSearch Service domain in the primary Region

To restore the manual snapshots, complete the following steps:

  1. Use the following code to restore the manual snapshots from the S3 bucket in the primary Region to the new OpenSearch Service domain in the primary Region:

POST /_snapshot/os-snapshot-repo/2023-11-18/_restore
{
"indices": "movie-index"
}

bdb-4227-Restore

  1. Verify data integrity and make sure the primary domain is up to date by checking the document count of the index:

GET movie-index/_count

bdb-4227-IndexCount

  1. Update Route 53 to redirect traffic to the new OpenSearch Service domain in the primary Region.
  2. Set up manual snapshots from the new OpenSearch Service domain in the primary Region to a new prefix in the S3 bucket in the secondary Region.

Destroy the OpenSearch Service domain in the secondary Region

After you have successfully failed back to the OpenSearch Service domain in the primary Region, destroy the OpenSearch Service domain in the secondary Region. In the following screenshot, the secondary Region is US West (Oregon).

bdb-4227-DestroyDomain2

Conclusion

In this post, we explained how you can implement a DR pattern on OpenSearch Service using a snapshot and restore strategy. It’s highly recommended to define your RPO and RTO for your workload and choose an appropriate DR strategy. Then, using AWS services, you can design an architecture that achieves the RTO and RPO for your business needs.


About the Authors

Samir Patel is a Senior Data Architect at Amazon Web Services, where he specializes in OpenSearch, data analytics, and cutting-edge generative AI technologies. Samir works directly with enterprise customers to design and build customized solutions catered to their data analytics and cybersecurity needs. When not immersed in technical work, Samir pursues his passion for outdoor activities, including hiking, pickleball, and grilling with family and friends.

Sesha Sanjana Mylavarapu is an Associate Data Lake Consultant at AWS Professional Services. She specializes in cloud-based data management and collaborates with enterprise clients to design and implement scalable data lakes. She has a strong interest in data analytics and enjoys assisting customers solve their business and technical challenges. Beyond her professional pursuits, Sanjana enjoys hiking, playing guitar, and is passionate about teaching yoga.

Vivek Gautam is a Senior Data Architect with specialization in data analytics at AWS Professional Services. He works with enterprise customers building data products, analytics platforms, streaming, and search solutions on AWS. When not building and designing data products, Vivek is a food enthusiast who also likes to explore new travel destinations and go on hikes.

AWS Weekly Roundup: 20 years of AWS News Blog, Express brokers for Amazon MSK, Windows Server 2025 images on EC2, and more (Nov 11, 2024)

Post Syndicated from Channy Yun (윤석찬) original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-20-years-of-aws-news-blog-express-brokers-for-amazon-msk-windows-server-2025-images-on-ec2-and-more-nov-11-2024/

Happy 20th Anniversary of the AWS News Blog! 🎉🥳🎊 On November 9, 2004, Jeff Barr published his first blog post. At the time, he started a personal blog site using TypePad. He wanted to speak to his readers with his personal voice, not the company or team.

On April 29, 2014, we created a new AWS blog site and migrated all posts to that page. There are currently over 4,300 posts on the AWS News Blog, with Jeff contributing over 3,200 of them.

Since December 2016, the AWS News Blog has added new writers, but we are still following Jeff’s leadership principals for AWS News Bloggers in accordance with Day One. What’s unique about the AWS News Blog is that the blog writers get to use the features of the product team in advance, following the Customer Obsession leadership principle, and focus on walk-throughs of how customers can quickly use them to save time, with the Frugality principle.

I am very grateful for Jeff’s fundamental and pivotal role over the past 20 years, and I look forward to the next 20 years!

Last week’s launches
Here are some launches that got my attention:

New Express brokers for Amazon MSK – Express brokers are a new broker type for Amazon MSK Provisioned designed to deliver up to three times more throughput per broker, scale up to 20 times faster, and reduce recovery time by 90 percent as compared to standard Apache Kafka brokers. Express brokers come preconfigured with Kafka best practices by default, support all Kafka APIs, and provide the same low-latency performance, so you can continue using existing client applications without any changes.

New Amazon Kinesis Client Library 3.0 – You can now reduce compute costs to process streaming data by up to 33 percent with Kinesis Client Library (KCL) 3.0, compared to previous KCL versions. KCL 3.0 introduces an enhanced load balancing algorithm that continuously monitors resource utilization of the stream processing workers and automatically redistributes the load from overutilized workers to other underutilized workers. To learn more, read the AWS Big Data Blog post.

Microsoft Windows Server 2025 images on Amazon EC2 – We now support Microsoft Windows Server 2025 with License Included (LI) Amazon Machine Images (AMIs), providing customers with an easy and flexible way to launch the latest version of Windows Server. By running Windows Server 2025 on Amazon EC2, customers can take advantage of the security, performance, and reliability of AWS with the latest Windows Server features. To learn more about running Windows Server 2025 on Amazon EC2, visit Windows Workloads on AWS.

Anthropic’s Claude 3.5 Haiku model in Amazon Bedrock – Claude 3.5 Haiku is the next generation of Anthropic’s fastest model, combining rapid response times with improved reasoning capabilities, making it ideal for tasks that require both speed and intelligence. Claude 3.5 Haiku improves across every skill set and surpasses even Claude 3 Opus, the largest model in Anthropic’s previous generation, on many intelligence benchmarks—including coding. To learn more, read the AWS News Blog post.

Amazon Bedrock Prompt Management GA – You can simplify the creation, testing, versioning, and sharing of prompts in Amazon Bedrock Prompt Management. At general availability, we added new features that provide enhanced options for configuring your prompts and enabling seamless integration for invoking them in your generative AI applications, such as structured prompts and Converse and InvokeModel API integration. To learn more, read the AWS Machine Learning blog post.

Six new synthetic generative voices for Amazon Polly – The generative engine is Amazon Polly’s most advanced text-to-speech (TTS) model leveraging the generative AI technology. We added six new synthetic female-sounding generative voices: Ayanda (South African English), Léa (French), Lucia (European Spanish), Lupe (American Spanish), Mía (Mexican Spanish), and Vicki (German). This extends thirteen voices and nine locales to provide you with more options of highly expressive and engaging voices.

Amazon OpenSearch Service Extended Support – We announce the end of Standard Support and Extended Support timelines for legacy Elasticsearch versions and OpenSearch Versions. Standard Support ends on Nov 7, 2025, for legacy Elasticsearch versions up to 6.7, Elasticsearch versions 7.1 through 7.8, OpenSearch versions from 1.0 through 1.2, and OpenSearch versions 2.3 through 2.9. With Extended Support, for an incremental flat fee over regular instance pricing, you continue to get critical security updates beyond the end of Standard Support. To learn more, read the AWS Big Data Blog post.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS news
Here are some additional news items that you might find interesting:

CEO’s visiting at AWS data center – Matt Garman, CEO of AWS, had a great time visiting one of our AWS data centers recently, and was able to get a look at the continuous innovation delivered by the team. Of course, it’s no surprise that Amazon’s senior executives visit fulfillment centers, contact centers, and data centers, to do real work for customers. AWS data centers are designed for customers in every aspect, for maximum resilience, performance, and energy efficiency.

AWS supports small businesses, creates jobs, sets up sustainability initiatives, and develops educational programs near AWS data centers. Get the latest updates – AWS in your community: Here’s what’s happening near data centers across the US on About Amazon News.

Amazon Q Business at Amazon – I introduced an Amazon story to use Code transformation in Amazon Q Developer to migrate more than old 30,000 Java applications to Java 17 version. It saved over 4,500 developer years of effort compared to previous manual jobs and saved the company $260 million in annual by moving to the latest Java version.

Here is another dogfooding story of Amazon Q Business at Amazon. Amazon built an internal chatbot with Amazon Q Business and it has resolved over 1 million internal Amazon developer questions, reducing time spent churning on manual technical investigations by more than 450,000 hours.

Our team onboarded Amazon Q Business with millions of internal documents and integrated Q Business into the tools our team use every day. Now, instead of waiting hours for responses to complex technical questions on Q&A boards or Slack channels, developers can get answers in seconds.

TOURCast at PGA TOUR – If you enjoy golf, this news will be of interest to you. The PGA TOUR debuted TOURCast in Japan at the 2024 ZOZO Championship to capture and disseminate better statistical data and bring fans closer to the game based on new scoring system called ShotLink, powered by CDW. This marks the first time the PGA TOUR has been able to bring this technology to Asia, leveraging the flexibility and scalability of AWS to overcome unique challenges.


PGA TOUR volunteer setting up GPS equipment on the fairway at ZOZO championship that will input specific shot data and feed back to Shotlink Select Plus. [IMAGE: PGA TOUR]

They’ve completely rebuilt their scoring system over the past two years on a new cloud stack. With AWS cloud, whether data comes from high-tech radar systems, cameras, or manual input, the system processes it all seamlessly.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS GenAI LoftsAWS GenAI Lofts are about more than just the tech, they bring together startups, developers, investors, and industry experts. Whether you’re looking to gain deep insights, or get your questions answered by generative AI pros, our GenAI Lofts have you covered, and provide everything you need to start building your next innovation. Join events in São Paulo (through November 20), and Paris (through November 25).

AWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Jakarta, Indonesia (November 23), Kochi, India (December 14).

AWS re:Invent – You can still register for the annual learning event, taking place December 2–6 in Las Vegas. Surprisingly Andy Jassy, CEO of Amazon said he will come back and participate in AWS re:Invent this year. He said “As always, the priority is to make this a learning event so customers can take nuggets back and change their own customer experiences and businesses. We’ll also have a bunch of goodies for you that we’ll announce and that we think folks will like.” Let’s meet there!

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

Channy

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Amazon OpenSearch Service announces Standard and Extended Support dates for Elasticsearch and OpenSearch versions

Post Syndicated from Arvind Mahesh original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-announces-standard-and-extended-support-dates-for-elasticsearch-and-opensearch-versions/

Amazon OpenSearch Service supports 19 versions of Elasticsearch opensource, and 11 versions of OpenSearch. Over the years, we have added several stability, resiliency, and security features to recent engine versions, helping customers derive better value from OpenSearch Service. As software versions grow older, we need to make sure that these versions continue to meet high security and compliance standards. Many of the legacy versions supported on OpenSearch Service, such as Elasticsearch versions 1.5 and 2.3, depend on third-party dependencies that are no longer actively supported. By moving to the latest engine versions, customers can derive maximum benefit from the new features, improved price-performance, and security improvements we make to OpenSearch.

Today, we’re announcing timelines for end of Standard Support and Extended Support for legacy Elasticsearch versions up to 6.7, Elasticsearch versions 7.1 through 7.8, OpenSearch versions from 1.0 through 1.2, and OpenSearch versions 2.3 through 2.9 available on Amazon OpenSearch Service. Versions that are under Standard Support receive regular bug fixes and security fixes, and versions in Extended Support receive critical security fixes and operating system patches for an additional flat fee per normalized instance hour. With Extended Support, we want to make sure that our customers continue to receive critical security fixes for an adequate time, while they plan to upgrade to more recent engine versions. For more details on Extended Support please see the FAQs.

End of Standard Support and Extended Support for Elasticsearch versions

See Table 1 that follows for end of Standard Support and Extended Support dates for legacy Elasticsearch versions available on OpenSearch Service. We recommend that customers running Elasticsearch versions upgrade to the latest OpenSearch versions. All Elasticsearch versions will receive at least 12 months of Extended Support, and version 5.6 will receive 36 months of Extended Support. After Extended Support ends for a version, domains running the specific version will not receive bug fixes or security updates.

Software version End of Standard Support End of Extended Support
Elasticsearch versions 1.5 and 2.3 11/7/2025 11/7/2026
Elasticsearch versions 5.1 to 5.5 11/7/2025 11/7/2026
Elasticsearch version 5.6 11/7/2025 11/7/2028
Elasticsearch versions 6.0 to 6.7 11/7/2025 11/7/2026
Elasticsearch version 6.8 Not announced Not announced
Elasticsearch versions 7.1 to 7.8 11/7/2025 11/7/2026
Elasticsearch version 7.9 Not announced Not announced
Elasticsearch version 7.10 Not announced Not announced

End of Standard Support and Extended Support for OpenSearch versions

For OpenSearch versions running on Amazon OpenSearch Service, we will provide at least 12 months of Standard Support after the end of support date for the corresponding upstream open source OpenSearch version, or 12 months of Standard Support after the release of the next minor version on OpenSearch Service, whichever is longer. All OpenSearch versions will receive at least 12 months of Extended Support after the end of Standard Support date. For more details, check the open source OpenSearch maintenance policy.

See Table 2 that follows for end of Standard Support and Extended Support dates for various OpenSearch versions available on OpenSearch Service. For future updates on versions in Standard Support and Extended Support, follow supported versions.

Software Version End of Standard Support End of Extended Support
OpenSearch versions 1.0 to 1.2 11/7/2025 11/7/2026
OpenSearch version 1.3 Not announced Not announced
OpenSearch versions 2.3 to 2.9 11/7/2025 11/7/2026
OpenSearch versions 2.11 and higher versions Not announced Not announced

Upgrading OpenSearch Service domains: We recommend that you update your domains to the latest available OpenSearch version to derive maximum value out of OpenSearch Service. Minor version upgrades on OpenSearch tend to be seamless because they don’t contain breaking changes, and we recommend moving to the latest minor version, or a version for which end of support has not yet been announced. For example, if you are on OpenSearch version 1.2, you can move to OpenSearch version 1.3, because it’s the last minor version of the 1.x series and because presently it continues to be supported by the open source community and AWS. If you want to choose an Elasticsearch version, and you are running an older 6.x or 7.x version, you can move to version 6.8, or 7.10.

There are various ways to upgrade your cluster to a newer version, and the steps vary depending on the version your domain is running and the version you want to upgrade to. See Upgrading OpenSearch Service domains for detailed instructions on upgrading your domain to a new version. You can also use the Migration Assistant for Amazon OpenSearch Service for upgrading to newer versions

Calculating Extended Support charges: Domains running versions under Extended Support will be charged a flat additional fee per normalized instance hour (NIH). For example, $0.0065 per NIH in the US East (North Virginia) AWS Region. See the pricing page for exact pricing by Region.

NIH is computed as a factor of the instance size (for example, medium or large), and the number of instance hours. For example, if you’re running an m7g.medium.search instance for 24 hours in the US EAST (North Virginia) Region, which is priced at $0.068 per instance hour (on-demand), you will typically pay $1.632 ($0.068×24). If you’re running a version that is in Extended Support, you will pay an additional $0.0065 per NIH, which is computed as $0.0065 x 24 (number of instance hours) x 2 (size normalization factor, which is 2 for medium-sized instances), which comes to $0.312 for Extended Support for 24 hours. The total amount that you will pay for 24 hours will be a sum of the standard instance usage cost and the Extended Support cost, which is $1.944 ($1.632+$0.312, excluding storage cost). The following table shows the normalization factor for various instance sizes in OpenSearch Service.

Instance size Normalization Factor
nano 0.25
micro 0.5
small 1
medium 2
large 4
xlarge 8
2xlarge 16
4xlarge 32
8xlarge 64
9xlarge 72
10xlarge 80
12xlarge 96
16xlarge 128
18xlarge 144
24xlarge 192
32xlarge 256

Summary

We add new capabilities across various vectors to the latest OpenSearch versions, which include new features, performance and resiliency improvements, and security improvements. We recommend that you update to recent OpenSearch versions to get the most benefit out of OpenSearch Service. For any questions on Standard and Extended Support options, see the FAQs. For further questions, contact AWS Support.


About the authors

Arvind Mahesh is a Senior Manager-Product at Amazon Web Services for Amazon OpenSearch Service. He has close to two decades of technology experience across a variety of domains such as Analytics, Search, Cloud, Network Security, and Telecom.

Kuldeep Yadav is a Senior Technical Program Manager at Amazon Web Services who is passionate about driving innovation and complex problem solving. He works closely with teams and customers in ensuring operational excellence and achieving more with less. Outside of work he enjoys trekking and all sports

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Amazon OpenSearch Service launches the next-generation OpenSearch UI

Post Syndicated from Hang Zuo original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-launches-the-next-generation-opensearch-ui/

Amazon OpenSearch Service launches a modernized operational analytics experience that can provide comprehensive observability spanning multiple data sources, so that you can gain insights from OpenSearch and other integrated data sources in one place. The launch also introduces OpenSearch Workspaces that provides tailored experience for popular use cases and supports access control, so that you can create a private space for your use case and share it only to your collaborators. With the next-generation user interface (UI), the Discover experience has been improved to simplify interactive analysis, so that you can easily utilize features such as natural language query generation to gain insights from your data.

Multiple Data Source: You might have already used OpenSearch Dashboards to provide an operational analytics experience for your OpenSearch clusters. OpenSearch Dashboards is co-located with a cluster, so that each OpenSearch Dashboards can only work with one cluster. And as you scale up your workload across multiple clusters, there is not a unified experience to analyze your data in one place. In comparison, the next-generation OpenSearch UI is designed to work across multiple OpenSearch clusters to aggregate the comprehensive insights in one place. An OpenSearch application is an instance of the next-generation OpenSearch UI. Currently, OpenSearch applications can be associated with multiple OpenSearch clusters (above version 1.3), Amazon OpenSearch Service Serverless collections, and integrated data sources such as Amazon S3. Each OpenSearch cluster can be associated with multiple OpenSearch applications, in addition to its co-located OpenSearch Dashboards that will remain functional.

Workspace: With workspaces, you can easily create your use case specific contents in a private space and manage the permissions in team collaboration. Workspace provide curated experiences for popular use cases such as Observability, Security Analytics and Search, so that you can find it straightforward to build contents for your use case. Workspace supports collaborator management, so that you can share your workspace only to your intended collaborators, and manage the permissions for each collaborator.

Discover: The improved Discover feature now provides a unified log exploration experience that adds the support for SQL and Piped Processing Language (PPL), in addition to the existing support for DQL and Lucene. Discover features a new data selector to support multiple data sources, a new visual design, query autocomplete and natural language query generation for improved usability. With the enhanced Discover interface, you can now analyze data from multiple sources without switching tools, reducing complexity and improving efficiency.

Solution Overview

The following diagram illustrates architecture of the OpenSearch Dashboards.

The following diagram illustrates the next-generation OpenSearch UI architecture.

In the following sections, we discuss the following topics

  1. The process of creating an application
  2. Setting up and using the new Workspaces functionality
  3. The enhanced Discover experience

We’ll demonstrate how these improvements streamline data analysis, foster collaboration, and empower you to extract insights more efficiently across various use cases.

Create an application:

To begin using the next-generation OpenSearch UI, you can first create an application. An application is an instance of the OpenSearch UI (Dashboards), and you have the flexibility to create multiple applications within a single account. To create a new application, complete the following steps:

  1. On the Amazon OpenSearch Service console, choose OpenSearch UI (Dashboards)under Central management in the navigation panel.
  2. Choose Create application.
  3. For application name, enter a descriptive name for your new application.
  4. AWS Identity and Access Management (IAM) is the default authentication mechanism. Optionally, you can select Authentication with IAM Identity Center (IDC), so that you can use credentials and access management from your existing identity providers to manage user access.
  5. For OpenSearch application admins, specify the IAM principals or IDC users that will have permissions to update or delete the application configuration. You will automatically be set as the first admin.

This page lists all the existing applications under your account in the current AWS region. You can create new application from this page.

This page is the create application workflow. You can specify the application name, enable/disable IDC and define application admins to create an application.

After you configured these settings and created an application, your new OpenSearch application will be ready for you to associate data sources and start using the enhanced UI capabilities.

Associating data sources:

After you create your new OpenSearch application, the next step is to associate the relevant data sources. This allows you to connect the application to the necessary OpenSearch domains, collections, and other data sources.

  1. On the application details page, choose Manage data sources.

You will be presented with a list of all the OpenSearch data sources you have access to, including managed domains and serverless collections.

  1. Select the data sources you want to associate with this application.

OpenSearch domains below version v1.3 will not be compatible with the next-generation UI, and will be grayed out in the list. Additionally, if you need to connect to a domain within a virtual private cloud (VPC), you will need to authorize OpenSearch application as a new principal under its security configurations. If you need to connect to a collection within a VPC, you will need to configure its network policy to Private, enable AWS service private access with OpenSearch application.

  1. Choose Save to finalize the data source association.

Your OpenSearch application is now ready to use, with access to the connected data.

Working with the OpenSearch application:

To access your new OpenSearch application, you can either choose the application URL or choose Launch application on the application details page. After you’ve successfully logged in either with IAM or IDC, you’ll be directed to the application’s homepage. From here, you can choose to create a new workspace or navigate to an existing one that you have access to.

Creating a new workspace:

A workspace is a tailored experience for your use case and team collaboration. There are five types of workspaces: Observability, Security Analytics, Search, Essentials, and Analytics. You can click on the info button to learn more about each workspace type. Existing workspaces will be listed on the homepage. To create a new workspace, complete the following steps:

  1. Choose Create workspace.
  2. Enter a name for your workspace.
  3. Optionally, you can select a different color for the workspace icon for easier identification.
  4. Select the type of workspace you want to create: Observability, Security Analytics, Search, Essentials, or Analytics
  5. Add at least one data source for this workspace (from the list of data sources you previously associated with the application).

For this post, we create an Observability workspace named MyWorkspace and associate it with one Amazon OpenSearch Serverless collection and one Amazon OpenSearch Service managed cluster. You can always manage the data sources associated with a workspace, even after it has been created.

Invite Collaborators

After you create your new workspace, you can add users or groups as collaborators. Workspace collaborators are the users you want to invite to work with you in this workspace, and there are three available permission levels for collaborators: admin, read/write, and read-only. Read/write permission allows a collaborator to create, edit and delete the dashboards, visualizations, and saved queries within the workspace, whereas collaborators with read-only access can only view the results. Admin level gives a collaborator the same permissions as you to not only read/write but also update the configurations of the workspace or delete it.

To add collaborators to your workspace, complete the following steps:

  1. Choose Collaborators in the navigation panel.
  2. Choose Add collaborators.
  3. Choose the type of users you want to add as collaborators. You can add collaborators by their IAM Amazon Resource Name (ARN) or IDC username.
  4. Select a permission level for the collaborator from the three options: Read only, Read and write, and Admin

If you do not know the ARN of your intended collaborator, follow the instruction to check for their ARN, for example.

Improved navigation:

The improved navigation in workspaces provides a more contextual and purpose-built interface, ensuring that each workspace includes only the tools and features relevant to its use case. With enhanced clarity and better organization, the new navigation system is tailored to help you find the features you need quickly, improving overall productivity and minimizing time spent searching through menus.

Revamped Discover experience

Discover is now revamped to offer improved usability and efficiency. You can access multiple data sources, natural language query generation, a new data selector, and polished design with optimized data density, allowing you to effortlessly navigate and analyze your data:

  • Unified language selector – Discover now offers a unified language selector, allowing users to choose from SQL, PPL, Dashboards Query Language (DQL), or Lucene, making it convenient to work with your preferred query languages in one place.

  • Natural language query generation – Discover now supports natural language query building for PPL. Enter your questions in plain language, and Discover converts them to PPL syntax, making data exploration simpler and more accessible. This new feature empowers users of different skill levels to get insights without needing to fully understand the PPL syntax.

  • Powerful query autocomplete – The enhanced query bar includes autocomplete functionality and natural language query generation support, simplifying query building by offering relevant suggestions as you type, making it faster and more efficient to write complex queries

  • New data selector– The new data selector makes it straightforward to connect to multiple data sources, bringing data from Amazon OpenSearch Service domains and serverless collections, and Amazon S3 into a unified view.

Conclusion

In this post, we discussed the features of the next-generation OpenSearch UI. These improvements streamline data analytics, foster collaboration, and empower you to extract insights more efficiently across various use cases.

You can create your own OpenSearch UI applications today in the US East (N. Virginia), US West (N. California, Oregon), Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), South America (São Paulo), Europe (Frankfurt, Ireland, London, Paris) and Canada (Central) Regions.


About the Authors

Hang (Arthur) Zuo is a Senior Product Manager with Amazon OpenSearch Service. Arthur leads the core experience in the next-gen OpenSearch UI and data migration to Amazon OpenSearch Service. Arthur is passionate about cloud technologies and building data products that help users and businesses gain actionable insights and achieve operational excellence.

Rushabh Vora is a Principal Product Manager for the OpenSearch project of Amazon Web Services. Rushabh leads core experiences in data exploration, dashboards, visualizations, reporting, and data management to help organizations unlock insights at scale. Rushabh is passionate about cloud technologies and building products that enable businesses to make data-driven decisions and achieve operational excellence.

Sohaib Katariwala is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service based out of Chicago, IL. His interests are in all things data and analytics. More specifically he loves to help customers use AI in their data strategy to solve modern day challenges.

Arun Lakshmanan is a Search Specialist with Amazon OpenSearch Service based out of Chicago, IL. He works closely with customers on their OpenSearch journey across various use cases including vector search, observability, and security analytics.

Xenia Tupitsyna is a UX Designer at OpenSearch. She is working on user experiences across security analytics solutions, anomaly detection, alerting, and core dashboards.

Improve OpenSearch Service cluster resiliency and performance with dedicated coordinator nodes

Post Syndicated from Akshay Zade original https://aws.amazon.com/blogs/big-data/improve-opensearch-service-cluster-resiliency-and-performance-with-dedicated-coordinator-nodes/

Today, we are announcing dedicated coordinator nodes for Amazon OpenSearch Service domains deployed on managed clusters. When you use Amazon OpenSearch Service to create OpenSearch domains, the data nodes serve dual roles of coordinating data-related requests like indexing requests, and search requests, and of doing the work of processing the requests – indexing documents and responding to search queries. Additionally, data nodes also serve the OpenSearch Dashboards. Because of these multiple responsibilities, data nodes can become a hot spot in the OpenSearch Service domain, leading to resource scarcity, and ultimately node failures. Dedicated coordinator nodes help you mitigate this problem by limiting the request coordination and Dashboards to the coordinator nodes, and request processing to the data nodes. This leads to more resilient, scalable domains.

Amazon OpenSearch Service is a managed service that you can use to secure, deploy, and operate OpenSearch clusters at scale in the AWS Cloud. The service allows you to configure clusters with different types of nodes such as data nodes, dedicated cluster manager nodes, and UltraWarm nodes. When you send requests to your OpenSearch Service domain, the request is broadcast to the nodes with shards that will process that request. By assigning roles through deploying dedicated nodes, like dedicated cluster manager nodes, you concentrate the processing of those kinds of requests and remove that processing from nodes in other roles.

OpenSearch Service has recently expanded its node type options to include dedicated coordinator nodes, alongside data nodes, dedicated cluster manager nodes, and UltraWarm nodes. These dedicated coordinator nodes offload coordination tasks and dashboard hosting from data nodes, freeing up CPU and memory resources. By provisioning dedicated coordinator nodes, you can improve a cluster’s overall performance and resiliency. Dedicated coordinator nodes also let you scale the coordination capacity of your cluster independently of the data storage capacity. Dedicated coordinator nodes are available in Amazon OpenSearch Service for all OpenSearch engine versions. See the documentation for engine and version support.

A brief introduction to coordination

OpenSearch operates as a distributed system, where data is stored in multiple shards across various nodes. Consequently, a node handling a request must coordinate with several other nodes to store or retrieve data.

Here are a few examples of coordination operations performed to successfully serve different user requests:

  • A bulk indexing request might contain data that belongs to multiple shards. The coordination process splits such a request into multiple shard-specific subrequests and routes them to the corresponding shards for indexing.
  • A search request might require querying various shards that are present in different nodes. The coordination process splits the request into multiple shard level search requests and sends those requests to the corresponding data nodes holding the data. Each of those data nodes processes the data locally and returns a shard-level response. The coordination process gathers these responses and builds the final response.
  • For queries with aggregations, the coordination process performs the additional computation of re-aggregating the aggregation responses from data nodes.

In OpenSearch Service, each data node is implicitly capable of coordination. In the absence of dedicated coordinator nodes, the data node receiving the request will perform the coordinating duties, though it might not have the relevant shards for the request. By adding dedicated coordinator nodes to a cluster, you can reduce the burden on data nodes. The following sections walk through some of the improvements.

Higher indexing and search throughput

In an OpenSearch cluster, each indexing request goes through three broad phases: coordination, primary, and replica. With coordination responsibilities offloaded to dedicated coordinators, the data nodes have more resources at their disposal for the primary and replica phases. By adding coordinator nodes, we observed as much as 15% higher indexing throughput in workloads such as Stack Overflow and Big5.

A search request in OpenSearch can involve something as trivial as looking up a single document by ID or something complex, such as bucketing a large amount of data and performing aggregations on each of the buckets. The impact of adding dedicated coordinator nodes can vary widely depending on the query. In a query workload containing date histograms with multiple aggregations such as average, p50, p99, and so on, we were able to achieve about 20% higher throughput. The term and multi-term aggregations also benefit from the addition of coordinator nodes. Depending on the key composition throughput improvement of 15% to 20% was observed.

More resilient clusters

Dedicated coordinator nodes provide a separation of responsibilities that prevents data nodes from being overwhelmed by complex queries or sudden spikes in request volume. In the case of complex aggregations, the coordinator nodes absorb the CPU impact ensuring that the data nodes focus on filtering, matching, scoring, sorting, and returning the search response, and maintaining the integrity of the data. In addition to coordination responsibilities, coordinator nodes also serve the OpenSearch Dashboards frontend. This ensures that the dashboards stay responsive even during high loads, ensuring a smooth user experience.

Complex aggregations consume a lot of memory. Memory intensive operations can lead to out of memory (OOM) errors causing node crashes and data loss. By adding dedicated coordinator nodes in a cluster, you can isolate the impact away from the data nodes. Coordinator nodes can greatly improve performance by significantly reducing or even completely eliminating query-induced OOM errors on data nodes. Because coordinator nodes don’t hold any data, the cluster still remains functional even if one of the coordinator nodes fails.

Efficient scaling

Dedicated coordinator nodes separate a cluster’s coordination capacity from data storage capacity. This allows you to choose the amount of memory and CPU required for your workload without impacting the stored data. For example, a cluster with high throughput might require a lot of lightweight nodes while a cluster with complex aggregations should have fewer but larger nodes.

Having a dedicated coordinator node allows you to adjust the number of nodes according to anticipated traffic patterns. For example, you can scale up the number of coordinators in high traffic hours and scale them down during low traffic hours.

Smaller IP reservations for VPC domains

With dedicated coordinator nodes, you can achieve up to 90% reduction in the number of IP addresses reserved by the service in your VPC. This reduction allows deployments of larger clusters that might otherwise face resource constraints.

When you create a virtual private cloud (VPC) domain without dedicated coordinator nodes, OpenSearch Service places an elastic network interface (ENI) in the VPC for each data node. Each ENI is assigned an IP address. At the time of domain creation, the service reserves three IP addresses for each data node. See Architecture for more information. When dedicated coordinator nodes are used, the ENIs are attached to the coordinator nodes instead of the data nodes. Because there are typically fewer coordinator nodes than data nodes fewer IP addresses are reserved. The following diagram shows the domain architecture of a VPC domain with dedicated coordinator nodes.

Picking the right configuration

OpenSearch Service offers two key parameters for managing dedicated coordinator nodes:

  1. Instance type, which determines the memory and compute capacity of each coordinator node.
  2. Instance count, which specifies the number of coordinator nodes.

Identify your use case

To get the most benefits out of coordinator nodes, you must pick the right type as well as the right count. As a general rule, we recommend that you set the count to 10% of the number of data nodes and choose a size that’s similar to the size of the data nodes. See the documentation to find out the supported instance types for dedicated coordinator nodes. The following guidelines should help tailor the configuration further to specific workloads:

  • Indexing: Indexing requires compute power to split the bulk upload request payload into shard-specific chunks. We recommend using CPU optimized instances of a size similar to that of the data nodes. While the count is dependent on the indexing throughput that you want to achieve, 10% of the number of data nodes is a good starting point.
  • High search throughput: Achieving high search throughput requires a lot of network capacity. Increasing the number of coordinator nodes will sustain the traffic load while providing high availability. We recommend setting the coordinator node count at from 10% to 15% of the number of data nodes.
  • Complex aggregations: Aggregations are memory intensive. For example, to calculate a p50 value, a coordinator node must first gather the entire dataset in memory. Moreover, crunching these numbers requires CPU cycles. We recommend that you use general purpose coordinator nodes that are one size larger than the data nodes. While the node count can be tuned by the use case, 8% to 10% of the number of data nodes is a good start.

Coordinator metrics

While the guidelines above are a good start, every use case is unique. To arrive at an optimal configuration, you must experiment with your own workload, observe the performance, and identify the bottlenecks. OpenSearch Service provides some key metrics and APIs to observe how coordinator nodes are doing.

  • CoordinatorCPUUtilization metric: This metric provides information about how much CPU is being consumed on the coordinator nodes. This metric is available at both the node and the cluster levels. If you see CPU consistently breaching the 80% mark, it might be a time to use larger coordinator nodes.
  • CoordinatorJVMMemoryPressure, CoordinatorJVMGCOldCollectionCount and CoordinatorJVMGCOldCollectionTime metrics: The CoordinatorJVMMemoryPressure metric indicates the percentage of JVM memory used by the OpenSearch process. This metric is available at both the cluster and node levels. Consistently high JVM memory pressure suggests that coordination tasks are using memory efficiently. It’s important to assess this metric alongside the JVM garbage collection (GC) metrics, which show how many old generation GC runs have been triggered and how long they lasted. In a properly scaled cluster, GC runs should be infrequent and short. If GC runs occur too often, they might also negatively impact CPU performance.
  • CoordinatingWriteRejected metric: This metric should be evaluated alongside other metrics, such as PrimaryWriteRejected and ReplicaWriteRejected. An increase in primary or replica write rejections suggests that the data nodes are underscaled and unable to process requests quickly enough. However, if the CoordinatingWriteRejected metric rises independently of the other two, it indicates that the coordinating node is struggling to handle the indexing coordination process, preventing it from processing queued requests. Indexing requires many resources, any of which could be a bottleneck. You can alleviate indexing pressure where the CPU is the bottleneck with more or larger instances that have more vCPUs.
  • Circuit breaker statistics API: Circuit breakers prevent OpenSearch from causing a Java OutOfMemoryError. The circuit breaker statistics for coordinator nodes can be retrieved with following API:
    _nodes/coordinating_only:true/stats/breaker
    Every time a circuit breaker trips for a request the client receives a 429 error with the circuit_breaking_exception message. These indicate that the result size of the request was too big to fit on a coordinator node. To avoid these errors, it’s recommended to use an instance with more memory.

Provision a dedicated coordinator node

You can add one or more dedicated coordinator nodes by updating the domain configuration with the appropriate options for coordinator nodes. This will trigger a blue/green deployment, and the domain will have dedicated coordinator nodes once the deployment is complete. Alternatively, you can create a new domain with dedicated coordinator nodes.

In either scenario, you can expand or reduce the number of coordinator nodes without requiring a blue/green deployment, giving you the flexibility to experiment.

Conclusion

In real-world production environments, dedicated coordinator nodes in Amazon OpenSearch Service provide an effective way to separate coordination tasks from data processing. This shift enhances resource efficiency, often delivering up to a 15% increase in indexing throughput and a 20% improvement in query performance, depending on workload demands. By offloading coordination tasks, you reduce the risk of node overloads, improve system stability, and gain better cost control by scaling coordination and data tasks independently.

For workloads with complex queries and high traffic, dedicated coordinator nodes help ensure that your cluster maintains optimal performance and is prepared to handle future growth with greater resilience. Start experimenting with dedicated coordinator nodes today to unlock more efficient resource management and enhanced performance in your OpenSearch clusters.


About the author

Akshay Zade is a Senior SDE working for Amazon OpenSearch Service, passionate about solving real-world problems with the power of large-scale distributed systems. Outside of work, he enjoys drawing, painting, and diving into fantasy books.

Infor’s Amazon OpenSearch Service Modernization: 94% faster searches and 50% lower costs

Post Syndicated from Allan Pienaar original https://aws.amazon.com/blogs/big-data/infors-amazon-opensearch-service-modernization-94-faster-searches-and-50-lower-costs/

This post is cowritten by Arjan Hammink from Infor.

Robust storage and search capabilities are critical components of Infor’s enterprise business cloud software. Infor’s Intelligent Open Network (ION) OneView platform provides real-time reporting, dashboards, and data visualization to help customers access and analyze information across their organization. To enhance the search functionality within ION OneView, Infor used Amazon OpenSearch Service to improve their software products and offer better service to their customers by providing real-time visibility. By modernizing their use of OpenSearch Service, Infor has been able to deliver a 94% improvement in search performance for customers, along with a 50% reduction in storage costs.

In this post, we’ll explore Infor’s journey to modernize its search capabilities, the key benefits they achieved, and the technologies that powered this transformation. We’ll also discuss how Infor’s customers are now able to more effectively search through business messages, documents, and other critical data within the ION OneView platform.

Where Infor started

Infor’s ION OneView was built on top of Elasticsearch v5.x on Amazon OpenSearch Service, hosted across eight AWS Regions. This architecture enabled users to track business documents from a consolidated view, search using various criteria, and correlate messages while viewing content based on user roles. Over time, Infor expanded its functionality to include “Enrich” and “Archive” capabilities, which added significant complexity. The Enrich process would build searchable messages by aggregating related events, requiring constant document updates to the OpenSearch indices. The Archive process would then move these messages and events to Amazon Simple Storage Service (Amazon S3), while using a delete_by_query to remove the corresponding documents from OpenSearch Service. These read-update-write-delete workloads, coupled with large all-encompassing indices with shard sizes of over 100GB, resulted in high volumes of deleted documents and exponential data growth that the system struggled to keep up with. To address increasing performance needs, Infor continually horizontally scaled out their OpenSearch Service domain.

Challenges 

The key challenges Infor faced underscored the need for a more scalable, resilient, and cost-effective search capability that could seamlessly integrate with their cloud environment. These included the inability to effectively archive data because of high ingestion rates, resulting in longer upgrade and recovery times. Escalating costs from scaling the solution and the need for custom development to enable newer OpenSearch Service features created significant operational burdens. Additionally, Infor was seeing increasing search latency, with CPU utilization peaking at 75% and occasionally spiking above 90% (as shown in the following figures), demonstrating the performance limitations of Infor’s existing infrastructure. Collectively, these issues drove Infor’s need for a modernized search solution.

SearchLatency Pre-Modernization

Screenshot shows CloudWatch metric SearchLatency before Modernization

CPUUtilization Pre-Modernization

Screenshot shows CloudWatch metric CPUUtilization before Modernization

Infor’s journey to modernize search with OpenSearch Service

To address the growing challenges with ION OneView, Infor partnered with AWS to undertake a comprehensive modernization effort. This involved optimizing operational processes, storage configurations, and instance selections, while also upgrading to the later versions within OpenSearch Service.

Operational review and enhancements

As a collaborative effort between Infor and AWS, a comprehensive operational review of Infor’s OpenSearch Service cluster was undertaken. With the help of slow logs and adjusting the logging thresholds, the review was able to identify long-running queries and the archival process consuming the largest amount of CPU capacity. Infor rewrote the long-running queries that used high cardinality fields, reducing the average query time.

Next, the team turned their attention to redesigning Infor’s archival process to reduce stress on the CPU. Instead of a single large index, we implemented independent indices based on customer license types. This improved delete performance by allowing the team to target old indices, using index aliases to manage the transition. We also replaced the delete_by_query approach where a query is sent to locate documents prior to a delete with a standard delete passing document IDs directly, because all the document IDs to be archived were known ahead of time. This reduced round-trip time and CPU stress compared to the sequential search requests performed by delete_by_query. This was followed by the tuning of the refresh interval based on the workload requirements, improving the indexing performance, and memory and CPU utilization.

Storage optimization

The team switched from GP2 to GP3 storage, provisioning additional input/output operations per second (IOPS) and throughput only when needed. This resulted in a 9% reduction in storage costs for most of Infor’s workloads. In all use cases where IOPS was a bottleneck, the team was able to provision additional IOPS and throughput independent of the volume size using GP3, further reducing Infor’s overall storage costs. Additionally, we implemented a shard size-based rollover strategy that provided a sharding strategy where total shards were divisible by the number of nodes to reduce the shard size to the recommended number of less than 50 GiB. This helped ensure an even distribution of data and workloads across the nodes for each index, and the performance improvements indicated that more vCPU would be beneficial given the thread pool queues and latencies. Appropriate master and data node instance types were chosen based on the new storage requirements. To support the reindexing process, the team also temporarily scaled up the storage and compute resources.

Upgrading OpenSearch Service

After optimizing the storage and compute configurations based on best practices, the Infor ION team turned their attention to using the latest features of OpenSearch Service. With the shards now at an appropriate boundary and the memory and CPU utilization at the right levels, the team was able to seamlessly upgrade from Elasticsearch version 5.x to 6.x and then to 7.x in OpenSearch Service. Each major version upgrade required careful testing and client-side code changes to make sure that the appropriate compatible client libraries were used, and the team took the necessary time after each upgrade to thoroughly validate the system and provide a smooth transition for Infor’s customers. This commitment to a methodical upgrade process allowed Infor to take advantage of the latest OpenSearch Service features, such as Graviton support, performance improvements, bug fixes, and security posture improvements, while minimizing disruption to their users.

Optimizing instance selection for performance

In collaboration with the AWS team, Infor carefully evaluated local non-volatile memory express (NVMe)-backed instance types for their ION OneView search cluster, comparing options such as i3 and R6gd instances to balance memory, latency, and storage requirements. For write-heavy workloads, the team found that using NVMe storage provided better performance and price compared to Amazon Elastic Block Store (Amazon EBS) volumes because of the high IOPS requirement of the workload, allowing them to be less reliant on off-heap memory usage. By selecting the most appropriate instance types, the ION OneView search cluster was able to resize and scale down the number of data nodes by 63% while still achieving improved throughput and reduced latency. Staying on the latest AWS instance families was also a key consideration, and the team further optimized costs by purchasing Reserved Instances after establishing a good baseline for their performance and compute consumption, with discounts ranging from 30% to 50% depending on the commitment term.

Results

The following figures show the improvements of the modernization.

New indices with the correct shard size can be seen in the increase in shards, shown in the following figure.

Figure showing increase in shards with new indices and correct shard size

The updated shard strategy combined with a version upgrade led to a ten-fold increase in the volume of traffic and efficient archiving as shown in the following figure.

Figure illustrates 10x increase in traffic volume and improved archiving due to updated shard strategy and version upgrade

The SearchRate increase is shown in the following figure.

Figure shows increase in SearchRate

The following figure shows that the CPU increase was minimal compared to the traffic increase.

Figure demonstrates CPU increase was minimal compared to traffic increase

The SearchLatency reduction post upgrade and implementation of the new indexing and shard strategy is shown in the following figure.

Figure illustrates reduction in CloudWatch metric SearchLatency after upgrade and new indexing/shard strategy implementation

The following figure shows the monthly spend over the past 4 quarters for two Infor ION products.

Figure shows the monthly spend over 4 quarters for two Infor ION products.

Conclusion

Through their careful modernization of the OpenSearch Service infrastructure, Infor was able to achieve 50% reduction in infrastructure costs coupled with a 94% improvement in cluster performance. The optimized clusters are now healthier and more resilient, enabling faster blue/green deployments to process even greater data volumes.

This successful transformation was driven by Infor’s close collaboration with the AWS team, using deep technical expertise and best practices to accelerate the optimization process and unlock the full potential of OpenSearch Service. Infor’s OpenSearch Service modernization has empowered the company to provide an improved, high-performing search experience for their customers at a significantly lower cost, positioning their ION OneView platform for continued growth and success.

Every workload is unique, with its own distinct characteristics. While the best practices outlined in the Amazon OpenSearch Service developer guide serve as a valuable guide, the most important step is to deploy, test, and continuously tune your own domains to find the optimal configuration, stability, and cost for your specific needs.


About the Authors

Author image of Allan PiennarAllan Pienaar is an OpenSearch SME and Customer Success Engineer at AWS. He works closely with enterprise customers in ensuring operational excellence, maintaining production stability and optimizing cost using the Amazon OpenSearch Service.

Author image of Gokul Sarangaraju Gokul Sarangaraju is a Senior Solutions Architect at AWS. He helps customers adopt AWS services and provides guidance in AWS cost and usage optimization. His areas of expertise include building scalable and cost-effective data analytics solutions using AWS services and tools.

Author image of Arjan Hammink Arjan Hammink is a Senior Director of Software Development at Infor, bringing over 25 years of expertise in software development and team management. He currently oversees Infor ION, a project he has been integral to since its inception in 2010 when he began as a Software Engineer. Infor ION is a robust middleware designed to streamline software integration, a key component of Infor OS, Infor’s cloud technology platform.

AWS Weekly Roundup: Agentic workflows, Amazon Transcribe, AWS Lambda insights, and more (October 21, 2024)

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/aws-weekly-roundup-agentic-workflows-amazon-transcribe-aws-lambda-insights-and-more-october-21-2024/

Agentic workflows are quickly becoming a cornerstone of AI innovation, enabling intelligent systems to autonomously handle and refine complex tasks in a way that mirrors human problem-solving. Last week, we launched Serverless Agentic Workflows with Amazon Bedrock, a new short course developed in collaboration with Dr. Andrew Ng and DeepLearning.AI.

Serverless Agentic Workflows with Amazon Bedrock

This hands-on course, taught by my colleague Mike Chambers, teaches how to build serverless agents that can handle complex tasks without the hassle of managing infrastructure. You will learn everything you need to know about integrating tools, automating workflows, and deploying responsible agents with built-in guardrails on Amazon Web Services (AWS) with Amazon Bedrock. The hands-on labs provided with the course let you apply your knowledge directly in an AWS environment, hosted by AWS Partner Vocareum. Find more information and enroll for free on the DeepLearning.AI course page.

Now, let’s turn our attention to other exciting news in the AWS universe from last week.

Last week’s launches
Here are some launches that got my attention:

Amazon Transcribe now supports streaming transcription in 30 additional languagesAmazon Transcribe has expanded its support to include 30 additional languages, bringing the total number of supported languages to 54. This enhancement helps you reach a broader global audience and improves accessibility across various industries, including contact centers, broadcasting, and e-learning. The expanded language support allows for more efficient content moderation, improved agent productivity, and automatic subtitling for live events and meetings.

AWS Lambda console now surfaces key function insights and supports real-time log analytics – The AWS Lambda console now features a built-in Amazon CloudWatch Metrics Insights dashboard and supports CloudWatch Logs Live Tail, providing instant visibility into critical function metrics and real-time log streaming. You can now identify and troubleshoot errors or performance issues for your Lambda functions without leaving the console, as well as view and analyze logs in real time as they become available. You can reduce context switching and accelerate the development and troubleshooting processes for serverless applications. Check out the launch post for more details.

Amazon Bedrock Model Evaluation now supports evaluating custom model import models – You can now evaluate custom models you’ve imported to Amazon Bedrock using the model evaluation feature. This helps you to complete the full cycle of selecting, customizing, and evaluating models before deploying them. To evaluate an imported model, select the custom model from the list of models to evaluate in the model selector tool when creating an evaluation job.

Amazon Q in AWS Supply Chain – You can now use Amazon Q, an interactive AI assistant, to analyze your supply chain data in AWS Supply Chain and get insights to operate your supply chain more efficiently. Amazon Q can answer your supply chain questions by diving into your data. This reduces the time spent searching for information and streamlines finding answers to improve your supply chain operations.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS news
Here are some additional news items and posts that you might find interesting:

New Amazon OpenSearch Service YouTube channel – The channel offers bite-sized tutorials, curated content, and organized playlists on topics such as log analytics, semantic search, vector databases, and operational best practices. You can also provide feedback to influence future channel content and the OpenSearch Service roadmap. Check out the launch post for more details and subscribe to the Amazon OpenSearch Service YouTube channel.

Deploying Generative AI Applications with NVIDIA NIM Microservices on Amazon Elastic Kubernetes Service (Amazon EKS) – This post shows you how to use Amazon EKS to orchestrate the deployment of pods containing NVIDIA NIM microservices, to enable quick-to-setup and optimized large-scale large language model (LLM) inference on Amazon EC2 G5 instances. It also demonstrates how to scale (both pod and cluster) by monitoring for custom metrics through Prometheus, and how you can load balance using an Application Load Balancer.

Instant Well-Architected CDK Resources with Solutions Constructs Factories – You can now create well-architected AWS resources such as Amazon Simple Storage Service (Amazon S3) buckets and AWS Step Functions state machines with a single function call using the new AWS Solutions Constructs Factories. These factories handle all the best practices configuration for you while still allowing customization. Try using a Constructs factory the next time you need to deploy one of the supported resources.

Upcoming AWS events
Check your calendars and sign up for these AWS events:

AWS GenAI LoftsAWS GenAI LoftsAWS GenAI Lofts are about more than just the tech, they bring together startups, developers, investors, and industry experts. Whether you’re looking to gain deep insights, or get your questions answered by generative AI pros, our GenAI Lofts have you covered and provide everything you need to start building your next innovation. Join events in London (through October 25), Seoul (October 30–November 6), São Paulo (through November 20), and Paris (through November 25).

AWS Community DaysAWS Community Days – Join community-led conferences that feature technical discussions, workshops, and hands-on labs led by expert AWS users and industry leaders from around the world: Malta (November 8), Chile (November 9), and Kochi, India (December 14).

AWS re:Invent 2024AWS re:InventRegistration is now open for the annual tech extravaganza, taking place December 2–6 in Las Vegas. At re:Invent 2024, you’ll get a front row seat to hear real stories from customers and AWS leaders about navigating pressing topics, such as generative AI. Learn about new product launches, watch demos, and get behind-the-scenes insights during five headline-making keynotes.

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Weekly Roundup!

— Antje

This post is part of our Weekly Roundup series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A customer’s journey with Amazon OpenSearch Ingestion pipelines

Post Syndicated from Navnit Shukla original https://aws.amazon.com/blogs/big-data/a-customers-journey-with-amazon-opensearch-ingestion-pipelines/

This is a guest post co-written with Mike Mosher, Sr. Principal Cloud Platform Network Architect at a multi-national financial credit reporting company.

I work for a multi-national financial credit reporting company that offers credit risk, fraud, targeted marketing, and automated decisioning solutions. We are an AWS early adopter and have embraced the cloud to drive digital transformation efforts. Our Cloud Center of Excellence (CCoE) team operates a global AWS Landing Zone, which includes a centralized AWS network infrastructure. We are also an AWS PrivateLink Ready Partner and offer our E-Connect solution to allow our B2B customers to connect to a range of products through private, secure, and performant connectivity.

Our E-Connect solution is a platform comprised of multiple AWS services like Application Load Balancer (ALB), Network Load Balancer (NLB), Gateway Load Balancer (GWLB), AWS Transit Gateway, AWS PrivateLink, AWS WAF, and third-party security appliances. All of these services and resources, as well as the large amount of network traffic across the platform, create a large number of logs, and we needed a solution to aggregate and organize these logs for quick analysis by our operations teams when troubleshooting the platform.

Our original design consisted of Amazon OpenSearch Service, selected for its ability to return specific log entries from extensive datasets in seconds. We also complemented this with Logstash, allowing us to use multiple filters to enrich and augment the data before sending to the OpenSearch cluster, facilitating a more comprehensive and insightful monitoring experience.

In this post, we share our journey, including the hurdles we faced, the solutions we thought about, and why we went with Amazon OpenSearch Ingestion pipelines to make our log management smoother.

Overview of the initial solution

We originally wanted to store and analyze the logs in an OpenSearch cluster, and decided to use the AWS-managed service for OpenSearch called Amazon OpenSearch Service. We also wanted to enrich these logs with Logstash, but there was no AWS-managed service for this, so we needed to deploy the application on an Amazon Elastic Compute Cloud (Amazon EC2) server. This setup meant that we had to implement a lot of maintenance of the server, including using AWS CodePipeline and AWS CodeDeploy to push new Logstash configurations to the server and restart the service. We also needed to perform server maintenance tasks such as patching and updating the operating system (OS) and the Logstash application, and monitor server resources such as Java heap, CPU, memory, and storage.

The complexity extended to validating the network path from the Logstash server to the OpenSearch cluster, incorporating checks on Access Control Lists (ACLs) and security groups, as well as routes in the VPC subnets. Scaling beyond a single EC2 server introduced considerations for managing an auto scaling group, Amazon Simple Queue Service (Amazon SQS) queues, and more. Maintaining the continuous functionality of our solution became a significant effort, diverting focus from the core tasks of operating and monitoring the platform.

The following diagram illustrates our initial architecture.

Possible solutions for us:

Our team looked at multiple options to manage the logs from this platform. We possess a Splunk solution for storing and analyzing logs, and we did assess it as a potential competitor to OpenSearch Service. However, we opted against it for several reasons:

  • Our team is more familiar with OpenSearch Service and Logstash than Splunk.
  • Amazon OpenSearch Service, being a managed service in AWS, facilitates a smoother log transfer process compared to our on-premises Splunk solution. Also, transporting logs to the on-premises Splunk cluster would incur high costs, consume bandwidth on our AWS Direct Connect connections, and introduce unnecessary complexity.
  • Splunk’s pricing structure, based on storage in GBs, proved cost-prohibitive for the volume of logs we intended to store and analyze.

Initial designs for an OpenSearch Ingestion pipeline solution

The Amazon team approached me about a new feature they were launching: Amazon OpenSearch Ingestion. This feature offered a great solution to the problems we were facing with managing EC2 instances for Logstash. First, the new feature removed all the heavy lifting from our team of managing multiple EC2 instances, scaling the servers up and down based on traffic, and monitoring the ingestion of logs and the resources of the underlying servers. Second, Amazon OpenSearch Ingestion pipelines supported most if not all of the Logstash filters we were using in our current solution, which allowed us to use the same functionality of our current solution for enriching the logs.

We were thrilled to be accepted into the AWS beta program, emerging as one of its earliest and largest adopters. Our journey began with ingesting VPC flow logs for our internet ingress platform, alongside Transit Gateway flow logs connecting all VPCs in the AWS Region. Handling such a substantial volume of logs proved to be a significant task, with Transit Gateway flow logs alone reaching upwards of 14 TB per day. As we expanded our scope to include other logs like ALB and NLB access logs and AWS WAF logs, the scale of the solution translated to higher costs.

However, our enthusiasm was somewhat dampened by the challenges we faced initially. Despite our best efforts, we encountered performance issues with the domain. Through collaborative efforts with the AWS team, we uncovered misconfigurations within our setup. We had been using instances that were inadequately sized for the volume of data we were handling. Consequently, these instances were constantly operating at maximum CPU capacity, resulting in a backlog of incoming logs. This bottleneck cascaded into our OpenSearch Ingestion pipelines, forcing them to scale up unnecessarily, even as the OpenSearch cluster struggled to keep pace.

These challenges led to a suboptimal performance from our cluster. We found ourselves unable to analyze flow logs or access logs promptly, sometimes waiting days after their creation. Additionally, the costs associated with these inefficiencies far exceeded our initial expectations.

However, with the assistance of the AWS team, we successfully addressed these issues, optimizing our setup for improved performance and cost-efficiency. This experience underscored the importance of proper configuration and collaboration in maximizing the potential of AWS services, ultimately leading to a more positive outcome for our data ingestion processes.

Optimized design for our OpenSearch Ingestion pipelines solution

We collaborated with AWS to enhance our overall solution, building a solution that is both high performing, cost-effective, and aligned with our monitoring requirements. The solution involves selectively ingesting specific log fields into the OpenSearch Service domain using an Amazon S3 Select pipeline in the pipeline source; alternative selective ingestion can also be done by filtering within pipelines. You can use include_keys and exclude_keys in your sink to filter data that’s routed to destination. We also used the built-in Index State Management feature to remove logs older than a predefined period to reduce the overall cost of the cluster.

The ingested logs in OpenSearch Service empower us to derive aggregate data, providing insights into trends and issues across the entire platform. For additional detailed analysis of these logs including all original log fields, we use Amazon Athena tables with partitioning to quickly and cost-effectively query Amazon Simple Storage Service (Amazon S3) for logs stored in Parquet format.

This comprehensive solution significantly enhances our platform visibility, reduces overall monitoring costs for handling a large log volume, and expedites our time to identify root causes when troubleshooting platform incidents.

The following diagram illustrates our optimized architecture.

Performance comparison

The following table compares the performance of the initial design with Logstash on Amazon EC2, the original OpenSearch Ingestion pipeline solution, and the optimized OpenSearch Ingestion pipeline solution.

  Initial Design with Logstash on Amazon EC2 Original Ingestion Pipeline Solution Optimized Ingestion Pipeline Solution
Maintenance Effort High: Solution required the team to manage multiple services and instances, taking effort away from managing and monitoring our platform. Low: OpenSearch Ingestion managed most of the undifferentiated heavy lifting, leaving the team to only maintain the ingestion pipeline configuration file. Low: OpenSearch Ingestion managed most of the undifferentiated heavy lifting, leaving the team to only maintain the ingestion pipeline configuration file.
Performance High: EC2 instances with Logstash could scale up and down as needed in the auto scaling group. Low: Due to insufficient resources on the OpenSearch cluster, the ingestion pipelines were constantly at max OpenSearch Compute Units (OCUs), causing log delivery to be delayed by multiple days. High: Ingestion pipelines can scale up and down in OCUs as needed.
Real-time Log Availability Medium: In order to pull, process, and deliver the large number of logs in Amazon S3, we needed a large number of EC2 instances. To save on cost, we ran fewer instances, which led to slower log delivery to OpenSearch. Low: Due to insufficient resources on the OpenSearch cluster, the ingestion pipelines were constantly at max OCUs, causing log delivery to be delayed by multiple days. High: The optimized solution was able to deliver a large number of logs to OpenSearch to be analyzed in near real time.
Cost Saving Medium: Running multiple services and instances to send logs to OpenSearch increased the cost of the overall solution. Low: Due to insufficient resources on the OpenSearch cluster, the ingestion pipelines were constantly at max OCUs, increasing the cost of the service. High: The optimized solution was able to scale the ingestion pipeline OCUs up and down as needed, which kept the overall cost low.
Overall Benefit Medium Low High

Conclusion

In this post, we highlighted my journey to build a solution using OpenSearch Service and OpenSearch Ingestion pipelines. This solution allows us to focus on analyzing logs and supporting our platform, without needing to support the infrastructure to deliver logs to OpenSearch. We also highlighted the need to optimize the service in order to increase performance and reduce cost.

As our next steps, we aim to explore the recently announced Amazon OpenSearch Service zero-ETL integration with Amazon S3 (in preview) feature within OpenSearch Service. This step is intended to further reduce the solution’s costs and provide flexibility in the timing and number of logs that are ingested.


About the Authors

Navnit Shukla serves as an AWS Specialist Solutions Architect with a focus on analytics. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices. Notably, Navnit Shukla is the accomplished author of the book titled “Data Wrangling on AWS.” He can be reached via LinkedIn.

Mike Mosher is s Senior Principal Cloud Platform Network Architect at a multi-national financial credit reporting company. He has more than 16 years of experience in on-premises and cloud networking and is passionate about building new architectures on the cloud that serve customers and solve problems. Outside of work, he enjoys time with his family and traveling back home to the mountains of Colorado.