All posts by Jon Handler

Boosting search relevance: Automatic semantic enrichment in Amazon OpenSearch Serverless

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/boosting-search-relevance-automatic-semantic-enrichment-in-amazon-opensearch-serverless/

Traditional search engines rely on word-to-word matching (referred to as lexical search) to find results for queries. Although this works well for specific queries such as television model numbers, it struggles with more abstract searches. For example, when searching for “shoes for the beach,” a lexical search merely matches individual words “shoes,” “beach,” “for,” and “the” in catalog items, potentially missing relevant products like “water-resistant sandals” or “surf footwear” that don’t contain the exact search terms.

Large language models (LLMs) create dense vector embeddings for text that expand retrieval beyond individual word boundaries to include the context in which words are used. Dense vector embeddings capture the relationship between shoes and beaches by learning how often they occur together, enabling better retrieval for more abstract queries through what is called semantic search.

Sparse vectors combine the benefits of lexical and semantic search. The process starts with a WordPiece tokenizer to create a limited set of tokens from text. A transformer model then assigns weights to these tokens. During search, the system calculates the dot-product of the weights on the tokens (from the reduced set) from the query with tokens from the target document. You get a blended score from the terms (tokens) whose weights are high for both the query and the target. Sparse vectors encode semantic information, like dense vectors, and supply word-to-word matching through the dot-product, giving you a hybrid lexical-semantic match. For a detailed understanding of sparse and dense vector embeddings, visit Improving document retrieval with sparse semantic encoders in the OpenSearch blog.

Automatic semantic enrichment for Amazon OpenSearch Serverless makes implementing semantic search with sparse vectors effortless. You can now experiment with search relevance improvements and deploy to production with only a few clicks, requiring no long-term commitment or upfront investment. In this post, we show how automatic semantic enrichment removes friction and makes the implementation of semantic search for text data seamless, with step-by-step instructions to enhance your search functionality.

Automatic semantic enrichment

You could already enhance search relevance scoring beyond OpenSearch’s default lexical scoring with the Okapi BM25 algorithm, integrating dense vector and sparse vector models for semantic search using OpenSearch’s connector framework. However, implementing semantic search in OpenSearch Serverless has been complex and costly, requiring model selection, hosting, and integration with an OpenSearch Serverless collection.

Automatic semantic enrichment lets you automatically encode your text fields in your OpenSearch Serverless collections as sparse vectors by just setting the field type. During ingestion, OpenSearch Serverless automatically processes the data through a service-managed machine learning (ML) model, converting text to sparse vectors in native Lucene format.

Automatic semantic enrichment supports both English-only and multilingual options. The multilingual variant supports the following languages: Arabic, Bengali, Chinese, English, Finnish, French, Hindi, Indonesian, Japanese, Korean, Persian, Russian, Spanish, Swahili, and Telugu.

Model details and performance

Automatic semantic enrichment uses a service-managed, pre-trained sparse model that works effectively without requiring custom fine-tuning. The model analyzes the fields you specify, expanding them into sparse vectors based on learned associations from diverse training data. The expanded terms and their significance weights are stored in native Lucene index format for efficient retrieval. We’ve optimized this process using document-only mode, where encoding happens only during data ingestion. Search queries are merely tokenized rather than processed through the sparse model, making the solution both cost-effective and performant.

Our performance validation during feature development used the MS MARCO passage retrieval dataset, featuring passages averaging 334 characters. For relevance scoring, we measured average Normalized discounted cumulative gain (NDCG) for the first 10 search results (ndcg@10) on the BEIR benchmark for English content and average ndcg@10 on MIRACL for multilingual content. We assessed latency through client-side, 90th-percentile (p90) measurements and search response p90 took values. These benchmarks provide baseline performance indicators for both search relevance and response times.

The following table shows the automatic semantic enrichment benchmark.

Language Relevance improvement P90 search latency
English 20.0% over lexical search 7.7% lower latency over lexical search (bm25 is 26 ms, and automatic semantic enrichment is 24 ms)
Multilingual 105.1% over lexical search 38.4% higher latency over lexical search (bm25 is 26 ms, and automatic semantic enrichment is 36 ms)

Given the unique nature of each workload, we encourage you to evaluate this feature in your development environment using your own benchmarking criteria before making implementation decisions.

Pricing

OpenSearch Serverless bills automatic semantic enrichment based on OpenSearch Compute Units (OCUs) consumed during sparse vector generation at indexing time. You’re charged only for actual usage during indexing. You can monitor this consumption using the Amazon CloudWatch metric SemanticSearchOCU. For specific details about model token limits and volume throughput per OCU, visit Amazon OpenSearch Service Pricing.

Prerequisites

Before you create an automatic semantic enrichment index, verify that you’ve been granted the necessary permissions for the task. Contact an account administrator for assistance if required. To work with automatic semantic enrichment in OpenSearch Serverless, you need the account-level AWS Identity and Access Management (IAM) permissions shown in the following policy. The permissions serve the following purposes:

  • The aoss:*Index IAM permissions is used to create and manage indices.
  • The aoss:APIAccessAll IAM permission is used to perform OpenSearch API operations.
{
"Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
              "aoss:CreateIndex",
              "aoss:GetIndex",
              "aoss:APIAccessAll",
            ],
            "Resource": "<ARN of your Serverless Collection>"
        }
    ]
}

You also need an OpenSearch Serverless data access policy to create and manage Indices and associated resources in the collection. For more information, visit Data access control for Amazon OpenSearch Serverless in the OpenSearch Serverless Developer Guide. Use the following policy:

[
    {
        "Description": "Create index permission",
        "Rules": [
            {
                "ResourceType": "index",
                "Resource": ["index/<collection_name>/*"],
                "Permission": [
                  "aoss:CreateIndex", 
                  "aoss:DescribeIndex",                  
"aoss:ReadDocument",
    "aoss:WriteDocument"
                ]
            }
        ],
        "Principal": [
            "arn:aws:iam::<account_id>:role/<role_name>"
        ]
    },
    {
        "Description": "Create pipeline permission",
        "Rules": [
            {
                "ResourceType": "collection",
                "Resource": ["collection/<collection_name>"],
                "Permission": [
                  "aoss:CreateCollectionItems",
                  "aoss:"
                ]
            }
        ],
        "Principal": [
            "arn:aws:iam::<account_id>:role/<role_name>"
        ]
    },
    {
        "Description": "Create model permission",
        "Rules": [
            {
                "ResourceType": "model",
                "Resource": ["model/<collection_name>/*"],
                "Permission": ["aoss:CreateMLResources"]
            }
        ],
        "Principal": [
            "arn:aws:iam::<account_id>:role/<role_name>"
        ]
    },
]

To access private collections, set up the following network policy:

[
   {
      "Description":"Enable automatic semantic enrichment in private collection",
      "Rules":[
         {
            "ResourceType":"collection",
            "Resource":[
               "collection/<collection_name>"
            ]
         }
      ],
      "AllowFromPublic":false,
      "SourceServices":[
         "aoss.amazonaws.com"
      ],
   }
]

Set up an automatic semantic enrichment index

To set up an automatic semantic enrichment index, follow these steps:

  1. To create an automatic semantic enrichment index using the AWS Command Line Interface (AWS CLI), use the create-index command:
aws opensearchserverless create-index \
    --id <collection_id> \
    --index-name <index_name> \
    --index-schema <index_body>
  1. To describe the created index, use the following command:
aws opensearchserverless create-index \
    --id <collection_id> \
    --index-name <index_name> 

You can also use AWS CloudFormation templates (Type: AWS::OpenSearchServerless::CollectionIndex) or the AWS Management Console to create semantic search during collection provisioning as well as after the collection is created.

Example: Index setup for product catalog search

This section shows how to set up a product catalog search index. You’ll implement semantic search on the title_semantic field (using an English model). For the product_id field, you’ll maintain default lexical search functionality.

In the following index-schema, the title_semantic field has a field type set to text and has parameter semantic_enrichment set to status ENABLED. Setting the semantic_enrichment parameter enables automatic semantic enrichment on the title_semantic field. You can use the language_options field to specify either english or multi-lingual. For this post, we generate a nonsemantic title field named title_non_semantic. Use the following code:

aws opensearchserverless create-index \
    --id XXXXXXXXX \
    --index-name 'product-catalog' \
    --index-schema '{
    "mappings": {
        "properties": {
            "product_id": {
                "type": "keyword"
            },
            "title_semantic": {
                "type": "text",
                "semantic_enrichment": {
                    "status": "ENABLED",
                    "language_options": "english"
                }
            },
            "title_non_semantic": {
                "type": "text"
            }
        }
    }
}'

Data ingestion

After the index is created, you can ingest data through standard OpenSearch mechanisms, including client libraries, REST APIs, or directly through OpenSearch Dashboards. Here’s an example of how to add multiple documents using bulk API in OpenSearch Dashboards Dev Tools:

POST _bulk
{"index": {"_index": "product-catalog"}}
{"title_semantic": "Red shoes", "title_non_semantic": "Red shoes", "product_id": "12345" }
{"index": {"_index": "product-catalog"}}
{"title_semantic": "Black shirt", "title_non_semantic": "Black shirt", "product_id": "6789" }
{"index": {"_index": "product-catalog"}}
{"title_semantic": "Blue hat", "title_non_semantic": "Blue hat", "product_id": "0000" }

Search against automatic semantic enrichment index

After the data is ingested, you can query the index:

POST product-catalog/_search?size=1
{
  "query": {
    "match":{
      "title_semantic":{
        "query": "crimson footwear"
      }
    }
  }
}

The following is the response:

{
    "took": 240,
    "timed_out": false,
    "_shards": {
        "total": 0,
        "successful": 0,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 7.6092715,
        "hits": [
            {
                "_index": "product-catalog",
                "_id": "Q61b35YBAkHYIP5jIOWH",
                "_score": 7.6092715,
                "_source": {
                    "title_semantic": "Red shoes",
                    "title_non_semantic": "Red shoes",
                    "title_semantic_embedding": {
                        "feet": 0.85673976,
                        "dress": 0.48490667,
                        "##wear": 0.26745942,
                        "pants": 0.3588211,
                        "hats": 0.30846077,
                        ...
                    },
                    "product_id": "12345"
                }
            }
        ]
    }
}

The search successfully matched the document with Red shoes despite the query using crimson footwear, demonstrating the power of semantic search. The system automatically generated semantic embeddings for the document (truncated here for brevity) which enable these intelligent matches based on meaning rather than exact keywords.

Comparing search results

By running a similar query against the nonsemantic index title_non_semantic, you can confirm that nonsemantic fields can’t search based on context:

GET product-catalog/_search?size=1
{
  "query": {
    "match":{
      "title_non_semantic":{
        "query": "crimson footwear"
      }
    }
  }
}

The following is the search response:

{
    "took": 398,
    "timed_out": ,
    "_shards": {
        "total": 0,
        "successful": 0,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": ,
        "hits": []
    }
}

Limitations of automatic semantic enrichment

Automatic semantic search is most effective when applied to small-to-medium sized fields containing natural language content, such as movie titles, product descriptions, reviews, and summaries. Although semantic search enhances relevance for most use cases, it might not be optimal for certain scenarios:

  • Very long documents – The current sparse model processes only the first 8,192 tokens of each document for English. For multilingual documents, it’s 512 tokens. For lengthy articles, consider implementing document chunking to ensure complete content processing.
  • Log analysis workloads – Semantic enrichment significantly increases index size, which might be unnecessary for log analysis where exact matching typically suffices. The additional semantic context rarely improves log search effectiveness enough to justify the increased storage requirements.

Consider these limitations when deciding whether to implement automatic semantic enrichment for your specific use case.

Conclusion

Automatic semantic enrichment marks a significant advancement in making sophisticated search capabilities accessible to all OpenSearch Serverless users. By eliminating the traditional complexities of implementing semantic search, search developers can now enhance their search functionality with minimal effort and cost. Our feature supports multiple languages and collection types, with a pay-as-you-use pricing model that makes it economically viable for various use cases. Benchmark results are promising, particularly for English language searches, showing both improved relevance and reduced latency. However, although semantic search enhances most scenarios, certain use cases such as processing extremely long articles or log analysis might benefit from alternative approaches.

We encourage you to experiment with this feature and discover how it can optimize your search implementation so you can deliver better search experiences without the overhead of managing ML infrastructure. Check out the video and tech documentation for additional details.


About the Authors

Jon Handler is Director of Solutions Architecture for Search Services at Amazon Web Services, based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have generative AI, search, and log analytics workloads for OpenSearch. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Arjun Kumar Giri is a Principal Engineer at AWS working on the OpenSearch Project. He primarily works on OpenSearch’s artificial intelligence and machine learning (AI/ML) and semantic search features. He is passionate about AI, ML, and building scalable systems.

Siddhant Gupta is a Senior Product Manager (Technical) at AWS, spearheading AI innovation within the OpenSearch Project from Hyderabad, India. With a deep understanding of artificial intelligence and machine learning, Siddhant architects features that democratize advanced AI capabilities, enabling customers to harness the full potential of AI without requiring extensive technical expertise. His work seamlessly integrates cutting-edge AI technologies into scalable systems, bridging the gap between complex AI models and practical, user-friendly applications.

Amazon OpenSearch Service vector database capabilities revisited

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-vector-database-capabilities-revisited/

In 2023, we blogged about OpenSearch Service vector database capabilities. Since then, OpenSearch and Amazon OpenSearch Service have developed to bring better performance, lower cost, and enhanced tradeoffs. We’ve improved the OpenSearch Service hybrid lexical and semantic search methods using both dense vectors and sparse vectors. We’ve simplified connecting with and managing large language models (LLMs) hosted in other environments. We’ve brought native chunking and streamlined searching for chunked documents.

Where 2023 saw the explosion of LLMs for generative AI and LLM-generated vector embeddings for semantic search, 2024 was a year of consolidation and reification. Applications relying on Retrieval Augmented Generation (RAG) started to move from proof of concept (POC) to production, with all of the attendant concerns on hallucinations, inappropriate content, and cost. Builders of search applications began to move their semantic search workloads to production, seeking improved relevance to drive their businesses.

As we enter 2025, OpenSearch Service support for OpenSearch 2.17 brings these improvements to the service. In this post, we walk through 2024’s innovations with an eye to how you can adopt new features to lower your cost, reduce your latency, and improve the accuracy of your search results and generated text.

Using OpenSearch Service as a vector database

Amazon OpenSearch Service as a vector database provides you with the core capabilities to store vector embeddings from LLMs and use vector and lexical information to retrieve documents based on their lexical similarity, as well as their proximity in vector space. OpenSearch Service continues to support three vector engines: Facebook AI Similarity Search (FAISS), Non-Metric Space Library (NMSLIB), and Lucene. The service supports exact nearest-neighbor matching and approximate nearest-neighbor matching (ANN). For ANN, the service provides both Hierarchical Navigable Small World (HNSW), and Inverted File (IVF) for storage and retrieval. The service further supports a wealth of distance metrics, including Cartesian distance, cosine similarity, Manhattan distance, and more.

The move to hybrid search

The job of a search engine is to take as input a searcher’s intent, captured as words, locations, numeric ranges, dates, (and, with multimodal search, rich media such as images, videos, and audio) and return a set of results from its collection of indexed documents that meet the searcher’s need. For some queries, such as “plumbing fittings for CPVC pipes,” the words in a product’s description and the words that a searcher uses are sufficient to bring the right results, using the standard Term Frequency-Inverse Document Frequency (TF/IDF) similarity metric. These queries are characterized by a high level of specificity in the searcher’s intent, which matches well to the words they use and the product’s name and description. When the searcher’s intent is more abstract, such as “a cozy place to curl up by the fire,” the words are less likely to provide a good match.

To best serve their users across the range of queries, builders have largely started to take a hybrid search approach, using both lexical and semantic retrieval with combined ranking. OpenSearch provides a hybrid search that can blend lexical queries, k-Nearest Neighbor (k-NN) queries, and neural queries using OpenSearch’s neural search plugin. Builders can implement three levels of hybrid search—lexical filtering along with vectors, combining lexical and vector scores, and out-of-the-box score normalization and blending.

In 2024, OpenSearch improved its hybrid search capability with conditional scoring logic, improved constructs, removal of repetitive and unnecessary calculations, and optimized data structures, yielding as much as a fourfold latency improvement. OpenSearch also added support for parallelization of the query processing for hybrid search, which can deliver up to 25% improvement in latency. OpenSearch released post-filtering for hybrid queries, which can help further dial in search results. 2024 also saw the release of OpenSearch Service’s support for aggregations for hybrid queries.

Sparse vector search is a different way of combining lexical and semantic information. Sparse vectors reduce corpus terms to around 32,000 terms, the same as or closely aligned with the source. Sparse vectors use weights that are mostly zero or near-zero to provide a weighted set of tokens that capture the meaning of the terms. Queries are translated to the reduced token set, with generalization provided by sparse models. In 2024, OpenSearch introduced two-phase processing for sparse vectors that improves latency for query processing.

Focus on accuracy

One of builders’ primary concerns in moving their workloads to production has been balancing retrieval accuracy (derivatively, generated text accuracy) with the cost and latency of the solution. Over the course of 2024, OpenSearch and OpenSearch Service brought capabilities for trading off between cost, latency, and accuracy. One area of innovation for the service was to bring out various methods for reducing the amount of RAM consumed by vector embeddings through k-NN vector quantization methods. Beyond these new methods, OpenSearch has long supported product quantization for the FAISS engine. Product quantization uses training to build centroids for vector clusters on reduced-dimension sub-vectors and queries by matching these centroids. We’ve blogged about the latency and cost benefits of product quantization.

You use a chunking strategy to divide up long documents into smaller, retrievable pieces. The insight for doing that is that large pieces of text have many pools of meaning, captured in sentences, paragraphs, tables, and figures. You choose chunks that are units of meaning, within pools of related words. In 2024, OpenSearch made this process with a straightforward k-NN query, alleviating the need for custom processing logic. You can now represent long documents as multiple vectors in a nested field. When you run k-NN queries, each nested field is treated as a single vector (an encoded long document). Previously, you had to implement custom processing logic in your application to support the querying of documents represented as vector chunks. With this feature, you can run k-NN queries, making it seamless for you to create vector search applications.

Similarity search is designed around finding the k nearest vectors, representing the top-k most similar documents. In 2024, OpenSearch updated its k-NN query interface to include filtering k-NN results based on distance and vector score, alongside existing top-k support. This is ideal for use cases in which your goal is to retrieve all the results that are highly or sufficiently similar (for example, >= 0.95), minimizing the possibility of missing highly relevant results because they don’t meet a top-k threshold.

Reducing cost for production workloads

In 2024, OpenSearch introduced and extended scalar and binary quantization that reduce the number of bits used to store each vector. OpenSearch already supported product quantization for vectors. When using these scalar and byte quantization methods, OpenSearch reduces the number of bits used to store vectors in the k-NN index from 32-bit floating numbers down to as little as 1 bit per dimension. For scalar quantization, OpenSearch supports half precision (also called fp16), and quarter precision with 8-bit integers for two times and four times the compression, respectively.

For binary quantization, OpenSearch supports 1-bit, 2-bit, and 4-bit compression for 32, 16, and 8 times compression respectively. These quantization methods are lossy, reducing accuracy. In our testing, we’ve seen minimal impact on accuracy—as little as 2% on some standardized data sets—with up to 32 times reduction in RAM consumed.

In-memory handling of dense vectors drives cost in proportion to the number of vectors, the vector dimensions, and the parameters you set for indexing. In 2024, OpenSearch extended vector handling to include disk-based vector search. With disk-based search, OpenSearch keeps a reduced bit-count vector in memory for generating match candidates, retrieving full-precision vectors for the final scoring and ranking. The default compression of 32 times means a reduction in RAM needs by 32 times with an attendant reduction in the cost of the solution.

In 2024, OpenSearch introduced support for JDK21, which users can use to run OpenSearch clusters on the latest Java version. OpenSearch further enhanced performance by adding support for Single Instruction, Multiple Data (SIMD) instruction sets for exact search queries. Previous versions have supported SIMD for ANN search queries. The integration of SIMD for exact search requires no additional configuration steps, making it a seamless performance improvement. You can expect a significant reduction in query latencies and a more efficient and responsive search experience, with approximately 1.5 times faster performance than non-SIMD implementations.

Increasing innovation velocity

In November 2023, OpenSearch 2.9 was released on Amazon OpenSearch Service. The release included high-level vector database interfaces such as neural search, hybrid search, and AI connectors. For instance, users can use neural search to run semantic queries with text input instead of vectors. Using AI connectors to services such as Amazon SageMaker, Amazon Bedrock, and OpenAI, neural search encodes text into vectors using the customers’ preferred models and rewrites text-based queries into k-NN queries transparently. Effectively, neural search alleviated the need for customers to develop and manage custom middleware to perform this functionality, which is required by applications that use the k-NN APIs.

With the following 2.11 and 2.13 releases, OpenSearch added high-level interfaces for multimodal and conversational search, respectively. With multimodal search, customers can run semantic queries using a combination of text and image inputs to find images. As illustrated in this OpenSearch blog post, multimodal enables new search paradigms. An ecommerce customer, for instance, could use a photo of a shirt and describe alterations such as “with desert colors” to shop for clothes fashioned to their tastes. Facilitated by a connector to Amazon Bedrock Titan Multimodal Embeddings G1, vector generation and query rewrites are handled by OpenSearch.

Conversational search enabled yet another search paradigm, which users can use to discover information through chat. Conversational searches run RAG pipelines, which use connectors to generative LLMs such as Anthropic’s Claude 3.5 Sonnet in Amazon Bedrock, OpenAI ChatGPT, or DeepSeek R1 to generate conversational responses. A conversational memory module provides LLMs with persistent memory by retaining conversation history.

With OpenSearch 2.17, support for search AI use cases was expanded through AI-native pipelines. With ML inference processors (search request, response, ingestion), customers can enrich data flows on OpenSearch with any machine learning (ML) model or AI service. Previously, enrichments were limited to a few model types such as text embedding models to support neural search. Without limitations on model type support, the full breadth of search AI use cases can be powered by OpenSearch search and ingest pipeline APIs.

Conclusion

OpenSearch continues to explore and enhance its features to build scalable, cost-effective, and low-latency semantic search and vector database solutions. The OpenSearch Service neural plugin, connector framework, and high-level APIs reduce complexity for builders, making the OpenSearch Service vector database more approachable and powerful. 2024’s improvements span text-based exact searches, semantic search, and hybrid search. These performance enhancements, feature innovations, and integrations provide a robust foundation for creating AI-driven solutions that provide better performance and more accurate results. Try out these new features with the latest version of OpenSearch.


About the Author

Jon Handler is Director of Solutions Architecture for Search Services at Amazon Web Services, based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have generative AI, search, and log analytics workloads for OpenSearch. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Supercharge your RAG applications with Amazon OpenSearch Service and Aryn DocParse

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/supercharge-your-rag-applications-with-amazon-opensearch-service-and-aryn-docparse/

The old adage “garbage in, garbage out” applies to all search systems. Whether you are building for ecommerce, document retrieval, or Retrieval Augmented Generation (RAG), the quality of your search results depends on the quality of your search documents. Downstream, RAG systems improve the quality of generated answers by adding relevant data from other systems to the generative prompt. Most RAG solutions use a search engine to search for this relevant data. To get great responses, you need great search results, and to get great search results, you need great data. If you don’t properly partition, extract, enrich, and clean your data before loading it, your search results will reflect the poor quality of your search documents.

Aryn DocParse segments and labels PDF documents, runs OCR, extracts tables and images, and more. It turns your messy documents into beautiful, structured JSON, which is the first step of document extract, transform, and load (ETL). DocParse runs the open source Aryn Partitioner and its state-of-the-art, open source deep learning DETR AI model trained on over 80,000 enterprise documents. This leads to up to 6 times more accurate data chunking and 2 times improved recall on vector search or RAG when compared to off-the-shelf systems. The following screenshot is an example of how DocParse would segment a page in an ETL pipeline. You can visualize labeled bounding boxes for each document segment using the Aryn Playground.

In this post, we demonstrate how to use Amazon OpenSearch Service with purpose-built document ETL tools, Aryn DocParse and Sycamore, to quickly build a RAG application that relies on complex documents. We use over 75 PDF reports from the National Transportation Safety Board (NTSB) about aircraft incidents. You can refer to the following example document from the collection. As you can see, these documents are complex, containing tables, images, section headings, and complicated layouts.

Let’s get started!

Prerequisites

Complete the following prerequisite steps:

  1. Create an OpenSearch Service domain. For more details, see Creating and managing Amazon OpenSearch Service domains. You can create a domain using the AWS Management Console, AWS Command Line Interface (AWS CLI), or SDK. Be sure to choose public access for your domain, and set up a user name and password for your domain’s primary user so that you can run the notebook from your laptop, Amazon SageMaker Studio, or an Amazon Elastic Compute Cloud (EC2) instance. To keep costs low, you can create an OpenSearch Service domain with a single t3.small search node in a dev/test configuration for this example. Take note of the domain’s endpoint to use in later steps.
  2. Get an Aryn API key.
  3. You will be using Anthropic’s Claude large language model (LLM) on Amazon Bedrock in the ETL pipeline, so make sure your notebook has access to AWS credentials with the required permissions.
  4. Have access to a Jupyter environment to open and run the notebook.

Use DocParse and Sycamore to chunk data and load OpenSearch Service

Although you can generate an ETL pipeline to load your OpenSearch Service domain using the Aryn DocPrep UI, we will instead focus on the underlying Sycamore document ETL library and write a pipeline from scratch.

Sycamore was designed to make it straightforward for developers and data engineers to define complex data transformations over large collections of documents. Borrowing some ideas from popular dataflow frameworks like Apache Spark, Sycamore has a core abstraction called the DocSet. Each DocSet represents a collection of unstructured documents, and is scalable from a single document to many thousands. Each document in a DocSet has an arbitrary set of key-value properties as metadata, as well as an ordered list of elements. An Element corresponds to a chunk of the document that can be processed and embedded separately, such as a table, headline, text passage, or image. Like documents, Elements can also contain arbitrary key-value properties to encode domain- or application-specific metadata.

Notebook walkthrough

We’ve created a Jupyter notebook that uses Sycamore to orchestrate data preparation and loading. This notebook uses Sycamore to create a data processing pipeline that sends documents to DocParse for initial document segmentation and data extraction, then runs entity extraction and data transforms, and finally loads data into OpenSearch Service using a connector.

Copy the notebook into your Amazon SageMaker JupyterLab space, launch it using a Python kernel, then walk through the cells along with the following procedures.

To install Sycamore with the OpenSearch Service connector and local inference features necessary to create vector embeddings, run the first cell of the notebook:

!pip install 'sycamore-ai[opensearch,local-inference]'

In the second cell of the notebook, fill in your ARYN_API_KEY. You should be able to complete the example in the notebook for less than $1.

Cell 3 does the initial work of reading the source data and preparing a DocSet for that data. After initializing the Sycamore context and setting paths, this code calls out to DocParse to create a partitioned_docset:

partitioned_docset = (
  docset.partition(
    partitioner=ArynPartitioner(
      extract_table_structure=True,
      extract_images=True
    )
  ).materialize(
      path="./opensearch-tutorial/partitioned-docset",
      source_mode=sycamore.MATERIALIZE_USE_STORED
    )
)
partitioned_docset.execute()

The previous code uses materialize to create and save a checkpoint. In future runs, the code will use the materialized view to save a few minutes of time. partitioned_docset.execute() forces the pipeline to execute. Sycamore uses lazy execution to create efficient query plans, and would otherwise execute the pipeline at a much later step.

After this step, each document in the DocSet now includes the partitioned output from DocParse, including bounding boxes, text content, and images from that document, stored as elements.

Entity extraction

Part of the key to building good retrieval for RAG is adding structured information that enables accurate filtering for the search query. Sycamore provides LLM-powered transforms that can extract this information and store it as structured properties, enriching the document. Sycamore can do unsupervised or supervised schema extraction, where it pulls out fields based on a JSON schema you provide. When executing these types of transforms, Sycamore will take a specified number of elements from each document, use an LLM to extract the specified fields, and include them as properties in the document.

Cell 4 uses supervised schema extraction, setting the schema as the fields you want to extract. You can add additional information that is passed to the LLM performing the entity extraction. The location property is an example of this:

schema = {
            'type': 'object',
            'properties': {'accidentNumber': {'type': 'string'},
                           'dateAndTime': {'type': 'date'},
                           'location': {
                             'type': 'string', 
                             'description': 'US State where the incident occured'
                           },
                           'aircraft': {'type': 'string'},
                           'aircraftDamage': {'type': 'string'},
                           'injuries': {'type': 'string'},
                           'definingEvent': {'type': 'string'}},
            'required': ['accidentNumber',
                         'dateAndTime',
                         'location',
                         'aircraft']
    }

schema_name = 'FlightAccidentReport'
property_extractor=LLMPropertyExtractor(llm=llm, num_of_elements=20, schema_name=schema_name, schema=schema)

The LLMPropertyExtractor uses the schema you provided to add additional properties to the document. Next, summarize the images to add additional information to improve retrieval.

Image summarization

There’s more information in your documents than just text—as the saying goes, a picture is worth 1,000 words! When your documents contain images, you can capture the information in those images using Sycamore’s SummarizeImages transform. SummarizeImages uses an LLM to compute a text summary for the image, then adds the summary to that element. Sycamore will also send related information about the image, like a caption, to the LLM to aid with summarization. The following code (in cell 4) takes advantage of DocParse type labeling to automatically apply SummarizeImages to image elements:

enriched_docset = enriched_docset.transform(SummarizeImages, summarizer=LLMImageSummarizer(llm=llm))

This cell can take up to 20 minutes to complete.

Now that your image elements contain additional retrieval information, it’s time to clean and normalize the text in the elements and extracted entities.

Data cleaning and formatting

Unless you are in direct control of the creation of the documents you are processing, you will likely need to normalize that data and make it ready for search. Sycamore makes it straightforward for you to clean messy data and bring it to a regular form, fixing data quality issues.

For example, in the NTSB data, dates in the incident report are not all formatted the same way, and some US state names are shown as abbreviations. Sycamore makes it straightforward to write custom transformations in Python, and also provides several useful cleaning and formatting transforms. Cell 4 uses two functions in Sycamore to format the state names and dates:

formatted_docset = (
  enriched_docset
  
  # Converts state abbreviations to their full names.
  .map(lambda doc: USStateStandardizer.standardize(
    doc, key_path = ["properties","entity","location"])
  )

  # Converts datetime into a common format
  .map(lambda doc: DateTimeStandardizer.standardize(
    doc, key_path = ["properties","entity","dateTime"])
  )
)

The elements are now in normal form, with extracted entities and image descriptions. The next step is to merge together semantically related elements to create chunks.

Create final chunks and vector embeddings

When you prepare for RAG, you create chunks—parts of the full document that are related information. You design your chunks so that as a search result they can be added to the prompt to provide a unit of meaning and information. There are many ways to approach chunking. If you have small documents, sometimes the whole document is a chunk. If you have larger documents, sentences, paragraphs, or even sections can be a chunk. As you iterate on your end application, it’s common to adjust the chunking strategy to fine-tune the accuracy of retrieval. Sycamore automates the process of building chunks by merging together the elements of the DocSet.

At this stage of the processing in cell 4, each document in our DocSet has a set of elements. The following code merges elements together using a chunking strategy to create larger elements that will improve query results. For instance, the DocSet might have an element that is a table and an element that is a caption for that table. Merging those elements together creates a chunk that’s a better search result.

We will use Sycamore’s Merge transform with the GreedySectionMerger merging strategy to add elements in the same document section together into larger chunks:

merger = GreedySectionMerger(
  tokenizer=HuggingFaceTokenizer(
    "sentence-transformers/all-MiniLM-L6-v2"),
  max_tokens=512
)
chunked_docset = formatted_docset.merge(merger=merger)

With chunks created, it’s time to add vector embeddings for the chunks.

Create vector embeddings

Use vector embeddings to enable semantic search in OpenSearch Service. With semantic search, retrieve documents that are close to a query in a multidimensional space, rather than by matching words exactly. In RAG systems, it’s common to use semantic search along with lexical search for a hybrid search. Using hybrid search, you get best-of-all-worlds retrieval.

The code in cell 4 creates vector embeddings for each chunk. You can use a variety of different AI models with Sycamore’s embed transform to create vector embeddings. You can run these locally or use a service like Amazon Bedrock or OpenAI. The embedding model you choose has a huge impact on your search quality, and it’s common to experiment with this variable as well. In this example, you create embeddings locally using a model called GTE:

model_name = "thenlper/gte-small"
embedded_docset = chunked_docset.spread_properties(["entity", "path"]).explode().embed(
      embedder=SentenceTransformerEmbedder(batch_size=10_000, model_name=model_name)
)
embedded_docset = embedded_docset.materialize(
  path="./opensearch-tutorial/embedded-docset",
  source_mode=sycamore.MATERIALIZE_USE_STORED
)
embedded_docset.execute()

You use materialize again here, so you can checkpoint the processed DocSet before loading. If there is an error when loading the indexes, you can retry without running the last few steps of the pipeline again.

Load OpenSearch Service

The final ETL step is loading the prepared data into OpenSearch Service vector and keyword indexes to power hybrid search for the RAG application. Sycamore makes loading indexes straightforward with its set of connectors. Cell 5 adds configuration, specifying the OpenSearch Service domain endpoint and what indexes to create. If you’re following along, be sure to replace YOUR-DOMAIN-ENDPOINT, YOUR-OPENSEARCH-USERNAME, and YOUR-OPENSEARCH-PASSWORD in cell 5 with the actual values.

If you copied your domain endpoint from the console, it will start with the https:// URL scheme. When you replace YOUR-DOMAIN-ENDPOINT, be sure to remove https://.

In cell 6, Sycamore’s OpenSearch connector loads the data into an OpenSearch index:

embedded_docset.write.opensearch(
    os_client_args=openSearch_client_args,
    index_name="aryn-rag-demo",
    index_settings=index_settings,
)

Congratulations! You’ve completed some of the core processing steps to take raw PDFs and prepare them as a source for retrieval in a RAG application. In the next cells, you will run a couple of RAG queries.

Run a RAG query on OpenSearch using Sycamore

In cell 7, Sycamore’s query and summarize functions create a RAG pipeline on the data. The query step uses OpenSearch’s vector search to retrieve the relevant passages for RAG. Then, cell 8 runs a second RAG query that filters on metadata that Sycamore extracted in the ETL pipeline, yielding even better results. You could also use an OpenSearch hybrid search pipeline to perform hybrid vector and lexical retrieval.

Cell 7 asks “What was common with incidents in Texas, and how does that differ from incidents in California?” Sycamore’s summarize_data transform runs the RAG query, and uses the LLM specified for generation (in this case, it’s Anthropic’s Claude):

Based on the provided data, it appears that the common factor among the incidents 
in Texas was that many of them involved substantial aircraft damage, with some resulting 
in injuries or fatalities. The incidents covered a range of aircraft types, including small
planes like Cessnas and Pipers, as well as a helicopter. The defining events varied, 
including loss of control on the ground, engine failures, fuel issues, and collisions 
with terrain or objects.

In contrast, the incidents in California seemed to primarily involve substantial aircraft
damage as well, but with fewer injuries reported. The defining events included loss of 
control on the ground, collisions during takeoff or landing, and a miscellaneous/other event.
One key difference is that the Texas incidents included a fatal accident (CEN23FA084) 
involving a Piper PA46 that resulted in 4 fatalities and 1 serious injury after impacting 
terrain. The California incidents did not appear to have any fatal accidents based on the 
provided data.

Additionally, while both states had incidents involving loss of control on the ground, the 
Texas incidents seemed to have a higher proportion of engine failures, fuel issues, and 
collisions with terrain or objects as defining events compared to California.

Overall, while both states experienced aviation incidents resulting in substantial aircraft
damage, the Texas incidents tended to be more severe in terms of injuries and fatalities, 
with a higher prevalence of engine failures, fuel issues, and terrain/object collisions as 
contributing factors.

Using metadata filters in a RAG query

Cell 8 makes a small adjustment to the code to add a filter to the vector search, filtering for documents from incidents with the location of California. Filters increase the accuracy of chatbot responses by removing irrelevant data from the result the RAG pipeline passes to the LLM in the prompt.

To add a filter, cell 8 adds a filter clause to the k-nearest neighbors (k-NN) query:

os_query["query"]["knn"]["embedding"]["filter"] = {"match_phrase": {"properties.entity.location": "California"}}

The output from the RAG query is as follows:

Based on the database entries provided, several incidents occurred in California during January 2023:

1. On January 12th, a Cessna 180K aircraft sustained substantial damage in a collision during takeoff 
or landing at Agua Caliente Springs, California. There was 1 person on board with no injuries reported.

2. On January 20th, a Cessna 195A aircraft sustained substantial damage due to a los of control on the 
ground at Calexico, California. There were 3 people on board with no injuries.  

3. On January 15th, a Piper PA-28-180 aircraft sustained substantial damage in a miscellaneous incident 
at San Diego, California during an instructional flight. There were 4 people on board with no injuries.

4. On January 1st, a Cessna 172 aircraft sustained substantial damage in a collision during takeoff or 
landing at Watsonville, California during an instructional flight. There was 1 serious injury reported.

5. On January 27th, a Cessna T210N aircraft sustained substantial damage when it descended into a ravine 
and impacted the ground about 2,000 feet short of the runway threshold at Murrieta, California. There were
1 serious injury and 1 minor injury reported. The engine did not respond during the landing approach.

The details provided in the database entries, such as aircraft type, location, date/time, damage level, 
injuries, and a brief description of the defining event, serve as evidence for these incidents occurring 
in California during the specified time period.

Clean up

Be sure to clean up the resources you deployed for this walkthrough:

  1. Delete your OpenSearch Service domain.
  2. Remove any Jupyter environments you created.

Conclusion

In this post, you used Aryn DocParse and Sycamore to parse, extract, enrich, clean, embed, and load data into vector and keyword indexes in OpenSearch Service. You then used Sycamore to run RAG queries on this data. Your second RAG query used an OpenSearch filter on metadata to get a more accurate result.

The way in which your documents are parsed, enriched, and processed has a significant impact on the quality of your RAG queries. You can use the examples in this post to build your own RAG systems with Aryn and OpenSearch Service, and iterate on the processing and retrieval strategies as you build your generative AI application.


About the Authors

Jon Handler is Director of Solutions Architecture for Search Services at Amazon Web Services, based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads for OpenSearch. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master’s of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Jon is the founding Chief Product Officer at Aryn. Prior to that, he was the SVP of Product Management at Dremio, a data lake company. Earlier, Jon was a Director at AWS, and led product management for in-memory database services (Amazon ElastiCache and Amazon MemoryDB for Redis), Amazon EMR (Apache Spark and Hadoop), and founded and was GM of the blockchain division. Jon has an MBA from Stanford Graduate School of Business and a BA in Chemistry from Washington University in St. Louis.

Improve search results for AI using Amazon OpenSearch Service as a vector database with Amazon Bedrock

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/improve-search-results-for-ai-using-amazon-opensearch-service-as-a-vector-database-with-amazon-bedrock/

Artificial intelligence (AI) has transformed how humans interact with information in two major ways—search applications and generative AI. Search applications include ecommerce websites, document repository search, customer support call centers, customer relationship management, matchmaking for gaming, and application search. Generative AI use cases include chatbots with Retrieval-Augmented Generation (RAG), intelligent log analysis, code generation, document summarization, and AI assistants. AWS recommends Amazon OpenSearch Service as a vector database for Amazon Bedrock as the building blocks to power your solution for these workloads.

In this post, you’ll learn how to use OpenSearch Service and Amazon Bedrock to build AI-powered search and generative AI applications. You’ll learn about how AI-powered search systems employ foundation models (FMs) to capture and search context and meaning across text, images, audio, and video, delivering more accurate results to users. You’ll learn how generative AI systems use these search results to create original responses to questions, supporting interactive conversations between humans and machines.

The post addresses common questions such as:

  1. What is a vector database and how does it support generative AI applications?
  2. Why is Amazon OpenSearch Service recommended as a vector database for Amazon Bedrock?
  3. How do vector databases help prevent AI hallucinations?
  4. How can vector databases improve recommendation systems?
  5. What are the scaling capabilities of OpenSearch as a vector database?

How vector databases work in the AI workflow

When you’re building for search, FMs and other AI models convert various types of data (text, images, audio, and video) into mathematical representations called vectors. When you use vectors for search, you encode your data as vectors and store those vectors in a vector database. You further convert your query into a vector and then query the vector database to find related items by minimizing the distance between vectors.

When you’re building for generative AI, you use FMs such as large language models (LLMs), to generate text, video, audio, images, code, and more from a prompt. The prompt might contain text, such as a user’s question, along with other media such as images, audio, or video. However, generative AI models can produce hallucinations—outputs that appear convincing but contain factual errors. To solve for this challenge, you employ vector search to retrieve accurate information from a vector database. You add this information to the prompt in a process called Retrieval-Augmented Generation (RAG).

Why is Amazon OpenSearch Service the recommended vector database for Amazon Bedrock?

Amazon Bedrock is a fully managed service that provides FMs from leading AI companies, and the tools to customize these FMs with your data to improve their accuracy. With Amazon Bedrock, you get a serverless, no-fuss solution to adopt your selected FM and use it for your generative AI application.

Amazon OpenSearch Service is a fully managed service that you can use to deploy and operate OpenSearch in the AWS Cloud. OpenSearch is an open source search, log analytics, and vector database solution, composed of a search engine and vector database; and OpenSearch Dashboards, a log analytics, observability, security analytics, and dashboarding solution. OpenSearch Service can help you to deploy and operate your search infrastructure with native vector database capabilities, pre-built templates, and simplified setup. API calls and integration templates streamline connectivity with Amazon Bedrock FMs, while the OpenSearch Service vector engine can deliver as low as single-digit millisecond latencies for searches across billions of vectors, making it ideal for real-time AI applications.

OpenSearch is a specialized type of database technology that was originally designed for latency- and throughput-optimized matching and retrieval of large and small blocks of unstructured text with ranked results. OpenSearch ranks results based on a measure of similarity to the search query, returning the most similar results. This similarity matching has evolved over time. Before FMs, search engines used a word-frequency scoring system called term frequency/inverse document frequency (TF/IDF). OpenSearch Service uses TF/IDF to score a document based on the rarity of the search terms in all documents and how often the search terms appeared in the document it’s scoring.

With the rise of AI/ML, OpenSearch added the ability to compute a similarity score for the distance between vectors. To search with vectors, you add vector embeddings produced by FMs and other AI/ML technologies to your documents. To score documents for a query, OpenSearch computes the distance from the document’s vector to a vector from the query. OpenSearch further provides field-based filtering and matching and hybrid vector and lexical search, which you use to incorporate terms in your queries. OpenSearch hybrid search performs a lexical and a vector query in parallel, producing a similarity score with built-in score normalization and blending to improve the accuracy of the search result compared with lexical or vector similarity alone.

OpenSearch Service supports three vector engines: Facebook AI Similarity (FAISS), Non-Metric Space Library (NMSLib), and Apache Lucene. It supports exact nearest neighbor search, and approximate nearest neighbor (ANN) search with either hierarchical navigable small world (HNSW), or Inverted File (IVF) engines. OpenSearch Service supports vector quantization methods, including disk-based vector quantization so you can optimize cost, latency, and retrieval accuracy for your solution.

Use case 1: Improve your search results with AI/ML

To improve your search results with AI/ML, you use a vector-generating ML model, most frequently an LLM or multi-modal model that produces embeddings for text and image inputs. You use Amazon OpenSearch Ingestion, or a similar technology to send your data to OpenSearch Service with OpenSearch Neural Plugin to integrate the model, using a model ID, into an OpenSearch ingest pipeline. The ingest pipeline calls Amazon Bedrock to create vector embeddings for every document during ingestion.

To query OpenSearch Service as a vector database, you use an OpenSearch neural query to call Amazon Bedrock to create an embedding for the query. The neural query uses the vector database to retrieve nearest neighbors.

The service offers pre-built CloudFormation templates that construct OpenSearch Service integrations to connect to Amazon Bedrock foundation models for remote inference. These templates simplify the setup of the connector that OpenSearch Service uses to contact Amazon Bedrock.

After you’ve created the integration, you can refer to the model_id when you set up your ingest and search pipelines.

Use case 2: Amazon OpenSearch Serverless as an Amazon Bedrock knowledge base

Amazon OpenSearch Serverless offers an auto-scaled, high-performing vector database that you can use to build with Amazon Bedrock for RAG, and AI agents, without having to manage the vector database infrastructure. When you use OpenSearch Serverless, you create a collection—a collection of indexes for your application’s search, vector, and logging needs. For vector database use cases, you send your vector data to your collection’s indices, and OpenSearch Serverless creates a vector database that provides fast vector similarity and retrieval.

When you use OpenSearch Serverless as a vector database, you pay only for storage for your vectors and the compute needed to serve your queries. Serverless compute capacity is measured in OpenSearch Compute Units (OCUs). You can deploy OpenSearch Serverless starting at just one OCU for development and test workloads for about $175/month. OpenSearch Serverless scales up and down automatically to accommodate your ingestion and search workloads.

With Amazon OpenSearch Serverless, you get an autoscaled, performant vector database that is seamlessly integrated with Amazon Bedrock as a knowledge base for your generative AI solution. You use the Amazon Bedrock console to automatically create vectors from your data in up to five data stores, including an Amazon Simple Storage Service (Amazon S3) bucket and store them in an Amazon OpenSearch Serverless collection.

When you’ve configured your data source, and selected a model, select Amazon OpenSearch Serverless as your vector store, and Amazon Bedrock and OpenSearch Serverless will take it from there. Amazon Bedrock will automatically retrieve source data from your data source, apply the parsing and chunking strategies you have configured, and index vector embeddings in OpenSearch Serverless. An API call will synchronize your data source with OpenSearch Serverless vector store.

The Amazon Bedrock retrieve_and_generate() runtime API call makes it straightforward for you to implement RAG with Amazon Bedrock and your OpenSearch Serverless knowledge base.

response = bedrock_agent_runtime_client.retrieve_and_generate(
  input={
    'text': prompt,
  },
  retrieveAndGenerateConfiguration={
    'type': 'KNOWLEDGE_BASE',
    'knowledgeBaseConfiguration': {
      'knowledgeBaseId': knowledge_base_id,
      'modelArn': model_arn,
}})

Conclusion

In this post, you learned how Amazon OpenSearch Service and Amazon Bedrock work together to deliver AI-powered search and generative AI applications and why OpenSearch Service is the AWS recommended vector database for Amazon Bedrock. You learned how to add Amazon Bedrock FMs to generate vector embeddings for OpenSearch Service semantic search to bring meaning and context to your search results. You learned how OpenSearch Serverless provides a tightly integrated knowledge base for Amazon Bedrock that simplifies using foundation models for RAG and other generative AI. Get started with Amazon OpenSearch Service and Amazon Bedrock today to enhance your AI-powered applications with improved search capabilities with more reliable generative AI outputs.


About the author

Jon Handler is Director of Solutions Architecture for Search Services at Amazon Web Services, based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads for OpenSearch. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master’s of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Use DeepSeek with Amazon OpenSearch Service vector databases and Amazon SageMaker

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/use-deepseek-with-amazon-opensearch-service-vector-databases-and-amazon-sagemaker/

DeepSeek-R1 is a powerful and cost-effective AI model that excels at complex reasoning tasks. When combined with Amazon OpenSearch Service, it enables robust Retrieval Augmented Generation (RAG) applications. This post shows you how to set up RAG using DeepSeek-R1 on Amazon SageMaker with an OpenSearch Service vector database as the knowledge base. This example provides a solution for enterprises looking to enhance their AI capabilities.

OpenSearch Service provides rich capabilities for RAG use cases, as well as vector embedding-powered semantic search. You can use the flexible connector framework and search flow pipelines in OpenSearch to connect to models hosted by DeepSeek, Cohere, and OpenAI, as well as models hosted on Amazon Bedrock and SageMaker. In this post, we build a connection to DeepSeek’s text generation model, supporting a RAG workflow to generate text responses to user queries.

Solution overview

The following diagram illustrates the solution architecture.

In this walkthrough, you will use a set of scripts to create the preceding architecture and data flow. First, you will create an OpenSearch Service domain, and deploy DeepSeek-R1 to SageMaker. You will execute scripts to create an AWS Identity and Access Management (IAM) role for invoking SageMaker, and a role for your user to create a connector to SageMaker. You will create an OpenSearch connector and model that will enable the retrieval_augmented_generation processor within OpenSearch to execute a user query, perform a search, and use DeepSeek to generate a text response. You will create a connector to SageMaker with Amazon Titan Text Embeddings V2 to create embeddings for a set of documents with population statistics. Finally, you will execute the query to compare population growth in Miami and New York City.

Prerequisites

We’ve created and open-sourced a GitHub repo with all the code you need to follow along with the post and deploy it for yourself. You will need the following prerequisites:

Deploy DeepSeek on Amazon SageMaker

You will need to have or deploy DeepSeek with an Amazon SageMaker endpoint. To learn more about deploying DeepSeek-R1 on SageMaker, refer to Deploying DeepSeek-R1 Distill Model on AWS using Amazon SageMaker AI.

Create an OpenSearch Service domain

Refer to Create an Amazon OpenSearch Service domain for instructions on how to create your domain. Make note of the domain Amazon Resource Name (ARN) and domain endpoint, both of which can be found in the General information section of each domain on the OpenSearch Service console.

Download and prepare the code

Run the following steps from your local computer or workspace that has Python and git:

  1. If you haven’t already, clone the repo into a local folder using the following command:
git clone https://github.com/Jon-AtAWS/opensearch-examples.git
  1. Create a Python virtual environment:
cd opensearch-examples/opensearch-deepseek-rag
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

The example scripts use environment variables for setting some common parameters. Set these up now using the following commands. Be sure to update with your AWS Region, your SageMaker endpoint ARN and URL, your OpenSearch Service domain’s endpoint and ARN, and your domain’s primary user and password.

export DEEPSEEK_AWS_REGION='<your current region>'
export SAGEMAKER_MODEL_INFERENCE_ARN='<your SageMaker endpoint’s ARN>' 
export SAGEMAKER_MODEL_INFERENCE_ENDPOINT='<your SageMaker endpoint’s URL>'
export OPENSEARCH_SERVICE_DOMAIN_ARN='<your domain’s ARN>’
export OPENSEARCH_SERVICE_DOMAIN_ENDPOINT='<your domain’s API endpoint>'
export OPENSEARCH_SERVICE_ADMIN_USER='<your domain’s master user name>'
export OPENSEARCH_SERVICE_ADMIN_PASSWORD='<your domain’s master user password>'

You now have the code base and have your virtual environment set up. You can examine the contents of the opensearch-deepseek-rag directory. For clarity of purpose and reading, we’ve encapsulated each of seven steps in its own Python script. This post will guide you through running these scripts. We’ve also chosen to use environment variables to pass parameters between scripts. In an actual solution, you would encapsulate the code in classes and pass the values where needed. Coding this way is clearer, but is less efficient and doesn’t follow coding best practices. Use these scripts as examples to pull from.

First, you will set up permissions for your OpenSearch Service domain to connect to your SageMaker endpoint.

Set up permissions

You will create two IAM roles. The first will allow OpenSearch to call your SageMaker endpoint. The second will allow you to make the create connector API call to OpenSearch.

  1. Examine the code in create_invoke_role.py.
  2. Return to the command line, and execute the script:
python create_invoke_role.py
  1. Execute the command line from the script’s output to set the INVOKE_DEEPSEEK_ROLE environment variable.

You have created a role named invoke_deepseek_role, with a trust relationship for OpenSearch Service to assume the role, and with a permission policy that allows OpenSearch Service to invoke your SageMaker endpoint. The script outputs the ARNs for your role and policy and additionally a command line command to add the role to your environment. Execute that command before running the next script. Make a note of the role ARN in case you need to return at a later time.

Now you need to create a role for your user to be able to create a connector in OpenSearch Service.

  1. Examine the code in create_connector_role.py.
  2. Return to the command line and execute the script:
python create_connector_role.py
  1. Execute the command line from the script’s output to set the CREATE_DEEPSEEK_CONNECTOR_ROLE environment variable.

You have created a role named create_deepseek_connector_role, with a trust relationship with the current user and permissions to write to OpenSearch Service. You need these permissions to call the OpenSearch create_connector API, which packages a connection to a remote model host, DeepSeek in this case. The script prints the policy’s and role’s ARNs, and additionally a command line command to add the role to your environment. Execute that command before running the next script. Again, make note of the role ARN, just in case.

Now that you have your roles created, you will tell OpenSearch about them. The fine-grained access control feature includes an OpenSearch role, ml_full_access, that will allow authenticated entities to execute API calls within OpenSearch.

  1. Examine the code in setup_opensearch_security.py.
  2. Return to the command line and execute the script:
python setup_opensearch_security.py

You set up the OpenSearch Service security plugin to recognize two AWS roles: invoke_create_connector_role and LambdaInvokeOpenSearchMLCommonsRole. You will use the second role later, when you connect with an embedding model and load data into OpenSearch to use as a RAG knowledge base. Now that you have permissions in place, you can create the connector.

Create the connector

You create a connector with configuration that tells OpenSearch how to connect, provides credentials for the target model host, and provides prompt details. For more information, see Creating connectors for third-party ML platforms.

  1. Examine the code in create_connector.py.
  2. Return to the command line and execute the script:
python create_connector.py
  1. Execute the command line from the script’s output to set the DEEPSEEK_CONNECTOR_ID environment variable.

The script will create the connector to call the SageMaker endpoint and return the connector ID. The connector is an OpenSearch construct that tells OpenSearch how to connect to an external model host. You don’t use it directly; you create an OpenSearch model for that.

Create an OpenSearch model

When you work with machine learning (ML) models, in OpenSearch, you use OpenSearch’s ml-commons plugin to create a model. ML models are an OpenSearch abstraction that let you perform ML tasks like sending text for embeddings during indexing, or calling out to a large language model (LLM) to generate text in a search pipeline. The model interface provides you with a model ID in a model group that you then use in your ingest pipelines and search pipelines.

  1. Examine the code in create_deepseek_model.py.
  2. Return to the command line and execute the script:
python create_deepseek_model.py
  1. Execute the command line from the script’s output to set the DEEPSEEK_MODEL_ID environment variable.

You created an OpenSearch ML model group and model that you can use to create ingest and search pipelines. The _register API places the model in the model group and references your SageMaker endpoint through the connector (connector_id) you created.

Verify your setup

You can run a query to verify your setup and make sure that you can connect to DeepSeek on SageMaker and receive generated text. Complete the following steps:

  1. On the OpenSearch Service console, choose Dashboard under Managed clusters in the navigation pane.
  2. Choose your domain’s dashboard.

Amazon OpenSearch Service console on the AWS console showing where to click to reveal a domain’s details

  1. Choose the OpenSearch Dashboards URL (dual stack) link to open OpenSearch Dashboards.
  2. Log in to OpenSearch Dashboards with your primary user name and password.
  3. Dismiss the welcome dialog by choosing Explore on my own.
  4. Dismiss the new look and feel dialog.
  5. Confirm the global tenant in the Select your tenant dialog.
  6. Navigate to the Dev Tools tab.
  7. Dismiss the welcome dialog.

You can also get to Dev Tools by expanding the navigation menu (three lines) to reveal the navigation pane, and scrolling down to Dev Tools.

OpenSearch Dashboards home screen, with an indicator on where to click to open the Dev Tools tab

The Dev Tools page provides a left pane where you enter REST API calls. You execute the commands and the right pane shows the output of the command. Enter the following command in the left pane, replace your_model_id with the model ID you created, and run the command by placing the cursor anywhere in the command and choosing the run icon.

POST _plugins/_ml/models/<your model ID>/_predict{  "parameters": {    "inputs": "Hello"  }}

You should see output like the following screenshot.

Congratulations! You’ve now created and deployed an ML model that can use the connector you created to call to your SageMaker endpoint, and use DeepSeek to generate text. Next, you will use your model in an OpenSearch search pipeline to automate a RAG workflow.

Set up a RAG workflow

RAG is a way of adding information to the prompt so that the LLM generating the response is more accurate. An overall generative application like a chatbot orchestrates a call to external knowledge bases and augments the prompt with knowledge from those sources. We’ve created a small knowledge base comprising population information.

OpenSearch provides search pipelines, which are sets of OpenSearch search processors that are applied to the search request sequentially to build a final result. OpenSearch has processors for hybrid search, reranking, and RAG, among others. You define your processor and then send your queries to the pipeline. OpenSearch responds with the final result.

When you build a RAG application, you choose a knowledge base and a retrieval mechanism. In most cases, you will use an OpenSearch Service vector database as a knowledge base, performing a k-nearest neighbor (k-NN) search to incorporate semantic information in the retrieval with vector embeddings. OpenSearch Service provides integrations with vector embedding models hosted in Amazon Bedrock and SageMaker (among other options).

Make sure that your domain is running OpenSearch 2.9 or later, and that fine-grained access control is enabled for the domain. Then complete the following steps:

  1. On the OpenSearch Service console, choose Integrations in the navigation pane.
  2. Choose Configure domain under Integration with text embedding models through Amazon SageMaker.

  1. Choose Configure public domain.
  2. If you created a virtual private cloud (VPC) domain instead, choose Configure VPC domain.

You will be redirected to the AWS CloudFormation console.

  1. For Amazon OpenSearch Endpoint, enter your endpoint.
  2. Leave everything else as default values.

The CloudFormation stack requires a role to create a connector to the all-MiniLM-L6-v2 model, hosted on SageMaker, called LambdaInvokeOpenSearchMLCommonsRole. You enabled access for this role when you ran setup_opensearch_security.py. If you changed the name in that script, be sure to change it in the Lambda Invoke OpenSearch ML Commons Role Name field.

  1. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and choose Create stack.

For simplicity, we’ve elected to use the open source all-MiniLM-L6-v2 model, hosted on SageMaker for embedding generation. To achieve high search quality for production workloads, you should fine-tune lightweight models like all-MiniLM-L6-v2, or use OpenSearch Service integrations with models such as Cohere Embed V3 on Amazon Bedrock or Amazon Titan Text Embedding V2, which are designed to deliver high out-of-the-box quality.

Wait for CloudFormation to deploy your stack and the status to change to Create_Complete.

  1. Choose the stack’s Outputs tab on the CloudFormation console and copy the value for ModelID.

The AWS CloudFormation console showing the template results for the integration template and where to find the model ID

You will use this model ID to connect with your embedding model.

  1. Examine the code in load_data.py.
  2. Return to the command line and set an environment variable with the model ID of the embedding model:
export EMBEDDING_MODEL_ID='<the model ID from CloudFormation’s output>'
  1. Execute the script to load data into your domain:
python load_data.py

The script creates the population_data index and an OpenSearch ingest pipeline that calls SageMaker using the connector referenced by the embedding model ID. The ingest pipeline’s field mapping tells OpenSearch the source and destination fields for each document’s embedding.

Now that you have your knowledge base prepared, you can run a RAG query.

  1. Examine the code in run_rag.py.
  2. Return to the command line and execute the script:
python run_rag.py

The script creates a search pipeline with an OpenSearch retrieval_augmented_generation processor. The processor automates running an OpenSearch k-NN query to retrieve relevant information and adding that information to the prompt. It uses the generation_model_id and connector to the DeepSeek model on SageMaker to generate a text response for the user’s question. The OpenSearch neural query (line 55 of run_rag.py) takes care of generating the embedding for the k-NN query using the embedding_model_id. In the ext section of the query, you provide the user’s question for the LLM. The llm_model is set to bedrock/claude because the parameterization and actions are the same as they are for DeepSeek. You’re still using DeepSeek to generate text.

Examine the output from OpenSearch Service. The user asked the question “What’s the population increase of New York City from 2021 to 2023? How is the trending comparing with Miami?” The first portion of the result shows the hits—documents OpenSearch retrieved from the semantic query—as the population statistics for New York City and Miami. The next section of the response includes the prompt, as well as DeepSeek’s answer.

Okay, so I need to figure out the population increase of New York City from 2021 to 2023 and compare it with Miami's growth. 
Let me start by looking at the data provided in the search results.
From SEARCH RESULT 2, I see that in 2021, NYC had a population of 18,823,000. 
In 2022, it was 18,867,000, and in 2023, it's 18,937,000. 
So, the increase from 2021 to 2022 is 18,867,000 - 18,823,000 = 44,000. 
Then from 2022 to 2023, it's 18,937,000 - 18,867,000 = 70,000. 
Adding those together, the total increase from 2021 to 2023 is 44,000 + 70,000 = 114,000.
Now, looking at Miami's data in SEARCH RESULT 1. In 2021, Miami's population was 6,167,000, in 2022 it was 6,215,000, and in 2023 it's 6,265,000. 
The increase from 2021 to 2022 is 6,215,000 - 6,167,000 = 48,000. From 2022 to 2023, it's 6,265,000 - 6,215,000 = 50,000. 
So, the total increase is 48,000 + 50,000 = 98,000.Comparing the two, NYC's increase of 114,000 is higher than Miami's 98,000. 
So, NYC's population increased more over that period."

Congratulations! You’ve connected to an embedding model, created a knowledge base, and used that knowledge base, along with DeepSeek, to generate a text response to a question on population changes in New York City and Miami. You can adapt the code from this post to create your own knowledge base and run your own queries.

Clean up

To avoid incurring additional charges, clean up the resources you deployed:

  1. Delete the SageMaker deployment of DeepSeek. For instructions, see Cleaning Up.
  2. If your Jupyter notebook has lost context, you can delete the endpoint:
    1. On the SageMaker console, under Inference in the navigation pane, choose Endpoints.
    2. Select your endpoint and choose Delete.
  3. Delete the CloudFormation template for connecting to SageMaker for the embedding model.
  4. Delete the OpenSearch Service domain you created.

Conclusion

The OpenSearch connector framework is a flexible way for you to access models you host on other platforms. In this example, you connected to the open source DeepSeek model that you deployed on SageMaker. DeepSeek’s reasoning capabilities, augmented with a knowledge base in the OpenSearch Service vector engine, enabled it to answer a question comparing population growth in New York and Miami.

Find out more about AI/ML capabilities of OpenSearch Service, and let us know how you are using DeepSeek and other generative models to build!


About the Authors

Jon Handler is the Director of Solutions Architecture for Search Services at Amazon Web Services, based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads for OpenSearch. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Yaliang Wu is a Software Engineering Manager at AWS, focusing on OpenSearch projects, machine learning, and generative AI applications.

Amazon OpenSearch Service: Managed and community driven

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-managed-and-community-driven/

I’ve always loved the problem of search. At its core, search is about receiving a question, understanding that question, and then retrieving the best answer for it. A long time ago, I did an AI robotics project for my PhD that married a library of plan fragments to a real-world situation, through search. I’ve worked on and built a commercial search engine from the ground up in a prior job. And in my career at AWS, I’ve worked as a solutions architect, helping our customers adopt our search services in all their incarnations.

Like many developers, I share a passion for open source. This stems partly from my academic background, where scholars work for the greater good, building upon and benefiting from previous achievements in their fields. I’ve used and contributed to numerous open source technologies, ranging from small projects with a single purpose to large-scale initiatives with passionate, engaged communities. The search community has its own, special and academic flavor, because search itself is related to long-standing academic endeavors like information retrieval, psychology, and (symbolic) AI. Open source software has played a prominent role in this community. Search technology has been democratized, especially over the past 10–15 years, through open source projects like Apache Lucene, Apache Solr, Apache License, 2.0 version of Elasticsearch, and OpenSearch.

It’s that context that makes me so excited that today the Linux Foundation announced the OpenSearch Software Foundation. As part of the creation of the OpenSearch Foundation, AWS has transferred ownership of OpenSearch to the Linux Foundation. At the launch of the project in April of 2021, in introducing OpenSearch, we spoke of our desire to “ensure users continue to have a secure, high-quality, fully open source search and analytics suite with a rich roadmap of new and innovative functionality.” We’ve maintained that desire and commitment, and with this transfer, are deepening that commitment, and bringing in the broader community with open governance to help with that goal.

There are two key points regarding this announcement: first, nothing is changing if you’re a customer of Amazon OpenSearch Service; second a lot is changing on the open source side, and that’s a net benefit for the service. We’re moving into a future that includes an acceleration in innovation for the OpenSearch Project, driven by deeper collaboration and participation with the community. Ultimately, that’s going to come to the service and benefit our AWS customers.

Amazon OpenSearch Service: How we’ve worked

Amazon’s focus from the beginning was to work on OpenSearch in the open. Our first task was to release a working code base with code import and renaming capabilities. We launched OpenSearch1.0 in July 2021, followed by renaming our managed service to Amazon OpenSearch Service in September 2021. With the launch of Amazon OpenSearch Service, we announced support for OpenSearch 1.0 as an engine choice.

As our team at Amazon and the community grew and innovated in the OpenSearch Project, we brought those changes to Amazon OpenSearch Service along with support for the corresponding versions. At AWS, we embraced open source by jointly publishing and discussing ideas, RFCs,and feature requests with the community. As time passed and the project progressed, we onboarded community maintainers and accepted contributions from various sources within and outside AWS.

As an Amazon OpenSearch Service customer, you’ll continue to see updates and new versions flowing from open source to our managed service. You’ll also experience ongoing innovation driven by our investment in growing the project, its community, and code base.

Today the OpenSearch project has significant momentum, with more than 700 million software downloads and participation from thousands of contributors and more than 200 project maintainers. The OpenSearch Software Foundation launches with support from premier members AWS, SAP, and Uber and general members Aiven, Aryn, Atlassian, Canonical, Digital Ocean, Eliatra, Graylog, NetApp® Instaclustr, and Portal26.

Amazon OpenSearch Service: Going forward

This announcement doesn’t change anything for Amazon OpenSearch Service. Amazon remains committed to innovating for and contributing to the OpenSearch Project, with a growing number of committers and maintainers. If anything, this innovation will accelerate with broader and deeper participation bringing more diverse ideas from the global community. At the core of this commitment is our founding and continuing desire to “ensure users continue to have a secure, high-quality, fully open source search and analytics suite with a rich roadmap of new and innovative functionality.” We plan to continue closely working with the project, contributing code improvements and bringing those improvements to our managed service.

This announcement doesn’t change how you connect with or use Amazon OpenSearch Service. OpenSearch Service will continue to be a fully managed service, providing OpenSearch and OpenSearch Dashboards at service-provided endpoints, and with the full suite of existing managed-service features. If you’re using Amazon OpenSearch Service, you won’t need to change anything. There won’t be any licensing changes or cost changes driven by the move to a foundation.

Amazon will continue bringing its expertise to the project, funding new innovations where our customers need them the most, such as cloud-native large scale distributed systems, search, analytics, machine learning and AI. The Linux Foundation will also facilitate collaboration with other open source organizations such as Cloud Native Computing Foundation (CNCF), which is instrumental for cloud-native, open source projects. Our goal will remain to solve some of the most challenging customer problems, open source first. Finally, given the open source nature of the product we think there’s a big opportunity and are excited to partner with our customers to solve their problems together, in code.

We’ve always encouraged our customers to participate in the OpenSearch Project. Now, the project has a well-defined structure and management with the governing board, and technical steering committee, each staffed with members from diverse backgrounds, both in and out of Amazon. The governing board will look after the project’s funding and management, the technical steering committee will take care of the technical direction of the project. This opens the door wider for you to directly participate in shaping the technology you’re using in our managed service. If you’re an Amazon OpenSearch Service customer, the project welcomes your contributions, big or small, from filing issues and feature requests to commenting on RFCs and contributing code.

Conclusion

This is an exciting time, for the project, for the community, and for Amazon OpenSearch Service. As an AWS customer, you don’t need to make any changes in use, and there aren’t any changes in the Apache License, 2.0 or the pricing. But, moving to the Linux Foundation will help bring the spirit of cooperation from the open source world to the technology and from there to Amazon OpenSearch Service. As search continues to mature, together we’ll continue to get better at understanding questions, and providing relevant results.

You can read more about the OpenSearch Foundation announcement on the AWS Open Source blog.


About the author

Jon Handler is the Director of Solutions Architecture for Search Services at Amazon Web Services, based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads for OpenSearch. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Reducing long-term logging expenses by 4,800% with Amazon OpenSearch Service

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/reducing-long-term-logging-expenses-by-4800-with-amazon-opensearch-service/

When you use Amazon OpenSearch Service for time-bound data like server logs, service logs, application logs, clickstreams, or event streams, storage cost is one of the primary drivers for the overall cost of your solution. Over the last year, OpenSearch Service has released features that have opened up new possibilities for storing your log data in various tiers, enabling you to trade off data latency, durability, and availability. In October 2023, OpenSearch Service announced support for im4gn data nodes, with NVMe SSD storage of up to 30 TB. In November 2023, OpenSearch Service introduced or1, the OpenSearch-optimized instance family, which delivers up to 30% price-performance improvement over existing instances in internal benchmarks and uses Amazon Simple Storage Service (Amazon S3) to provide 11 nines of durability. Finally, in May 2024, OpenSearch Service announced general availability for Amazon OpenSearch Service zero-ETL integration with Amazon S3. These new features join OpenSearch’s existing UltraWarm instances, which provide an up to 90% reduction in storage cost per GB, and UltraWarm’s cold storage option, which lets you detach UltraWarm indexes and durably store rarely accessed data in Amazon S3.

This post works through an example to help you understand the trade-offs available in cost, latency, throughput, data durability and availability, retention, and data access, so that you can choose the right deployment to maximize the value of your data and minimize the cost.

Examine your requirements

When designing your logging solution, you need a clear definition of your requirements as a prerequisite to making smart trade-offs. Carefully examine your requirements for latency, durability, availability, and cost. Additionally, consider which data you choose to send to OpenSearch Service, how long you retain data, and how you plan to access that data.

For the purposes of this discussion, we divide OpenSearch instance storage into two classes: ephemeral backed storage and Amazon S3 backed storage. The ephemeral backed storage class includes OpenSearch nodes that use Nonvolatile Memory Express SSDs (NVMe SSDs) and Amazon Elastic Block Store (Amazon EBS) volumes. The Amazon S3 backed storage class includes UltraWarm nodes, UltraWarm cold storage, or1 instances, and Amazon S3 storage you access with the service’s zero-ETL with Amazon S3. When designing your logging solution, consider the following:

  • Latency – if you need results in milliseconds, then you must use ephemeral backed storage. If seconds or minutes are acceptable, you can lower your cost by using Amazon S3 backed storage.
  • Throughput – As a general rule, ephemeral backed storage instances will provide higher throughput. Instances that have NVMe SSDs, like the im4gn, generally provide the best throughput, with EBS volumes providing good throughput. or1 instances take advantage of Amazon EBS storage for primary shards while using Amazon S3 with segment replication to reduce the compute cost of replication, thereby offering indexing throughput that can match or even exceed NVMe-based instances.
  • Data durability – Data stored in the hot tier (you deploy these as data nodes) has the lowest latency, and also the lowest durability. OpenSearch Service provides automated recovery of data in the hot tier through replicas, which provide durability with added cost. Data that OpenSearch stores in Amazon S3 (UltraWarm, UltraWarm cold storage, zero-ETL with Amazon S3, and or1 instances) gets the benefit of 11 nines of durability from Amazon S3.
  • Data availabilityBest practices dictate that you use replicas for data in ephemeral backed storage. When you have at least one replica, you can continue to access all of your data, even during a node failure. However, each replica adds a multiple of cost. If you can tolerate temporary unavailability, you can reduce replicas through or1 instances, with Amazon S3 backed storage.
  • Retention – Data in all storage tiers incurs cost. The longer you retain data for analysis, the more cumulative cost you incur for each GB of that data. Identify the maximum amount of time you must retain data before it loses all value. In some cases, compliance requirements may restrict your retention window.
  • Data access – Amazon S3 backed storage instances generally have a much higher storage to compute ratio, providing cost savings but with insufficient compute for high-volume workloads. If you have high query volume or your queries span a large volume of data, ephemeral backed storage is the right choice. Direct query (Amazon S3 backed storage) is perfect for large volume queries for infrequently queried data.

As you consider your requirements along these dimensions, your answers will guide your choices for implementation. To help you make trade-offs, we work through an extended example in the following sections.

OpenSearch Service cost model

To understand how to cost an OpenSearch Service deployment, you need to understand the cost dimensions. OpenSearch Service has two different deployment options: managed clusters and serverless. This post considers managed clusters only, because Amazon OpenSearch Serverless already tiers data and manages storage for you. When you use managed clusters, you configure data nodes, UltraWarm nodes, and cluster manager nodes, selecting Amazon Elastic Compute Cloud (Amazon EC2) instance types for each of these functions. OpenSearch Service deploys and manages these nodes for you, providing OpenSearch and OpenSearch Dashboards through a REST endpoint. You can choose Amazon EBS backed instances or instances with NVMe SSD drives. OpenSearch Service charges an hourly cost for the instances in your managed cluster. If you choose Amazon EBS backed instances, the service will charge you for the storage provisioned, and any provisioned IOPs you configure. If you choose or1 nodes, UltraWarm nodes, or UltraWarm cold storage, OpenSearch Service charges for the Amazon S3 storage consumed. Finally, the service charges for data transferred out.

Example use case

We use an example use case to examine the trade-offs in cost and performance. The cost and sizing of this example are based on best practices, and are directional in nature. Although you can expect to see similar savings, all workloads are unique and your actual costs may vary substantially from what we present in this post.

For our use case, Fizzywig, a fictitious company, is a large soft drink manufacturer. They have many plants for producing their beverages, with copious logging from their manufacturing line. They started out small, with an all-hot deployment and generating 10 GB of logs daily. Today, that has grown to 3 TB of log data daily, and management is mandating a reduction in cost. Fizzywig uses their log data for event debugging and analysis, as well as historical analysis over one year of log data. Let’s compute the cost of storing and using that data in OpenSearch Service.

Ephemeral backed storage deployments

Fizzywig’s current deployment is 189 r6g.12xlarge.search data nodes (no UltraWarm tier), with ephemeral backed storage. When you index data in OpenSearch Service, OpenSearch builds and stores index data structures that are usually about 10% larger than the source data, and you need to leave 25% free storage space for operating overhead. Three TB of daily source data will use 4.125 TB of storage for the first (primary) copy, including overhead. Fizzywig follows best practices, using two replica copies for maximum data durability and availability, with the OpenSearch Service Multi-AZ with Standby option, increasing the storage need to 12.375 TB per day. To store 1 year of data, multiply by 365 days to get 4.5 PB of storage needed.

To provision this much storage, they could also choose im4gn.16xlarge.search instances, or or1.16.xlarge.search instances. The following table gives the instance counts for each of these instance types, and with one, two, or three copies of the data.

. Max Storage (GB)
per Node

Primary

(1 Copy)

Primary + Replica

(2 Copies)

Primary + 2 Replicas

(3 Copies)

im4gn.16xlarge.search 30,000 52 104 156
or1.16xlarge.search 36,000 42 84 126
r6g.12xlarge.search 24,000 63 126 189

The preceding table and the following discussion are strictly based on storage needs. or1 instances and im4gn instances both provide higher throughput than r6g instances, which will reduce cost further. The amount of compute saved varies between 10–40% depending on the workload and the instance type. These savings do not pass straight through to the bottom line; they require scaling and modification of the index and shard strategy to fully realize them. The preceding table and subsequent calculations take the general assumption that these deployments are over-provisioned on compute, and are storage-bound. You would see more savings for or1 and im4gn, compared with r6g, if you had to scale higher for compute.

The following table represents the total cluster costs for the three different instance types across the three different data storage sizes specified. These are based on on-demand US East (N. Virginia) AWS Region costs and include instance hours, Amazon S3 cost for the or1 instances, and Amazon EBS storage costs for the or1 and r6g instances.

.

Primary

(1 Copy)

Primary + Replica

(2 Copies)

Primary + 2 Replicas

(3 Copies)

im4gn.16xlarge.search $3,977,145 $7,954,290 $11,931,435
or1.16xlarge.search $4,691,952 $9,354,996 $14,018,041
r6g.12xlarge.search $4,420,585 $8,841,170 $13,261,755

This table gives you the one-copy, two-copy, and three-copy costs (including Amazon S3 and Amazon EBS costs, where applicable) for this 4.5 PB workload. For this post, “one copy” refers to the first copy of your data, with the replication factor set to zero. “Two copies” includes a replica copy of all of the data, and “three copies” includes a primary and two replicas. As you can see, each replica adds a multiple of cost to the solution. Of course, each replica adds availability and durability to the data. With one copy (primary only), you would lose data in the case of a single node outage (with an exception for or1 instances). With one replica, you might lose some or all data in a two-node outage. With two replicas, you could lose data only in a three-node outage.

The or1 instances are an exception to this rule. or1 instances can support a one-copy deployment. These instances use Amazon S3 as a backing store, writing all index data to Amazon S3, as a means of replication, and for durability. Because all acknowledged writes are persisted in Amazon S3, you can run with a single copy, but with the risk of losing availability of your data in case of a node outage. If a data node becomes unavailable, any impacted indexes will be unavailable (red) during the recovery window (usually 10–20 minutes). Carefully evaluate whether you can tolerate this unavailability with your customers as well as your system (for example, your ingestion pipeline buffer). If so, you can drop your cost from $14 million to $4.7 million based on the one-copy (primary) column illustrated in the preceding table.

Reserved Instances

OpenSearch Service supports Reserved Instances (RIs), with 1-year and 3-year terms, with no up-front cost (NURI), partial up-front cost (PURI), or all up-front cost (AURI). All reserved instance commitments lower cost, with 3-year, all up-front RIs providing the deepest discount. Applying a 3-year AURI discount, annual costs for Fizzywig’s workload gives costs as shown in the following table.

. Primary Primary + Replica Primary + 2 Replicas
im4gn.16xlarge.search $1,909,076 $3,818,152 $5,727,228
or1.16xlarge.search $3,413,371 $6,826,742 $10,240,113
r6g.12xlarge.search $3,268,074 $6,536,148 $9,804,222

RIs provide a straightforward way to save cost, with no code or architecture changes. Adopting RIs for this workload brings the im4gn cost for three copies down to $5.7 million, and the one-copy cost for or1 instances down to $3.2 million.

Amazon S3 backed storage deployments

The preceding deployments are useful as a baseline and for comparison. In actuality, you would choose one of the Amazon S3 backed storage options to keep costs manageable.

OpenSearch Service UltraWarm instances store all data in Amazon S3, using UltraWarm nodes as a hot cache on top of this full dataset. UltraWarm works best for interactive querying of data in small time-bound slices, such as running multiple queries against 1 day of data from 6 months ago. Evaluate your access patterns carefully and consider whether UltraWarm’s cache-like behavior will serve you well. UltraWarm first-query latency scales with the amount of data you need to query.

When designing an OpenSearch Service domain for UltraWarm, you need to decide on your hot retention window and your warm retention window. Most OpenSearch Service customers use a hot retention window that varies between 7–14 days, with warm retention making up the rest of the full retention period. For our Fizzywig scenario, we use 14 days hot retention and 351 days of UltraWarm retention. We also use a two-copy (primary and one replica) deployment in the hot tier.

The 14-day, hot storage need (based on a daily ingestion rate of 4.125 TB) is 115.5 TB. You can deploy six instances of any of the three instance types to support this indexing and storage. UltraWarm stores a single replica in Amazon S3, and doesn’t need additional storage overhead, making your 351-day storage need 1.158 PiB. You can support this with 58 UltraWarm1.large.search instances. The following table gives the total cost for this deployment, with 3-year AURIs for the hot tier. The or1 instances’ Amazon S3 cost is rolled into the S3 column.

. Hot UltraWarm S3 Total
im4gn.16xlarge.search $220,278 $1,361,654 $333,590 $1,915,523
or1.16xlarge.search $337,696 $1,361,654 $418,136 $2,117,487
r6g.12xlarge.search $270,410 $1,361,654 $333,590 $1,965,655

You can further reduce the cost by moving data to UltraWarm cold storage. Cold storage reduces cost by reducing availability of the data—to query the data, you must issue an API call to reattach the target indexes to the UltraWarm tier. A typical pattern for 1 year of data keeps 14 days hot, 76 days in UltraWarm, and 275 days in cold storage. Following this pattern, you use 6 hot nodes and 13 UltraWarm1.large.search nodes. The following table illustrates the cost to run Fizzywig’s 3 TB daily workload. The or1 cost for Amazon S3 usage is rolled into the UltraWarm nodes + S3 column.

. Hot UltraWarm nodes + S3 Cold Total
im4gn.16xlarge.search $220,278 $377,429 $261,360 $859,067
or1.16xlarge.search $337,696 $461,975 $261,360 $1,061,031
r6g.12xlarge.search $270,410 $377,429 $261,360 $909,199

By employing Amazon S3 backed storage options, you’re able to reduce cost even further, with a single-copy or1 deployment at $337,000, and a maximum of $1 million annually with or1 instances.

OpenSearch Service zero-ETL for Amazon S3

When you use OpenSearch Service zero-ETL for Amazon S3, you keep all your secondary and older data in Amazon S3. Secondary data is the higher-volume data that has lower value for direct inspection, such as VPC Flow Logs and WAF logs. For these deployments, you keep the majority of infrequently queried data in Amazon S3, and only the most recent data in your hot tier. In some cases, you sample your secondary data, keeping a percentage in the hot tier as well. Fizzywig decides that they want to have 7 days of all of their data in the hot tier. They will access the rest with direct query (DQ).

When you use direct query, you can store your data in JSON, Parquet, and CSV formats. Parquet format is optimal for direct query and provides about 75% compression on the data. Fizzywig is using Amazon OpenSearch Ingestion, which can write Parquet format data directly to Amazon S3. Their 3 TB of daily source data compresses to 750 GB of daily Parquet data. OpenSearch Service maintains a pool of compute units for direct query. You are billed hourly for these OpenSearch Compute Units (OCUs), scaling based on the amount of data you access. For this conversation, we assume that Fizzywig will have some debugging sessions and run 50 queries daily over one day worth of data (750 GB). The following table summarizes the annual cost to run Fizzywig’s 3 TB daily workload, 7 days hot, 358 days in Amazon S3.

. Hot DQ Cost OR1 S3 Raw Data S3 Total
im4gn.16xlarge.search $220,278 $2,195 $0 $65,772 $288,245
or1.16xlarge.search $337,696 $2,195 $84,546 $65,772 $490,209
r6g.12xlarge.search $270,410 $2,195 $0 $65,772 $338,377

That’s quite a journey! Fizzywig’s cost for logging has come down from as high as $14 million annually to as low as $288,000 annually using direct query with zero-ETL from Amazon S3. That’s a savings of 4,800%!

Sampling and compression

In this post, we have looked at one data footprint to let you focus on data size, and the trade-offs you can make depending on how you want to access that data. OpenSearch has additional features that can further change the economics by reducing the amount of data you store.

For logs workloads, you can employ OpenSearch Ingestion sampling to reduce the size of data you send to OpenSearch Service. Sampling is appropriate when your data as a whole has statistical characteristics where a part can be representative of the whole. For example, if you’re running an observability workload, you can often send as little as 10% of your data to get a representative sampling of the traces of request handling in your system.

You can further employ a compression algorithm for your workloads. OpenSearch Service recently released support for Zstandard (zstd) compression that can bring higher compression rates and lower decompression latencies as compared to the default, best compression.

Conclusion

With OpenSearch Service, Fizzywig was able to balance cost, latency, throughput, durability and availability, data retention, and preferred access patterns. They were able to save 4,800% for their logging solution, and management was thrilled.

Across the board, im4gn comes out with the lowest absolute dollar amounts. However, there are a couple of caveats. First, or1 instances can provide higher throughput, especially for write-intensive workloads. This may mean additional savings through reduced need for compute. Additionally, with or1’s added durability, you can maintain availability and durability with lower replication, and therefore lower cost. Another factor to consider is RAM; the r6g instances provide additional RAM, which speeds up queries for lower latency. When coupled with UltraWarm, and with different hot/warm/cold ratios, r6g instances can also be an excellent choice.

Do you have a high-volume, logging workload? Have you benefitted from some or all of these methods? Let us know!


About the Author

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have vector, search, and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor’s of the Arts from the University of Pennsylvania, and a Master’s of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Amazon OpenSearch H2 2023 in review

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-h2-2023-in-review/

2023 was been a busy year for Amazon OpenSearch Service! Learn more about the releases that OpenSearch Service launched in the first half of 2023.

In the second half of 2023, OpenSearch Service added the support of two new OpenSearch versions: 2.9 and 2.11 These two versions introduce new features in the search space, machine learning (ML) search space, migrations, and the operational side of the service.

With the release of zero-ETL integration with Amazon Simple Storage Service (Amazon S3), you can analyze your data sitting in your data lake using OpenSearch Service to build dashboards and query the data without the need to move your data from Amazon S3.

OpenSearch Service also announced a new zero-ETL integration with Amazon DynamoDB through the DynamoDB plugin for Amazon OpenSearch Ingestion. OpenSearch Ingestion takes care of bootstrapping and continuously streams data from your DynamoDB source.

OpenSearch Serverless announced the general availability of the Vector Engine for Amazon OpenSearch Serverless along with other features to enhance your experience with time series collections, manage your cost for development environments, and quickly scale your resources to match your workload demands.

In this post, we discuss the new releases in OpenSearch Service to empower your business with search, observability, security analytics, and migrations.

Build cost-effective solutions with OpenSearch Service

With the zero-ETL integration for Amazon S3, OpenSearch Service now lets you query your data in place, saving cost on storage. Data movement is an expensive operation because you need to replicate data across different data stores. This increases your data footprint and drives cost. Moving data also adds the overhead of managing pipelines to migrate the data from one source to a new destination.

OpenSearch Service also added new instance types for data nodes—Im4gn and OR1—to help you further optimize your infrastructure cost. With a maximum 30 TB non-volatile memory (NVMe) solid state drives (SSD), the Im4gn instance provides dense storage and better performance. OR1 instances use segment replication and remote-backed storage to greatly increase throughput for indexing-heavy workloads.

Zero-ETL from DynamoDB to OpenSearch Service

In November 2023, DynamoDB and OpenSearch Ingestion introduced a zero-ETL integration for OpenSearch Service. OpenSearch Service domains and OpenSearch Serverless collections provide advanced search capabilities, such as full-text and vector search, on your DynamoDB data. With a few clicks on the AWS Management Console, you can now seamlessly load and synchronize your data from DynamoDB to OpenSearch Service, eliminating the need to write custom code to extract, transform, and load the data.

Direct query (zero-ETL for Amazon S3 data, in preview)

OpenSearch Service announced a new way for you to query operational logs in Amazon S3 and S3-based data lakes without needing to switch between tools to analyze operational data. Previously, you had to copy data from Amazon S3 into OpenSearch Service to take advantage of OpenSearch’s rich analytics and visualization features to understand your data, identify anomalies, and detect potential threats.

However, continuously replicating data between services can be expensive and requires operational work. With the OpenSearch Service direct query feature, you can access operational log data stored in Amazon S3, without needing to move the data itself. Now you can perform complex queries and visualizations on your data without any data movement.

Support of Im4gn with OpenSearch Service

Im4gn instances are optimized for workloads that manage large datasets and need high storage density per vCPU. Im4gn instances come in sizes large through 16xlarge, with up to 30 TB in NVMe SSD disk size. Im4gn instances are built on AWS Nitro System SSDs, which offer high-throughput, low-latency disk access for best performance. OpenSearch Service Im4gn instances support all OpenSearch versions and Elasticsearch versions 7.9 and above. For more details, refer to Supported instance types in Amazon OpenSearch Service.

Introducing OR1, an OpenSearch Optimized Instance family for indexing heavy workloads

In November 2023, OpenSearch Service launched OR1, the OpenSearch Optimized Instance family, which delivers up to 30% price-performance improvement over existing instances in internal benchmarks and uses Amazon S3 to provide 11 9s of durability. A domain with OR1 instances uses Amazon Elastic Block Store (Amazon EBS) volumes for primary storage, with data copied synchronously to Amazon S3 as it arrives. OR1 instances use OpenSearch’s segment replication feature to enable replica shards to read data directly from Amazon S3, avoiding the resource cost of indexing in both primary and replica shards. The OR1 instance family also supports automatic data recovery in the event of failure. For more information about OR1 instance type options, refer to Current generation instance types in OpenSearch Service.

Enable your business with security analytics features

The Security Analytics plugin in OpenSearch Service supports out-of-the-box prepackaged log types and provides security detection rules (SIGMA rules) to detect potential security incidents.

In OpenSearch 2.9, the Security Analytics plugin added support for customer log types and native support for Open Cybersecurity Schema Framework (OCSF) data format. With this new support, you can build detectors with OCSF data stored in Amazon Security Lake to analyze security findings and mitigate any potential incident. The Security Analytics plugin has also added the possibility to create your own custom log types and create custom detection rules.

Build ML-powered search solutions

In 2023, OpenSearch Service invested in eliminating the heavy lifting required to build next-generation search applications. With features such as search pipelines, search processors, and AI/ML connectors, OpenSearch Service enabled rapid development of search applications powered by neural search, hybrid search, and personalized results. Additionally, enhancements to the kNN plugin improved storage and retrieval of vector data. Newly launched optional plugins for OpenSearch Service enable seamless integration with additional language analyzers and Amazon Personalize.

Search pipelines

Search pipelines provide new ways to enhance search queries and improve search results. You define a search pipeline and then send your queries to it. When you define the search pipeline, you specify processors that transform and augment your queries, and re-rank your results. The prebuilt query processors include date conversion, aggregation, string manipulation, and data type conversion. The results processor in the search pipeline intercepts and adapts results on the fly before rendering to next phase. Both request and response processing for the pipeline are performed on the coordinator node, so there is no shard-level processing.

Optional plugins

OpenSearch Service lets you associate preinstalled optional OpenSearch plugins to use with your domain. An optional plugin package is compatible with a specific OpenSearch version, and can only be associated to domains with that version. Available plugins are listed on the Packages page on the OpenSearch Service console. The optional plugin includes the Amazon Personalize plugin, which integrates OpenSearch Service with Amazon Personalize, and new language analyzers such as Nori, Sudachi, STConvert, and Pinyin.

Support for new language analyzers

OpenSearch Service added support for four new language analyzer plugins: Nori (Korean), Sudachi (Japanese), Pinyin (Chinese), and STConvert Analysis (Chinese). These are available in all AWS Regions as optional plugins that you can associate with domains running any OpenSearch version. You can use the Packages page on the OpenSearch Service console to associate these plugins to your domain, or use the Associate Package API.

Neural search feature

Neural search is generally available with OpenSearch Service version 2.9 and later. Neural search allows you to integrate with ML models that are hosted remotely using the model serving framework. When you use a neural query during search, neural search converts the query text into vector embeddings, uses vector search to compare the query and document embedding, and returns the closest results. During ingestion, neural search transforms document text into vector embedding and indexes both the text and its vector embeddings in a vector index.

Integration with Amazon Personalize

OpenSearch Service introduced an optional plugin to integrate with Amazon Personalize in OpenSearch versions 2.9 or later. The OpenSearch Service plugin for Amazon Personalize Search Ranking allows you to improve the end-user engagement and conversion from your website and application search by taking advantage of the deep learning capabilities offered by Amazon Personalize. As an optional plugin, the package is compatible with OpenSearch version 2.9 or later, and can only be associated to domains with that version.

Efficient query filtering with OpenSearch’s k-NN FAISS

OpenSearch Service introduced efficient query filtering with OpenSearch’s k-NN FAISS in version 2.9 and later. OpenSearch’s efficient vector query filters capability intelligently evaluates optimal filtering strategies—pre-filtering with approximate nearest neighbor (ANN) or filtering with exact k-nearest neighbor (k-NN)—to determine the best strategy to deliver accurate and low-latency vector search queries. In earlier OpenSearch versions, vector queries on the FAISS engine used post-filtering techniques, which enabled filtered queries at scale, but potentially returning less than the requested “k” number of results. Efficient vector query filters deliver low latency and accurate results, enabling you to employ hybrid search across vector and lexical techniques.

Byte-quantized vectors in OpenSearch Service

With the new byte-quantized vector introduced with 2.9, you can reduce memory requirements by a factor of 4 and significantly reduce search latency, with minimal loss in quality (recall). With this feature, the usual 32-bit floats that are used for vectors are quantized or converted to 8-bit signed integers. For many applications, existing float vector data can be quantized with little loss in quality. Comparing benchmarks, you will find that using byte vectors rather than 32-bit floats results in a significant reduction in storage and memory usage while also improving indexing throughput and reducing query latency. An internal benchmark showed the storage usage was reduced by up to 78%, and RAM usage was reduced by up to 59% (for the glove-200-angular dataset). Recall values for angular datasets were lower than those of Euclidean datasets.

AI/ML connectors

OpenSearch 2.9 and later supports integrations with ML models hosted on AWS services or third-party platforms. This allows system administrators and data scientists to run ML workloads outside of their OpenSearch Service domain. The ML connectors come with a supported set of ML blueprints—templates that define the set of parameters you need to provide when sending API requests to a specific connector. OpenSearch Service provides connectors for several platforms, such as Amazon SageMaker, Amazon Bedrock, OpenAI ChatGPT, and Cohere.

OpenSearch Service console integrations

OpenSearch 2.9 and later added a new integrations feature on the console. Integrations provides you with an AWS CloudFormation template to build your semantic search use case by connecting to your ML models hosted on SageMaker or Amazon Bedrock. The CloudFormation template generates the model endpoint and registers the model ID with the OpenSearch Service domain you provide as input to the template.

Hybrid search and range normalization

The normalization processor and hybrid query builds on top of the two features released earlier in 2023—neural search and search pipelines. Because lexical and semantic queries return relevance scores on different scales, fine-tuning hybrid search queries was difficult.

OpenSearch Service 2.11 now supports a combination and normalization processor for hybrid search. You can now perform hybrid search queries, combining a lexical and a natural language-based k-NN vector search queries. OpenSearch Service also enables you to tune your hybrid search results for maximum relevance using multiple scoring combination and normalization techniques.

Multimodal search with Amazon Bedrock

OpenSearch Service 2.11 launches the support of multimodal search that allows you to search text and image data using multimodal embedding models. To generate vector embeddings, you need to create an ingest pipeline that contains a text_image_embedding processor, which converts the text or image binaries in a document field to vector embeddings. You can use the neural query clause, either in the k-NN plugin API or Query DSL queries, to do a combination of text and images searches. You can use the new OpenSearch Service integration features to quickly start with multimodal search.

Neural sparse retrieval

Neural sparse search, a new efficient method of semantic retrieval, is available in OpenSearch Service 2.11. Neural sparse search operates in two modes: bi-encoder and document-only. With the bi-encoder mode, both documents and search queries are passed through deep encoders. In document-only mode, only documents are passed through deep encoders, while search queries are tokenized. A document-only sparse encoder generates an index that is 10.4% of the size of a dense encoding index. For a bi-encoder, the index size is 7.2% of the size of a dense encoding index. Neural sparse search is enabled by sparse encoding models that create sparse vector embeddings: a set of <token: weight> pairs representing the text entry and its corresponding weight in the sparse vector. To learn more about the pre-trained models for sparse neural search, refer to Sparse encoding models.

Neural sparse search reduces costs, improves search relevance, and has lower latency. You can use the new OpenSearch Service integrations features to quickly start with neural sparse search.

OpenSearch Ingestion updates

OpenSearch Ingestion is a fully managed and auto scaled ingestion pipeline that delivers your data to OpenSearch Service domains and OpenSearch Serverless collections. Since its release in 2023, OpenSearch Ingestion continues to add new features to make it straightforward to transform and move your data from supported sources to downstream destinations like OpenSearch Service, OpenSearch Serverless, and Amazon S3.

New migration features in OpenSearch Ingestion

In November 2023, OpenSearch Ingestion announced the release of new features to support data migration from self-managed Elasticsearch version 7.x domains to the latest versions of OpenSearch Service.

OpenSearch Ingestion also supports the migration of data from OpenSearch Service managed domains running OpenSearch version 2.x to OpenSearch Serverless collections.

Learn how you can use OpenSearch Ingestion to migrate your data to OpenSearch Service.

Improve data durability with OpenSearch Ingestion

In November 2023, OpenSearch Ingestion introduced persistent buffering for push-based sources likes HTTP sources (HTTP, Fluentd, FluentBit) and OpenTelemetry collectors.

By default, OpenSearch Ingestion uses in-memory buffering. With persistent buffering, OpenSearch Ingestion stores your data in a disk-based store that is more resilient. If you have existing ingestion pipelines, you can enable persistent buffering for these pipelines, as shown in the following screenshot.

Support of new plugins

In early 2023, OpenSearch Ingestion added support for Amazon Managed Streaming for Apache Kafka (Amazon MSK). OpenSearch Ingestion uses the Kafka plugin to stream data from Amazon MSK to OpenSearch Service managed domains or OpenSearch Serverless collections. To learn more about setting up Amazon MSK as a data source, see Using an OpenSearch Ingestion pipeline with Amazon Managed Streaming for Apache Kafka.

OpenSearch Serverless updates

OpenSearch Serverless continued to enhance your serverless experience with OpenSearch by introducing the support of a new collection of type vector search to store embeddings and run similarity search. OpenSearch Serverless now supports shard replica scaling to handle spikes in query throughput. And if you are using a time series collection, you can now set up your custom data retention policy to match your data retention requirements.

Vector Engine for OpenSearch Serverless

In November 2023, we launched the vector engine for Amazon OpenSearch Serverless. The vector engine makes it straightforward to build modern ML-augmented search experiences and generative artificial intelligence (generative AI) applications without needing to manage the underlying vector database infrastructure. It also enables you to run hybrid search, combining vector search and full-text search in the same query, removing the need to manage and maintain separate data stores or a complex application stack.

OpenSearch Serverless lower-cost dev and test environments

OpenSearch Serverless now supports development and test workloads by allowing you to avoid running a replica. Removing replicas eliminates the need to have redundant OCUs in another Availability Zone solely for availability purposes. If you are using OpenSearch Serverless for development and testing, where availability is not a concern, you can drop your minimum OCUs from 4 to 2.

OpenSearch Serverless supports automated time-based data deletion using data lifecycle policies

In December 2023, OpenSearch Serverless announced support for managing data retention of time series collections and indexes. With the new automated time-based data deletion feature, you can specify how long you want to retain data. OpenSearch Serverless automatically manages the lifecycle of the data based on this configuration. To learn more, refer to Amazon OpenSearch Serverless now supports automated time-based data deletion.

OpenSearch Serverless announced support for scaling up replicas at shard level

At launch, OpenSearch Serverless supported increasing capacity automatically in response to growing data sizes. With the new shard replica scaling feature, OpenSearch Serverless automatically detects shards under duress due to sudden spikes in query rates and dynamically adds new shard replicas to handle the increased query throughput while maintaining fast response times. This approach proves to be more cost-efficient than simply adding new index replicas.

AWS user notifications to monitor your OCU usage

With this launch, you can configure the system to send notifications when OCU utilization is approaching or has reached maximum configured limits for search or ingestion. With the new AWS User Notification integration, you can configure the system to send notifications whenever the capacity threshold is breached. The User Notification feature eliminates the need to monitor the service constantly. For more information, see Monitoring Amazon OpenSearch Serverless using AWS User Notifications.

Enhance your experience with OpenSearch Dashboards

OpenSearch 2.9 in OpenSearch Service introduced new features to make it straightforward to quickly analyze your data in OpenSearch Dashboards. These new features include the new out-of-the box, preconfigured dashboards with OpenSearch Integrations, and the ability to create alerting and anomaly detection from an existing visualization in your dashboards.

OpenSearch Dashboard integrations

OpenSearch 2.9 added the support of OpenSearch integrations in OpenSearch Dashboards. OpenSearch integrations include preconfigured dashboards so you can quickly start analyzing your data coming from popular sources such as AWS CloudFront, AWS WAF, AWS CloudTrail, and Amazon Virtual Private Cloud (Amazon VPC) flow logs.

Alerting and anomalies in OpenSearch Dashboards

In OpenSearch Service 2.9, you can create a new alerting monitor directly from your line chart visualization in OpenSearch Dashboards. You can also associate the existing monitors or detectors previously created in OpenSearch to the dashboard visualization.

This new feature helps reduce context switching between dashboards and both the Alerting or Anomaly Detection plugins. Refer to the following dashboard to add an alerting monitor to detect drops in average data volume in your services.

OpenSearch expands geospatial aggregations support

With OpenSearch version 2.9, OpenSearch Service added the support of three types of geoshape data aggregation through API: geo_bounds, geo_hash, and geo_tile.

The geoshape field type provides the possibility to index location data in different geographic formats such as a point, a polygon, or a linestring. With the new aggregation types, you have more flexibility to aggregate documents from an index using metric and multi-bucket geospatial aggregations.

OpenSearch Service operational updates

OpenSearch Service removed the need to run blue/green deployment when changing the domain managed nodes. Additionally, the service improved the Auto-Tune events with the support of new Auto-Tune metrics to track the changes within your OpenSearch Service domain.

OpenSearch Service now lets you update domain manager nodes without blue/green deployment

As of early H2 of 2023, OpenSearch Service allowed you to modify the instance type or instance count of dedicated cluster manager nodes without the need for blue/green deployment. This enhancement allows quicker updates with minimal disruption to your domain operations, all while avoiding any data movement.

Previously, updating your dedicated cluster manager nodes on OpenSearch Service meant using a blue/green deployment to make the change. Although blue/green deployments are meant to avoid any disruption to your domains, because the deployment utilizes additional resources on the domain, it is recommended that you perform them during low-traffic periods. Now you can update cluster manager instance types or instance counts without requiring a blue/green deployment, so these updates can complete faster while avoiding any potential disruption to your domain operations. In cases where you modify both the domain manager instance type and count, OpenSearch Service will still use a blue/green deployment to make the change. You can use the dry-run option to check whether your change requires a blue/green deployment.

Enhanced Auto-Tune experience

In September 2023, OpenSearch Service added new Auto-Tune metrics and improved Auto-Tune events that give you better visibility into the domain performance optimizations made by Auto-Tune.

Auto-Tune is an adaptive resource management system that automatically updates OpenSearch Service domain resources to improve efficiency and performance. For example, Auto-Tune optimizes memory-related configuration such as queue sizes, cache sizes, and Java virtual machine (JVM) settings on your nodes.

With this launch, you can now audit the history of the changes, as well as track them in real time from the Amazon CloudWatch console.

Additionally, OpenSearch Service now publishes details of the changes to Amazon EventBridge when Auto-Tune settings are recommended or applied to an OpenSearch Service domain. These Auto-Tune events will also be visible on the Notifications page on the OpenSearch Service console.

Accelerate your migration to OpenSearch Service with the new Migration Assistant solution

In November 2023, the OpenSearch team launched a new open-source solution—Migration Assistant for Amazon OpenSearch Service. The solution supports data migration from self-managed Elasticsearch and OpenSearch domains to OpenSearch Service, supporting Elasticsearch 7.x (<=7.10), OpenSearch 1.x, and OpenSearch 2.x as migration sources. The solution facilitates the migration of the existing and live data between source and destination.

Conclusion

In this post, we covered the new releases in OpenSearch Service to help you innovate your business with search, observability, security analytics, and migrations. We provided you with information about when to use each new feature in OpenSearch Service, OpenSearch Ingestion, and OpenSearch Serverless.

Learn more about OpenSearch Dashboards and OpenSearch plugins and the new exciting OpenSearch assistant using OpenSearch playground.

Check out the features described in this post, and we appreciate you providing us your valuable feedback.


About the Authors

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on Amazon OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Muslim Abu Taha is a Sr. OpenSearch Specialist Solutions Architect dedicated to guiding clients through seamless search workload migrations, fine-tuning clusters for peak performance, and ensuring cost-effectiveness. With a background as a Technical Account Manager (TAM), Muslim brings a wealth of experience in assisting enterprise customers with cloud adoption and optimize their different set of workloads. Muslim enjoys spending time with his family, traveling and exploring new places.

Amazon OpenSearch Service’s vector database capabilities explained

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-services-vector-database-capabilities-explained/

OpenSearch is a scalable, flexible, and extensible open-source software suite for search, analytics, security monitoring, and observability applications, licensed under the Apache 2.0 license. It comprises a search engine, OpenSearch, which delivers low-latency search and aggregations, OpenSearch Dashboards, a visualization and dashboarding tool, and a suite of plugins that provide advanced capabilities like alerting, fine-grained access control, observability, security monitoring, and vector storage and processing. Amazon OpenSearch Service is a fully managed service that makes it simple to deploy, scale, and operate OpenSearch in the AWS Cloud.

As an end-user, when you use OpenSearch’s search capabilities, you generally have a goal in mind—something you want to accomplish. Along the way, you use OpenSearch to gather information in support of achieving that goal (or maybe the information is the original goal). We’ve all become used to the “search box” interface, where you type some words, and the search engine brings back results based on word-to-word matching. Let’s say you want to buy a couch in order to spend cozy evenings with your family around the fire. You go to Amazon.com, and you type “a cozy place to sit by the fire.” Unfortunately, if you run that search on Amazon.com, you get items like fire pits, heating fans, and home decorations—not what you intended. The problem is that couch manufacturers probably didn’t use the words “cozy,” “place,” “sit,” and “fire” in their product titles or descriptions.

In recent years, machine learning (ML) techniques have become increasingly popular to enhance search. Among them are the use of embedding models, a type of model that can encode a large body of data into an n-dimensional space where each entity is encoded into a vector, a data point in that space, and organized such that similar entities are closer together. An embedding model, for instance, could encode the semantics of a corpus. By searching for the vectors nearest to an encoded document — k-nearest neighbor (k-NN) search — you can find the most semantically similar documents. Sophisticated embedding models can support multiple modalities, for instance, encoding the image and text of a product catalog and enabling similarity matching on both modalities.

A vector database provides efficient vector similarity search by providing specialized indexes like k-NN indexes. It also provides other database functionality like managing vector data alongside other data types, workload management, access control and more. OpenSearch’s k-NN plugin provides core vector database functionality for OpenSearch, so when your customer searches for “a cozy place to sit by the fire” in your catalog, you can encode that prompt and use OpenSearch to perform a nearest neighbor query to surface that 8-foot, blue couch with designer arranged photographs in front of fireplaces.

Using OpenSearch Service as a vector database

With OpenSearch Service’s vector database capabilities, you can implement semantic search, Retrieval Augmented Generation (RAG) with LLMs, recommendation engines, and search rich media.

Semantic search

With semantic search, you improve the relevance of retrieved results using language-based embeddings on search documents. You enable your search customers to use natural language queries, like “a cozy place to sit by the fire” to find their 8-foot-long blue couch. For more information, refer to Building a semantic search engine in OpenSearch to learn how semantic search can deliver a 15% relevance improvement, as measured by normalized discounted cumulative gain (nDCG) metrics compared with keyword search. For a concrete example, our Improve search relevance with ML in Amazon OpenSearch Service workshop explores the difference between keyword and semantic search, based on a Bidirectional Encoder Representations from Transformers (BERT) model, hosted by Amazon SageMaker to generate vectors and store them in OpenSearch. The workshop uses product question answers as an example to show how keyword search using the keywords/phrases of the query leads to some irrelevant results. Semantic search is able to retrieve more relevant documents by matching the context and semantics of the query. The following diagram shows an example architecture for a semantic search application with OpenSearch Service as the vector database.

Architecture diagram showing how to use Amazon OpenSearch Service to perform semantic search to improve relevance

Retrieval Augmented Generation with LLMs

RAG is a method for building trustworthy generative AI chatbots using generative LLMs like OpenAI, ChatGPT, or Amazon Titan Text. With the rise of generative LLMs, application developers are looking for ways to take advantage of this innovative technology. One popular use case involves delivering conversational experiences through intelligent agents. Perhaps you’re a software provider with knowledge bases for product information, customer self-service, or industry domain knowledge like tax reporting rules or medical information about diseases and treatments. A conversational search experience provides an intuitive interface for users to sift through information through dialog and Q&A. Generative LLMs on their own are prone to hallucinations—a situation where the model generates a believable but factually incorrect response. RAG solves this problem by complementing generative LLMs with an external knowledge base that is typically built using a vector database hydrated with vector-encoded knowledge articles.

As illustrated in the following diagram, the query workflow starts with a question that is encoded and used to retrieve relevant knowledge articles from the vector database. Those results are sent to the generative LLM whose job is to augment those results, typically by summarizing the results as a conversational response. By complementing the generative model with a knowledge base, RAG grounds the model on facts to minimize hallucinations. You can learn more about building a RAG solution in the Retrieval Augmented Generation module of our semantic search workshop.

Architecture diagram showing how to use Amazon OpenSearch Service to perform retrieval-augmented generation

Recommendation engine

Recommendations are a common component in the search experience, especially for ecommerce applications. Adding a user experience feature like “more like this” or “customers who bought this also bought that” can drive additional revenue through getting customers what they want. Search architects employ many techniques and technologies to build recommendations, including Deep Neural Network (DNN) based recommendation algorithms such as the two-tower neural net model, YoutubeDNN. A trained embedding model encodes products, for example, into an embedding space where products that are frequently bought together are considered more similar, and therefore are represented as data points that are closer together in the embedding space. Another possibility
is that product embeddings are based on co-rating similarity instead of purchase activity. You can employ this affinity data through calculating the vector similarity between a particular user’s embedding and vectors in the database to return recommended items. The following diagram shows an example architecture of building a recommendation engine with OpenSearch as a vector store.

Architecture diagram showing how to use Amazon OpenSearch Service as a recommendation engine

Media search

Media search enables users to query the search engine with rich media like images, audio, and video. Its implementation is similar to semantic search—you create vector embeddings for your search documents and then query OpenSearch Service with a vector. The difference is you use a computer vision deep neural network (e.g. Convolutional Neural Network (CNN)) such as ResNet to convert images into vectors. The following diagram shows an example architecture of building an image search with OpenSearch as the vector store.

Architecture diagram showing how to use Amazon OpenSearch Service to search rich media like images, videos, and audio files

Understanding the technology

OpenSearch uses approximate nearest neighbor (ANN) algorithms from the NMSLIB, FAISS, and Lucene libraries to power k-NN search. These search methods employ ANN to improve search latency for large datasets. Of the three search methods the k-NN plugin provides, this method offers the best search scalability for large datasets. The engine details are as follows:

  • Non-Metric Space Library (NMSLIB) – NMSLIB implements the HNSW ANN algorithm
  • Facebook AI Similarity Search (FAISS) – FAISS implements both HNSW and IVF ANN algorithms
  • Lucene – Lucene implements the HNSW algorithm

Each of the three engines used for approximate k-NN search has its own attributes that make one more sensible to use than the others in a given situation. You can follow the general information in this section to help determine which engine will best meet your requirements.

In general, NMSLIB and FAISS should be selected for large-scale use cases. Lucene is a good option for smaller deployments, but offers benefits like smart filtering where the optimal filtering strategy—pre-filtering, post-filtering, or exact k-NN—is automatically applied depending on the situation. The following table summarizes the differences between each option.

.

NMSLIB-HNSW

FAISS-HNSW

FAISS-IVF

Lucene-HNSW

Max Dimension

16,000

16,000

16,000

1024

Filter

Post filter

Post filter

Post filter

Filter while search

Training Required

No

No

Yes

No

Similarity Metrics

l2, innerproduct, cosinesimil, l1, linf

l2, innerproduct

l2, innerproduct

l2, cosinesimil

Vector Volume

Tens of billions

Tens of billions

Tens of billions

< Ten million

Indexing latency

Low

Low

Lowest

Low

Query Latency & Quality

Low latency & high quality

Low latency & high quality

Low latency & low quality

High latency & high quality

Vector Compression

Flat

Flat

Product Quantization

Flat

Product Quantization

Flat

Memory Consumption

High

High

Low with PQ

Medium

Low with PQ

High

Approximate and exact nearest-neighbor search

The OpenSearch Service k-NN plugin supports three different methods for obtaining the k-nearest neighbors from an index of vectors: approximate k-NN, score script (exact k-NN), and painless extensions (exact k-NN).

Approximate k-NN

The first method takes an approximate nearest neighbor approach—it uses one of several algorithms to return the approximate k-nearest neighbors to a query vector. Usually, these algorithms sacrifice indexing speed and search accuracy in return for performance benefits such as lower latency, smaller memory footprints, and more scalable search. Approximate k-NN is the best choice for searches over large indexes (that is, hundreds of thousands of vectors or more) that require low latency. You should not use approximate k-NN if you want to apply a filter on the index before the k-NN search, which greatly reduces the number of vectors to be searched. In this case, you should use either the score script method or painless extensions.

Score script

The second method extends the OpenSearch Service score script functionality to run a brute force, exact k-NN search over knn_vector fields or fields that can represent binary objects. With this approach, you can run k-NN search on a subset of vectors in your index (sometimes referred to as a pre-filter search). This approach is preferred for searches over smaller bodies of documents or when a pre-filter is needed. Using this approach on large indexes may lead to high latencies.

Painless extensions

The third method adds the distance functions as painless extensions that you can use in more complex combinations. Similar to the k-NN score script, you can use this method to perform a brute force, exact k-NN search across an index, which also supports pre-filtering. This approach has slightly slower query performance compared to the k-NN score script. If your use case requires more customization over the final score, you should use this approach over score script k-NN.

Vector search algorithms

The simple way to find similar vectors is to use k-nearest neighbors (k-NN) algorithms, which compute the distance between a query vector and the other vectors in the vector database. As we mentioned earlier, the score script k-NN and painless extensions search methods use the exact k-NN algorithms under the hood. However, in the case of extremely large datasets with high dimensionality, this creates a scaling problem that reduces the efficiency of the search. Approximate nearest neighbor (ANN) search methods can overcome this by employing tools that restructure indexes more efficiently and reduce the dimensionality of searchable vectors. There are different ANN search algorithms; for example, locality sensitive hashing, tree-based, cluster-based, and graph-based. OpenSearch implements two ANN algorithms: Hierarchical Navigable Small Worlds (HNSW) and Inverted File System (IVF). For a more detailed explanation of how the HNSW and IVF algorithms work in OpenSearch, see blog post “Choose the k-NN algorithm for your billion-scale use case with OpenSearch”.

Hierarchical Navigable Small Worlds

The HNSW algorithm is one of the most popular algorithms out there for ANN search. The core idea of the algorithm is to build a graph with edges connecting index vectors that are close to each other. Then, on search, this graph is partially traversed to find the approximate nearest neighbors to the query vector. To steer the traversal towards the query’s nearest neighbors, the algorithm always visits the closest candidate to the query vector next.

Inverted File

The IVF algorithm separates your index vectors into a set of buckets, then, to reduce your search time, only searches through a subset of these buckets. However, if the algorithm just randomly split up your vectors into different buckets, and only searched a subset of them, it would yield a poor approximation. The IVF algorithm uses a more elegant approach. First, before indexing begins, it assigns each bucket a representative vector. When a vector is indexed, it gets added to the bucket that has the closest representative vector. This way, vectors that are closer to each other are placed roughly in the same or nearby buckets.

Vector similarity metrics

All search engines use a similarity metric to rank and sort results and bring the most relevant results to the top. When you use a plain text query, the similarity metric is called TF-IDF, which measures the importance of the terms in the query and generates a score based on the number of textual matches. When your query includes a vector, the similarity metrics are spatial in nature, taking advantage of proximity in the vector space. OpenSearch supports several similarity or distance measures:

  • Euclidean distance – The straight-line distance between points.
  • L1 (Manhattan) distance – The sum of the differences of all of the vector components. L1 distance measures how many orthogonal city blocks you need to traverse from point A to point B.
  • L-infinity (chessboard) distance – The number of moves a King would make on an n-dimensional chessboard. It’s different than Euclidean distance on the diagonals—a diagonal step on a 2-dimensional chessboard is 1.41 Euclidean units away, but 2 L-infinity units away.
  • Inner product – The product of the magnitudes of two vectors and the cosine of the angle between them. Usually used for natural language processing (NLP) vector similarity.
  • Cosine similarity – The cosine of the angle between two vectors in a vector space.
  • Hamming distance – For binary-coded vectors, the number of bits that differ between the two vectors.

Advantage of OpenSearch as a vector database

When you use OpenSearch Service as a vector database, you can take advantage of the service’s features like usability, scalability, availability, interoperability, and security. More importantly, you can use OpenSearch’s search features to enhance the search experience. For example, you can use Learning to Rank in OpenSearch to integrate user clickthrough behavior data into your search application and improve search relevance. You can also combine OpenSearch text search and vector search capabilities to search documents with keyword and semantic similarity. You can also use other fields in the index to filter documents to improve relevance. For advanced users, you can use a hybrid scoring model to combine OpenSearch’s text-based relevance score, computed with the Okapi BM25 function and its vector search score to improve the ranking of your search results.

Scale and limits

OpenSearch as vector database support billions of vector records. Keep in mind the following calculator regarding number of vectors and dimensions to size your cluster.

Number of vectors

OpenSearch VectorDB takes advantage of the sharding capabilities of OpenSearch and can scale to billions of vectors at single-digit millisecond latencies by sharding vectors and scale horizontally by adding more nodes. The number of vectors that can fit in a single machine is a function of the off-heap memory availability on the machine. The number of nodes required will depend on the amount of memory that can be used for the algorithm per node and the total amount of memory required by the algorithm. The more nodes, the more memory and better performance. The amount of memory available per node is computed as memory_available = (node_memoryjvm_size) * circuit_breaker_limit, with the following parameters:

  • node_memory – The total memory of the instance.
  • jvm_size – The OpenSearch JVM heap size. This is set to half of the instance’s RAM, capped at approximately 32 GB.
  • circuit_breaker_limit – The native memory usage threshold for the circuit breaker. This is set to 0.5.

Total cluster memory estimation depends on total number of vector records and algorithms. HNSW and IVF have different memory requirements. You can refer to Memory Estimation for more details.

Number of dimensions

OpenSearch’s current dimension limit for the vector field knn_vector is 16,000 dimensions. Each dimension is represented as a 32-bit float. The more dimensions, the more memory you’ll need to index and search. The number of dimensions is usually determined by the embedding models that translate the entity to a vector. There are a lot of options to choose from when building your knn_vector field. To determine the correct methods and parameters to choose, refer to Choosing the right method.

Customer stories:

Amazon Music

Amazon Music is always innovating to provide customers with unique and personalized experiences. One of Amazon Music’s approaches to music recommendations is a remix of a classic Amazon innovation, item-to-item collaborative filtering, and vector databases. Using data aggregated based on user listening behavior, Amazon Music has created an embedding model that encodes music tracks and customer representations into a vector space where neighboring vectors represent tracks that are similar. 100 million songs are encoded into vectors, indexed into OpenSearch, and served across multiple geographies to power real-time recommendations. OpenSearch currently manages 1.05 billion vectors and supports a peak load of 7,100 vector queries per second to power Amazon Music recommendations.

The item-to-item collaborative filter continues to be among the most popular methods for online product recommendations because of its effectiveness at scaling to large customer bases and product catalogs. OpenSearch makes it easier to operationalize and further the scalability of the recommender by providing scale-out infrastructure and k-NN indexes that grow linearly with respect to the number of tracks and similarity search in logarithmic time.

The following figure visualizes the high-dimensional space created by the vector embedding.

A visualization of the vector encoding of Amazon Music entries in the large vector space

Brand protection at Amazon

Amazon strives to deliver the world’s most trustworthy shopping experience, offering customers the widest possible selection of authentic products. To earn and maintain our customers’ trust, we strictly prohibit the sale of counterfeit products, and we continue to invest in innovations that ensure only authentic products reach our customers. Amazon’s brand protection programs build trust with brands by accurately representing and completely protecting their brand. We strive to ensure that public perception mirrors the trustworthy experience we deliver. Our brand protection strategy focuses on four pillars: (1) Proactive Controls (2) Powerful Tools to Protect Brands (3) Holding Bad Actors Accountable (4) Protecting and Educating Customers. Amazon OpenSearch Service is a key part of Amazon’s Proactive Controls.

In 2022, Amazon’s automated technology scanned more than 8 billion attempted changes daily to product detail pages for signs of potential abuse. Our proactive controls found more than 99% of blocked or removed listings before a brand ever had to find and report it. These listings were suspected of being fraudulent, infringing, counterfeit, or at risk of other forms of abuse. To perform these scans, Amazon created tooling that uses advanced and innovative techniques, including the use of advanced machine learning models to automate the detection of intellectual property infringements in listings across Amazon’s stores globally. A key technical challenge in implementing such automated system is the ability to search for protected intellectual property within a vast billion-vector corpus in a fast, scalable and cost effective manner. Leveraging Amazon OpenSearch Service’s scalable vector database capabilities and distributed architecture, we successfully developed an ingestion pipeline that has indexed a total of 68 billion, 128- and 1024-dimension vectors into OpenSearch Service to enable brands and automated systems to conduct infringement detection, in real-time, through a highly available and fast (sub-second) search API.

Conclusion

Whether you’re building a generative AI solution, searching rich media and audio, or bringing more semantic search to your existing search-based application, OpenSearch is a capable vector database. OpenSearch supports a variety of engines, algorithms, and distance measures that you can employ to build the right solution. OpenSearch provides a scalable engine that can support vector search at low latency and up to billions of vectors. With OpenSearch and its vector DB capabilities, your users can find that 8-foot-blue couch easily, and relax by a cozy fire.


About the Authors

Jon Handler is a Senior Principal Solutions Architect with AWSJon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Jianwei Li is a Principal Analytics Specialist TAM at Amazon Web Services. Jianwei provides consultant service for customers to help customer design and build modern data platform. Jianwei has been working in big data domain as software developer, consultant and tech leader.

Dylan Tong is a Senior Product Manager at AWS. He works with customers to help drive their success on the AWS platform through thought leadership and guidance on designing well architected solutions. He has spent most of his career building on his expertise in data management and analytics by working for leaders and innovators in the space.

Vamshi Vijay Nakkirtha is a Software Engineering Manager working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include distributed systems. He is an active contributor to various plugins, like k-NN, GeoSpatial, and dashboard-maps.

Serverless logging with Amazon OpenSearch Service and Amazon Kinesis Data Firehose

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/serverless-logging-with-amazon-opensearch-service-and-amazon-kinesis-data-firehose/

In this post, you will learn how you can use Amazon Kinesis Data Firehose to build a log ingestion pipeline to send VPC flow logs to Amazon OpenSearch Serverless. First, you create the OpenSearch Serverless collection you use to store VPC flow logs, then you create a Kinesis Data Firehose delivery pipeline that forwards the flow logs to OpenSearch Serverless. Finally, you enable delivery of VPC flow logs to your Firehose delivery stream. The following diagram illustrates the solution workflow.

OpenSearch Serverless is a new serverless option offered by Amazon OpenSearch Service. OpenSearch Serverless makes it simple to run petabyte-scale search and analytics workloads without having to configure, manage, or scale OpenSearch clusters. OpenSearch Serverless automatically provisions and scales the underlying resources to deliver fast data ingestion and query responses for even the most demanding and unpredictable workloads.

Kinesis Data Firehose is a popular service that delivers streaming data from over 20 AWS services to over 15 analytical and observability tools such as OpenSearch Serverless. Kinesis Data Firehose is great for those looking for a fast and easy way to send your VPC flow logs data to your OpenSearch Serverless collection in minutes without a single line of code and without building or managing your own data ingestion and delivery infrastructure.

VPC flow logs capture the traffic information going to and from your network interfaces in your VPC. With the launch of Kinesis Data Firehose support to OpenSearch Serverless, it makes an easy solution to analyze your VPC flow logs with just a few clicks. Kinesis Data Firehose provides a true end-to-end serverless mechanism to deliver your flow logs to OpenSearch Serverless, where you can use OpenSearch Dashboards to search through those logs, create dashboards, detect anomalies, and send alerts. VPC flow logs helps you to answer questions like:

  • What percentage of your traffic is getting dropped?
  • How much traffic is getting generated for specific sources and destinations?

Create your OpenSearch Serverless collection

To get started, you first create a collection. An OpenSearch Serverless collection is a logical grouping of one or more indexes that represent an analytics workload. Complete the following steps:

  1. On the OpenSearch Service console, choose Collections under Serverless in the navigation pane.
  2. Choose Create a collection.
  3. For Collection name, enter a name (for example, vpc-flow-logs).
  4. For Collection type¸ choose Time series.
  5. For Encryption, choose your preferred encryption setting:
    1. Choose Use AWS owned key to use an AWS managed key.
    2. Choose a different AWS KMS key to use your own AWS Key Management Service (AWS KMS) key.
  6. For Network access settings, choose your preferred setting:
    1. Choose VPC to use a VPC endpoint.
    2. Choose Public to use a public endpoint.

AWS recommends that you use a VPC endpoint for all production workloads. For this walkthrough, select Public.

  1. Choose Create.

It should take couple of minutes to create the collection.

The following graphic gives a quick demonstration of creating the OpenSearch Serverless collection via the preceding steps.

At this point, you have successfully created a collection for OpenSearch Serverless. Next, you create a delivery pipeline for Kinesis Data Firehose.

Create a Kinesis Data Firehose delivery stream

To set up a delivery stream for Kinesis Data Firehose, complete the following steps:

  1. On the Kinesis Data Firehose console, choose Create delivery stream.
  2. For Source, specify Direct PUT.

Check out Source, Destination, and Name to learn more about different sources supported by Kinesis Data Firehose.

  1. For Destination, choose Amazon OpenSearch Serverless.
  2. For Delivery stream name, enter a name (for example, vpc-flow-logs).
  3. Under Destination settings, in the OpenSearch Serverless collection settings, choose Browse.
  4. Select vpc-flow-logs.
  5. Choose Choose.

If your collection is still creating, wait a few minutes and try again.

  1. For Index, specify vpc-flow-logs.
  2. In the Backup settings section, select Failed data only for the Source record backup in Amazon S3.

Kinesis Data Firehose uses Amazon Simple Storage Service (Amazon S3) to back up failed data that it attempts to deliver to your chosen destination. If you want to keep all data, select All data.

  1. For S3 Backup Bucket, choose Browse to select an existing S3 bucket, or choose Create to create a new bucket.
  2. Choose Create delivery stream.

The following graphic gives a quick demonstration of creating the Kinesis Data Firehose delivery stream via the preceding steps.

At this point, you have successfully created a delivery stream for Kinesis Data Firehose, which you will use to stream data from your VPC flow logs and send it to your OpenSearch Serverless collection.

Set up the data access policy for your OpenSearch Serverless collection

Before you send any logs to OpenSearch Serverless, you need to create a data access policy within OpenSearch Serverless that allows Kinesis Data Firehose to write to the vpc-flow-logs index in your collection. Complete the following steps:

  1. On the Kinesis Data Firehose console, choose the Configuration tab on the details page for the vpc-flow-logs delivery stream you just created.
  2. In the Permissions section, note down the AWS Identity and Access Management (IAM) role.
  3. Navigate to the vpc-flow-logs collection details page on the OpenSearch Serverless dashboard.
  4. Under Data access, choose Manage data access.
  5. Choose Create access policy.
  6. In the Name and description section, specify an access policy name, add a description, and select JSON as the policy definition method.
  7. Add the following policy in the JSON editor. Provide the collection name and index you specified during the delivery stream creation in the policy. Provide the IAM role name that you got from the permissions page of the Firehose delivery stream, and the account ID for your AWS account.
    [
      {
        "Rules": [
          {
            "ResourceType": "index",
            "Resource": [
              "index/<collection-name>/<index-name>"
            ],
            "Permission": [
              "aoss:WriteDocument",
              "aoss:CreateIndex",
              "aoss:UpdateIndex"
            ]
          }
        ],
        "Principal": [
          "arn:aws:sts::<aws-account-id>:assumed-role/<IAM-role-name>/*"
        ]
      }
    ]

  8. Choose Create.

The following graphic gives a quick demonstration of creating the data access policy via the preceding steps.

Set up VPC flow logs

In the final step of this post, you enable flow logs for your VPC with the destination as Kinesis Data Firehose, which sends the data to OpenSearch Serverless.

  1. Navigate to the AWS Management Console.
  2. Search for “VPC” and then choose Your VPCs in the search result (hover over the VPC rectangle to reveal the link).
  3. Choose the VPC ID link for one of your VPCs.
  4. On the Flow Logs tab, choose Create flow log.
  5. For Name, enter a name.
  6. Leave the Filter set to All. You can limit the traffic by selecting Accept or Reject.
  7. Under Destination, select Send to Kinesis Firehose in the same account.
  8. For Kinesis Firehose delivery stream name, choose vpc-flow-logs.
  9. Choose Create flow log.

The following graphic gives a quick demonstration of creating a flow log for your VPC following the preceding steps.

Examine the VPC flow logs data in your collection using OpenSearch Dashboards

You won’t be able to access your collection data until you configure data access. Data access policies allow users to access the actual data within a collection.

To create a data access policy for OpenSearch Dashboards, complete the following steps:

  1. Navigate to the vpc-flow-logs collection details page on the OpenSearch Serverless dashboard.
  2. Under Data access, choose Manage data access.
  3. Choose Create access policy.
  4. In the Name and description section, specify an access policy name, add a description, and select JSON as the policy definition method.
  5. Add the following policy in the JSON editor. Provide the collection name and index you specified during the delivery stream creation in the policy. Additionally, provide the IAM user and the account ID for your AWS account. You need to make sure that you have the AWS access and secret keys for the principal that you specified as an IAM user.
    [
      {
        "Rules": [
          {
            "Resource": [
              "index/<collection-name>/<index-name>"
            ],
            "Permission": [
              "aoss:ReadDocument"
            ],
            "ResourceType": "index"
          }
        ],
        "Principal": [
          "arn:aws:iam::<aws-account-id>:user/<IAM-user-name>"
        ]
      }
    ]

  6. Choose Create.
  7. Navigate to OpenSearch Serverless and choose the collection you created (vpc-flow-logs).
  8. Choose the OpenSearch Dashboards URL and log in with your IAM access key and secret key for the user you specified under Principal.
  9. Navigate to dev tools within OpenSearch Dashboards and run the following query to retrieve the VPC flow logs for your VPC:
    GET <index-name>/_search
    {
      "query": {
        "match_all": {}
      }
    }

The query returns the data as shown in the following screenshot, which contains information such as account ID, interface ID, source IP address, destination IP address, and more.

Create dashboards

After the data is flowing into OpenSearch Serverless, you can easily create dashboards to monitor the activity in your VPC. The following example dashboard shows overall traffic, accepted and rejected traffic, bytes transmitted, and some charts with the top sources and destinations.

Clean up

If you don’t want to continue using the solution, be sure to delete the resources you created:

  1. Return to the AWS console and in the VPCs section, disable the flow logs for your VPC.
  2. In the OpenSearch Serverless dashboard, delete your vpc-flow-logs collection.
  3. On the Kinesis Data Firehose console, delete your vpc-flow-logs delivery stream.

Conclusion

In this post, you created an end-to-end serverless pipeline to deliver your VPC flow logs to OpenSearch Serverless using Kinesis Data Firehose. In this example, you built a delivery pipeline for your VPC flow logs, but you can also use Kinesis Data Firehose to send logs from Amazon Kinesis Data Streams and Amazon CloudWatch, which in turn can be sent to OpenSearch Serverless collections for running analytics on those logs. With serverless solutions on AWS, you can focus on your application development rather than worrying about the ingestion pipeline and tools to visualize your logs.

Get hands-on with OpenSearch Serverless by taking the Getting Started with Amazon OpenSearch Serverless workshop and check out other pipelines for analyzing your logs.

If you have feedback about this post, share it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.


About the authors

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Monitor your Amazon ES domains with Amazon Elasticsearch Service Monitor

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/monitor-your-amazon-es-domains-with-amazon-elasticsearch-service-monitor/

Amazon Elasticsearch Service (Amazon ES) is a fully managed service that you can use to deploy, secure, and run Elasticsearch cost-effectively at scale. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with Logstash and other AWS services.

Amazon ES provides a wealth of information about your domain, surfaced through Amazon CloudWatch metrics (for more information, see Instance metrics). Your domain’s dashboard on the AWS Management Console collects key metrics and provides a view of what’s going on with that domain. This view is limited to that single domain, and for a subset of the available metrics. What if you’re running many domains? How can you see all their metrics in one place? You can set CloudWatch alarms at the single domain level, but what about anomaly detection and centralized alerting?

In this post, we detail Amazon Elasticsearch Service Monitor, an open-source monitoring solution for all the domains in your account, across all Regions, backed by a set of AWS CloudFormation templates delivered through the AWS Cloud Development Kit (AWS CDK). The templates deploy an Amazon ES domain in a VPC, an Nginx proxy for Kibana access, and an AWS Lambda function. The function is invoked by CloudWatch Events to pull metrics from all your Amazon ES domains and send them to the previously created monitoring domain for your review.

Your Amazon ES monitoring domain is an ideal way to monitor your Amazon ES infrastructure. We provide dashboards at the account and individual domain level. We also provide basic alerts that you can use as a template to build your own alerting solution.

Prerequisites

To bootstrap the solution, you need a few tools in your development environment:

Create and deploy the AWS CDK monitoring tool

Complete the following steps to set up the AWS CDK monitoring tool in your environment. Depending on your operating system, the commands may differ. This walkthrough uses Linux and bash.

Clone the code from the GitHub repo:

# clone the repo
$ git clone https://github.com/aws-samples/amazon-elasticsearch-service-monitor.git
# move to directory
$ cd amazon-elasticsearch-service-monitor

We provide a bash bootstrap script to prepare your environment for running the AWS CDK and deploying the architecture. The bootstrap.sh script is in the amazon-elasticsearch-service-monitor directory. The script creates a Python virtual environment and downloads some further dependencies. It creates an Amazon Elastic Compute Cloud (Amazon EC2) key pair to facilitate accessing Kibana, then adds that key pair to your local SSH setup. Finally, it prompts for an email address where the stack sends alerts. You can edit email_default in the script or enter it at the command line when you run the script. See the following code:

$ bash bootstrap.sh
Collecting astroid==2.4.2
  Using cached astroid-2.4.2-py3-none-any.whl (213 kB)
Collecting attrs==20.3.0
  Using cached attrs-20.3.0-py2.py3-none-any.whl (49 kB)

After the script is complete, enter the Python virtual environment:

$ source .env/bin/activate
(.env) $

Bootstrap the AWS CDK

The AWS CDK creates resources in your AWS account to enable it to track your deployments. You bootstrap the AWS CDK with the bootstrap command:

# bootstrap the cdk
(.env) $ cdk bootstrap aws://yourAccountID/yourRegion

Deploy the architecture

The monitoring_cdk directory collects all the components that enable the AWS CDK to deploy the following architecture.

You can review amazon-elasticsearch-service-monitor/monitoring_cdk/monitoring_cdk_stack.py for further details.

The architecture has the following components:

  • An Amazon Virtual Private Cloud (Amazon VPC) spanning two Amazon EC2 Availability Zones.
  • An Amazon ES cluster with two t3.medium data nodes, one in each Availability Zone, with 100 GB of EBS storage.
  • An Amazon DynamoDB table for tracking the timestamp for the last pull from CloudWatch.
  • A Lambda function to fetch CloudWatch metrics across all Regions and all domains. By default, it fetches the data every 5 minutes, which you can change if needed.
  • An EC2 instance that acts as an SSH tunnel to access Kibana, because our setup is secured and in a VPC.
  • A default Kibana dashboard to visualize metrics across all domains.
  • Default email alerts to the newly launched Amazon ES cluster.
  • An index template and Index State Management (ISM) policy to delete indexes older than 366 days. (You can change this to a different retention period if needed.)
  • A monitoring stack with the option to enable UltraWarm (UW), which is disabled by default. You can change the settings in the monitoring_cdk_stack.py file to enable UW.

The monitoring_cdk_stack.py file contains several constants at the top that let you control the domain configuration, its sizing, and the Regions to monitor. It also specifies the username and password for the admin user of your domain. You should edit and replace those constants with your own values.

For example, the following code indicates which Regions to monitor:

REGIONS_TO_MONITOR='["us-east-1", "us-east-2", "us-west-1", "us-west-2", "af-south-1", "ap-east-1", "ap-south-1", "ap-northeast-1", "ap-northeast-2", "ap-southeast-1", "ap-southeast-2", "ca-central-1", "eu-central-1", "eu-west-1", "eu-west-2", "eu-west-3", "eu-north-1", "eu-south-1", "me-south-1",   "sa-east-1"]'

Run the following command:

(.env)$ cdk deploy

The AWS CDK prompts you to apply security changes; enter y for yes.

After the app is deployed, you get the Kibana URL, user, and password to access Kibana. After you log in, use the following sections to navigate around dashboards and alerts.

After the stack is deployed, you receive an email to confirm the subscription; make sure to confirm the email to start getting the alerts.

Pre-built monitoring dashboards

The monitoring tool comes with pre-built dashboards. To access them, complete the following steps:

  1. Navigate to the IP obtained after deployment.
  2. Log in to Kibana.
    Be sure to use the endpoint you received, provided as an output from the cdk deploy command
  3. In the navigation pane, choose Dashboard.

The Dashboards page displays the default dashboards.

The Domain Metrics At A glance dashboard gives a 360-degree view of all Amazon ES domains across Regions.

The Domain Overview dashboard gives more detailed metrics for a particular domain, to help you deep dive into issues in a specific domain.

Pre-built alerts

The monitoring framework comes with pre-built alerts, as summarized in the following table. These alerts notify you on key resources like CPU, disk space, and JVM. We also provide alerts for cluster status, snapshot failures, and more. You can use the following alerts as a template to create your own alerts and monitoring for search and indexing latencies and volumes, for example.

Alert Type Frequency
Cluster Health – Red 5 Min
Cluster Index Writes Blocked 5 Min
Automated Snapshot Failure 5 Min
JVM Memory Pressure > 80% 5 Min
CPU Utilization > 80% 15 Min
No Kibana Healthy Nodes 15 Min
Invalid Host Header Requests 15 Min
Cluster Health – Yellow 30 Min

Clean up

To clean up the stacks, destroy the monitoring-cdk stack; all other stacks are torn down due to dependencies:

# Enter into python virtual environment
$ source .env/bin/activate
(.env)$ cdk destroy

CloudWatch logs need to be removed separately.

Pricing

Running this solution incurs charges of less than $10 per day for one domain, with an additional $2 per day for each additional domain.

Conclusion

In this post, we discussed Amazon Elasticsearch Service Monitor, an open-source monitoring solution for all the domains in your account, across all Regions. Amazon ES monitoring domains are an ideal way to monitor your Amazon ES infrastructure. Try it out and leave your thoughts in the comments.


About the Authors

Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.

 

 

 

Prashant Agrawal is a Specialist Solutions Architect at Amazon Web Services based in Seattle, WA.. Prashant works closely with Amazon Elasticsearch team, helping customers migrate their workloads to the AWS Cloud. Before joining AWS, Prashant helped various customers use Elasticsearch for their search and analytics use cases.

Get started with fine-grained access control in Amazon Elasticsearch Service

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/security/get-started-with-fine-grained-access-control-in-amazon-elasticsearch-service/

Amazon Elasticsearch Service (Amazon ES) provides fine-grained access control, powered by the Open Distro for Elasticsearch security plugin. The security plugin adds Kibana authentication and access control at the cluster, index, document, and field levels that can help you secure your data. You now have many different ways to configure your Amazon ES domain to provide access control. In this post, I offer basic configuration information to get you started.

Figure 1: A high-level view of data flow and security

Figure 1: A high-level view of data flow and security

Figure 1 details the authentication and access control provided in Amazon ES. The left half of the diagram details the different methods of authenticating. Looking horizontally, requests originate either from Kibana or directly access the REST API. When using Kibana, you can use a login screen powered by the Open Distro security plugin, your SAML identity provider, or Amazon Cognito. Each of these methods results in an authenticated identity: SAML providers via the response, Amazon Cognito via an AWS Identity and Access Management (IAM) identity, and Open Distro via an internal user identity. When you use the REST API, you can use AWS Signature V4 request signing (SigV4 signing), or user name and password authentication. You can also send unauthenticated traffic, but your domain should be configured to reject all such traffic.

The right side of the diagram details the access control points. You can consider the handling of access control in two phases to better understand it—authentication at the edge by IAM and authentication in the Amazon ES domain by the Open Distro security plugin.

First, requests from Kibana or direct API calls have to reach your domain endpoint. If you follow best practices and the domain is in an Amazon Virtual Private Cloud (VPC), you can use Amazon Elastic Compute Cloud (Amazon EC2) security groups to allow or deny traffic based on the originating IP address or security group of the Amazon EC2 instances. Best practice includes least privilege based on subnet ACLs and security group ingress and egress restrictions. In this post, we assume that your requests are legitimate, meet your access control criteria, and can reach your domain.

When a request reaches the domain endpoint—the edge of your domain—, it can be anonymous or it can carry identity and authentication information as described previously. Each Amazon ES domain carries a resource-based IAM policy. With this policy, you can allow or deny traffic based on an IAM identity attached to the request. When your policy specifies an IAM principal, Amazon ES evaluates the request against the allowed Actions in the policy and allows or denies the request. If you don’t have an IAM identity attached to the request (SAML assertion, or user name and password) you should leave the domain policy open and pass traffic through to fine-grained access control in Amazon ES without any checks. You should employ IAM security best practices and add additional IAM restrictions for direct-to-API access control once your domain is set up.

The Open Distro for Elasticsearch security plugin has its own internal user database for user name and password authentication and handles access control for all users. When traffic reaches the Elasticsearch cluster, the plugin validates any user name and password authentication information against this internal database to identify the user and grant a set of permissions. If a request comes with identity information from either SAML or an IAM role, you map that backend role onto the roles or users that you have created in Open Distro security.

Amazon ES documentation and the Open Distro for Elasticsearch documentation give more information on all of these points. For this post, I walk through a basic console setup for a new domain.

Console set up

The Amazon ES console provides a guided wizard that lets you configure—and reconfigure—your Amazon ES domain. Step 1 offers you the opportunity to select some predefined configurations that carry through the wizard. In step 2, you choose the instances to deploy in your domain. In Step 3, you configure the security. This post focuses on step 3. See also these tutorials that explain using an IAM master user and using an HTTP-authenticated master user.

Note: At the time of writing, you cannot enable fine-grained access control on existing domains; you must create a new domain and enable the feature at domain creation time. You can use fine-grained access control with Elasticsearch versions 6.8 and later.

Set your endpoint

Amazon ES gives you a DNS name that resolves to an IP address that you use to send traffic to the Elasticsearch cluster in the domain. The IP address can be in the IP space of the public internet, or it can resolve to an IP address in your VPC. While—with fine-grained access control—you have the means of securing your cluster even when the endpoint is a public IP address, we recommend using VPC access as the more secure option. Shown in Figure 2.

Figure 2: Select VPC access

Figure 2: Select VPC access

With the endpoint in your VPC, you use security groups to control which ports accept traffic and limit access to the endpoints of your Amazon ES domain to IP addresses in your VPC. Make sure to use least privilege when setting up security group access.

Enable fine-grained access control

You should enable fine-grained access control. Shown in Figure 3.

Figure 3: Enabled fine-grained access control

Figure 3: Enabled fine-grained access control

Set up the master user

The master user is the administrator identity for your Amazon ES domain. This user can set up additional users in the Amazon ES security plugin, assign roles to them, and assign permissions for those roles. You can choose user name and password authentication for the master user, or use an IAM identity. User name and password authentication, shown in Figure 4, is simpler to set up and—with a strong password—may provide sufficient security depending on your use case. We recommend you follow your organization’s policy for password length and complexity. If you lose this password, you can return to the domain’s dashboard in the AWS Management Console and reset it. You’ll use these credentials to log in to Kibana. Following best practices on choosing your master user, you should move to an IAM master user once setup is complete.

Note: Password strength is a function of length, complexity of characters (e.g., upper and lower case letters, numbers, and special characters), and unpredictability to decrease the likelihood the password could be guessed or cracked over a period of time.

 

Figure 4: Setting up the master username and password

Figure 4: Setting up the master username and password

Do not enable Amazon Cognito authentication

When you use Kibana, Amazon ES includes a login experience. You currently have three choices for the source of the login screen:

  1. The Open Distro security plugin
  2. Amazon Cognito
  3. Your SAML-compliant system

You can apply fine-grained access control regardless of how you log in. However, setting up fine-grained access control for the master user and additional users is most straightforward if you use the login experience provided by the Open Distro security plugin. After your first login, and when you have set up additional users, you should migrate to either Cognito or SAML for login, taking advantage of the additional security they offer. To use the Open Distro login experience, disable Amazon Cognito authentication, as shown in Figure 5.

Figure 5: Amazon Cognito authentication is not enabled

Figure 5: Amazon Cognito authentication is not enabled

If you plan to integrate with your SAML identity provider, check the Prepare SAML authentication box. You will complete the set up when the domain is active.

Figure 6: Choose Prepare SAML authentication if you plan to use it

Figure 6: Choose Prepare SAML authentication if you plan to use it

Use an open access policy

When you create your domain, you attach an IAM policy to it that controls whether your traffic must be signed with AWS SigV4 request signing for authentication. Policies that specify an IAM principal require that you use AWS SigV4 signing to authenticate those requests. The domain sends your traffic to IAM, which authenticates signed requests to resolve the user or role that sent the traffic. The domain and IAM apply the policy access controls and either accept the traffic or reject it based on the commands. This is done down to the index level for single-index API calls.

When you use fine-grained access control, your traffic is also authenticated by the Amazon ES security plugin, which makes the IAM authentication redundant. Create an open access policy, as shown in Figure 7, which doesn’t specify a principal and so doesn’t require request signing. This may be acceptable, since you can choose to require an authenticated identity on all traffic. The security plugin authenticates the traffic as above, providing access control based on the internal database.

Figure 7: Selected open access policy

Figure 7: Selected open access policy

Encrypted data

Amazon ES provides an option to encrypt data in transit and at rest for any domain. When you enable fine-grained access control, you must use encryption with the corresponding checkboxes automatically checked and not changeable. These include Transport Layer Security (TLS) for requests to the domain and for traffic between nodes in the domain, and encryption of data at rest through AWS Key Management Service (KMS). Shown in Figure 8.

Figure 8: Enabled encryption

Figure 8: Enabled encryption

Accessing Kibana

When you complete the domain creation wizard, it takes about 10 minutes for your domain to activate. Return to the console and the Overview tab of your Amazon ES dashboard. When the Domain Status is Active, select the Kibana URL. Since you created your domain in your VPC, you must be able to access the Kibana endpoint via proxy, VPN, SSH tunnel, or similar. Use the master user name and password that you configured earlier to log in to Kibana, as shown in Figure 9. As detailed above, you should only ever log in as the master user to set up additional users—administrators, users with read-only access, and others.

Figure 9: Kibana login page

Figure 9: Kibana login page

Conclusion

Congratulations, you now know the basic steps to set up the minimum configuration to access your Amazon ES domain with a master user. You can examine the settings for fine-grained access control in the Kibana console Security tab. Here, you can add additional users, assign permissions, map IAM users to security roles, and set up your Kibana tenancy. We’ll cover those topics in future posts.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Elasticsearch Service forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Jon Handler

Jon is a Principal Solutions Architect at AWS. He works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a Ph. D. in Computer Science and Artificial Intelligence from Northwestern University.

Author

Sajeev Attiyil Bhaskaran

Sajeev is a Senior Cloud Engineer focused on big data and analytics. He works with AWS customers to provide architectural and engineering assistance and guidance. He dives deep into big data technologies and streaming solutions. He also does onsite and online sessions for customers to design best solutions for their use cases. In his free time, he enjoys spending time with his wife and daughter.