Tag Archives: Amazon OpenSearch Service

Amazon OpenSearch Service Under the Hood : OpenSearch Optimized Instances(OR1)

Post Syndicated from Bukhtawar Khan original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-under-the-hood-opensearch-optimized-instancesor1/

Amazon OpenSearch Service recently introduced the OpenSearch Optimized Instance family (OR1), which delivers up to 30% price-performance improvement over existing memory optimized instances in internal benchmarks, and uses Amazon Simple Storage Service (Amazon S3) to provide 11 9s of durability. With this new instance family, OpenSearch Service uses OpenSearch innovation and AWS technologies to reimagine how data is indexed and stored in the cloud.

Today, customers widely use OpenSearch Service for operational analytics because of its ability to ingest high volumes of data while also providing rich and interactive analytics. In order to provide these benefits, OpenSearch is designed as a high-scale distributed system with multiple independent instances indexing data and processing requests. As your operational analytics data velocity and volume of data grows, bottlenecks may emerge. To sustainably support high indexing volume and provide durability, we built the OR1 instance family.

In this post, we discuss how the reimagined data flow works with OR1 instances and how it can provide high indexing throughput and durability using a new physical replication protocol. We also dive deep into some of the challenges we solved to maintain correctness and data integrity.

Designing for high throughput with 11 9s of durability

OpenSearch Service manages tens of thousands of OpenSearch clusters. We’ve gained insights into typical cluster configurations that customers use to meet high throughput and durability goals. To achieve higher throughput, customers often choose to drop replica copies to save on the replication latency; however, this configuration results in sacrificing availability and durability. Other customers require high durability and as a result need to maintain multiple replica copies, resulting in higher operating costs for them.

The OpenSearch Optimized Instance family provides additional durability while also keeping costs lower by storing a copy of the data on Amazon S3. With OR1 instances, you can configure multiple replica copies for high read availability while maintaining indexing throughput.
The following diagram illustrates an indexing flow involving a metadata update in OR1

Indexing Request Flow in OR1

During indexing operations, individual documents are indexed into Lucene and also appended to a write-ahead log also known as a translog. Before sending back an acknowledgement to the client, all translog operations are persisted to the remote data store backed by Amazon S3. If any replica copies are configured, the primary copy performs checks to detect the possibility of multiple writers (control flow) on all replica copies for correctness reasons.
The following diagram illustrates the segment generation and replication flow in OR1 instances

Replication Flow in OR1

Periodically, as new segment files are created, the OR1 copy those segments to Amazon S3. When the transfer is complete, the primary publishes new checkpoints to all replica copies, notifying them of a new segment being available for download. The replica copies subsequently download newer segments and make them searchable. This model decouples the data flow that happens using Amazon S3 and the control flow (checkpoint publication and term validation) that happens over inter-node transport communication.

The following diagram illustrates the recovery flow in OR1 instances

Recovery Flow in OR1

OR1 instances persist not only the data, but the cluster metadata like index mappings, templates, and settings in Amazon S3. This makes sure that in the event of a cluster-manager quorum loss, which is a common failure mode in non-dedicated cluster-manager setups, OpenSearch can reliably recover the last acknowledged metadata.

In the event of an infrastructure failure, an OpenSearch domain can end up losing one or more nodes. In such an event, the new instance family guarantees recovery of both the cluster metadata and the index data up to the latest acknowledged operation. As new replacement nodes join the cluster, the internal cluster recovery mechanism bootstraps the new set of nodes and then recovers the latest cluster metadata from the remote cluster metadata store. After the cluster metadata is recovered, the recovery mechanism starts to hydrate the missing segment data and translog from Amazon S3. Then all uncommitted translog operations, up to the last acknowledged operation, are replayed to reinstate the lost copy.

The new design doesn’t modify the way searches work. Queries are processed normally by either the primary or replica shard for each shard in the index. You may see longer delays (in the 10-second range) before all copies are consistent to a particular point in time because the data replication is using Amazon S3.

A key advantage of this architecture is that it serves as a foundational building block for future innovations, like separation of readers and writers, and helps segregate compute and storage layers.

How redefining the replication strategy boosts the indexing throughput

OpenSearch supports two replication strategies: logical (document) and physical (segment) replication. In the case of logical replication, the data is indexed on all the copies independently, leading to redundant computation on the cluster. The OR1 instances use the new physical replication model, where data is indexed only on the primary copy and additional copies are created by copying data from the primary. With a high number of replica copies, the node hosting the primary copy requires significant network bandwidth, replicating the segment to all the copies. The new OR1 instances solve this problem by durably persisting the segment to Amazon S3, which is configured as a remote storage option. They also help with scaling replicas without bottlenecking on primary.

After the segments are uploaded to Amazon S3, the primary sends out a checkpoint request, notifying all replicas to download the new segments. The replica copies then need to download the incremental segments. Because this process frees up compute resources on replicas, which is otherwise required to redundantly index data and network overhead incurred on primaries to replicate data, the cluster is able to churn more throughput. In the event the replicas aren’t able to process the newly created segments, due to overload or slow network paths, the replicas beyond a point are marked as failed to prevent them from returning stale results.

Why high durability is a good idea, but hard to do well

Although all committed segments are durably persisted to Amazon S3 whenever they get created, one of key challenges in achieving high durability is synchronously writing all uncommitted operations to a write-ahead log on Amazon S3, before acknowledging back the request to the client, without sacrificing throughput. The new semantics introduce additional network latency for individual requests, but the way we’ve made sure there is no impact to throughput is by batching and draining requests on a single thread for up to a specified interval, while making sure other threads continue to index requests. As a result, you can drive higher throughput with more concurrent client connections by optimally batching your bulk payloads.

Other challenges in designing a highly durable system include enforcing data integrity and correctness at all times. Although some events like network partitions are rare, they can break the correctness of the system and therefore the system needs to be prepared to deal with these failure modes. Therefore, while switching to the new segment replication protocol, we also introduced a few other protocol changes, like detecting multiple writers on each replica. The protocol makes sure that an isolated writer can’t acknowledge a write request, while another newly promoted primary, based on the cluster-manager quorum, is concurrently accepting newer writes.

The new instance family automatically detects the loss of a primary shard while recovering data, and performs extensive checks on network reachability before the data can be re-hydrated from Amazon S3 and the cluster is brought back to a healthy state.

For data integrity, all files are extensively checksummed to make sure we are able to detect and prevent network or file system corruption that may result in data being unreadable. Furthermore, all files including metadata are designed to be immutable, providing additional safety against corruptions and versioned to prevent accidental mutating changes.

Reimagining how data flows

The OR1 instances hydrate copies directly from Amazon S3 in order to perform recovery of lost shards during an infrastructure failure. By using Amazon S3, we are able to free up the primary node’s network bandwidth, disk throughput, and compute, and therefore provide a more seamless in-place scaling and blue/green deployment experience by orchestrating the entire process with minimal primary node coordination.

OpenSearch Service provides automatic data backups called snapshots at hourly intervals, which means in case of accidental modifications to data, you have the option to go back to a previous point in time state. However, with the new OpenSearch instance family, we’ve discussed that the data is already durably persisted on Amazon S3. So how do snapshots work when we already have the data present on Amazon S3?

With the new instance family, snapshots serve as checkpoints, referencing the already present segment data as it exists at a point in time. This makes snapshots more lightweight and faster because they don’t need to re-upload any additional data. Instead, they upload metadata files that capture the view of the segments at that point in time, which we call shallow snapshots. The benefit of shallow snapshots extends to all operations, namely creation, deletion, and cloning of snapshots. You still have the option to snapshot an independent copy with manual snapshots for other administrative operations.

Summary

OpenSearch is an open source, community-driven software. Most of the foundational changes including the replication model, remote-backed storage, and remote cluster metadata have been contributed to open source; in fact, we follow an open source first development model.

Efforts to improve throughput and reliability is a never-ending cycle as we continue to learn and improve. The new OpenSearch optimized instances serve as a foundational building block, paving the way for future innovations. We are excited to continue our efforts in improving reliability and performance and to see what new and existing solutions builders can create using OpenSearch Service. We hope this leads to a deeper understanding of the new OpenSearch instance family, how this offering achieves high durability and better throughput, and how it can help you configure clusters based on the needs of your business.

If you’re excited to contribute to OpenSearch, open up a GitHub issue and let us know your thoughts. We would also love to hear about your success stories achieving high throughput and durability on OpenSearch Service. If you have other questions, please leave a comment.


About the Authors

Bukhtawar Khan is a Principal Engineer working on Amazon OpenSearch Service. He is interested in building distributed and autonomous systems. He is a maintainer and an active contributor to OpenSearch.

Gaurav Bafna is a Senior Software Engineer working on OpenSearch at Amazon Web Services. He is fascinated about solving problems in distributed systems. He is a maintainer and an active contributor to OpenSearch.

Sachin Kale is a senior software development engineer at AWS working on OpenSearch.

Rohin Bhargava is a Sr. Product Manager with the Amazon OpenSearch Service team. His passion at AWS is to help customers find the correct mix of AWS services to achieve success for their business goals.

Ranjith Ramachandra is a Senior Engineering Manager working on Amazon OpenSearch Service. He is passionate about highly scalable distributed systems, high performance and resilient systems.

Simplify your query management with search templates in Amazon OpenSearch Service

Post Syndicated from Arun Lakshmanan original https://aws.amazon.com/blogs/big-data/simplify-your-query-management-with-search-templates-in-amazon-opensearch-service/

Amazon OpenSearch Service is an Apache-2.0-licensed distributed search and analytics suite offered by AWS. This fully managed service allows organizations to secure data, perform keyword and semantic search, analyze logs, alert on anomalies, explore interactive log analytics, implement real-time application monitoring, and gain a more profound understanding of their information landscape. OpenSearch Service provides the tools and resources needed to unlock the full potential of your data. With its scalability, reliability, and ease of use, it’s a valuable solution for businesses seeking to optimize their data-driven decision-making processes and improve overall operational efficiency.

This post delves into the transformative world of search templates. We unravel the power of search templates in revolutionizing the way you handle queries, providing a comprehensive guide to help you navigate through the intricacies of this innovative solution. From optimizing search processes to saving time and reducing complexities, discover how incorporating search templates can elevate your query management game.

Search templates

Search templates empower developers to articulate intricate queries within OpenSearch, enabling their reuse across various application scenarios, eliminating the complexity of query generation in the code. This flexibility also grants you the ability to modify your queries without requiring application recompilation. Search templates in OpenSearch use the mustache template, which is a logic-free templating language. Search templates can be reused by their name. A search template that is based on mustache has a query structure and placeholders for the variable values. You use the _search API to query, specifying the actual values that OpenSearch should use. You can create placeholders for variables that will be changed to their true values at runtime. Double curly braces ({{}}) serve as placeholders in templates.

Mustache enables you to generate dynamic filters or queries based on the values passed in the search request, making your search requests more flexible and powerful.

In the following example, the search template runs the query in the “source” block by passing in the values for the field and value parameters from the “params” block:

GET /myindex/_search/template
 { 
      "source": {   
         "query": { 
             "bool": {
               "must": [
                 {
                   "match": {
                    "{{field}}": "{{value}}"
                 }
             }
        ]
     }
    }
  },
 "params": {
    "field": "place",
    "value": "sweethome"
  }
}

You can store templates in the cluster with a name and refer to them in a search instead of attaching the template in each request. You use the PUT _scripts API to publish a template to the cluster. Let’s say you have an index of books, and you want to search for books with publication date, ratings, and price. You could create and publish a search template as follows:

PUT /_scripts/find_book
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "publish_date": {
                  "gte": "{{gte_date}}"
                }
              }
            },
            {
              "range": {
                "rating": {
                  "gte": "{{gte_rating}}"
                }
              }
            },
            {
              "range": {
                "price": {
                  "lte": "{{lte_price}}"
                }
              }
            }
          ]
        }
      }
    }
  }
}

In this example, you define a search template called find_book that uses the mustache template language with defined placeholders for the gte_date, gte_rating, and lte_price parameters.

To use the search template stored in the cluster, you can send a request to OpenSearch with the appropriate parameters. For example, you can search for products that have been published in the last year with ratings greater than 4.0, and priced less than $20:

POST /books/_search/template
{
  "id": "find_book",
  "params": {
    "gte_date": "now-1y",
    "gte_rating": 4.0,
    "lte_price": 20
  }
}

This query will return all books that have been published in the last year, with a rating of at least 4.0, and a price less than $20 from the books index.

Default values in search templates

Default values are values that are used for search parameters when the query that engages the template doesn’t specify values for them. In the context of the find_book example, you can set default values for the from, size, and gte_date parameters in case they are not provided in the search request. To set default values, you can use the following mustache template:

PUT /_scripts/find_book
{
  "script": {
    "lang": "mustache",
    "source": {
      "query": {
        "bool": {
          "filter": [
            {
              "range": {
                "publish_date": {
                  "gte": "{{gte_date}}{{^gte_date}}now-1y{{/gte_date}}"
                }
              }
            },
            {
              "range": {
                "rating": {
                  "gte": "{{gte_rating}}"
                }
              }
            },
            {
              "range": {
                "price": {
                  "lte": "{{lte_price}}"
                }
              }
            }
          ]
        },
        "from": "{{from}}{{^from}}0{{/from}}",
        "size": "{{size}}{{^size}}2{{/size}}"
      }
    }
  }
}

In this template, the {{from}}, {{size}}, and {{gte_date}} parameters are placeholders that can be filled in with specific values when the template is used in a search. If no value is specified for {{from}}, {{size}}, and {{gte_date}}, OpenSearch uses the default values of 0, 2, and now-1y, respectively. This means that if a user searches for products without specifying from, size, and gte_date, the search will return just two products matching the search criteria for 1 year.

You can also use the render API as follows if you have a stored template and want to validate it:

POST _render/template
{
  "id": "find_book",
  "params": {
    "gte_date": "now-1y",
    "gte_rating": 4.0,
    "lte_price": 20
  }
}

Conditions in search templates

The conditional statement that allows you to control the flow of your search template based on certain conditions. It’s often used to include or exclude certain parts of the search request based on certain parameters. The syntax as follows:

{{#Any condition}}
  ... code to execute if the condition is true ...
{{/Any}}

The following example searches for books based on the gte_date, gte_rating, and lte_price parameters and an optional stock parameter. The if condition is used to include the condition_block/term query only if the stock parameter is present in the search request. If the is_available parameter is not present, the condition_block/term query will be skipped.

GET /books/_search/template
{
  "source": """{
    "query": {
      "bool": {
        "must": [
        {{#is_available}}
        {
          "term": {
            "in_stock": "{{is_available}}"
          }
        },
        {{/is_available}}
          {
            "range": {
              "publish_date": {
                "gte": "{{gte_date}}"
              }
            }
          },
          {
            "range": {
              "rating": {
                "gte": "{{gte_rating}}"
              }
            }
          },
          {
            "range": {
              "price": {
                "lte": "{{lte_price}}"
              }
            }
          }
        ]
      }
    }
  }""",
  "params": {
    "gte_date": "now-3y",
    "gte_rating": 4.0,
    "lte_price": 20,
    "is_available": true
  }
}

By using a conditional statement in this way, you can make your search requests more flexible and efficient by only including the necessary filters when they are needed.

To make the query valid inside the JSON, it needs to be escaped with triple quotes (""") in the payload.

Loops in search templates

A loop is a feature of mustache templates that allows you to iterate over an array of values and run the same code block for each item in the array. It’s often used to generate a dynamic list of filters or queries based on the values passed in the search request. The syntax is as follows:

{{#list item in array}}
  ... code to execute for each item ...
{{/list}}

The following example searches for books based on a query string ({{query}}) and an array of categories to filter the search results. The mustache loop is used to generate a match filter for each item in the categories array.

GET books/_search/template
{
  "source": """{
    "query": {
      "bool": {
        "must": [
        {{#list}}
        {
          "match": {
            "category": "{{list}}"
          }
        }
        {{/list}}
          {
          "match": {
            "title": "{{name}}"
          }
        }
        ]
      }
    }
  }""",
  "params": {
    "name": "killer",
    "list": ["Classics", "comics", "Horror"]
  }
}

The search request is rendered as follows:

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "killer"
          }
        },
        {
          "match": {
            "category": "Classics"
          }
        },
        {
          "match": {
            "category": "comics"
          }
        },
        {
          "match": {
            "category": "Horror"
          }
        }
      ]
    }
  }
}

The loop has generated a match filter for each item in the categories array, resulting in a more flexible and efficient search request that filters by multiple categories. By using the loops, you can generate dynamic filters or queries based on the values passed in the search request, making your search requests more flexible and powerful.

Advantages of using search templates

The following are key advantages of using search templates:

  • Maintainability – By separating the query definition from the application code, search templates make it straightforward to manage changes to the query or tune search relevancy. You don’t have to compile and redeploy your application.
  • Consistency – You can construct search templates that allow you to design standardized query patterns and reuse them throughout your application, which can help maintain consistency across your queries.
  • Readability – Because templates can be constructed using a more terse and expressive syntax, complicated queries are straightforward to test and debug.
  • Testing – Search templates can be tested and debugged independently of the application code, facilitating simpler problem-solving and relevancy tuning without having to re-deploy the application. You can easily create A/B testing with different templates for the same search.
  • Flexibility – Search templates can be quickly updated or adjusted to account for modifications to the data or search specifications.

Best practices

Consider the following best practices when using search templates:

  •  Before deploying your template to production, make sure it is fully tested. You can test the effectiveness and correctness of your template with example data. It is highly recommended to run the application tests that use these templates before publishing.
  • Search templates allow for the addition of input parameters, which you can use to modify the query to suit the needs of a particular use case. Reusing the same template with varied inputs is made simpler by parameterizing the inputs.
  • Manage the templates in an external source control system.
  • Avoid hard-coding values inside the query—instead, use defaults.

Conclusion

In this post, you learned the basics of search templates, a powerful feature of OpenSearch, and how templates help streamline search queries and improve performance. With search templates, you can build more robust search applications in less time.

If you have feedback about this post, submit it in the comments section. If you have questions about this post, start a new thread on the Amazon OpenSearch Service forum or contact AWS Support.

Stay tuned for more exciting updates and new features in OpenSearch Service.


About the authors

Arun Lakshmanan is a Search Specialist with Amazon OpenSearch Service based out of Chicago, IL. He has over 20 years of experience working with enterprise customers and startups. He loves to travel and spend quality time with his family.

Madhan Kumar Baskaran works as a Search Engineer at AWS, specializing in Amazon OpenSearch Service. His primary focus involves assisting customers in constructing scalable search applications and analytics solutions. Based in Bengaluru, India, Madhan has a keen interest in data engineering and DevOps.

Hybrid Search with Amazon OpenSearch Service

Post Syndicated from Hajer Bouafif original https://aws.amazon.com/blogs/big-data/hybrid-search-with-amazon-opensearch-service/

Amazon OpenSearch Service has been a long-standing supporter of both lexical and semantic search, facilitated by its utilization of the k-nearest neighbors (k-NN) plugin. By using OpenSearch Service as a vector database, you can seamlessly combine the advantages of both lexical and vector search. The introduction of the neural search feature in OpenSearch Service 2.9 further simplifies integration with artificial intelligence (AI) and machine learning (ML) models, facilitating the implementation of semantic search.

Lexical search using TF/IDF or BM25 has been the workhorse of search systems for decades. These traditional lexical search algorithms match user queries with exact words or phrases in your documents. Lexical search is more suitable for exact matches, provides low latency, and offers good interpretability of results and generalizes well across domains. However, this approach does not consider the context or meaning of the words, which can lead to irrelevant results.

In the past few years, semantic search methods based on vector embeddings have become increasingly popular to enhance search. Semantic search enables a more context-aware search, understanding the natural language questions of user queries. However, semantic search powered by vector embeddings requires fine-tuning of the ML model for the associated domain (such as healthcare or retail) and more memory resources compared to basic lexical search.

Both lexical search and semantic search have their own strengths and weaknesses. Combining lexical and vector search improves the quality of search results by using their best features in a hybrid model. OpenSearch Service 2.11 now supports out-of-the-box hybrid query capabilities that make it straightforward for you to implement a hybrid search model combining lexical search and semantic search.

This post explains the internals of hybrid search and how to build a hybrid search solution using OpenSearch Service. We experiment with sample queries to explore and compare lexical, semantic, and hybrid search. All the code used in this post is publicly available in the GitHub repository.

Hybrid search with OpenSearch Service

In general, hybrid search to combine lexical and semantic search involves the following steps:

  1. Run a semantic and lexical search using a compound search query clause.
  2. Each query type provides scores on different scales. For example, a Lucene lexical search query will return a score between 1 and infinity. On the other hand, a semantic query using the Faiss engine returns scores between 0 and 1. Therefore, you need to normalize the scores coming from each type of query to put them on the same scale before combining the scores. In a distributed search engine, this normalization needs to happen at the global level rather than shard or node level.
  3. After the scores are all on the same scale, they’re combined for every document.
  4. Reorder the documents based on the new combined score and render the documents as a response to the query.

Prior to OpenSearch Service 2.11, search practitioners would need to use compound query types to combine lexical and semantic search queries. However, this approach does not address the challenge of global normalization of scores as mentioned in Step 2.

OpenSearch Service 2.11 added the support of hybrid query by introducing the score normalization processor in search pipelines. Search pipelines take away the heavy lifting of building normalization of score results and combination outside your OpenSearch Service domain. Search pipelines run inside the OpenSearch Service domain and support three types of processors: search request processor, search response processor, and search phase results processor.

In a hybrid search, the search phase results processor runs between the query phase and fetch phase at the coordinator node (global) level. The following diagram illustrates this workflow.

Search Pipeline

The hybrid search workflow in OpenSearch Service contains the following phases:

  • Query phase – The first phase of a search request is the query phase, where each shard in your index runs the search query locally and returns the document ID matching the search request with relevance scores for each document.
  • Score normalization and combination – The search phase results processor runs between the query phase and fetch phase. It uses the normalization processer to normalize scoring results from BM25 and KNN subqueries. The search processor supports min_max and L2-Euclidean distance normalization methods. The processor combines all scores, compiles the final list of ranked document IDs, and passes them to the fetch phase. The processor supports arithmetic_mean, geometric_mean, and harmonic_mean to combine scores.
  • Fetch phase – The final phase is the fetch phase, where the coordinator node retrieves the documents that matches the final ranked list and returns the search query result.

Solution overview

In this post, you build a web application where you can search through a sample image dataset in the retail space, using a hybrid search system powered by OpenSearch Service. Let’s assume that the web application is a retail shop and you as a consumer need to run queries to search for women’s shoes.

For a hybrid search, you combine a lexical and semantic search query against the text captions of images in the dataset. The end-to-end search application high-level architecture is shown in the following figure.

Solution Architecture

The workflow contains the following steps:

  1. You use an Amazon SageMaker notebook to index image captions and image URLs from the Amazon Berkeley Objects Dataset stored in Amazon Simple Storage Service (Amazon S3) into OpenSearch Service using the OpenSearch ingest pipeline. This dataset is a collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. You only use the item images and item names in US English. For demo purposes, you use approximately 1,600 products.
  2. OpenSearch Service calls the embedding model hosted in SageMaker to generate vector embeddings for the image caption. You use the GPT-J-6B variant embedding model, which generates 4,096 dimensional vectors.
  3. Now you can enter your search query in the web application hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance (c5.large). The application client triggers the hybrid query in OpenSearch Service.
  4. OpenSearch Service calls the SageMaker embedding model to generate vector embeddings for the search query.
  5. OpenSearch Service runs the hybrid query, combines the semantic search and lexical search scores for the documents, and sends back the search results to the EC2 application client.

Let’s look at Steps 1, 2, 4, and 5 in more detail.

Step 1: Ingest the data into OpenSearch

In Step 1, you create an ingest pipeline in OpenSearch Service using the text_embedding processor to generate vector embeddings for the image captions.

After you define a k-NN index with the ingest pipeline, you run a bulk index operation to store your data into the k-NN index. In this solution, you only index the image URLs, text captions, and caption embeddings where the field type for the caption embeddings is k-NN vector.

Step 2 and Step 4: OpenSearch Service calls the SageMaker embedding model

In these steps, OpenSearch Service uses the SageMaker ML connector to generate the embeddings for the image captions and query. The blue box in the preceding architecture diagram refers to the integration of OpenSearch Service with SageMaker using the ML connector feature of OpenSearch. This feature is available in OpenSearch Service starting from version 2.9. It enables you to create integrations with other ML services, such as SageMaker.

Step 5: OpenSearch Service runs the hybrid search query

OpenSearch Service uses the search phase results processor to perform a hybrid search. For hybrid scoring, OpenSearch Service uses the normalization, combination, and weights configuration settings that are set in the normalization processor of the search pipeline.

Prerequisites

Before you deploy the solution, make sure you have the following prerequisites:

  • An AWS account
  • Familiarity with the Python programming language
  • Familiarity with AWS Identity and Access Management (IAM), Amazon EC2, OpenSearch Service, and SageMaker

Deploy the hybrid search application to your AWS account

To deploy your resources, use the provided AWS CloudFormation template. Supported AWS Regions are us-east-1, us-west-2, and eu-west-1. Complete the following steps to launch the stack:

  1. On the AWS CloudFormation console, create a new stack.
  2. For Template source, select Amazon S3 URL.
  3. For Amazon S3 URL, enter the path for the template for deploying hybrid search.
  4. Choose Next.Deploy CloudFormation
  5. Name the stack hybridsearch.
  6. Keep the remaining settings as default and choose Submit.
  7. The template stack should take 15 minutes to deploy. When it’s done, the stack status will show as CREATE_COMPLETE.Check completion of CloudFormation deployment
  8. When the stack is complete, navigate to the stack Outputs tab.
  9. Choose the SagemakerNotebookURL link to open the SageMaker notebook in a separate tab.Open SageMaker Notebook URL
  10. In the SageMaker notebook, navigate to the AI-search-with-amazon-opensearch-service/opensearch-hybridsearch directory and open HybridSearch.ipynb.Open HybridSearch notebook
  11. If the notebook prompts to set the kernel, Choose the conda_pytorch_p310 kernel from the drop-down menu, then choose Set Kernel.Set the Kernel
  12. The notebook should look like the following screenshot.HybridSearch Implementation steps

Now that the notebook is ready to use, follow the step-by-step instructions in the notebook. With these steps, you create an OpenSearch SageMaker ML connector and a k-NN index, ingest the dataset into an OpenSearch Service domain, and host the web search application on Amazon EC2.

Run a hybrid search using the web application

The web application is now deployed in your account and you can access the application using the URL generated at the end of the SageMaker notebook.

Open the Application using the URL

Copy the generated URL and enter it in your browser to launch the application.

Application UI components

Complete the following steps to run a hybrid search:

  1. Use the search bar to enter your search query.
  2. Use the drop-down menu to select the search type. The available options are Keyword Search, Vector Search, and Hybrid Search.
  3. Choose GO to render results for your query or regenerate results based on your new settings.
  4. Use the left pane to tune your hybrid search configuration:
    • Under Weight for Semantic Search, adjust the slider to choose the weight for semantic subquery. Be aware that the total weight for both lexical and semantic queries should be 1.0. The closer the weight is to 1.0, the more weight is given to the semantic subquery, and this setting minus 1.0 goes as weightage to the lexical query.
    • For Select the normalization type, choose the normalization technique (min_max or L2).
    • For Select the Score Combination type, choose the score combination techniques: arithmetic_mean, geometric_mean, or harmonic_mean.

Experiment with Hybrid Search

In this post, you run four experiments to understand the differences between the outputs of each search type.

As a customer of this retail shop, you are looking for women’s shoes, and you don’t know yet what style of shoes you would like to purchase. You expect that the retail shop should be able to help you decide according to the following parameters:

  • Not to deviate from the primary attributes of what you search for.
  • Provide versatile options and styles to help you understand your preference of style and then choose one.

As your first step, enter the search query “women shoes” and choose 5 as the number of documents to output.

Next, run the following experiments and review the observation for each search type

Experiment 1: Lexical search

Experiment 1: Lexical searchFor a lexical search, choose Keyword Search as your search type, then choose GO.

The keyword search runs a lexical query, looking for same words between the query and image captions. In the first four results, two are women’s boat-style shoes identified by common words like “women” and “shoes.” The other two are men’s shoes, linked by the common term “shoes.” The last result is of style “sandals,” and it’s identified based on the common term “shoes.”

In this experiment, the keyword search provided three relevant results out of five—it doesn’t completely capture the user’s intention to have shoes only for women.

Experiment 2: Semantic search

Experiment 2: Semantic search

For a semantic search, choose Semantic search as the search type, then choose GO.

The semantic search provided results that all belong to one particular style of shoes, “boots.” Even though the term “boots” was not part of the search query, the semantic search understands that terms “shoes” and “boots” are similar because they are found to be nearest neighbors in the vector space.

In this experiment, when the user didn’t mention any specific shoe styles like boots, the results limited the user’s choices to a single style. This hindered the user’s ability to explore a variety of styles and make a more informed decision on their preferred style of shoes to purchase.

Let’s see how hybrid search can help in this use case.

Experiment 3: Hybrid search

Experiment 3: Hybrid search

Choose Hybrid Search as the search type, then choose GO.

In this example, the hybrid search uses both lexical and semantic search queries. The results show two “boat shoes” and three “boots,” reflecting a blend of both lexical and semantic search outcomes.

In the top two results, “boat shoes” directly matched the user’s query and were obtained through lexical search. In the lower-ranked items, “boots” was identified through semantic search.

In this experiment, the hybrid search gave equal weighs to both lexical and semantic search, which allowed users to quickly find what they were looking for (shoes) while also presenting additional styles (boots) for them to consider.

Experiment 4: Fine-tune the hybrid search configuration

Experiment 4: Fine-tune the hybrid search configuration

In this experiment, set the weight of the vector subquery to 0.8, which means the keyword search query has a weightage of 0.2. Keep the normalization and score combination settings set to default. Then choose GO to generate new results for the preceding query.

Providing more weight to the semantic search subquery resulted in higher scores to the semantic search query results. You can see a similar outcome as the semantic search results from the second experiment, with five images of boots for women.

You can further fine-tune the hybrid search results by adjusting the combination and normalization techniques.

In a benchmark conducted by the OpenSearch team using publicly available datasets such as BEIR and Amazon ESCI, they concluded that the min_max normalization technique combined with the arithmetic_mean score combination technique provides the best results in a hybrid search.

You need to thoroughly test the different fine-tuning options to choose what is the most relevant to your business requirements.

Overall observations

From all the previous experiments, we can conclude that the hybrid search in the third experiment had a combination of results that looks relevant to the user in terms of giving exact matches and also additional styles to choose from. The hybrid search matches the expectation of the retail shop customer.

Clean up

To avoid incurring continued AWS usage charges, make sure you delete all the resources you created as part of this post.

To clean up your resources, make sure you delete the S3 bucket you created within the application before you delete the CloudFormation stack.

OpenSearch Service integrations

In this post, you deployed a CloudFormation template to host the ML model in a SageMaker endpoint and spun up a new OpenSearch Service domain, then you used a SageMaker notebook to run steps to create the SageMaker-ML connector and deploy the ML model in OpenSearch Service.

You can achieve the same setup for an existing OpenSearch Service domain by using the ready-made CloudFormation templates from the OpenSearch Service console integrations. These templates automate the steps of SageMaker model deployment and SageMaker ML connector creation in OpenSearch Service.

Conclusion

In this post, we provided a complete solution to run a hybrid search with OpenSearch Service using a web application. The experiments in the post provided an example of how you can combine the power of lexical and semantic search in a hybrid search to improve the search experience for your end-users for a retail use case.

We also explained the new features available in version 2.9 and 2.11 in OpenSearch Service that make it effortless for you to build semantic search use cases such as remote ML connectors, ingest pipelines, and search pipelines. In addition, we showed you how the new score normalization processor in the search pipeline makes it straightforward to establish the global normalization of scores within your OpenSearch Service domain before combining multiple search scores.

Learn more about ML-powered search with OpenSearch and set up hybrid search in your own environment using the guidelines in this post. The solution code is also available on the GitHub repo.


About the Authors

Hajer Bouafif, Analytics Specialist Solutions ArchitectHajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on Amazon OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Praveen Mohan Prasad, Analytics Specialist Technical Account ManagerPraveen Mohan Prasad is an Analytics Specialist Technical Account Manager at Amazon Web Services and helps customers with pro-active operational reviews on analytics workloads. Praveen actively researches on applying machine learning to improve search relevance.

Amazon OpenSearch H2 2023 in review

Post Syndicated from Jon Handler original https://aws.amazon.com/blogs/big-data/amazon-opensearch-h2-2023-in-review/

2023 was been a busy year for Amazon OpenSearch Service! Learn more about the releases that OpenSearch Service launched in the first half of 2023.

In the second half of 2023, OpenSearch Service added the support of two new OpenSearch versions: 2.9 and 2.11 These two versions introduce new features in the search space, machine learning (ML) search space, migrations, and the operational side of the service.

With the release of zero-ETL integration with Amazon Simple Storage Service (Amazon S3), you can analyze your data sitting in your data lake using OpenSearch Service to build dashboards and query the data without the need to move your data from Amazon S3.

OpenSearch Service also announced a new zero-ETL integration with Amazon DynamoDB through the DynamoDB plugin for Amazon OpenSearch Ingestion. OpenSearch Ingestion takes care of bootstrapping and continuously streams data from your DynamoDB source.

OpenSearch Serverless announced the general availability of the Vector Engine for Amazon OpenSearch Serverless along with other features to enhance your experience with time series collections, manage your cost for development environments, and quickly scale your resources to match your workload demands.

In this post, we discuss the new releases in OpenSearch Service to empower your business with search, observability, security analytics, and migrations.

Build cost-effective solutions with OpenSearch Service

With the zero-ETL integration for Amazon S3, OpenSearch Service now lets you query your data in place, saving cost on storage. Data movement is an expensive operation because you need to replicate data across different data stores. This increases your data footprint and drives cost. Moving data also adds the overhead of managing pipelines to migrate the data from one source to a new destination.

OpenSearch Service also added new instance types for data nodes—Im4gn and OR1—to help you further optimize your infrastructure cost. With a maximum 30 TB non-volatile memory (NVMe) solid state drives (SSD), the Im4gn instance provides dense storage and better performance. OR1 instances use segment replication and remote-backed storage to greatly increase throughput for indexing-heavy workloads.

Zero-ETL from DynamoDB to OpenSearch Service

In November 2023, DynamoDB and OpenSearch Ingestion introduced a zero-ETL integration for OpenSearch Service. OpenSearch Service domains and OpenSearch Serverless collections provide advanced search capabilities, such as full-text and vector search, on your DynamoDB data. With a few clicks on the AWS Management Console, you can now seamlessly load and synchronize your data from DynamoDB to OpenSearch Service, eliminating the need to write custom code to extract, transform, and load the data.

Direct query (zero-ETL for Amazon S3 data, in preview)

OpenSearch Service announced a new way for you to query operational logs in Amazon S3 and S3-based data lakes without needing to switch between tools to analyze operational data. Previously, you had to copy data from Amazon S3 into OpenSearch Service to take advantage of OpenSearch’s rich analytics and visualization features to understand your data, identify anomalies, and detect potential threats.

However, continuously replicating data between services can be expensive and requires operational work. With the OpenSearch Service direct query feature, you can access operational log data stored in Amazon S3, without needing to move the data itself. Now you can perform complex queries and visualizations on your data without any data movement.

Support of Im4gn with OpenSearch Service

Im4gn instances are optimized for workloads that manage large datasets and need high storage density per vCPU. Im4gn instances come in sizes large through 16xlarge, with up to 30 TB in NVMe SSD disk size. Im4gn instances are built on AWS Nitro System SSDs, which offer high-throughput, low-latency disk access for best performance. OpenSearch Service Im4gn instances support all OpenSearch versions and Elasticsearch versions 7.9 and above. For more details, refer to Supported instance types in Amazon OpenSearch Service.

Introducing OR1, an OpenSearch Optimized Instance family for indexing heavy workloads

In November 2023, OpenSearch Service launched OR1, the OpenSearch Optimized Instance family, which delivers up to 30% price-performance improvement over existing instances in internal benchmarks and uses Amazon S3 to provide 11 9s of durability. A domain with OR1 instances uses Amazon Elastic Block Store (Amazon EBS) volumes for primary storage, with data copied synchronously to Amazon S3 as it arrives. OR1 instances use OpenSearch’s segment replication feature to enable replica shards to read data directly from Amazon S3, avoiding the resource cost of indexing in both primary and replica shards. The OR1 instance family also supports automatic data recovery in the event of failure. For more information about OR1 instance type options, refer to Current generation instance types in OpenSearch Service.

Enable your business with security analytics features

The Security Analytics plugin in OpenSearch Service supports out-of-the-box prepackaged log types and provides security detection rules (SIGMA rules) to detect potential security incidents.

In OpenSearch 2.9, the Security Analytics plugin added support for customer log types and native support for Open Cybersecurity Schema Framework (OCSF) data format. With this new support, you can build detectors with OCSF data stored in Amazon Security Lake to analyze security findings and mitigate any potential incident. The Security Analytics plugin has also added the possibility to create your own custom log types and create custom detection rules.

Build ML-powered search solutions

In 2023, OpenSearch Service invested in eliminating the heavy lifting required to build next-generation search applications. With features such as search pipelines, search processors, and AI/ML connectors, OpenSearch Service enabled rapid development of search applications powered by neural search, hybrid search, and personalized results. Additionally, enhancements to the kNN plugin improved storage and retrieval of vector data. Newly launched optional plugins for OpenSearch Service enable seamless integration with additional language analyzers and Amazon Personalize.

Search pipelines

Search pipelines provide new ways to enhance search queries and improve search results. You define a search pipeline and then send your queries to it. When you define the search pipeline, you specify processors that transform and augment your queries, and re-rank your results. The prebuilt query processors include date conversion, aggregation, string manipulation, and data type conversion. The results processor in the search pipeline intercepts and adapts results on the fly before rendering to next phase. Both request and response processing for the pipeline are performed on the coordinator node, so there is no shard-level processing.

Optional plugins

OpenSearch Service lets you associate preinstalled optional OpenSearch plugins to use with your domain. An optional plugin package is compatible with a specific OpenSearch version, and can only be associated to domains with that version. Available plugins are listed on the Packages page on the OpenSearch Service console. The optional plugin includes the Amazon Personalize plugin, which integrates OpenSearch Service with Amazon Personalize, and new language analyzers such as Nori, Sudachi, STConvert, and Pinyin.

Support for new language analyzers

OpenSearch Service added support for four new language analyzer plugins: Nori (Korean), Sudachi (Japanese), Pinyin (Chinese), and STConvert Analysis (Chinese). These are available in all AWS Regions as optional plugins that you can associate with domains running any OpenSearch version. You can use the Packages page on the OpenSearch Service console to associate these plugins to your domain, or use the Associate Package API.

Neural search feature

Neural search is generally available with OpenSearch Service version 2.9 and later. Neural search allows you to integrate with ML models that are hosted remotely using the model serving framework. When you use a neural query during search, neural search converts the query text into vector embeddings, uses vector search to compare the query and document embedding, and returns the closest results. During ingestion, neural search transforms document text into vector embedding and indexes both the text and its vector embeddings in a vector index.

Integration with Amazon Personalize

OpenSearch Service introduced an optional plugin to integrate with Amazon Personalize in OpenSearch versions 2.9 or later. The OpenSearch Service plugin for Amazon Personalize Search Ranking allows you to improve the end-user engagement and conversion from your website and application search by taking advantage of the deep learning capabilities offered by Amazon Personalize. As an optional plugin, the package is compatible with OpenSearch version 2.9 or later, and can only be associated to domains with that version.

Efficient query filtering with OpenSearch’s k-NN FAISS

OpenSearch Service introduced efficient query filtering with OpenSearch’s k-NN FAISS in version 2.9 and later. OpenSearch’s efficient vector query filters capability intelligently evaluates optimal filtering strategies—pre-filtering with approximate nearest neighbor (ANN) or filtering with exact k-nearest neighbor (k-NN)—to determine the best strategy to deliver accurate and low-latency vector search queries. In earlier OpenSearch versions, vector queries on the FAISS engine used post-filtering techniques, which enabled filtered queries at scale, but potentially returning less than the requested “k” number of results. Efficient vector query filters deliver low latency and accurate results, enabling you to employ hybrid search across vector and lexical techniques.

Byte-quantized vectors in OpenSearch Service

With the new byte-quantized vector introduced with 2.9, you can reduce memory requirements by a factor of 4 and significantly reduce search latency, with minimal loss in quality (recall). With this feature, the usual 32-bit floats that are used for vectors are quantized or converted to 8-bit signed integers. For many applications, existing float vector data can be quantized with little loss in quality. Comparing benchmarks, you will find that using byte vectors rather than 32-bit floats results in a significant reduction in storage and memory usage while also improving indexing throughput and reducing query latency. An internal benchmark showed the storage usage was reduced by up to 78%, and RAM usage was reduced by up to 59% (for the glove-200-angular dataset). Recall values for angular datasets were lower than those of Euclidean datasets.

AI/ML connectors

OpenSearch 2.9 and later supports integrations with ML models hosted on AWS services or third-party platforms. This allows system administrators and data scientists to run ML workloads outside of their OpenSearch Service domain. The ML connectors come with a supported set of ML blueprints—templates that define the set of parameters you need to provide when sending API requests to a specific connector. OpenSearch Service provides connectors for several platforms, such as Amazon SageMaker, Amazon Bedrock, OpenAI ChatGPT, and Cohere.

OpenSearch Service console integrations

OpenSearch 2.9 and later added a new integrations feature on the console. Integrations provides you with an AWS CloudFormation template to build your semantic search use case by connecting to your ML models hosted on SageMaker or Amazon Bedrock. The CloudFormation template generates the model endpoint and registers the model ID with the OpenSearch Service domain you provide as input to the template.

Hybrid search and range normalization

The normalization processor and hybrid query builds on top of the two features released earlier in 2023—neural search and search pipelines. Because lexical and semantic queries return relevance scores on different scales, fine-tuning hybrid search queries was difficult.

OpenSearch Service 2.11 now supports a combination and normalization processor for hybrid search. You can now perform hybrid search queries, combining a lexical and a natural language-based k-NN vector search queries. OpenSearch Service also enables you to tune your hybrid search results for maximum relevance using multiple scoring combination and normalization techniques.

Multimodal search with Amazon Bedrock

OpenSearch Service 2.11 launches the support of multimodal search that allows you to search text and image data using multimodal embedding models. To generate vector embeddings, you need to create an ingest pipeline that contains a text_image_embedding processor, which converts the text or image binaries in a document field to vector embeddings. You can use the neural query clause, either in the k-NN plugin API or Query DSL queries, to do a combination of text and images searches. You can use the new OpenSearch Service integration features to quickly start with multimodal search.

Neural sparse retrieval

Neural sparse search, a new efficient method of semantic retrieval, is available in OpenSearch Service 2.11. Neural sparse search operates in two modes: bi-encoder and document-only. With the bi-encoder mode, both documents and search queries are passed through deep encoders. In document-only mode, only documents are passed through deep encoders, while search queries are tokenized. A document-only sparse encoder generates an index that is 10.4% of the size of a dense encoding index. For a bi-encoder, the index size is 7.2% of the size of a dense encoding index. Neural sparse search is enabled by sparse encoding models that create sparse vector embeddings: a set of <token: weight> pairs representing the text entry and its corresponding weight in the sparse vector. To learn more about the pre-trained models for sparse neural search, refer to Sparse encoding models.

Neural sparse search reduces costs, improves search relevance, and has lower latency. You can use the new OpenSearch Service integrations features to quickly start with neural sparse search.

OpenSearch Ingestion updates

OpenSearch Ingestion is a fully managed and auto scaled ingestion pipeline that delivers your data to OpenSearch Service domains and OpenSearch Serverless collections. Since its release in 2023, OpenSearch Ingestion continues to add new features to make it straightforward to transform and move your data from supported sources to downstream destinations like OpenSearch Service, OpenSearch Serverless, and Amazon S3.

New migration features in OpenSearch Ingestion

In November 2023, OpenSearch Ingestion announced the release of new features to support data migration from self-managed Elasticsearch version 7.x domains to the latest versions of OpenSearch Service.

OpenSearch Ingestion also supports the migration of data from OpenSearch Service managed domains running OpenSearch version 2.x to OpenSearch Serverless collections.

Learn how you can use OpenSearch Ingestion to migrate your data to OpenSearch Service.

Improve data durability with OpenSearch Ingestion

In November 2023, OpenSearch Ingestion introduced persistent buffering for push-based sources likes HTTP sources (HTTP, Fluentd, FluentBit) and OpenTelemetry collectors.

By default, OpenSearch Ingestion uses in-memory buffering. With persistent buffering, OpenSearch Ingestion stores your data in a disk-based store that is more resilient. If you have existing ingestion pipelines, you can enable persistent buffering for these pipelines, as shown in the following screenshot.

Support of new plugins

In early 2023, OpenSearch Ingestion added support for Amazon Managed Streaming for Apache Kafka (Amazon MSK). OpenSearch Ingestion uses the Kafka plugin to stream data from Amazon MSK to OpenSearch Service managed domains or OpenSearch Serverless collections. To learn more about setting up Amazon MSK as a data source, see Using an OpenSearch Ingestion pipeline with Amazon Managed Streaming for Apache Kafka.

OpenSearch Serverless updates

OpenSearch Serverless continued to enhance your serverless experience with OpenSearch by introducing the support of a new collection of type vector search to store embeddings and run similarity search. OpenSearch Serverless now supports shard replica scaling to handle spikes in query throughput. And if you are using a time series collection, you can now set up your custom data retention policy to match your data retention requirements.

Vector Engine for OpenSearch Serverless

In November 2023, we launched the vector engine for Amazon OpenSearch Serverless. The vector engine makes it straightforward to build modern ML-augmented search experiences and generative artificial intelligence (generative AI) applications without needing to manage the underlying vector database infrastructure. It also enables you to run hybrid search, combining vector search and full-text search in the same query, removing the need to manage and maintain separate data stores or a complex application stack.

OpenSearch Serverless lower-cost dev and test environments

OpenSearch Serverless now supports development and test workloads by allowing you to avoid running a replica. Removing replicas eliminates the need to have redundant OCUs in another Availability Zone solely for availability purposes. If you are using OpenSearch Serverless for development and testing, where availability is not a concern, you can drop your minimum OCUs from 4 to 2.

OpenSearch Serverless supports automated time-based data deletion using data lifecycle policies

In December 2023, OpenSearch Serverless announced support for managing data retention of time series collections and indexes. With the new automated time-based data deletion feature, you can specify how long you want to retain data. OpenSearch Serverless automatically manages the lifecycle of the data based on this configuration. To learn more, refer to Amazon OpenSearch Serverless now supports automated time-based data deletion.

OpenSearch Serverless announced support for scaling up replicas at shard level

At launch, OpenSearch Serverless supported increasing capacity automatically in response to growing data sizes. With the new shard replica scaling feature, OpenSearch Serverless automatically detects shards under duress due to sudden spikes in query rates and dynamically adds new shard replicas to handle the increased query throughput while maintaining fast response times. This approach proves to be more cost-efficient than simply adding new index replicas.

AWS user notifications to monitor your OCU usage

With this launch, you can configure the system to send notifications when OCU utilization is approaching or has reached maximum configured limits for search or ingestion. With the new AWS User Notification integration, you can configure the system to send notifications whenever the capacity threshold is breached. The User Notification feature eliminates the need to monitor the service constantly. For more information, see Monitoring Amazon OpenSearch Serverless using AWS User Notifications.

Enhance your experience with OpenSearch Dashboards

OpenSearch 2.9 in OpenSearch Service introduced new features to make it straightforward to quickly analyze your data in OpenSearch Dashboards. These new features include the new out-of-the box, preconfigured dashboards with OpenSearch Integrations, and the ability to create alerting and anomaly detection from an existing visualization in your dashboards.

OpenSearch Dashboard integrations

OpenSearch 2.9 added the support of OpenSearch integrations in OpenSearch Dashboards. OpenSearch integrations include preconfigured dashboards so you can quickly start analyzing your data coming from popular sources such as AWS CloudFront, AWS WAF, AWS CloudTrail, and Amazon Virtual Private Cloud (Amazon VPC) flow logs.

Alerting and anomalies in OpenSearch Dashboards

In OpenSearch Service 2.9, you can create a new alerting monitor directly from your line chart visualization in OpenSearch Dashboards. You can also associate the existing monitors or detectors previously created in OpenSearch to the dashboard visualization.

This new feature helps reduce context switching between dashboards and both the Alerting or Anomaly Detection plugins. Refer to the following dashboard to add an alerting monitor to detect drops in average data volume in your services.

OpenSearch expands geospatial aggregations support

With OpenSearch version 2.9, OpenSearch Service added the support of three types of geoshape data aggregation through API: geo_bounds, geo_hash, and geo_tile.

The geoshape field type provides the possibility to index location data in different geographic formats such as a point, a polygon, or a linestring. With the new aggregation types, you have more flexibility to aggregate documents from an index using metric and multi-bucket geospatial aggregations.

OpenSearch Service operational updates

OpenSearch Service removed the need to run blue/green deployment when changing the domain managed nodes. Additionally, the service improved the Auto-Tune events with the support of new Auto-Tune metrics to track the changes within your OpenSearch Service domain.

OpenSearch Service now lets you update domain manager nodes without blue/green deployment

As of early H2 of 2023, OpenSearch Service allowed you to modify the instance type or instance count of dedicated cluster manager nodes without the need for blue/green deployment. This enhancement allows quicker updates with minimal disruption to your domain operations, all while avoiding any data movement.

Previously, updating your dedicated cluster manager nodes on OpenSearch Service meant using a blue/green deployment to make the change. Although blue/green deployments are meant to avoid any disruption to your domains, because the deployment utilizes additional resources on the domain, it is recommended that you perform them during low-traffic periods. Now you can update cluster manager instance types or instance counts without requiring a blue/green deployment, so these updates can complete faster while avoiding any potential disruption to your domain operations. In cases where you modify both the domain manager instance type and count, OpenSearch Service will still use a blue/green deployment to make the change. You can use the dry-run option to check whether your change requires a blue/green deployment.

Enhanced Auto-Tune experience

In September 2023, OpenSearch Service added new Auto-Tune metrics and improved Auto-Tune events that give you better visibility into the domain performance optimizations made by Auto-Tune.

Auto-Tune is an adaptive resource management system that automatically updates OpenSearch Service domain resources to improve efficiency and performance. For example, Auto-Tune optimizes memory-related configuration such as queue sizes, cache sizes, and Java virtual machine (JVM) settings on your nodes.

With this launch, you can now audit the history of the changes, as well as track them in real time from the Amazon CloudWatch console.

Additionally, OpenSearch Service now publishes details of the changes to Amazon EventBridge when Auto-Tune settings are recommended or applied to an OpenSearch Service domain. These Auto-Tune events will also be visible on the Notifications page on the OpenSearch Service console.

Accelerate your migration to OpenSearch Service with the new Migration Assistant solution

In November 2023, the OpenSearch team launched a new open-source solution—Migration Assistant for Amazon OpenSearch Service. The solution supports data migration from self-managed Elasticsearch and OpenSearch domains to OpenSearch Service, supporting Elasticsearch 7.x (<=7.10), OpenSearch 1.x, and OpenSearch 2.x as migration sources. The solution facilitates the migration of the existing and live data between source and destination.

Conclusion

In this post, we covered the new releases in OpenSearch Service to help you innovate your business with search, observability, security analytics, and migrations. We provided you with information about when to use each new feature in OpenSearch Service, OpenSearch Ingestion, and OpenSearch Serverless.

Learn more about OpenSearch Dashboards and OpenSearch plugins and the new exciting OpenSearch assistant using OpenSearch playground.

Check out the features described in this post, and we appreciate you providing us your valuable feedback.


About the Authors

Jon Handler is a Senior Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with OpenSearch and Amazon OpenSearch Service, providing help and guidance to a broad range of customers who have search and log analytics workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included 4 years of coding a large-scale, ecommerce search engine. Jon holds a Bachelor of the Arts from the University of Pennsylvania, and a Master of Science and a PhD in Computer Science and Artificial Intelligence from Northwestern University.

Hajer Bouafif is an Analytics Specialist Solutions Architect at Amazon Web Services. She focuses on Amazon OpenSearch Service and helps customers design and build well-architected analytics workloads in diverse industries. Hajer enjoys spending time outdoors and discovering new cultures.

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Muslim Abu Taha is a Sr. OpenSearch Specialist Solutions Architect dedicated to guiding clients through seamless search workload migrations, fine-tuning clusters for peak performance, and ensuring cost-effectiveness. With a background as a Technical Account Manager (TAM), Muslim brings a wealth of experience in assisting enterprise customers with cloud adoption and optimize their different set of workloads. Muslim enjoys spending time with his family, traveling and exploring new places.

Build a RAG data ingestion pipeline for large-scale ML workloads

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/big-data/build-a-rag-data-ingestion-pipeline-for-large-scale-ml-workloads/

For building any generative AI application, enriching the large language models (LLMs) with new data is imperative. This is where the Retrieval Augmented Generation (RAG) technique comes in. RAG is a machine learning (ML) architecture that uses external documents (like Wikipedia) to augment its knowledge and achieve state-of-the-art results on knowledge-intensive tasks. For ingesting these external data sources, Vector databases have evolved, which can store vector embeddings of the data source and allow for similarity searches.

In this post, we show how to build a RAG extract, transform, and load (ETL) ingestion pipeline to ingest large amounts of data into an Amazon OpenSearch Service cluster and use Amazon Relational Database Service (Amazon RDS) for PostgreSQL with the pgvector extension as a vector data store. Each service implements k-nearest neighbor (k-NN) or approximate nearest neighbor (ANN) algorithms and distance metrics to calculate similarity. We introduce the integration of Ray into the RAG contextual document retrieval mechanism. Ray is an open source, Python, general purpose, distributed computing library. It allows distributed data processing to generate and store embeddings for a large amount of data, parallelizing across multiple GPUs. We use a Ray cluster with these GPUs to run parallel ingest and query for each service.

In this experiment, we attempt to analyze the following aspects for OpenSearch Service and the pgvector extension on Amazon RDS:

  • As a vector store, the ability to scale and handle a large dataset with tens of millions of records for RAG
  • Possible bottlenecks in the ingest pipeline for RAG
  • How to achieve optimal performance in ingestion and query retrieval times for OpenSearch Service and Amazon RDS

To understand more about vector data stores and their role in building generative AI applications, refer to The role of vector datastores in generative AI applications.

Overview of OpenSearch Service

OpenSearch Service is a managed service for secure analysis, search, and indexing of business and operational data. OpenSearch Service supports petabyte-scale data with the ability to create multiple indexes on text and vector data. With optimized configuration, it aims for high recall for the queries. OpenSearch Service supports ANN as well as exact k-NN search. OpenSearch Service supports a selection of algorithms from the NMSLIB, FAISS, and Lucene libraries to power the k-NN search. We created the ANN index for OpenSearch with the Hierarchical Navigable Small World (HNSW) algorithm because it’s regarded as a better search method for large datasets. For more information on the choice of index algorithm, refer to Choose the k-NN algorithm for your billion-scale use case with OpenSearch.

Overview of Amazon RDS for PostgreSQL with pgvector

The pgvector extension adds an open source vector similarity search to PostgreSQL. By utilizing the pgvector extension, PostgreSQL can perform similarity searches on vector embeddings, providing businesses with a speedy and proficient solution. pgvector provides two types of vector similarity searches: exact nearest neighbor, which results with 100% recall, and approximate nearest neighbor (ANN), which provides better performance than exact search with a trade-off on recall. For searches over an index, you can choose how many centers to use in the search, with more centers providing better recall with a trade-off of performance.

Solution overview

The following diagram illustrates the solution architecture.

Let’s look at the key components in more detail.

Dataset

We use OSCAR data as our corpus and the SQUAD dataset to provide sample questions. These datasets are first converted to Parquet files. Then we use a Ray cluster to convert the Parquet data to embeddings. The created embeddings are ingested to OpenSearch Service and Amazon RDS with pgvector.

OSCAR (Open Super-large Crawled Aggregated corpus) is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the ungoliant architecture. Data is distributed by language in both original and deduplicated form. The Oscar Corpus dataset is approximately 609 million records and takes up about 4.5 TB as raw JSONL files. The JSONL files are then converted to Parquet format, which minimizes the total size to 1.8 TB. We further scaled the dataset down to 25 million records to save time during ingestion.

SQuAD (Stanford Question Answering Dataset) is a reading comprehension dataset consisting of questions posed by crowd workers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. We use SQUAD, licensed as CC-BY-SA 4.0, to provide sample questions. It has approximately 100,000 questions with over 50,000 unanswerable questions written by crowd workers to look similar to answerable ones.

Ray cluster for ingestion and creating vector embeddings

In our testing, we found that the GPUs make the biggest impact to performance when creating the embeddings. Therefore, we decided to use a Ray cluster to convert our raw text and create the embeddings. Ray is an open source unified compute framework that enables ML engineers and Python developers to scale Python applications and accelerate ML workloads. Our cluster consisted of 5 g4dn.12xlarge Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance was configured with 4 NVIDIA T4 Tensor Core GPUs, 48 vCPU, and 192 GiB of memory. For our text records, we ended up chunking each into 1,000 pieces with a 100-chunk overlap. This breaks out to approximately 200 per record. For the model used to create embeddings, we settled on all-mpnet-base-v2 to create a 768-dimensional vector space.

Infrastructure setup

We used the following RDS instance types and OpenSearch service cluster configurations to set up our infrastructure.

The following are our RDS instance type properties:

  • Instance type: db.r7g.12xlarge
  • Allocated storage: 20 TB
  • Multi-AZ: True
  • Storage encrypted: True
  • Enable Performance Insights: True
  • Performance Insight retention: 7 days
  • Storage type: gp3
  • Provisioned IOPS: 64,000
  • Index type: IVF
  • Number of lists: 5,000
  • Distance function: L2

The following are our OpenSearch Service cluster properties:

  • Version: 2.5
  • Data nodes: 10
  • Data node instance type: r6g.4xlarge
  • Primary nodes: 3
  • Primary node instance type: r6g.xlarge
  • Index: HNSW engine: nmslib
  • Refresh interval: 30 seconds
  • ef_construction: 256
  • m: 16
  • Distance function: L2

We used large configurations for both the OpenSearch Service cluster and RDS instances to avoid any performance bottlenecks.

We deploy the solution using an AWS Cloud Development Kit (AWS CDK) stack, as outlined in the following section.

Deploy the AWS CDK stack

The AWS CDK stack allows us to choose OpenSearch Service or Amazon RDS for ingesting data.

Pre-requsities

Before proceeding with the installation, under cdk, bin, src.tc, change the Boolean values for Amazon RDS and OpenSearch Service to either true or false depending on your preference.

You also need a service-linked AWS Identity and Access Management (IAM) role for the OpenSearch Service domain. For more details, refer to Amazon OpenSearch Service Construct Library. You can also run the following command to create the role:

aws iam create-service-linked-role --aws-service-name es.amazonaws.com

npm install
cdk deploy

This AWS CDK stack will deploy the following infrastructure:

  • A VPC
  • A jump host (inside the VPC)
  • An OpenSearch Service cluster (if using OpenSearch service for ingestion)
  • An RDS instance (if using Amazon RDS for ingestion)
  • An AWS Systems Manager document for deploying the Ray cluster
  • An Amazon Simple Storage Service (Amazon S3) bucket
  • An AWS Glue job for converting the OSCAR dataset JSONL files to Parquet files
  • Amazon CloudWatch dashboards

Download the data

Run the following commands from the jump host:

stack_name="RAGStack"
output_key="S3bucket"

export AWS_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')
aws configure set region $AWS_REGION

bucket_name=$(aws cloudformation describe-stacks --stack-name "$stack_name" --query "Stacks[0].Outputs[?OutputKey=='bucketName'].OutputValue" --output text )

Before cloning the git repo, make sure you have a Hugging Face profile and access to the OSCAR data corpus. You need to use the user name and password for cloning the OSCAR data:

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/oscar-corpus/OSCAR-2301
cd OSCAR-2301
git lfs pull --include en_meta
cd en_meta
for F in `ls *.zst`; do zstd -d $F; done
rm *.zst
cd ..
aws s3 sync en_meta s3://$bucket_name/oscar/jsonl/

Convert JSONL files to Parquet

The AWS CDK stack created the AWS Glue ETL job oscar-jsonl-parquet to convert the OSCAR data from JSONL to Parquet format.

After you run the oscar-jsonl-parquet job, the files in Parquet format should be available under the parquet folder in the S3 bucket.

Download the questions

From your jump host, download the questions data and upload it to your S3 bucket:

stack_name="RAGStack"
output_key="S3bucket"

export AWS_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')
aws configure set region $AWS_REGION

bucket_name=$(aws cloudformation describe-stacks --stack-name "$stack_name" --query "Stacks[0].Outputs[?OutputKey=='bucketName'].OutputValue" --output text )

wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
cat train-v2.0.json| jq '.data[].paragraphs[].qas[].question' > questions.csv
aws s3 cp questions.csv s3://$bucket_name/oscar/questions/questions.csv

Set up the Ray cluster

As part of the AWS CDK stack deployment, we created a Systems Manager document called CreateRayCluster.

To run the document, complete the following steps:

  1. On the Systems Manager console, under Documents in the navigation pane, choose Owned by Me.
  2. Open the CreateRayCluster document.
  3. Choose Run.

The run command page will have the default values populated for the cluster.

The default configuration requests 5 g4dn.12xlarge. Make sure your account has limits to support this. The relevant service limit is Running On-Demand G and VT instances. The default for this is 64, but this configuration requires 240 CPUS.

  1. After you review the cluster configuration, select the jump host as the target for the run command.

This command will perform the following steps:

  • Copy the Ray cluster files
  • Set up the Ray cluster
  • Set up the OpenSearch Service indexes
  • Set up the RDS tables

You can monitor the output of the commands on the Systems Manager console. This process will take 10–15 minutes for the initial launch.

Run ingestion

From the jump host, connect to the Ray cluster:

sudo -i
cd /rag
ray attach llm-batch-inference.yaml

The first time connecting to the host, install the requirements. These files should already be present on the head node.

pip install -r requirements.txt

For either of the ingestion methods, if you get an error like the following, it’s related to expired credentials. The current workaround (as of this writing) is to place credential files in the Ray head node. To avoid security risks, don’t use IAM users for authentication when developing purpose-built software or working with real data. Instead, use federation with an identity provider such as AWS IAM Identity Center (successor to AWS Single Sign-On).

OSError: When reading information for key 'oscar/parquet_data/part-00497-f09c5d2b-0e97-4743-ba2f-1b2ad4f36bb1-c000.snappy.parquet' in bucket 'ragstack-s3bucket07682993-1e3dic0fvr3rf': AWS Error [code 15]: No response body.

Usually, the credentials are stored in the file ~/.aws/credentials on Linux and macOS systems, and %USERPROFILE%\.aws\credentials on Windows, but these are short-term credentials with a session token. You also can’t override the default credential file, and so you need to create long-term credentials without the session token using a new IAM user.

To create long-term credentials, you need to generate an AWS access key and AWS secret access key. You can do that from the IAM console. For instructions, refer to Authenticate with IAM user credentials.

After you create the keys, connect to the jump host using Session Manager, a capability of Systems Manager, and run the following command:

$ aws configure
AWS Access Key ID [None]: <Your AWS Access Key>
AWS Secret Access Key [None]: <Your AWS Secret access key>
Default region name [None]: us-east-1
Default output format [None]: json

Now you can rerun the ingestion steps.

Ingest data into OpenSearch Service

If you’re using OpenSearch service, run the following script to ingest the files:

export AWS_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')
aws configure set region $AWS_REGION

python embedding_ray_os.py

When it’s complete, run the script that runs simulated queries:

python query_os.py

Ingest data into Amazon RDS

If you’re using Amazon RDS, run the following script to ingest the files:

export AWS_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone | sed 's/\(.*\)[a-z]/\1/')
aws configure set region $AWS_REGION

python embedding_ray_rds.py

When it’s complete, make sure to run a full vacuum on the RDS instance.

Then run the following script to run simulated queries:

python query_rds.py

Set up the Ray dashboard

Before you set up the Ray dashboard, you should install the AWS Command Line Interface (AWS CLI) on your local machine. For instructions, refer to Install or update the latest version of the AWS CLI.

Complete the following steps to set up the dashboard:

  1. Install the Session Manager plugin for the AWS CLI.
  2. In the Isengard account, copy the temporary credentials for bash/zsh and run in your local terminal.
  3. Create a session.sh file in your machine and copy the following content to the file:
#!/bin/bash
echo Starting session to $1 to forward to port $2 using local port $3
aws ssm start-session --target $1 --document-name AWS-StartPortForwardingSession --parameters ‘{“portNumber”:[“‘$2’“], “localPortNumber”:[“‘$3’“]}'
  1. Change the directory to where this session.sh file is stored.
  2. Run the command Chmod +x to give executable permission to the file.
  3. Run the following command:
./session.sh <Ray cluster head node instance ID> 8265 8265

For example:

./session.sh i-021821beb88661ba3 8265 8265

You will see a message like the following:

Starting session to i-021821beb88661ba3 to forward to port 8265 using local port 8265

Starting session with SessionId: abcdefgh-Isengard-0d73d992dfb16b146
Port 8265 opened for sessionId abcdefgh-Isengard-0d73d992dfb16b146.
Waiting for connections...

Open a new tab in your browser and enter localhost:8265.

You will see the Ray dashboard and statistics of the jobs and cluster running. You can track metrics from here.

For example, you can use the Ray dashboard to observe load on the cluster. As shown in the following screenshot, during ingest, the GPUs are running close to 100% utilization.

You can also use the RAG_Benchmarks CloudWatch dashboard to see the ingestion rate and query response times.

Extensibility of the solution

You can extend this solution to plug in other AWS or third-party vector stores. For every new vector store, you will need to create scripts for configuring the data store as well as ingesting data. The rest of the pipeline can be reused as needed.

Conclusion

In this post, we shared an ETL pipeline that you can use to put vectorized RAG data in both OpenSearch Service as well as Amazon RDS with the pgvector extension as vector datastores. The solution used a Ray cluster to provide the necessary parallelism to ingest a large data corpus. You can use this methodology to integrate any vector database of your choice to build RAG pipelines.


About the Authors

Randy DeFauw is a Senior Principal Solutions Architect at AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held a variety of positions in the technology space, ranging from software engineering to product management. He entered the big data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences, including Strata and GlueCon.

David Christian is a Principal Solutions Architect based out of Southern California. He has his bachelor’s in Information Security and a passion for automation. His focus areas are DevOps culture and transformation, infrastructure as code, and resiliency. Prior to joining AWS, he held roles in security, DevOps, and system engineering, managing large-scale private and public cloud environments.

Prachi Kulkarni is a Senior Solutions Architect at AWS. Her specialization is machine learning, and she is actively working on designing solutions using various AWS ML, big data, and analytics offerings. Prachi has experience in multiple domains, including healthcare, benefits, retail, and education, and has worked in a range of positions in product engineering and architecture, management, and customer success.

Richa Gupta is a Solutions Architect at AWS. She is passionate about architecting end-to-end solutions for customers. Her specialization is machine learning and how it can be used to build new solutions that lead to operational excellence and drive business revenue. Prior to joining AWS, she worked in the capacity of a Software Engineer and Solutions Architect, building solutions for large telecom operators. Outside of work, she likes to explore new places and loves adventurous activities.

In-stream anomaly detection with Amazon OpenSearch Ingestion and Amazon OpenSearch Serverless

Post Syndicated from Rupesh Tiwari original https://aws.amazon.com/blogs/big-data/in-stream-anomaly-detection-with-amazon-opensearch-ingestion-and-amazon-opensearch-serverless/

Unsupervised machine learning analytics has emerged as a powerful tool for anomaly detection in today’s data-rich landscape, especially with the growing volume of machine-generated data. In-stream anomaly detection offers real-time insights into data anomalies, enabling proactive response. Amazon OpenSearch Serverless focuses on delivering seamless scalability and management of search workloads; Amazon OpenSearch Ingestion complements this by providing a robust solution for anomaly detection on indexed data.

In this post, we provide a solution using OpenSearch Ingestion that empowers you to perform in-stream anomaly detection within your own AWS environment.

In-stream anomaly detection with OpenSearch Ingestion

OpenSearch Ingestion makes the process of in-stream anomaly detection straightforward and at less cost. In-stream anomaly detection helps you save on indexing and avoids the need for extensive resources to handle big data. It lets organizations apply the appropriate resources at the appropriate time, managing large data efficiently and saving money. Using peer forwarders and aggregate processors can make things more complex and expensive; OpenSearch Ingestion reduces these issues.

Let’s look at a use case showing an OpenSearch Ingestion configuration YAML for in-stream anomaly detection.

Solution overview

In this example, we walk through the setup of OpenSearch Ingestion using a random cut forest anomaly detector for monitoring log counts within a 5-minute period. We also index the raw logs to provide a comprehensive demonstration of the incoming data flow. If your use case requires the analysis of raw logs, you can streamline the process by bypassing the initial pipeline and focus directly on in-stream anomaly detection, indexing only the identified anomalies.

The following diagram illustrates our solution architecture.

The configuration outlines two OpenSearch Ingestion pipelines. The first, non-ad-pipeline, ingests HTTP data, timestamps it, and forwards it to both ad-pipeline and an OpenSearch index, non-ad-index. The second, ad-pipeline, receives this data, performs aggregation based on the ID within a 5-minute window, and conducts anomaly detection. Results are stored in the index ad-anomaly-index. This setup showcases data processing, anomaly detection, and storage within OpenSearch Service, enhancing analysis capabilities.

Implement the solution

Complete the following steps to set up the solution:

  1. Create a pipeline role.
  2. Create a collection.
  3. Create a pipeline in which you specify the pipeline role.

The pipeline assumes this role in order to sign requests to the OpenSearch Serverless collection endpoint. Specify the values for the keys within the following pipeline configuration:

  • For sts_role_arn, specify the Amazon Resource Name (ARN) of the pipeline role that you created.
  • For hosts, specify the endpoint of the collection that you created.
  • Set serverless to true.
version: "2"
# 1st pipeline
non-ad-pipeline:
  source:
    http:
      path: "/${pipelineName}/test_ingestion_path"
  processor:
    - date:
        from_time_received: true
        destination: "@timestamp"
  sink:
    - pipeline:
        name: "ad-pipeline"
    - opensearch:
        hosts:
          [
            "https://{collection-id}.us-east-1.aoss.amazonaws.com",
          ]
        index: "non-ad-index"
        
        aws:
          sts_role_arn: "arn:aws:iam::{account-id}:role/pipeline-role"
          region: "us-east-1"
          serverless: true
# 2nd pipeline
ad-pipeline:
  source:
    pipeline:
      name: "non-ad-pipeline"
  processor:
    - aggregate:
        identification_keys: ["id"]
        action:
          count:
        group_duration: "300s"
    - anomaly_detector:
        keys: ["value"] # value will have sum of logs
        mode:
          random_cut_forest:
            output_after: 200 
  sink:
    - opensearch:
        hosts:
          [
            "https://{collection-id}.us-east-1.aoss.amazonaws.com",
          ]
        aws:
          sts_role_arn: "arn:aws:iam::{account-id}:role/pipeline-role"
          region: "us-east-1"
          serverless: true
        index: "ad-anomaly-index"

For a detailed guide on the required parameters and any limitations, see Supported plugins and options for Amazon OpenSearch Ingestion pipelines.

  1. After you update the configuration, confirm the validity of your pipeline settings by choosing Validate pipeline.

A successful validation will display a message stating Pipeline configuration validation successful.” as shown in the following screenshot.

If validation fails, refer to Troubleshooting Amazon OpenSearch Service for troubleshooting and guidance.

Cost estimation for OpenSearch Ingestion

You are only charged for the number of Ingestion OpenSearch Compute Units (Ingestion OCUs) that are allocated to a pipeline, regardless of whether there’s data flowing through the pipeline. OpenSearch Ingestion immediately accommodates your workloads by scaling pipeline capacity up or down based on usage. For an overview of expenses, refer to Amazon OpenSearch Ingestion.

The following table shows approximate monthly costs based on specified throughputs and compute needs. Let’s assume that operation occurs from 8:00 AM to 8:00 PM on weekdays, with a cost of $0.24 per OCU per hour.

The formula would be: Total Cost/Month = OCU Requirement * OCU Price * Hours/Day * Days/Month.

Throughput Compute Required (OCUs) Total Cost/Month (USD)
1 Gbps 10 576
10 Gbps 100 5760
50 Gbps 500 28800
100 Gbps 1000 57600
500 Gbps 5000 288000

Clean up

When you are done using the solution, delete the resources you created, including the pipeline role, pipeline, and collection.

Summary

With OpenSearch Ingestion, you can explore in-stream anomaly detection with OpenSearch Service. The use case in this post demonstrates how OpenSearch Ingestion simplifies the process, achieving more with fewer resources. It showcases the service’s ability to analyze log rates, generate anomaly notifications, and empower proactive response to anomalies. With OpenSearch Ingestion, you can improve operational efficiency and enhance real-time risk management capabilities.

Leave any thoughts and questions in the comments.


About the Authors

Rupesh Tiwari, an AWS Solutions Architect, specializes in modernizing applications with a focus on data analytics, OpenSearch, and generative AI. He’s known for creating scalable, secure solutions that leverage cloud technology for transformative business outcomes, also dedicating time to community engagement and sharing expertise.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Petabyte-scale log analytics with Amazon S3, Amazon OpenSearch Service, and Amazon OpenSearch Ingestion

Post Syndicated from Jagadish Kumar original https://aws.amazon.com/blogs/big-data/petabyte-scale-log-analytics-with-amazon-s3-amazon-opensearch-service-and-amazon-opensearch-ingestion/

Organizations often need to manage a high volume of data that is growing at an extraordinary rate. At the same time, they need to optimize operational costs to unlock the value of this data for timely insights and do so with a consistent performance.

With this massive data growth, data proliferation across your data stores, data warehouse, and data lakes can become equally challenging. With a modern data architecture on AWS, you can rapidly build scalable data lakes; use a broad and deep collection of purpose-built data services; ensure compliance via unified data access, security, and governance; scale your systems at a low cost without compromising performance; and share data across organizational boundaries with ease, allowing you to make decisions with speed and agility at scale.

You can take all your data from various silos, aggregate that data in your data lake, and perform analytics and machine learning (ML) directly on top of that data. You can also store other data in purpose-built data stores to analyze and get fast insights from both structured and unstructured data. This data movement can be inside-out, outside-in, around the perimeter or sharing across.

For example, application logs and traces from web applications can be collected directly in a data lake, and a portion of that data can be moved out to a log analytics store like Amazon OpenSearch Service for daily analysis. We think of this concept as inside-out data movement. The analyzed and aggregated data stored in Amazon OpenSearch Service can again be moved to the data lake to run ML algorithms for downstream consumption from applications. We refer to this concept as outside-in data movement.

Let’s look at an example use case. Example Corp. is a leading Fortune 500 company that specializes in social content. They have hundreds of applications generating data and traces at approximately 500 TB per day and have the following criteria:

  • Have logs available for fast analytics for 2 days
  • Beyond 2 days, have data available in a storage tier that can be made available for analytics with a reasonable SLA
  • Retain the data beyond 1 week in cold storage for 30 days (for purposes of compliance, auditing, and others)

In the following sections, we discuss three possible solutions to address similar use cases:

  • Tiered storage in Amazon OpenSearch Service and data lifecycle management
  • On-demand ingestion of logs using Amazon OpenSearch Ingestion
  • Amazon OpenSearch Service direct queries with Amazon Simple Storage Service (Amazon S3)

Solution 1: Tiered storage in OpenSearch Service and data lifecycle management

OpenSearch Service supports three integrated storage tiers: hot, UltraWarm, and cold storage. Based on your data retention, query latency, and budgeting requirements, you can choose the best strategy to balance cost and performance. You can also migrate data between different storage tiers.

Hot storage is used for indexing and updating, and provides the fastest access to data. Hot storage takes the form of an instance store or Amazon Elastic Block Store (Amazon EBS) volumes attached to each node.

UltraWarm offers significantly lower costs per GiB for read-only data that you query less frequently and doesn’t need the same performance as hot storage. UltraWarm nodes use Amazon S3 with related caching solutions to improve performance.

Cold storage is optimized to store infrequently accessed or historical data. When you use cold storage, you detach your indexes from the UltraWarm tier, making them inaccessible. You can reattach these indexes in a few seconds when you need to query that data.

For more details on data tiers within OpenSearch Service, refer to Choose the right storage tier for your needs in Amazon OpenSearch Service.

Solution overview

The workflow for this solution consists of the following steps:

  1. Incoming data generated by the applications is streamed to an S3 data lake.
  2. Data is ingested into Amazon OpenSearch using S3-SQS near-real-time ingestion through notifications set up on the S3 buckets.
  3. After 2 days, hot data is migrated to UltraWarm storage to support read queries.
  4. After 5 days in UltraWarm, the data is migrated to cold storage for 21 days and detached from any compute. The data can be reattached to UltraWarm when needed. Data is deleted from cold storage after 21 days.
  5. Daily indexes are maintained for easy rollover. An Index State Management (ISM) policy automates the rollover or deletion of indexes that are older than 2 days.

The following is a sample ISM policy that rolls over data into the UltraWarm tier after 2 days, moves it to cold storage after 5 days, and deletes it from cold storage after 21 days:

{
    "policy": {
        "description": "hot warm delete workflow",
        "default_state": "hot",
        "schema_version": 1,
        "states": [
            {
                "name": "hot",
                "actions": [
                    {
                        "rollover": {
                            "min_index_age": "2d",
                            "min_primary_shard_size": "30gb"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "warm"
                    }
                ]
            },
            {
                "name": "warm",
                "actions": [
                    {
                        "replica_count": {
                            "number_of_replicas": 5
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "cold",
                        "conditions": {
                            "min_index_age": "5d"
                        }
                    }
                ]
            },
            {
                "name": "cold",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "cold_migration": {
                            "start_time": null,
                            "end_time": null,
                            "timestamp_field": "@timestamp",
                            "ignore": "none"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "21d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "retry": {
                            "count": 3,
                            "backoff": "exponential",
                            "delay": "1m"
                        },
                        "cold_delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": {
            "index_patterns": [
                "log*"
            ],
            "priority": 100
        }
    }
}

Considerations

UltraWarm uses sophisticated caching techniques to enable querying for infrequently accessed data. Although the data access is infrequent, the compute for UltraWarm nodes needs to be running all the time to make this access possible.

When operating at PB scale, to reduce the area of effect of any errors, we recommend decomposing the implementation into multiple OpenSearch Service domains when using tiered storage.

The next two patterns remove the need to have long-running compute and describe on-demand techniques where the data is either brought when needed or queried directly where it resides.

Solution 2: On-demand ingestion of logs data through OpenSearch Ingestion

OpenSearch Ingestion is a fully managed data collector that delivers real-time log and trace data to OpenSearch Service domains. OpenSearch Ingestion is powered by the open source data collector Data Prepper. Data Prepper is part of the open source OpenSearch project.

With OpenSearch Ingestion, you can filter, enrich, transform, and deliver your data for downstream analysis and visualization. You configure your data producers to send data to OpenSearch Ingestion. It automatically delivers the data to the domain or collection that you specify. You can also configure OpenSearch Ingestion to transform your data before delivering it. OpenSearch Ingestion is serverless, so you don’t need to worry about scaling your infrastructure, operating your ingestion fleet, and patching or updating the software.

There are two ways that you can use Amazon S3 as a source to process data with OpenSearch Ingestion. The first option is S3-SQS processing. You can use S3-SQS processing when you require near-real-time scanning of files after they are written to S3. It requires an Amazon Simple Queue Service (Amazon S3) queue that receives S3 Event Notifications. You can configure S3 buckets to raise an event any time an object is stored or modified within the bucket to be processed.

Alternatively, you can use a one-time or recurring scheduled scan to batch process data in an S3 bucket. To set up a scheduled scan, configure your pipeline with a schedule at the scan level that applies to all your S3 buckets, or at the bucket level. You can configure scheduled scans with either a one-time scan or a recurring scan for batch processing.

For a comprehensive overview of OpenSearch Ingestion, see Amazon OpenSearch Ingestion. For more information about the Data Prepper open source project, visit Data Prepper.

Solution overview

We present an architecture pattern with the following key components:

  • Application logs are streamed into to the data lake, which helps feed hot data into OpenSearch Service in near-real time using OpenSearch Ingestion S3-SQS processing.
  • ISM policies within OpenSearch Service handle index rollovers or deletions. ISM policies let you automate these periodic, administrative operations by triggering them based on changes in the index age, index size, or number of documents. For example, you can define a policy that moves your index into a read-only state after 2 days and then deletes it after a set period of 3 days.
  • Cold data is available in the S3 data lake to be consumed on demand into OpenSearch Service using OpenSearch Ingestion scheduled scans.

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. Incoming data generated by the applications is streamed to the S3 data lake.
  2. For the current day, data is ingested into OpenSearch Service using S3-SQS near-real-time ingestion through notifications set up in the S3 buckets.
  3. Daily indexes are maintained for easy rollover. An ISM policy automates the rollover or deletion of indexes that are older than 2 days.
  4. If a request is made for analysis of data beyond 2 days and the data is not in the UltraWarm tier, data will be ingested using the one-time scan feature of Amazon S3 between the specific time window.

For example, if the present day is January 10, 2024, and you need data from January 6, 2024 at a specific interval for analysis, you can create an OpenSearch Ingestion pipeline with an Amazon S3 scan in your YAML configuration, with the start_time and end_time to specify when you want the objects in the bucket to be scanned:

version: "2"
ondemand-ingest-pipeline:
  source:
    s3:
      codec:
        newline:
      compression: "gzip"
      scan:
        start_time: 2023-12-28T01:00:00
        end_time: 2023-12-31T09:00:00
        buckets:
          - bucket:
              name: <bucket-name>
      aws:
        region: "us-east-1"
        sts_role_arn: "arn:aws:iam::<acct num>:role/PipelineRole"
    
    acknowledgments: true
  processor:
    - parse_json:
    - date:
        from_time_received: true
        destination: "@timestamp"           
  sink:
    - opensearch:                  
        index: "logs_ondemand_20231231"
        hosts: [ "https://search-XXXX-domain-XXXXXXXXXX.us-east-1.es.amazonaws.com" ]
        aws:                  
          sts_role_arn: "arn:aws:iam::<acct num>:role/PipelineRole"
          region: "us-east-1"

Considerations

Take advantage of compression

Data in Amazon S3 can be compressed, which reduces your overall data footprint and results in significant cost savings. For example, if you are generating 15 PB of raw JSON application logs per month, you can use a compression mechanism like GZIP, which can reduce the size to approximately 1PB or less, resulting in significant cost savings.

Stop the pipeline when possible

OpenSearch Ingestion scales automatically between the minimum and maximum OCUs set for the pipeline. After the pipeline has completed the Amazon S3 scan for the specified duration mentioned in the pipeline configuration, the pipeline continues to run for continuous monitoring at the minimum OCUs.

For on-demand ingestion for past time durations where you don’t expect new objects to be created, consider using supported pipeline metrics such as recordsOut.count to create Amazon CloudWatch alarms that can stop the pipeline. For a list of supported metrics, refer to Monitoring pipeline metrics.

CloudWatch alarms perform an action when a CloudWatch metric exceeds a specified value for some amount of time. For example, you might want to monitor recordsOut.count to be 0 for longer than 5 minutes to initiate a request to stop the pipeline through the AWS Command Line Interface (AWS CLI) or API.

Solution 3: OpenSearch Service direct queries with Amazon S3

OpenSearch Service direct queries with Amazon S3 (preview) is a new way to query operational logs in Amazon S3 and S3 data lakes without needing to switch between services. You can now analyze infrequently queried data in cloud object stores and simultaneously use the operational analytics and visualization capabilities of OpenSearch Service.

OpenSearch Service direct queries with Amazon S3 provides zero-ETL integration to reduce the operational complexity of duplicating data or managing multiple analytics tools by enabling you to directly query your operational data, reducing costs and time to action. This zero-ETL integration is configurable within OpenSearch Service, where you can take advantage of various log type templates, including predefined dashboards, and configure data accelerations tailored to that log type. Templates include VPC Flow Logs, Elastic Load Balancing logs, and NGINX logs, and accelerations include skipping indexes, materialized views, and covered indexes.

With OpenSearch Service direct queries with Amazon S3, you can perform complex queries that are critical to security forensics and threat analysis and correlate data across multiple data sources, which aids teams in investigating service downtime and security events. After you create an integration, you can start querying your data directly from OpenSearch Dashboards or the OpenSearch API. You can audit connections to ensure that they are set up in a scalable, cost-efficient, and secure way.

Direct queries from OpenSearch Service to Amazon S3 use Spark tables within the AWS Glue Data Catalog. After the table is cataloged in your AWS Glue metadata catalog, you can run queries directly on your data in your S3 data lake through OpenSearch Dashboards.

Solution overview

The following diagram illustrates the solution architecture.

This solution consists of the following key components:

  • The hot data for the current day is stream processed into OpenSearch Service domains through the event-driven architecture pattern using the OpenSearch Ingestion S3-SQS processing feature
  • The hot data lifecycle is managed through ISM policies attached to daily indexes
  • The cold data resides in your Amazon S3 bucket, and is partitioned and cataloged

The following screenshot shows a sample http_logs table that is cataloged in the AWS Glue metadata catalog. For detailed steps, refer to Data Catalog and crawlers in AWS Glue.

Before you create a data source, you should have an OpenSearch Service domain with version 2.11 or later and a target S3 table in the AWS Glue Data Catalog with the appropriate AWS Identity and Access Management (IAM) permissions. IAM will need access to the desired S3 buckets and have read and write access to the AWS Glue Data Catalog. The following is a sample role and trust policy with appropriate permissions to access the AWS Glue Data Catalog through OpenSearch Service:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "directquery.opensearchservice.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

The following is a sample custom policy with access to Amazon S3 and AWS Glue:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Statement1",
            "Effect": "Allow",
            "Action": "es:ESHttp*",
            "Resource": "arn:aws:es:*:<acct_num>:domain/*"
        },
        {
            "Sid": "Statement2",
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3:Put*",
                "s3:Describe*"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket-name>",
                "arn:aws:s3:::<bucket-name>/*"
            ]
        },
        {
            "Sid": "GlueCreateAndReadDataCatalog",
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateDatabase",
                "glue:GetDatabases",
                "glue:CreateTable",
                "glue:GetTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetTables",
                "glue:GetPartition",
                "glue:GetPartitions",
                "glue:CreatePartition",
                "glue:BatchCreatePartition",
                "glue:GetUserDefinedFunctions"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:<acct_num>:catalog",
                "arn:aws:glue:us-east-1:<acct_num>:database/*",
                "arn:aws:glue:us-east-1:<acct_num>:table/*"
            ]
        }
    ]
}

To create a new data source on the OpenSearch Service console, provide the name of your new data source, specify the data source type as Amazon S3 with the AWS Glue Data Catalog, and choose the IAM role for your data source.

After you create a data source, you can go to the OpenSearch dashboard of the domain, which you use to configure access control, define tables, set up log type-based dashboards for popular log types, and query your data.

After you set up your tables, you can query your data in your S3 data lake through OpenSearch Dashboards. You can run a sample SQL query for the http_logs table you created in the AWS Glue Data Catalog tables, as shown in the following screenshot.

Best practices

Ingest only the data you need

Work backward from your business needs and establish the right datasets you’ll need. Evaluate if you can avoid ingesting noisy data and ingest only curated, sampled, or aggregated data. Using these cleaned and curated datasets will help you optimize the compute and storage resources needed to ingest this data.

Reduce the size of data before ingestion

When you design your data ingestion pipelines, use strategies such as compression, filtering, and aggregation to reduce the size of the ingested data. This will permit smaller data sizes to be transferred over the network and stored in your data layer.

Conclusion

In this post, we discussed solutions that enable petabyte-scale log analytics using OpenSearch Service in a modern data architecture. You learned how to create a serverless ingestion pipeline to deliver logs to an OpenSearch Service domain, manage indexes through ISM policies, configure IAM permissions to start using OpenSearch Ingestion, and create the pipeline configuration for data in your data lake. You also learned how to set up and use the OpenSearch Service direct queries with Amazon S3 feature (preview) to query data from your data lake.

To choose the right architecture pattern for your workloads when using OpenSearch Service at scale, consider the performance, latency, cost and data volume growth over time in order to make the right decision.

  • Use Tiered storage architecture with Index State Management policies when you need fast access to your hot data and want to balance the cost and performance with UltraWarm nodes for read-only data.
  • Use On Demand Ingestion of your data into OpenSearch Service when you can tolerate ingestion latencies to query your data not retained in your hot nodes. You can achieve significant cost savings when using compressed data in Amazon S3 and ingesting data on demand into OpenSearch Service.
  • Use Direct query with S3 feature when you want to directly analyze your operational logs in Amazon S3 with the rich analytics and visualization features of OpenSearch Service.

As a next step, refer to the Amazon OpenSearch Developer Guide to explore logs and metric pipelines that you can use to build a scalable observability solution for your enterprise applications.


About the Authors

Jagadish Kumar (Jag) is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service. He is deeply passionate about Data Architecture and helps customers build analytics solutions at scale on AWS.


Muthu Pitchaimani
is a Senior Specialist Solutions Architect with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.


Sam Selvan
is a Principal Specialist Solution Architect with Amazon OpenSearch Service.

Top Architecture Blog Posts of 2023

Post Syndicated from Andrea Courtright original https://aws.amazon.com/blogs/architecture/top-architecture-blog-posts-of-2023/

2023 was a rollercoaster year in tech, and we at the AWS Architecture Blog feel so fortunate to have shared in the excitement. As we move into 2024 and all of the new technologies we could see, we want to take a moment to highlight the brightest stars from 2023.

As always, thanks to our readers and to the many talented and hardworking Solutions Architects and other contributors to our blog.

I give you our 2023 cream of the crop!

#10: Build a serverless retail solution for endless aisle on AWS

In this post, Sandeep and Shashank help retailers and their customers alike in this guided approach to finding inventory that doesn’t live on shelves.

Building endless aisle architecture for order processing

Figure 1. Building endless aisle architecture for order processing

Check it out!

#9: Optimizing data with automated intelligent document processing solutions

Who else dreads wading through large amounts of data in multiple formats? Just me? I didn’t think so. Using Amazon AI/ML and content-reading services, Deependra, Anirudha, Bhajandeep, and Senaka have created a solution that is scalable and cost-effective to help you extract the data you need and store it in a format that works for you.

AI-based intelligent document processing engine

Figure 2: AI-based intelligent document processing engine

Check it out!

#8: Disaster Recovery Solutions with AWS managed services, Part 3: Multi-Site Active/Passive

Disaster recovery posts are always popular, and this post by Brent and Dhruv is no exception. Their creative approach in part 3 of this series is most helpful for customers who have business-critical workloads with higher availability requirements.

Warm standby with managed services

Figure 3. Warm standby with managed services

Check it out!

#7: Simulating Kubernetes-workload AZ failures with AWS Fault Injection Simulator

Continuing with the theme of “when bad things happen,” we have Siva, Elamaran, and Re’s post about preparing for workload failures. If resiliency is a concern (and it really should be), the secret is test, test, TEST.

Architecture flow for Microservices to simulate a realistic failure scenario

Figure 4. Architecture flow for Microservices to simulate a realistic failure scenario

Check it out!

#6: Let’s Architect! Designing event-driven architectures

Luca, Laura, Vittorio, and Zamira weren’t content with their four top-10 spots last year – they’re back with some things you definitely need to know about event-driven architectures.

Let's Architect

Figure 5. Let’s Architect artwork

Check it out!

#5: Use a reusable ETL framework in your AWS lake house architecture

As your lake house increases in size and complexity, you could find yourself facing maintenance challenges, and Ashutosh and Prantik have a solution: frameworks! The reusable ETL template with AWS Glue templates might just save you a headache or three.

Reusable ETL framework architecture

Figure 6. Reusable ETL framework architecture

Check it out!

#4: Invoking asynchronous external APIs with AWS Step Functions

It’s possible that AWS’ menagerie of services doesn’t have everything you need to run your organization. (Possible, but not likely; we have a lot of amazing services.) If you are using third-party APIs, then Jorge, Hossam, and Shirisha’s architecture can help you maintain a secure, reliable, and cost-effective relationship among all involved.

Invoking Asynchronous External APIs architecture

Figure 7. Invoking Asynchronous External APIs architecture

Check it out!

#3: Announcing updates to the AWS Well-Architected Framework

The Well-Architected Framework continues to help AWS customers evaluate their architectures against its six pillars. They are constantly striving for improvement, and Haleh’s diligence in keeping us up to date has not gone unnoticed. Thank you, Haleh!

Well-Architected logo

Figure 8. Well-Architected logo

Check it out!

#2: Let’s Architect! Designing architectures for multi-tenancy

The practically award-winning Let’s Architect! series strikes again! This time, Luca, Laura, Vittorio, and Zamira were joined by Federica to discuss multi-tenancy and why that concept is so crucial for SaaS providers.

Let's Architect

Figure 9. Let’s Architect

Check it out!

And finally…

#1: Understand resiliency patterns and trade-offs to architect efficiently in the cloud

Haresh, Lewis, and Bonnie revamped this 2022 post into a masterpiece that completely stole our readers’ hearts and is among the top posts we’ve ever made!

Resilience patterns and trade-offs

Figure 10. Resilience patterns and trade-offs

Check it out!

Bonus! Three older special mentions

These three posts were published before 2023, but we think they deserve another round of applause because you, our readers, keep coming back to them.

Thanks again to everyone for their contributions during a wild year. We hope you’re looking forward to the rest of 2024 as much as we are!

Introducing Terraform support for Amazon OpenSearch Ingestion

Post Syndicated from Rahul Sharma original https://aws.amazon.com/blogs/big-data/introducing-terraform-support-for-amazon-opensearch-ingestion/

Today, we are launching Terraform support for Amazon OpenSearch Ingestion. Terraform is an infrastructure as code (IaC) tool that helps you build, deploy, and manage cloud resources efficiently. OpenSearch Ingestion is a fully managed, serverless data collector that delivers real-time log, metric, and trace data to Amazon OpenSearch Service domains and Amazon OpenSearch Serverless collections. In this post, we explain how you can use Terraform to deploy OpenSearch Ingestion pipelines. As an example, we use an HTTP source as input and an Amazon OpenSearch Service domain (Index) as output.

Solution overview

The steps in this post deploy a publicly accessible OpenSearch Ingestion pipeline with Terraform, along with other supporting resources that are needed for the pipeline to ingest data into Amazon OpenSearch. We have implemented the Tutorial: Ingesting data into a domain using Amazon OpenSearch Ingestion, using Terraform.

We create the following resources with Terraform:

The pipeline that you create exposes an HTTP source as input and an Amazon OpenSearch sink to save batches of events.

Prerequisites

To follow the steps in this post, you need the following:

  • An active AWS account.
  • Terraform installed on your local machine. For more information, see Install Terraform.
  • The necessary IAM permissions required to create the AWS resources using Terraform.
  • awscurl for sending HTTPS requests through the command line with AWS Sigv4 authentication. For instructions on installing this tool, see the GitHub repo.

Create a directory

In Terraform, infrastructure is managed as code, called a project. A Terraform project contains various Terraform configuration files, such as main.tf, provider.tf, variables.tf, and output.df . Let’s create a directory on the server or machine that we can use to connect to AWS services using the AWS Command Line Interface (AWS CLI):

mkdir osis-pipeline-terraform-example

Change to the directory.

cd osis-pipeline-terraform-example

Create the Terraform configuration

Create a file to define the AWS resources.

touch main.tf

Enter the following configuration in main.tf and save your file:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.36"
    }
  }

  required_version = ">= 1.2.0"
}

provider "aws" {
  region = "eu-central-1"
}

data "aws_region" "current" {}
data "aws_caller_identity" "current" {}
locals {
    account_id = data.aws_caller_identity.current.account_id
}

output "ingest_endpoint_url" {
  value = tolist(aws_osis_pipeline.example.ingest_endpoint_urls)[0]
}

resource "aws_iam_role" "example" {
  name = "exampleosisrole"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Sid    = ""
        Principal = {
          Service = "osis-pipelines.amazonaws.com"
        }
      },
    ]
  })
}

resource "aws_opensearch_domain" "test" {
  domain_name           = "osi-example-domain"
  engine_version = "OpenSearch_2.7"
  cluster_config {
    instance_type = "r5.large.search"
  }
  encrypt_at_rest {
    enabled = true
  }
  domain_endpoint_options {
    enforce_https       = true
    tls_security_policy = "Policy-Min-TLS-1-2-2019-07"
  }
  node_to_node_encryption {
    enabled = true
  }
  ebs_options {
    ebs_enabled = true
    volume_size = 10
  }
 access_policies = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "${aws_iam_role.example.arn}"
      },
      "Action": "es:*"
    }
  ]
}

EOF

}

resource "aws_iam_policy" "example" {
  name = "osis_role_policy"
  description = "Policy for OSIS pipeline role"
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
        {
          Action = ["es:DescribeDomain"]
          Effect = "Allow"
          Resource = "arn:aws:es:${data.aws_region.current.name}:${local.account_id}:domain/*"
        },
        {
          Action = ["es:ESHttp*"]
          Effect = "Allow"
          Resource = "arn:aws:es:${data.aws_region.current.name}:${local.account_id}:domain/osi-test-domain/*"
        }
    ]
})
}

resource "aws_iam_role_policy_attachment" "example" {
  role       = aws_iam_role.example.name
  policy_arn = aws_iam_policy.example.arn
}

resource "aws_cloudwatch_log_group" "example" {
  name = "/aws/vendedlogs/OpenSearchIngestion/example-pipeline"
  retention_in_days = 365
  tags = {
    Name = "AWS Blog OSIS Pipeline Example"
  }
}

resource "aws_osis_pipeline" "example" {
  pipeline_name               = "example-pipeline"
  pipeline_configuration_body = <<-EOT
            version: "2"
            example-pipeline:
              source:
                http:
                  path: "/test_ingestion_path"
              processor:
                - date:
                    from_time_received: true
                    destination: "@timestamp"
              sink:
                - opensearch:
                    hosts: ["https://${aws_opensearch_domain.test.endpoint}"]
                    index: "application_logs"
                    aws:
                      sts_role_arn: "${aws_iam_role.example.arn}"   
                      region: "${data.aws_region.current.name}"
        EOT
  max_units                   = 1
  min_units                   = 1
  log_publishing_options {
    is_logging_enabled = true
    cloudwatch_log_destination {
      log_group = aws_cloudwatch_log_group.example.name
    }
  }
  tags = {
    Name = "AWS Blog OSIS Pipeline Example"
  }
  }

Create the resources

Initialize the directory:

terraform init

Review the plan to see what resources will be created:

terraform plan

Apply the configuration and answer yes to run the plan:

terraform apply

The process might take around 7–10 minutes to complete.

Test the pipeline

After you create the resources, you should see the ingest_endpoint_url output displayed. Copy this value and export it in your environment variable:

export OSIS_PIPELINE_ENDPOINT_URL=<Replace with value copied>

Send a sample log with awscurl. Replace the profile with your appropriate AWS profile for credentials:

awscurl --service osis --region eu-central-1 -X POST -H "Content-Type: application/json" -d '[{"time":"2014-08-11T11:40:13+00:00","remote_addr":"122.226.223.69","status":"404","request":"GET http://www.k2proxy.com//hello.html HTTP/1.1","http_user_agent":"Mozilla/4.0 (compatible; WOW64; SLCC2;)"}]' https://$OSIS_PIPELINE_ENDPOINT_URL/test_ingestion_path

You should receive a 200 OK as a response.

To verify that the data was ingested in the OpenSearch Ingestion pipeline and saved in the OpenSearch, navigate to the OpenSearch and get its domain endpoint. Replace the <OPENSEARCH ENDPOINT URL> in the snippet given below and run it.

awscurl --service es --region eu-central-1 -X GET https://<OPENSEARCH ENDPOINT URL>/application_logs/_search | json_pp 

You should see the output as below:

Clean up

To destroy the resources you created, run the following command and answer yes when prompted:

terraform destroy

The process might take around 30–35 minutes to complete.

Conclusion

In this post, we showed how you can use Terraform to deploy OpenSearch Ingestion pipelines. AWS offers various resources for you to quickly start building pipelines using OpenSearch Ingestion and use Terraform to deploy them. You can use various built-in pipeline integrations to quickly ingest data from Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Security Lake, Fluent Bit, and many more. The following OpenSearch Ingestion blueprints allow you to build data pipelines with minimal configuration changes and manage them with ease using Terraform. To learn more, check out the Terraform documentation for Amazon OpenSearch Ingestion.


About the Authors

Rahul Sharma is a Technical Account Manager at Amazon Web Services. He is passionate about the data technologies that help leverage data as a strategic asset and is based out of New York city, New York.

Farhan Angullia is a Cloud Application Architect at AWS Professional Services, based in Singapore. He primarily focuses on modern applications with microservice software patterns, and advocates for implementing robust CI/CD practices to optimize the software delivery lifecycle for customers. He enjoys contributing to the open source Terraform ecosystem in his spare time.

Arjun Nambiar is a Product Manager with Amazon OpenSearch Service. He focusses on ingestion technologies that enable ingesting data from a wide variety of sources into Amazon OpenSearch Service at scale. Arjun is interested in large scale distributed systems and cloud-native technologies and is based out of Seattle, Washington.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Use Amazon OpenSearch Ingestion to migrate to Amazon OpenSearch Serverless

Post Syndicated from Muthu Pitchaimani original https://aws.amazon.com/blogs/big-data/use-amazon-opensearch-ingestion-to-migrate-to-amazon-opensearch-serverless/

Amazon OpenSearch Serverless is an on-demand auto scaling configuration for Amazon OpenSearch Service. Since its release, the interest for OpenSearch Serverless had been steadily growing. Customers prefer to let the service manage its capacity automatically rather than having to manually provision capacity. Until now, customers have had to rely on using custom code or third-party solutions to move the data between provisioned OpenSearch Service domains and OpenSearch Serverless.

We recently introduced a feature with Amazon OpenSearch Ingestion (OSI) to make this migration even more effortless. OSI is a fully managed, serverless data collector that delivers real-time log, metric, and trace data to OpenSearch Service domains and OpenSearch Serverless collections.

In this post, we outline the steps to make migrate the data between provisioned OpenSearch Service domains and OpenSearch Serverless. Migration of metadata such as security roles and dashboard objects will be covered in another subsequent post.

Solution overview

The following diagram shows the necessary components for moving data between OpenSearch Service provisioned domains and OpenSearch Serverless using OSI. You will use OSI with OpenSearch Service as source and an OpenSearch Serverless collection as sink.

Prerequisites

Before getting started, complete the following steps to create the necessary resources:

  1. Create an AWS Identity and Access Management (IAM) role that the OpenSearch Ingestion pipeline will assume to write to the OpenSearch Serverless collection. This role needs to be specified in the sts_role_arn parameter of the pipeline configuration.
  2. Attach a permissions policy to the role to allow it to read data from the OpenSearch Service domain. The following is a sample policy with least privileges:
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Effect":"Allow",
             "Action":"es:ESHttpGet",
             "Resource":[
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_cat/indices",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search/scroll",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/*/_search"
             ]
          },
          {
             "Effect":"Allow",
             "Action":"es:ESHttpPost",
             "Resource":[
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/*/_search/point_in_time",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/*/_search/scroll"
             ]
          },
          {
             "Effect":"Allow",
             "Action":"es:ESHttpDelete",
             "Resource":[
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search/point_in_time",
                "arn:aws:es:us-east-1:{account-id}:domain/{domain-name}/_search/scroll"
             ]
          }
       ]
    }

  3. Attach a permissions policy to the role to allow it to send data to the collection. The following is a sample policy with least privileges:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Action": [
            "aoss:BatchGetCollection",
            "aoss:APIAccessAll"
          ],
          "Effect": "Allow",
          "Resource": "arn:aws:aoss:{region}:{your-account-id}:collection/{collection-id}"
        },
        {
          "Action": [
            "aoss:CreateSecurityPolicy",
            "aoss:GetSecurityPolicy",
            "aoss:UpdateSecurityPolicy"
          ],
          "Effect": "Allow",
          "Resource": "*",
          "Condition": {
            "StringEquals": {
              "aoss:collection": "{collection-name}"
            }
          }
        }
      ]
    }

  4. Configure the role to assume the trust relationship, as follows:
    {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Effect": "Allow",
                    "Principal": {
                        "Service": "osis-pipelines.amazonaws.com"
                    },
                    "Action": "sts:AssumeRole"
                }
            ]
        }

  5. It’s recommended to add the aws:SourceAccount and aws:SourceArn condition keys to the policy for protection against the confused deputy problem:
    "Condition": {
        "StringEquals": {
            "aws:SourceAccount": "{your-account-id}"
        },
        "ArnLike": {
            "aws:SourceArn": "arn:aws:osis:{region}:{your-account-id}:pipeline/*"
        }
    }

  6. Map the OpenSearch Ingestion domain role ARN as a backend user (as an all_access user) to the domain user. We show a simplified example to use the all_access role. For production scenarios, make sure to use a role with just enough permissions to read and write.
  7. Create an OpenSearch Serverless collection, which is where data will be ingested.
  8. Associate a data policy, as shown in the following code, to grant the OpenSearch Ingestion role permissions on the collection:
    [
      {
        "Rules": [
          {
            "Resource": [
              "index/collection-name/*"
            ],
            "Permission": [
              "aoss:CreateIndex",
              "aoss:UpdateIndex",
              "aoss:DescribeIndex",
              "aoss:WriteDocument",
            ],
            "ResourceType": "index"
          }
        ],
        "Principal": [
          "arn:aws:iam::{account-id}:role/pipeline-role"
        ],
        "Description": "Pipeline role access"
      }
    ]

  9. If the collection is defined as a VPC collection, you need to create a network policy and configure it in the ingestion pipeline.

Now you’re ready to move data from your provisioned domain to OpenSearch Serverless.

Move data from provisioned domains to Serverless

Setup Amazon OpenSearch Ingestion
To get started, you must have an active OpenSearch Service domain (source) and OpenSearch Serverless collection (sink). Complete the following steps to set up an OpenSearch Ingestion pipeline for migration:

  1. On the OpenSearch Service console, choose Pipeline under Ingestion in the navigation pane.
  2. Choose Create a pipeline.
  3. For Pipeline name, enter a name (for example, octank-migration).
  4. For Pipeline capacity, you can define the minimum and maximum capacity to scale up the resources. For now, you can leave the default minimum as 1 and maximum as 4.
  5. For Configuration Blueprint, select AWS-OpenSearchDataMigrationPipeline.
  6. Update the following information for the source:
    1. Uncomment hosts and specify the endpoint of the existing OpenSearch Service endpoint.
    2. Uncomment distribution_version if your source cluster is an OpenSearch Service cluster with compatibility mode enabled; otherwise, leave it commented.
    3. Uncomment indices, include, index_name_regex, and add an index name or pattern that you want to migrate (for example, octank-iot-logs-2023.11.0*).
    4. Update region under aws where your source domain is (for example, us-west-2).
    5. Update sts_role_arn under aws to the role that has permission to read data from the OpenSearch Service domain (for example, arn:aws:iam::111122223333:role/osis-pipeline). This role should be added as a backend role within the OpenSearch Service security roles.
  7. Update the following information for the sink:
    1. Uncomment hosts and specify the endpoint of the existing OpenSearch Serverless endpoint.
    2. Update sts_role_arn under aws to the role that has permission to write data into the OpenSearch Serverless collection (for example, arn:aws:iam::111122223333:role/osis-pipeline). This role should be added as part of the data access policy in the OpenSearch Serverless collection.
    3. Update the serverless flag to be true.
    4. For index, you can leave it as default, which will get the metadata from the source index and write to the same name in the destination as of the sources. Alternatively, if you want to have a different index name at the destination, modify this value with your desired name.
    5. For document_id, you can get the ID from the document metadata in the source and use the same in the target. Note that custom document IDs are supported only for the SEARCH type of collection; if you have your collection as TIMESERIES or VECTORSEARCH, you should comment this line.
  8. Next, you can validate your pipeline to check the connectivity of source and sink to make sure the endpoint exists and is accessible.
  9. For Network settings, choose your preferred setting:
    1. Choose VPC access and select your VPC, subnet, and security group to set up the access privately.
    2. Choose Public to use public access. AWS recommends that you use a VPC endpoint for all production workloads, but this walkthrough, select Public.
  10. For Log Publishing Option, you can either create a new Amazon CloudWatch group or use an existing CloudWatch group to write the ingestion logs. This provides access to information about errors and warnings raised during the operation, which can help during troubleshooting. For this walkthrough, choose Create new group.
  11. Choose Next, and verify the details you specified for your pipeline settings.
  12. Choose Create pipeline.

It should take a couple of minutes to create the ingestion pipeline.
The following graphic gives a quick demonstration of creating the OpenSearch Ingestion pipeline via the preceding steps.

Verify ingested data in the target OpenSearch Serverless collection

After the pipeline is created and active, log in to OpenSearch Dashboards for your OpenSearch Serverless collection and run the following command to list the indexes:

GET _cat/indices?v

The following graphic gives a quick demonstration of listing the indexes before and after the pipeline becomes active.

Conclusion

In this post, we saw how OpenSearch Ingestion can ingest data into an OpenSearch Serverless collection without the need to use the third-party solutions. With minimal data producer configuration, it automatically ingested data to the collection. OSI also allows you to transform or reindex the data from ES7.x version before ingestion to an OpenSearch Service domain or OpenSearch Serverless collection. OSI eliminates the need to provision, scale, or manage servers. AWS offers various resources for you to quickly start building pipelines using OpenSearch Ingestion. You can use various built-in pipeline integrations to quickly ingest data from Amazon DynamoDB, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Security Lake, Fluent Bit, and many more. The following OpenSearch Ingestion blueprints enable you to build data pipelines with minimal configuration changes.


About the Authors

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Prashant Agrawal is a Sr. Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Rahul Sharma is a Technical Account Manager at Amazon Web Services. He is passionate about the data technologies that help leverage data as a strategic asset and is based out of New York city, New York.

Enable advanced search capabilities for Amazon Keyspaces data by integrating with Amazon OpenSearch Service

Post Syndicated from Rajesh Kantamani original https://aws.amazon.com/blogs/big-data/enable-advanced-search-capabilities-for-amazon-keyspaces-data-by-integrating-with-amazon-opensearch-service/

Amazon Keyspaces (for Apache Cassandra) is a fully managed, serverless, and Apache Cassandra-compatible database service offered by AWS. It caters to developers in need of a highly available, durable, and fast NoSQL database backend. When you start the process of designing your data model for Amazon Keyspaces, it’s essential to possess a comprehensive understanding of your access patterns, similar to the approach used in other NoSQL databases. This allows for the uniform distribution of data across all partitions within your table, thereby enabling your applications to achieve optimal read and write throughput. In cases where your application demands supplementary query features, such as conducting full-text searches on the data stored in a table, you may explore the utilization of alternative services like Amazon OpenSearch Service to meet these particular needs.

Amazon OpenSearch Service is a powerful and fully managed search and analytics service. It empowers businesses to explore and gain insights from large volumes of data quickly. OpenSearch Service is versatile, allowing you to perform text and geospatial searches. Amazon OpenSearch Ingestion is a fully managed, serverless data collection solution that efficiently routes data to your OpenSearch Service domains and Amazon OpenSearch Serverless collections. It eliminates the need for third-party tools to ingest data into your OpenSearch service setup. You simply configure your data sources to send information to OpenSearch Ingestion, which then automatically delivers the data to your specified destination. Additionally, you can configure OpenSearch Ingestion to apply data transformations before delivery.

In this post, we explore the process of integrating  Amazon Keyspaces and Amazon OpenSearch Service using AWS Lambda and Amazon OpenSearch Ingestion to enable advanced search capabilities. The content includes a reference architecture, a step-by-step guide on infrastructure setup, sample code for implementing the solution within a use case, and an AWS Cloud Development Kit (AWS CDK) application for deployment.

Solution overview

AnyCompany, a rapidly growing eCommerce platform, faces a critical challenge in efficiently managing its extensive product and item catalog while enhancing the shopping experience for its customers. Currently, customers struggle to find specific products quickly due to limited search capabilities. AnyCompany aims to address this issue by implementing advanced search functionality that enables customers to easily search for the products. This enhancement is expected to significantly improve customer satisfaction and streamline the shopping process, ultimately boosting sales and retention rates.

The following diagram illustrates the solution architecture.

The workflow includes the following steps:

  1. Amazon API Gateway is set up to issue a POST request to the Amazon Lambda function when there is a need to insert, update, or delete data in Amazon Keyspaces.
  2. The Lambda function passes this modification to Amazon Keyspaces and holds the change, waiting for a success return code from Amazon Keyspaces that confirms the data persistence.
  3. After it receives the 200 return code, the Lambda function initiates an HTTP request to the OpenSearch Ingestion data pipeline asynchronously.
  4. The OpenSearch Ingestion process moves the transaction data to the OpenSearch Serverless collection.
  5. We then utilize the dev tools in OpenSearch Dashboards to execute various search patterns.

Prerequisites

Complete the following prerequisite steps:

  1. Ensure the AWS Command Line Interface (AWS CLI) is installed and the user profile is set up.
  2. Install Node.js, npm and the AWS CDK Toolkit.
  3. Install Python and jq.
  4. Use an integrated developer environment (IDE), such as Visual Studio Code.

Deploy the solution

The solution is detailed in an AWS CDK project. You don’t need any prior knowledge of AWS CDK. Complete the following steps to deploy the solution:

  1. Clone the GitHub repository to your IDE and navigate to the cloned repository’s directory:This project is structured like a standard Python project.
    git clone <repo-link>
    cd <repo-dir>

  2. On MacOS and Linux, complete the following steps to set up your virtual environment:
    • Create a virtual environment
      $ python3 -m venv .venv

    • After the virtual environment is created, activate it:
      $ source .venv/bin/activate

  3. For Windows users, activate the virtual environment as follows.
    % .venv\\\\Scripts\\\\activate.bat

  4. After you activate the virtual environment, install the required dependencies:
    (.venv) $ pip install -r requirements.txt

  5. Bootstrap AWS CDK in your account:(.venv) $ cdk bootstrap aws://<aws_account_id>/<aws_region>

After the bootstrap process completes, you’ll see a CDKToolkit AWS CloudFormation stack on the AWS CloudFormation console. AWS CDK is now ready for use.

  1. You can synthesize the CloudFormation template for this code:
    (.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
    (.venv) $ export CDK_DEFAULT_REGION=<aws_region>
    (.venv) $ cdk synth -c iam_user_name=<your-iam-user-name> --all
    

  2. Use the cdk deploy command to create the stack:
    (.venv) $ cdk deploy -c iam_user_name=<your-iam-user-name> --all
    

    When the deployment process is complete, you’ll see the following CloudFormation stacks on the AWS CloudFormation console:

  • OpsApigwLambdaStack
  • OpsServerlessIngestionStack
  • OpsServerlessStack
  • OpsKeyspacesStack
  • OpsCollectionPipelineRoleStack

CloudFormation stack details

The CloudFormation template deploys the following components:

  1. An API named keyspaces-OpenSearch-Endpoint in API Gateway, which handles mutations (inserts, updates, and deletes) via the POST method to Lambda, compatible with OpenSearch Ingestion.
  2. A keyspace named productsearch, along with a table called product_by_item. The chosen partition key for this table is product_id. The following screenshot shows an example of the table’s attributes and data provided for reference using the CQL editor.
  3. A Lambda function called OpsApigwLambdaStack-ApiHandler* that will forward the transaction to Amazon Keyspaces. After the transaction is committed in keyspaces, we send a response code of 200 to the client as well as asynchronously send the transaction to the OpenSearch Ingestion pipeline.
  4. The OpenSearch ingestion pipeline, named serverless-ingestion. This pipeline publishes records to an OpenSearch Serverless collection under an index named products. The key for this collection is product_id. Additionally, the pipeline specifies the actions it can handle. The delete action supports delete operations; the index action is the default action, which supports insert and update operations.

We have chosen an OpenSearch Serverless collection as our target, so we included serverless: true in our configuration file. To keep things simple, we haven’t altered the network_policy_name settings, but you have the option to specify a different network policy name if needed. For additional details on how to set up network access for OpenSearch Serverless collections, refer to Creating network policies (console).

version: "2"
product-pipeline:
  source:
    http:
      path: "/${pipelineName}/test_ingestion_path"
  processor:
    - date:
        from_time_received: true
        destination: "@timestamp"
  sink:
    - opensearch:
        hosts: [ "<OpenSearch_Endpoint>" ]
        document_root_key: "item"
        index_type: custom
        index: "products"
        document_id_field: "item/product_id"
        flush_timeout: -1
        actions:
          - type: "delete"
            when: '/operation == "delete"'
          - type: "index"                      
        aws:
          sts_role_arn: "arn:aws:iam::<account_id>:role/OpenSearchCollectionPipelineRole"
          region: "us-east-1"
          serverless: true
        # serverless_options:
            # Specify a name here to create or update network policy for the serverless collection
            # network_policy_name: "network-policy-name"

You can incorporate a dead-letter queue (DLQ) into your pipeline to handle and store events that fail to process. This allows for easy access and analysis of these events. If your sinks refuse data due to mapping errors or other problems, redirecting this data to the DLQ will facilitate troubleshooting and resolving the issue. For detailed instructions on configuring DLQs, refer to Dead-letter queues. To reduce complexity, we don’t configure the DLQs in this post.

Now that all components have been deployed, we can test the solution and conduct various searches on the OpenSearch Service index.

Test the solution

Complete the following steps to test the solution:

  1. On the API Gateway console, navigate to your API and choose the ANY method.
  2. Choose the Test tab.
  3. For Method type¸ choose POST.

This is the only supported method by OpenSearch Ingestion for any inserts, deletes, or updates.

  1. For Request body, enter the input.

The following are some of the sample requests:

{"operation": "insert", "item": {"product_id": 1, "product_name": "Reindeer sweater", "product_description": "A Christmas sweater for everyone in the family." } }
{"operation": "insert", "item": {"product_id": 2, "product_name": "Bluetooth Headphones", "product_description": "High-quality wireless headphones with long battery life."}}
{"operation": "insert", "item": {"product_id": 3, "product_name": "Smart Fitness Watch", "product_description": "Advanced watch tracking fitness and health metrics."}}
{"operation": "insert", "item": {"product_id": 4, "product_name": "Eco-Friendly Water Bottle", "product_description": "Durable and eco-friendly bottle for hydration on-the-go."}}
{"operation": "insert", "item": {"product_id": 5, "product_name": "Wireless Charging Pad", "product_description": "Convenient pad for fast wireless charging of devices."}}

If the test is successful, you should see a return code of 200 in API Gateway. The following is a sample response:

{"message": "Ingestion completed successfully for {'operation': 'insert', 'item': {'product_id': 100, 'product_name': 'Reindeer sweater', 'product_description': 'A Christmas sweater for everyone in the family.'}}."}

If the test is successful, you should see the updated records in the Amazon Keyspaces table.

  1. Now that you have loaded some sample data, run a sample query to confirm the data that you loaded using API Gateway is actually being persisted to OpenSearch Service. The following is a query against the OpenSearch Service index for product_name = sweater:
awscurl --service aoss --region us-east-1 -X POST "<OpenSearch_Endpoint>/products/_search" -H "Content-Type: application/json" -d '
{
"query": {
"term": {
"product_name": "sweater"
     }
   } 
}'  | jq '.'

  1. To update a record, enter the following in the API’s request body. If the record doesn’t already exist, this operation will insert the record.
  2. To delete a record, enter the following in the API’s request body.

Monitoring

You can use Amazon CloudWatch to monitor the pipeline metrics. The following graph shows the number of documents successfully sent to OpenSearch Service.

Run queries on Amazon Keyspaces data in OpenSearch Service

There are several methods to run search queries against an OpenSearch Service collection, with the most popular being through awscurl or the dev tools in the OpenSearch Dashboards. For this post, we will be utilizing the dev tools in the OpenSearch Dashboards.

To access the dev tools, Navigate to the OpenSearch collection dashboards  and select the dashboard radio button, which is highlighted in the screenshot adjacent to the ingestion-collection.

Once on the OpenSearch Dashboards page, click on the Dev Tools radio button as highlighted

This action brings up the Dev Tools console, enabling you to run various search queries, either to validate the data or simply to query it.

Type in your query and use the size parameter to determine how many records you want to be displayed. Click the play icon to execute the query. Results will appear in the right pane.

The following are some of the different search queries that you can run against the ingestion-collection for different search needs. For more search methods and examples, refer to Searching data in Amazon OpenSearch Service.

Full text search

In a search for Bluetooth headphones, we adopted an exacting full-text search approach. Our strategy involved formulating a query to align precisely with the term “Bluetooth Headphones,” searching through an extensive product database. This method allowed us to thoroughly examine and evaluate a broad range of Bluetooth headphones, concentrating on those that best met our search parameters. See the following code:

Fuzzy search

We used a fuzzy search query to navigate through product descriptions, even when they contain variations or misspellings of our search term. For instance, by setting the value to “chrismas” and the fuzziness to AUTO, our search could accommodate common misspellings or close approximations in the product descriptions. This approach is particularly useful in making sure that we capture a wider range of relevant results, especially when dealing with terms that are often misspelled or have multiple variations. See the following code:

Wildcard search

In our approach to discovering a variety of products, we employed a wildcard search technique within the product descriptions. By using the query Fit*s, we signaled our search tool to look for any product descriptions that begin with “Fit” and end with “s,” allowing for any characters to appear in between. This method is effective for capturing a range of products that have similar naming patterns or attributes, making sure that we don’t miss out on relevant items that fit within a certain category but may have slightly different names or features. See the following code:

It is essential to comprehend that queries incorporating wildcard characters often exhibit reduced performance, as they require iterating through an extensive array of terms. Consequently, it is advisable to refrain from positioning wildcard characters at the beginning of a query, given that this approach can lead to operations that significantly strain both computational resources and time.

Troubleshooting

A status code other than 200 indicates a problem either in the Amazon Keyspaces operation or the OpenSearch Ingestion operation. View the CloudWatch logs of the Lambda function OpsApigwLambdaStack-ApiHandler* and the OpenSearch Ingestion pipeline logs to troubleshoot the failure.

You will see the following errors in the ingestion pipeline logs. This is because the pipeline endpoint is publicly accessible, and not accessible via VPC. They are harmless. As a best practice you can enable VPC access for the serverless collection, which provides an inherent layer of security.

  • 2024-01-23T13:47:42.326 [armeria-common-worker-epoll-3-1] ERROR com.amazon.osis.HttpAuthorization - Unauthenticated request: Missing Authentication Token
  • 2024-01-23T13:47:42.327 [armeria-common-worker-epoll-3-1] ERROR com.amazon.osis.HttpAuthorization - Authentication status: 401

Clean up

To prevent additional charges and to effectively remove resources, delete the CloudFormation stacks by running the following command:

(.venv) $ cdk destroy -c iam_user_name=<your-iam-user-name> --force --all

Verify the following CloudFormation stacks are deleted from the CloudFormation console:

Finally, delete the CDKToolkit CloudFormation stack to remove the AWS CDK resources.

Conclusion

In this post, we delved into enabling diverse search scenarios on data stored in Amazon Keyspaces by using the capabilities of OpenSearch Service. Through the use of Lambda and OpenSearch Ingestion, we managed the data movement seamlessly. Furthermore, we provided insights into testing the deployed solution using a CloudFormation template, ensuring a thorough grasp of its practical application and effectiveness.

Test the procedure that is outlined in this post by deploying the sample code provided and share your feedback in the comments section.


About the authors

Rajesh, a Senior Database Solution Architect. He specializes in assisting customers with designing, migrating, and optimizing database solutions on Amazon Web Services, ensuring scalability, security, and performance. In his spare time, he loves spending time outdoors with family and friends.

Sylvia, a Senior DevOps Architect, specializes in designing and automating DevOps processes to guide clients through their DevOps transformation journey. During her leisure time, she finds joy in activities such as biking, swimming, practicing yoga, and photography.

Knowledge Bases for Amazon Bedrock now supports Amazon Aurora PostgreSQL and Cohere embedding models

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/knowledge-bases-for-amazon-bedrock-now-supports-amazon-aurora-postgresql-and-cohere-embedding-models/

During AWS re:Invent 2023, we announced the general availability of Knowledge Bases for Amazon Bedrock. With a knowledge base, you can securely connect foundation models (FMs) in Amazon Bedrock to your company data for Retrieval Augmented Generation (RAG).

In my previous post, I described how Knowledge Bases for Amazon Bedrock manages the end-to-end RAG workflow for you. You specify the location of your data, select an embedding model to convert the data into vector embeddings, and have Amazon Bedrock create a vector store in your AWS account to store the vector data, as shown in the following figure. You can also customize the RAG workflow, for example, by specifying your own custom vector store.

Knowledge Bases for Amazon Bedrock

Since my previous post in November, there have been a number of updates to Knowledge Bases, including the availability of Amazon Aurora PostgreSQL-Compatible Edition as an additional custom vector store option next to vector engine for Amazon OpenSearch Serverless, Pinecone, and Redis Enterprise Cloud. But that’s not all. Let me give you a quick tour of what’s new.

Additional choice for embedding model
The embedding model converts your data, such as documents, into vector embeddings. Vector embeddings are numeric representations of text data within your documents. Each embedding aims to capture the semantic or contextual meaning of the data.

Cohere Embed v3 – In addition to Amazon Titan Text Embeddings, you can now also choose from two additional embedding models, Cohere Embed English and Cohere Embed Multilingual, each supporting 1,024 dimensions.

Knowledge Bases for Amazon Bedrock

Check out the Cohere Blog to learn more about Cohere Embed v3 models.

Additional choice for vector stores
Each vector embedding is put into a vector store, often with additional metadata such as a reference to the original content the embedding was created from. The vector store indexes the stored vector embeddings, which enables quick retrieval of relevant data.

Knowledge Bases gives you a fully managed RAG experience that includes creating a vector store in your account to store the vector data. You can also select a custom vector store from the list of supported options and provide the vector database index name as well as index field and metadata field mappings.

We have made three recent updates to vector stores that I want to highlight: The addition of Amazon Aurora PostgreSQL-Compatible and Pinecone serverless to the list of supported custom vector stores, as well as an update to the existing Amazon OpenSearch Serverless integration that helps to reduce cost for development and testing workloads.

Amazon Aurora PostgreSQL – In addition to vector engine for Amazon OpenSearch Serverless, Pinecone, and Redis Enterprise Cloud, you can now also choose Amazon Aurora PostgreSQL as your vector database for Knowledge Bases.

Knowledge Bases for Amazon Bedrock

Aurora is a relational database service that is fully compatible with MySQL and PostgreSQL. This allows existing applications and tools to run without the need for modification. Aurora PostgreSQL supports the open source pgvector extension, which allows it to store, index, and query vector embeddings.

Many of Aurora’s features for general database workloads also apply to vector embedding workloads:

  • Aurora offers up to 3x the database throughput when compared to open source PostgreSQL, extending to vector operations in Amazon Bedrock.
  • Aurora Serverless v2 provides elastic scaling of storage and compute capacity based on real-time query load from Amazon Bedrock, ensuring optimal provisioning.
  • Aurora global database provides low-latency global reads and disaster recovery across multiple AWS Regions.
  • Blue/green deployments replicate the production database in a synchronized staging environment, allowing modifications without affecting the production environment.
  • Aurora Optimized Reads on Amazon EC2 R6gd and R6id instances use local storage to enhance read performance and throughput for complex queries and index rebuild operations. With vector workloads that don’t fit into memory, Aurora Optimized Reads can offer up to 9x better query performance over Aurora instances of the same size.
  • Aurora seamlessly integrates with AWS services such as Secrets Manager, IAM, and RDS Data API, enabling secure connections from Amazon Bedrock to the database and supporting vector operations using SQL.

For a detailed walkthrough of how to configure Aurora for Knowledge Bases, check out this post on the AWS Database Blog and the User Guide for Aurora.

Pinecone serverless – Pinecone recently introduced Pinecone serverless. If you choose Pinecone as a custom vector store in Knowledge Bases, you can provide either Pinecone or Pinecone serverless configuration details. Both options are supported.

Reduce cost for development and testing workloads in Amazon OpenSearch Serverless
When you choose the option to quickly create a new vector store, Amazon Bedrock creates a vector index in Amazon OpenSearch Serverless in your account, removing the need to manage anything yourself.

Since becoming generally available in November, vector engine for Amazon OpenSearch Serverless gives you the choice to disable redundant replicas for development and testing workloads, reducing cost. You can start with just two OpenSearch Compute Units (OCUs), one for indexing and one for search, cutting the costs in half compared to using redundant replicas. Additionally, fractional OCU billing further lowers costs, starting with 0.5 OCUs and scaling up as needed. For development and testing workloads, a minimum of 1 OCU (split between indexing and search) is now sufficient, reducing cost by up to 75 percent compared to the 4 OCUs required for production workloads.

Usability improvement – Redundant replicas disabled is now the default selection when you choose the quick-create workflow in Knowledge Bases for Amazon Bedrock. Optionally, you can create a collection with redundant replicas by selecting Update to production workload.

Knowledge Bases for Amazon Bedrock

For more details on vector engine for Amazon OpenSearch Serverless, check out Channy’s post.

Additional choice for FM
At runtime, the RAG workflow starts with a user query. Using the embedding model, you create a vector embedding representation of the user’s input prompt. This embedding is then used to query the database for similar vector embeddings to retrieve the most relevant text as the query result. The query result is then added to the original prompt, and the augmented prompt is passed to the FM. The model uses the additional context in the prompt to generate the completion, as shown in the following figure.

Knowledge Bases for Amazon Bedrock

Anthropic Claude 2.1 – In addition to Anthropic Claude Instant 1.2 and Claude 2, you can now choose Claude 2.1 for Knowledge Bases. Compared to previous Claude models, Claude 2.1 doubles the supported context window size to 200 K tokens.

Knowledge Bases for Amazon Bedrock

Check out the Anthropic Blog to learn more about Claude 2.1.

Now available
Knowledge Bases for Amazon Bedrock, including the additional choice in embedding models, vector stores, and FMs, is available in the AWS Regions US East (N. Virginia) and US West (Oregon).

Learn more

Read more about Knowledge Bases for Amazon Bedrock

— Antje

Build SAML identity federation for Amazon OpenSearch Service domains within a VPC

Post Syndicated from Mahdi Ebrahimi original https://aws.amazon.com/blogs/big-data/build-saml-identity-federation-for-amazon-opensearch-service-domains-within-a-vpc/

Amazon OpenSearch Service is a fully managed search and analytics service powered by the Apache Lucene search library that can be operated within a virtual private cloud (VPC). A VPC is a virtual network that’s dedicated to your AWS account. It’s logically isolated from other virtual networks in the AWS Cloud. Placing an OpenSearch Service domain within a VPC enables a secure communication between OpenSearch Service and other services within the VPC without the need for an internet gateway, NAT device, or a VPN connection. All traffic remains securely within the AWS Cloud, providing a safe environment for your data. To connect to an OpenSearch Service domain running inside a private VPC, enterprise customers use one of two available options: either integrate their VPC with their enterprise network through VPN or AWS Direct Connect, or make the cluster endpoint publicly accessible through a reverse proxy. Refer to How can I access OpenSearch Dashboards from outside of a VPC using Amazon Cognito authentication for a detailed evaluation of the available options and the corresponding pros and cons.

For managing access to OpenSearch Dashboards in enterprise customers’ environments, OpenSearch Service supports Security Assertion Markup Language (SAML) integration with the customer’s existing identity providers (IdPs) to offer single sign-on (SSO). Although SAML integration for publicly accessible OpenSearch Dashboards works out of the box, enabling SAML for OpenSearch Dashboards within a VPC requires careful design with various configurations.

This post outlines an end-to-end solution for integrating SAML authentication for OpenSearch Service domains running in a VPC. It provides a step-by-step deployment guideline and is accompanied by AWS Cloud Development Kit (AWS CDK) applications, which automate all the necessary configurations.

Overview of solution

The following diagram describes the step-by-step authentication flow for accessing a private OpenSearch Service domain through SSO using SAML identity federation. The access is enabled over public internet through private NGINX reverse proxy servers running on Amazon Elastic Container Service (Amazon ECS) for high availability.

Solution overview

The workflow consists of the following steps:

  1. The user navigates to the OpenSearch Dashboards URL in their browser.
  2. The browser resolves the domain IP address and sends the request.
  3. AWS WAF rules make sure that only allow listed IP address ranges are allowed.
  4. Application Load Balancer forwards the request to NGINX reverse proxy.
  5. NGINX adds the necessary headers and forwards the request to OpenSearch Dashboards.
  6. OpenSearch Dashboards detects that the request is not authenticated. It replies with a redirect to the integrated SAML IdP for authentication.
  7. The user is redirected to the SSO login page.
  8. The IdP verifies the user’s identity and generates a SAML assertion token.
  9. The user is redirected back to the OpenSearch Dashboards URL.
  10. The request goes through the Steps 1–5 again until it reaches OpenSearch. This time, OpenSearch Dashboards detects the accompanying SAML assertion and allows the request.

In the following sections, we set up a NGINX reverse proxy in private subnets to provide access to OpenSearch Dashboards for a domain deployed inside VPC private subnets. We then enable SAML authentication for OpenSearch Dashboards using a SAML 2.0 application and use a custom domain endpoint to access OpenSearch Dashboards to see the SAML authentication in action.

Prerequisites

Before you get started, complete the prerequisite steps in this section.

Install required tools

First, install the AWS CDK. For more information, refer to the AWS CDK v2 Developer Guide.

Prepare required AWS resources

Complete the following steps to set up your AWS resources:

  1. Create an AWS account.
  2. Create an Amazon Route 53 public hosted zone such as mydomain.com to be used for routing internet traffic to your domain. For instructions, refer to Creating a public hosted zone.
  3. Request an AWS Certificate Manager (ACM) public certificate for the hosted zone. For instructions, refer to Requesting a public certificate.
  4. Create a VPC with public and private subnets.
  5. Enable AWS IAM Identity Center. For instructions, refer to Enable IAM Identity Center.

Prepare your OpenSearch Service cluster

This post is accompanied with a standalone AWS CDK application (opensearch-domain) that deploys a sample OpenSearch Service domain in private VPC subnets. The deployed domain is for demonstration purposes only, and is optional.

If you have an existing OpenSearch Service domain in VPC that you want to use for SAML integration, apply the following configurations:

  1. On the Cluster configuration tab, choose Edit and select Enable custom endpoint in the Custom endpoint section.
  2. For Custom hostname, enter a fully qualified domain name (FQDN) such as opensearch.mydomain.com, which you want to use to access your cluster. Note that the domain name of the provided FQDN (for example, mydomain.com) must be the same as the public hosted zone you created earlier.
  3. For AWS certificate, choose the SSL certificate you created earlier.
  4. In the Summary section, optionally enable dry run analysis and select Dry run or deselect it and choose Save changes.

Otherwise, download the accompanied opensearch-domain AWS CDK application and unzip it. Then, edit the cdk.json file on the root of the unzipped folder and configure the required parameters:

  • vpc_cidr – The CIDR block in which to create the VPC. You may leave the default of 10.0.0.0/16.
  • opensearch_cluster_name – The name of the OpenSearch Service cluster. You may leave the default value of opensearch. It will also be used, together with the hosted_zone_name parameter, to build the FQDN of the custom domain URL.
  • hosted_zone_id – The Route 53 public hosted zone ID.
  • hosted_zone_name – The Route 53 public hosted zone name (for example, mydomain.com). The result FQDN with the default example values will then be opensearch.mydomain.com.

Finally, run the following commands to deploy the AWS CDK application:

cd opensearch-domain

# Create a Python environment and install the reuired dependencies
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-dev.txt
pip install -r requirements.txt

# Deploy the CDK application
cdk deploy

With the prerequisites in place, refer to the following sections for a step-by-step guide to deploy this solution.

Create a SAML 2.0 application

We use IAM Identity Center as the source of identity for our SAML integration. The same configuration should apply to other SAML 2.0-compliant IdPs. Consult your IdP documentation.

  1. On the IAM Identity Center console, choose Groups in the navigation pane.
  2. Create a new group called Opensearch Admin, and add users to it.
    This will be the SAML group that receives full permissions in OpenSearch Dashboards. Take note of the group ID.OpenSearch Admin Group
  3. Choose Applications in the navigation pane.
  4. Create a new custom SAML 2.0 application.Create SAML application
  5. Download the IAM Identity Center SAML metadata file to use in a later step.IAM Identity Center SAML metadata
  6. For Application start URL, enter [Custom Domain URL]/_dashboards/.
    The custom domain URL is composed of communication protocol (https://) followed by the FQDN, which you used for your OpenSearch Service cluster in the prerequisites (for example, https://opensearch.mydomain.com). Look under your OpenSearch Service cluster configurations, if in doubt.
  7. For Application ACS URL, enter [Custom Domain URL]/_dashboards/_opendistro/_security/saml/acs.
  8. For Application SAML audience, enter [Custom Domain URL] (without any trailing slash).
    OpenSearch domain SAML properties
  9. Choose Submit.
  10. In the Assigned users section, select Opensearch Admin and choose Assign Users.Assign admin group to SAML application
  11. On the Actions menu, choose Edit attribute mappings.Edit SAML application's attribute mapping
  12. Define attribute mappings as shown in the following screenshot and choose Save changes.Configure SAML application's attribute mappings

Deploy the AWS CDK application

Complete the following steps to deploy the AWS CDK application:

  1. Download and unzip the opensearch-domain-saml-integration AWS CDK application.
  2. Add your private SSL key and certificate to AWS Secrets Manager and create two secrets called Key and Crt. For example, see the following code:
    KEY=$(cat private.key | base64) && aws secretsmanager create-secret --name Key --secret-string $KEY
    CRT=$(cat certificate.crt | base64) && aws secretsmanager create-secret --name Crt --secret-string $CRT

    You can use the following command to generate a self-signed certificate. This is for testing only; do not use this for production environments.

    openssl req -new -newkey rsa:4096 -days 1095 -nodes -x509 -subj '/' -keyout private.key -out certificate.crt

  3. Edit the cdk.json file and set the required parameters inside the nested config object:
  • aws_region – The target AWS Region for your deployment (for example, eu-central-1).
  • vpc_id – The ID of the VPC into which the OpenSearch Service domain has been deployed.
  • opensearch_cluster_security_group_id – The ID of the security group used by the OpenSearch Service domain or any other security group that allows inbound connections to that domain on port 80 and 443. This group ID will be used by the Application Load Balancer to forward traffic to your OpenSearch Service domain.
  • hosted_zone_id – The Route 53 public hosted zone ID.
  • hosted_zone – The Route 53 public hosted zone name (for example, mydomain.com).
  • opensearch_custom_domain_name – An FQDN such as opensearch.mydomain.com, which you want to use to access your cluster. Note that the domain name of the provided FQDN (mydomain.com) must be the same as the hosted_zone parameter.
  • opensearch_custom_domain_certificate_arn – The ARN of the certificate stored in ACM.
  • opensearch_domain_endpoint – The OpenSearch Service VPC domain endpoint (for example, vpc-opensearch-abc123.eu-central-1.es.amazonaws.com).
  • vpc_dns_resolver – This must be 10.0.0. if your VPC CIDR is 10.0.0.0/16. See Amazon DNS server for further details.
  • alb_waf_ip_whitelist_cidrs – This is an optional list of zero or more IP CIDR ranges that will be automatically allow listed in AWS WAF to permit access to the OpenSearch Service domain. If not specified, after the deployment you will need to manually add relevant IP CIDR ranges to the AWS WAF IP set to allow access. For example, ["1.2.3.4/32", "5.6.7.0/24"].
  1. Deploy the OpenSearch Service domain SAML integration AWS CDK application:
    cd opensearch-domain-saml-integration
    
    # Create a Python environment and install the required dependencies
    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements-dev.txt
    pip install -r requirements.txt
    
    # Deploy the CDK application
    cdk deploy

Enable SAML authentication for your OpenSearch Service cluster

When the application deployment is complete, enable SAML authentication for your cluster:

  1. On the OpenSearch Service console, navigate to your domain.
  2. On the Security configuration tab, choose Edit.Enable SAML authentication for OpenSearch domain
  3. Select Enable SAML authentication.
  4. Choose Import from XML file and import the IAM Identity Center SAML metadata file that you downloaded in an earlier step.
  5. For SAML master backend role, use the group ID you saved earlier.
  6. Expand the Additional settings section and for Roles, enter the SAML 2.0 attribute name you mapped earlier when you created the SAML 2.0 application in AWS Identity Center.
  7. Configure the domain access policy for SAML integration.
  8. Submit changes and wait for OpenSearch Service to apply the configurations before proceeding to the next section.

Test the solution

Complete the following steps to see the solution in action:

  1. On the IAM Identity Center console, choose Dashboard in the navigation pane.
  2. In the Settings summary section, choose the link under AWS access portal URL.Login to IAM Identity Centre
  3. Sign in with your user name and password (register your password if this is your first login).
    If your account was successfully added to the admin group, a SAML application logo is visible.
  4. Choose Custom SAML 2.0 application to be redirected to the OpenSearch Service dashboards through SSO without any additional login attempts.Open SAML application
    Alternatively, you could skip logging in to the access portal and directly point your browser to the OpenSearch Dashboards URL. In that case, OpenSearch Dashboards would first redirect you to the access portal to log in, which would redirect you back to the OpenSearch Dashboards UI after a successful login, resulting in the same outcome as shown in the following screenshot.OpenSearch Dashboards UI after successful login

Troubleshooting

Your public-facing IP must be allow listed by the AWS WAF rule, otherwise a 403 Forbidden error will be returned. Allow list your IP CIDR range via the AWS CDK alb_waf_ip_whitelist_cidrs property as described in the installation guide and redeploy the AWS CDK application for changes to take effect.

Clean up

When you’re finished with this configuration, clean up the resources to avoid future charges.

  1. On the OpenSearch Service console, navigate to the Security configuration tab of your OpenSearch Service domain and choose Edit.
  2. Deselect Enable SAML authentication and choose Save changes.
  3. After the Amazon SAML integration is disabled, delete the opensearch-domain-saml-integration stack using cdk destroy.
  4. Optionally, if you used the provided OpenSearch Service sample AWS CDK stack (opensearch-domain), delete it using cdk destroy.

Conclusion

OpenSearch Service allows enterprise customers to use their preferred federated IdPs such as SAML using IAM Identity Center for clusters running inside private VPC subnets following AWS best practices.

In this post, we showed you how to integrate an OpenSearch Service domain within a VPC with an existing SAML IdP for SSO access to OpenSearch Dashboards using IAM Identity Center. The provided solution securely manages network access to the resources using AWS WAF to restrict access only to authorized network segments or specific IP addresses.

To get started, refer to How can I access OpenSearch Dashboards from outside of a VPC using Amazon Cognito authentication for further comparison of OpenSearch Service domain in private VPC access patterns.


About the Authors

Mahdi Ebrahimi is a Senior Cloud Infrastructure Architect with Amazon Web Services. He excels in designing distributed, highly-available software systems. Mahdi is dedicated to delivering cutting-edge solutions that empower his customers to innovate in the rapidly evolving landscape in the automotive industry.

Dmytro Protsiv is a Cloud Applications Architect for with Amazon Web Services. He is passionate about helping customers to solve their business challenges around application modernization.

Luca Menichetti is a Big Data Architect with Amazon Web Services. He helps customers develop performant and reusable solutions to process data at scale. Luca is passioned about managing organisation’s data architecture, enabling data analytics and machine learning. Having worked around the Hadoop ecosystem for a decade, he really enjoys tackling problems in NoSQL environments.

Krithivasan Balasubramaniyan is a Principal Consultant with Amazon Web Services. He enables global enterprise customers in their digital transformation journey and helps architect cloud native solutions.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Track Amazon OpenSearch Service configuration changes more easily with new visibility improvements

Post Syndicated from Siddhant Gupta original https://aws.amazon.com/blogs/big-data/track-amazon-opensearch-service-configuration-with-improved-visibility/

Amazon OpenSearch Service offers multiple domain configuration settings to meet your workload-specific requirements. As part of standard service operations, you may be required to update these configuration settings on a regular basis. Recently, Amazon OpenSearch Service launched visibility improvements that allow you to track configuration changes more effectively. We’ve introduced granular and more descriptive configuration statuses that enable you to set up alarms and use them in automation to minimize manual monitoring.

We recommend that you take advantage of these visibility improvements in your applications. These changes are backward compatible, and if your automations rely on the legacy processing parameter to determine configuration change status, then they should still continue to work without any disruption. To simplify tracking of multiple in-flight configuration change requests, Amazon OpenSearch Service allows configuration request only when Domain Processing Status is Active. Additional details are in section ‘Single configuration change at a time’.

Solution overview

Earlier, configuration change status visibility was available through processing parameters in the OpenSearch Service APIs (Application Programming Interface), and as a Domain Status field in the OpenSearch Service console. We have now introduced the following changes to improve the configuration update experience:

  • Introduced two new parameters, DomainProcessingStatus and ConfigChangeStatus, in the API responses. Similarly, added Domain Processing Status and Configuration Change Status fields in the console. These changes provide better visibility through multiple, intuitive statuses. Earlier statuses were limited to only two values: Active and Processing.
  • Ability to easily compare active and in-flight configurations for clarity. Earlier, it required multiple steps.
  • Amazon OpenSearch Service has now adopted the approach of allowing a single configuration change request at a time. There is no limit on the number of domain configuration changes you can bundle in a single request. However, you can submit the next configuration request when the previous request is complete and the domain processing status becomes Active. This improvement streamlines configuration updates and addresses previous challenges of tracking multiple, in-flight configuration change requests.
  • Ability to cancel a change request in case of a validation failure. Previously, when instances were unavailable, domains remained in processing state. Now, upon encountering any validation failure, you can cancel the change request and retry after some time.
  • Domain processing status turns to Active only after all the background activities, including shard movement is complete. This means that you can confidently use newly introduced statuses in your automation scripts without needing to infer if all the internal processes, such as data movement, are complete.

How do you get granular details to track the configuration update status?

As part of recent improvements, Amazon OpenSearch Service introduced DomainProcessingStatus and ConfigChangeStatus parameters in the APIs along with the respective Domain Processing Status and Configuration Change Status fields in the console. You can rely on these statuses to get accurate and consistent information during different configuration change scenarios, like when configuration changes involve blue/green operations or without blue/green operations, and when configuration changes are triggered by the operator or by the OpenSearch Service. Let us explore these enhanced visibility experiences.

  1. Domain processing status visibility: You can track the staus of domain-level configuration changes through the Domain Processing Status field in the console. Similarly, API responses include the DomainProcessingStatus parameter. The values and a brief description are provided in the following details:
    1. Active: No configuration change is in progress. You can submit a new configuration change request.
    2. Creating: New domain creation is in progress.
    3. Modifying: This status indicates that one or more configuration changes, such as new data node addition, Amazon Elastic Block Store (Amazon EBS) GP3 storage provisioning, or setting up KMS keys, are in progress. In other words, changes made through the UpdateDomainConfig API, set the status to modifying. The ‘Modifying’ status also covers situations where domains require shard movement to complete configuration changes. Note: For backward compatibility, we have kept the behavior of the processing parameter unchanged in the API responses, and it is set to false as soon as the core configuration changes are complete, without waiting for shard movement completion.
    4. Upgrading Engine Version: Engine version upgrades are in progress, such as from Elasticsearch version 7.9 to OpenSearch version 1.0.
    5. Updating Service Software: This status refers to configuration changes related to service software updates.
    6. Deleting: Domain deletion is progressing.
    7. Isolated: This represents domains that are suspended due to a variety of reasons, such as account-related billing issues or domains that are not compliant with critical security patch updates.
  2. Configuration change status visibility: Configuration changes can be initiated by the user (e.g., new data node addition, instance type change) or by the Service (e.g., AutoTune and mandatory service software updates). You can find the latest status details through Configuration Change Status field in the console, and through the ConfigChangeStatus parameter in API responses. Below are the values and a brief description:
    1. Pending: Indicates that a configuration change request has been submitted.
    2. Initializing: Service is initializing a configuration change request.
    3. Validating: Service is validating the requested changes and resources required.
    4. Validation Failed: Requested changes failed validation. At this point, no configuration changes are applied. Some possible validation failures could be the presence of red indices in the domain, unavailability of a chosen instance type, and low disk space. Here is a list of potential validation failures. During a validation failure event, you can cancel, retry, or edit configuration changes.
    5. Awaiting user inputs: Scenarios where user may be able to fix validation errors such as invalid KMS key. At this status, user can edit the configuration changes.
    6. Applying changes: Service is applying requested configuration changes.
    7. Cancelled: During validation failed status, you can either click on the Cancel button in the console or call the CancelDomainConfigChange API. All the applied changes that were part of the change request will be rolled back.
    8. Completed: Requested configuration changes have been successfully completed.

Console enhancements

The Amazon OpenSearch Service console offers enhanced visibility to track configuration change progress. Below are a few screenshots to give you an idea of these improvements.

  • Amazon OpenSearch Service console provides Domain Processing Status, Configuration Change Status, and Change ID fields. Note: To know the change details associated with the Change ID, you can use the DescribeDomainChangeProgress API.

  • Configuration change summary. To see a side by side comparison of your active configurations and requested changes, on the domain detail page, navigate to the cluster configuration tab, scroll down to the configuration change summary section. Pending Changes field shows the status of the pending properties at that time and does not include changes that have been applied. You can also get similar details from the DescribeDomain and DescribeDomainConfig APIs through theModifyingProperties parameter.

Cancelling during validation failure. In the below screenshots, you can see a new option to cancel a change request when a configuration change request fails validations. For example, when you encounter SubnetNotFound error, you can use the Cancel request button to roll back to the previous active configuration, fix the issue and then retry the configuration update.

Single configuration change at a time

Previously, it was not straightforward to track the success and failure of individual change requests, when several requests were made. To provide a simplified experience, OpenSearch Service now limits you to only a single change request at a time. In a single configuration change request, you can bundle multiple changes at once. Once a configuration change request is submitted, it must be completed before you can request the next configuration change through the console, or through the UpdateDomainConfig API. This simplified experience makes it easier to keep track of changes that have been requested and their most recent status. If your automation is written to call configuration change update APIs multiple times, then it should be updated to group multiple configuration changes in a single update call, or wait for individual updates to complete before you submit the next configuration change. You can update domain configuration when domain processing status becomes active. For a list of changes that might need a blue/green deployment, please see here.

The below screenshot shows an example alert on the ‘Edit domain’ page informing the user that another change or update is in progress. OpenSearch Service no longer allows you to submit new configuration update requests, and the ‘Apply change’ button is disabled until the change in progress is completed.

API changes

You can use the DescribeDomain, DescribeDomainChangeProgress, and DescribeDomainConfig APIs to get detailed configuration update statuses. In addition, you can use CancelDomainConfigChange to cancel the change request in the event of a validation failures. You can refer Amazon OpenSearch Service API documentation here.

Conclusion

In this post, we showed you how to get granular information about a configuration update request. These newly introduced changes will help you gain better visibility into the progress of configuration change requests, and easily distinguish between applied changes and pending ones. You need to ensure that the DomainProcessingStatus processing status value is Active before submitting configuration change requests. The ability to cancel changes in the event of validation failures gives you better control in getting your domain out of processing state in a self-service manner. Visit product documentation to learn more.


About the Authors

Siddhant Gupta is a Sr. Technical Product Manager at Amazon Web Services based in Hyderabad, India. Siddhant has been with Amazon for over six years and is currently working with the OpenSearch Service team, helping with new region launches, pricing strategy, and bringing EC2 and EBS innovations to OpenSearch Service customers. He is passionate about analytics and machine learning. In his free time, he loves traveling, fitness activities, spending time with his family and reading non-fiction books.

Deniz Ercelebi is a Sr. UX Designer at Amazon OpenSearch Service. In her role she contributes to the creation, implementation, and successful delivery of design solutions for complex problems. Her personal drive is fueled by a passion for user experience, a dedication to customer-centric solutions, and a firm belief in collaborative innovation.

Shashank Gupta is a Sr. Software Developer at Amazon OpenSearch Service, specializing in the enhancement of the Managed service aspect of the platform. His primary focus is on optimizing the managed experience, spanning from the console to APIs and resource provisioning in an efficient manner. With a dedicated commitment to innovation, Shashank aims to elevate the overall customer experience by introducing inventive solutions within the service.

Power neural search with AI/ML connectors in Amazon OpenSearch Service

Post Syndicated from Aruna Govindaraju original https://aws.amazon.com/blogs/big-data/power-neural-search-with-ai-ml-connectors-in-amazon-opensearch-service/

With the launch of the neural search feature for Amazon OpenSearch Service in OpenSearch 2.9, it’s now effortless to integrate with AI/ML models to power semantic search and other use cases. OpenSearch Service has supported both lexical and vector search since the introduction of its k-nearest neighbor (k-NN) feature in 2020; however, configuring semantic search required building a framework to integrate machine learning (ML) models to ingest and search. The neural search feature facilitates text-to-vector transformation during ingestion and search. When you use a neural query during search, the query is translated into a vector embedding and k-NN is used to return the nearest vector embeddings from the corpus.

To use neural search, you must set up an ML model. We recommend configuring AI/ML connectors to AWS AI and ML services (such as Amazon SageMaker or Amazon Bedrock) or third-party alternatives. Starting with version 2.9 on OpenSearch Service, AI/ML connectors integrate with neural search to simplify and operationalize the translation of your data corpus and queries to vector embeddings, thereby removing much of the complexity of vector hydration and search.

In this post, we demonstrate how to configure AI/ML connectors to external models through the OpenSearch Service console.

Solution Overview

Specifically, this post walks you through connecting to a model in SageMaker. Then we guide you through using the connector to configure semantic search on OpenSearch Service as an example of a use case that is supported through connection to an ML model. Amazon Bedrock and SageMaker integrations are currently supported on the OpenSearch Service console UI, and the list of UI-supported first- and third-party integrations will continue to grow.

For any models not supported through the UI, you can instead set them up using the available APIs and the ML blueprints. For more information, refer to Introduction to OpenSearch Models. You can find blueprints for each connector in the ML Commons GitHub repository.

Prerequisites

Before connecting the model via the OpenSearch Service console, create an OpenSearch Service domain. Map an AWS Identity and Access Management (IAM) role by the name LambdaInvokeOpenSearchMLCommonsRole as the backend role on the ml_full_access role using the Security plugin on OpenSearch Dashboards, as shown in the following video. The OpenSearch Service integrations workflow is pre-filled to use the LambdaInvokeOpenSearchMLCommonsRole IAM role by default to create the connector between the OpenSearch Service domain and the model deployed on SageMaker. If you use a custom IAM role on the OpenSearch Service console integrations, make sure the custom role is mapped as the backend role with ml_full_access permissions prior to deploying the template.

Deploy the model using AWS CloudFormation

The following video demonstrates the steps to use the OpenSearch Service console to deploy a model within minutes on Amazon SageMaker and generate the model ID via the AI connectors. The first step is to choose Integrations in the navigation pane on the OpenSearch Service AWS console, which routes to a list of available integrations. The integration is set up through a UI, which will prompt you for the necessary inputs.

To set up the integration, you only need to provide the OpenSearch Service domain endpoint and provide a model name to uniquely identify the model connection. By default, the template deploys the Hugging Face sentence-transformers model, djl://ai.djl.huggingface.pytorch/sentence-transformers/all-MiniLM-L6-v2.

When you choose Create Stack, you are routed to the AWS CloudFormation console. The CloudFormation template deploys the architecture detailed in the following diagram.

The CloudFormation stack creates an AWS Lambda application that deploys a model from Amazon Simple Storage Service (Amazon S3), creates the connector, and generates the model ID in the output. You can then use this model ID to create a semantic index.

If the default all-MiniLM-L6-v2 model doesn’t serve your purpose, you can deploy any text embedding model of your choice on the chosen model host (SageMaker or Amazon Bedrock) by providing your model artifacts as an accessible S3 object. Alternatively, you can select one of the following pre-trained language models and deploy it to SageMaker. For instructions to set up your endpoint and models, refer to Available Amazon SageMaker Images.

SageMaker is a fully managed service that brings together a broad set of tools to enable high-performance, low-cost ML for any use case, delivering key benefits such as model monitoring, serverless hosting, and workflow automation for continuous training and deployment. SageMaker allows you to host and manage the lifecycle of text embedding models, and use them to power semantic search queries in OpenSearch Service. When connected, SageMaker hosts your models and OpenSearch Service is used to query based on inference results from SageMaker.

View the deployed model through OpenSearch Dashboards

To verify the CloudFormation template successfully deployed the model on the OpenSearch Service domain and get the model ID, you can use the ML Commons REST GET API through OpenSearch Dashboards Dev Tools.

The GET _plugins REST API now provides additional APIs to also view the model status. The following command allows you to see the status of a remote model:

GET _plugins/_ml/models/<modelid>

As shown in the following screenshot, a DEPLOYED status in the response indicates the model is successfully deployed on the OpenSearch Service cluster.

Alternatively, you can view the model deployed on your OpenSearch Service domain using the Machine Learning page of OpenSearch Dashboards.

This page lists the model information and the statuses of all the models deployed.

Create the neural pipeline using the model ID

When the status of the model shows as either DEPLOYED in Dev Tools or green and Responding in OpenSearch Dashboards, you can use the model ID to build your neural ingest pipeline. The following ingest pipeline is run in your domain’s OpenSearch Dashboards Dev Tools. Make sure you replace the model ID with the unique ID generated for the model deployed on your domain.

PUT _ingest/pipeline/neural-pipeline
{
  "description": "Semantic Search for retail product catalog ",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "sfG4zosBIsICJFsINo3X",
        "field_map": {
           "description": "desc_v",
           "name": "name_v"
        }
      }
    }
  ]
}

Create the semantic search index using the neural pipeline as the default pipeline

You can now define your index mapping with the default pipeline configured to use the new neural pipeline you created in the previous step. Ensure the vector fields are declared as knn_vector and the dimensions are appropriate to the model that is deployed on SageMaker. If you have retained the default configuration to deploy the all-MiniLM-L6-v2 model on SageMaker, keep the following settings as is and run the command in Dev Tools.

PUT semantic_demostore
{
  "settings": {
    "index.knn": true,  
    "default_pipeline": "neural-pipeline",
    "number_of_shards": 1,
    "number_of_replicas": 1
  },
  "mappings": {
    "properties": {
      "desc_v": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "space_type": "cosinesimil"
        }
      },
      "name_v": {
        "type": "knn_vector",
        "dimension": 384,
        "method": {
          "name": "hnsw",
          "engine": "nmslib",
          "space_type": "cosinesimil"
        }
      },
      "description": {
        "type": "text" 
      },
      "name": {
        "type": "text" 
      } 
    }
  }
}

Ingest sample documents to generate vectors

For this demo, you can ingest the sample retail demostore product catalog to the new semantic_demostore index. Replace the user name, password, and domain endpoint with your domain information and ingest raw data into OpenSearch Service:

curl -XPOST -u 'username:password' 'https://domain-end-point/_bulk' --data-binary @semantic_demostore.json -H 'Content-Type: application/json'

Validate the new semantic_demostore index

Now that you have ingested your dataset to the OpenSearch Service domain, validate if the required vectors are generated using a simple search to fetch all fields. Validate if the fields defined as knn_vectors have the required vectors.

Compare lexical search and semantic search powered by neural search using the Compare Search Results tool

The Compare Search Results tool on OpenSearch Dashboards is available for production workloads. You can navigate to the Compare search results page and compare query results between lexical search and neural search configured to use the model ID generated earlier.

Clean up

You can delete the resources you created following the instructions in this post by deleting the CloudFormation stack. This will delete the Lambda resources and the S3 bucket that contain the model that was deployed to SageMaker. Complete the following steps:

  1. On the AWS CloudFormation console, navigate to your stack details page.
  2. Choose Delete.

  1. Choose Delete to confirm.

You can monitor the stack deletion progress on the AWS CloudFormation console.

Note that, deleting the CloudFormation stack doesn’t delete the model deployed on the SageMaker domain and the AI/ML connector created. This is because these models and the connector can be associated with multiple indexes within the domain. To specifically delete a model and its associated connector, use the model APIs as shown in the following screenshots.

First, undeploy the model from the OpenSearch Service domain memory:

POST /_plugins/_ml/models/<model_id>/_undeploy

Then you can delete the model from the model index:

DELETE /_plugins/_ml/models/<model_id>

Lastly, delete the connector from the connector index:

DELETE /_plugins/_ml/connectors/<connector_id>

Conclusion

In this post, you learned how to deploy a model in SageMaker, create the AI/ML connector using the OpenSearch Service console, and build the neural search index. The ability to configure AI/ML connectors in OpenSearch Service simplifies the vector hydration process by making the integrations to external models native. You can create a neural search index in minutes using the neural ingestion pipeline and the neural search that use the model ID to generate the vector embedding on the fly during ingest and search.

To learn more about these AI/ML connectors, refer to Amazon OpenSearch Service AI connectors for AWS services, AWS CloudFormation template integrations for semantic search, and Creating connectors for third-party ML platforms.


About the Authors

Aruna Govindaraju is an Amazon OpenSearch Specialist Solutions Architect and has worked with many commercial and open source search engines. She is passionate about search, relevancy, and user experience. Her expertise with correlating end-user signals with search engine behavior has helped many customers improve their search experience.

Dagney Braun is a Principal Product Manager at AWS focused on OpenSearch.

Detect, mask, and redact PII data using AWS Glue before loading into Amazon OpenSearch Service

Post Syndicated from Michael Hamilton original https://aws.amazon.com/blogs/big-data/detect-mask-and-redact-pii-data-using-aws-glue-before-loading-into-amazon-opensearch-service/

Many organizations, small and large, are working to migrate and modernize their analytics workloads on Amazon Web Services (AWS). There are many reasons for customers to migrate to AWS, but one of the main reasons is the ability to use fully managed services rather than spending time maintaining infrastructure, patching, monitoring, backups, and more. Leadership and development teams can spend more time optimizing current solutions and even experimenting with new use cases, rather than maintaining the current infrastructure.

With the ability to move fast on AWS, you also need to be responsible with the data you’re receiving and processing as you continue to scale. These responsibilities include being compliant with data privacy laws and regulations and not storing or exposing sensitive data like personally identifiable information (PII) or protected health information (PHI) from upstream sources.

In this post, we walk through a high-level architecture and a specific use case that demonstrates how you can continue to scale your organization’s data platform without needing to spend large amounts of development time to address data privacy concerns. We use AWS Glue to detect, mask, and redact PII data before loading it into Amazon OpenSearch Service.

Solution overview

The following diagram illustrates the high-level solution architecture. We have defined all layers and components of our design in line with the AWS Well-Architected Framework Data Analytics Lens.

os_glue_architecture

The architecture is comprised of a number of components:

Source data

Data may be coming from many tens to hundreds of sources, including databases, file transfers, logs, software as a service (SaaS) applications, and more. Organizations may not always have control over what data comes through these channels and into their downstream storage and applications.

Ingestion: Data lake batch, micro-batch, and streaming

Many organizations land their source data into their data lake in various ways, including batch, micro-batch, and streaming jobs. For example, Amazon EMR, AWS Glue, and AWS Database Migration Service (AWS DMS) can all be used to perform batch and or streaming operations that sink to a data lake on Amazon Simple Storage Service (Amazon S3). Amazon AppFlow can be used to transfer data from different SaaS applications to a data lake. AWS DataSync and AWS Transfer Family can help with moving files to and from a data lake over a number of different protocols. Amazon Kinesis and Amazon MSK also have capabilities to stream data directly to a data lake on Amazon S3.

S3 data lake

Using Amazon S3 for your data lake is in line with the modern data strategy. It provides low-cost storage without sacrificing performance, reliability, or availability. With this approach, you can bring compute to your data as needed and only pay for capacity it needs to run.

In this architecture, raw data can come from a variety of sources (internal and external), which may contain sensitive data.

Using AWS Glue crawlers, we can discover and catalog the data, which will build the table schemas for us, and ultimately make it straightforward to use AWS Glue ETL with the PII transform to detect and mask or and redact any sensitive data that may have landed in the data lake.

Business context and datasets

To demonstrate the value of our approach, let’s imagine you’re part of a data engineering team for a financial services organization. Your requirements are to detect and mask sensitive data as it is ingested into your organization’s cloud environment. The data will be consumed by downstream analytical processes. In the future, your users will be able to safely search historical payment transactions based on data streams collected from internal banking systems. Search results from operation teams, customers, and interfacing applications must be masked in sensitive fields.

The following table shows the data structure used for the solution. For clarity, we have mapped raw to curated column names. You’ll notice that multiple fields within this schema are considered sensitive data, such as first name, last name, Social Security number (SSN), address, credit card number, phone number, email, and IPv4 address.

Raw Column Name Curated Column Name Type
c0 first_name string
c1 last_name string
c2 ssn string
c3 address string
c4 postcode string
c5 country string
c6 purchase_site string
c7 credit_card_number string
c8 credit_card_provider string
c9 currency string
c10 purchase_value integer
c11 transaction_date date
c12 phone_number string
c13 email string
c14 ipv4 string

Use case: PII batch detection before loading to OpenSearch Service

Customers who implement the following architecture have built their data lake on Amazon S3 to run different types of analytics at scale. This solution is suitable for customers who don’t require real-time ingestion to OpenSearch Service and plan to use data integration tools that run on a schedule or are triggered through events.

batch_architecture

Before data records land on Amazon S3, we implement an ingestion layer to bring all data streams reliably and securely to the data lake. Kinesis Data Streams is deployed as an ingestion layer for accelerated intake of structured and semi-structured data streams. Examples of these are relational database changes, applications, system logs, or clickstreams. For change data capture (CDC) use cases, you can use Kinesis Data Streams as a target for AWS DMS. Applications or systems generating streams containing sensitive data are sent to the Kinesis data stream via one of the three supported methods: the Amazon Kinesis Agent, the AWS SDK for Java, or the Kinesis Producer Library. As a last step, Amazon Kinesis Data Firehose helps us reliably load near-real-time batches of data into our S3 data lake destination.

The following screenshot shows how data flows through Kinesis Data Streams via the Data Viewer and retrieves sample data that lands on the raw S3 prefix. For this architecture, we followed the data lifecycle for S3 prefixes as recommended in Data lake foundation.

kinesis raw data

As you can see from the details of the first record in the following screenshot, the JSON payload follows the same schema as in the previous section. You can see the unredacted data flowing into the Kinesis data stream, which will be obfuscated later in subsequent stages.

raw_json

After the data is collected and ingested into Kinesis Data Streams and delivered to the S3 bucket using Kinesis Data Firehose, the processing layer of the architecture takes over. We use the AWS Glue PII transform to automate detection and masking of sensitive data in our pipeline. As shown in the following workflow diagram, we took a no-code, visual ETL approach to implement our transformation job in AWS Glue Studio.

glue studio nodes

First, we access the source Data Catalog table raw from the pii_data_db database. The table has the schema structure presented in the previous section. To keep track of the raw processed data, we used job bookmarks.

glue catalog

We use the AWS Glue DataBrew recipes in the AWS Glue Studio visual ETL job to transform two date attributes to be compatible with OpenSearch expected formats. This allows us to have a full no-code experience.

We use the Detect PII action to identify sensitive columns. We let AWS Glue determine this based on selected patterns, detection threshold, and sample portion of rows from the dataset. In our example, we used patterns that apply specifically to the United States (such as SSNs) and may not detect sensitive data from other countries. You may look for available categories and locations applicable to your use case or use regular expressions (regex) in AWS Glue to create detection entities for sensitive data from other countries.

It’s important to select the correct sampling method that AWS Glue offers. In this example, it’s known that the data coming in from the stream has sensitive data in every row, so it’s not necessary to sample 100% of the rows in the dataset. If you have a requirement where no sensitive data is allowed to downstream sources, consider sampling 100% of the data for the patterns you chose, or scan the entire dataset and act on each individual cell to ensure all sensitive data is detected. The benefit you get from sampling is reduced costs because you don’t have to scan as much data.

PII Options

The Detect PII action allows you to select a default string when masking sensitive data. In our example, we use the string **********.

selected_options

We use the apply mapping operation to rename and remove unnecessary columns such as ingestion_year, ingestion_month, and ingestion_day. This step also allows us to change the data type of one of the columns (purchase_value) from string to integer.

schema

From this point on, the job splits into two output destinations: OpenSearch Service and Amazon S3.

Our provisioned OpenSearch Service cluster is connected via the OpenSearch built-in connector for Glue. We specify the OpenSearch Index we’d like to write to and the connector handles the credentials, domain and port. In the screen shot below, we write to the specified index index_os_pii.

opensearch config

We store the masked dataset in the curated S3 prefix. There, we have data normalized to a specific use case and safe consumption by data scientists or for ad hoc reporting needs.

opensearch target s3 folder

For unified governance, access control, and audit trails of all datasets and Data Catalog tables, you can use AWS Lake Formation. This helps you restrict access to the AWS Glue Data Catalog tables and underlying data to only those users and roles who have been granted necessary permissions to do so.

After the batch job runs successfully, you can use OpenSearch Service to run search queries or reports. As shown in the following screenshot, the pipeline masked sensitive fields automatically with no code development efforts.

You can identify trends from the operational data, such as the amount of transactions per day filtered by credit card provider, as shown in the preceding screenshot. You can also determine the locations and domains where users make purchases. The transaction_date attribute helps us see these trends over time. The following screenshot shows a record with all of the transaction’s information redacted appropriately.

json masked

For alternate methods on how to load data into Amazon OpenSearch, refer to Loading streaming data into Amazon OpenSearch Service.

Furthermore, sensitive data can also be discovered and masked using other AWS solutions. For example, you could use Amazon Macie to detect sensitive data inside an S3 bucket, and then use Amazon Comprehend to redact the sensitive data that was detected. For more information, refer to Common techniques to detect PHI and PII data using AWS Services.

Conclusion

This post discussed the importance of handling sensitive data within your environment and various methods and architectures to remain compliant while also allowing your organization to scale quickly. You should now have a good understanding of how to detect, mask, or redact and load your data into Amazon OpenSearch Service.


About the authors

Michael Hamilton is a Sr Analytics Solutions Architect focusing on helping enterprise customers modernize and simplify their analytics workloads on AWS. He enjoys mountain biking and spending time with his wife and three children when not working.

Daniel Rozo is a Senior Solutions Architect with AWS supporting customers in the Netherlands. His passion is engineering simple data and analytics solutions and helping customers move to modern data architectures. Outside of work, he enjoys playing tennis and biking.

Achieve high availability in Amazon OpenSearch Multi-AZ with Standby enabled domains: A deep dive into failovers

Post Syndicated from Anshu Agarwal original https://aws.amazon.com/blogs/big-data/achieve-high-availability-in-amazon-opensearch-multi-az-with-standby-enabled-domains-a-deep-dive-into-failovers/

Amazon OpenSearch Service recently introduced Multi-AZ with Standby, a deployment option designed to provide businesses with enhanced availability and consistent performance for critical workloads. With this feature, managed clusters can achieve 99.99% availability while remaining resilient to zonal infrastructure failures.

In this post, we explore how search and indexing works with Multi-AZ with Standby and delve into the underlying mechanisms that contribute to its reliability, simplicity, and fault tolerance.

Background

Multi-AZ with Standby deploys OpenSearch Service domain instances across three Availability Zones, with two zones designated as active and one as standby. This configuration ensures consistent performance, even in the event of zonal failures, by maintaining the same capacity across all zones. Importantly, this standby zone follows a statically stable design, eliminating the need for capacity provisioning or data movement during failures.

During regular operations, the active zone handles coordinator traffic for both read and write requests, as well as shard query traffic. The standby zone, on the other hand, only receives replication traffic. OpenSearch Service utilizes a synchronous replication protocol for write requests. This enables the service to promptly promote a standby zone to active status in the event of a failure (mean time to failover <= 1 minute), known as a zonal failover. The previously active zone is then demoted to standby mode, and recovery operations commence to restore its healthy state.

Search traffic routing and failover to guarantee high availability

In an OpenSearch Service domain, a coordinator is any node that handles HTTP(S) requests, especially indexing and search requests. In a Multi-AZ with Standby domain, the data nodes in the active zone act as coordinators for search requests.

During the query phase of a search request, the coordinator determines the shards to be queried and sends a request to the data node hosting the shard copy. The query is run locally on each shard and matched documents are returned to the coordinator node. The coordinator node, which is responsible for sending the request to nodes containing shard copies, runs the process in two steps. First, it creates an iterator that defines the order in which nodes need to be queried for a shard copy so that traffic is uniformly distributed across shard copies. Subsequently, the request is sent to the relevant nodes.

In order to create an ordered list of nodes to be queried for a shard copy, the coordinator node uses various algorithms. These algorithms include round-robin selection, adaptive replica selection, preference-based shard routing, and weighted round-robin.

For Multi-AZ with Standby, the weighted round-robin algorithm is used for shard copy selection. In this approach, active zones are assigned a weight of 1, and the standby zone is assigned a weight of 0. This ensures that no read traffic is sent to data nodes in the standby Availability Zone.

The weights are stored in cluster state metadata as a JSON object:

"weighted_shard_routing": {
    "awareness": {
        "zone": {
            "us-east-1b": 0,
            "us-east-1d": 1,
            "us-east-1c": 1
         }
     },
     "_version": 3
}

As shown in the following screenshot, the us-east-1b Region has its zone status as StandBy, indicating that the data nodes in this Availability Zone are in standby state and don’t receive search or indexing requests from the load balancer.

Availability Zone status in AWS Console

To maintain steady-state operations, the standby Availability Zone is rotated every 30 minutes, ensuring all network parts are covered across Availability Zones. This proactive approach verifies the availability of read paths, further enhancing the system’s resilience during potential failures. The following diagram illustrates this architecture.

Steady State Operation

In the preceding diagram, Zone-C has a weighted round-robin weight set to zero. This ensures that the data nodes in the standby zone don’t receive any indexing or search traffic. When the coordinator queries data nodes for shard copies, it uses a weighted round-robin weight to decide on the order in which nodes to be queried. Because the weight is zero for the standby Availability Zone, coordinator requests are not sent.

In an OpenSearch Service cluster, the active and standby zones can be checked at any time using Availability Zone rotation metrics, as shown in the following screenshot.

Availability Zone rotation metrics

During zonal outages, the standby Availability Zone seamlessly switches to fail-open mode for search requests. This means that the shard query traffic is routed to all Availability Zones, even those in standby, when a healthy shard copy is unavailable in the active Availability Zone. This fail-open approach safeguards search requests from disruption during failures, ensuring continuous service. The following diagram illustrates this architecture.

Read Failover during Zonal Failure

In the preceding diagram, during the steady state, the shard query traffic is sent to the data node in the active Availability Zones (Zone-A and Zone-B). Due to node failures in Zone-A, the standby Availability Zone (Zone-C) fails open to take shard query traffic so that there isn’t any impact to the search requests. Eventually, Zone-A is detected as unhealthy and the read failover switches the standby to Zone-A.

How failover ensures high availability during write impairment

The OpenSearch Service replication model follows a primary backup model, characterized by its synchronous nature, where acknowledgement from all shard copies is necessary before a write request can be acknowledged to the user. One notable drawback of this replication model is its susceptibility to slowdowns in the event of any impairment in the write path. These systems rely on an active leader node to identify failures or delays and then broadcast this information to all nodes. The duration it takes to detect these issues (mean time to detect) and subsequently resolve them (mean time to repair) largely determines how long the system will operate in an impaired state. Additionally, any networking event that affects inter-zone communications can significantly impede write requests due to the synchronous nature of replication.

OpenSearch Service utilizes an internal node-to-node communication protocol for replicating write traffic and coordinating metadata updates through an elected leader. Consequently, putting the zone experiencing stress in standby wouldn’t effectively address the issue of write impairment.

Zonal write failover: Cutting off inter-zone replication traffic

For Multi-AZ with Standby, to mitigate potential performance issues caused during unforeseen events like zonal failures and networking events, zonal write failover is an effective approach. This approach involves graceful removal of nodes in the impacted zone from the cluster, effectively cutting off ingress and egress traffic between zones. By severing the inter-zone replication traffic, the impact of zonal failures can be contained within the affected zone. This provides a more predictable experience for customers and ensures that the system continues to operate reliably.

Graceful write failover

The orchestration of a write failover within OpenSearch Service is carried out by the elected leader node through a well-defined mechanism. This mechanism involves a consensus protocol for cluster state publication, ensuring unanimous agreement among all nodes to designate a single zone (at all times) for decommissioning. Importantly, metadata related to the affected zone is replicated across all nodes to ensure its persistence, even during a full restart in the event of an outage.

Furthermore, the leader node ensures a smooth and graceful transition by initially placing the nodes in the impacted zones on standby for a duration of 5 minutes before initiating I/O fencing. This deliberate approach prevents any new coordinator traffic or shard query traffic from being directed to the nodes within the impacted zone. This, in turn, allow these nodes to complete their ongoing tasks gracefully and gradually handle any inflight requests before being taken out of service. The following diagram illustrates this architecture.

Write Failover during Networking Event

In the process of implementing a write failover for a leader node, OpenSearch Service follows these key steps:

  • Leader abdication – If the leader node happens to be located in a zone scheduled for write failover, the system ensures that the leader node voluntarily steps down from its leadership role. This abdication is carried out in a controlled manner, and the entire process is handed over to another eligible node, which then takes charge of the actions required.
  • Prevent reelection of to-be-decommissioned leader – To prevent the reelection of a leader from a zone marked for write failover, when the eligible leader node initiates the write failover action, it takes measures to ensure that any to-be-decommissioned leader nodes do not participate in any further elections. This is achieved by excluding the to-be-decommissioned leader node from the voting configuration, effectively preventing it from voting during any critical phase of the cluster’s operation.

Metadata related to the write failover zone is stored within the cluster state, and this information is published to all nodes in the distributed OpenSearch Service cluster as follows:

"decommissionedAttribute": {
    "awareness": {
        "zone": "us-east-1c"
     },
     "status": "successful",
     "requestID": "FLoyf5v9RVSsaAquRNKxIw"
}

The following screenshot depicts that during a networking slowdown in a zone, write failover helps recover availability.

Write Failover helps recovering availability

Zonal recovery after write failover

The process of zonal recommissioning plays a crucial role in the recovery phase following a zonal write failover. After the impacted zone has been restored and is considered stable, the nodes that were previously decommissioned will rejoin the cluster. This recommissioning typically occurs within a time frame of 2 minutes after the zone has been recommissioned.

This enables them to synchronize with their peer nodes and initiates the recovery process for replica shards, effectively restoring the cluster to its desired state.

Conclusion

The introduction of OpenSearch Service Multi-AZ with Standby provides businesses with a powerful solution to achieve high availability and consistent performance for critical workloads. With this deployment option, businesses can enhance their infrastructure’s resilience, simplify cluster configuration and management, and enforce best practices. With features like weighted round-robin shard copy selection, proactive failover mechanisms, and fail-open standby Availability Zones, OpenSearch Service Multi-AZ with Standby ensures a reliable and efficient search experience for demanding enterprise environments.

For more information about Multi-AZ with Standby, refer to Amazon OpenSearch Service Under the Hood: Multi-AZ with Standby.


About the Author


Anshu Agarwal
 is a Senior Software Engineer working on AWS OpenSearch at Amazon Web Services. She is passionate about solving problems related to building scalable and highly reliable systems.


Rishab Nahata
 is a Software Engineer working on OpenSearch at Amazon Web Services. He is fascinated about solving problems in distributed systems. He is active contributor to OpenSearch.


Bukhtawar Khan
is a Principal Engineer working on Amazon OpenSearch Service. He is interested in distributed and autonomous systems. He is an active contributor to OpenSearch.


Ranjith Ramachandra
is an Engineering Manager working on Amazon OpenSearch Service at Amazon Web Services.

Amazon OpenSearch Service search enhancements: 2023 roundup

Post Syndicated from Dagney Braun original https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-search-enhancements-2023-roundup/

What users expect from search engines has evolved over the years. Just returning lexically relevant results quickly is no longer enough for most users. Now users seek methods that allow them to get even more relevant results through semantic understanding or even search through image visual similarities instead of textual search of metadata. Amazon OpenSearch Service includes many features that allow you to enhance your search experience. We are excited about the OpenSearch Service features and enhancements we’ve added to that toolkit in 2023.

2023 was a year of rapid innovation within the artificial intelligence (AI) and machine learning (ML) space, and search has been a significant beneficiary of that progress. Throughout 2023, Amazon OpenSearch Service invested in enabling search teams to use the latest AI/ML technologies to improve and augment your existing search experiences, without having to rewrite your applications or build bespoke orchestrations, resulting in unlocking rapid development, iteration, and productization. These investments include the introduction of new search methods as well as functionality to simplify implementation of the methods available, which we review in this post.

Background: Lexical and semantic search

Before we get started, let’s review lexical and semantic search.

Lexical search

In lexical search, the search engine compares the words in the search query to the words in the documents, matching word for word. Only items that have words the user typed match the query. Traditional lexical search, based on term frequency models like BM25, is widely used and effective for many search applications. However, lexical search techniques struggle to go beyond the words included in the user’s query, resulting in highly relevant potential results not always being returned.

Semantic search

In semantic search, the search engine uses an ML model to encode text or other media (such as images and videos) from the source documents as a dense vector in a high-dimensional vector space. This is also called embedding the text into the vector space. It similarly codes the query as a vector and then uses a distance metric to find nearby vectors in the multi-dimensional space to find matches. The algorithm for finding nearby vectors is called k-nearest neighbors (k-NN). Semantic search doesn’t match individual query terms—it finds documents whose vector embedding is near the query’s embedding in the vector space and therefore semantically similar to the query. This allows you to return highly relevant items even if they don’t contain any of the words that were in the query.

OpenSearch has provided vector similarity search (k-NN and approximate k-NN) for several years, which has been valuable for customers who adopted it. However, not all customers who have the opportunity to benefit from k-NN have adopted it, due to the significant engineering effort and resources required to do so.

2023 releases: Fundamentals

In 2023 several features and improvements were launched on OpenSearch Service, including new features which are fundamental building blocks for continued search enhancements.

The OpenSearch Compare Search Results tool

The Compare Search Results tool, generally available in OpenSearch Service version 2.11, allows you to compare search results from two ranking techniques side by side, in OpenSearch Dashboards, to determine whether one query produces better results than the other. For customers who are interested in experimenting with the latest search methods powered by ML-assisted models, the ability to compare search results is critical. This can include comparing lexical search, semantic search, and hybrid search techniques to understand the benefits of each technique against your corpus, or adjustments such as field weighting and different stemming or lemmatization strategies.

The following screenshot shows an example of using the Compare Search Results tool.


To learn more about semantic search and cross-modal search and experiment with a demo of the Compare Search Results tool, refer to Try semantic search with the Amazon OpenSearch Service vector engine.

Search pipelines

Search practitioners are looking to introduce new ways to enhance search queries as well as results. With the general availability of search pipelines, starting in OpenSearch Service version 2.9, you can build search query and result processing as a composition of modular processing steps, without complicating your application software. By integrating processors for functions such as filters, and with the ability to add a script to run on newly indexed documents, you can make your search applications more accurate and efficient and reduce the need for custom development.

Search pipelines incorporate three built-in processors: filter_query, rename_field, and script request, as well as new developer-focused APIs to enable developers who want to build their own processors to do so. OpenSearch will continue adding additional built-in processors to further expand on this functionality in the coming releases.

The following diagram illustrates the search pipelines architecture.

Byte-sized vectors in Lucene

Until now, the k-NN plugin in OpenSearch has supported indexing and querying vectors of type float, with each vector element occupying 4 bytes. This can be expensive in memory and storage, especially for large-scale use cases. With the new byte vector feature in OpenSearch Service version 2.9, you can reduce memory requirements by a factor of 4 and significantly reduce search latency, with minimal loss in quality (recall). To learn more, refer to Byte-quantized vectors in OpenSearch.

Support for new language analyzers

OpenSearch Service previously supported language analyzer plugins such as IK (Chinese), Kuromoji (Japanese), and Seunjeon (Korean), among several others. We added support for Nori (Korean), Sudachi (Japanese), Pinyin (Chinese), and STConvert Analysis (Chinese). These new plugins are available as a new package type, ZIP-PLUGIN, along with the previously supported TXT-DICTIONARY package type. You can navigate to the Packages page of the OpenSearch Service console to associate these plugins to your cluster, or use the AssociatePackage API.

2023 releases: Ease-of-use enhancements

The OpenSearch Service also made improvements in 2023 to enhance ease of use within key search features.

Semantic search with neural search

Previously, implementing semantic search meant that your application was responsible for the middleware to integrate text embedding models into search and ingest, orchestrating the encoding the corpus, and then using a k-NN search at query time.

OpenSearch Service introduced neural search in version 2.9, enabling builders to create and operationalize semantic search applications with significantly reduced undifferentiated heavy lifting. Your application no longer needs to deal with the vectorization of documents and queries; semantic search does that, and invokes k-NN during query time. Semantic search via the neural search feature transforms documents or other media into vector embeddings and indexes both the text and its vector embeddings in a vector index. When you use a neural query during search, neural search converts the query text into a vector embedding, uses vector search to compare the query and document embeddings, and returns the closest results. This functionality was initially released as experimental in OpenSearch Service version 2.4, and is now generally available with version 2.9.

AI/ML connectors to enable AI-powered search features

With OpenSearch Service 2.9, you can use out-of-the-box AI connectors to AWS AI and ML services and third-party alternatives to power features like neural search. For instance, you can connect to external ML models hosted on Amazon SageMaker, which provides comprehensive capabilities to manage models successfully in production. If you want to use the latest foundation models via a fully managed experience, you can use connectors for Amazon Bedrock to power use cases like multimodal search. Our initial release includes a connector to Cohere Embed, and through SageMaker and Amazon Bedrock, you have access to more third-party options. You can configure some of these integrations on your domains through the OpenSearch Service console integrations (see the following screenshot), and even automate model deployment to SageMaker.

Integrated models are cataloged in your OpenSearch Service domain, so that your team can discover the variety of models that are integrated and readily available for use. You even have the option to enable granular security controls on your model and connector resources to govern model and connector level access.

To foster an open ecosystem, we created a framework to empower partners to easily build and publish AI connectors. Technology providers can simply create a blueprint, which is a JSON document that describes secure RESTful communication between OpenSearch and your service. Technology partners can publish their connectors on our community site, and you can immediately use these AI connectors—whether for a self-managed cluster or on OpenSearch Service. You can find blueprints for each connector in the ML Commons GitHub repository.

Hybrid search supported by score combination

Semantic technologies such as vector embeddings for neural search and generative AI large language models (LLMs) for natural language processing have revolutionized search, reducing the need for manual synonym list management and fine-tuning. On the other hand, text-based (lexical) search outperforms semantic search in some important cases, such as part numbers or brand names. Hybrid search, the combination of the two methods, gives 14% higher search relevancy (as measured by NDCG@10—a measure of ranking quality) than BM25 alone, so customers want to use hybrid search to get the best of both. For more information about detailed benchmarking score accuracy and performance, refer to Improve search relevance with hybrid search, generally available in OpenSearch 2.10.

Until now, combining them has been challenging given the different relevancy scales for each method. Previously, to implement a hybrid approach, you had to run multiple queries independently, then normalize and combine scores outside of OpenSearch. With the launch of the new hybrid score combination and normalization query type in OpenSearch Service 2.11, OpenSearch handles score normalization and combination in one query, making hybrid search easier to implement and a more efficient way to improve search relevance.

New search methods

Lastly, OpenSearch Service now features new search methods.

Neural sparse retrieval

OpenSearch Service 2.11 introduced neural sparse search, a new kind of sparse embedding method that is similar in many ways to classic term-based indexing, but with low-frequency words and phrases better represented. Sparse semantic retrieval uses transformer models (such as BERT) to build information-rich embeddings that solve for the vocabulary mismatch problem in a scalable way, while having similar computational cost and latency to lexical search. This new sparse retrieval functionality with OpenSearch offers two modes with different advantages: a document-only mode and a bi-encoder mode. The document-only mode can deliver low-latency performance more comparable to BM25 search, with limitations for advanced syntax as compared to dense methods. The bi-encoder mode can maximize search relevance while performing at higher latencies. With this update, you can now choose the method that works best for your performance, accuracy, and cost requirements.

Multi-modal search

OpenSearch Service 2.11 introduces text and image multimodal search using neural search. This functionality allows you to search image and text pairs, like product catalog items (product image and description), based on visual and semantic similarity. This enables new search experiences that can deliver more relevant results. For instance, you can search for “white blouse” to retrieve products with images that match that description, even if the product title is “cream colored shirt.” The ML model that powers this experience is able to associate semantics and visual characteristics. You can also search by image to retrieve visually similar products or search by both text and image to find the products most similar to a particular product catalog item.

You can now build these capabilities into your application to connect directly to multimodal models and run multimodal search queries without having to build custom middleware. The Amazon Titan Multimodal Embeddings model can be integrated with OpenSearch Service to support this method. Refer to Multimodal search for guidance on how to get started with multimodal semantic search, and look out for more input types to be added in future releases. You can also try out the demo of cross-modal textual and image search, which shows searching for images using textual descriptions.

Summary

OpenSearch Service offers an array of different tools to build your search application, but the best implementation will depend on your corpus and your business needs and goals. We encourage search practitioners to begin testing the search methods available in order to find the right fit for your use case. In 2024 and beyond, you can expect to continue to see this fast pace of search innovation in order to keep the latest and greatest search technologies at the fingertips of OpenSearch search practitioners.


About the Authors

Dagney Braun is a Senior Manager of Product at Amazon Web Services OpenSearch Team. She is passionate about improving the ease of use of OpenSearch, and expanding the tools available to better support all customer use-cases.

Stavros Macrakis is a Senior Technical Product Manager on the OpenSearch project of Amazon Web Services. He is passionate about giving customers the tools to improve the quality of their search results.

Dylan Tong is a Senior Product Manager at Amazon Web Services. He leads the product initiatives for AI and machine learning (ML) on OpenSearch including OpenSearch’s vector database capabilities. Dylan has decades of experience working directly with customers and creating products and solutions in the database, analytics and AI/ML domain. Dylan holds a BSc and MEng degree in Computer Science from Cornell University.

Amazon OpenSearch Serverless now supports automated time-based data deletion 

Post Syndicated from Satish Nandi original https://aws.amazon.com/blogs/big-data/amazon-opensearch-serverless-now-supports-automated-time-based-data-deletion/

We recently announced a new enhancement to OpenSearch Serverless for managing data retention of Time Series collections and Indexes. OpenSearch Serverless for Amazon OpenSearch Service makes it straightforward to run search and analytics workloads without having to think about infrastructure management. With the new automated time-based data deletion feature, you can specify how long they want to retain data and OpenSearch Serverless automatically manages the lifecycle of the data based on this configuration.

To analyze time series data such as application logs and events in OpenSearch, you must create and ingest data into indexes. Typically, these logs are generated continuously and ingested frequently, such as every few minutes, into OpenSearch. Large volumes of logs can consume a lot of the available resources such as storage in the clusters and therefore need to be managed efficiently to maximize optimum performance. You can manage the lifecycle of the indexed data by using automated tooling to create daily indexes. You can then use scripts to rotate the indexed data from the primary storage in clusters to a secondary remote storage to maintain performance and control costs, and then delete the aged data after a certain retention period.

The new automated time-based data deletion feature in OpenSearch Serverless minimizes the need to manually create and manage daily indexes or write data lifecycle scripts. You can now create a single index and OpenSearch Serverless will handle creating a timestamped collection of indexes under one logical grouping automatically. You only need to configure the desired data retention policies for your time series data collections. OpenSearch Serverless will then efficiently roll over indexes from primary storage to Amazon Simple Storage Service(Amazon S3) as they age, and automatically delete aged data per the configured retention policies, reducing the operational overhead and saving costs.

In this post we discuss the new data lifecycle polices and how to get started with these polices in OpenSearch Serverless

Solution Overview

Consider a use case where the fictitious  company Octank Broker collects logs from its web services and ingests them into OpenSearch Serverless for service availability analysis. The company is interested in tracking web access and root cause when failures are seen with error types 4xx and 5xx. Generally, the server issues are of interest within an immediate timeframe, say in a few days. After 30 days, these logs are no longer of interest.

Octank wants to retain their log data for 7 days. If the collections or indexes are configured for 7 days’ data retention, then after 7 days, OpenSearch Serverless deletes the data. The indexes are no longer available for search. Note: Document counts in search results might reflect data that is marked for deletion for a short time.

You can configure data retention by creating a data lifecycle policy. The retention time can be unlimited, or a you can provide a specific time length in Days and Hours with a minimum retention of 24 hours and a maximum of 10 years. If the retention time is unlimited, as the name suggests, no data is deleted.

To start using data lifecycle policies in OpenSearch Serverless, you can follow the steps outlined in this post.

Prerequisites

This post assumes that you have already set up an OpenSearch Serverless collection. If not, refer to Log analytics the easy way with Amazon OpenSearch Serverless for instructions.

Create a data lifecycle policy

You can create a data lifecycle policy from the AWS Management Console, the AWS Command Line Interface (AWS CLI), AWS CloudFormation, AWS Cloud Development Kit (AWS CDK), and Terraform. To create a data lifecycle policy via the console, complete the following steps:

  • On the OpenSearch Service console, choose Data lifecycle policies under Serverless in the navigation pane.
  • Choose Create data lifecycle policy.
  • For Data lifecycle policy name, enter a name (for example, web-logs-policy).
  • Choose Add under Data lifecycle.
  • Under Source Collection, choose the collection to which you want to apply the policy (for example, web-logs-collection).
  • Under Indexes, enter the index or index patterns to apply the retention duration (for example, web-logs).
  • Under Data retention, disable Unlimited (to set up the specific retention for the index pattern you defined).
  • Enter the hours or days after which you want to delete data from Amazon S3.
  • Choose Create.

The following graphic gives a quick demonstration of creating the OpenSearch Serverless Data lifecycle policies via the preceding steps.

View the data lifecycle policy

After you have created the data lifecycle policy, you can view the policy by completing the following steps:

  • On the OpenSearch Service console, choose Data lifecycle policies under Serverless in the navigation pane.
  • Select the policy you want to view (for example, web-logs-policy).
  • Choose the hyperlink under Policy name.

This page will show you the details such as the index pattern and its retention period for a specific index and collection. The following graphic gives a quick demonstration of viewing the OpenSearch Serverless data lifecycle policies via the preceding steps.

Update the data lifecycle policy

After you have created the data lifecycle policy, you can modify and update it to add more rules. For example, you can add another index pattern or add a new collection with a new index pattern to set up the retention. The following example shows the steps to add another rule in the policy for syslog index under syslogs-collection.

  • On the OpenSearch Service console, choose Data lifecycle policies under Serverless in the navigation pane.
  • Select the policy you want to edit (for example, web-logs-policy), then choose Edit.
  • Choose Add under Data lifecycle.
  • Under Source Collection, choose the collection you are going to use for setting up the data lifecycle policy (for example, syslogs-collection).
  • Under Indexes, enter index or index patterns you are going to set retention for (for example, syslogs).
  • Under Data retention, disable Unlimited (to set up specific retention for the index pattern you defined).
  • Enter the hours or days after which you want to delete data from Amazon S3.
  • Choose Save.

The following graphic gives a quick demonstration of updating existing data lifecycle policies via the preceding steps.

Delete the data lifecycle policy

Delete the existing data lifecycle policy with the following steps:

  • On the OpenSearch Service console, choose Data lifecycle policies under Serverless in the navigation pane.
  • Select the policy you want to edit (for example, web-logs-policy).
  • Choose Delete.

Data lifecycle policy rules

In a data lifecycle policy, you specify a series of rules. The data lifecycle policy lets you manage the retention period of data associated to indexes or collections that match these rules. These rules define the retention period for data in an index or group of indexes. Each rule consists of a resource type (index), a retention period, and a list of resources (indexes) that the retention period applies to.

You define the retention period with one of the following formats:

  • “MinIndexRetention”: “24h” – OpenSearch Serverless retains the index data for a specified period in hours or days. You can set this period to be from 24 hours (24h) to 3,650 days (3650d).
  • “NoMinIndexRetention”: true – OpenSearch Serverless retains the index data indefinitely.

When data lifecycle policy rules overlap, within or across policies, the rule with a more specific resource name or pattern for an index overrides a rule with a more general resource name or pattern for any indexes that are common to both rules. For example, in the following policy, two rules apply to the index index/sales/logstash. In this situation, the second rule takes precedence because index/sales/log* is the longest match to index/sales/logstash. Therefore, OpenSearch Serverless sets no retention period for the index.

Summary

Data lifecycle policies provide a consistent and straightforward way to manage indexes in OpenSearch Serverless. With data lifecycle policies, you can automate data management and avoid human errors. Deleting non-relevant data without manual intervention reduces your operational load, saves storage costs, and helps keep the system performant for search.


About the authors

Prashant Agrawal is a Senior Search Specialist Solutions Architect with Amazon OpenSearch Service. He works closely with customers to help them migrate their workloads to the cloud and helps existing customers fine-tune their clusters to achieve better performance and save on cost. Before joining AWS, he helped various customers use OpenSearch and Elasticsearch for their search and log analytics use cases. When not working, you can find him traveling and exploring new places. In short, he likes doing Eat → Travel → Repeat.

Satish Nandi is a Senior Product Manager with Amazon OpenSearch Service. He is focused on OpenSearch Serverless and has years of experience in networking, security and ML/AI. He holds a Bachelor degree in Computer Science and an MBA in Entrepreneurship. In his free time, he likes to fly airplanes, hang gliders and ride his motorcycle.

Accelerate analytics on Amazon OpenSearch Service with AWS Glue through its native connector

Post Syndicated from Basheer Sheriff original https://aws.amazon.com/blogs/big-data/accelerate-analytics-on-amazon-opensearch-service-with-aws-glue-through-its-native-connector/

As the volume and complexity of analytics workloads continue to grow, customers are looking for more efficient and cost-effective ways to ingest and analyse data. Data is stored from online systems such as the databases, CRMs, and marketing systems to data stores such as data lakes on Amazon Simple Storage Service (Amazon S3), data warehouses in Amazon Redshift, and purpose-built stores such as Amazon OpenSearch Service, Amazon Neptune, and Amazon Timestream.

OpenSearch Service is used for multiple purposes, such as observability, search analytics, consolidation, cost savings, compliance, and integration. OpenSearch Service also has vector database capabilities that let you implement semantic search and Retrieval Augmented Generation (RAG) with large language models (LLMs) to build recommendation and media search engines. Previously, to integrate with OpenSearch Service, you could use open source clients for specific programming languages such as Java, Python, or JavaScript or use REST APIs provided by OpenSearch Service.

Movement of data across data lakes, data warehouses, and purpose-built stores is achieved by extract, transform, and load (ETL) processes using data integration services such as AWS Glue. AWS Glue is a serverless data integration service that makes it straightforward to discover, prepare, and combine data for analytics, machine learning (ML), and application development. AWS Glue provides both visual and code-based interfaces to make data integration effortless. Using a native AWS Glue connector increases agility, simplifies data movement, and improves data quality.

In this post, we explore the AWS Glue native connector to OpenSearch Service and discover how it eliminates the need to build and maintain custom code or third-party tools to integrate with OpenSearch Service. This accelerates analytics pipelines and search use cases, providing instant access to your data in OpenSearch Service. You can now use data stored in OpenSearch Service indexes as a source or target within the AWS Glue Studio no-code, drag-and-drop visual interface or directly in an AWS Glue ETL job script. When combined with AWS Glue ETL capabilities, this new connector simplifies the creation of ETL pipelines, enabling ETL developers to save time building and maintaining data pipelines.

Solution overview

The new native OpenSearch Service connector is a powerful tool that can help organizations unlock the full potential of their data. It enables you to efficiently read and write data from OpenSearch Service without needing to install or manage OpenSearch Service connector libraries.

In this post, we demonstrate exporting the New York City Taxi and Limousine Commission (TLC) Trip Record Data dataset into OpenSearch Service using the AWS Glue native connector. The following diagram illustrates the solution architecture.

By the end of this post, your visual ETL job will resemble the following screenshot.

Prerequisites

To follow along with this post, you need a running OpenSearch Service domain. For setup instructions, refer to Getting started with Amazon OpenSearch Service. Ensure it is public, for simplicity, and note the primary user and password for later use.

Note that as of this writing, the AWS Glue OpenSearch Service connector doesn’t support Amazon OpenSearch Serverless, so you need to set up a provisioned domain.

Create an S3 bucket

We use an AWS CloudFormation template to create an S3 bucket to store the sample data. Complete the following steps:

  1. Choose Launch Stack.
  2. On the Specify stack details page, enter a name for the stack.
  3. Choose Next.
  4. On the Configure stack options page, choose Next.
  5. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Submit.

The stack takes about 2 minutes to deploy.

Create an index in the OpenSearch Service domain

To create an index in the OpenSearch service domain, complete the following steps:

  1. On the OpenSearch Service console, choose Domains in the navigation pane.
  2. Open the domain you created as a prerequisite.
  3. Choose the link under OpenSearch Dashboards URL.
  4. On the navigation menu, choose Dev Tools.
  5. Enter the following code to create the index:
PUT /yellow-taxi-index
{
  "mappings": {
    "properties": {
      "VendorID": {
        "type": "integer"
      },
      "tpep_pickup_datetime": {
        "type": "date",
        "format": "epoch_millis"
      },
      "tpep_dropoff_datetime": {
        "type": "date",
        "format": "epoch_millis"
      },
      "passenger_count": {
        "type": "integer"
      },
      "trip_distance": {
        "type": "float"
      },
      "RatecodeID": {
        "type": "integer"
      },
      "store_and_fwd_flag": {
        "type": "keyword"
      },
      "PULocationID": {
        "type": "integer"
      },
      "DOLocationID": {
        "type": "integer"
      },
      "payment_type": {
        "type": "integer"
      },
      "fare_amount": {
        "type": "float"
      },
      "extra": {
        "type": "float"
      },
      "mta_tax": {
        "type": "float"
      },
      "tip_amount": {
        "type": "float"
      },
      "tolls_amount": {
        "type": "float"
      },
      "improvement_surcharge": {
        "type": "float"
      },
      "total_amount": {
        "type": "float"
      },
      "congestion_surcharge": {
        "type": "float"
      },
      "airport_fee": {
        "type": "integer"
      }
    }
  }
}

Create a secret for OpenSearch Service credentials

In this post, we use basic authentication and store our authentication credentials securely using AWS Secrets Manager. Complete the following steps to create a Secrets Manager secret:

  1. On the Secrets Manager console, choose Secrets in the navigation pane.
  2. Choose Store a new secret.
  3. For Secret type, select Other type of secret.
  4. For Key/value pairs, enter the user name opensearch.net.http.auth.user and the password opensearch.net.http.auth.pass.
  5. Choose Next.
  6. Complete the remaining steps to create your secret.

Create an IAM role for the AWS Glue job

Complete the following steps to configure an AWS Identity and Access Management (IAM) role for the AWS Glue job:

  1. On the IAM console, create a new role.
  2. Attach the AWS managed policy GlueServiceRole.
  3. Attach the following policy to the role. Replace each ARN with the corresponding ARN of the OpenSearch Service domain, Secrets Manager secret, and S3 bucket.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "OpenSearchPolicy",
            "Effect": "Allow",
            "Action": [
                "es:ESHttpPost",
                "es:ESHttpPut"
            ],
            "Resource": [
                "arn:aws:es:<region>:<aws-account-id>:domain/<amazon-opensearch-domain-name>"
            ]
        },
        {
            "Sid": "GetDescribeSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager:<region>:<aws-account-id>:secret:<secret-name>"
        },
        {
            "Sid": "S3Policy",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:ListBucket",
                "s3:GetBucketAcl",
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::<bucket-name>",
                "arn:aws:s3:::<bucket-name>/*"
            ]
        }
    ]
}

Create an AWS Glue connection

Before you can use the OpenSearch Service connector, you need to create an AWS Glue connection for connecting to OpenSearch Service. Complete the following steps:

  1. On the AWS Glue console, choose Connections in the navigation pane.
  2. Choose Create connection.
  3. For Name, enter opensearch-connection.
  4. For Connection type, choose Amazon OpenSearch.
  5. For Domain endpoint, enter the domain endpoint of OpenSearch Service.
  6. For Port, enter HTTPS port 443.
  7. For Resource, enter yellow-taxi-index.

In this context, resource means the index of OpenSearch Service where the data is read from or written to.

  1. Select Wan only enabled.
  2. For AWS Secret, choose the secret you created earlier.
  3. Optionally, if you’re connecting to an OpenSearch Service domain in a VPC, specify a VPC, subnet, and security group to run AWS Glue jobs inside the VPC. For security groups, a self-referencing inbound rule is required. For more information, see Setting up networking for development for AWS Glue.
  4. Choose Create connection.

Create an ETL job using AWS Glue Studio

Complete the following steps to create your AWS Glue ETL job:

  1. On the AWS Glue console, choose Visual ETL in the navigation pane.
  2. Choose Create job and Visual ETL.
  3. On the AWS Glue Studio console, change the job name to opensearch-etl.
  4. Choose Amazon S3 for the data source and Amazon OpenSearch for the data target.

Between the source and target, you can optionally insert transform nodes. In this solution, we create a job that has only source and target nodes for simplicity.

  1. In the Data source properties section, specify the S3 bucket where the sample data is located, and choose Parquet as the data format.
  2. In the Data sink properties section, specify the connection you created in the previous section (opensearch-connection).
  3. Choose the Job details tab, and in the Basic properties section, specify the IAM role you created earlier.
  4. Choose Save to save your job, and choose Run to run the job.
  5. Navigate to the Runs tab to check the status of the job. When it is successful, the run status should be Succeeded.
  6. After the job runs successfully, navigate to OpenSearch Dashboards, and log in to the dashboard.
  7. Choose Dashboards Management on the navigation menu.
  8. Choose Index patterns, and choose Create index pattern.
  9. Enter yellow-taxi-index for Index pattern name.
  10. Choose tpep_pickup_datetime for Time.
  11. Choose Create index pattern. This index pattern will be used to visualize the index.
  12. Choose Discover on the navigation menu, and choose yellow-taxi-index.


You have now created an index in OpenSearch Service and loaded data into it from Amazon S3 in just a few steps using the AWS Glue OpenSearch Service native connector.

Clean up

To avoid incurring charges, clean up the resources in your AWS account by completing the following steps:

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. From the list of jobs, select the job opensearch-etl, and on the Actions menu, choose Delete.
  3. On the AWS Glue console, choose Data connections in the navigation pane.
  4. Select opensearch-connection from the list of connectors, and on the Actions menu, choose Delete.
  5. On the IAM console, choose Roles in the navigation page.
  6. Select the role you created for the AWS Glue job and delete it.
  7. On the CloudFormation console, choose Stacks in the navigation pane.
  8. Select the stack you created for the S3 bucket and sample data and delete it.
  9. On the Secrets Manager console, choose Secrets in the navigation pane.
  10. Select the secret you created, and on the Actions menu, choose Delete.
  11. Reduce the waiting period to 7 days and schedule the deletion.

Conclusion

The integration of AWS Glue with OpenSearch Service adds the powerful ability to perform data transformation when integrating with OpenSearch Service for analytics use cases. This enables organizations to streamline data integration and analytics with OpenSearch Service. The serverless nature of AWS Glue means no infrastructure management, and you pay only for the resources consumed while your jobs are running. As organizations increasingly rely on data for decision-making, this native Spark connector provides an efficient, cost-effective, and agile solution to swiftly meet data analytics needs.


About the authors

Basheer Sheriff is a Senior Solutions Architect at AWS. He loves to help customers solve interesting problems leveraging new technology. He is based in Melbourne, Australia, and likes to play sports such as football and cricket.

Shunsuke Goto is a Prototyping Engineer working at AWS. He works closely with customers to build their prototypes and also helps customers build analytics systems.